The upcoming RISC-V Ssdtso specification introduces a bit in the senvcfg
CSR to switch the memory consistency model of user mode at run-time from
RVWMO to TSO. The active consistency model can therefore be switched on a
per-hart base and managed by the kernel on a per-process base.
This patchset implements basic Ssdtso support and adds a prctl API on top
so that user-space processes can switch to a stronger memory consistency
model (than the kernel was written for) at run-time.
The patchset also comes with a short documentation of the prctl API.
This series is based on the third draft of the Ssdtso specification
which can be found here:
https://github.com/riscv/riscv-ssdtso/releases/tag/v1.0-draft3
Note, that the Ssdtso specification is in development state
(i.e., not frozen or even ratified) which is also the reason
why this series is marked as RFC.
This series saw the following changes since v1:
* Reordered/restructured patches
* Fixed build issues
* Addressed typos
* Removed ability to switch TSO->WMO
* Moved the state from per-thread to per-process
* Reschedule all CPUs after switching
* Some cleanups in the documentation
* Adding compatibility with Ztso (spec change in draft 3)
This patchset can also be found in this GitHub branch:
https://github.com/cmuellner/linux/tree/ssdtso-v2
A QEMU implementation of DTSO can be found in this GitHub branch:
https://github.com/cmuellner/qemu/tree/ssdtso-v2
Christoph Müllner (6):
mm: Add dynamic memory consistency model switching
uapi: prctl: Add new prctl call to set/get the memory consistency
model
RISC-V: Enable dynamic memory consistency model support with Ssdtso
RISC-V: Implement prctl call to set/get the memory consistency model
RISC-V: Expose Ssdtso via hwprobe API
RISC-V: selftests: Add DTSO tests
Documentation/arch/riscv/hwprobe.rst | 3 +
.../mm/dynamic-memory-consistency-model.rst | 86 ++++++++++++++++
Documentation/mm/index.rst | 1 +
arch/Kconfig | 14 +++
arch/riscv/Kconfig | 11 +++
arch/riscv/include/asm/csr.h | 1 +
arch/riscv/include/asm/dtso.h | 97 +++++++++++++++++++
arch/riscv/include/asm/hwcap.h | 1 +
arch/riscv/include/asm/processor.h | 7 ++
arch/riscv/include/asm/switch_to.h | 3 +
arch/riscv/include/uapi/asm/hwprobe.h | 1 +
arch/riscv/kernel/Makefile | 1 +
arch/riscv/kernel/asm-offsets.c | 3 +
arch/riscv/kernel/cpufeature.c | 1 +
arch/riscv/kernel/dtso.c | 67 +++++++++++++
arch/riscv/kernel/sys_hwprobe.c | 2 +
include/linux/sched.h | 5 +
include/uapi/linux/prctl.h | 5 +
kernel/sys.c | 12 +++
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/dtso/.gitignore | 1 +
tools/testing/selftests/riscv/dtso/Makefile | 11 +++
tools/testing/selftests/riscv/dtso/dtso.c | 82 ++++++++++++++++
23 files changed, 416 insertions(+), 1 deletion(-)
create mode 100644 Documentation/mm/dynamic-memory-consistency-model.rst
create mode 100644 arch/riscv/include/asm/dtso.h
create mode 100644 arch/riscv/kernel/dtso.c
create mode 100644 tools/testing/selftests/riscv/dtso/.gitignore
create mode 100644 tools/testing/selftests/riscv/dtso/Makefile
create mode 100644 tools/testing/selftests/riscv/dtso/dtso.c
--
2.43.0
The reuseport_addr_any.sh is currently skipping DCCP tests and
pmtu.sh is skipping all the FOU/GUE related cases: add the missing
options.
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
Note that this does not include the - still missing - OVS-related
option and pmtu.sh is will keep skipping such cases. Such tests
will still fail in the virtme environment even with the relevant
kernel options enabled, as they have an hard to solve dependency
on systemd/dbus.
The longer term plan is to move such test cases in the openvswitch
directory. One short term option to avoid skips in selftests results
while retaining the potential code coverage would be making the ovs
tests disabled by default but reachable via pmtu.sh command line
arguments.
---
tools/testing/selftests/net/config | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index 3b749addd364..54d21e2911a9 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -13,6 +13,10 @@ CONFIG_IPV6=y
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_VETH=y
CONFIG_NET_IPVTI=y
+CONFIG_NET_FOU=y
+CONFIG_NET_FOU_IP_TUNNELS=y
+CONFIG_NET_IPIP=y
+CONFIG_IPV6_SIT=y
CONFIG_IPV6_VTI=y
CONFIG_DUMMY=y
CONFIG_BRIDGE_VLAN_FILTERING=y
@@ -24,6 +28,7 @@ CONFIG_IFB=y
CONFIG_INET_DIAG=y
CONFIG_INET_ESP=y
CONFIG_INET_ESP_OFFLOAD=y
+CONFIG_IP_DCCP=m
CONFIG_IP_GRE=m
CONFIG_NETFILTER=y
CONFIG_NETFILTER_ADVANCED=y
--
2.43.0
Here's a follow-up from my RFC series last year:
https://lore.kernel.org/lkml/20221004093131.40392-1-thuth@redhat.com/T/
and from v1 earlier this year:
https://lore.kernel.org/kvm/20230712075910.22480-1-thuth@redhat.com/
Basic idea of this series is now to use the kselftest_harness.h
framework to get TAP output in the tests, so that it is easier
for the user to see what is going on, and e.g. to be able to
detect whether a certain test is part of the test binary or not
(which is useful when tests get extended in the course of time).
v2:
- Dropped the "Rename the ASSERT_EQ macro" patch (already merged)
- Split the fixes in the sync_regs_test into separate patches
(see the first two patches)
- Introduce the KVM_ONE_VCPU_TEST_SUITE() macro as suggested
by Sean (see third patch) and use it in the following patches
- Add a new patch to convert vmx_pmu_caps_test.c, too
Thomas Huth (7):
KVM: selftests: x86: sync_regs_test: Use vcpu_run() where appropriate
KVM: selftests: x86: sync_regs_test: Get regs structure before
modifying it
KVM: selftests: Add a macro to define a test with one vcpu
KVM: selftests: x86: Use TAP interface in the sync_regs test
KVM: selftests: x86: Use TAP interface in the fix_hypercall test
KVM: selftests: x86: Use TAP interface in the vmx_pmu_caps test
KVM: selftests: x86: Use TAP interface in the userspace_msr_exit test
.../selftests/kvm/include/kvm_test_harness.h | 35 +++++
.../selftests/kvm/x86_64/fix_hypercall_test.c | 27 ++--
.../selftests/kvm/x86_64/sync_regs_test.c | 121 +++++++++++++-----
.../kvm/x86_64/userspace_msr_exit_test.c | 19 +--
.../selftests/kvm/x86_64/vmx_pmu_caps_test.c | 50 ++------
5 files changed, 160 insertions(+), 92 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/kvm_test_harness.h
--
2.41.0
On Fri, Nov 24, 2023 at 12:04:09PM +0100, Jonas Oberhauser wrote:
> Unfortunately, at least last time I checked RISC-V still hadn't gotten such
> instructions.
> What they have is the *semantics* of the instructions, but no actual opcodes
> to encode them.
> I argued for them in the RISC-V memory group, but it was considered to be
> outside the scope of that group.
(Sorry for the late, late reply; just recalled this thread...)
That's right. AFAICT, the discussion about the native load-acquire
and store-release instructions was revived somewhere last year within
the RVI community, culminating in the so called Zalasr-proposal [1];
Brendan, Hans and Andrew (+ Cc) might be able to provide more up-to-
date information about the status/plans for that proposal.
(Remark that RISC-V did introduce LR/SCs and AMOs instructions with
acquire/release semantics separately, cf. the so called A-extension.)
Andrea
[1] https://github.com/mehnadnerd/riscv-zalasr
The gro self-tests sends the packets to be aggregated with
multiple write operations.
When running is slow environment, it's hard to guarantee that
the GRO engine will wait for the last packet in an intended
train.
The above causes almost deterministic failures in our CI for
the 'large' test-case.
Address the issue explicitly ignoring failures for such case
in slow environments (KSFT_MACHINE_SLOW==true).
Fixes: 7d1575014a63 ("selftests/net: GRO coalesce test")
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
Note that the fixes tag is there mainly to justify targeting the net
tree, and this is aiming at net to hopefully make the test more stable
ASAP for both trees.
I experimented with a largish refactory replacing the multiple writes
with a single GSO packet, but exhausted by time budget before reaching
any good result.
---
tools/testing/selftests/net/gro.sh | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/net/gro.sh b/tools/testing/selftests/net/gro.sh
index 19352f106c1d..114b5281a3f5 100755
--- a/tools/testing/selftests/net/gro.sh
+++ b/tools/testing/selftests/net/gro.sh
@@ -31,6 +31,10 @@ run_test() {
1>>log.txt
wait "${server_pid}"
exit_code=$?
+ if [ ${test} == "large" -a -n "${KSFT_MACHINE_SLOW}" ]; then
+ echo "Ignoring errors due to slow environment" 1>&2
+ exit_code=0
+ fi
if [[ "${exit_code}" -eq 0 ]]; then
break;
fi
--
2.43.0
If KUnit is built as a module, and it's unloaded, the kunit_bus is not
unregistered. This causes an error if it's then re-loaded later, as we
try to re-register the bus.
Unregister the bus and root_device on shutdown, if it looks valid.
In addition, be more specific about the value of kunit_bus_device. It
is:
- a valid struct device* if the kunit_bus initialised correctly.
- an ERR_PTR if it failed to initialise.
- NULL before initialisation and after shutdown.
Fixes: d03c720e03bd ("kunit: Add APIs for managing devices")
Signed-off-by: David Gow <davidgow(a)google.com>
---
This will hopefully resolve some of the issues linked to from:
https://lore.kernel.org/intel-gfx/DM4PR11MB614179CB9C387842D8E8BB40B97C2@DM…
---
lib/kunit/device-impl.h | 2 ++
lib/kunit/device.c | 14 ++++++++++++++
lib/kunit/test.c | 3 +++
3 files changed, 19 insertions(+)
diff --git a/lib/kunit/device-impl.h b/lib/kunit/device-impl.h
index 54bd55836405..5fcd48ff0f36 100644
--- a/lib/kunit/device-impl.h
+++ b/lib/kunit/device-impl.h
@@ -13,5 +13,7 @@
// For internal use only -- registers the kunit_bus.
int kunit_bus_init(void);
+// For internal use only -- unregisters the kunit_bus.
+void kunit_bus_shutdown(void);
#endif //_KUNIT_DEVICE_IMPL_H
diff --git a/lib/kunit/device.c b/lib/kunit/device.c
index 074c6dd2e36a..644a38a1f5b1 100644
--- a/lib/kunit/device.c
+++ b/lib/kunit/device.c
@@ -54,6 +54,20 @@ int kunit_bus_init(void)
return error;
}
+/* Unregister the 'kunit_bus' in case the KUnit module is unloaded. */
+void kunit_bus_shutdown(void)
+{
+ /* Make sure the bus exists before we unregister it. */
+ if (IS_ERR_OR_NULL(kunit_bus_device))
+ return;
+
+ bus_unregister(&kunit_bus_type);
+
+ root_device_unregister(kunit_bus_device);
+
+ kunit_bus_device = NULL;
+}
+
/* Release a 'fake' KUnit device. */
static void kunit_device_release(struct device *d)
{
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index 31a5a992e646..1d1475578515 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -928,6 +928,9 @@ static void __exit kunit_exit(void)
#ifdef CONFIG_MODULES
unregister_module_notifier(&kunit_mod_nb);
#endif
+
+ kunit_bus_shutdown();
+
kunit_debugfs_cleanup();
}
module_exit(kunit_exit);
--
2.43.0.429.g432eaa2c6b-goog
Other mechanisms for querying the peak memory usage of either a process
or v1 memory cgroup allow for resetting the high watermark. Restore
parity with those mechanisms.
For example:
- Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets
the high watermark.
- writing "5" to the clear_refs pseudo-file in a processes's proc
directory resets the peak RSS.
This change copies the cgroup v1 behavior so any write to the
memory.peak and memory.swap.peak pseudo-files reset the high watermark
to the current usage.
This behavior is particularly useful for work scheduling systems that
need to track memory usage of worker processes/cgroups per-work-item.
Since memory can't be squeezed like CPU can (the OOM-killer has
opinions), these systems need to track the peak memory usage to compute
system/container fullness when binpacking workitems.
Signed-off-by: David Finkel <davidf(a)vimeo.com>
---
Documentation/admin-guide/cgroup-v2.rst | 20 +++---
mm/memcontrol.c | 23 ++++++
.../selftests/cgroup/test_memcontrol.c | 72 ++++++++++++++++---
3 files changed, 99 insertions(+), 16 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3f85254f3cef..95af0628dc44 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1305,11 +1305,13 @@ PAGE_SIZE multiple when read back.
reclaim induced by memory.reclaim.
memory.peak
- A read-only single value file which exists on non-root
- cgroups.
+ A read-write single value file which exists on non-root cgroups.
+
+ The max memory usage recorded for the cgroup and its descendants since
+ either the creation of the cgroup or the most recent reset.
- The max memory usage recorded for the cgroup and its
- descendants since the creation of the cgroup.
+ Any non-empty write to this file resets it to the current memory usage.
+ All content written is completely ignored.
memory.oom.group
A read-write single value file which exists on non-root
@@ -1626,11 +1628,13 @@ PAGE_SIZE multiple when read back.
Healthy workloads are not expected to reach this limit.
memory.swap.peak
- A read-only single value file which exists on non-root
- cgroups.
+ A read-write single value file which exists on non-root cgroups.
+
+ The max swap usage recorded for the cgroup and its descendants since
+ the creation of the cgroup or the most recent reset.
- The max swap usage recorded for the cgroup and its
- descendants since the creation of the cgroup.
+ Any non-empty write to this file resets it to the current swap usage.
+ All content written is completely ignored.
memory.swap.max
A read-write single value file which exists on non-root
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1c1061df9cd1..b04af158922d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -25,6 +25,7 @@
* Copyright (C) 2020 Alibaba, Inc, Alex Shi
*/
+#include <linux/cgroup-defs.h>
#include <linux/page_counter.h>
#include <linux/memcontrol.h>
#include <linux/cgroup.h>
@@ -6635,6 +6636,16 @@ static u64 memory_peak_read(struct cgroup_subsys_state *css,
return (u64)memcg->memory.watermark * PAGE_SIZE;
}
+static ssize_t memory_peak_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+
+ page_counter_reset_watermark(&memcg->memory);
+
+ return nbytes;
+}
+
static int memory_min_show(struct seq_file *m, void *v)
{
return seq_puts_memcg_tunable(m,
@@ -6947,6 +6958,7 @@ static struct cftype memory_files[] = {
.name = "peak",
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = memory_peak_read,
+ .write = memory_peak_write,
},
{
.name = "min",
@@ -7917,6 +7929,16 @@ static u64 swap_peak_read(struct cgroup_subsys_state *css,
return (u64)memcg->swap.watermark * PAGE_SIZE;
}
+static ssize_t swap_peak_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+
+ page_counter_reset_watermark(&memcg->swap);
+
+ return nbytes;
+}
+
static int swap_high_show(struct seq_file *m, void *v)
{
return seq_puts_memcg_tunable(m,
@@ -7999,6 +8021,7 @@ static struct cftype swap_files[] = {
.name = "swap.peak",
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = swap_peak_read,
+ .write = swap_peak_write,
},
{
.name = "swap.events",
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index c7c9572003a8..0326c317f1f2 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -161,12 +161,12 @@ static int alloc_pagecache_50M_check(const char *cgroup, void *arg)
/*
* This test create a memory cgroup, allocates
* some anonymous memory and some pagecache
- * and check memory.current and some memory.stat values.
+ * and checks memory.current, memory.peak, and some memory.stat values.
*/
-static int test_memcg_current(const char *root)
+static int test_memcg_current_peak(const char *root)
{
int ret = KSFT_FAIL;
- long current;
+ long current, peak, peak_reset;
char *memcg;
memcg = cg_name(root, "memcg_test");
@@ -180,12 +180,32 @@ static int test_memcg_current(const char *root)
if (current != 0)
goto cleanup;
+ peak = cg_read_long(memcg, "memory.peak");
+ if (peak != 0)
+ goto cleanup;
+
if (cg_run(memcg, alloc_anon_50M_check, NULL))
goto cleanup;
+ peak = cg_read_long(memcg, "memory.peak");
+ if (peak < MB(50))
+ goto cleanup;
+
+ peak_reset = cg_write(memcg, "memory.peak", "\n");
+ if (peak_reset != 0)
+ goto cleanup;
+
+ peak = cg_read_long(memcg, "memory.peak");
+ if (peak > MB(30))
+ goto cleanup;
+
if (cg_run(memcg, alloc_pagecache_50M_check, NULL))
goto cleanup;
+ peak = cg_read_long(memcg, "memory.peak");
+ if (peak < MB(50))
+ goto cleanup;
+
ret = KSFT_PASS;
cleanup:
@@ -815,13 +835,14 @@ static int alloc_anon_50M_check_swap(const char *cgroup, void *arg)
/*
* This test checks that memory.swap.max limits the amount of
- * anonymous memory which can be swapped out.
+ * anonymous memory which can be swapped out. Additionally, it verifies that
+ * memory.swap.peak reflects the high watermark and can be reset.
*/
-static int test_memcg_swap_max(const char *root)
+static int test_memcg_swap_max_peak(const char *root)
{
int ret = KSFT_FAIL;
char *memcg;
- long max;
+ long max, peak;
if (!is_swap_enabled())
return KSFT_SKIP;
@@ -838,6 +859,12 @@ static int test_memcg_swap_max(const char *root)
goto cleanup;
}
+ if (cg_read_long(memcg, "memory.swap.peak"))
+ goto cleanup;
+
+ if (cg_read_long(memcg, "memory.peak"))
+ goto cleanup;
+
if (cg_read_strcmp(memcg, "memory.max", "max\n"))
goto cleanup;
@@ -860,6 +887,27 @@ static int test_memcg_swap_max(const char *root)
if (cg_read_key_long(memcg, "memory.events", "oom_kill ") != 1)
goto cleanup;
+ peak = cg_read_long(memcg, "memory.peak");
+ if (peak < MB(29))
+ goto cleanup;
+
+ peak = cg_read_long(memcg, "memory.swap.peak");
+ if (peak < MB(29))
+ goto cleanup;
+
+ if (cg_write(memcg, "memory.swap.peak", "\n"))
+ goto cleanup;
+
+ if (cg_read_long(memcg, "memory.swap.peak") > MB(10))
+ goto cleanup;
+
+
+ if (cg_write(memcg, "memory.peak", "\n"))
+ goto cleanup;
+
+ if (cg_read_long(memcg, "memory.peak"))
+ goto cleanup;
+
if (cg_run(memcg, alloc_anon_50M_check_swap, (void *)MB(30)))
goto cleanup;
@@ -867,6 +915,14 @@ static int test_memcg_swap_max(const char *root)
if (max <= 0)
goto cleanup;
+ peak = cg_read_long(memcg, "memory.peak");
+ if (peak < MB(29))
+ goto cleanup;
+
+ peak = cg_read_long(memcg, "memory.swap.peak");
+ if (peak < MB(19))
+ goto cleanup;
+
ret = KSFT_PASS;
cleanup:
@@ -1293,7 +1349,7 @@ struct memcg_test {
const char *name;
} tests[] = {
T(test_memcg_subtree_control),
- T(test_memcg_current),
+ T(test_memcg_current_peak),
T(test_memcg_min),
T(test_memcg_low),
T(test_memcg_high),
@@ -1301,7 +1357,7 @@ struct memcg_test {
T(test_memcg_max),
T(test_memcg_reclaim),
T(test_memcg_oom_events),
- T(test_memcg_swap_max),
+ T(test_memcg_swap_max_peak),
T(test_memcg_sock),
T(test_memcg_oom_group_leaf_events),
T(test_memcg_oom_group_parent_events),
--
2.39.2
Continue DAMON selftests' test coverage improvement works with a trivial
improvement of the test code itself. The sequence of the patches in
patchset is as follows.
The first five patches add two DAMON core functionalities tests. Those
begins with three patches (patches 1-3) that update the test-purpose
DAMON sysfs interface wrapper to support DAMOS quota, stats, and apply
interval features, respectively. The fourth patch implements and adds a
selftest for DAMOS quota feature, using the DAMON sysfs interface
wrapper's newly added support of the quota and the stats feature. The
fifth patch further implements and adds a selftest for DAMOS apply
interval using the DAMON sysfs interface wrapper's newly added support
of the apply interval and the stats feature.
Two patches (patches 6 and 7) for implementing and adding two corner
cases handling selftests follow. Those try to avoid two previously
fixed bugs from recurring.
Finally, a patch for making DAMON debugfs selftests dependency checker
to use /proc/mounts instead of the hard-coded mount point assumption
follows.
SeongJae Park (8):
selftests/damon/_damon_sysfs: support DAMOS quota
selftests/damon/_damon_sysfs: support DAMOS stats
selftests/damon/_damon_sysfs: support DAMOS apply interval
selftests/damon: add a test for DAMOS quota
selftests/damon: add a test for DAMOS apply intervals
selftests/damon: add a test for a race between target_ids_read() and
dbgfs_before_terminate()
selftests/damon: add a test for the pid leak of
dbgfs_target_ids_write()
selftests/damon/_chk_dependency: get debugfs mount point from
/proc/mounts
tools/testing/selftests/damon/.gitignore | 2 +
tools/testing/selftests/damon/Makefile | 5 ++
.../selftests/damon/_chk_dependency.sh | 9 ++-
tools/testing/selftests/damon/_damon_sysfs.py | 77 ++++++++++++++++--
.../selftests/damon/damos_apply_interval.py | 67 ++++++++++++++++
tools/testing/selftests/damon/damos_quota.py | 67 ++++++++++++++++
.../damon/debugfs_target_ids_pid_leak.c | 68 ++++++++++++++++
.../damon/debugfs_target_ids_pid_leak.sh | 22 +++++
...fs_target_ids_read_before_terminate_race.c | 80 +++++++++++++++++++
...s_target_ids_read_before_terminate_race.sh | 14 ++++
10 files changed, 403 insertions(+), 8 deletions(-)
create mode 100755 tools/testing/selftests/damon/damos_apply_interval.py
create mode 100755 tools/testing/selftests/damon/damos_quota.py
create mode 100644 tools/testing/selftests/damon/debugfs_target_ids_pid_leak.c
create mode 100755 tools/testing/selftests/damon/debugfs_target_ids_pid_leak.sh
create mode 100644 tools/testing/selftests/damon/debugfs_target_ids_read_before_terminate_race.c
create mode 100755 tools/testing/selftests/damon/debugfs_target_ids_read_before_terminate_race.sh
base-commit: f51e629727d8cc526a3156a2c80489b8f050410f
--
2.39.2
cmsg_ipv6 test requests tcpdump to capture 4 packets,
and sends until tcpdump quits. Only the first packet
is "real", however, and the rest are basic UDP packets.
So if tcpdump doesn't start in time it will miss
the real packet and only capture the UDP ones.
This makes the test fail on slow machine (no KVM or with
debug enabled) 100% of the time, while it passes in fast
environments.
Repeat the "real" / expected packet.
Fixes: 9657ad09e1fa ("selftests: net: test IPV6_TCLASS")
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: shuah(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/cmsg_ipv6.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/net/cmsg_ipv6.sh b/tools/testing/selftests/net/cmsg_ipv6.sh
index f30bd57d5e38..8bc23fb4c82b 100755
--- a/tools/testing/selftests/net/cmsg_ipv6.sh
+++ b/tools/testing/selftests/net/cmsg_ipv6.sh
@@ -89,7 +89,7 @@ for ovr in setsock cmsg both diff; do
check_result $? 0 "TCLASS $prot $ovr - pass"
while [ -d /proc/$BG ]; do
- $NSEXE ./cmsg_sender -6 -p u $TGT6 1234
+ $NSEXE ./cmsg_sender -6 -p $p $m $((TOS2)) $TGT6 1234
done
tcpdump -r $TMPF -v 2>&1 | grep "class $TOS2" >> /dev/null
@@ -126,7 +126,7 @@ for ovr in setsock cmsg both diff; do
check_result $? 0 "HOPLIMIT $prot $ovr - pass"
while [ -d /proc/$BG ]; do
- $NSEXE ./cmsg_sender -6 -p u $TGT6 1234
+ $NSEXE ./cmsg_sender -6 -p $p $m $LIM $TGT6 1234
done
tcpdump -r $TMPF -v 2>&1 | grep "hlim $LIM[^0-9]" >> /dev/null
--
2.43.0
Add specification for test metadata to the KTAP v2 spec.
KTAP v1 only specifies the output format of very basic test information:
test result and test name. Any additional test information either gets
added to general diagnostic data or is not included in the output at all.
The purpose of KTAP metadata is to create a framework to include and
easily identify additional important test information in KTAP.
KTAP metadata could include any test information that is pertinent for
user interaction before or after the running of the test. For example,
the test file path or the test speed.
Since this includes a large variety of information, this specification
will recognize notable types of KTAP metadata to ensure consistent format
across test frameworks. See the full list of types in the specification.
Example of KTAP Metadata:
KTAP version 2
# ktap_test: main
# ktap_arch: uml
1..1
KTAP version 2
# ktap_test: suite_1
# ktap_subsystem: example
# ktap_test_file: lib/test.c
1..2
ok 1 test_1
# ktap_test: test_2
# ktap_speed: very_slow
# custom_is_flaky: true
ok 2 test_2
ok 1 test_suite
The changes to the KTAP specification outline the format, location, and
different types of metadata.
Here is a link to a version of the KUnit parser that is able to parse test
metadata lines for KTAP version 2. Note this includes test metadata
lines for the main level of KTAP.
Link: https://kunit-review.googlesource.com/c/linux/+/5889
Signed-off-by: Rae Moar <rmoar(a)google.com>
---
Documentation/dev-tools/ktap.rst | 163 ++++++++++++++++++++++++++++++-
1 file changed, 159 insertions(+), 4 deletions(-)
diff --git a/Documentation/dev-tools/ktap.rst b/Documentation/dev-tools/ktap.rst
index ff77f4aaa6ef..4480eaf5bbc3 100644
--- a/Documentation/dev-tools/ktap.rst
+++ b/Documentation/dev-tools/ktap.rst
@@ -17,19 +17,20 @@ KTAP test results describe a series of tests (which may be nested: i.e., test
can have subtests), each of which can contain both diagnostic data -- e.g., log
lines -- and a final result. The test structure and results are
machine-readable, whereas the diagnostic data is unstructured and is there to
-aid human debugging.
+aid human debugging. One exception to this is test metadata lines - a type
+of diagnostic lines. Test metadata is used to identify important supplemental
+test information and can be machine-readable.
KTAP output is built from four different types of lines:
- Version lines
- Plan lines
- Test case result lines
-- Diagnostic lines
+- Diagnostic lines (including test metadata)
In general, valid KTAP output should also form valid TAP output, but some
information, in particular nested test results, may be lost. Also note that
there is a stagnant draft specification for TAP14, KTAP diverges from this in
-a couple of places (notably the "Subtest" header), which are described where
-relevant later in this document.
+a couple of places, which are described where relevant later in this document.
Version lines
-------------
@@ -166,6 +167,154 @@ even if they do not start with a "#": this is to capture any other useful
kernel output which may help debug the test. It is nevertheless recommended
that tests always prefix any diagnostic output they have with a "#" character.
+KTAP metadata lines
+-------------------
+
+KTAP metadata lines are a subset of diagnostic lines that are used to include
+and easily identify important supplemental test information in KTAP.
+
+.. code-block:: none
+
+ # <prefix>_<metadata type>: <metadata value>
+
+The <prefix> indicates where to find the specification for the type of
+metadata. The metadata types listed below use the prefix "ktap" (See Types of
+KTAP Metadata).
+
+Types that are instead specified by an individual test framework use the
+framework name as the prefix. For example, a metadata type documented by the
+kselftest specification would use the prefix "kselftest". Any metadata type
+that is not listed in a specification must use the prefix "custom". Note the
+prefix must not include spaces or the characters ":" or "_".
+
+The format of <metadata type> and <value> varies based on the type. See the
+individual specification. For "custom" types the <metadata type> can be any
+string excluding ":", spaces, or newline characters and the <value> can be any
+string.
+
+**Location:**
+
+The first KTAP metadata entry for a test must be "# ktap_test: <test name>",
+which acts as a header to associate metadata with the correct test.
+
+For test cases, the location of the metadata is between the prior test result
+line and the current test result line. For test suites, the location of the
+metadata is between the suite's version line and test plan line. See the
+example below.
+
+KTAP metadata for a test does not need to be contiguous. For example, a kernel
+warning or other diagnostic output could interrupt metadata lines. However, it
+is recommended to keep a test's metadata lines together when possible, as this
+improves readability.
+
+**Here is an example of using KTAP metadata:**
+
+::
+
+ KTAP version 2
+ # ktap_test: main
+ # ktap_arch: uml
+ 1..1
+ KTAP version 2
+ # ktap_test: suite_1
+ # ktap_subsystem: example
+ # ktap_test_file: lib/test.c
+ 1..2
+ ok 1 test_1
+ # ktap_test: test_2
+ # ktap_speed: very_slow
+ # custom_is_flaky: true
+ ok 2 test_2
+ # suite_1 passed
+ ok 1 suite_1
+
+In this example, the tests are running on UML. The test suite "suite_1" is part
+of the subsystem "example" and belongs to the file "lib/example_test.c". It has
+two subtests, "test_1" and "test_2". The subtest "test_2" has a speed of
+"very_slow" and has been marked with a custom KTAP metadata type called
+"custom_is_flaky" with the value of "true".
+
+**Types of KTAP Metadata:**
+
+This is the current list of KTAP metadata types recognized in this
+specification. Note that all of these metadata types are optional (except for
+ktap_test as the KTAP metadata header).
+
+- ``ktap_test``: Name of test (used as header of KTAP metadata). This should
+ match the test name printed in the test result line: "ok 1 [test_name]".
+
+- ``ktap_module``: Name of the module containing the test
+
+- ``ktap_subsystem``: Name of the subsystem being tested
+
+- ``ktap_start_time``: Time tests started in ISO8601 format
+
+ - Example: "# ktap_start_time: 2024-01-09T13:09:01.990000+00:00"
+
+- ``ktap_duration``: Time taken (in seconds) to execute the test
+
+ - Example: "ktap_duration: 10.154s"
+
+- ``ktap_speed``: Category of how fast test runs: "normal", "slow", or
+ "very_slow"
+
+- ``ktap_test_file``: Path to source file containing the test. This metadata
+ line can be repeated if the test is spread across multiple files.
+
+ - Example: "# ktap_test_file: lib/test.c"
+
+- ``ktap_generated_file``: Description of and path to file generated during
+ test execution. This could be a core dump, generated filesystem image, some
+ form of visual output (for graphics drivers), etc. This metadata line can be
+ repeated to attach multiple files to the test.
+
+ - Example: "# ktap_generated_file: Core dump: /var/lib/systemd/coredump/hello.core"
+
+- ``ktap_log_file``: Path to file containing kernel log test output
+
+ - Example: "# ktap_log_file: /sys/kernel/debugfs/kunit/example/results"
+
+- ``ktap_error_file``: Path to file containing context for test failure or
+ error. This could include the difference between optimal test output and
+ actual test output.
+
+ - Example: "# ktap_error_file: fs/results/example.out.bad"
+
+- ``ktap_results_url``: Link to webpage describing this test run and its
+ results
+
+ - Example: "# ktap_results_url: https://kcidb.kernelci.org/hello"
+
+- ``ktap_arch``: Architecture used during test run
+
+ - Example: "# ktap_arch: x86_64"
+
+- ``ktap_compiler``: Compiler used during test run
+
+ - Example: "# ktap_compiler: gcc (GCC) 10.1.1 20200507 (Red Hat 10.1.1-1)"
+
+- ``ktap_respository_url``: Link to git repository of the checked out code.
+
+ - Example: "# ktap_respository_url: https://github.com/torvalds/linux.git"
+
+- ``ktap_git_branch``: Name of git branch of checked out code
+
+ - Example: "# ktap_git_branch: kselftest/kunit"
+
+- ``ktap_kernel_version``: Version of Linux Kernel being used during test run
+
+ - Example: "# ktap_kernel_version: 6.7-rc1"
+
+- ``ktap_commit_hash``: The full git commit hash of the checked out base code.
+
+ - Example: "# ktap_commit_hash: 064725faf8ec2e6e36d51e22d3b86d2707f0f47f"
+
+**Other Metadata Types:**
+
+There can also be KTAP metadata that is not included in the recognized list
+above. This metadata must be prefixed with the test framework, ie. "kselftest",
+or with the prefix "custom". For example, "# custom_batch: 20".
+
Unknown lines
-------------
@@ -206,6 +355,7 @@ An example of a test with two nested subtests:
KTAP version 2
1..1
KTAP version 2
+ # ktap_test: example
1..2
ok 1 test_1
not ok 2 test_2
@@ -219,6 +369,7 @@ An example format with multiple levels of nested testing:
KTAP version 2
1..2
KTAP version 2
+ # ktap_test: example_test_1
1..2
KTAP version 2
1..2
@@ -254,6 +405,7 @@ Example KTAP output
KTAP version 2
1..1
KTAP version 2
+ # ktap_test: main_test
1..3
KTAP version 2
1..1
@@ -261,11 +413,14 @@ Example KTAP output
ok 1 test_1
ok 1 example_test_1
KTAP version 2
+ # ktap_test: example_test_2
+ # ktap_speed: slow
1..2
ok 1 test_1 # SKIP test_1 skipped
ok 2 test_2
ok 2 example_test_2
KTAP version 2
+ # ktap_test: example_test_3
1..3
ok 1 test_1
# test_2: FAIL
base-commit: 906f02e42adfbd5ae70d328ee71656ecb602aaf5
--
2.43.0.429.g432eaa2c6b-goog
The seccomp benchmark test (for validating the benefit of bitmaps) can
be sensitive to scheduling speed, so pin the process to a single CPU,
which appears to significantly improve reliability, and loosen the
"close enough" checking to allow up to 10% variance instead of 1%.
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Closes: https://lore.kernel.org/oe-lkp/202402061002.3a8722fd-oliver.sang@intel.com
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Andy Lutomirski <luto(a)amacapital.net>
Cc: Will Drewry <wad(a)chromium.org>
Signed-off-by: Kees Cook <keescook(a)chromium.org>
---
v2:
- improve comment about selecting CPU (broonie)
- loosen variance check from 1% to 10%
v1: https://lore.kernel.org/all/20240206095642.work.502-kees@kernel.org/
---
.../selftests/seccomp/seccomp_benchmark.c | 38 ++++++++++++++++++-
1 file changed, 36 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c
index 5b5c9d558dee..9d7aa5a730e0 100644
--- a/tools/testing/selftests/seccomp/seccomp_benchmark.c
+++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c
@@ -4,7 +4,9 @@
*/
#define _GNU_SOURCE
#include <assert.h>
+#include <err.h>
#include <limits.h>
+#include <sched.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdio.h>
@@ -76,8 +78,12 @@ unsigned long long calibrate(void)
bool approx(int i_one, int i_two)
{
- double one = i_one, one_bump = one * 0.01;
- double two = i_two, two_bump = two * 0.01;
+ /*
+ * This continues to be a noisy test. Instead of a 1% comparison
+ * go with 10%.
+ */
+ double one = i_one, one_bump = one * 0.1;
+ double two = i_two, two_bump = two * 0.1;
one_bump = one + MAX(one_bump, 2.0);
two_bump = two + MAX(two_bump, 2.0);
@@ -119,6 +125,32 @@ long compare(const char *name_one, const char *name_eval, const char *name_two,
return good ? 0 : 1;
}
+/* Pin to a single CPU so the benchmark won't bounce around the system. */
+void affinity(void)
+{
+ long cpu;
+ ulong ncores = sysconf(_SC_NPROCESSORS_CONF);
+ cpu_set_t *setp = CPU_ALLOC(ncores);
+ ulong setsz = CPU_ALLOC_SIZE(ncores);
+
+ /*
+ * Totally unscientific way to avoid CPUs that might be busier:
+ * choose the highest CPU instead of the lowest.
+ */
+ for (cpu = ncores - 1; cpu >= 0; cpu--) {
+ CPU_ZERO_S(setsz, setp);
+ CPU_SET_S(cpu, setsz, setp);
+ if (sched_setaffinity(getpid(), setsz, setp) == -1)
+ continue;
+ printf("Pinned to CPU %lu of %lu\n", cpu + 1, ncores);
+ goto out;
+ }
+ fprintf(stderr, "Could not set CPU affinity -- calibration may not work well");
+
+out:
+ CPU_FREE(setp);
+}
+
int main(int argc, char *argv[])
{
struct sock_filter bitmap_filter[] = {
@@ -153,6 +185,8 @@ int main(int argc, char *argv[])
system("grep -H . /proc/sys/net/core/bpf_jit_enable");
system("grep -H . /proc/sys/net/core/bpf_jit_harden");
+ affinity();
+
if (argc > 1)
samples = strtoull(argv[1], NULL, 0);
else
--
2.34.1
I have been steadily working but struggled to find a seamlessly
integrated way to implement tty frontend until Guilherme inspired me
that multi-backend and tty frontend are actually two separate entities.
This submission presents the second iteration of my efforts, listing
notable changes form the v1:
1. pstore.backend no longer acts as "registered backend", but "backends
eligible for registration".
2. drop subdir since it will break user space
3. drop tty frontend since I haven't yet devised a satisfactory
implementation strategy
A heartfelt thank you to Kees and Guilherme for your suggestions.
I firmly believe that a tty frontend is crucial for kdump debugging,
and I am still dedicating effort to develop one. Hope in the future I
can accomplish it with deeper comprehension with tty driver :)
Yuanhe Shu (3):
pstore: add multi-backend support
Documentation: adjust pstore backend related document
tools/testing: adjust pstore backend related selftest
Documentation/ABI/testing/pstore | 8 +-
.../admin-guide/kernel-parameters.txt | 4 +-
fs/pstore/ftrace.c | 29 ++-
fs/pstore/inode.c | 19 +-
fs/pstore/internal.h | 4 +-
fs/pstore/platform.c | 225 ++++++++++++------
fs/pstore/pmsg.c | 24 +-
include/linux/pstore.h | 29 +++
tools/testing/selftests/pstore/common_tests | 8 +-
.../selftests/pstore/pstore_post_reboot_tests | 65 ++---
tools/testing/selftests/pstore/pstore_tests | 2 +-
11 files changed, 293 insertions(+), 124 deletions(-)
--
2.39.3
From: Willem de Bruijn <willemb(a)google.com>
This test is time sensitive. It may fail on virtual machines and for
debug builds.
Continue to run in these environments to get code coverage. But
optionally suppress failure for timing errors (only). This is
controlled with environment variable KSFT_MACHINE_SLOW.
The test continues to return 0 (KSFT_PASS), rather than KSFT_XFAIL
as previously discussed. Because making so_txtime.c return that and
then making so_txtime.sh capture runs that pass that vs KSFT_FAIL
and pass it on added a bunch of (fragile bash) boilerplate, while the
result is interpreted the same as KSFT_PASS anyway.
Signed-off-by: Willem de Bruijn <willemb(a)google.com>
---
tools/testing/selftests/net/so_txtime.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/net/so_txtime.c b/tools/testing/selftests/net/so_txtime.c
index 2672ac0b6d1f..8457b7ccbc09 100644
--- a/tools/testing/selftests/net/so_txtime.c
+++ b/tools/testing/selftests/net/so_txtime.c
@@ -134,8 +134,11 @@ static void do_recv_one(int fdr, struct timed_send *ts)
if (rbuf[0] != ts->data)
error(1, 0, "payload mismatch. expected %c", ts->data);
- if (llabs(tstop - texpect) > cfg_variance_us)
- error(1, 0, "exceeds variance (%d us)", cfg_variance_us);
+ if (llabs(tstop - texpect) > cfg_variance_us) {
+ fprintf(stderr, "exceeds variance (%d us)\n", cfg_variance_us);
+ if (!getenv("KSFT_MACHINE_SLOW"))
+ exit(1);
+ }
}
static void do_recv_verify_empty(int fdr)
--
2.43.0.429.g432eaa2c6b-goog
Selftests here check not only that connect()/accept() for
TCP-AO/TCP-MD5/non-signed-TCP combinations do/don't establish
connections, but also counters: those are per-AO-key, per-socket and
per-netns.
The counters are checked on the server's side, as the server listener
has TCP-AO/TCP-MD5/no keys for different peers. All tests run in
the same namespaces with the same veth pair, created in test_init().
After close() in both client and server, the sides go through
the regular FIN/ACK + FIN/ACK sequence, which goes in the background.
If the selftest has already started a new testing scenario, read
per-netns counters - it may fail in the end iff it doesn't expect
the TCPAOGood per-netns counters go up during the test.
Let's just kill both TCP-AO sides - that will avoid any asynchronous
background TCP-AO segments going to either sides.
Reported-by: Jakub Kicinski <kuba(a)kernel.org>
Closes: https://lore.kernel.org/all/20240201132153.4d68f45e@kernel.org/T/#u
Fixes: 6f0c472a6815 ("selftests/net: Add TCP-AO + TCP-MD5 + no sign listen socket tests")
Signed-off-by: Dmitry Safonov <dima(a)arista.com>
---
tools/testing/selftests/net/tcp_ao/unsigned-md5.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/net/tcp_ao/unsigned-md5.c b/tools/testing/selftests/net/tcp_ao/unsigned-md5.c
index c5b568cd7d90..6b59a652159f 100644
--- a/tools/testing/selftests/net/tcp_ao/unsigned-md5.c
+++ b/tools/testing/selftests/net/tcp_ao/unsigned-md5.c
@@ -110,9 +110,9 @@ static void try_accept(const char *tst_name, unsigned int port,
test_tcp_ao_counters_cmp(tst_name, &ao_cnt1, &ao_cnt2, cnt_expected);
out:
- synchronize_threads(); /* close() */
+ synchronize_threads(); /* test_kill_sk() */
if (sk > 0)
- close(sk);
+ test_kill_sk(sk);
}
static void server_add_routes(void)
@@ -302,10 +302,10 @@ static void try_connect(const char *tst_name, unsigned int port,
test_ok("%s: connected", tst_name);
out:
- synchronize_threads(); /* close() */
+ synchronize_threads(); /* test_kill_sk() */
/* _test_connect_socket() cleans up on failure */
if (ret > 0)
- close(sk);
+ test_kill_sk(sk);
}
#define PREINSTALL_MD5_FIRST BIT(0)
@@ -486,10 +486,10 @@ static void try_to_add(const char *tst_name, unsigned int port,
}
out:
- synchronize_threads(); /* close() */
+ synchronize_threads(); /* test_kill_sk() */
/* _test_connect_socket() cleans up on failure */
if (ret > 0)
- close(sk);
+ test_kill_sk(sk);
}
static void client_add_ip(union tcp_addr *client, const char *ip)
---
base-commit: 021533194476035883300d60fbb3136426ac8ea5
change-id: 20240202-unsigned-md5-netns-counters-35134409362a
Best regards,
--
Dmitry Safonov <dima(a)arista.com>
Non-contiguous CBM support for Intel CAT has been merged into the kernel
with Commit 0e3cd31f6e90 ("x86/resctrl: Enable non-contiguous CBMs in
Intel CAT") but there is no selftest that would validate if this feature
works correctly.
The selftest needs to verify if writing non-contiguous CBMs to the
schemata file behaves as expected in comparison to the information about
non-contiguous CBMs support.
The patch series is based on a rework of resctrl selftests that's
currently in review [1]. The patch also implements a similar
functionality presented in the bash script included in the cover letter
of the original non-contiguous CBMs in Intel CAT series [3].
Changelog v4:
- Changes to error failure return values in non-contiguous test.
- Some minor text refactoring without functional changes.
Changelog v3:
- Rebase onto v4 of Ilpo's series [1].
- Split old patch 3/4 into two parts. One doing refactoring and one
adding a new function.
- Some changes to all the patches after Reinette's review.
Changelog v2:
- Rebase onto v4 of Ilpo's series [2].
- Add two patches that prepare helpers for the new test.
- Move Ilpo's patch that adds test grouping to this series.
- Apply Ilpo's suggestion to the patch that adds a new test.
[1] https://lore.kernel.org/all/20231215150515.36983-1-ilpo.jarvinen@linux.inte…
[2] https://lore.kernel.org/all/20231211121826.14392-1-ilpo.jarvinen@linux.inte…
[3] https://lore.kernel.org/all/cover.1696934091.git.maciej.wieczor-retman@inte…
Older versions of this series:
[v1] https://lore.kernel.org/all/20231109112847.432687-1-maciej.wieczor-retman@i…
[v2] https://lore.kernel.org/all/cover.1702392177.git.maciej.wieczor-retman@inte…
Ilpo Järvinen (1):
selftests/resctrl: Add test groups and name L3 CAT test L3_CAT
Maciej Wieczor-Retman (4):
selftests/resctrl: Add helpers for the non-contiguous test
selftests/resctrl: Split validate_resctrl_feature_request()
selftests/resctrl: Add resource_info_file_exists()
selftests/resctrl: Add non-contiguous CBMs CAT test
tools/testing/selftests/resctrl/cat_test.c | 84 ++++++++++++++++-
tools/testing/selftests/resctrl/cmt_test.c | 2 +-
tools/testing/selftests/resctrl/mba_test.c | 2 +-
tools/testing/selftests/resctrl/mbm_test.c | 6 +-
tools/testing/selftests/resctrl/resctrl.h | 10 +-
.../testing/selftests/resctrl/resctrl_tests.c | 18 +++-
tools/testing/selftests/resctrl/resctrlfs.c | 94 ++++++++++++++++---
7 files changed, 192 insertions(+), 24 deletions(-)
--
2.43.0
From: Jeff Xu <jeffxu(a)chromium.org>
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory
range against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW)
and no-execute (NX) bits. Linux has supported NX since the release of
kernel version 2.6.8 in August 2004 [1]. The memory permission feature
improves the security stance on memory corruption bugs, as an attacker
cannot simply write to arbitrary memory and point the code to it. The
memory must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects
the VMA itself against modifications of the selected seal type.
Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped. Memory sealing can
automatically be applied by the runtime loader to seal .text and
.rodata pages and applications can additionally seal security critical
data at runtime. A similar feature already exists in the XNU kernel
with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the
mimmutable syscall [4]. Also, Chrome wants to adopt this feature for
their CFI work [2] and this patchset has been designed to be
compatible with the Chrome use case.
Two system calls are involved in sealing the map: mmap() and mseal().
The new mseal() is an syscall on 64 bit CPU, and with
following signature:
int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.
mseal() blocks following operations for the given memory range.
1> Unmapping, moving to another location, and shrinking the size,
via munmap() and mremap(), can leave an empty space, therefore can
be replaced with a VMA with a new set of attributes.
2> Moving or expanding a different VMA into the current location,
via mremap().
3> Modifying a VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed VMA.
5> mprotect() and pkey_mprotect().
6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
memory, when users don't have write permission to the memory. Those
behaviors can alter region contents by discarding pages, effectively a
memset(0) for anonymous memory.
In addition: mmap() has two related changes.
The PROT_SEAL bit in prot field of mmap(). When present, it marks
the map sealed since creation.
The MAP_SEALABLE bit in the flags field of mmap(). When present, it marks
the map as sealable. A map created without MAP_SEALABLE will not support
sealing, i.e. mseal() will fail.
Applications that don't care about sealing will expect their behavior
unchanged. For those that need sealing support, opt-in by adding
MAP_SEALABLE in mmap().
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.
Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in
the case of libc, sealing is only applied to read-only (RO) or
read-execute (RX) memory segments (such as .text and .RELRO) to
prevent them from becoming writable, the lifetime of those mappings
are tied to the lifetime of the process.
Chrome wants to seal two large address space reservations that are
managed by different allocators. The memory is mapped RW- and RWX
respectively but write access to it is restricted using pkeys (or in
the future ARM permission overlay extensions). The lifetime of those
mappings are not tied to the lifetime of the process, therefore, while
the memory is sealed, the allocators still need to free or discard the
unused memory. For example, with madvise(DONTNEED).
However, always allowing madvise(DONTNEED) on this range poses a
security risk. For example if a jump instruction crosses a page
boundary and the second page gets discarded, it will overwrite the
target bytes with zeros and change the control flow. Checking
write-permission before the discard operation allows us to control
when the operation is valid. In this case, the madvise will only
succeed if the executing thread has PKEY write permissions and PKRU
changes are protected in software by control-flow integrity.
Although the initial version of this patch series is targeting the
Chrome browser as its first user, it became evident during upstream
discussions that we would also want to ensure that the patch set
eventually is a complete solution for memory sealing and compatible
with other use cases. The specific scenario currently in mind is
glibc's use case of loading and sealing ELF executables. To this end,
Stephen is working on a change to glibc to add sealing support to the
dynamic linker, which will seal all non-writable segments at startup.
Once this work is completed, all applications will be able to
automatically benefit from these new protections.
In closing, I would like to formally acknowledge the valuable
contributions received during the RFC process, which were instrumental
in shaping this patch:
Jann Horn: raising awareness and providing valuable insights on the
destructive madvise operations.
Liam R. Howlett: perf optimization.
Linus Torvalds: assisting in defining system call signature and scope.
Pedro Falcato: suggesting sealing in the mmap().
Theo de Raadt: sharing the experiences and insight gained from
implementing mimmutable() in OpenBSD.
Change history:
===============
V8:
- perf optimization in mmap. (Liam R. Howlett)
- add one testcase (test_seal_zero_address)
- Update mseal.rst to add note for MAP_SEALABLE.
V7:
- fix index.rst (Randy Dunlap)
- fix arm build (Randy Dunlap)
- return EPERM for blocked operations (Theo de Raadt)
https://lore.kernel.org/linux-mm/20240122152905.2220849-2-jeffxu@chromium.o…
V6:
- Drop RFC from subject, Given Linus's general approval.
- Adjust syscall number for mseal (main Jan.11/2024)
- Code style fix (Matthew Wilcox)
- selftest: use ksft macros (Muhammad Usama Anjum)
- Document fix. (Randy Dunlap)
https://lore.kernel.org/all/20240111234142.2944934-1-jeffxu@chromium.org/
V5:
- fix build issue in mseal-Wire-up-mseal-syscall
(Suggested by Linus Torvalds, and Greg KH)
- updates on selftest.
https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/#r
V4:
(Suggested by Linus Torvalds)
- new signature: mseal(start,len,flags)
- 32 bit is not supported. vm_seal is removed, use vm_flags instead.
- single bit in vm_flags for sealed state.
- CONFIG_MSEAL kernel config is removed.
- single bit of PROT_SEAL in the "Prot" field of mmap().
Other changes:
- update selftest (Suggested by Muhammad Usama Anjum)
- update documentation.
https://lore.kernel.org/all/20240104185138.169307-1-jeffxu@chromium.org/
V3:
- Abandon per-syscall approach, (Suggested by Linus Torvalds).
- Organize sealing types around their functionality, such as
MM_SEAL_BASE, MM_SEAL_PROT_PKEY.
- Extend the scope of sealing from calls originated in userspace to
both kernel and userspace. (Suggested by Linus Torvalds)
- Add seal type support in mmap(). (Suggested by Pedro Falcato)
- Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent
destructive operations of madvise. (Suggested by Jann Horn and
Stephen Röttger)
- Make sealed VMAs mergeable. (Suggested by Jann Horn)
- Add MAP_SEALABLE to mmap()
- Add documentation - mseal.rst
https://lore.kernel.org/linux-mm/20231212231706.2680890-2-jeffxu@chromium.o…
v2:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add a detailed commit message.
https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/
v1:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/
----------------------------------------------------------------
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b…
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXge…
[6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426Fkcgnf…
[7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/
Jeff Xu (4):
mseal: Wire up mseal syscall
mseal: add mseal syscall
selftest mm/mseal memory sealing
mseal:add documentation
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/mseal.rst | 215 ++
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/mman-common.h | 8 +
include/uapi/asm-generic/unistd.h | 5 +-
kernel/sys_ni.c | 1 +
mm/Makefile | 4 +
mm/internal.h | 48 +
mm/madvise.c | 12 +
mm/mmap.c | 35 +-
mm/mprotect.c | 10 +
mm/mremap.c | 31 +
mm/mseal.c | 343 ++++
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/mseal_test.c | 2024 +++++++++++++++++++
33 files changed, 2756 insertions(+), 3 deletions(-)
create mode 100644 Documentation/userspace-api/mseal.rst
create mode 100644 mm/mseal.c
create mode 100644 tools/testing/selftests/mm/mseal_test.c
--
2.43.0.429.g432eaa2c6b-goog
hugetlb_madv_vs_map selftest was not part of the mm test-suite since we
didn't have a fix for the problem it found.
Now that the problem is already fixed (see previous commit), let's
enable this selftest in the default test-suite.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
tools/testing/selftests/mm/run_vmtests.sh | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 246d53a5d7f2..50e2094ed761 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -253,6 +253,7 @@ nr_hugepages_tmp=$(cat /proc/sys/vm/nr_hugepages)
# For this test, we need one and just one huge page
echo 1 > /proc/sys/vm/nr_hugepages
CATEGORY="hugetlb" run_test ./hugetlb_fault_after_madv
+CATEGORY="hugetlb" run_test ./hugetlb_madv_vs_map
# Restore the previous number of huge pages, since further tests rely on it
echo "$nr_hugepages_tmp" > /proc/sys/vm/nr_hugepages
--
2.34.1
=== Description ===
This is a bpf-treewide change that annotates all kfuncs as such inside
.BTF_ids. This annotation eventually allows us to automatically generate
kfunc prototypes from bpftool.
We store this metadata inside a yet-unused flags field inside struct
btf_id_set8 (thanks Kumar!). pahole will be taught where to look.
More details about the full chain of events are available in commit 3's
description.
The accompanying pahole and bpftool changes can be viewed
here on these "frozen" branches [0][1].
[0]: https://github.com/danobi/pahole/tree/kfunc_btf-v3-mailed
[1]: https://github.com/danobi/linux/tree/kfunc_bpftool-mailed
=== Changelog ===
Changes from v3:
* Rebase to bpf-next and add missing annotation on new kfunc
Changes from v2:
* Only WARN() for vmlinux kfuncs
Changes from v1:
* Move WARN_ON() up a call level
* Also return error when kfunc set is not properly tagged
* Use BTF_KFUNCS_START/END instead of flags
* Rename BTF_SET8_KFUNC to BTF_SET8_KFUNCS
Daniel Xu (3):
bpf: btf: Support flags for BTF_SET8 sets
bpf: btf: Add BTF_KFUNCS_START/END macro pair
bpf: treewide: Annotate BPF kfuncs in BTF
Documentation/bpf/kfuncs.rst | 8 +++----
drivers/hid/bpf/hid_bpf_dispatch.c | 8 +++----
fs/verity/measure.c | 4 ++--
include/linux/btf_ids.h | 21 +++++++++++++++----
kernel/bpf/btf.c | 8 +++++++
kernel/bpf/cpumask.c | 4 ++--
kernel/bpf/helpers.c | 8 +++----
kernel/bpf/map_iter.c | 4 ++--
kernel/cgroup/rstat.c | 4 ++--
kernel/trace/bpf_trace.c | 8 +++----
net/bpf/test_run.c | 8 +++----
net/core/filter.c | 20 +++++++++---------
net/core/xdp.c | 4 ++--
net/ipv4/bpf_tcp_ca.c | 4 ++--
net/ipv4/fou_bpf.c | 4 ++--
net/ipv4/tcp_bbr.c | 4 ++--
net/ipv4/tcp_cubic.c | 4 ++--
net/ipv4/tcp_dctcp.c | 4 ++--
net/netfilter/nf_conntrack_bpf.c | 4 ++--
net/netfilter/nf_nat_bpf.c | 4 ++--
net/xfrm/xfrm_interface_bpf.c | 4 ++--
net/xfrm/xfrm_state_bpf.c | 4 ++--
.../selftests/bpf/bpf_testmod/bpf_testmod.c | 8 +++----
23 files changed, 87 insertions(+), 66 deletions(-)
--
2.42.1
When execute the dirty_log_test on some aarch64 machine, it sometimes
trigger the ASSERT:
==== Test Assertion Failure ====
dirty_log_test.c:384: dirty_ring_vcpu_ring_full
pid=14854 tid=14854 errno=22 - Invalid argument
1 0x00000000004033eb: dirty_ring_collect_dirty_pages at dirty_log_test.c:384
2 0x0000000000402d27: log_mode_collect_dirty_pages at dirty_log_test.c:505
3 (inlined by) run_test at dirty_log_test.c:802
4 0x0000000000403dc7: for_each_guest_mode at guest_modes.c:100
5 0x0000000000401dff: main at dirty_log_test.c:941 (discriminator 3)
6 0x0000ffff9be173c7: ?? ??:0
7 0x0000ffff9be1749f: ?? ??:0
8 0x000000000040206f: _start at ??:?
Didn't continue vcpu even without ring full
The dirty_log_test fails when execute the dirty-ring test, this is
because the sem_vcpu_cont and the sem_vcpu_stop is non-zero value when
execute the dirty_ring_collect_dirty_pages() function. When those two
sem_t variables are non-zero, the dirty_ring_wait_vcpu() at the
beginning of the dirty_ring_collect_dirty_pages() will not wait for the
vcpu to stop, but continue to execute the following code. In this case,
before vcpu stop, if the dirty_ring_vcpu_ring_full is true, and the
dirty_ring_collect_dirty_pages() has passed the check for the
dirty_ring_vcpu_ring_full but hasn't execute the check for the
continued_vcpu, the vcpu stop, and set the dirty_ring_vcpu_ring_full to
false. Then dirty_ring_collect_dirty_pages() will trigger the ASSERT.
Why sem_vcpu_cont and sem_vcpu_stop can be non-zero value? It's because
the dirty_ring_before_vcpu_join() execute the sem_post(&sem_vcpu_cont)
at the end of each dirty-ring test. It can cause two cases:
1. sem_vcpu_cont be non-zero. When we set the host_quit to be true,
the vcpu_worker directly see the host_quit to be true, it quit. So
the log_mode_before_vcpu_join() function will set the sem_vcpu_cont
to 1, since the vcpu_worker has quit, it won't consume it.
2. sem_vcpu_stop be non-zero. When we set the host_quit to be true,
the vcpu_worker has entered the guest state, the next time it exit
from guest state, it will set the sem_vcpu_stop to 1, and then see
the host_quit, no one will consume the sem_vcpu_stop.
When execute more and more dirty-ring tests, the sem_vcpu_cont and
sem_vcpu_stop can be larger and larger, which makes many code paths
don't wait for the sem_t. Thus finally cause the problem.
To fix this problem, we can wait a while before set the host_quit to
true, which gives the vcpu time to enter the guest state, so it will
exit again. Then we can wait the vcpu to exit, and let it continue
again, then the vcpu will see the host_quit. Thus the sem_vcpu_cont and
sem_vcpu_stop will be both zero when test finished.
Signed-off-by: Shaoqin Huang <shahuang(a)redhat.com>
---
v2->v3:
- Rebase to v6.8-rc2.
- Use TEST_ASSERT().
v1->v2:
- Fix the real logic bug, not just fresh the context.
v1: https://lore.kernel.org/all/20231116093536.22256-1-shahuang@redhat.com/
v2: https://lore.kernel.org/all/20231117052210.26396-1-shahuang@redhat.com/
tools/testing/selftests/kvm/dirty_log_test.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 6cbecf499767..dd2d8be390a5 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -417,7 +417,8 @@ static void dirty_ring_after_vcpu_run(struct kvm_vcpu *vcpu, int ret, int err)
static void dirty_ring_before_vcpu_join(void)
{
- /* Kick another round of vcpu just to make sure it will quit */
+ /* Wait vcpu exit, and let it continue to see the host_quit. */
+ dirty_ring_wait_vcpu();
sem_post(&sem_vcpu_cont);
}
@@ -719,6 +720,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
struct kvm_vm *vm;
unsigned long *bmap;
uint32_t ring_buf_idx = 0;
+ int sem_val;
if (!log_mode_supported()) {
print_skip("Log mode '%s' not supported",
@@ -726,6 +728,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
return;
}
+ sem_getvalue(&sem_vcpu_stop, &sem_val);
+ assert(sem_val == 0);
+ sem_getvalue(&sem_vcpu_cont, &sem_val);
+ assert(sem_val == 0);
+
/*
* We reserve page table for 2 times of extra dirty mem which
* will definitely cover the original (1G+) test range. Here
@@ -825,6 +832,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
sync_global_to_guest(vm, iteration);
}
+ /*
+ *
+ * Before we set the host_quit, let the vcpu has time to run, to make
+ * sure we consume the sem_vcpu_stop and the vcpu consume the
+ * sem_vcpu_cont, to keep the semaphore balance.
+ */
+ usleep(p->interval * 1000);
/* Tell the vcpu thread to quit */
host_quit = true;
log_mode_before_vcpu_join();
base-commit: 41bccc98fb7931d63d03f326a746ac4d429c1dd3
--
2.40.1
If HUGETLBFS is not enabled then the default_huge_page_size function will
return 0 and cause a divide by 0 error. Add a check to see if the huge page
size is 0 and skip the hugetlb tests if it is.
Signed-off-by: Terry Tritton <terry.tritton(a)linaro.org>
---
tools/testing/selftests/mm/uffd-unit-tests.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/selftests/mm/uffd-unit-tests.c
index cce90a10515a..2b9f8cc52639 100644
--- a/tools/testing/selftests/mm/uffd-unit-tests.c
+++ b/tools/testing/selftests/mm/uffd-unit-tests.c
@@ -1517,6 +1517,12 @@ int main(int argc, char *argv[])
continue;
uffd_test_start("%s on %s", test->name, mem_type->name);
+ if ((mem_type->mem_flag == MEM_HUGETLB ||
+ mem_type->mem_flag == MEM_HUGETLB_PRIVATE) &&
+ (default_huge_page_size() == 0)) {
+ uffd_test_skip("huge page size is 0, feature missing?");
+ continue;
+ }
if (!uffd_feature_supported(test)) {
uffd_test_skip("feature missing");
continue;
--
2.43.0.594.gd9cf4e227d-goog
In very slow environments, most big TCP cases including
segmentation and reassembly of big TCP packets have a good
chance to fail: by default the TCP client uses write size
well below 64K. If the host is low enough autocorking is
unable to build real big TCP packets.
Address the issue using much larger write operations.
Note that is hard to observe the issue without an extremely
slow and/or overloaded environment; reduce the TCP transfer
time to allow for much easier/faster reproducibility.
Fixes: 6bb382bcf742 ("selftests: add a selftest for big tcp")
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
tools/testing/selftests/net/big_tcp.sh | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/big_tcp.sh b/tools/testing/selftests/net/big_tcp.sh
index cde9a91c4797..2db9d15cd45f 100755
--- a/tools/testing/selftests/net/big_tcp.sh
+++ b/tools/testing/selftests/net/big_tcp.sh
@@ -122,7 +122,9 @@ do_netperf() {
local netns=$1
[ "$NF" = "6" ] && serip=$SERVER_IP6
- ip net exec $netns netperf -$NF -t TCP_STREAM -H $serip 2>&1 >/dev/null
+
+ # use large write to be sure to generate big tcp packets
+ ip net exec $netns netperf -$NF -t TCP_STREAM -l 1 -H $serip -- -m 262144 2>&1 >/dev/null
}
do_test() {
--
2.43.0
On Mon, Nov 27, 2023 at 11:49:16AM +0000, Felix Huettner wrote:
> conntrack zones are heavily used by tools like openvswitch to run
> multiple virtual "routers" on a single machine. In this context each
> conntrack zone matches to a single router, thereby preventing
> overlapping IPs from becoming issues.
> In these systems it is common to operate on all conntrack entries of a
> given zone, e.g. to delete them when a router is deleted. Previously this
> required these tools to dump the full conntrack table and filter out the
> relevant entries in userspace potentially causing performance issues.
>
> To do this we reuse the existing CTA_ZONE attribute. This was previous
> parsed but not used during dump and flush requests. Now if CTA_ZONE is
> set we filter these operations based on the provided zone.
> However this means that users that previously passed CTA_ZONE will
> experience a difference in functionality.
>
> Alternatively CTA_FILTER could have been used for the same
> functionality. However it is not yet supported during flush requests and
> is only available when using AF_INET or AF_INET6.
For the record, this is applied to nf-next.
Paolo points out that ifconfig is legacy and we should not use it.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: shuah(a)kernel.org
CC: horms(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
---
.../drivers/net/netdevsim/udp_tunnel_nic.sh | 40 +++++++++----------
1 file changed, 20 insertions(+), 20 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh b/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
index f98435c502f6..384cfa3d38a6 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
@@ -270,7 +270,7 @@ for port in 0 1; do
echo 1 > $NSIM_DEV_SYS/new_port
fi
NSIM_NETDEV=`get_netdev_name old_netdevs`
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
msg="new NIC device created"
exp0=( 0 0 0 0 )
@@ -284,8 +284,8 @@ for port in 0 1; do
msg="VxLAN v4 devices go down"
exp0=( 0 0 0 0 )
- ifconfig vxlan1 down
- ifconfig vxlan0 down
+ ip link set dev vxlan1 down
+ ip link set dev vxlan0 down
check_tables
msg="VxLAN v6 devices"
@@ -293,7 +293,7 @@ for port in 0 1; do
new_vxlan vxlanA 4789 $NSIM_NETDEV 6
for ifc in vxlan0 vxlan1; do
- ifconfig $ifc up
+ ip link set dev $ifc up
done
new_vxlan vxlanB 4789 $NSIM_NETDEV 6
@@ -307,14 +307,14 @@ for port in 0 1; do
new_geneve gnv0 6081
msg="NIC device goes down"
- ifconfig $NSIM_NETDEV down
+ ip link set dev $NSIM_NETDEV down
if [ $port -eq 1 ]; then
exp0=( 0 0 0 0 )
exp1=( 0 0 0 0 )
fi
check_tables
msg="NIC device goes up again"
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
exp0=( `mke 4789 1` `mke 4790 1` 0 0 )
exp1=( `mke 6081 2` 0 0 0 )
check_tables
@@ -433,7 +433,7 @@ for port in 0 1; do
echo $port > $NSIM_DEV_SYS/new_port
NSIM_NETDEV=`get_netdev_name old_netdevs`
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
overflow_table0 "overflow NIC table"
overflow_table1 "overflow NIC table"
@@ -491,7 +491,7 @@ for port in 0 1; do
echo $port > $NSIM_DEV_SYS/new_port
NSIM_NETDEV=`get_netdev_name old_netdevs`
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
overflow_table0 "overflow NIC table"
overflow_table1 "overflow NIC table"
@@ -548,7 +548,7 @@ for port in 0 1; do
echo $port > $NSIM_DEV_SYS/new_port
NSIM_NETDEV=`get_netdev_name old_netdevs`
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
overflow_table0 "destroy NIC"
overflow_table1 "destroy NIC"
@@ -578,7 +578,7 @@ for port in 0 1; do
echo $port > $NSIM_DEV_SYS/new_port
NSIM_NETDEV=`get_netdev_name old_netdevs`
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
msg="create VxLANs v6"
new_vxlan vxlanA0 10000 $NSIM_NETDEV 6
@@ -639,7 +639,7 @@ for port in 0 1; do
echo $port > $NSIM_DEV_SYS/new_port
NSIM_NETDEV=`get_netdev_name old_netdevs`
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
echo 110 > $NSIM_DEV_DFS/ports/$port/udp_ports_inject_error
@@ -695,7 +695,7 @@ for port in 0 1; do
echo $port > $NSIM_DEV_SYS/new_port
NSIM_NETDEV=`get_netdev_name old_netdevs`
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
msg="create VxLANs v6"
exp0=( `mke 10000 1` 0 0 0 )
@@ -755,7 +755,7 @@ for port in 0 1; do
echo $port > $NSIM_DEV_SYS/new_port
NSIM_NETDEV=`get_netdev_name old_netdevs`
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
msg="create VxLANs v6"
exp0=( `mke 10000 1` 0 0 0 )
@@ -768,7 +768,7 @@ for port in 0 1; do
check_tables
msg="NIC device goes down"
- ifconfig $NSIM_NETDEV down
+ ip link set dev $NSIM_NETDEV down
if [ $port -eq 1 ]; then
exp0=( 0 0 0 0 )
exp1=( 0 0 0 0 )
@@ -779,7 +779,7 @@ for port in 0 1; do
check_tables
msg="NIC device goes up again"
- ifconfig $NSIM_NETDEV up
+ ip link set dev $NSIM_NETDEV up
exp0=( `mke 10000 1` 0 0 0 )
check_tables
@@ -827,12 +827,12 @@ new_vxlan vxlan1 4789 $NSIM_NETDEV2
msg="VxLAN v4 devices go down"
exp0=( 0 0 0 0 )
-ifconfig vxlan1 down
-ifconfig vxlan0 down
+ip link set dev vxlan1 down
+ip link set dev vxlan0 down
check_tables
for ifc in vxlan0 vxlan1; do
- ifconfig $ifc up
+ ip link set dev $ifc up
done
msg="VxLAN v6 device"
@@ -844,11 +844,11 @@ exp1=( `mke 6081 2` 0 0 0 )
new_geneve gnv0 6081
msg="NIC device goes down"
-ifconfig $NSIM_NETDEV down
+ip link set dev $NSIM_NETDEV down
check_tables
msg="NIC device goes up again"
-ifconfig $NSIM_NETDEV up
+ip link set dev $NSIM_NETDEV up
check_tables
for i in `seq 2`; do
--
2.43.0
The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1]. With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses. This provides some
protection against ROP attacks and making it easier to collect call
stacks. These shadow stacks are allocated in the address space of the
userspace process.
Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled. The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread. This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces. As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.
Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process in a similar manner
to how the normal stack is specified, keeping the current implicit
allocation behaviour if one is not specified either with clone3() or
through the use of clone(). Unlike normal stacks only the shadow stack
size is specified, similar issues to those that lead to the creation of
map_shadow_stack() apply.
Please note that the x86 portions of this code are build tested only, I
don't appear to have a system that can run CET avaible to me, I have
done testing with an integration into my pending work for GCS. There is
some possibility that the arm64 implementation may require the use of
clone3() and explicit userspace allocation of shadow stacks, this is
still under discussion.
A new architecture feature Kconfig option for shadow stacks is added as
here, this was suggested as part of the review comments for the arm64
GCS series and since we need to detect if shadow stacks are supported it
seemed sensible to roll it in here.
[1] https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org/
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke…
Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke…
---
Mark Brown (5):
mm: Introduce ARCH_HAS_USER_SHADOW_STACK
fork: Add shadow stack support to clone3()
selftests/clone3: Factor more of main loop into test_clone3()
selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
kselftest/clone3: Test shadow stack support
arch/x86/Kconfig | 1 +
arch/x86/include/asm/shstk.h | 11 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 59 +++++--
fs/proc/task_mmu.c | 2 +-
include/linux/mm.h | 2 +-
include/linux/sched/task.h | 1 +
include/uapi/linux/sched.h | 4 +
kernel/fork.c | 22 ++-
mm/Kconfig | 6 +
tools/testing/selftests/clone3/clone3.c | 200 +++++++++++++++++-----
tools/testing/selftests/clone3/clone3_selftests.h | 7 +
12 files changed, 250 insertions(+), 67 deletions(-)
---
base-commit: 98b1cc82c4affc16f5598d4fa14b1858671b2263
change-id: 20231019-clone3-shadow-stack-15d40d2bf536
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Another small bunch of fixes, addressing issues outlined by the
netdev CI.
The first 2 patches are just rebased.
The following 2 are new fixes, for even more problems that surfaced
meanwhile.
Paolo Abeni (4):
selftests: net: cut more slack for gro fwd tests.
selftests: net: fix setup_ns usage in rtnetlink.sh
selftests: net: fix tcp listener handling in pmtu.sh
selftests: net: avoid just another constant wait
tools/testing/selftests/net/pmtu.sh | 23 ++++++++++++++-----
tools/testing/selftests/net/rtnetlink.sh | 6 ++---
tools/testing/selftests/net/udpgro_fwd.sh | 14 +++++++++--
tools/testing/selftests/net/udpgso_bench_rx.c | 2 +-
4 files changed, 32 insertions(+), 13 deletions(-)
--
2.43.0
This patch series introduces a new char misc driver, /dev/ntsync, which is used
to implement Windows NT synchronization primitives.
== Background ==
The Wine project emulates the Windows API in user space. One particular part of
that API, namely the NT synchronization primitives, have historically been
implemented via RPC to a dedicated "kernel" process. However, more recent
applications use these APIs more strenuously, and the overhead of RPC has become
a bottleneck.
The NT synchronization APIs are too complex to implement on top of existing
primitives without sacrificing correctness. Certain operations, such as
NtPulseEvent() or the "wait-for-all" mode of NtWaitForMultipleObjects(), require
direct control over the underlying wait queue, and implementing a wait queue
sufficiently robust for Wine in user space is not possible. This proposed
driver, therefore, implements the problematic interfaces directly in the Linux
kernel.
This driver was presented at Linux Plumbers Conference 2023. For those further
interested in the history of synchronization in Wine and past attempts to solve
this problem in user space, a recording of the presentation can be viewed here:
https://www.youtube.com/watch?v=NjU4nyWyhU8
== Performance ==
The gain in performance varies wildly depending on the application in question
and the user's hardware. For some games NT synchronization is not a bottleneck
and no change can be observed, but for others frame rate improvements of 50 to
150 percent are not atypical. The following table lists frame rate measurements
from a variety of games on a variety of hardware, taken by users Dmitry
Skvortsov, FuzzyQuils, OnMars, and myself:
Game Upstream ntsync improvement
===========================================================================
Anger Foot 69 99 43%
Call of Juarez 99.8 224.1 125%
Dirt 3 110.6 860.7 678%
Forza Horizon 5 108 160 48%
Lara Croft: Temple of Osiris 141 326 131%
Metro 2033 164.4 199.2 21%
Resident Evil 2 26 77 196%
The Crew 26 51 96%
Tiny Tina's Wonderlands 130 360 177%
Total War Saga: Troy 109 146 34%
===========================================================================
== Patches ==
The intended semantics of the patches are broadly intended to match those of the
corresponding Windows functions. For those not already familiar with the Windows
functions (or their undocumented behaviour), patch 29/29 provides a detailed
specification, and individual patches also include a brief description of the
API they are implementing.
The patches making use of this driver in Wine can be retrieved or browsed here:
https://repo.or.cz/wine/zf.git/shortlog/refs/heads/ntsync5
== Implementation ==
Some aspects of the implementation may deserve particular comment:
* In the interest of performance, each object is governed only by a single
spinlock. However, NTSYNC_IOC_WAIT_ALL requires that the state of multiple
objects be changed as a single atomic operation. In order to achieve this, we
first take a device-wide lock ("wait_all_lock") any time we are going to lock
more than one object at a time.
The maximum number of objects that can be used in a vectored wait, and
therefore the maximum that can be locked simultaneously, is 64. This number is
NT's own limit.
The acquisition of multiple spinlocks will degrade performance. This is a
conscious choice, however. Wait-for-all is known to be a very rare operation
in practice, especially with counts that approach the maximum, and it is the
intent of the ntsync driver to optimize wait-for-any at the expense of
wait-for-all as much as possible.
* NT mutexes are tied to their threads on an OS level, and the kernel includes
builtin support for "robust" mutexes. In order to keep the ntsync driver
self-contained and avoid touching more code than necessary, it does not hook
into task exit nor use pids.
Instead, the user space emulator is expected to manage thread IDs and pass
them as an argument to any relevant functions; this is the "owner" field of
ntsync_wait_args and ntsync_mutex_args.
When the emulator detects that a thread dies, it should therefore call
NTSYNC_IOC_KILL_OWNER, which will mark mutexes owned by that thread (if any)
as abandoned.
* This implementation uses a misc device mostly because it seemed like the
simplest and least obtrusive option.
Besides simplicitly of implementation, the only particularly interesting
advantage is the ability to create an arbitrary number of "contexts"
(corresponding to Windows virtual machines) which are self-contained and
shareable across multiple processes; this maps nicely to file descriptions
(i.e. struct file). This is not impossible with syscalls of course but would
require an extra argument.
On the other hand, there is no reason to forbid using ntsync by default from
user-mode processes, and (as far as I understand) to do so with a char device
requires explicit configuration by e.g. udev or init. Since this is done with
e.g. fuse, I assume this is the model to follow, but I may have chosen
something deprecated.
* ntsync is module-capable mostly because there was nothing preventing it, and
because it aided development. It is not a hard requirement, though.
== Previous versions ==
Changes in v2:
* Send the whole series instead of just the first few patches.
* Try to add more description to each patch, as a short documentation of the
functions to be implemented. A more complete documentation of all aspects of
the driver is provided in the contents of the last patch.
* Objects are now files rather than indices into a table. This prevents a
process from changing the state of an object which it should not have access
to. Suggested by Andy Lutorminski.
* Because the device no longer inherently has a table of all objects, marking a
thread's owned mutexes as abandoned is now done through an ioctl on the mutex.
* Change the names of a couple ioctls to be a bit less odd (PUT_SEM -> SEM_POST,
PUT_MUTEX -> MUTEX_UNLOCK), and to reflect that they are ioctls on an object
rather than on the device.
* Pass the timeout for wait functions as a bare u64 (in ns), per Arnd Bergmann,
with U64_MAX used to indicate no timeout. I originally indicated that I would
change the timeout to be relative, but on reflection ended up keeping it as
absolute, as this results in the least number of calls to get the current time
(i.e. one).
* Use compat_ptr_ioctl(), per Arnd Bergmann.
* Remove the fixed minor number and module alias, per Greg Kroah-Hartman.
* Allocate the fds array on stack in setup_wait(). This array takes up 260
bytes.
* Link to v1: https://lore.kernel.org/lkml/20240124004028.16826-1-zfigura@codeweavers.com/
Elizabeth Figura (29):
ntsync: Introduce the ntsync driver and character device.
ntsync: Introduce NTSYNC_IOC_CREATE_SEM.
ntsync: Introduce NTSYNC_IOC_SEM_POST.
ntsync: Introduce NTSYNC_IOC_WAIT_ANY.
ntsync: Introduce NTSYNC_IOC_WAIT_ALL.
ntsync: Introduce NTSYNC_IOC_CREATE_MUTEX.
ntsync: Introduce NTSYNC_IOC_MUTEX_UNLOCK.
ntsync: Introduce NTSYNC_IOC_MUTEX_KILL.
ntsync: Introduce NTSYNC_IOC_CREATE_EVENT.
ntsync: Introduce NTSYNC_IOC_EVENT_SET.
ntsync: Introduce NTSYNC_IOC_EVENT_RESET.
ntsync: Introduce NTSYNC_IOC_EVENT_PULSE.
ntsync: Introduce NTSYNC_IOC_SEM_READ.
ntsync: Introduce NTSYNC_IOC_MUTEX_READ.
ntsync: Introduce NTSYNC_IOC_EVENT_READ.
ntsync: Introduce alertable waits.
selftests: ntsync: Add some tests for semaphore state.
selftests: ntsync: Add some tests for mutex state.
selftests: ntsync: Add some tests for NTSYNC_IOC_WAIT_ANY.
selftests: ntsync: Add some tests for NTSYNC_IOC_WAIT_ALL.
selftests: ntsync: Add some tests for wakeup signaling with
WINESYNC_IOC_WAIT_ANY.
selftests: ntsync: Add some tests for wakeup signaling with
WINESYNC_IOC_WAIT_ALL.
selftests: ntsync: Add some tests for manual-reset event state.
selftests: ntsync: Add some tests for auto-reset event state.
selftests: ntsync: Add some tests for wakeup signaling with events.
selftests: ntsync: Add tests for alertable waits.
selftests: ntsync: Add some tests for wakeup signaling via alerts.
maintainers: Add an entry for ntsync.
docs: ntsync: Add documentation for the ntsync uAPI.
Documentation/userspace-api/index.rst | 1 +
.../userspace-api/ioctl/ioctl-number.rst | 2 +
Documentation/userspace-api/ntsync.rst | 390 +++++
MAINTAINERS | 9 +
drivers/misc/Kconfig | 9 +
drivers/misc/Makefile | 1 +
drivers/misc/ntsync.c | 1132 ++++++++++++++
include/uapi/linux/ntsync.h | 58 +
tools/testing/selftests/Makefile | 1 +
.../testing/selftests/drivers/ntsync/Makefile | 8 +
tools/testing/selftests/drivers/ntsync/config | 1 +
.../testing/selftests/drivers/ntsync/ntsync.c | 1300 +++++++++++++++++
12 files changed, 2912 insertions(+)
create mode 100644 Documentation/userspace-api/ntsync.rst
create mode 100644 drivers/misc/ntsync.c
create mode 100644 include/uapi/linux/ntsync.h
create mode 100644 tools/testing/selftests/drivers/ntsync/Makefile
create mode 100644 tools/testing/selftests/drivers/ntsync/config
create mode 100644 tools/testing/selftests/drivers/ntsync/ntsync.c
--
2.43.0
Changelog:
v2:
* Make the swapin test also checks for zswap usage (patch 3)
(suggested by Yosry Ahmed)
* Some test simplifications/cleanups (patch 3)
(suggested by Yosry Ahmed).
Fix a broken zswap kselftest due to cgroup zswap writeback counter
renaming, and add 2 zswap kselftests, one to cover the (z)swapin case,
and another to check that no zswapping happens when the cgroup limit is
0.
Also, add the zswap kselftest file to zswap maintainer entry so that
get_maintainers script can find zswap maintainers.
Nhat Pham (3):
selftests: zswap: add zswap selftest file to zswap maintainer entry
selftests: fix the zswap invasive shrink test
selftests: add zswapin and no zswap tests
MAINTAINERS | 1 +
tools/testing/selftests/cgroup/test_zswap.c | 99 ++++++++++++++++++++-
2 files changed, 99 insertions(+), 1 deletion(-)
base-commit: 3a92c45e4ba694381c46994f3fde0d8544a2088b
--
2.39.3
From: Maxim Mikityanskiy <maxim(a)isovalent.com>
The goal of this series is to extend the verifier's capabilities of
tracking scalars when they are spilled to stack, especially when the
spill or fill is narrowing. It also contains a fix by Eduard for
infinite loop detection and a state pruning optimization by Eduard that
compensates for a verification complexity regression introduced by
tracking unbounded scalars. These improvements reduce the surface of
false rejections that I saw while working on Cilium codebase.
Patches 1-9 of the original series were previously applied in v2.
Patches 1-2 (Maxim): Support the case when boundary checks are first
performed after the register was spilled to the stack.
Patches 3-4 (Maxim): Support narrowing fills.
Patches 5-6 (Eduard): Optimization for state pruning in stacksafe() to
mitigate the verification complexity regression.
veristat -e file,prog,states -f '!states_diff<50' -f '!states_pct<10' -f '!states_a<10' -f '!states_b<10' -C ...
* Without patch 5:
File Program States (A) States (B) States (DIFF)
-------------------- -------- ---------- ---------- ----------------
pyperf100.bpf.o on_event 4878 6528 +1650 (+33.83%)
pyperf180.bpf.o on_event 6936 11032 +4096 (+59.05%)
pyperf600.bpf.o on_event 22271 39455 +17184 (+77.16%)
pyperf600_iter.bpf.o on_event 400 490 +90 (+22.50%)
strobemeta.bpf.o on_event 4895 14028 +9133 (+186.58%)
* With patch 5:
File Program States (A) States (B) States (DIFF)
----------------------- ------------- ---------- ---------- ---------------
bpf_xdp.o tail_lb_ipv4 2770 2224 -546 (-19.71%)
pyperf100.bpf.o on_event 4878 5848 +970 (+19.89%)
pyperf180.bpf.o on_event 6936 8868 +1932 (+27.85%)
pyperf600.bpf.o on_event 22271 29656 +7385 (+33.16%)
pyperf600_iter.bpf.o on_event 400 450 +50 (+12.50%)
xdp_synproxy_kern.bpf.o syncookie_tc 280 226 -54 (-19.29%)
xdp_synproxy_kern.bpf.o syncookie_xdp 302 228 -74 (-24.50%)
v2 changes:
Fixed comments in patch 1, moved endianness checks to header files in
patch 12 where possible, added Eduard's ACKs.
v3 changes:
Maxim: Removed __is_scalar_unbounded altogether, addressed Andrii's
comments.
Eduard: Patch #5 (#14 in v2) changed significantly:
- Logical changes:
- Handling of STACK_{MISC,ZERO} mix turned out to be incorrect:
a mix of MISC and ZERO in old state is not equivalent to e.g.
just MISC is current state, because verifier could have deduced
zero scalars from ZERO slots in old state for some loads.
- There is no reason to limit the change only to cases when
old or current stack is a spill of unbounded scalar,
it is valid to compare any 64-bit scalar spill with fake
register impersonating MISC.
- STACK_ZERO vs spilled zero case was dropped,
after recent changes for zero handling by Andrii and Yonghong
it is hard (impossible?) to conjure all ZERO slots for an spi.
=> the case does not make any difference in veristat results.
- Use global static variable for unbound_reg (Andrii)
- Code shuffling to remove duplication in stacksafe() (Andrii)
Eduard Zingerman (2):
bpf: handle scalar spill vs all MISC in stacksafe()
selftests/bpf: states pruning checks for scalar vs STACK_MISC
Maxim Mikityanskiy (4):
bpf: Track spilled unbounded scalars
selftests/bpf: Test tracking spilled unbounded scalars
bpf: Preserve boundaries and track scalars on narrowing fill
selftests/bpf: Add test cases for narrowing fill
include/linux/bpf_verifier.h | 9 +
kernel/bpf/verifier.c | 103 ++++--
.../selftests/bpf/progs/verifier_spill_fill.c | 324 +++++++++++++++++-
3 files changed, 404 insertions(+), 32 deletions(-)
--
2.43.0
Non-contiguous CBM support for Intel CAT has been merged into the kernel
with Commit 0e3cd31f6e90 ("x86/resctrl: Enable non-contiguous CBMs in
Intel CAT") but there is no selftest that would validate if this feature
works correctly.
The selftest needs to verify if writing non-contiguous CBMs to the
schemata file behaves as expected in comparison to the information about
non-contiguous CBMs support.
The patch series is based on a rework of resctrl selftests that's
currently in review [1]. The patch also implements a similar
functionality presented in the bash script included in the cover letter
of the original non-contiguous CBMs in Intel CAT series [3].
Changelog v3:
- Rebase onto v4 of Ilpo's series [1].
- Split old patch 3/4 into two parts. One doing refactoring and one
adding a new function.
- Some changes to all the patches after Reinette's review.
Changelog v2:
- Rebase onto v4 of Ilpo's series [2].
- Add two patches that prepare helpers for the new test.
- Move Ilpo's patch that adds test grouping to this series.
- Apply Ilpo's suggestion to the patch that adds a new test.
[1] https://lore.kernel.org/all/20231215150515.36983-1-ilpo.jarvinen@linux.inte…
[2] https://lore.kernel.org/all/20231211121826.14392-1-ilpo.jarvinen@linux.inte…
[3] https://lore.kernel.org/all/cover.1696934091.git.maciej.wieczor-retman@inte…
Older versions of this series:
[v1] https://lore.kernel.org/all/20231109112847.432687-1-maciej.wieczor-retman@i…
[v2] https://lore.kernel.org/all/cover.1702392177.git.maciej.wieczor-retman@inte…
Ilpo Järvinen (1):
selftests/resctrl: Add test groups and name L3 CAT test L3_CAT
Maciej Wieczor-Retman (4):
selftests/resctrl: Add helpers for the non-contiguous test
selftests/resctrl: Split validate_resctrl_feature_request()
selftests/resctrl: Add resource_info_file_exists()
selftests/resctrl: Add non-contiguous CBMs CAT test
tools/testing/selftests/resctrl/cat_test.c | 84 +++++++++++++++-
tools/testing/selftests/resctrl/cmt_test.c | 4 +-
tools/testing/selftests/resctrl/mba_test.c | 4 +-
tools/testing/selftests/resctrl/mbm_test.c | 6 +-
tools/testing/selftests/resctrl/resctrl.h | 11 ++-
.../testing/selftests/resctrl/resctrl_tests.c | 18 +++-
tools/testing/selftests/resctrl/resctrlfs.c | 98 ++++++++++++++++---
7 files changed, 199 insertions(+), 26 deletions(-)
--
2.43.0
Arch maintainers, please ack/review patches.
This is a resend of a series from Frank last year[1]. I worked in Rob's
review comments to unconditionally call unflatten_device_tree() and
fixup/audit calls to of_have_populated_dt() so that behavior doesn't
change.
I need this series so I can add DT based tests in the clk framework.
Either I can merge it through the clk tree once everyone is happy, or
Rob can merge it through the DT tree and provide some branch so I can
base clk patches on it.
Changes from Frank's series[1]:
* Add a DTB loaded kunit test
* Make of_have_populated_dt() return false if the DTB isn't from the
bootloader
* Architecture calls made unconditional so that a root node is always
made
Changes from v1 (https://lore.kernel.org/r/20240112200750.4062441-1-sboyd@kernel.org):
* x86 patch included
* arm64 knocks out initial dtb if acpi is in use
* keep Kconfig hidden but def_bool enabled otherwise
Frank Rowand (2):
of: Create of_root if no dtb provided by firmware
of: unittest: treat missing of_root as error instead of fixing up
Stephen Boyd (5):
arm64: Unconditionally call unflatten_device_tree()
um: Unconditionally call unflatten_device_tree()
x86/of: Unconditionally call unflatten_and_copy_device_tree()
of: Always unflatten in unflatten_and_copy_device_tree()
of: Add KUnit test to confirm DTB is loaded
arch/arm64/kernel/setup.c | 7 +++--
arch/um/kernel/dtb.c | 14 +++++-----
arch/x86/kernel/devicetree.c | 24 +++++++++--------
drivers/of/.kunitconfig | 3 +++
drivers/of/Kconfig | 11 +++++++-
drivers/of/Makefile | 4 ++-
drivers/of/empty_root.dts | 6 +++++
drivers/of/fdt.c | 52 +++++++++++++++++++++++++-----------
drivers/of/of_test.c | 48 +++++++++++++++++++++++++++++++++
drivers/of/platform.c | 3 ---
drivers/of/unittest.c | 16 +++--------
include/linux/of.h | 25 ++++++++++-------
12 files changed, 151 insertions(+), 62 deletions(-)
create mode 100644 drivers/of/.kunitconfig
create mode 100644 drivers/of/empty_root.dts
create mode 100644 drivers/of/of_test.c
Cc: Anton Ivanov <anton.ivanov(a)cambridgegreys.com>
Cc: Brendan Higgins <brendan.higgins(a)linux.dev>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: David Gow <davidgow(a)google.com>
Cc: Frank Rowand <frowand.list(a)gmail.com>
Cc: Johannes Berg <johannes(a)sipsolutions.net>
Cc: Richard Weinberger <richard(a)nod.at>
Cc: Rob Herring <robh+dt(a)kernel.org>
Cc: Will Deacon <will(a)kernel.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: <x86(a)kernel.org>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Saurabh Sengar <ssengar(a)linux.microsoft.com>
[1] https://lore.kernel.org/r/20230317053415.2254616-1-frowand.list@gmail.com
base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a
--
https://git.kernel.org/pub/scm/linux/kernel/git/clk/linux.git/https://git.kernel.org/pub/scm/linux/kernel/git/sboyd/spmi.git
Add a test case for regression in openvswitch nat that was fixed by
commit e6345d2824a3 ("netfilter: nf_nat: fix action not being set for
all ct states").
Link: https://lore.kernel.org/netdev/20231221224311.130319-1-brad@faucet.nz/
Link: https://mail.openvswitch.org/pipermail/ovs-dev/2024-January/410476.html
Suggested-by: Aaron Conole <aconole(a)redhat.com>
Signed-off-by: Brad Cowie <brad(a)faucet.nz>
---
.../selftests/net/openvswitch/openvswitch.sh | 62 +++++++++++++++++++
1 file changed, 62 insertions(+)
diff --git a/tools/testing/selftests/net/openvswitch/openvswitch.sh b/tools/testing/selftests/net/openvswitch/openvswitch.sh
index f8499d4c87f3..87b80bee6df4 100755
--- a/tools/testing/selftests/net/openvswitch/openvswitch.sh
+++ b/tools/testing/selftests/net/openvswitch/openvswitch.sh
@@ -17,6 +17,7 @@ tests="
ct_connect_v4 ip4-ct-xon: Basic ipv4 tcp connection using ct
connect_v4 ip4-xon: Basic ipv4 ping between two NS
nat_connect_v4 ip4-nat-xon: Basic ipv4 tcp connection via NAT
+ nat_related_v4 ip4-nat-related: ICMP related matches work with SNAT
netlink_checks ovsnl: validate netlink attrs and settings
upcall_interfaces ovs: test the upcall interfaces
drop_reason drop: test drop reasons are emitted"
@@ -473,6 +474,67 @@ test_nat_connect_v4 () {
return 0
}
+# nat_related_v4 test
+# - client->server ip packets go via SNAT
+# - client solicits ICMP destination unreachable packet from server
+# - undo NAT for ICMP reply and test dst ip has been updated
+test_nat_related_v4 () {
+ which nc >/dev/null 2>/dev/null || return $ksft_skip
+
+ sbx_add "test_nat_related_v4" || return $?
+
+ ovs_add_dp "test_nat_related_v4" natrelated4 || return 1
+ info "create namespaces"
+ for ns in client server; do
+ ovs_add_netns_and_veths "test_nat_related_v4" "natrelated4" "$ns" \
+ "${ns:0:1}0" "${ns:0:1}1" || return 1
+ done
+
+ ip netns exec client ip addr add 172.31.110.10/24 dev c1
+ ip netns exec client ip link set c1 up
+ ip netns exec server ip addr add 172.31.110.20/24 dev s1
+ ip netns exec server ip link set s1 up
+
+ ip netns exec server ip route add 192.168.0.20/32 via 172.31.110.10
+
+ # Allow ARP
+ ovs_add_flow "test_nat_related_v4" natrelated4 \
+ "in_port(1),eth(),eth_type(0x0806),arp()" "2" || return 1
+ ovs_add_flow "test_nat_related_v4" natrelated4 \
+ "in_port(2),eth(),eth_type(0x0806),arp()" "1" || return 1
+
+ # Allow IP traffic from client->server, rewrite source IP with SNAT to 192.168.0.20
+ ovs_add_flow "test_nat_related_v4" natrelated4 \
+ "ct_state(-trk),in_port(1),eth(),eth_type(0x0800),ipv4(dst=172.31.110.20)" \
+ "ct(commit,nat(src=192.168.0.20)),recirc(0x1)" || return 1
+ ovs_add_flow "test_nat_related_v4" natrelated4 \
+ "recirc_id(0x1),ct_state(+trk-inv),in_port(1),eth(),eth_type(0x0800),ipv4()" \
+ "2" || return 1
+
+ # Allow related ICMP responses back from server and undo NAT to restore original IP
+ # Drop any ICMP related packets where dst ip hasn't been restored back to original IP
+ ovs_add_flow "test_nat_related_v4" natrelated4 \
+ "ct_state(-trk),in_port(2),eth(),eth_type(0x0800),ipv4()" \
+ "ct(commit,nat),recirc(0x2)" || return 1
+ ovs_add_flow "test_nat_related_v4" natrelated4 \
+ "recirc_id(0x2),ct_state(+rel+trk),in_port(2),eth(),eth_type(0x0800),ipv4(src=172.31.110.20,dst=172.31.110.10,proto=1),icmp()" \
+ "1" || return 1
+ ovs_add_flow "test_nat_related_v4" natrelated4 \
+ "recirc_id(0x2),ct_state(+rel+trk),in_port(2),eth(),eth_type(0x0800),ipv4(dst=192.168.0.20,proto=1),icmp()" \
+ "drop" || return 1
+
+ # Solicit destination unreachable response from server
+ ovs_sbx "test_nat_related_v4" ip netns exec client \
+ bash -c "echo a | nc -u -w 1 172.31.110.20 10000"
+
+ # Check to make sure no packets matched the drop rule with incorrect dst ip
+ python3 "$ovs_base/ovs-dpctl.py" dump-flows natrelated4 \
+ | grep "drop" | grep "packets:0" >/dev/null || return 1
+
+ info "done..."
+ return 0
+}
+
# netlink_validation
# - Create a dp
# - check no warning with "old version" simulation
--
2.34.1
When execute the dirty_log_test on some aarch64 machine, it sometimes
trigger the ASSERT:
==== Test Assertion Failure ====
dirty_log_test.c:384: dirty_ring_vcpu_ring_full
pid=14854 tid=14854 errno=22 - Invalid argument
1 0x00000000004033eb: dirty_ring_collect_dirty_pages at dirty_log_test.c:384
2 0x0000000000402d27: log_mode_collect_dirty_pages at dirty_log_test.c:505
3 (inlined by) run_test at dirty_log_test.c:802
4 0x0000000000403dc7: for_each_guest_mode at guest_modes.c:100
5 0x0000000000401dff: main at dirty_log_test.c:941 (discriminator 3)
6 0x0000ffff9be173c7: ?? ??:0
7 0x0000ffff9be1749f: ?? ??:0
8 0x000000000040206f: _start at ??:?
Didn't continue vcpu even without ring full
The dirty_log_test fails when execute the dirty-ring test, this is
because the sem_vcpu_cont and the sem_vcpu_stop is non-zero value when
execute the dirty_ring_collect_dirty_pages() function. When those two
sem_t variables are non-zero, the dirty_ring_wait_vcpu() at the
beginning of the dirty_ring_collect_dirty_pages() will not wait for the
vcpu to stop, but continue to execute the following code. In this case,
before vcpu stop, if the dirty_ring_vcpu_ring_full is true, and the
dirty_ring_collect_dirty_pages() has passed the check for the
dirty_ring_vcpu_ring_full but hasn't execute the check for the
continued_vcpu, the vcpu stop, and set the dirty_ring_vcpu_ring_full to
false. Then dirty_ring_collect_dirty_pages() will trigger the ASSERT.
Why sem_vcpu_cont and sem_vcpu_stop can be non-zero value? It's because
the dirty_ring_before_vcpu_join() execute the sem_post(&sem_vcpu_cont)
at the end of each dirty-ring test. It can cause two cases:
1. sem_vcpu_cont be non-zero. When we set the host_quit to be true,
the vcpu_worker directly see the host_quit to be true, it quit. So
the log_mode_before_vcpu_join() function will set the sem_vcpu_cont
to 1, since the vcpu_worker has quit, it won't consume it.
2. sem_vcpu_stop be non-zero. When we set the host_quit to be true,
the vcpu_worker has entered the guest state, the next time it exit
from guest state, it will set the sem_vcpu_stop to 1, and then see
the host_quit, no one will consume the sem_vcpu_stop.
When execute more and more dirty-ring tests, the sem_vcpu_cont and
sem_vcpu_stop can be larger and larger, which makes many code paths
don't wait for the sem_t. Thus finally cause the problem.
To fix this problem, we can wait a while before set the host_quit to
true, which gives the vcpu time to enter the guest state, so it will
exit again. Then we can wait the vcpu to exit, and let it continue
again, then the vcpu will see the host_quit. Thus the sem_vcpu_cont and
sem_vcpu_stop will be both zero when test finished.
Signed-off-by: Shaoqin Huang <shahuang(a)redhat.com>
---
v1->v2:
- Fix the real logic bug, not just fresh the context.
v1: https://lore.kernel.org/all/20231116093536.22256-1-shahuang@redhat.com/
---
tools/testing/selftests/kvm/dirty_log_test.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index 936f3a8d1b83..a6e0ff46a07c 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -417,7 +417,8 @@ static void dirty_ring_after_vcpu_run(struct kvm_vcpu *vcpu, int ret, int err)
static void dirty_ring_before_vcpu_join(void)
{
- /* Kick another round of vcpu just to make sure it will quit */
+ /* Wait vcpu exit, and let it continue to see the host_quit. */
+ dirty_ring_wait_vcpu();
sem_post(&sem_vcpu_cont);
}
@@ -719,6 +720,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
struct kvm_vm *vm;
unsigned long *bmap;
uint32_t ring_buf_idx = 0;
+ int sem_val;
if (!log_mode_supported()) {
print_skip("Log mode '%s' not supported",
@@ -726,6 +728,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
return;
}
+ sem_getvalue(&sem_vcpu_stop, &sem_val);
+ assert(sem_val == 0);
+ sem_getvalue(&sem_vcpu_cont, &sem_val);
+ assert(sem_val == 0);
+
/*
* We reserve page table for 2 times of extra dirty mem which
* will definitely cover the original (1G+) test range. Here
@@ -825,6 +832,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
sync_global_to_guest(vm, iteration);
}
+ /*
+ *
+ * Before we set the host_quit, let the vcpu has time to run, to make
+ * sure we consume the sem_vcpu_stop and the vcpu consume the
+ * sem_vcpu_cont, to keep the semaphore balance.
+ */
+ usleep(p->interval * 1000);
/* Tell the vcpu thread to quit */
host_quit = true;
log_mode_before_vcpu_join();
--
2.40.1
This series of 9 patches fixes issues mostly identified by CI's not
managed by the MPTCP maintainers. Thank you Linero (LKFT) and Netdev
maintainers (NIPA) for running our kunit and selftests tests!
For the first patch, it took a bit of time to identify the root cause.
Some MPTCP Join selftest subtests have been "flaky", mostly in slow
environments. It appears to be due to the use of a TCP-specific helper
on an MPTCP socket. A fix for kernels >= v5.15.
Patches 2 to 4 add missing kernel config to support NetFilter tables
needed for IPTables commands. These kconfigs are usually enabled in
default configurations, but apparently not for all architectures.
Patches 2 and 3 can be backported up to v5.11 and the 4th one up to
v5.19.
Patch 5 increases the time limit for MPTCP selftests. It appears that
many CI's execute tests in a VM without acceleration supports, e.g. QEmu
without KVM. As a result, the tests take longer. Plus, there are more
and more tests. This patch modifies the timeout added in v5.18.
Patch 6 reduces the maximum rate and delay of the different links in
some Simult Flows selftest subtests. The goal is to let slow VMs reach
the maximum speed. The original rate was introduced in v5.11.
Patch 7 lets CI changing the prefix of the subtests titles, to be able
to run the same selftest multiple times with different parameters. With
different titles, tests will be considered as different and not override
previous results as it is the case with some CI envs. Subtests have been
introduced in v6.6.
Patch 8 and 9 make some MPTCP Join selftest subtests quicker by stopping
the transfer when the expected events have been seen. Patch 8 can be
backported up to v6.5.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Matthieu Baerts (NGI0) (8):
selftests: mptcp: add missing kconfig for NF Filter
selftests: mptcp: add missing kconfig for NF Filter in v6
selftests: mptcp: add missing kconfig for NF Mangle
selftests: mptcp: increase timeout to 30 min
selftests: mptcp: decrease BW in simult flows
selftests: mptcp: allow changing subtests prefix
selftests: mptcp: join: stop transfer when check is done (part 1)
selftests: mptcp: join: stop transfer when check is done (part 2)
Paolo Abeni (1):
mptcp: fix data re-injection from stale subflow
net/mptcp/protocol.c | 3 ---
tools/testing/selftests/net/mptcp/config | 3 +++
tools/testing/selftests/net/mptcp/mptcp_join.sh | 27 +++++++++--------------
tools/testing/selftests/net/mptcp/mptcp_lib.sh | 2 +-
tools/testing/selftests/net/mptcp/settings | 2 +-
tools/testing/selftests/net/mptcp/simult_flows.sh | 8 +++----
6 files changed, 20 insertions(+), 25 deletions(-)
---
base-commit: c9ec85153fea6873c52ed4f5055c87263f1b54f9
change-id: 20240131-upstream-net-20240131-mptcp-ci-issues-9d68b5601e74
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
The mmap() respects rlimit only for normal users. This test should be
run as normal user, without root privileges. Also add back the sudo -u
nobody as run_vmtests.sh is run as root most of the times. Skip the test
instead if sudo isn't present to lower the privileges.
Fixes: b6221771d468 ("selftests/mm: run_vmtests: remove sudo and conform to tap")
Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
---
Please fold this patch in the Fixes patch if needed.
---
tools/testing/selftests/mm/on-fault-limit.c | 6 +++---
tools/testing/selftests/mm/run_vmtests.sh | 7 ++++++-
2 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/mm/on-fault-limit.c b/tools/testing/selftests/mm/on-fault-limit.c
index 0ea98ffab3589..431c1277d83a1 100644
--- a/tools/testing/selftests/mm/on-fault-limit.c
+++ b/tools/testing/selftests/mm/on-fault-limit.c
@@ -21,7 +21,7 @@ static void test_limit(void)
map = mmap(NULL, 2 * lims.rlim_max, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
- ksft_test_result(map == MAP_FAILED, "Failed mmap\n");
+ ksft_test_result(map == MAP_FAILED, "The map failed respecting mlock limits\n");
if (map != MAP_FAILED)
munmap(map, 2 * lims.rlim_max);
@@ -33,8 +33,8 @@ int main(int argc, char **argv)
ksft_print_header();
ksft_set_plan(1);
- if (getuid())
- ksft_test_result_skip("Require root privileges to run\n");
+ if (!getuid())
+ ksft_test_result_skip("The test must be run from a normal user\n");
else
test_limit();
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 55898d64e2ebf..edd73f871c79a 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -303,7 +303,12 @@ echo "$nr_hugepgs" > /proc/sys/vm/nr_hugepages
CATEGORY="compaction" run_test ./compaction_test
-CATEGORY="mlock" run_test ./on-fault-limit
+if command -v sudo &> /dev/null;
+then
+ CATEGORY="mlock" run_test sudo -u nobody ./on-fault-limit
+else
+ echo "# SKIP ./on-fault-limit"
+fi
CATEGORY="mmap" run_test ./map_populate
--
2.42.0
The mmap() respects rlimit only for normal users. This test should be
run as normal user, without root privileges.
Fixes: b6221771d468 ("selftests/mm: run_vmtests: remove sudo and conform to tap")
Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
---
tools/testing/selftests/mm/on-fault-limit.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/mm/on-fault-limit.c b/tools/testing/selftests/mm/on-fault-limit.c
index 0ea98ffab3589..431c1277d83a1 100644
--- a/tools/testing/selftests/mm/on-fault-limit.c
+++ b/tools/testing/selftests/mm/on-fault-limit.c
@@ -21,7 +21,7 @@ static void test_limit(void)
map = mmap(NULL, 2 * lims.rlim_max, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
- ksft_test_result(map == MAP_FAILED, "Failed mmap\n");
+ ksft_test_result(map == MAP_FAILED, "The map failed respecting mlock limits\n");
if (map != MAP_FAILED)
munmap(map, 2 * lims.rlim_max);
@@ -33,8 +33,8 @@ int main(int argc, char **argv)
ksft_print_header();
ksft_set_plan(1);
- if (getuid())
- ksft_test_result_skip("Require root privileges to run\n");
+ if (!getuid())
+ ksft_test_result_skip("The test must be run from a normal user\n");
else
test_limit();
--
2.42.0
This series try to address CI failures for the pmtu.sh tests. It
does _not_ attempt to enable all the currently skipped cases, to
avoid adding more entropy.
Tested with:
make -C tools/testing/selftests/ TARGETS=net install
vng --build --config tools/testing/selftests/net/config
vng --run . --user root -- \
./tools/testing/selftests/kselftest_install/run_kselftest.sh \
-t net:pmtu.sh
Paolo Abeni (3):
selftests: net: add missing config for pmtu.sh tests
selftests: net: fix available tunnels detection
selftests: net: don't access /dev/stdout in pmtu.sh
tools/testing/selftests/net/config | 3 +++
tools/testing/selftests/net/pmtu.sh | 18 +++++++++---------
2 files changed, 12 insertions(+), 9 deletions(-)
--
2.43.0
This patch series give a proposal to support guest VM running
in user mode and in canonical linear address organization as
well.
First design to parition the 64-bit canonical linear address space
into two half parts belonging to user-mode and supervisor-mode
respectively, similar as the organization of linear addresses used
in linux OS. Currently the linear addresses use 48-bit canonical
format in which bits 63:47 of the address are identical.
Secondly setup page table mapping the same guest physical address
of test code and data segment onto both user-mode and supervisor-mode
address space. It allows guest in different runtime mode, i.e.
user or supervisor, can run one code base in the corresponding
linear address space.
Also provide the runtime environment setup API for switching to
user mode execution.
Zeng Guang (8):
KVM: selftests: x86: Fix bug in addr_arch_gva2gpa()
KVM: selftests: x86: Support guest running on canonical linear-address
organization
KVM: selftests: Add virt_arch_ucall_prealloc() arch specific
implementation
KVM : selftests : Adapt selftest cases to kernel canonical linear
address
KVM: selftests: x86: Prepare setup for user mode support
KVM: selftests: x86: Allow user to access user-mode address and I/O
address space
KVM: selftests: x86: Support vcpu run in user mode
KVM: selftests: x86: Add KVM forced emulation prefix capability
.../selftests/kvm/include/kvm_util_base.h | 20 ++-
.../selftests/kvm/include/x86_64/processor.h | 48 ++++++-
.../selftests/kvm/lib/aarch64/processor.c | 5 +
tools/testing/selftests/kvm/lib/kvm_util.c | 6 +-
.../selftests/kvm/lib/riscv/processor.c | 5 +
.../selftests/kvm/lib/s390x/processor.c | 5 +
.../testing/selftests/kvm/lib/ucall_common.c | 2 +
.../selftests/kvm/lib/x86_64/processor.c | 117 ++++++++++++++----
.../selftests/kvm/set_memory_region_test.c | 13 +-
.../testing/selftests/kvm/x86_64/debug_regs.c | 2 +-
.../kvm/x86_64/userspace_msr_exit_test.c | 9 +-
11 files changed, 195 insertions(+), 37 deletions(-)
--
2.21.3
l2_tos_ttl_inherit.sh verifies the inheritance of tos and ttl
for GRETAP, VXLAN and GENEVE.
Before testing it checks if the required module is available
and if not skips the tests accordingly.
Currently only GRETAP and VXLAN are tested because the GENEVE
module is missing.
Signed-off-by: Matthias May <matthias.may(a)westermo.com>
---
tools/testing/selftests/net/config | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index 19ff75051660..8d79c024bebf 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -76,6 +76,7 @@ CONFIG_CRYPTO_SM4_GENERIC=y
CONFIG_AMT=m
CONFIG_TUN=y
CONFIG_VXLAN=m
+CONFIG_GENEVE=m
CONFIG_IP_SCTP=m
CONFIG_NETFILTER_XT_MATCH_POLICY=m
CONFIG_CRYPTO_ARIA=y
--
2.39.2
From: Willem de Bruijn <willemb(a)google.com>
The test sends packets and compares enqueue, transmit and Ack
timestamps with expected values. It installs netem delays to increase
latency between these points.
The test proves flaky in virtual environment (vng). Increase the
delays to reduce variance. Scale measurement tolerance accordingly.
Time sensitive tests are difficult to calibrate. Increasing delays 10x
also increases runtime 10x, for one. And it may still prove flaky at
some rate.
Signed-off-by: Willem de Bruijn <willemb(a)google.com>
---
tools/testing/selftests/net/txtimestamp.sh | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/net/txtimestamp.sh b/tools/testing/selftests/net/txtimestamp.sh
index 31637769f59f..25baca4b148e 100755
--- a/tools/testing/selftests/net/txtimestamp.sh
+++ b/tools/testing/selftests/net/txtimestamp.sh
@@ -8,13 +8,13 @@ set -e
setup() {
# set 1ms delay on lo egress
- tc qdisc add dev lo root netem delay 1ms
+ tc qdisc add dev lo root netem delay 10ms
# set 2ms delay on ifb0 egress
modprobe ifb
ip link add ifb_netem0 type ifb
ip link set dev ifb_netem0 up
- tc qdisc add dev ifb_netem0 root netem delay 2ms
+ tc qdisc add dev ifb_netem0 root netem delay 20ms
# redirect lo ingress through ifb0 egress
tc qdisc add dev lo handle ffff: ingress
@@ -24,9 +24,11 @@ setup() {
}
run_test_v4v6() {
- # SND will be delayed 1000us
- # ACK will be delayed 6000us: 1 + 2 ms round-trip
- local -r args="$@ -v 1000 -V 6000"
+ # SND will be delayed 10ms
+ # ACK will be delayed 60ms: 10 + 20 ms round-trip
+ # allow +/- tolerance of 8ms
+ # wait for ACK to be queued
+ local -r args="$@ -v 10000 -V 60000 -t 8000 -S 80000"
./txtimestamp ${args} -4 -L 127.0.0.1
./txtimestamp ${args} -6 -L ::1
--
2.43.0.429.g432eaa2c6b-goog
From: Jeff Xu <jeffxu(a)chromium.org>
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory
range against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW)
and no-execute (NX) bits. Linux has supported NX since the release of
kernel version 2.6.8 in August 2004 [1]. The memory permission feature
improves the security stance on memory corruption bugs, as an attacker
cannot simply write to arbitrary memory and point the code to it. The
memory must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects
the VMA itself against modifications of the selected seal type.
Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped. Memory sealing can
automatically be applied by the runtime loader to seal .text and
.rodata pages and applications can additionally seal security critical
data at runtime. A similar feature already exists in the XNU kernel
with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the
mimmutable syscall [4]. Also, Chrome wants to adopt this feature for
their CFI work [2] and this patchset has been designed to be
compatible with the Chrome use case.
Two system calls are involved in sealing the map: mmap() and mseal().
The new mseal() is an syscall on 64 bit CPU, and with
following signature:
int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.
mseal() blocks following operations for the given memory range.
1> Unmapping, moving to another location, and shrinking the size,
via munmap() and mremap(), can leave an empty space, therefore can
be replaced with a VMA with a new set of attributes.
2> Moving or expanding a different VMA into the current location,
via mremap().
3> Modifying a VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed VMA.
5> mprotect() and pkey_mprotect().
6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
memory, when users don't have write permission to the memory. Those
behaviors can alter region contents by discarding pages, effectively a
memset(0) for anonymous memory.
In addition: mmap() has two related changes.
The PROT_SEAL bit in prot field of mmap(). When present, it marks
the map sealed since creation.
The MAP_SEALABLE bit in the flags field of mmap(). When present, it marks
the map as sealable. A map created without MAP_SEALABLE will not support
sealing, i.e. mseal() will fail.
Applications that don't care about sealing will expect their behavior
unchanged. For those that need sealing support, opt-in by adding
MAP_SEALABLE in mmap().
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.
Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in
the case of libc, sealing is only applied to read-only (RO) or
read-execute (RX) memory segments (such as .text and .RELRO) to
prevent them from becoming writable, the lifetime of those mappings
are tied to the lifetime of the process.
Chrome wants to seal two large address space reservations that are
managed by different allocators. The memory is mapped RW- and RWX
respectively but write access to it is restricted using pkeys (or in
the future ARM permission overlay extensions). The lifetime of those
mappings are not tied to the lifetime of the process, therefore, while
the memory is sealed, the allocators still need to free or discard the
unused memory. For example, with madvise(DONTNEED).
However, always allowing madvise(DONTNEED) on this range poses a
security risk. For example if a jump instruction crosses a page
boundary and the second page gets discarded, it will overwrite the
target bytes with zeros and change the control flow. Checking
write-permission before the discard operation allows us to control
when the operation is valid. In this case, the madvise will only
succeed if the executing thread has PKEY write permissions and PKRU
changes are protected in software by control-flow integrity.
Although the initial version of this patch series is targeting the
Chrome browser as its first user, it became evident during upstream
discussions that we would also want to ensure that the patch set
eventually is a complete solution for memory sealing and compatible
with other use cases. The specific scenario currently in mind is
glibc's use case of loading and sealing ELF executables. To this end,
Stephen is working on a change to glibc to add sealing support to the
dynamic linker, which will seal all non-writable segments at startup.
Once this work is completed, all applications will be able to
automatically benefit from these new protections.
In closing, I would like to formally acknowledge the valuable
contributions received during the RFC process, which were instrumental
in shaping this patch:
Jann Horn: raising awareness and providing valuable insights on the
destructive madvise operations.
Linus Torvalds: assisting in defining system call signature and scope.
Pedro Falcato: suggesting sealing in the mmap().
Theo de Raadt: sharing the experiences and insights gained from
implementing mimmutable() in OpenBSD.
Change history:
===============
V7:
- fix index.rst (Randy Dunlap)
- fix arm build (Randy Dunlap)
- return EPERM for blocked operations (Theo de Raadt)
V6:
- Drop RFC from subject, Given Linus's general approval.
- Adjust syscall number for mseal (main Jan.11/2024)
- Code style fix (Matthew Wilcox)
- selftest: use ksft macros (Muhammad Usama Anjum)
- Document fix. (Randy Dunlap)
https://lore.kernel.org/all/20240111234142.2944934-1-jeffxu@chromium.org/
V5:
- fix build issue in mseal-Wire-up-mseal-syscall
(Suggested by Linus Torvalds, and Greg KH)
- updates on selftest.
https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/#r
V4:
(Suggested by Linus Torvalds)
- new signature: mseal(start,len,flags)
- 32 bit is not supported. vm_seal is removed, use vm_flags instead.
- single bit in vm_flags for sealed state.
- CONFIG_MSEAL kernel config is removed.
- single bit of PROT_SEAL in the "Prot" field of mmap().
Other changes:
- update selftest (Suggested by Muhammad Usama Anjum)
- update documentation.
https://lore.kernel.org/all/20240104185138.169307-1-jeffxu@chromium.org/
V3:
- Abandon per-syscall approach, (Suggested by Linus Torvalds).
- Organize sealing types around their functionality, such as
MM_SEAL_BASE, MM_SEAL_PROT_PKEY.
- Extend the scope of sealing from calls originated in userspace to
both kernel and userspace. (Suggested by Linus Torvalds)
- Add seal type support in mmap(). (Suggested by Pedro Falcato)
- Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent
destructive operations of madvise. (Suggested by Jann Horn and
Stephen Röttger)
- Make sealed VMAs mergeable. (Suggested by Jann Horn)
- Add MAP_SEALABLE to mmap()
- Add documentation - mseal.rst
https://lore.kernel.org/linux-mm/20231212231706.2680890-2-jeffxu@chromium.o…
v2:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add a detailed commit message.
https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/
v1:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/
----------------------------------------------------------------
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b…
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXge…
[6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426Fkcgnf…
[7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/
Jeff Xu (4):
mseal: Wire up mseal syscall
mseal: add mseal syscall
selftest mm/mseal memory sealing
mseal:add documentation
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/mseal.rst | 183 ++
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/mm.h | 48 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/mman-common.h | 8 +
include/uapi/asm-generic/unistd.h | 5 +-
kernel/sys_ni.c | 1 +
mm/Makefile | 4 +
mm/madvise.c | 12 +
mm/mmap.c | 27 +
mm/mprotect.c | 10 +
mm/mremap.c | 31 +
mm/mseal.c | 343 ++++
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/mseal_test.c | 1997 +++++++++++++++++++
33 files changed, 2690 insertions(+), 2 deletions(-)
create mode 100644 Documentation/userspace-api/mseal.rst
create mode 100644 mm/mseal.c
create mode 100644 tools/testing/selftests/mm/mseal_test.c
--
2.43.0.429.g432eaa2c6b-goog
The idea of this RFC is to introduce a way to catalogue and document any tests
that should be executed for changes to a subsystem, as well as to make
checkpatch.pl require a tag in commit messages certifying they were, plus
hopefully make it easier to discover and run them.
This is following a discussion Veronika Kabatova started with a few
(addressed) people at the LPC last year (IIRC), where there was a good deal of
interest for something like this.
Apart from implementing basic support (surely to be improved), two sample
changes are added on top, adding a few test suites (roughly) based on what the
maintainers described earlier. I'm definitely not qualified for describing
them adequately, and don't have the time to dig deeper, but hopefully they
could serve as illustrations, and shouldn't be merged as is.
I would defer to maintainers of the corresponding subsystems and tests to
describe their tests and requirements better. Although I would accept
amendments too, if they prefer it that way.
One bug I know that's definitely there is handling removed files. The
scripts/get_maintainer.pl chokes on non-existing files, failing to output the
required test suites (I'm sure there's a good reason, but I couldn't see it).
My first idea is to only check for required tests upon encountering the '+++
<file>' line, and ignore the '/dev/null' file, but I hope the checkpatch.pl
maintainers could recommend a better way.
Anyway, tell me what you think, and I'll work on polishing this.
Thank you!
Nick
---
Nikolai Kondrashov (3):
MAINTAINERS: Introduce V: field for required tests
MAINTAINERS: Require kvm-xfstests smoke for ext4
MAINTAINERS: Require kunit core tests for framework changes
Documentation/process/submitting-patches.rst | 19 +++++
Documentation/process/tests.rst | 80 ++++++++++++++++++
MAINTAINERS | 8 ++
scripts/checkpatch.pl | 118 ++++++++++++++++++++++++++-
scripts/get_maintainer.pl | 17 +++-
scripts/parse-maintainers.pl | 3 +-
6 files changed, 241 insertions(+), 4 deletions(-)
---
From: Willem de Bruijn <willemb(a)google.com>
This test validates per-band packet limits in FQ. Packets are dropped
rather than enqueued if the limit for their band is reached.
This test is timing sensitive. It queues packets in FQ with a future
delivery time to fill the qdisc.
The test failed in a virtual environment (vng). Increase the delays
to make it more tolerant to environments with timing variance.
Signed-off-by: Willem de Bruijn <willemb(a)google.com>
---
tools/testing/selftests/net/fq_band_pktlimit.sh | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/net/fq_band_pktlimit.sh b/tools/testing/selftests/net/fq_band_pktlimit.sh
index 24b77bdf41ff..977070ed42b3 100755
--- a/tools/testing/selftests/net/fq_band_pktlimit.sh
+++ b/tools/testing/selftests/net/fq_band_pktlimit.sh
@@ -8,7 +8,7 @@
# 3. send 20 pkts on band A: verify that 0 are queued, 20 dropped
# 4. send 20 pkts on band B: verify that 10 are queued, 10 dropped
#
-# Send packets with a 100ms delay to ensure that previously sent
+# Send packets with a delay to ensure that previously sent
# packets are still queued when later ones are sent.
# Use SO_TXTIME for this.
@@ -29,19 +29,21 @@ ip -6 addr add fdaa::1/128 dev dummy0
ip -6 route add fdaa::/64 dev dummy0
tc qdisc replace dev dummy0 root handle 1: fq quantum 1514 initial_quantum 1514 limit 10
-./cmsg_sender -6 -p u -d 100000 -n 20 fdaa::2 8000
+DELAY=400000
+
+./cmsg_sender -6 -p u -d "${DELAY}" -n 20 fdaa::2 8000
OUT1="$(tc -s qdisc show dev dummy0 | grep '^\ Sent')"
-./cmsg_sender -6 -p u -d 100000 -n 20 fdaa::2 8000
+./cmsg_sender -6 -p u -d "${DELAY}" -n 20 fdaa::2 8000
OUT2="$(tc -s qdisc show dev dummy0 | grep '^\ Sent')"
-./cmsg_sender -6 -p u -d 100000 -n 20 -P 7 fdaa::2 8000
+./cmsg_sender -6 -p u -d "${DELAY}" -n 20 -P 7 fdaa::2 8000
OUT3="$(tc -s qdisc show dev dummy0 | grep '^\ Sent')"
# Initial stats will report zero sent, as all packets are still
-# queued in FQ. Sleep for the delay period (100ms) and see that
+# queued in FQ. Sleep for at least the delay period and see that
# twenty are now sent.
-sleep 0.1
+sleep 0.6
OUT4="$(tc -s qdisc show dev dummy0 | grep '^\ Sent')"
# Log the output after the test
--
2.43.0.429.g432eaa2c6b-goog
After commit 25ae948b4478 ("selftests/net: add lib.sh") but before commit
2114e83381d3 ("selftests: forwarding: Avoid failures to source
net/lib.sh"), some net selftests encountered errors when they were being
exported and run. This was because the new net/lib.sh was not exported
along with the tests. The errors were crudely avoided by duplicating some
content between net/lib.sh and net/forwarding/lib.sh in 2114e83381d3.
In order to restore the sourcing of net/lib.sh from net/forwarding/lib.sh
and remove the duplicated content, this series introduces a new selftests
Makefile variable to list extra files to export from other directories and
makes use of it to avoid reintroducing the errors mentioned above.
v2:
* "selftests: Introduce Makefile variable to list shared bash scripts"
Fix rst syntax in Documentation/dev-tools/kselftest.rst (Jakub Kicinski)
v1:
* "selftests: Introduce Makefile variable to list shared bash scripts"
Changed TEST_INCLUDES to take relative paths, like other TEST_* variables.
Paths are adjusted accordingly in the subsequent patches. (Vladimir Oltean)
* selftests: bonding: Change script interpreter
selftests: forwarding: Remove executable bits from lib.sh
Removed from this series, submitted separately.
Since commit 2114e83381d3 ("selftests: forwarding: Avoid failures to source
net/lib.sh") resolved the test errors, this version of the series is
focused on removing the duplication that was added in that commit. Directly
rebasing the series would reintroduce the problems that 2114e83381d3
avoided before fixing them again. In order to prevent such breakage partway
through the series, patches are reordered and content changed slightly but
there is no diff at the end compared with the simple rebasing approach. I
have dropped most review tags on account of this reordering.
RFC:
https://lore.kernel.org/netdev/20231222135836.992841-1-bpoirier@nvidia.com/
Link: https://lore.kernel.org/netdev/ZXu7dGj7F9Ng8iIX@Laptop-X1/
Benjamin Poirier (5):
selftests: Introduce Makefile variable to list shared bash scripts
selftests: bonding: Add net/forwarding/lib.sh to TEST_INCLUDES
selftests: team: Add shared library scripts to TEST_INCLUDES
selftests: dsa: Replace test symlinks by wrapper script
selftests: forwarding: Redefine relative_path variable
Petr Machata (1):
selftests: forwarding: Remove duplicated lib.sh content
Documentation/dev-tools/kselftest.rst | 12 ++++++
tools/testing/selftests/Makefile | 7 +++-
.../selftests/drivers/net/bonding/Makefile | 7 +++-
.../net/bonding/bond-eth-type-change.sh | 2 +-
.../drivers/net/bonding/bond_topo_2d1c.sh | 2 +-
.../drivers/net/bonding/dev_addr_lists.sh | 2 +-
.../net/bonding/mode-1-recovery-updelay.sh | 2 +-
.../net/bonding/mode-2-recovery-updelay.sh | 2 +-
.../drivers/net/bonding/net_forwarding_lib.sh | 1 -
.../selftests/drivers/net/dsa/Makefile | 18 ++++++++-
.../drivers/net/dsa/bridge_locked_port.sh | 2 +-
.../selftests/drivers/net/dsa/bridge_mdb.sh | 2 +-
.../selftests/drivers/net/dsa/bridge_mld.sh | 2 +-
.../drivers/net/dsa/bridge_vlan_aware.sh | 2 +-
.../drivers/net/dsa/bridge_vlan_mcast.sh | 2 +-
.../drivers/net/dsa/bridge_vlan_unaware.sh | 2 +-
.../testing/selftests/drivers/net/dsa/lib.sh | 1 -
.../drivers/net/dsa/local_termination.sh | 2 +-
.../drivers/net/dsa/no_forwarding.sh | 2 +-
.../net/dsa/run_net_forwarding_test.sh | 9 +++++
.../selftests/drivers/net/dsa/tc_actions.sh | 2 +-
.../selftests/drivers/net/dsa/tc_common.sh | 1 -
.../drivers/net/dsa/test_bridge_fdb_stress.sh | 2 +-
.../selftests/drivers/net/team/Makefile | 7 ++--
.../drivers/net/team/dev_addr_lists.sh | 4 +-
.../selftests/drivers/net/team/lag_lib.sh | 1 -
.../drivers/net/team/net_forwarding_lib.sh | 1 -
tools/testing/selftests/lib.mk | 19 ++++++++++
.../testing/selftests/net/forwarding/Makefile | 3 ++
tools/testing/selftests/net/forwarding/lib.sh | 37 +++----------------
.../net/forwarding/mirror_gre_lib.sh | 2 +-
.../net/forwarding/mirror_gre_topo_lib.sh | 2 +-
32 files changed, 98 insertions(+), 64 deletions(-)
delete mode 120000 tools/testing/selftests/drivers/net/bonding/net_forwarding_lib.sh
delete mode 120000 tools/testing/selftests/drivers/net/dsa/lib.sh
create mode 100755 tools/testing/selftests/drivers/net/dsa/run_net_forwarding_test.sh
delete mode 120000 tools/testing/selftests/drivers/net/dsa/tc_common.sh
delete mode 120000 tools/testing/selftests/drivers/net/team/lag_lib.sh
delete mode 120000 tools/testing/selftests/drivers/net/team/net_forwarding_lib.sh
--
2.43.0
The test is inspired by the pmu_event_filter_test which implemented by x86. On
the arm64 platform, there is the same ability to set the pmu_event_filter
through the KVM_ARM_VCPU_PMU_V3_FILTER attribute. So add the test for arm64.
The series first move some pmu common code from vpmu_counter_access to
lib/aarch64/vpmu.c and include/aarch64/vpmu.h, which can be used by
pmu_event_filter_test. Then fix a bug related to the [enable|disable]_counter,
and at last, implement the test itself.
Changelog:
----------
v2->v3:
- Check the pmceid in guest code instead of pmu event count since different
hardware may have different event count result, check pmceid makes it stable
on different platform. [Eric]
- Some typo fixed and commit message improved.
v1->v2:
- Improve the commit message. [Eric]
- Fix the bug in [enable|disable]_counter. [Raghavendra & Marc]
- Add the check if kvm has attr KVM_ARM_VCPU_PMU_V3_FILTER.
- Add if host pmu support the test event throught pmceid0.
- Split the test_invalid_filter() to another patch. [Eric]
v1: https://lore.kernel.org/all/20231123063750.2176250-1-shahuang@redhat.com/
v2: https://lore.kernel.org/all/20231129072712.2667337-1-shahuang@redhat.com/
Shaoqin Huang (5):
KVM: selftests: aarch64: Make the [create|destroy]_vpmu_vm() public
KVM: selftests: aarch64: Move pmu helper functions into vpmu.h
KVM: selftests: aarch64: Fix the buggy [enable|disable]_counter
KVM: selftests: aarch64: Introduce pmu_event_filter_test
KVM: selftests: aarch64: Add invalid filter test in
pmu_event_filter_test
tools/testing/selftests/kvm/Makefile | 2 +
.../kvm/aarch64/pmu_event_filter_test.c | 255 ++++++++++++++++++
.../kvm/aarch64/vpmu_counter_access.c | 218 ++-------------
.../selftests/kvm/include/aarch64/vpmu.h | 135 ++++++++++
.../testing/selftests/kvm/lib/aarch64/vpmu.c | 74 +++++
5 files changed, 490 insertions(+), 194 deletions(-)
create mode 100644 tools/testing/selftests/kvm/aarch64/pmu_event_filter_test.c
create mode 100644 tools/testing/selftests/kvm/include/aarch64/vpmu.h
create mode 100644 tools/testing/selftests/kvm/lib/aarch64/vpmu.c
--
2.40.1
One of the test cases in the test_bridge_backup_port.sh selftest relies
on a matchall classifier to drop unrelated traffic so that the Tx drop
counter on the VXLAN device will only be incremented as a result of
traffic generated by the test.
However, the configuration option for the matchall classifier is
missing from the configuration file which might explain the failures we
see in the netdev CI [1].
Fix by adding CONFIG_NET_CLS_MATCHALL to the configuration file.
[1]
# Backup nexthop ID - invalid IDs
# -------------------------------
[...]
# TEST: Forwarding out of vx0 [ OK ]
# TEST: No forwarding using backup nexthop ID [ OK ]
# TEST: Tx drop increased [FAIL]
# TEST: IPv6 address family nexthop as backup nexthop [ OK ]
# TEST: No forwarding out of swp1 [ OK ]
# TEST: Forwarding out of vx0 [ OK ]
# TEST: No forwarding using backup nexthop ID [ OK ]
# TEST: Tx drop increased [FAIL]
[...]
Fixes: b408453053fb ("selftests: net: Add bridge backup port and backup nexthop ID test")
Signed-off-by: Ido Schimmel <idosch(a)nvidia.com>
---
Jakub, you can apply to net if you want to, but I'm sending this to
net-next since I want to see if it helps the CI. I'm unable to reproduce
this locally.
---
tools/testing/selftests/net/config | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index 19ff75051660..2bd5f9033ade 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -68,6 +68,7 @@ CONFIG_MPLS_ROUTING=m
CONFIG_MPLS_IPTUNNEL=m
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_CLS_FLOWER=m
+CONFIG_NET_CLS_MATCHALL=m
CONFIG_NET_ACT_TUNNEL_KEY=m
CONFIG_NET_ACT_MIRRED=m
CONFIG_BAREUDP=m
--
2.43.0
Hi Linus,
Please pull the following kselftest fixes update for Linux 6.8-rc3.
This kselftest fixes update for Linux 6.8-rc3 consists of three
fixes to livepatch, rseq, and seccomp tests.
diff is attached
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 6613476e225e090cc9aad49be7fa504e290dd33d:
Linux 6.8-rc1 (2024-01-21 14:11:32 -0800)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-fixes-6.8-rc3
for you to fetch changes up to b54761f6e9773350c0d1fb8e1e5aacaba7769d0f:
kselftest/seccomp: Report each expectation we assert as a KTAP test (2024-01-30 08:55:42 -0700)
----------------------------------------------------------------
linux_kselftest-fixes-6.8-rc3
This kselftest fixes update for Linux 6.8-rc3 consists of three
fixes to livepatch, rseq, and seccomp tests.
----------------------------------------------------------------
Joe Lawrence (1):
selftests/livepatch: fix and refactor new dmesg message code
Mark Brown (2):
kselftest/seccomp: Use kselftest output functions for benchmark
kselftest/seccomp: Report each expectation we assert as a KTAP test
Mathieu Desnoyers (1):
selftests/rseq: Do not skip !allowed_cpus for mm_cid
tools/testing/selftests/livepatch/functions.sh | 37 ++++----
.../testing/selftests/rseq/basic_percpu_ops_test.c | 14 ++-
tools/testing/selftests/rseq/param_test.c | 22 +++--
.../testing/selftests/seccomp/seccomp_benchmark.c | 104 +++++++++++++--------
4 files changed, 109 insertions(+), 68 deletions(-)
----------------------------------------------------------------
Hi Linus,
Please pull the following KUnit fixes update for Linux 6.8-rc3.
This kunit fixes update for Linux 6.8-rc3 consists of NULL vs IS_ERR()
bug fixes, documentation update, MAINTAINERS file update to add
Rae Moar as a reviewer, and a fix to run test suites only after module
initialization completes.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 6613476e225e090cc9aad49be7fa504e290dd33d:
Linux 6.8-rc1 (2024-01-21 14:11:32 -0800)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-kunit-fixes-6.8-rc3
for you to fetch changes up to 1a9f2c776d1416c4ea6cb0d0b9917778c41a1a7d:
Documentation: KUnit: Update the instructions on how to test static functions (2024-01-22 07:59:03 -0700)
----------------------------------------------------------------
linux_kselftest-kunit-fixes-6.8-rc3
This kunit fixes update for Linux 6.8-rc3 consists of NULL vs IS_ERR()
bug fixes, documentation update, MAINTAINERS file update to add
Rae Moar as a reviewer, and a fix to run test suites only after module
initialization completes.
----------------------------------------------------------------
Arthur Grillo (1):
Documentation: KUnit: Update the instructions on how to test static functions
Dan Carpenter (2):
kunit: Fix a NULL vs IS_ERR() bug
kunit: device: Fix a NULL vs IS_ERR() check in init()
David Gow (1):
MAINTAINERS: kunit: Add Rae Moar as a reviewer
Marco Pagani (1):
kunit: run test suites only after module initialization completes
Documentation/dev-tools/kunit/usage.rst | 19 +++++++++++++++++--
MAINTAINERS | 1 +
lib/kunit/device.c | 4 ++--
lib/kunit/executor.c | 4 ++++
lib/kunit/kunit-test.c | 2 +-
lib/kunit/test.c | 14 +++++++++++---
6 files changed, 36 insertions(+), 8 deletions(-)
----------------------------------------------------------------
On riscv, mmap currently returns an address from the largest address
space that can fit entirely inside of the hint address. This makes it
such that the hint address is almost never returned. This patch raises
the mappable area up to and including the hint address. This allows mmap
to often return the hint address, which allows a performance improvement
over searching for a valid address as well as making the behavior more
similar to other architectures.
Signed-off-by: Charlie Jenkins <charlie(a)rivosinc.com>
---
Charlie Jenkins (3):
riscv: mm: Use hint address in mmap if available
selftests: riscv: Generalize mm selftests
docs: riscv: Define behavior of mmap
Documentation/arch/riscv/vm-layout.rst | 16 ++--
arch/riscv/include/asm/processor.h | 21 ++----
tools/testing/selftests/riscv/mm/mmap_bottomup.c | 20 +----
tools/testing/selftests/riscv/mm/mmap_default.c | 20 +----
tools/testing/selftests/riscv/mm/mmap_test.h | 93 +++++++++++++-----------
5 files changed, 66 insertions(+), 104 deletions(-)
---
base-commit: 556e2d17cae620d549c5474b1ece053430cd50bc
change-id: 20240119-use_mmap_hint_address-f9f4b1b6f5f1
--
- Charlie
Fix a broken zswap kselftest due to cgroup zswap writeback counter
renaming, and add a kselftest to cover the (z)swapin case.
Also, add the zswap kselftest file to zswap maintainer entry so that
get_maintainers script can find zswap maintainers.
Nhat Pham (3):
selftests: zswap: add zswap selftest file to zswap maintainer entry
selftests: fix the zswap invasive shrink test
selftests: add test for zswapin
MAINTAINERS | 1 +
tools/testing/selftests/cgroup/test_zswap.c | 69 ++++++++++++++++++++-
2 files changed, 67 insertions(+), 3 deletions(-)
base-commit: d162e170f1181b4305494843e1976584ddf2b72e
--
2.39.3
From: Björn Töpel <bjorn(a)rivosinc.com>
Here's the fourth try. The "make install" target for the BPF selftests
are missing a bunch of files, which makes the BPF machine flavor fail
(e.g. cpuv4).
This series aims to fix that, but explicitly installing bpftool, all
the BPF programs, and the utilities defined by the TRUNNER_EXTRA_PROGS
for test_progs.
The fact that this series even have a changelog says a lot, but for
those who care:
v4: Added bpftool
v3: Do not use hardcoded file names (Andrii)
v2: Added btf_dump_test_case files
Björn Töpel (3):
selftests/bpf: Remove incorrect object path
selftests/bpf: Make install target copy test_progs extra files
selftests/bpf: Make install target copy bpftool
tools/testing/selftests/bpf/Makefile | 31 +++++++++++++++++-----------
1 file changed, 19 insertions(+), 12 deletions(-)
base-commit: beb53f32698ff9cd0ca442c1f856ea0ecfb82be3
--
2.40.1
Modern OSes use iptables implementation with nf_tables as a backend,
e.g.:
$ iptables -V
iptables v1.8.8 (nf_tables)
Pablo points out that we need CONFIG_NFT_COMPAT to make that work,
otherwise we see a lot of:
Warning: Extension DNAT revision 0 not supported, missing kernel module?
with DNAT being just an example here, other modules we need
include udp, TTL, length etc.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
Location for new entry chosen based on `sort --version-sort`.
CC: shuah(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/config | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index 413ab9abcf1b..ba56f231e109 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -59,6 +59,7 @@ CONFIG_NET_SCH_HTB=m
CONFIG_NET_SCH_FQ=m
CONFIG_NET_SCH_ETF=m
CONFIG_NET_SCH_NETEM=y
+CONFIG_NFT_COMPAT=m
CONFIG_NF_FLOW_TABLE=m
CONFIG_PSAMPLE=m
CONFIG_TCP_MD5SIG=y
--
2.43.0
DAMON debugfs interface is deprecated in February 2023, by commit
5445fcbc4cda ("Docs/admin-guide/mm/damon/usage: add DAMON debugfs
interface deprecation notice"). Make the fact unable to be easily
ignored by removing an example usage from the document (patch 1),
renaming the config (patch 2), adding a deprecation notice file to the
debugfs directory (patches 3-5), and renaming the debugfs file that
essnetial to be used for real use of DAMON (patches 6-9).
SeongJae Park (9):
Docs/admin-guide/mm/damon/usage: use sysfs interface for tracepoints
example
mm/damon: rename CONFIG_DAMON_DBGFS to DAMON_DBGFS_DEPRECATED
mm/damon/dbgfs: implement deprecation notice file
mm/damon/dbgfs: make debugfs interface deprecation message a macro
Docs/admin-guide/mm/damon/usage: document 'DEPRECATED' file of DAMON
debugfs interface
selftets/damon: prepare for monitor_on file renaming
mm/damon/dbgfs: rename monitor_on file to monitor_on_DEPRECATED
Docs/admin-guide/mm/damon/usage: update for monitor_on renaming
Docs/translations/damon/usage: update for monitor_on renaming
Documentation/admin-guide/mm/damon/usage.rst | 42 +++++++++++--------
.../zh_CN/admin-guide/mm/damon/usage.rst | 20 ++++-----
.../zh_TW/admin-guide/mm/damon/usage.rst | 20 ++++-----
mm/damon/Kconfig | 7 +++-
mm/damon/dbgfs.c | 27 +++++++++---
.../selftests/damon/_chk_dependency.sh | 11 ++++-
.../selftests/damon/_debugfs_common.sh | 7 ++++
.../selftests/damon/debugfs_empty_targets.sh | 12 +++++-
8 files changed, 98 insertions(+), 48 deletions(-)
base-commit: f1ab2f51e99ffb94ce127d132b24be00dc130e6c
--
2.39.2
When walking directory trees, instead of looking for specific files and
running dirname to get the parent folder, traverse all folders and
ignore the ones not containing the desired files. This avoids the need
to call dirname inside the loop, which drastically decreases run time:
Running locally on a mt8192-asurada-spherion, which reports 160 test
cases, has gone from 5.5s to 2.9s, while running remotely with an
nfsroot has gone from 13.5s to 5.5s.
This change has a side-effect, which is that the root DT node now
also shows in the output, even though it isn't expected to bind to a
driver. However there shouldn't be a matching driver for the board
compatible, so the end result will be just an extra skipped test:
ok 1 / # SKIP
Reported-by: Mark Brown <broonie(a)kernel.org>
Closes: https://lore.kernel.org/all/310391e8-fdf2-4c2f-a680-7744eb685177@sirena.org…
Fixes: 14571ab1ad21 ("kselftest: Add new test for detecting unprobed Devicetree devices")
Tested-by: Mark Brown <broonie(a)kernel.org>
Signed-off-by: Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
---
Changes in v2:
- Tweaked commit message
- Added trailer tags
- Rebased on 6.8-rc1
---
tools/testing/selftests/dt/test_unprobed_devices.sh | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/dt/test_unprobed_devices.sh b/tools/testing/selftests/dt/test_unprobed_devices.sh
index b07af2a4c4de..7fae90293a9d 100755
--- a/tools/testing/selftests/dt/test_unprobed_devices.sh
+++ b/tools/testing/selftests/dt/test_unprobed_devices.sh
@@ -33,8 +33,8 @@ if [[ ! -d "${PDT}" ]]; then
fi
nodes_compatible=$(
- for node_compat in $(find ${PDT} -name compatible); do
- node=$(dirname "${node_compat}")
+ for node in $(find ${PDT} -type d); do
+ [ ! -f "${node}"/compatible ] && continue
# Check if node is available
if [[ -e "${node}"/status ]]; then
status=$(tr -d '\000' < "${node}"/status)
@@ -46,10 +46,11 @@ nodes_compatible=$(
nodes_dev_bound=$(
IFS=$'\n'
- for uevent in $(find /sys/devices -name uevent); do
- if [[ -d "$(dirname "${uevent}")"/driver ]]; then
- grep '^OF_FULLNAME=' "${uevent}" | sed -e 's|OF_FULLNAME=||'
- fi
+ for dev_dir in $(find /sys/devices -type d); do
+ [ ! -f "${dev_dir}"/uevent ] && continue
+ [ ! -d "${dev_dir}"/driver ] && continue
+
+ grep '^OF_FULLNAME=' "${dev_dir}"/uevent | sed -e 's|OF_FULLNAME=||'
done
)
---
base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
change-id: 20240122-dt-kselftest-dirname-perf-fix-7dc421e6dfb0
Best regards,
--
Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
The usage of run_vmtests.sh does not include hugetlb, which is a valid
test category.
Add the 'hugetlb' to the usage of run_vmtests.sh.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
tools/testing/selftests/mm/run_vmtests.sh | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 55898d64e2eb..2ee0a1c4740f 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -65,6 +65,8 @@ separated by spaces:
test copy-on-write semantics
- thp
test transparent huge pages
+- hugetlb
+ test hugetlbfs huge pages
- migration
invoke move_pages(2) to exercise the migration entry code
paths in the kernel
--
2.39.3
From: Christoph Müllner <christoph.muellner(a)vrull.eu>
[ Upstream commit 12c16919652b5873f524c8b361336ecfa5ce5e6b ]
When building the mm tests with a riscv32 compiler, we see a range
of shift-count-overflow errors from shifting 1UL by more than 32 bits
in do_mmaps(). Since, the relevant code is only called from code that
is gated by `__riscv_xlen == 64`, we can just apply the same gating
to do_mmaps().
Signed-off-by: Christoph Müllner <christoph.muellner(a)vrull.eu>
Reviewed-by: Alexandre Ghiti <alexghiti(a)rivosinc.com>
Reviewed-by: Andrew Jones <ajones(a)ventanamicro.com>
Link: https://lore.kernel.org/r/20231123185821.2272504-6-christoph.muellner@vrull…
Signed-off-by: Palmer Dabbelt <palmer(a)rivosinc.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/riscv/mm/mmap_test.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/riscv/mm/mmap_test.h b/tools/testing/selftests/riscv/mm/mmap_test.h
index 9b8434f62f57..2e0db9c5be6c 100644
--- a/tools/testing/selftests/riscv/mm/mmap_test.h
+++ b/tools/testing/selftests/riscv/mm/mmap_test.h
@@ -18,6 +18,8 @@ struct addresses {
int *on_56_addr;
};
+// Only works on 64 bit
+#if __riscv_xlen == 64
static inline void do_mmaps(struct addresses *mmap_addresses)
{
/*
@@ -50,6 +52,7 @@ static inline void do_mmaps(struct addresses *mmap_addresses)
mmap_addresses->on_56_addr =
mmap(on_56_bits, 5 * sizeof(int), prot, flags, 0, 0);
}
+#endif /* __riscv_xlen == 64 */
static inline int memory_layout(void)
{
--
2.43.0
From: Christoph Müllner <christoph.muellner(a)vrull.eu>
[ Upstream commit 12c16919652b5873f524c8b361336ecfa5ce5e6b ]
When building the mm tests with a riscv32 compiler, we see a range
of shift-count-overflow errors from shifting 1UL by more than 32 bits
in do_mmaps(). Since, the relevant code is only called from code that
is gated by `__riscv_xlen == 64`, we can just apply the same gating
to do_mmaps().
Signed-off-by: Christoph Müllner <christoph.muellner(a)vrull.eu>
Reviewed-by: Alexandre Ghiti <alexghiti(a)rivosinc.com>
Reviewed-by: Andrew Jones <ajones(a)ventanamicro.com>
Link: https://lore.kernel.org/r/20231123185821.2272504-6-christoph.muellner@vrull…
Signed-off-by: Palmer Dabbelt <palmer(a)rivosinc.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/riscv/mm/mmap_test.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/riscv/mm/mmap_test.h b/tools/testing/selftests/riscv/mm/mmap_test.h
index 9b8434f62f57..2e0db9c5be6c 100644
--- a/tools/testing/selftests/riscv/mm/mmap_test.h
+++ b/tools/testing/selftests/riscv/mm/mmap_test.h
@@ -18,6 +18,8 @@ struct addresses {
int *on_56_addr;
};
+// Only works on 64 bit
+#if __riscv_xlen == 64
static inline void do_mmaps(struct addresses *mmap_addresses)
{
/*
@@ -50,6 +52,7 @@ static inline void do_mmaps(struct addresses *mmap_addresses)
mmap_addresses->on_56_addr =
mmap(on_56_bits, 5 * sizeof(int), prot, flags, 0, 0);
}
+#endif /* __riscv_xlen == 64 */
static inline int memory_layout(void)
{
--
2.43.0
This patchset adds KVM selftests for LoongArch system, currently only
some common test cases are supported and pass to run. These testcase
are listed as following:
demand_paging_test
dirty_log_perf_test
dirty_log_test
guest_print_test
hardware_disable_test
kvm_binary_stats_test
kvm_create_max_vcpus
kvm_page_table_test
memslot_modification_stress_test
memslot_perf_test
set_memory_region_test
This patchset originally is posted from zhaotianrui, I continue to work
on his efforts.
---
Changes in v6:
1. Refresh the patch based on latest kernel 6.8-rc1, add LoongArch
support about testcase set_memory_region_test.
2. Add hardware_disable_test test case.
3. Drop modification about macro DEFAULT_GUEST_TEST_MEM, it is problem
of LoongArch binutils, this issue is raised to LoongArch binutils owners.
Changes in v5:
1. In LoongArch kvm self tests, the DEFAULT_GUEST_TEST_MEM could be
0x130000000, it is different from the default value in memstress.h.
So we Move the definition of DEFAULT_GUEST_TEST_MEM into LoongArch
ucall.h, and add 'ifndef' condition for DEFAULT_GUEST_TEST_MEM
in memstress.h.
Changes in v4:
1. Remove the based-on flag, as the LoongArch KVM patch series
have been accepted by Linux kernel, so this can be applied directly
in kernel.
Changes in v3:
1. Improve implementation of LoongArch VM page walk.
2. Add exception handler for LoongArch.
3. Add dirty_log_test, dirty_log_perf_test, guest_print_test
test cases for LoongArch.
4. Add __ASSEMBLER__ macro to distinguish asm file and c file.
5. Move ucall_arch_do_ucall to the header file and make it as
static inline to avoid function calls.
6. Change the DEFAULT_GUEST_TEST_MEM base addr for LoongArch.
Changes in v2:
1. We should use ".balign 4096" to align the assemble code with 4K in
exception.S instead of "align 12".
2. LoongArch only supports 3 or 4 levels page tables, so we remove the
hanlders for 2-levels page table.
3. Remove the DEFAULT_LOONGARCH_GUEST_STACK_VADDR_MIN and use the common
DEFAULT_GUEST_STACK_VADDR_MIN to allocate stack memory in guest.
4. Reorganize the test cases supported by LoongArch.
5. Fix some code comments.
6. Add kvm_binary_stats_test test case into LoongArch KVM selftests.
---
Tianrui Zhao (4):
KVM: selftests: Add KVM selftests header files for LoongArch
KVM: selftests: Add core KVM selftests support for LoongArch
KVM: selftests: Add ucall test support for LoongArch
KVM: selftests: Add test cases for LoongArch
tools/testing/selftests/kvm/Makefile | 16 +
.../selftests/kvm/include/kvm_util_base.h | 5 +
.../kvm/include/loongarch/processor.h | 133 +++++++
.../selftests/kvm/include/loongarch/ucall.h | 20 ++
.../selftests/kvm/lib/loongarch/exception.S | 59 ++++
.../selftests/kvm/lib/loongarch/processor.c | 332 ++++++++++++++++++
.../selftests/kvm/lib/loongarch/ucall.c | 38 ++
.../selftests/kvm/set_memory_region_test.c | 2 +-
8 files changed, 604 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/kvm/include/loongarch/processor.h
create mode 100644 tools/testing/selftests/kvm/include/loongarch/ucall.h
create mode 100644 tools/testing/selftests/kvm/lib/loongarch/exception.S
create mode 100644 tools/testing/selftests/kvm/lib/loongarch/processor.c
create mode 100644 tools/testing/selftests/kvm/lib/loongarch/ucall.c
base-commit: 7ed2632ec7d72e926b9e8bcc9ad1bb0cd37274bf
--
2.33.0
From: Jo Van Bulck <jo.vanbulck(a)cs.kuleuven.be>
[ Upstream commit 9fd552ee32c6c1e27c125016b87d295bea6faea7 ]
DEFINED only considers symbols, not section names. Hence, replace the
check for .got.plt with the _GLOBAL_OFFSET_TABLE_ symbol and remove other
(non-essential) asserts.
Signed-off-by: Jo Van Bulck <jo.vanbulck(a)cs.kuleuven.be>
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko(a)kernel.org>
Link: https://lore.kernel.org/all/20231005153854.25566-10-jo.vanbulck%40cs.kuleuv…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/sgx/test_encl.lds | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/tools/testing/selftests/sgx/test_encl.lds b/tools/testing/selftests/sgx/test_encl.lds
index a1ec64f7d91f..108bc11d1d8c 100644
--- a/tools/testing/selftests/sgx/test_encl.lds
+++ b/tools/testing/selftests/sgx/test_encl.lds
@@ -34,8 +34,4 @@ SECTIONS
}
}
-ASSERT(!DEFINED(.altinstructions), "ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.altinstr_replacement), "ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.discard.retpoline_safe), "RETPOLINE ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.discard.nospec), "RETPOLINE ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.got.plt), "Libcalls are not supported in enclaves")
+ASSERT(!DEFINED(_GLOBAL_OFFSET_TABLE_), "Libcalls through GOT are not supported in enclaves")
--
2.43.0
From: Jo Van Bulck <jo.vanbulck(a)cs.kuleuven.be>
[ Upstream commit 9fd552ee32c6c1e27c125016b87d295bea6faea7 ]
DEFINED only considers symbols, not section names. Hence, replace the
check for .got.plt with the _GLOBAL_OFFSET_TABLE_ symbol and remove other
(non-essential) asserts.
Signed-off-by: Jo Van Bulck <jo.vanbulck(a)cs.kuleuven.be>
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko(a)kernel.org>
Link: https://lore.kernel.org/all/20231005153854.25566-10-jo.vanbulck%40cs.kuleuv…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/sgx/test_encl.lds | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/tools/testing/selftests/sgx/test_encl.lds b/tools/testing/selftests/sgx/test_encl.lds
index a1ec64f7d91f..108bc11d1d8c 100644
--- a/tools/testing/selftests/sgx/test_encl.lds
+++ b/tools/testing/selftests/sgx/test_encl.lds
@@ -34,8 +34,4 @@ SECTIONS
}
}
-ASSERT(!DEFINED(.altinstructions), "ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.altinstr_replacement), "ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.discard.retpoline_safe), "RETPOLINE ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.discard.nospec), "RETPOLINE ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.got.plt), "Libcalls are not supported in enclaves")
+ASSERT(!DEFINED(_GLOBAL_OFFSET_TABLE_), "Libcalls through GOT are not supported in enclaves")
--
2.43.0
From: Jo Van Bulck <jo.vanbulck(a)cs.kuleuven.be>
[ Upstream commit 9fd552ee32c6c1e27c125016b87d295bea6faea7 ]
DEFINED only considers symbols, not section names. Hence, replace the
check for .got.plt with the _GLOBAL_OFFSET_TABLE_ symbol and remove other
(non-essential) asserts.
Signed-off-by: Jo Van Bulck <jo.vanbulck(a)cs.kuleuven.be>
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko(a)kernel.org>
Link: https://lore.kernel.org/all/20231005153854.25566-10-jo.vanbulck%40cs.kuleuv…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/sgx/test_encl.lds | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/tools/testing/selftests/sgx/test_encl.lds b/tools/testing/selftests/sgx/test_encl.lds
index a1ec64f7d91f..108bc11d1d8c 100644
--- a/tools/testing/selftests/sgx/test_encl.lds
+++ b/tools/testing/selftests/sgx/test_encl.lds
@@ -34,8 +34,4 @@ SECTIONS
}
}
-ASSERT(!DEFINED(.altinstructions), "ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.altinstr_replacement), "ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.discard.retpoline_safe), "RETPOLINE ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.discard.nospec), "RETPOLINE ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.got.plt), "Libcalls are not supported in enclaves")
+ASSERT(!DEFINED(_GLOBAL_OFFSET_TABLE_), "Libcalls through GOT are not supported in enclaves")
--
2.43.0
From: Jo Van Bulck <jo.vanbulck(a)cs.kuleuven.be>
[ Upstream commit 9fd552ee32c6c1e27c125016b87d295bea6faea7 ]
DEFINED only considers symbols, not section names. Hence, replace the
check for .got.plt with the _GLOBAL_OFFSET_TABLE_ symbol and remove other
(non-essential) asserts.
Signed-off-by: Jo Van Bulck <jo.vanbulck(a)cs.kuleuven.be>
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko(a)kernel.org>
Link: https://lore.kernel.org/all/20231005153854.25566-10-jo.vanbulck%40cs.kuleuv…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/sgx/test_encl.lds | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/tools/testing/selftests/sgx/test_encl.lds b/tools/testing/selftests/sgx/test_encl.lds
index a1ec64f7d91f..108bc11d1d8c 100644
--- a/tools/testing/selftests/sgx/test_encl.lds
+++ b/tools/testing/selftests/sgx/test_encl.lds
@@ -34,8 +34,4 @@ SECTIONS
}
}
-ASSERT(!DEFINED(.altinstructions), "ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.altinstr_replacement), "ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.discard.retpoline_safe), "RETPOLINE ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.discard.nospec), "RETPOLINE ALTERNATIVES are not supported in enclaves")
-ASSERT(!DEFINED(.got.plt), "Libcalls are not supported in enclaves")
+ASSERT(!DEFINED(_GLOBAL_OFFSET_TABLE_), "Libcalls through GOT are not supported in enclaves")
--
2.43.0
The gro.sh test-case relay on the gro_flush_timeout to ensure
that all the segments belonging to any given batch are properly
aggregated.
The other end, the sender is a user-space program transmitting
each packet with a separate write syscall. A busy host and/or
stracing the sender program can make the relevant segments reach
the GRO engine after the flush timeout triggers.
Give the GRO flush timeout more slack, to avoid sporadic self-tests
failures.
Fixes: 9af771d2ec04 ("selftests/net: allow GRO coalesce test on veth")
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
tools/testing/selftests/net/setup_veth.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/setup_veth.sh b/tools/testing/selftests/net/setup_veth.sh
index a9a1759e035c..1f78a87f6f37 100644
--- a/tools/testing/selftests/net/setup_veth.sh
+++ b/tools/testing/selftests/net/setup_veth.sh
@@ -11,7 +11,7 @@ setup_veth_ns() {
local -r ns_mac="$4"
[[ -e /var/run/netns/"${ns_name}" ]] || ip netns add "${ns_name}"
- echo 100000 > "/sys/class/net/${ns_dev}/gro_flush_timeout"
+ echo 1000000 > "/sys/class/net/${ns_dev}/gro_flush_timeout"
ip link set dev "${ns_dev}" netns "${ns_name}" mtu 65535
ip -netns "${ns_name}" link set dev "${ns_dev}" up
--
2.43.0
the udpgro_fraglist self-test uses the BPF classifiers, but the
current net self-test configuration does not include it, causing
CI failures:
# selftests: net: udpgro_frglist.sh
# ipv6
# tcp - over veth touching data
# -l 4 -6 -D 2001:db8::1 -t rx -4 -t
# Error: TC classifier not found.
# We have an error talking to the kernel
# Error: TC classifier not found.
# We have an error talking to the kernel
Add the missing knob.
Fixes: edae34a3ed92 ("selftests net: add UDP GRO fraglist + bpf self-tests")
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
tools/testing/selftests/net/config | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index 8da562a9ae87..ca4423ee6dc9 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -42,6 +42,7 @@ CONFIG_MPLS_ROUTING=m
CONFIG_MPLS_IPTUNNEL=m
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_CLS_FLOWER=m
+CONFIG_NET_CLS_BPF=m
CONFIG_NET_ACT_TUNNEL_KEY=m
CONFIG_NET_ACT_MIRRED=m
CONFIG_BAREUDP=m
--
2.43.0
Commit 2810c1e99867 ("kunit: Fix wild-memory-access bug in
kunit_free_suite_set()") fixed a wild-memory-access bug that could have
happened during the loading phase of test suites built and executed as
loadable modules. However, it also introduced a problematic side effect
that causes test suites modules to crash when they attempt to register
fake devices.
When a module is loaded, it traverses the MODULE_STATE_UNFORMED and
MODULE_STATE_COMING states before reaching the normal operating state
MODULE_STATE_LIVE. Finally, when the module is removed, it moves to
MODULE_STATE_GOING before being released. However, if the loading
function load_module() fails between complete_formation() and
do_init_module(), the module goes directly from MODULE_STATE_COMING to
MODULE_STATE_GOING without passing through MODULE_STATE_LIVE.
This behavior was causing kunit_module_exit() to be called without
having first executed kunit_module_init(). Since kunit_module_exit() is
responsible for freeing the memory allocated by kunit_module_init()
through kunit_filter_suites(), this behavior was resulting in a
wild-memory-access bug.
Commit 2810c1e99867 ("kunit: Fix wild-memory-access bug in
kunit_free_suite_set()") fixed this issue by running the tests when the
module is still in MODULE_STATE_COMING. However, modules in that state
are not fully initialized, lacking sysfs kobjects. Therefore, if a test
module attempts to register a fake device, it will inevitably crash.
This patch proposes a different approach to fix the original
wild-memory-access bug while restoring the normal module execution flow
by making kunit_module_exit() able to detect if kunit_module_init() has
previously initialized the tests suite set. In this way, test modules
can once again register fake devices without crashing.
This behavior is achieved by checking whether mod->kunit_suites is a
virtual or direct mapping address. If it is a virtual address, then
kunit_module_init() has allocated the suite_set in kunit_filter_suites()
using kmalloc_array(). On the contrary, if mod->kunit_suites is still
pointing to the original address that was set when looking up the
.kunit_test_suites section of the module, then the loading phase has
failed and there's no memory to be freed.
v4:
- rebased on 6.8
- noted that kunit_filter_suites() must return a virtual address
v3:
- add a comment to clarify why the start address is checked
v2:
- add include <linux/mm.h>
Fixes: 2810c1e99867 ("kunit: Fix wild-memory-access bug in kunit_free_suite_set()")
Reviewed-by: David Gow <davidgow(a)google.com>
Tested-by: Rae Moar <rmoar(a)google.com>
Tested-by: Richard Fitzgerald <rf(a)opensource.cirrus.com>
Reviewed-by: Javier Martinez Canillas <javierm(a)redhat.com>
Signed-off-by: Marco Pagani <marpagan(a)redhat.com>
---
lib/kunit/executor.c | 4 ++++
lib/kunit/test.c | 14 +++++++++++---
2 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/lib/kunit/executor.c b/lib/kunit/executor.c
index 717b9599036b..689fff2b2b10 100644
--- a/lib/kunit/executor.c
+++ b/lib/kunit/executor.c
@@ -146,6 +146,10 @@ void kunit_free_suite_set(struct kunit_suite_set suite_set)
kfree(suite_set.start);
}
+/*
+ * Filter and reallocate test suites. Must return the filtered test suites set
+ * allocated at a valid virtual address or NULL in case of error.
+ */
struct kunit_suite_set
kunit_filter_suites(const struct kunit_suite_set *suite_set,
const char *filter_glob,
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index f95d2093a0aa..31a5a992e646 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -17,6 +17,7 @@
#include <linux/panic.h>
#include <linux/sched/debug.h>
#include <linux/sched.h>
+#include <linux/mm.h>
#include "debugfs.h"
#include "device-impl.h"
@@ -801,12 +802,19 @@ static void kunit_module_exit(struct module *mod)
};
const char *action = kunit_action();
+ /*
+ * Check if the start address is a valid virtual address to detect
+ * if the module load sequence has failed and the suite set has not
+ * been initialized and filtered.
+ */
+ if (!suite_set.start || !virt_addr_valid(suite_set.start))
+ return;
+
if (!action)
__kunit_test_suites_exit(mod->kunit_suites,
mod->num_kunit_suites);
- if (suite_set.start)
- kunit_free_suite_set(suite_set);
+ kunit_free_suite_set(suite_set);
}
static int kunit_module_notify(struct notifier_block *nb, unsigned long val,
@@ -816,12 +824,12 @@ static int kunit_module_notify(struct notifier_block *nb, unsigned long val,
switch (val) {
case MODULE_STATE_LIVE:
+ kunit_module_init(mod);
break;
case MODULE_STATE_GOING:
kunit_module_exit(mod);
break;
case MODULE_STATE_COMING:
- kunit_module_init(mod);
break;
case MODULE_STATE_UNFORMED:
break;
base-commit: 539e582a375dedee95a4fa9ca3f37cdb25c441ec
--
2.43.0
This series aims to keep the git status clean after building the
selftests by adding some missing .gitignore files and object inclusion
in existing .gitignore files. This is one of the requirements listed in
the selftests documentation for new tests, but it is not always followed
as desired.
After adding these .gitignore files and including the generated objects,
the working tree appears clean again.
Signed-off-by: Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
---
Javier Carrasco (4):
selftests: netfilter: add sctp_collision to gitignore
selftests: uevent: add missing gitignore
selftests: thermal: intel: power_floor: add missing gitignore
selftests: thermal: intel: workload_hint: add missing gitignore
tools/testing/selftests/netfilter/.gitignore | 1 +
tools/testing/selftests/thermal/intel/power_floor/.gitignore | 1 +
tools/testing/selftests/thermal/intel/workload_hint/.gitignore | 1 +
tools/testing/selftests/uevent/.gitignore | 1 +
4 files changed, 4 insertions(+)
---
base-commit: 610a9b8f49fbcf1100716370d3b5f6f884a2835a
change-id: 20240101-selftest_gitignore-7da2c503766e
Best regards,
--
Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
In order for the page table level 5 to be in use, the CPU must have the
setting enabled in addition to the CONFIG option. Check for the flag to be
set to avoid false test failures on systems that do not have this cpu flag
set.
Signed-off-by: Audra Mitchell <audra(a)redhat.com>
---
tools/testing/selftests/mm/va_high_addr_switch.sh | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/mm/va_high_addr_switch.sh b/tools/testing/selftests/mm/va_high_addr_switch.sh
index 45cae7cab27e..a0a75f302904 100755
--- a/tools/testing/selftests/mm/va_high_addr_switch.sh
+++ b/tools/testing/selftests/mm/va_high_addr_switch.sh
@@ -29,9 +29,15 @@ check_supported_x86_64()
# See man 1 gzip under '-f'.
local pg_table_levels=$(gzip -dcfq "${config}" | grep PGTABLE_LEVELS | cut -d'=' -f 2)
+ local cpu_supports_pl5=$(awk '/^flags/ {if (/la57/) {print 0;}
+ else {print 1}; exit}' /proc/cpuinfo 2>/dev/null)
+
if [[ "${pg_table_levels}" -lt 5 ]]; then
echo "$0: PGTABLE_LEVELS=${pg_table_levels}, must be >= 5 to run this test"
exit $ksft_skip
+ elif [[ "${cpu_supports_pl5}" -ne 0 ]]; then
+ echo "$0: CPU does not have the necessary la57 flag to support page table level 5"
+ exit $ksft_skip
fi
}
--
2.43.0
The busywait timeout value is a millisecond, not a second. So the
current setting 2 is too small. On slow/busy host (or VMs) the
current timeout can expire even on "correct" execution, causing random
failures. Let's copy the WAIT_TIMEOUT from forwarding/lib.sh and set
BUSYWAIT_TIMEOUT here.
Fixes: 25ae948b4478 ("selftests/net: add lib.sh")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
v2: add fixes flag. update possible failures.
---
tools/testing/selftests/net/lib.sh | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index dca549443801..f9fe182dfbd4 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -4,6 +4,9 @@
##############################################################################
# Defines
+WAIT_TIMEOUT=${WAIT_TIMEOUT:=20}
+BUSYWAIT_TIMEOUT=$((WAIT_TIMEOUT * 1000)) # ms
+
# Kselftest framework requirement - SKIP code is 4.
ksft_skip=4
# namespace list created by setup_ns
@@ -48,7 +51,7 @@ cleanup_ns()
for ns in "$@"; do
ip netns delete "${ns}" &> /dev/null
- if ! busywait 2 ip netns list \| grep -vq "^$ns$" &> /dev/null; then
+ if ! busywait $BUSYWAIT_TIMEOUT ip netns list \| grep -vq "^$ns$" &> /dev/null; then
echo "Warn: Failed to remove namespace $ns"
ret=1
fi
--
2.43.0
Patches 1 and 3 are fixes for tdc that were discovered when running it
using defconfig + tc-testing config and against the latest iproute2.
Patch 2 improves the taprio tests.
Patch 4 enables all tdc tests.
Patch 5 fixes the return code of tdc for when a test fails
setup/teardown.
v1->v2: Suggestions by Davide
Pedro Tammela (5):
selftests: tc-testing: add missing netfilter config
selftests: tc-testing: check if 'jq' is available in taprio tests
selftests: tc-testing: adjust fq test to latest iproute2
selftests: tc-testing: enable all tdc tests
selftests: tc-testing: return fail if a test fails in setup/teardown
tools/testing/selftests/tc-testing/config | 1 +
tools/testing/selftests/tc-testing/tc-tests/qdiscs/fq.json | 2 +-
tools/testing/selftests/tc-testing/tc-tests/qdiscs/taprio.json | 2 ++
tools/testing/selftests/tc-testing/tdc.py | 2 +-
tools/testing/selftests/tc-testing/tdc.sh | 3 +--
5 files changed, 6 insertions(+), 4 deletions(-)
--
2.40.1
The default timeout for tests is 45sec, bench-lookups_ipv6
seems to take around 50sec when running in a VM without
HW acceleration. Give it a 2x margin and set the timeout
to 120sec.
Fixes: d1066c9c58d4 ("selftests/net: Add test/benchmark for removing MKTs")
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
Long story short I looked at the output for bench-lookups_ipv6
and it seems to be a trivial timeout problem. With this we're at
22/24 passing for TCP AO, the reset case failures aren't as obvious...
CC: shuah(a)kernel.org
CC: 0x7f454c46(a)gmail.com
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/tcp_ao/settings | 1 +
1 file changed, 1 insertion(+)
create mode 100644 tools/testing/selftests/net/tcp_ao/settings
diff --git a/tools/testing/selftests/net/tcp_ao/settings b/tools/testing/selftests/net/tcp_ao/settings
new file mode 100644
index 000000000000..6091b45d226b
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_ao/settings
@@ -0,0 +1 @@
+timeout=120
--
2.43.0
This series address self-tests failures for udp gro-related tests.
The first patch addresses the main problem I observe locally - the XDP
program required by such tests, xdp_dummy, is currently build in the
ebpf self-tests directory, not available if/when the user targets net
only. Arguably is more a refactor than a fix, but still targeting net
to hopefully
The second patch fixes the integration of such tests with the build
system.
Patch 3/3 fixes sporadic failures due to races.
Tested with:
make -C tools/testing/selftests/ TARGETS=net install
./tools/testing/selftests/kselftest_install/run_kselftest.sh \
-t "net:udpgro_bench.sh net:udpgro.sh net:udpgro_fwd.sh \
net:udpgro_frglist.sh net:veth.sh"
no failures.
Paolo Abeni (3):
selftests: net: remove dependency on ebpf tests
selftests: net: included needed helper in the install targets
selftests: net: explicitly wait for listener ready
tools/testing/selftests/net/Makefile | 6 ++++--
tools/testing/selftests/net/udpgro.sh | 4 ++--
tools/testing/selftests/net/udpgro_bench.sh | 4 ++--
tools/testing/selftests/net/udpgro_frglist.sh | 6 +++---
tools/testing/selftests/net/udpgro_fwd.sh | 8 +++++---
tools/testing/selftests/net/veth.sh | 4 ++--
tools/testing/selftests/net/xdp_dummy.c | 13 +++++++++++++
7 files changed, 31 insertions(+), 14 deletions(-)
create mode 100644 tools/testing/selftests/net/xdp_dummy.c
--
2.43.0
Still a bit unclear whether each directory should have its own
config file, but assuming they should lets add one for tcp_ao.
The following tests still fail with this config in place:
- rst_ipv4,
- rst_ipv6,
- bench-lookups_ipv6.
other 21 pass.
Fixes: d11301f65977 ("selftests/net: Add TCP-AO ICMPs accept test")
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: shuah(a)kernel.org
CC: 0x7f454c46(a)gmail.com
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/tcp_ao/config | 10 ++++++++++
1 file changed, 10 insertions(+)
create mode 100644 tools/testing/selftests/net/tcp_ao/config
diff --git a/tools/testing/selftests/net/tcp_ao/config b/tools/testing/selftests/net/tcp_ao/config
new file mode 100644
index 000000000000..d3277a9de987
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_ao/config
@@ -0,0 +1,10 @@
+CONFIG_CRYPTO_HMAC=y
+CONFIG_CRYPTO_RMD160=y
+CONFIG_CRYPTO_SHA1=y
+CONFIG_IPV6_MULTIPLE_TABLES=y
+CONFIG_IPV6=y
+CONFIG_NET_L3_MASTER_DEV=y
+CONFIG_NET_VRF=y
+CONFIG_TCP_AO=y
+CONFIG_TCP_MD5SIG=y
+CONFIG_VETH=m
--
2.43.0
Currently the seccomp benchmark selftest produces non-standard output,
meaning that while it makes a number of checks of the performance it
observes this has to be parsed by humans. This means that automated
systems running this suite of tests are almost certainly ignoring the
results which isn't ideal for spotting problems. Let's rework things so
that each check that the program does is reported as a test result to
the framework.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v4:
- Silence checkpatch noise.
- Link to v3: https://lore.kernel.org/r/20240122-b4-kselftest-seccomp-benchmark-ktap-v3-0…
Changes in v3:
- Re-add signoff.
- Link to v2: https://lore.kernel.org/r/20240122-b4-kselftest-seccomp-benchmark-ktap-v2-0…
Changes in v2:
- Rebase onto v6.8-rc1.
- Link to v1: https://lore.kernel.org/r/20231219-b4-kselftest-seccomp-benchmark-ktap-v1-0…
---
Mark Brown (2):
kselftest/seccomp: Use kselftest output functions for benchmark
kselftest/seccomp: Report each expectation we assert as a KTAP test
.../testing/selftests/seccomp/seccomp_benchmark.c | 104 +++++++++++++--------
1 file changed, 64 insertions(+), 40 deletions(-)
---
base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
change-id: 20231219-b4-kselftest-seccomp-benchmark-ktap-357603823708
Best regards,
--
Mark Brown <broonie(a)kernel.org>
After commit 25ae948b4478 ("selftests/net: add lib.sh") but before commit
2114e83381d3 ("selftests: forwarding: Avoid failures to source
net/lib.sh"), some net selftests encountered errors when they were being
exported and run. This was because the new net/lib.sh was not exported
along with the tests. The errors were crudely avoided by duplicating some
content between net/lib.sh and net/forwarding/lib.sh in 2114e83381d3.
In order to restore the sourcing of net/lib.sh from net/forwarding/lib.sh
and remove the duplicated content, this series introduces a new selftests
Makefile variable to list extra files to export from other directories and
makes use of it to avoid reintroducing the errors mentioned above.
v1:
* "selftests: Introduce Makefile variable to list shared bash scripts"
Changed TEST_INCLUDES to take relative paths, like other TEST_* variables.
Paths are adjusted accordingly in the subsequent patches. (Vladimir Oltean)
* selftests: bonding: Change script interpreter
selftests: forwarding: Remove executable bits from lib.sh
Removed from this series, submitted separately.
Since commit 2114e83381d3 ("selftests: forwarding: Avoid failures to source
net/lib.sh") resolved the test errors, this version of the series is
focused on removing the duplication that was added in that commit. Directly
rebasing the series would reintroduce the problems that 2114e83381d3
avoided before fixing them again. In order to prevent such breakage partway
through the series, patches are reordered and content changed slightly but
there is no diff at the end compared with the simple rebasing approach. I
have dropped most review tags on account of this reordering.
RFC:
https://lore.kernel.org/netdev/20231222135836.992841-1-bpoirier@nvidia.com/
Link: https://lore.kernel.org/netdev/ZXu7dGj7F9Ng8iIX@Laptop-X1/
Benjamin Poirier (5):
selftests: Introduce Makefile variable to list shared bash scripts
selftests: bonding: Add net/forwarding/lib.sh to TEST_INCLUDES
selftests: team: Add shared library scripts to TEST_INCLUDES
selftests: dsa: Replace test symlinks by wrapper script
selftests: forwarding: Redefine relative_path variable
Petr Machata (1):
selftests: forwarding: Remove duplicated lib.sh content
Documentation/dev-tools/kselftest.rst | 10 +++++
tools/testing/selftests/Makefile | 7 +++-
.../selftests/drivers/net/bonding/Makefile | 7 +++-
.../net/bonding/bond-eth-type-change.sh | 2 +-
.../drivers/net/bonding/bond_topo_2d1c.sh | 2 +-
.../drivers/net/bonding/dev_addr_lists.sh | 2 +-
.../net/bonding/mode-1-recovery-updelay.sh | 2 +-
.../net/bonding/mode-2-recovery-updelay.sh | 2 +-
.../drivers/net/bonding/net_forwarding_lib.sh | 1 -
.../selftests/drivers/net/dsa/Makefile | 18 ++++++++-
.../drivers/net/dsa/bridge_locked_port.sh | 2 +-
.../selftests/drivers/net/dsa/bridge_mdb.sh | 2 +-
.../selftests/drivers/net/dsa/bridge_mld.sh | 2 +-
.../drivers/net/dsa/bridge_vlan_aware.sh | 2 +-
.../drivers/net/dsa/bridge_vlan_mcast.sh | 2 +-
.../drivers/net/dsa/bridge_vlan_unaware.sh | 2 +-
.../testing/selftests/drivers/net/dsa/lib.sh | 1 -
.../drivers/net/dsa/local_termination.sh | 2 +-
.../drivers/net/dsa/no_forwarding.sh | 2 +-
.../net/dsa/run_net_forwarding_test.sh | 9 +++++
.../selftests/drivers/net/dsa/tc_actions.sh | 2 +-
.../selftests/drivers/net/dsa/tc_common.sh | 1 -
.../drivers/net/dsa/test_bridge_fdb_stress.sh | 2 +-
.../selftests/drivers/net/team/Makefile | 7 ++--
.../drivers/net/team/dev_addr_lists.sh | 4 +-
.../selftests/drivers/net/team/lag_lib.sh | 1 -
.../drivers/net/team/net_forwarding_lib.sh | 1 -
tools/testing/selftests/lib.mk | 19 ++++++++++
.../testing/selftests/net/forwarding/Makefile | 3 ++
tools/testing/selftests/net/forwarding/lib.sh | 37 +++----------------
.../net/forwarding/mirror_gre_lib.sh | 2 +-
.../net/forwarding/mirror_gre_topo_lib.sh | 2 +-
32 files changed, 96 insertions(+), 64 deletions(-)
delete mode 120000 tools/testing/selftests/drivers/net/bonding/net_forwarding_lib.sh
delete mode 120000 tools/testing/selftests/drivers/net/dsa/lib.sh
create mode 100755 tools/testing/selftests/drivers/net/dsa/run_net_forwarding_test.sh
delete mode 120000 tools/testing/selftests/drivers/net/dsa/tc_common.sh
delete mode 120000 tools/testing/selftests/drivers/net/team/lag_lib.sh
delete mode 120000 tools/testing/selftests/drivers/net/team/net_forwarding_lib.sh
--
2.43.0
Hi,
This two patches fix an issue when the user running net_test is not
root. The second patch simplify test error logs.
Regards,
Mickaël Salaün (2):
selftests/landlock: Fix capability for net_test
selftests/landlock: Clean up error logs related to capabilities
tools/testing/selftests/landlock/common.h | 88 ++++++++++++---------
tools/testing/selftests/landlock/net_test.c | 5 +-
2 files changed, 55 insertions(+), 38 deletions(-)
--
2.43.0
Non-contiguous CBM support for Intel CAT has been merged into the kernel
with Commit 0e3cd31f6e90 ("x86/resctrl: Enable non-contiguous CBMs in
Intel CAT") but there is no selftest that would validate if this feature
works correctly.
The selftest needs to verify if writing non-contiguous CBMs to the
schemata file behaves as expected in comparison to the information about
non-contiguous CBMs support.
The patch series is based on a rework of resctrl selftests that's currently in
review [1]. The patch also implements a similiar functionality presented
in the bash script included in the cover letter of the original
non-contiguous CBMs in Intel CAT series [2].
Changelog v2:
- Rebase onto v3 of [1] series.
- Add two patches that prepare helpers for the new test.
- Move Ilpo's patch that adds test grouping to this series.
- Apply Ilpo's suggestion to the patch that adds a new test.
[1] https://lore.kernel.org/all/20231211121826.14392-1-ilpo.jarvinen@linux.inte…
[2] https://lore.kernel.org/all/cover.1696934091.git.maciej.wieczor-retman@inte…
Ilpo Järvinen (1):
selftests/resctrl: Add test groups and name L3 CAT test L3_CAT
Maciej Wieczor-Retman (3):
selftests/resctrl: Add helpers for the non-contiguous test
selftests/resctrl: Split validate_resctrl_feature_request()
selftests/resctrl: Add non-contiguous CBMs CAT test
tools/testing/selftests/resctrl/cat_test.c | 80 ++++++++++++++++-
tools/testing/selftests/resctrl/cmt_test.c | 4 +-
tools/testing/selftests/resctrl/mba_test.c | 5 +-
tools/testing/selftests/resctrl/mbm_test.c | 6 +-
tools/testing/selftests/resctrl/resctrl.h | 12 ++-
.../testing/selftests/resctrl/resctrl_tests.c | 18 ++--
tools/testing/selftests/resctrl/resctrlfs.c | 86 ++++++++++++++++---
7 files changed, 185 insertions(+), 26 deletions(-)
--
2.43.0
On Tue, 2024-01-23 at 08:45 -0800, Jakub Kicinski wrote:
> On Mon, 22 Jan 2024 15:34:34 +0000 (UTC) Kalle Valo wrote:
> > Lukas Bulwahn (1):
> > wifi: cfg80211/mac80211: remove dependency on non-existing option
>
> BTW we run all kernel's kunit tests on netdev periodically by doing:
>
> ./tools/testing/kunit/kunit.py run --all
>
> and AFAICT the WiFi tests don't pop up there :(
>
> https://netdev.bots.linux.dev/contest.html?branch=net-next-2024-01-23--15-0…
>
> Is that on purpose?
No, but honestly, I didn't even really know about it, which is mostly
for lack of bothering to look - because we run them in a different way
now both in upstream hostap and internally (internally we also have
another file with metadata to tie them to other bits of the whole
project that isn't interesting upstream).
Looks like that needs adjustments to the config file there, mostly? I
can see about adding that, probably not that hard, at least for
mac80211/cfg80211.
++kunit folks:
We're also adding unit tests to iwlwifi (slowly), any idea if we should
enable that here also? It _is_ now possible to build PCI stuff on kunit,
but it requires some additional config options (virt-pci etc.), not sure
that's desirable here? It doesn't need it at runtime for the tests, of
course.
johannes
This patch abstracts envcfg CSR in kernel (as is done for other homonyn
CSRs). CSR_ENVCFG is used as alias for CSR_SENVCFG or CSR_MENVCFG depending
on how kernel is compiled.
Additionally it changes CBZE enabling to start using CSR_ENVCFG instead of
CSR_SENVCFG.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/csr.h | 2 ++
arch/riscv/kernel/cpufeature.c | 2 +-
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index 306a19a5509c..b3400517b0a9 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -415,6 +415,7 @@
# define CSR_STATUS CSR_MSTATUS
# define CSR_IE CSR_MIE
# define CSR_TVEC CSR_MTVEC
+# define CSR_ENVCFG CSR_MENVCFG
# define CSR_SCRATCH CSR_MSCRATCH
# define CSR_EPC CSR_MEPC
# define CSR_CAUSE CSR_MCAUSE
@@ -439,6 +440,7 @@
# define CSR_STATUS CSR_SSTATUS
# define CSR_IE CSR_SIE
# define CSR_TVEC CSR_STVEC
+# define CSR_ENVCFG CSR_SENVCFG
# define CSR_SCRATCH CSR_SSCRATCH
# define CSR_EPC CSR_SEPC
# define CSR_CAUSE CSR_SCAUSE
diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
index b3785ffc1570..98623393fd1f 100644
--- a/arch/riscv/kernel/cpufeature.c
+++ b/arch/riscv/kernel/cpufeature.c
@@ -725,7 +725,7 @@ arch_initcall(check_unaligned_access_all_cpus);
void riscv_user_isa_enable(void)
{
if (riscv_cpu_has_extension_unlikely(smp_processor_id(), RISCV_ISA_EXT_ZICBOZ))
- csr_set(CSR_SENVCFG, ENVCFG_CBZE);
+ csr_set(CSR_ENVCFG, ENVCFG_CBZE);
}
#ifdef CONFIG_RISCV_ALTERNATIVE
--
2.43.0
From e097eed364b4ef5f9d5e6c7ef22685bf34021555 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Tue, 12 Dec 2023 14:28:59 -0800
Subject: [PATCH 02/28] riscv: envcfg save and restore on trap entry/exit
envcfg CSR defines enabling bits for cache management instructions and soon
will control enabling for control flow integrity and pointer masking features.
Control flow integrity enabling for forward cfi and backward cfi is controlled
via envcfg and thus need to be enabled on per thread basis.
This patch creates a place holder for envcfg CSR in `thread_info` and adds
logic to save and restore on trap entry and exits.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/thread_info.h | 1 +
arch/riscv/kernel/asm-offsets.c | 1 +
arch/riscv/kernel/entry.S | 4 ++++
3 files changed, 6 insertions(+)
diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
index 574779900bfb..320bc899a63b 100644
--- a/arch/riscv/include/asm/thread_info.h
+++ b/arch/riscv/include/asm/thread_info.h
@@ -57,6 +57,7 @@ struct thread_info {
long user_sp; /* User stack pointer */
int cpu;
unsigned long syscall_work; /* SYSCALL_WORK_ flags */
+ unsigned long envcfg;
#ifdef CONFIG_SHADOW_CALL_STACK
void *scs_base;
void *scs_sp;
diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
index a03129f40c46..cdd8f095c30c 100644
--- a/arch/riscv/kernel/asm-offsets.c
+++ b/arch/riscv/kernel/asm-offsets.c
@@ -39,6 +39,7 @@ void asm_offsets(void)
OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
OFFSET(TASK_TI_USER_SP, task_struct, thread_info.user_sp);
+ OFFSET(TASK_TI_ENVCFG, task_struct, thread_info.envcfg);
#ifdef CONFIG_SHADOW_CALL_STACK
OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
#endif
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 54ca4564a926..63c3855ba80d 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -129,6 +129,10 @@ SYM_CODE_START_NOALIGN(ret_from_exception)
addi s0, sp, PT_SIZE_ON_STACK
REG_S s0, TASK_TI_KERNEL_SP(tp)
+ /* restore envcfg bits for current thread */
+ REG_L s0, TASK_TI_ENVCFG(tp)
+ csrw CSR_ENVCFG, s0
+
/* Save the kernel shadow call stack pointer */
scs_save_current
--
2.43.0
From 00561993452d050e19bd386d6d03a2a8aeb92ea2 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Fri, 29 Dec 2023 18:22:25 -0800
Subject: [PATCH 03/28] riscv: define default value for envcfg
Defines a base default value for envcfg per task. By default all tasks
should have cache zeroing capability. Any future capabilities can be
turned on.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/csr.h | 2 ++
arch/riscv/kernel/process.c | 1 +
2 files changed, 3 insertions(+)
diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index b3400517b0a9..01ba87954da2 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -202,6 +202,8 @@
#define ENVCFG_CBIE_FLUSH _AC(0x1, UL)
#define ENVCFG_CBIE_INV _AC(0x3, UL)
#define ENVCFG_FIOM _AC(0x1, UL)
+/* by default all threads should be able to zero cache */
+#define ENVCFG_BASE ENVCFG_CBZE
/* Smstateen bits */
#define SMSTATEEN0_AIA_IMSIC_SHIFT 58
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index 4f21d970a129..2420123444c4 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -152,6 +152,7 @@ void start_thread(struct pt_regs *regs, unsigned long pc,
else
regs->status |= SR_UXL_64;
#endif
+ current->thread_info.envcfg = ENVCFG_BASE;
}
void flush_thread(void)
--
2.43.0
From 87902dd95726e86bcd9d791cc73ca0a88f66d24a Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Fri, 15 Dec 2023 16:58:07 -0800
Subject: [PATCH 04/28] riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv
riscv will need an implementation for exit_thread to clean up shadow stack
when thread exits. If current thread had shadow stack enabled, shadow
stack is allocated by default for any new thread.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/Kconfig | 1 +
arch/riscv/kernel/process.c | 5 +++++
2 files changed, 6 insertions(+)
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 95a2a06acc6a..9d386e9edc45 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -142,6 +142,7 @@ config RISCV
select HAVE_RSEQ
select HAVE_STACKPROTECTOR
select HAVE_SYSCALL_TRACEPOINTS
+ select HAVE_EXIT_THREAD
select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU
select IRQ_DOMAIN
select IRQ_FORCED_THREADING
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index 2420123444c4..c249cf3d8083 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -192,6 +192,11 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
return 0;
}
+void exit_thread(struct task_struct *tsk)
+{
+ return;
+}
+
int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
{
unsigned long clone_flags = args->flags;
--
2.43.0
From bc80126c6b94228de149d767eb21d2e8d98f08df Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Sun, 15 Jan 2023 23:28:54 -0800
Subject: [PATCH 05/28] riscv: zicfiss/zicfilp enumeration
This patch adds support for detecting zicfiss and zicfilp. zicfiss and zicfilp
stands for unprivleged integer spec extension for shadow stack and branch
tracking on indirect branches, respectively.
This patch looks for zicfiss and zicfilp in device tree and accordinlgy lights
up bit in cpu feature bitmap. Furthermore this patch adds detection utility
functions to return whether shadow stack or landing pads are supported by
cpu.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/cpufeature.h | 18 ++++++++++++++++++
arch/riscv/include/asm/hwcap.h | 2 ++
arch/riscv/include/asm/processor.h | 1 +
arch/riscv/kernel/cpufeature.c | 2 ++
4 files changed, 23 insertions(+)
diff --git a/arch/riscv/include/asm/cpufeature.h b/arch/riscv/include/asm/cpufeature.h
index a418c3112cd6..216190731c55 100644
--- a/arch/riscv/include/asm/cpufeature.h
+++ b/arch/riscv/include/asm/cpufeature.h
@@ -133,4 +133,22 @@ static __always_inline bool riscv_cpu_has_extension_unlikely(int cpu, const unsi
return __riscv_isa_extension_available(hart_isa[cpu].isa, ext);
}
+static inline bool cpu_supports_shadow_stack(void)
+{
+#ifdef CONFIG_RISCV_USER_CFI
+ return riscv_isa_extension_available(NULL, ZICFISS);
+#else
+ return false;
+#endif
+}
+
+static inline bool cpu_supports_indirect_br_lp_instr(void)
+{
+#ifdef CONFIG_RISCV_USER_CFI
+ return riscv_isa_extension_available(NULL, ZICFILP);
+#else
+ return false;
+#endif
+}
+
#endif
diff --git a/arch/riscv/include/asm/hwcap.h b/arch/riscv/include/asm/hwcap.h
index 06d30526ef3b..918165cfb4fa 100644
--- a/arch/riscv/include/asm/hwcap.h
+++ b/arch/riscv/include/asm/hwcap.h
@@ -57,6 +57,8 @@
#define RISCV_ISA_EXT_ZIHPM 42
#define RISCV_ISA_EXT_SMSTATEEN 43
#define RISCV_ISA_EXT_ZICOND 44
+#define RISCV_ISA_EXT_ZICFISS 45
+#define RISCV_ISA_EXT_ZICFILP 46
#define RISCV_ISA_EXT_MAX 64
diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h
index f19f861cda54..ee2f51787ff8 100644
--- a/arch/riscv/include/asm/processor.h
+++ b/arch/riscv/include/asm/processor.h
@@ -13,6 +13,7 @@
#include <vdso/processor.h>
#include <asm/ptrace.h>
+#include <asm/hwcap.h>
#ifdef CONFIG_64BIT
#define DEFAULT_MAP_WINDOW (UL(1) << (MMAP_VA_BITS - 1))
diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
index 98623393fd1f..16624bc9a46b 100644
--- a/arch/riscv/kernel/cpufeature.c
+++ b/arch/riscv/kernel/cpufeature.c
@@ -185,6 +185,8 @@ const struct riscv_isa_ext_data riscv_isa_ext[] = {
__RISCV_ISA_EXT_DATA(svinval, RISCV_ISA_EXT_SVINVAL),
__RISCV_ISA_EXT_DATA(svnapot, RISCV_ISA_EXT_SVNAPOT),
__RISCV_ISA_EXT_DATA(svpbmt, RISCV_ISA_EXT_SVPBMT),
+ __RISCV_ISA_EXT_DATA(zicfiss, RISCV_ISA_EXT_ZICFISS),
+ __RISCV_ISA_EXT_DATA(zicfilp, RISCV_ISA_EXT_ZICFILP),
};
const size_t riscv_isa_ext_count = ARRAY_SIZE(riscv_isa_ext);
--
2.43.0
From f5c0949f18a5458498486b332d97121d4f3def27 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Mon, 16 Jan 2023 00:08:32 -0800
Subject: [PATCH 06/28] riscv: zicfiss/zicfilp extension csr and bit
definitions
zicfiss and zicfilp extension gets enabled via b3 and b2 in xenvcfg CSR.
menvcfg controls enabling for S/HS mode. henvcfg control enabling for VS while
senvcfg controls enabling for U/VU mode.
zicfilp extension extends xstatus CSR to hold `expected landing pad` bit.
A trap or interrupt can occur between an indirect jmp/call and target instr.
`expected landing pad` bit from CPU is recorded into xstatus CSR so that when
supervisor performs xret, `expected landing pad` state of CPU can be restored.
zicfiss adds one new CSR
- CSR_SSP: CSR_SSP contains current shadow stack pointer.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/csr.h | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index 01ba87954da2..80fe38d5de4a 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -18,6 +18,15 @@
#define SR_MPP _AC(0x00001800, UL) /* Previously Machine */
#define SR_SUM _AC(0x00040000, UL) /* Supervisor User Memory Access */
+/* zicfilp landing pad status bit */
+#define SR_SPELP _AC(0x00800000, UL)
+#define SR_MPELP _AC(0x020000000000, UL)
+#ifdef CONFIG_RISCV_M_MODE
+#define SR_ELP SR_MPELP
+#else
+#define SR_ELP SR_SPELP
+#endif
+
#define SR_FS _AC(0x00006000, UL) /* Floating-point Status */
#define SR_FS_OFF _AC(0x00000000, UL)
#define SR_FS_INITIAL _AC(0x00002000, UL)
@@ -196,6 +205,8 @@
#define ENVCFG_PBMTE (_AC(1, ULL) << 62)
#define ENVCFG_CBZE (_AC(1, UL) << 7)
#define ENVCFG_CBCFE (_AC(1, UL) << 6)
+#define ENVCFG_LPE (_AC(1, UL) << 2)
+#define ENVCFG_SSE (_AC(1, UL) << 3)
#define ENVCFG_CBIE_SHIFT 4
#define ENVCFG_CBIE (_AC(0x3, UL) << ENVCFG_CBIE_SHIFT)
#define ENVCFG_CBIE_ILL _AC(0x0, UL)
@@ -216,6 +227,11 @@
#define SMSTATEEN0_HSENVCFG (_ULL(1) << SMSTATEEN0_HSENVCFG_SHIFT)
#define SMSTATEEN0_SSTATEEN0_SHIFT 63
#define SMSTATEEN0_SSTATEEN0 (_ULL(1) << SMSTATEEN0_SSTATEEN0_SHIFT)
+/*
+ * zicfiss user mode csr
+ * CSR_SSP holds current shadow stack pointer.
+ */
+#define CSR_SSP 0x011
/* symbolic CSR names: */
#define CSR_CYCLE 0xc00
--
2.43.0
From 0551775c478b9643a7d801e9ed56fe47554e1ba9 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Mon, 16 Jan 2023 03:34:04 -0800
Subject: [PATCH 07/28] riscv: kernel handling on trap entry/exit for user cfi
Carves out space in arch specific thread struct for cfi status and shadow stack
in usermode on riscv.
This patch does following
- defines a new structure cfi_status with status bit for cfi feature
- defines shadow stack pointer, base and size in cfi_status structure
- defines offsets to new member fields in thread in asm-offsets.c
- Saves and restore shadow stack pointer on trap entry (U --> S) and exit
(S --> U)
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/processor.h | 1 +
arch/riscv/include/asm/thread_info.h | 3 +++
arch/riscv/include/asm/usercfi.h | 24 ++++++++++++++++++++++++
arch/riscv/kernel/asm-offsets.c | 5 ++++-
arch/riscv/kernel/entry.S | 25 +++++++++++++++++++++++++
5 files changed, 57 insertions(+), 1 deletion(-)
create mode 100644 arch/riscv/include/asm/usercfi.h
diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h
index ee2f51787ff8..d4dc298880fc 100644
--- a/arch/riscv/include/asm/processor.h
+++ b/arch/riscv/include/asm/processor.h
@@ -14,6 +14,7 @@
#include <asm/ptrace.h>
#include <asm/hwcap.h>
+#include <asm/usercfi.h>
#ifdef CONFIG_64BIT
#define DEFAULT_MAP_WINDOW (UL(1) << (MMAP_VA_BITS - 1))
diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
index 320bc899a63b..6a2acecec546 100644
--- a/arch/riscv/include/asm/thread_info.h
+++ b/arch/riscv/include/asm/thread_info.h
@@ -58,6 +58,9 @@ struct thread_info {
int cpu;
unsigned long syscall_work; /* SYSCALL_WORK_ flags */
unsigned long envcfg;
+#ifdef CONFIG_RISCV_USER_CFI
+ struct cfi_status user_cfi_state;
+#endif
#ifdef CONFIG_SHADOW_CALL_STACK
void *scs_base;
void *scs_sp;
diff --git a/arch/riscv/include/asm/usercfi.h b/arch/riscv/include/asm/usercfi.h
new file mode 100644
index 000000000000..080d7077d12c
--- /dev/null
+++ b/arch/riscv/include/asm/usercfi.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * Copyright (C) 2023 Rivos, Inc.
+ * Deepak Gupta <debug(a)rivosinc.com>
+ */
+#ifndef _ASM_RISCV_USERCFI_H
+#define _ASM_RISCV_USERCFI_H
+
+#ifndef __ASSEMBLY__
+#include <linux/types.h>
+
+#ifdef CONFIG_RISCV_USER_CFI
+struct cfi_status {
+ unsigned long ubcfi_en : 1; /* Enable for backward cfi. */
+ unsigned long rsvd : ((sizeof(unsigned long)*8) - 1);
+ unsigned long user_shdw_stk; /* Current user shadow stack pointer */
+ unsigned long shdw_stk_base; /* Base address of shadow stack */
+ unsigned long shdw_stk_size; /* size of shadow stack */
+};
+
+#endif /* CONFIG_RISCV_USER_CFI */
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_RISCV_USERCFI_H */
diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
index cdd8f095c30c..5e1f412e96ba 100644
--- a/arch/riscv/kernel/asm-offsets.c
+++ b/arch/riscv/kernel/asm-offsets.c
@@ -43,8 +43,11 @@ void asm_offsets(void)
#ifdef CONFIG_SHADOW_CALL_STACK
OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
#endif
-
OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu);
+#ifdef CONFIG_RISCV_USER_CFI
+ OFFSET(TASK_TI_CFI_STATUS, task_struct, thread_info.user_cfi_state);
+ OFFSET(TASK_TI_USER_SSP, task_struct, thread_info.user_cfi_state.user_shdw_stk);
+#endif
OFFSET(TASK_THREAD_F0, task_struct, thread.fstate.f[0]);
OFFSET(TASK_THREAD_F1, task_struct, thread.fstate.f[1]);
OFFSET(TASK_THREAD_F2, task_struct, thread.fstate.f[2]);
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 63c3855ba80d..410659e2eadb 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -49,6 +49,21 @@ SYM_CODE_START(handle_exception)
REG_S x5, PT_T0(sp)
save_from_x6_to_x31
+#ifdef CONFIG_RISCV_USER_CFI
+ /*
+ * we need to save cfi status only when previous mode was U
+ */
+ csrr s2, CSR_STATUS
+ andi s2, s2, SR_SPP
+ bnez s2, skip_bcfi_save
+ /* load cfi status word */
+ lw s3, TASK_TI_CFI_STATUS(tp)
+ andi s3, s3, 1
+ beqz s3, skip_bcfi_save
+ csrr s3, CSR_SSP
+ REG_S s3, TASK_TI_USER_SSP(tp) /* save user ssp in thread_info */
+skip_bcfi_save:
+#endif
/*
* Disable user-mode memory access as it should only be set in the
* actual user copy routines.
@@ -141,6 +156,16 @@ SYM_CODE_START_NOALIGN(ret_from_exception)
* structures again.
*/
csrw CSR_SCRATCH, tp
+
+#ifdef CONFIG_RISCV_USER_CFI
+ lw s3, TASK_TI_CFI_STATUS(tp)
+ andi s3, s3, 1
+ beqz s3, skip_bcfi_resume
+ REG_L s3, TASK_TI_USER_SSP(tp) /* restore user ssp from thread struct */
+ csrw CSR_SSP, s3
+skip_bcfi_resume:
+#endif
+
1:
REG_L a0, PT_STATUS(sp)
/*
--
2.43.0
From 78a8bd18df45b83011353c24a75bd6bc00bf84c7 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Wed, 6 Dec 2023 17:37:49 -0800
Subject: [PATCH 08/28] mm: Define VM_SHADOW_STACK for RISC-V
VM_SHADOW_STACK is defined by x86 as vm flag to mark a shadow stack vma.
x86 uses VM_HIGH_ARCH_5 bit but that limits shadow stack vma to 64bit only.
arm64 follows same path
https://lore.kernel.org/lkml/20231009-arm64-gcs-v6-12-78e55deaa4dd@kernel.o…
On RISC-V, write-only page table encodings are shadow stack pages. This patch
re-defines VM_WRITE only to be VM_SHADOW_STACK.
Next set of patches will set guard rail that no other mm flow can set VM_WRITE
only in vma except when specifically creating shadow stack.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
include/linux/mm.h | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 418d26608ece..dfe0e8118669 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -352,7 +352,19 @@ extern unsigned int kobjsize(const void *objp);
* for more details on the guard size.
*/
# define VM_SHADOW_STACK VM_HIGH_ARCH_5
-#else
+#endif
+
+#ifdef CONFIG_RISCV_USER_CFI
+/*
+ * On RISC-V pte encodings for shadow stack is R=0, W=1, X=0 and thus RISCV
+ * choosing to use similar mechanism on vm_flags where VM_WRITE only means
+ * VM_SHADOW_STACK. RISCV as well doesn't support VM_SHADOW_STACK to be set
+ * with VM_SHARED.
+ */
+#define VM_SHADOW_STACK VM_WRITE
+#endif
+
+#ifndef VM_SHADOW_STACK
# define VM_SHADOW_STACK VM_NONE
#endif
--
2.43.0
From 0d6de4bf12ec3349f8b3f96760575f8f660f0ae9 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Fri, 29 Dec 2023 18:32:53 -0800
Subject: [PATCH 09/28] mm: abstract shadow stack vma behind
`arch_is_shadow_stack`
x86 has used VM_SHADOW_STACK (alias to VM_HIGH_ARCH_5) to encode shadow
stack VMA. VM_SHADOW_STACK is thus not possible on 32bit. Some arches may
need a way to encode shadow stack on 32bit and 64bit both and they may
encode this information differently in VMAs.
This patch changes checks of VM_SHADOW_STACK flag in generic code to call
to a function `arch_is_shadow_stack` which will return true if arch
supports shadow stack and vma is shadow stack else stub returns false.
There was a suggestion to name it as `vma_is_shadow_stack`. I preferred to
keep `arch` prefix in there because it's each arch specific.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
include/linux/mm.h | 18 +++++++++++++++++-
mm/gup.c | 5 +++--
mm/internal.h | 2 +-
3 files changed, 21 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dfe0e8118669..15c70fc677a3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -352,6 +352,10 @@ extern unsigned int kobjsize(const void *objp);
* for more details on the guard size.
*/
# define VM_SHADOW_STACK VM_HIGH_ARCH_5
+static inline bool arch_is_shadow_stack(vm_flags_t vm_flags)
+{
+ return (vm_flags & VM_SHADOW_STACK);
+}
#endif
#ifdef CONFIG_RISCV_USER_CFI
@@ -362,10 +366,22 @@ extern unsigned int kobjsize(const void *objp);
* with VM_SHARED.
*/
#define VM_SHADOW_STACK VM_WRITE
+
+static inline bool arch_is_shadow_stack(vm_flags_t vm_flags)
+{
+ return ((vm_flags & (VM_WRITE | VM_READ | VM_EXEC)) == VM_WRITE);
+}
+
#endif
#ifndef VM_SHADOW_STACK
# define VM_SHADOW_STACK VM_NONE
+
+static inline bool arch_is_shadow_stack(vm_flags_t vm_flags)
+{
+ return false;
+}
+
#endif
#if defined(CONFIG_X86)
@@ -3464,7 +3480,7 @@ static inline unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
return stack_guard_gap;
/* See reasoning around the VM_SHADOW_STACK definition */
- if (vma->vm_flags & VM_SHADOW_STACK)
+ if (vma->vm_flags && arch_is_shadow_stack(vma->vm_flags))
return PAGE_SIZE;
return 0;
diff --git a/mm/gup.c b/mm/gup.c
index 231711efa390..45798782ed2c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1051,7 +1051,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
!writable_file_mapping_allowed(vma, gup_flags))
return -EFAULT;
- if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
+ if (!(vm_flags & VM_WRITE) || arch_is_shadow_stack(vm_flags)) {
if (!(gup_flags & FOLL_FORCE))
return -EFAULT;
/* hugetlb does not support FOLL_FORCE|FOLL_WRITE. */
@@ -1069,7 +1069,8 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
if (!is_cow_mapping(vm_flags))
return -EFAULT;
}
- } else if (!(vm_flags & VM_READ)) {
+ } else if (!(vm_flags & VM_READ) && !arch_is_shadow_stack(vm_flags)) {
+ /* reads allowed if its shadow stack vma */
if (!(gup_flags & FOLL_FORCE))
return -EFAULT;
/*
diff --git a/mm/internal.h b/mm/internal.h
index b61034bd50f5..0abf00c93fe1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -572,7 +572,7 @@ static inline bool is_exec_mapping(vm_flags_t flags)
*/
static inline bool is_stack_mapping(vm_flags_t flags)
{
- return ((flags & VM_STACK) == VM_STACK) || (flags & VM_SHADOW_STACK);
+ return ((flags & VM_STACK) == VM_STACK) || arch_is_shadow_stack(flags);
}
/*
--
2.43.0
From 5511cd8dc11f00e3ec604b8fc5afdb0f632bda9e Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Thu, 19 Jan 2023 06:13:49 -0800
Subject: [PATCH 10/28] riscv/mm : Introducing new protection flag
"PROT_SHADOWSTACK"
x86 and arm64 are using VM_SHADOW_STACK (which actually is VM_HIGH_ARCH_5)
vma flag and thus restrict it to 64bit implementation only. RISC-V is choosing
to encode presence of only VM_WRITE in vma flags as shadow stack vma. This allows
32bit RISC-V ecosystem leverage shadow stack as well.
This means that existing users of `do_mmap` who had been using `VM_WRITE` and
expecting read and write permissions will break.
Thus introducing `PROT_SHADOWSTACK` to allow `do_mmap` disambiguate between
read write v/s shadow stack mappings. Thus any kernel driver/module using `do_mmap`
and only passing `VM_WRITE` would still get read-write mappings. Although any user
of `do_mmap` intending to map a shaodw stack should pass `PROT_SHADOWSTACK` to get
a shadow stack mapping.
Although for userspace still want to rely on `map_shadow_stack` and not expose
`PROT_SHADOWSTACK` to userspace and that's why this prot flag is not exposed in uapi
headers.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/mman.h | 25 +++++++++++++++++++++++++
mm/mmap.c | 1 +
2 files changed, 26 insertions(+)
create mode 100644 arch/riscv/include/asm/mman.h
diff --git a/arch/riscv/include/asm/mman.h b/arch/riscv/include/asm/mman.h
new file mode 100644
index 000000000000..4902d837e93c
--- /dev/null
+++ b/arch/riscv/include/asm/mman.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_MMAN_H__
+#define __ASM_MMAN_H__
+
+#include <linux/compiler.h>
+#include <linux/types.h>
+#include <uapi/asm/mman.h>
+
+/*
+ * Major architectures (x86, aarch64, riscv) have shadow stack now. x86 and
+ * arm64 choose to use VM_SHADOW_STACK (which actually is VM_HIGH_ARCH_5) vma
+ * flag, however that restrict it to 64bit implementation only. risc-v shadow
+ * stack encodings in page tables is PTE.R=0, PTE.W=1, PTE.D=1 which used to be
+ * reserved until now. risc-v is choosing to encode presence of only VM_WRITE in
+ * vma flags as shadow stack vma. However this means that existing users of mmap
+ * (and do_mmap) who were relying on passing only PROT_WRITE (or VM_WRITE from
+ * kernel driver) but still getting read and write mappings, should still work.
+ * x86 and arm64 followed the direction of a new system call `map_shadow_stack`.
+ * risc-v would like to converge on that so that shadow stacks flows are as much
+ * arch agnostic. Thus a conscious decision to define PROT_XXX definition for
+ * shadow stack here (and not exposed to uapi)
+ */
+#define PROT_SHADOWSTACK 0x40
+
+#endif /* ! __ASM_MMAN_H__ */
diff --git a/mm/mmap.c b/mm/mmap.c
index 1971bfffcc03..fab2acf21ce9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -47,6 +47,7 @@
#include <linux/oom.h>
#include <linux/sched/mm.h>
#include <linux/ksm.h>
+#include <linux/processor.h>
#include <linux/uaccess.h>
#include <asm/cacheflush.h>
--
2.43.0
From a3d7e7e08612681c4980091bfebba930b733eef0 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Thu, 19 Jan 2023 06:28:08 -0800
Subject: [PATCH 11/28] riscv: Implementing "PROT_SHADOWSTACK" on riscv
This patch implements new risc-v specific protection flag
`PROT_SHADOWSTACK` (only for kernel) on riscv.
`PROT_SHADOWSTACK` protection flag is only limited to kernel and not exposed
to userspace. Shadow stack is a security construct to prevent against ROP attacks.
`map_shadow_stack` is a new syscall to manufacture shadow stack. In order to avoid
multiple methods to create shadow stack, `PROT_SHADOWSTACK` is not allowed for user
space `mmap` call. `mprotect` wouldn't allow because `arch_validate_prot` already
takes care of this for risc-v.
`arch_calc_vm_prot_bits` is implemented on risc-v to return VM_SHADOW_STACK (alias
for VM_WRITE) if PROT_SHADOWSTACK is supplied (such as call to `do_mmap` will) and
underlying CPU supports shadow stack. `PROT_WRITE` will be converted to `VM_READ |
`VM_WRITE` so that existing case where `PROT_WRITE` is specified keep working but
don't collide with `VM_WRITE` only encoding which now denotes a shadow stack.
risc-v `mmap` wrapper enforces if PROT_WRITE is specified and PROT_READ is left out
then PROT_READ is enforced.
Earlier `protection_map[VM_WRITE]` used to pick read-write (and copy on write) PTE
encodings. Now all non-shadow stack writeable mappings will pick `protection_map[VM_WRITE
| VM_READ] PTE encodings. `protection[VM_WRITE]` are programmed to pick PAGE_SHADOWSTACK
PTE encordings.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/mman.h | 17 +++++++++++++++++
arch/riscv/include/asm/pgtable.h | 1 +
arch/riscv/kernel/sys_riscv.c | 19 +++++++++++++++++++
arch/riscv/mm/init.c | 2 +-
4 files changed, 38 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/include/asm/mman.h b/arch/riscv/include/asm/mman.h
index 4902d837e93c..bc09a9c0e81f 100644
--- a/arch/riscv/include/asm/mman.h
+++ b/arch/riscv/include/asm/mman.h
@@ -22,4 +22,21 @@
*/
#define PROT_SHADOWSTACK 0x40
+static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
+ unsigned long pkey __always_unused)
+{
+ unsigned long ret = 0;
+
+ if (cpu_supports_shadow_stack())
+ ret = (prot & PROT_SHADOWSTACK) ? VM_SHADOW_STACK : 0;
+ /*
+ * If PROT_WRITE was specified, force it to VM_READ | VM_WRITE.
+ * Only VM_WRITE means shadow stack.
+ */
+ if (prot & PROT_WRITE)
+ ret = (VM_READ | VM_WRITE);
+ return ret;
+}
+#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
+
#endif /* ! __ASM_MMAN_H__ */
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 294044429e8e..54a8dde29504 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -184,6 +184,7 @@ extern struct pt_alloc_ops pt_ops __initdata;
#define PAGE_READ_EXEC __pgprot(_PAGE_BASE | _PAGE_READ | _PAGE_EXEC)
#define PAGE_WRITE_EXEC __pgprot(_PAGE_BASE | _PAGE_READ | \
_PAGE_EXEC | _PAGE_WRITE)
+#define PAGE_SHADOWSTACK __pgprot(_PAGE_BASE | _PAGE_WRITE)
#define PAGE_COPY PAGE_READ
#define PAGE_COPY_EXEC PAGE_READ_EXEC
diff --git a/arch/riscv/kernel/sys_riscv.c b/arch/riscv/kernel/sys_riscv.c
index a2ca5b7756a5..2a7cf28a6fe0 100644
--- a/arch/riscv/kernel/sys_riscv.c
+++ b/arch/riscv/kernel/sys_riscv.c
@@ -16,6 +16,7 @@
#include <asm/unistd.h>
#include <asm-generic/mman-common.h>
#include <vdso/vsyscall.h>
+#include <asm/mman.h>
static long riscv_sys_mmap(unsigned long addr, unsigned long len,
unsigned long prot, unsigned long flags,
@@ -25,6 +26,24 @@ static long riscv_sys_mmap(unsigned long addr, unsigned long len,
if (unlikely(offset & (~PAGE_MASK >> page_shift_offset)))
return -EINVAL;
+ /*
+ * If only PROT_WRITE is specified then extend that to PROT_READ
+ * protection_map[VM_WRITE] is now going to select shadow stack encodings.
+ * So specifying PROT_WRITE actually should select protection_map [VM_WRITE | VM_READ]
+ * If user wants to create shadow stack then they should use `map_shadow_stack` syscall.
+ */
+ if (unlikely((prot & PROT_WRITE) && !(prot & PROT_READ)))
+ prot |= PROT_READ;
+
+ /*
+ * PROT_SHADOWSTACK is a kernel only protection flag on risc-v.
+ * mmap doesn't expect PROT_SHADOWSTACK to be set by user space.
+ * User space can rely on `map_shadow_stack` syscall to create
+ * shadow stack pages.
+ */
+ if (unlikely(prot & PROT_SHADOWSTACK))
+ return -EINVAL;
+
return ksys_mmap_pgoff(addr, len, prot, flags, fd,
offset >> (PAGE_SHIFT - page_shift_offset));
}
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 2e011cbddf3a..f71c2d2c6cbf 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -296,7 +296,7 @@ pgd_t early_pg_dir[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE);
static const pgprot_t protection_map[16] = {
[VM_NONE] = PAGE_NONE,
[VM_READ] = PAGE_READ,
- [VM_WRITE] = PAGE_COPY,
+ [VM_WRITE] = PAGE_SHADOWSTACK,
[VM_WRITE | VM_READ] = PAGE_COPY,
[VM_EXEC] = PAGE_EXEC,
[VM_EXEC | VM_READ] = PAGE_READ_EXEC,
--
2.43.0
From e7bf650a2d0bb03e189d0077be51515c4ecaf50d Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Thu, 19 Jan 2023 09:38:32 -0800
Subject: [PATCH 12/28] riscv mm: manufacture shadow stack pte
This patch implements creating shadow stack pte (on riscv). Creating
shadow stack PTE on riscv means that clearing RWX and then setting W=1.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/pgtable.h | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 54a8dde29504..7ed00b4cc73d 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -408,6 +408,12 @@ static inline pte_t pte_mkwrite_novma(pte_t pte)
return __pte(pte_val(pte) | _PAGE_WRITE);
}
+static inline pte_t pte_mkwrite_shstk(pte_t pte)
+{
+ /* shadow stack on risc-v is XWR = 010. Clear everything and only set _PAGE_WRITE */
+ return __pte((pte_val(pte) & ~(_PAGE_LEAF)) | _PAGE_WRITE);
+}
+
/* static inline pte_t pte_mkexec(pte_t pte) */
static inline pte_t pte_mkdirty(pte_t pte)
@@ -705,6 +711,12 @@ static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
return pte_pmd(pte_mkwrite_novma(pmd_pte(pmd)));
}
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pte)
+{
+ /* shadow stack on risc-v is XWR = 010. Clear everything and only set _PAGE_WRITE */
+ return __pmd((pmd_val(pte) & ~(_PAGE_LEAF)) | _PAGE_WRITE);
+}
+
static inline pmd_t pmd_wrprotect(pmd_t pmd)
{
return pte_pmd(pte_wrprotect(pmd_pte(pmd)));
--
2.43.0
From e5d2f0f05004e8f14cde361a5649177ce8a082f0 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Thu, 19 Jan 2023 09:17:54 -0800
Subject: [PATCH 13/28] riscv mmu: teach pte_mkwrite to manufacture shadow
stack PTEs
pte_mkwrite creates PTEs with WRITE encodings for underlying arch. Underlying
arch can have two types of writeable mappings. One that can be written using
regular store instructions. Another one that can only be written using specialized
store instructions (like shadow stack stores). pte_mkwrite can select write PTE
encoding based on VMA range.
On riscv, presence of only VM_WRITE in vma->vm_flags means it's a shadow stack.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
rebase with a30f0ca0fa31cdb2ac3d24b7b5be9e3ae75f4175
Implementation of pte_mkwrite and pmd_mkwrite on riscv
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/pgtable.h | 7 +++++++
arch/riscv/mm/pgtable.c | 21 +++++++++++++++++++++
2 files changed, 28 insertions(+)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 7ed00b4cc73d..9477108e727d 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -403,6 +403,10 @@ static inline pte_t pte_wrprotect(pte_t pte)
/* static inline pte_t pte_mkread(pte_t pte) */
+struct vm_area_struct;
+pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma);
+#define pte_mkwrite pte_mkwrite
+
static inline pte_t pte_mkwrite_novma(pte_t pte)
{
return __pte(pte_val(pte) | _PAGE_WRITE);
@@ -706,6 +710,9 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
return pte_pmd(pte_mkyoung(pmd_pte(pmd)));
}
+pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
+#define pmd_mkwrite pmd_mkwrite
+
static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
{
return pte_pmd(pte_mkwrite_novma(pmd_pte(pmd)));
diff --git a/arch/riscv/mm/pgtable.c b/arch/riscv/mm/pgtable.c
index fef4e7328e49..9b1845f93ea1 100644
--- a/arch/riscv/mm/pgtable.c
+++ b/arch/riscv/mm/pgtable.c
@@ -101,3 +101,24 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
return pmd;
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+ if (arch_is_shadow_stack(vma->vm_flags))
+ return pte_mkwrite_shstk(pte);
+
+ pte = pte_mkwrite_novma(pte);
+
+ return pte;
+}
+
+pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+ if (arch_is_shadow_stack(vma->vm_flags))
+ return pmd_mkwrite_shstk(pmd);
+
+ pmd = pmd_mkwrite_novma(pmd);
+
+ return pmd;
+}
+
--
2.43.0
From 217876b4c2848c78595d9e870a41ca0f2e1c2ea2 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Tue, 17 Jan 2023 09:35:13 -0800
Subject: [PATCH 14/28] riscv mmu: write protect and shadow stack
`fork` implements copy on write (COW) by making pages readonly in child
and parent both.
ptep_set_wrprotect and pte_wrprotect clears _PAGE_WRITE in PTE.
Assumption is that page is readable and on fault copy on write happens.
To implement COW on such pages, clearing up W bit makes them XWR = 000.
This will result in wrong PTE setting which says no perms but V=1 and PFN
field pointing to final page. Instead desired behavior is to turn it into
a readable page, take an access (load/store) fault on sspush/sspop
(shadow stack) and then perform COW on such pages. This way regular reads
would still be allowed and not lead to COW maintaining current behavior
of COW on non-shadow stack but writeable memory.
On the other hand it doesn't interfere with existing COW for read-write
memory. Assumption is always that _PAGE_READ must have been set and thus
setting _PAGE_READ is harmless.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/pgtable.h | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 9477108e727d..9802e8d48616 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -398,7 +398,7 @@ static inline int pte_special(pte_t pte)
static inline pte_t pte_wrprotect(pte_t pte)
{
- return __pte(pte_val(pte) & ~(_PAGE_WRITE));
+ return __pte((pte_val(pte) & ~(_PAGE_WRITE)) | (_PAGE_READ));
}
/* static inline pte_t pte_mkread(pte_t pte) */
@@ -594,7 +594,15 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
static inline void ptep_set_wrprotect(struct mm_struct *mm,
unsigned long address, pte_t *ptep)
{
- atomic_long_and(~(unsigned long)_PAGE_WRITE, (atomic_long_t *)ptep);
+ volatile pte_t read_pte = *ptep;
+ /*
+ * ptep_set_wrprotect can be called for shadow stack ranges too.
+ * shadow stack memory is XWR = 010 and thus clearing _PAGE_WRITE will lead to
+ * encoding 000b which is wrong encoding with V = 1. This should lead to page fault
+ * but we dont want this wrong configuration to be set in page tables.
+ */
+ atomic_long_set((atomic_long_t *)ptep,
+ ((pte_val(read_pte) & ~(unsigned long)_PAGE_WRITE) | _PAGE_READ));
}
#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
--
2.43.0
From 1527cdb4739ae3884160eb8824b187717f9d0960 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Fri, 15 Dec 2023 11:39:14 -0800
Subject: [PATCH 15/28] riscv/mm: Implement map_shadow_stack() syscall
As discussed extensively in the changelog for the addition of this
syscall on x86 ("x86/shstk: Introduce map_shadow_stack syscall") the
existing mmap() and madvise() syscalls do not map entirely well onto the
security requirements for guarded control stacks since they lead to
windows where memory is allocated but not yet protected or stacks which
are not properly and safely initialised. Instead a new syscall
map_shadow_stack() has been defined which allocates and initialises a
shadow stack page.
This patch implements this syscall for riscv. riscv doesn't require token
to be setup by kernel because user mode can do that by itself. However to
provide compatiblity and portability with other architectues, user mode can
specify token set flag.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/kernel/Makefile | 2 +
arch/riscv/kernel/usercfi.c | 150 ++++++++++++++++++++++++++++++++
include/uapi/asm-generic/mman.h | 1 +
3 files changed, 153 insertions(+)
create mode 100644 arch/riscv/kernel/usercfi.c
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index fee22a3d1b53..8c668269e886 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -102,3 +102,5 @@ obj-$(CONFIG_COMPAT) += compat_vdso/
obj-$(CONFIG_64BIT) += pi/
obj-$(CONFIG_ACPI) += acpi.o
+
+obj-$(CONFIG_RISCV_USER_CFI) += usercfi.o
diff --git a/arch/riscv/kernel/usercfi.c b/arch/riscv/kernel/usercfi.c
new file mode 100644
index 000000000000..35ede2cbc05b
--- /dev/null
+++ b/arch/riscv/kernel/usercfi.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2023 Rivos, Inc.
+ * Deepak Gupta <debug(a)rivosinc.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/bitops.h>
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/uaccess.h>
+#include <linux/sizes.h>
+#include <linux/user.h>
+#include <linux/syscalls.h>
+#include <linux/prctl.h>
+#include <asm/csr.h>
+#include <asm/usercfi.h>
+
+#define SHSTK_ENTRY_SIZE sizeof(void *)
+
+/*
+ * Writes on shadow stack can either be `sspush` or `ssamoswap`. `sspush` can happen
+ * implicitly on current shadow stack pointed to by CSR_SSP. `ssamoswap` takes pointer to
+ * shadow stack. To keep it simple, we plan to use `ssamoswap` to perform writes on shadow
+ * stack.
+ */
+static noinline unsigned long amo_user_shstk(unsigned long *addr, unsigned long val)
+{
+ /*
+ * In case ssamoswap faults, return -1.
+ * Never expect -1 on shadow stack. Expect return addresses and zero
+ */
+ unsigned long swap = -1;
+
+ __enable_user_access();
+ asm_volatile_goto(
+ ".option push\n"
+ ".option arch, +zicfiss\n"
+#ifdef CONFIG_64BIT
+ "1: ssamoswap.d %0, %2, %1\n"
+#else
+ "1: ssamoswap.w %0, %2, %1\n"
+#endif
+ _ASM_EXTABLE(1b, %l[fault])
+ RISCV_ACQUIRE_BARRIER
+ ".option pop\n"
+ : "=r" (swap), "+A" (*addr)
+ : "r" (val)
+ : "memory"
+ : fault
+ );
+ __disable_user_access();
+ return swap;
+fault:
+ __disable_user_access();
+ return -1;
+}
+
+/*
+ * Create a restore token on the shadow stack. A token is always XLEN wide
+ * and aligned to XLEN.
+ */
+static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
+{
+ unsigned long addr;
+
+ /* Token must be aligned */
+ if (!IS_ALIGNED(ssp, SHSTK_ENTRY_SIZE))
+ return -EINVAL;
+
+ /* On RISC-V we're constructing token to be function of address itself */
+ addr = ssp - SHSTK_ENTRY_SIZE;
+
+ if (amo_user_shstk((unsigned long __user *)addr, (unsigned long) ssp) == -1)
+ return -EFAULT;
+
+ if (token_addr)
+ *token_addr = addr;
+
+ return 0;
+}
+
+static unsigned long allocate_shadow_stack(unsigned long addr, unsigned long size,
+ unsigned long token_offset,
+ bool set_tok)
+{
+ int flags = MAP_ANONYMOUS | MAP_PRIVATE;
+ struct mm_struct *mm = current->mm;
+ unsigned long populate, tok_loc = 0;
+
+ if (addr)
+ flags |= MAP_FIXED_NOREPLACE;
+
+ mmap_write_lock(mm);
+ addr = do_mmap(NULL, addr, size, PROT_SHADOWSTACK, flags,
+ VM_SHADOW_STACK, 0, &populate, NULL);
+ mmap_write_unlock(mm);
+
+ if (!set_tok || IS_ERR_VALUE(addr))
+ goto out;
+
+ if (create_rstor_token(addr + token_offset, &tok_loc)) {
+ vm_munmap(addr, size);
+ return -EINVAL;
+ }
+
+ addr = tok_loc;
+
+out:
+ return addr;
+}
+
+SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
+{
+ bool set_tok = flags & SHADOW_STACK_SET_TOKEN;
+ unsigned long aligned_size = 0;
+
+ if (!cpu_supports_shadow_stack())
+ return -EOPNOTSUPP;
+
+ /* Anything other than set token should result in invalid param */
+ if (flags & ~SHADOW_STACK_SET_TOKEN)
+ return -EINVAL;
+
+ /*
+ * Unlike other architectures, on RISC-V, SSP pointer is held in CSR_SSP and is available
+ * CSR in all modes. CSR accesses are performed using 12bit index programmed in instruction
+ * itself. This provides static property on register programming and writes to CSR can't
+ * be unintentional from programmer's perspective. As long as programmer has guarded areas
+ * which perform writes to CSR_SSP properly, shadow stack pivoting is not possible. Since
+ * CSR_SSP is writeable by user mode, it itself can setup a shadow stack token subsequent
+ * to allocation. Although in order to provide portablity with other architecture (because
+ * `map_shadow_stack` is arch agnostic syscall), RISC-V will follow expectation of a token
+ * flag in flags and if provided in flags, setup a token at the base.
+ */
+
+ /* If there isn't space for a token */
+ if (set_tok && size < SHSTK_ENTRY_SIZE)
+ return -ENOSPC;
+
+ if (addr && (addr % PAGE_SIZE))
+ return -EINVAL;
+
+ aligned_size = PAGE_ALIGN(size);
+ if (aligned_size < size)
+ return -EOVERFLOW;
+
+ return allocate_shadow_stack(addr, aligned_size, size, set_tok);
+}
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 57e8195d0b53..0c0ac6214de6 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -19,4 +19,5 @@
#define MCL_FUTURE 2 /* lock all future mappings */
#define MCL_ONFAULT 4 /* lock all pages that are faulted in */
+#define SHADOW_STACK_SET_TOKEN (1ULL << 0) /* Set up a restore token in the shadow stack */
#endif /* __ASM_GENERIC_MMAN_H */
--
2.43.0
From 2f73f707afb839a7146833ca8bf13e8a11b26eca Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Fri, 15 Dec 2023 13:14:46 -0800
Subject: [PATCH 16/28] riscv/shstk: If needed allocate a new shadow stack on
clone
Userspace specifies VM_CLONE to share address space and spawn new thread.
`clone` allow userspace to specify a new stack for new thread. However
there is no way to specify new shadow stack base address without changing
API. This patch allocates a new shadow stack whenever VM_CLONE is given.
In case of VM_FORK, parent is suspended until child finishes and thus can
child use parent shadow stack. In case of !VM_CLONE, COW kicks in because
entire address space is copied from parent to child.
`clone3` is extensible and can provide mechanisms using which shadow stack
as an input parameter can be provided. This is not settled yet and being
extensively discussed on mailing list. Once that's settled, this commit
will adapt to that.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/usercfi.h | 39 ++++++++++
arch/riscv/kernel/process.c | 10 +++
arch/riscv/kernel/usercfi.c | 121 +++++++++++++++++++++++++++++++
3 files changed, 170 insertions(+)
diff --git a/arch/riscv/include/asm/usercfi.h b/arch/riscv/include/asm/usercfi.h
index 080d7077d12c..eb9a0905e72b 100644
--- a/arch/riscv/include/asm/usercfi.h
+++ b/arch/riscv/include/asm/usercfi.h
@@ -8,6 +8,9 @@
#ifndef __ASSEMBLY__
#include <linux/types.h>
+struct task_struct;
+struct kernel_clone_args;
+
#ifdef CONFIG_RISCV_USER_CFI
struct cfi_status {
unsigned long ubcfi_en : 1; /* Enable for backward cfi. */
@@ -17,6 +20,42 @@ struct cfi_status {
unsigned long shdw_stk_size; /* size of shadow stack */
};
+unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
+ const struct kernel_clone_args *args);
+void shstk_release(struct task_struct *tsk);
+void set_shstk_base(struct task_struct *task, unsigned long shstk_addr, unsigned long size);
+void set_active_shstk(struct task_struct *task, unsigned long shstk_addr);
+bool is_shstk_enabled(struct task_struct *task);
+
+#else
+
+static inline unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
+ const struct kernel_clone_args *args)
+{
+ return 0;
+}
+
+static inline void shstk_release(struct task_struct *tsk)
+{
+
+}
+
+static inline void set_shstk_base(struct task_struct *task, unsigned long shstk_addr,
+ unsigned long size)
+{
+
+}
+
+static inline void set_active_shstk(struct task_struct *task, unsigned long shstk_addr)
+{
+
+}
+
+static inline bool is_shstk_enabled(struct task_struct *task)
+{
+ return false;
+}
+
#endif /* CONFIG_RISCV_USER_CFI */
#endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index c249cf3d8083..a2b2a686a545 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -26,6 +26,7 @@
#include <asm/cpuidle.h>
#include <asm/vector.h>
#include <asm/cpufeature.h>
+#include <asm/usercfi.h>
register unsigned long gp_in_global __asm__("gp");
@@ -194,6 +195,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
void exit_thread(struct task_struct *tsk)
{
+ shstk_release(tsk);
return;
}
@@ -202,6 +204,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
unsigned long clone_flags = args->flags;
unsigned long usp = args->stack;
unsigned long tls = args->tls;
+ unsigned long ssp = 0;
struct pt_regs *childregs = task_pt_regs(p);
memset(&p->thread.s, 0, sizeof(p->thread.s));
@@ -217,11 +220,18 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
p->thread.s[0] = (unsigned long)args->fn;
p->thread.s[1] = (unsigned long)args->fn_arg;
} else {
+ /* allocate new shadow stack if needed. In case of CLONE_VM we have to */
+ ssp = shstk_alloc_thread_stack(p, args);
+ if (IS_ERR_VALUE(ssp))
+ return PTR_ERR((void *)ssp);
+
*childregs = *(current_pt_regs());
/* Turn off status.VS */
riscv_v_vstate_off(childregs);
if (usp) /* User fork */
childregs->sp = usp;
+ if (ssp) /* if needed, set new ssp */
+ set_active_shstk(p, ssp);
if (clone_flags & CLONE_SETTLS)
childregs->tp = tls;
childregs->a0 = 0; /* Return value of fork() */
diff --git a/arch/riscv/kernel/usercfi.c b/arch/riscv/kernel/usercfi.c
index 35ede2cbc05b..36cac0d653f5 100644
--- a/arch/riscv/kernel/usercfi.c
+++ b/arch/riscv/kernel/usercfi.c
@@ -19,6 +19,41 @@
#define SHSTK_ENTRY_SIZE sizeof(void *)
+bool is_shstk_enabled(struct task_struct *task)
+{
+ return task->thread_info.user_cfi_state.ubcfi_en ? true : false;
+}
+
+void set_shstk_base(struct task_struct *task, unsigned long shstk_addr, unsigned long size)
+{
+ task->thread_info.user_cfi_state.shdw_stk_base = shstk_addr;
+ task->thread_info.user_cfi_state.shdw_stk_size = size;
+}
+
+unsigned long get_shstk_base(struct task_struct *task, unsigned long *size)
+{
+ if (size)
+ *size = task->thread_info.user_cfi_state.shdw_stk_size;
+ return task->thread_info.user_cfi_state.shdw_stk_base;
+}
+
+void set_active_shstk(struct task_struct *task, unsigned long shstk_addr)
+{
+ task->thread_info.user_cfi_state.user_shdw_stk = shstk_addr;
+}
+
+/*
+ * If size is 0, then to be compatible with regular stack we want it to be as big as
+ * regular stack. Else PAGE_ALIGN it and return back
+ */
+static unsigned long calc_shstk_size(unsigned long size)
+{
+ if (size)
+ return PAGE_ALIGN(size);
+
+ return PAGE_ALIGN(min_t(unsigned long long, rlimit(RLIMIT_STACK), SZ_4G));
+}
+
/*
* Writes on shadow stack can either be `sspush` or `ssamoswap`. `sspush` can happen
* implicitly on current shadow stack pointed to by CSR_SSP. `ssamoswap` takes pointer to
@@ -148,3 +183,89 @@ SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsi
return allocate_shadow_stack(addr, aligned_size, size, set_tok);
}
+
+/*
+ * This gets called during clone/clone3/fork. And is needed to allocate a shadow stack for
+ * cases where CLONE_VM is specified and thus a different stack is specified by user. We
+ * thus need a separate shadow stack too. How does separate shadow stack is specified by
+ * user is still being debated. Once that's settled, remove this part of the comment.
+ * This function simply returns 0 if shadow stack are not supported or if separate shadow
+ * stack allocation is not needed (like in case of !CLONE_VM)
+ */
+unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
+ const struct kernel_clone_args *args)
+{
+ unsigned long addr, size;
+
+ /* If shadow stack is not supported, return 0 */
+ if (!cpu_supports_shadow_stack())
+ return 0;
+
+ /*
+ * If shadow stack is not enabled on the new thread, skip any
+ * switch to a new shadow stack.
+ */
+ if (is_shstk_enabled(tsk))
+ return 0;
+
+ /*
+ * For CLONE_VFORK the child will share the parents shadow stack.
+ * Set base = 0 and size = 0, this is special means to track this state
+ * so the freeing logic run for child knows to leave it alone.
+ */
+ if (args->flags & CLONE_VFORK) {
+ set_shstk_base(tsk, 0, 0);
+ return 0;
+ }
+
+ /*
+ * For !CLONE_VM the child will use a copy of the parents shadow
+ * stack.
+ */
+ if (!(args->flags & CLONE_VM))
+ return 0;
+
+ /*
+ * reaching here means, CLONE_VM was specified and thus a separate shadow
+ * stack is needed for new cloned thread. Note: below allocation is happening
+ * using current mm.
+ */
+ size = calc_shstk_size(args->stack_size);
+ addr = allocate_shadow_stack(0, size, 0, false);
+ if (IS_ERR_VALUE(addr))
+ return addr;
+
+ set_shstk_base(tsk, addr, size);
+
+ return addr + size;
+}
+
+void shstk_release(struct task_struct *tsk)
+{
+ unsigned long base = 0, size = 0;
+ /* If shadow stack is not supported or not enabled, nothing to release */
+ if (!cpu_supports_shadow_stack() ||
+ !is_shstk_enabled(tsk))
+ return;
+
+ /*
+ * When fork() with CLONE_VM fails, the child (tsk) already has a
+ * shadow stack allocated, and exit_thread() calls this function to
+ * free it. In this case the parent (current) and the child share
+ * the same mm struct. Move forward only when they're same.
+ */
+ if (!tsk->mm || tsk->mm != current->mm)
+ return;
+
+ /*
+ * We know shadow stack is enabled but if base is NULL, then
+ * this task is not managing its own shadow stack (CLONE_VFORK). So
+ * skip freeing it.
+ */
+ base = get_shstk_base(tsk, &size);
+ if (!base)
+ return;
+
+ vm_munmap(base, size);
+ set_shstk_base(tsk, 0, 0);
+}
--
2.43.0
From 5d9631f58cb41a3a682d1d646b24c04f512c06eb Mon Sep 17 00:00:00 2001
From: Mark Brown <broonie(a)kernel.org>
Date: Mon, 9 Oct 2023 13:08:36 +0100
Subject: [PATCH 17/28] prctl: arch-agnostic prctl for shadow stack
Three architectures (x86, aarch64, riscv) have announced support for
shadow stacks with fairly similar functionality. While x86 is using
arch_prctl() to control the functionality neither arm64 nor riscv uses
that interface so this patch adds arch-agnostic prctl() support to
get and set status of shadow stacks and lock the current configuation to
prevent further changes, with support for turning on and off individual
subfeatures so applications can limit their exposure to features that
they do not need. The features are:
- PR_SHADOW_STACK_ENABLE: Tracking and enforcement of shadow stacks,
including allocation of a shadow stack if one is not already
allocated.
- PR_SHADOW_STACK_WRITE: Writes to specific addresses in the shadow
stack.
- PR_SHADOW_STACK_PUSH: Push additional values onto the shadow stack.
- PR_SHADOW_STACK_DISABLE: Allow to disable shadow stack.
Note once locked, disable must fail.
These features are expected to be inherited by new threads and cleared
on exec(), unknown features should be rejected for enable but accepted
for locking (in order to allow for future proofing).
This is based on a patch originally written by Deepak Gupta but later
modified by Mark Brown for arm's GCS patch series.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Co-developed-by: Deepak Gupta <debug(a)rivosinc.com>
---
include/linux/mm.h | 3 +++
include/uapi/linux/prctl.h | 22 ++++++++++++++++++++++
kernel/sys.c | 30 ++++++++++++++++++++++++++++++
3 files changed, 55 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 15c70fc677a3..df248764bcec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4170,5 +4170,8 @@ static inline bool pfn_is_unaccepted_memory(unsigned long pfn)
return range_contains_unaccepted_memory(paddr, paddr + PAGE_SIZE);
}
+int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *status);
+int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
+int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
#endif /* _LINUX_MM_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 370ed14b1ae0..3c66ed8f46d8 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -306,4 +306,26 @@ struct prctl_mm_map {
# define PR_RISCV_V_VSTATE_CTRL_NEXT_MASK 0xc
# define PR_RISCV_V_VSTATE_CTRL_MASK 0x1f
+/*
+ * Get the current shadow stack configuration for the current thread,
+ * this will be the value configured via PR_SET_SHADOW_STACK_STATUS.
+ */
+#define PR_GET_SHADOW_STACK_STATUS 71
+
+/*
+ * Set the current shadow stack configuration. Enabling the shadow
+ * stack will cause a shadow stack to be allocated for the thread.
+ */
+#define PR_SET_SHADOW_STACK_STATUS 72
+# define PR_SHADOW_STACK_ENABLE (1UL << 0)
+# define PR_SHADOW_STACK_WRITE (1UL << 1)
+# define PR_SHADOW_STACK_PUSH (1UL << 2)
+
+/*
+ * Prevent further changes to the specified shadow stack
+ * configuration. All bits may be locked via this call, including
+ * undefined bits.
+ */
+#define PR_LOCK_SHADOW_STACK_STATUS 73
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index e219fcfa112d..96e8a6b5993a 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2301,6 +2301,21 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
return -EINVAL;
}
+int __weak arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *status)
+{
+ return -EINVAL;
+}
+
+int __weak arch_set_shadow_stack_status(struct task_struct *t, unsigned long status)
+{
+ return -EINVAL;
+}
+
+int __weak arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status)
+{
+ return -EINVAL;
+}
+
#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
#ifdef CONFIG_ANON_VMA_NAME
@@ -2743,6 +2758,21 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_RISCV_V_GET_CONTROL:
error = RISCV_V_GET_CONTROL();
break;
+ case PR_GET_SHADOW_STACK_STATUS:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = arch_get_shadow_stack_status(me, (unsigned long __user *) arg2);
+ break;
+ case PR_SET_SHADOW_STACK_STATUS:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = arch_set_shadow_stack_status(me, arg2);
+ break;
+ case PR_LOCK_SHADOW_STACK_STATUS:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = arch_lock_shadow_stack_status(me, arg2);
+ break;
default:
error = -EINVAL;
break;
--
2.43.0
From a4fbd367bc8d3ce589cbda16534d274a109b3666 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Tue, 12 Dec 2023 11:59:55 -0800
Subject: [PATCH 18/28] prctl: arch-agnostic prtcl for indirect branch tracking
Three architectures (x86, aarch64, riscv) have support for indirect branch
tracking feature in a very similar fashion. On a very high level, indirect
branch tracking is a CPU feature where CPU tracks branches which uses memory
operand to perform control transfer in program. As part of this tracking on
indirect branches, CPU goes in a state where it expects a landing pad instr
on target and if not found then CPU raises some fault (architecture dependent)
x86 landing pad instr - `ENDBRANCH`
aarch64 landing pad instr - `BTI`
riscv landing instr - `lpad`
Given that three major arches have support for indirect branch tracking,
This patch makes `prctl` for indirect branch tracking arch agnostic.
To allow userspace to enable this feature for itself, following prtcls are
defined:
- PR_GET_INDIR_BR_LP_STATUS: Gets current configured status for indirect branch
tracking.
- PR_SET_INDIR_BR_LP_STATUS: Sets a configuration for indirect branch tracking
Following status options are allowed
- PR_INDIR_BR_LP_ENABLE: Enables indirect branch tracking on user
thread.
- PR_INDIR_BR_LP_DISABLE; Disables indirect branch tracking on user
thread.
- PR_LOCK_INDIR_BR_LP_STATUS: Locks configured status for indirect branch
tracking for user thread.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
include/uapi/linux/prctl.h | 27 +++++++++++++++++++++++++++
kernel/sys.c | 30 ++++++++++++++++++++++++++++++
2 files changed, 57 insertions(+)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 3c66ed8f46d8..b7a8212a068e 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -328,4 +328,31 @@ struct prctl_mm_map {
*/
#define PR_LOCK_SHADOW_STACK_STATUS 73
+/*
+ * Get the current indirect branch tracking configuration for the current
+ * thread, this will be the value configured via PR_SET_INDIR_BR_LP_STATUS.
+ */
+#define PR_GET_INDIR_BR_LP_STATUS 74
+
+/*
+ * Set the indirect branch tracking configuration. PR_INDIR_BR_LP_ENABLE will
+ * enable cpu feature for user thread, to track all indirect branches and ensure
+ * they land on arch defined landing pad instruction.
+ * x86 - If enabled, an indirect branch must land on `ENDBRANCH` instruction.
+ * arch64 - If enabled, an indirect branch must land on `BTI` instruction.
+ * riscv - If enabled, an indirect branch must land on `lpad` instruction.
+ * PR_INDIR_BR_LP_DISABLE will disable feature for user thread and indirect
+ * branches will no more be tracked by cpu to land on arch defined landing pad
+ * instruction.
+ */
+#define PR_SET_INDIR_BR_LP_STATUS 75
+# define PR_INDIR_BR_LP_ENABLE (1UL << 0)
+
+/*
+ * Prevent further changes to the specified indirect branch tracking
+ * configuration. All bits may be locked via this call, including
+ * undefined bits.
+ */
+#define PR_LOCK_INDIR_BR_LP_STATUS 76
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 96e8a6b5993a..9e2ebf9d9859 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2316,6 +2316,21 @@ int __weak arch_lock_shadow_stack_status(struct task_struct *t, unsigned long st
return -EINVAL;
}
+int __weak arch_get_indir_br_lp_status(struct task_struct *t, unsigned long __user *status)
+{
+ return -EINVAL;
+}
+
+int __weak arch_set_indir_br_lp_status(struct task_struct *t, unsigned long __user *status)
+{
+ return -EINVAL;
+}
+
+int __weak arch_lock_indir_br_lp_status(struct task_struct *t, unsigned long __user *status)
+{
+ return -EINVAL;
+}
+
#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
#ifdef CONFIG_ANON_VMA_NAME
@@ -2773,6 +2788,21 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
return -EINVAL;
error = arch_lock_shadow_stack_status(me, arg2);
break;
+ case PR_GET_INDIR_BR_LP_STATUS:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = arch_get_indir_br_lp_status(me, (unsigned long __user *) arg2);
+ break;
+ case PR_SET_INDIR_BR_LP_STATUS:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = arch_set_indir_br_lp_status(me, (unsigned long __user *) arg2);
+ break;
+ case PR_LOCK_INDIR_BR_LP_STATUS:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = arch_lock_indir_br_lp_status(me, (unsigned long __user *) arg2);
+ break;
default:
error = -EINVAL;
break;
--
2.43.0
From f2375f9a41e02e9ad8fa6910e12dfae84acbc9ae Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Thu, 19 Jan 2023 10:28:35 -0800
Subject: [PATCH 19/28] riscv: Implements arch agnostic shadow stack prctls
Implement architecture agnostic prctls() interface for setting and getting
shadow stack status.
prctls implemented are PR_GET_SHADOW_STACK_STATUS, PR_SET_SHADOW_STACK_STATUS
and PR_LOCK_SHADOW_STACK_STATUS.
As part of PR_SET_SHADOW_STACK_STATUS/PR_GET_SHADOW_STACK_STATUS, only
PR_SHADOW_STACK_ENABLE is implemented because RISCV allows each mode to write
to their own shadow stack using `sspush` or `ssamoswap`.
PR_LOCK_SHADOW_STACK_STATUS locks current configuration of shadow stack enabling
Following is not supported
"Enable shadow stack, then disable and enable again."
It's not sure whether providing such semantics are useful. It's better to return
error code when such situation arises.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/usercfi.h | 12 +++-
arch/riscv/kernel/usercfi.c | 105 +++++++++++++++++++++++++++++++
2 files changed, 116 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/include/asm/usercfi.h b/arch/riscv/include/asm/usercfi.h
index eb9a0905e72b..72bcfa773752 100644
--- a/arch/riscv/include/asm/usercfi.h
+++ b/arch/riscv/include/asm/usercfi.h
@@ -7,6 +7,7 @@
#ifndef __ASSEMBLY__
#include <linux/types.h>
+#include <linux/prctl.h>
struct task_struct;
struct kernel_clone_args;
@@ -14,7 +15,8 @@ struct kernel_clone_args;
#ifdef CONFIG_RISCV_USER_CFI
struct cfi_status {
unsigned long ubcfi_en : 1; /* Enable for backward cfi. */
- unsigned long rsvd : ((sizeof(unsigned long)*8) - 1);
+ unsigned long ubcfi_locked : 1;
+ unsigned long rsvd : ((sizeof(unsigned long)*8) - 2);
unsigned long user_shdw_stk; /* Current user shadow stack pointer */
unsigned long shdw_stk_base; /* Base address of shadow stack */
unsigned long shdw_stk_size; /* size of shadow stack */
@@ -26,6 +28,9 @@ void shstk_release(struct task_struct *tsk);
void set_shstk_base(struct task_struct *task, unsigned long shstk_addr, unsigned long size);
void set_active_shstk(struct task_struct *task, unsigned long shstk_addr);
bool is_shstk_enabled(struct task_struct *task);
+bool is_shstk_locked(struct task_struct *task);
+
+#define PR_SHADOW_STACK_SUPPORTED_STATUS_MASK (PR_SHADOW_STACK_ENABLE)
#else
@@ -56,6 +61,11 @@ static inline bool is_shstk_enabled(struct task_struct *task)
return false;
}
+static inline bool is_shstk_locked(struct task_struct *task)
+{
+ return false;
+}
+
#endif /* CONFIG_RISCV_USER_CFI */
#endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/kernel/usercfi.c b/arch/riscv/kernel/usercfi.c
index 36cac0d653f5..be3a071272d8 100644
--- a/arch/riscv/kernel/usercfi.c
+++ b/arch/riscv/kernel/usercfi.c
@@ -24,6 +24,16 @@ bool is_shstk_enabled(struct task_struct *task)
return task->thread_info.user_cfi_state.ubcfi_en ? true : false;
}
+bool is_shstk_allocated(struct task_struct *task)
+{
+ return task->thread_info.user_cfi_state.shdw_stk_base ? true : false;
+}
+
+bool is_shstk_locked(struct task_struct *task)
+{
+ return task->thread_info.user_cfi_state.ubcfi_locked ? true : false;
+}
+
void set_shstk_base(struct task_struct *task, unsigned long shstk_addr, unsigned long size)
{
task->thread_info.user_cfi_state.shdw_stk_base = shstk_addr;
@@ -42,6 +52,21 @@ void set_active_shstk(struct task_struct *task, unsigned long shstk_addr)
task->thread_info.user_cfi_state.user_shdw_stk = shstk_addr;
}
+void set_shstk_status(struct task_struct *task, bool enable)
+{
+ task->thread_info.user_cfi_state.ubcfi_en = enable ? 1 : 0;
+
+ if (enable)
+ task->thread_info.envcfg |= ENVCFG_SSE;
+ else
+ task->thread_info.envcfg &= ~ENVCFG_SSE;
+}
+
+void set_shstk_lock(struct task_struct *task)
+{
+ task->thread_info.user_cfi_state.ubcfi_locked = 1;
+}
+
/*
* If size is 0, then to be compatible with regular stack we want it to be as big as
* regular stack. Else PAGE_ALIGN it and return back
@@ -269,3 +294,83 @@ void shstk_release(struct task_struct *tsk)
vm_munmap(base, size);
set_shstk_base(tsk, 0, 0);
}
+
+int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *status)
+{
+ unsigned long bcfi_status = 0;
+
+ if (!cpu_supports_shadow_stack())
+ return -EINVAL;
+
+ /* this means shadow stack is enabled on the task */
+ bcfi_status |= (is_shstk_enabled(t) ? PR_SHADOW_STACK_ENABLE : 0);
+
+ return copy_to_user(status, &bcfi_status, sizeof(bcfi_status)) ? -EFAULT : 0;
+}
+
+int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status)
+{
+ unsigned long size = 0, addr = 0;
+ bool enable_shstk = false;
+
+ if (!cpu_supports_shadow_stack())
+ return -EINVAL;
+
+ /* Reject unknown flags */
+ if (status & ~PR_SHADOW_STACK_SUPPORTED_STATUS_MASK)
+ return -EINVAL;
+
+ /* bcfi status is locked and further can't be modified by user */
+ if (is_shstk_locked(t))
+ return -EINVAL;
+
+ enable_shstk = status & PR_SHADOW_STACK_ENABLE;
+ /* Request is to enable shadow stack and shadow stack is not enabled already */
+ if (enable_shstk && !is_shstk_enabled(t)) {
+ /* shadow stack was allocated and enable request again
+ * no need to support such usecase and return EINVAL.
+ */
+ if (is_shstk_allocated(t))
+ return -EINVAL;
+
+ size = calc_shstk_size(0);
+ addr = allocate_shadow_stack(0, size, 0, false);
+ if (IS_ERR_VALUE(addr))
+ return -ENOMEM;
+ set_shstk_base(t, addr, size);
+ set_active_shstk(t, addr + size);
+ }
+
+ /*
+ * If a request to disable shadow stack happens, let's go ahead and release it
+ * Although, if CLONE_VFORKed child did this, then in that case we will end up
+ * not releasing the shadow stack (because it might be needed in parent). Although
+ * we will disable it for VFORKed child. And if VFORKed child tries to enable again
+ * then in that case, it'll get entirely new shadow stack because following condition
+ * are true
+ * - shadow stack was not enabled for vforked child
+ * - shadow stack base was anyways pointing to 0
+ * This shouldn't be a big issue because we want parent to have availability of shadow
+ * stack whenever VFORKed child releases resources via exit or exec but at the same
+ * time we want VFORKed child to break away and establish new shadow stack if it desires
+ *
+ */
+ if (!enable_shstk)
+ shstk_release(t);
+
+ set_shstk_status(t, enable_shstk);
+ return 0;
+}
+
+int arch_lock_shadow_stack_status(struct task_struct *task,
+ unsigned long arg)
+{
+ /* If shtstk not supported or not enabled on task, nothing to lock here */
+ if (!cpu_supports_shadow_stack() ||
+ !is_shstk_enabled(task))
+ return -EINVAL;
+
+ set_shstk_lock(task);
+
+ return 0;
+}
--
2.43.0
From 8805582b1d18fb5488c9c8ed650f957c08697cfa Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Tue, 19 Dec 2023 07:58:08 -0800
Subject: [PATCH 20/28] riscv: Implements arch argnostic indirect branch
tracking prctls
prctls implemented are PR_SET_INDIR_BR_LP_STATUS / PR_GET_INDIR_BR_LP_STATUS
and PR_LOCK_INDIR_BR_LP_STATUS.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/usercfi.h | 17 +++++++-
arch/riscv/kernel/usercfi.c | 74 ++++++++++++++++++++++++++++++++
2 files changed, 90 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/include/asm/usercfi.h b/arch/riscv/include/asm/usercfi.h
index 72bcfa773752..4bd10dcd48aa 100644
--- a/arch/riscv/include/asm/usercfi.h
+++ b/arch/riscv/include/asm/usercfi.h
@@ -16,7 +16,9 @@ struct kernel_clone_args;
struct cfi_status {
unsigned long ubcfi_en : 1; /* Enable for backward cfi. */
unsigned long ubcfi_locked : 1;
- unsigned long rsvd : ((sizeof(unsigned long)*8) - 2);
+ unsigned long ufcfi_en : 1; /* Enable for forward cfi. Note that ELP goes in sstatus */
+ unsigned long ufcfi_locked : 1;
+ unsigned long rsvd : ((sizeof(unsigned long)*8) - 4);
unsigned long user_shdw_stk; /* Current user shadow stack pointer */
unsigned long shdw_stk_base; /* Base address of shadow stack */
unsigned long shdw_stk_size; /* size of shadow stack */
@@ -29,6 +31,8 @@ void set_shstk_base(struct task_struct *task, unsigned long shstk_addr, unsigned
void set_active_shstk(struct task_struct *task, unsigned long shstk_addr);
bool is_shstk_enabled(struct task_struct *task);
bool is_shstk_locked(struct task_struct *task);
+bool is_indir_lp_enabled(struct task_struct *task);
+bool is_indir_lp_locked(struct task_struct *task);
#define PR_SHADOW_STACK_SUPPORTED_STATUS_MASK (PR_SHADOW_STACK_ENABLE)
@@ -66,6 +70,17 @@ static inline bool is_shstk_locked(struct task_struct *task)
return false;
}
+static inline bool is_indir_lp_enabled(struct task_struct *task)
+{
+ return false;
+}
+
+static inline bool is_indir_lp_locked(struct task_struct *task)
+
+{
+ return false;
+}
+
#endif /* CONFIG_RISCV_USER_CFI */
#endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/kernel/usercfi.c b/arch/riscv/kernel/usercfi.c
index be3a071272d8..af8cc8f4616c 100644
--- a/arch/riscv/kernel/usercfi.c
+++ b/arch/riscv/kernel/usercfi.c
@@ -67,6 +67,30 @@ void set_shstk_lock(struct task_struct *task)
task->thread_info.user_cfi_state.ubcfi_locked = 1;
}
+bool is_indir_lp_enabled(struct task_struct *task)
+{
+ return task->thread_info.user_cfi_state.ufcfi_en ? true : false;
+}
+
+bool is_indir_lp_locked(struct task_struct *task)
+{
+ return task->thread_info.user_cfi_state.ufcfi_locked ? true : false;
+}
+
+void set_indir_lp_status(struct task_struct *task, bool enable)
+{
+ task->thread_info.user_cfi_state.ufcfi_en = enable ? 1 : 0;
+
+ if (enable)
+ task->thread_info.envcfg |= ENVCFG_LPE;
+ else
+ task->thread_info.envcfg &= ~ENVCFG_LPE;
+}
+
+void set_indir_lp_lock(struct task_struct *task)
+{
+ task->thread_info.user_cfi_state.ufcfi_locked = 1;
+}
/*
* If size is 0, then to be compatible with regular stack we want it to be as big as
* regular stack. Else PAGE_ALIGN it and return back
@@ -374,3 +398,53 @@ int arch_lock_shadow_stack_status(struct task_struct *task,
return 0;
}
+
+int arch_get_indir_br_lp_status(struct task_struct *t, unsigned long __user *status)
+{
+ unsigned long fcfi_status = 0;
+
+ if (!cpu_supports_indirect_br_lp_instr())
+ return -EINVAL;
+
+ /* indirect branch tracking is enabled on the task or not */
+ fcfi_status |= (is_indir_lp_enabled(t) ? PR_INDIR_BR_LP_ENABLE : 0);
+
+ return copy_to_user(status, &fcfi_status, sizeof(fcfi_status)) ? -EFAULT : 0;
+}
+
+int arch_set_indir_br_lp_status(struct task_struct *t, unsigned long status)
+{
+ bool enable_indir_lp = false;
+
+ if (!cpu_supports_indirect_br_lp_instr())
+ return -EINVAL;
+
+ /* indirect branch tracking is locked and further can't be modified by user */
+ if (is_indir_lp_locked(t))
+ return -EINVAL;
+
+ /* Reject unknown flags */
+ if (status & ~PR_INDIR_BR_LP_ENABLE)
+ return -EINVAL;
+
+ enable_indir_lp = (status & PR_INDIR_BR_LP_ENABLE) ? true : false;
+ set_indir_lp_status(t, enable_indir_lp);
+
+ return 0;
+}
+
+int arch_lock_indir_br_lp_status(struct task_struct *task,
+ unsigned long arg)
+{
+ /*
+ * If indirect branch tracking is not supported or not enabled on task,
+ * nothing to lock here
+ */
+ if (!cpu_supports_indirect_br_lp_instr() ||
+ !is_indir_lp_enabled(task))
+ return -EINVAL;
+
+ set_indir_lp_lock(task);
+
+ return 0;
+}
--
2.43.0
From 5b7ab396ec70d3e830acc7c9229c040f87db1ae8 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Mon, 11 Dec 2023 20:41:17 -0800
Subject: [PATCH 21/28] riscv/traps: Introduce software check exception
zicfiss / zicfilp introduces a new exception to priv isa `software check
exception` with cause code = 18. This patch implements software check exception.
Additionally it implements a cfi violation handler which checks for code in xtval
If xtval=2, it means that sw check exception happened because of an indirect
branch not landing on 4 byte aligned PC or not landing on `lpad` instruction or
label value embedded in `lpad` not matching label value setup in `x7`.
If xtval=3, it means that sw check exception happened because of mismatch between
link register (x1 or x5) and top of shadow stack (on execution of `sspopchk`)
In case of cfi violation, SIGSEGV is raised with code=SEGV_CPERR. SEGV_CPERR was
introduced by x86 shadow stack patches.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/asm-prototypes.h | 1 +
arch/riscv/kernel/entry.S | 3 ++
arch/riscv/kernel/traps.c | 38 +++++++++++++++++++++++++
3 files changed, 42 insertions(+)
diff --git a/arch/riscv/include/asm/asm-prototypes.h b/arch/riscv/include/asm/asm-prototypes.h
index 36b955c762ba..4ba8aea58dd0 100644
--- a/arch/riscv/include/asm/asm-prototypes.h
+++ b/arch/riscv/include/asm/asm-prototypes.h
@@ -24,6 +24,7 @@ DECLARE_DO_ERROR_INFO(do_trap_ecall_u);
DECLARE_DO_ERROR_INFO(do_trap_ecall_s);
DECLARE_DO_ERROR_INFO(do_trap_ecall_m);
DECLARE_DO_ERROR_INFO(do_trap_break);
+DECLARE_DO_ERROR_INFO(do_trap_software_check);
asmlinkage void handle_bad_stack(struct pt_regs *regs);
asmlinkage void do_page_fault(struct pt_regs *regs);
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 410659e2eadb..56dfe04094c1 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -369,6 +369,9 @@ SYM_DATA_START_LOCAL(excp_vect_table)
RISCV_PTR do_page_fault /* load page fault */
RISCV_PTR do_trap_unknown
RISCV_PTR do_page_fault /* store page fault */
+ RISCV_PTR do_trap_unknown /* cause=16 */
+ RISCV_PTR do_trap_unknown /* cause=17 */
+ RISCV_PTR do_trap_software_check /* cause=18 is sw check exception */
SYM_DATA_END_LABEL(excp_vect_table, SYM_L_LOCAL, excp_vect_table_end)
#ifndef CONFIG_MMU
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index a1b9be3c4332..9fba263428a1 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -339,6 +339,44 @@ asmlinkage __visible __trap_section void do_trap_ecall_u(struct pt_regs *regs)
}
+#define CFI_TVAL_FCFI_CODE 2
+#define CFI_TVAL_BCFI_CODE 3
+/* handle cfi violations */
+bool handle_user_cfi_violation(struct pt_regs *regs)
+{
+ bool ret = false;
+ unsigned long tval = csr_read(CSR_TVAL);
+
+ if (((tval == CFI_TVAL_FCFI_CODE) && cpu_supports_indirect_br_lp_instr()) ||
+ ((tval == CFI_TVAL_BCFI_CODE) && cpu_supports_shadow_stack())) {
+ do_trap_error(regs, SIGSEGV, SEGV_CPERR, regs->epc,
+ "Oops - control flow violation");
+ ret = true;
+ }
+
+ return ret;
+}
+/*
+ * software check exception is defined with risc-v cfi spec. Software check
+ * exception is raised when:-
+ * a) An indirect branch doesn't land on 4 byte aligned PC or `lpad`
+ * instruction or `label` value programmed in `lpad` instr doesn't
+ * match with value setup in `x7`. reported code in `xtval` is 2.
+ * b) `sspopchk` instruction finds a mismatch between top of shadow stack (ssp)
+ * and x1/x5. reported code in `xtval` is 3.
+ */
+asmlinkage __visible __trap_section void do_trap_software_check(struct pt_regs *regs)
+{
+ if (user_mode(regs)) {
+ /* not a cfi violation, then merge into flow of unknown trap handler */
+ if (!handle_user_cfi_violation(regs))
+ do_trap_unknown(regs);
+ } else {
+ /* sw check exception coming from kernel is a bug in kernel */
+ die(regs, "Kernel BUG");
+ }
+}
+
#ifdef CONFIG_MMU
asmlinkage __visible noinstr void do_page_fault(struct pt_regs *regs)
{
--
2.43.0
From a825cd99de13621b7e6e45cafa202f11cb3c3596 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Wed, 4 Jan 2023 19:20:09 -0800
Subject: [PATCH 22/28] riscv sigcontext: adding cfi state field in sigcontext
Shadow stack needs to be saved and restored on signal delivery and signal
return.
sigcontext embedded in ucontext is extendible. Adding cfi state in there
which can be used to save cfi state before signal delivery and restore
cfi state on sigreturn
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/uapi/asm/sigcontext.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/riscv/include/uapi/asm/sigcontext.h b/arch/riscv/include/uapi/asm/sigcontext.h
index cd4f175dc837..5ccdd94a0855 100644
--- a/arch/riscv/include/uapi/asm/sigcontext.h
+++ b/arch/riscv/include/uapi/asm/sigcontext.h
@@ -21,6 +21,10 @@ struct __sc_riscv_v_state {
struct __riscv_v_ext_state v_state;
} __attribute__((aligned(16)));
+struct __sc_riscv_cfi_state {
+ unsigned long ss_ptr; /* shadow stack pointer */
+ unsigned long rsvd; /* keeping another word reserved in case we need it */
+};
/*
* Signal context structure
*
@@ -29,6 +33,7 @@ struct __sc_riscv_v_state {
*/
struct sigcontext {
struct user_regs_struct sc_regs;
+ struct __sc_riscv_cfi_state sc_cfi_state;
union {
union __riscv_fp_state sc_fpregs;
struct __riscv_extra_ext_header sc_extdesc;
--
2.43.0
From 463c2e77fb9a1cf3bb75f1e359c55f1792b88349 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Sat, 14 Jan 2023 17:45:54 -0800
Subject: [PATCH 23/28] riscv signal: Save and restore of shadow stack for
signal
Save shadow stack pointer in sigcontext structure while delivering signal.
Restore shadow stack pointer from sigcontext on sigreturn.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/asm/usercfi.h | 18 ++++++++++++
arch/riscv/kernel/signal.c | 45 ++++++++++++++++++++++++++++++
arch/riscv/kernel/usercfi.c | 47 ++++++++++++++++++++++++++++++++
3 files changed, 110 insertions(+)
diff --git a/arch/riscv/include/asm/usercfi.h b/arch/riscv/include/asm/usercfi.h
index 4bd10dcd48aa..28c67866ff6f 100644
--- a/arch/riscv/include/asm/usercfi.h
+++ b/arch/riscv/include/asm/usercfi.h
@@ -33,6 +33,9 @@ bool is_shstk_enabled(struct task_struct *task);
bool is_shstk_locked(struct task_struct *task);
bool is_indir_lp_enabled(struct task_struct *task);
bool is_indir_lp_locked(struct task_struct *task);
+unsigned long get_active_shstk(struct task_struct *task);
+int restore_user_shstk(struct task_struct *tsk, unsigned long shstk_ptr);
+int save_user_shstk(struct task_struct *tsk, unsigned long *saved_shstk_ptr);
#define PR_SHADOW_STACK_SUPPORTED_STATUS_MASK (PR_SHADOW_STACK_ENABLE)
@@ -70,6 +73,16 @@ static inline bool is_shstk_locked(struct task_struct *task)
return false;
}
+int restore_user_shstk(struct task_struct *tsk, unsigned long shstk_ptr)
+{
+ return -EINVAL;
+}
+
+int save_user_shstk(struct task_struct *tsk, unsigned long *saved_shstk_ptr)
+{
+ return -EINVAL;
+}
+
static inline bool is_indir_lp_enabled(struct task_struct *task)
{
return false;
@@ -81,6 +94,11 @@ static inline bool is_indir_lp_locked(struct task_struct *task)
return false;
}
+static inline unsigned long get_active_shstk(struct task_struct *task)
+{
+ return 0;
+}
+
#endif /* CONFIG_RISCV_USER_CFI */
#endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/kernel/signal.c b/arch/riscv/kernel/signal.c
index 88b6220b2608..d1092f0a6363 100644
--- a/arch/riscv/kernel/signal.c
+++ b/arch/riscv/kernel/signal.c
@@ -22,6 +22,7 @@
#include <asm/vector.h>
#include <asm/csr.h>
#include <asm/cacheflush.h>
+#include <asm/usercfi.h>
unsigned long signal_minsigstksz __ro_after_init;
@@ -229,6 +230,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
struct pt_regs *regs = current_pt_regs();
struct rt_sigframe __user *frame;
struct task_struct *task;
+ unsigned long ss_ptr = 0;
sigset_t set;
size_t frame_size = get_rt_frame_size(false);
@@ -251,6 +253,26 @@ SYSCALL_DEFINE0(rt_sigreturn)
if (restore_altstack(&frame->uc.uc_stack))
goto badframe;
+ /*
+ * Restore shadow stack as a form of token stored on shadow stack itself as a safe
+ * way to restore.
+ * A token on shadow gives following properties
+ * - Safe save and restore for shadow stack switching. Any save of shadow stack
+ * must have had saved a token on shadow stack. Similarly any restore of shadow
+ * stack must check the token before restore. Since writing to shadow stack with
+ * address of shadow stack itself is not easily allowed. A restore without a save
+ * is quite difficult for an attacker to perform.
+ * - A natural break. A token in shadow stack provides a natural break in shadow stack
+ * So a single linear range can be bucketed into different shadow stack segments.
+ * sspopchk will detect the condition and fault to kernel as sw check exception.
+ */
+ if (__copy_from_user(&ss_ptr, &frame->uc.uc_mcontext.sc_cfi_state.ss_ptr,
+ sizeof(unsigned long)))
+ goto badframe;
+
+ if (is_shstk_enabled(current) && restore_user_shstk(current, ss_ptr))
+ goto badframe;
+
regs->cause = -1UL;
return regs->a0;
@@ -320,6 +342,7 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t *set,
struct rt_sigframe __user *frame;
long err = 0;
unsigned long __maybe_unused addr;
+ unsigned long ss_ptr = 0;
size_t frame_size = get_rt_frame_size(false);
frame = get_sigframe(ksig, regs, frame_size);
@@ -331,6 +354,23 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t *set,
/* Create the ucontext. */
err |= __put_user(0, &frame->uc.uc_flags);
err |= __put_user(NULL, &frame->uc.uc_link);
+ /*
+ * Save a pointer to shadow stack itself on shadow stack as a form of token.
+ * A token on shadow gives following properties
+ * - Safe save and restore for shadow stack switching. Any save of shadow stack
+ * must have had saved a token on shadow stack. Similarly any restore of shadow
+ * stack must check the token before restore. Since writing to shadow stack with
+ * address of shadow stack itself is not easily allowed. A restore without a save
+ * is quite difficult for an attacker to perform.
+ * - A natural break. A token in shadow stack provides a natural break in shadow stack
+ * So a single linear range can be bucketed into different shadow stack segments. Any
+ * sspopchk will detect the condition and fault to kernel as sw check exception.
+ */
+ if (is_shstk_enabled(current)) {
+ err |= save_user_shstk(current, &ss_ptr);
+ err |= __put_user(ss_ptr, &frame->uc.uc_mcontext.sc_cfi_state.ss_ptr);
+ }
+
err |= __save_altstack(&frame->uc.uc_stack, regs->sp);
err |= setup_sigcontext(frame, regs);
err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set));
@@ -341,6 +381,11 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t *set,
#ifdef CONFIG_MMU
regs->ra = (unsigned long)VDSO_SYMBOL(
current->mm->context.vdso, rt_sigreturn);
+
+ /* if bcfi is enabled x1 (ra) and x5 (t0) must match. not sure if we need this? */
+ if (is_shstk_enabled(current))
+ regs->t0 = regs->ra;
+
#else
/*
* For the nommu case we don't have a VDSO. Instead we push two
diff --git a/arch/riscv/kernel/usercfi.c b/arch/riscv/kernel/usercfi.c
index af8cc8f4616c..f5eb0124571b 100644
--- a/arch/riscv/kernel/usercfi.c
+++ b/arch/riscv/kernel/usercfi.c
@@ -52,6 +52,11 @@ void set_active_shstk(struct task_struct *task, unsigned long shstk_addr)
task->thread_info.user_cfi_state.user_shdw_stk = shstk_addr;
}
+unsigned long get_active_shstk(struct task_struct *task)
+{
+ return task->thread_info.user_cfi_state.user_shdw_stk;
+}
+
void set_shstk_status(struct task_struct *task, bool enable)
{
task->thread_info.user_cfi_state.ubcfi_en = enable ? 1 : 0;
@@ -165,6 +170,48 @@ static int create_rstor_token(unsigned long ssp, unsigned long *token_addr)
return 0;
}
+/*
+ * Save user shadow stack pointer on shadow stack itself and return pointer to saved location
+ * returns -EFAULT if operation was unsuccessful
+ */
+int save_user_shstk(struct task_struct *tsk, unsigned long *saved_shstk_ptr)
+{
+ unsigned long ss_ptr = 0;
+ unsigned long token_loc = 0;
+ int ret = 0;
+
+ if (saved_shstk_ptr == NULL)
+ return -EINVAL;
+
+ ss_ptr = get_active_shstk(tsk);
+ ret = create_rstor_token(ss_ptr, &token_loc);
+
+ *saved_shstk_ptr = token_loc;
+ return ret;
+}
+
+/*
+ * Restores user shadow stack pointer from token on shadow stack for task `tsk`
+ * returns -EFAULT if operation was unsuccessful
+ */
+int restore_user_shstk(struct task_struct *tsk, unsigned long shstk_ptr)
+{
+ unsigned long token = 0;
+
+ token = amo_user_shstk((unsigned long __user *)shstk_ptr, 0);
+
+ if (token == -1)
+ return -EFAULT;
+
+ /* invalid token, return EINVAL */
+ if ((token - shstk_ptr) != SHSTK_ENTRY_SIZE)
+ return -EINVAL;
+
+ /* all checks passed, set active shstk and return success */
+ set_active_shstk(tsk, token);
+ return 0;
+}
+
static unsigned long allocate_shadow_stack(unsigned long addr, unsigned long size,
unsigned long token_offset,
bool set_tok)
--
2.43.0
From fe9611befe43cded39d708768e56ecaed14be37c Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Thu, 19 Jan 2023 10:47:15 -0800
Subject: [PATCH 24/28] riscv: select config for shadow stack and landing pad
instr support
This patch selects config shadow stack support and landing pad instr
support. Shadow stack support and landing instr support is hidden behind
`CONFIG_RISCV_USER_CFI`. Selecting `CONFIG_RISCV_USER_CFI` wires up path
to enumerate CPU support and if cpu support exists, kernel will support
cpu assisted user mode cfi.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/Kconfig | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 9d386e9edc45..437b2f9abf3e 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -163,6 +163,7 @@ config RISCV
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
select TRACE_IRQFLAGS_SUPPORT
+ select RISCV_USER_CFI
select UACCESS_MEMCPY if !MMU
select ZONE_DMA32 if 64BIT
@@ -182,6 +183,20 @@ config HAVE_SHADOW_CALL_STACK
# https://github.com/riscv-non-isa/riscv-elf-psabi-doc/commit/a484e843e6eeb51…
depends on $(ld-option,--no-relax-gp)
+config RISCV_USER_CFI
+ bool "riscv userspace control flow integrity"
+ help
+ Provides CPU assisted control flow integrity to userspace tasks.
+ Control flow integrity is provided by implementing shadow stack for
+ backward edge and indirect branch tracking for forward edge in program.
+ Shadow stack protection is a hardware feature that detects function
+ return address corruption. This helps mitigate ROP attacks.
+ Indirect branch tracking enforces that all indirect branches must land
+ on a landing pad instruction else CPU will fault. This mitigates against
+ JOP / COP attacks. Applications must be enabled to use it, and old user-
+ space does not get protection "for free".
+ default y
+
config ARCH_MMAP_RND_BITS_MIN
default 18 if 64BIT
default 8
--
2.43.0
From 1b49d1810f8447ee58fc09c0f9ca467dff51181d Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Wed, 17 Jan 2024 10:27:13 -0800
Subject: [PATCH 25/28] riscv/ptrace: riscv cfi status and state via ptrace and
in core files
Expose a new register type NT_RISCV_USER_CFI for risc-v cfi status and
state. Intentionally both landing pad and shadow stack status and state
are rolled into cfi state. Creating two different NT_RISCV_USER_XXX would
not be useful and wastage of a note type. Enabling or disabling of feature
is not allowed via ptrace set interface. However setting `elp` state or
setting shadow stack pointer are allowed via ptrace set interface. It is
expected `gdb` might have use to fixup `elp` state or `shadow stack`
pointer.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
arch/riscv/include/uapi/asm/ptrace.h | 18 ++++++
arch/riscv/kernel/ptrace.c | 83 ++++++++++++++++++++++++++++
include/uapi/linux/elf.h | 1 +
3 files changed, 102 insertions(+)
diff --git a/arch/riscv/include/uapi/asm/ptrace.h b/arch/riscv/include/uapi/asm/ptrace.h
index a38268b19c3d..512be06a8661 100644
--- a/arch/riscv/include/uapi/asm/ptrace.h
+++ b/arch/riscv/include/uapi/asm/ptrace.h
@@ -127,6 +127,24 @@ struct __riscv_v_regset_state {
*/
#define RISCV_MAX_VLENB (8192)
+struct __cfi_status {
+ /* indirect branch tracking state */
+ __u64 lp_en : 1;
+ __u64 lp_lock : 1;
+ __u64 elp_state : 1;
+
+ /* shadow stack status */
+ __u64 shstk_en : 1;
+ __u64 shstk_lock : 1;
+
+ __u64 rsvd : sizeof(__u64) - 5;
+};
+
+struct user_cfi_state {
+ struct __cfi_status cfi_status;
+ __u64 shstk_ptr;
+};
+
#endif /* __ASSEMBLY__ */
#endif /* _UAPI_ASM_RISCV_PTRACE_H */
diff --git a/arch/riscv/kernel/ptrace.c b/arch/riscv/kernel/ptrace.c
index 2afe460de16a..8ddd529bef0b 100644
--- a/arch/riscv/kernel/ptrace.c
+++ b/arch/riscv/kernel/ptrace.c
@@ -19,6 +19,7 @@
#include <linux/regset.h>
#include <linux/sched.h>
#include <linux/sched/task_stack.h>
+#include <asm/usercfi.h>
enum riscv_regset {
REGSET_X,
@@ -28,6 +29,9 @@ enum riscv_regset {
#ifdef CONFIG_RISCV_ISA_V
REGSET_V,
#endif
+#ifdef CONFIG_RISCV_USER_CFI
+ REGSET_CFI,
+#endif
};
static int riscv_gpr_get(struct task_struct *target,
@@ -149,6 +153,75 @@ static int riscv_vr_set(struct task_struct *target,
}
#endif
+#ifdef CONFIG_RISCV_USER_CFI
+static int riscv_cfi_get(struct task_struct *target,
+ const struct user_regset *regset,
+ struct membuf to)
+{
+ struct user_cfi_state user_cfi;
+ struct pt_regs *regs;
+
+ regs = task_pt_regs(target);
+
+ user_cfi.cfi_status.lp_en = is_indir_lp_enabled(target);
+ user_cfi.cfi_status.lp_lock = is_indir_lp_locked(target);
+ user_cfi.cfi_status.elp_state = (regs->status & SR_ELP);
+
+ user_cfi.cfi_status.shstk_en = is_shstk_enabled(target);
+ user_cfi.cfi_status.shstk_lock = is_shstk_locked(target);
+ user_cfi.shstk_ptr = get_active_shstk(target);
+
+ return membuf_write(&to, &user_cfi, sizeof(user_cfi));
+}
+
+/*
+ * Does it make sense to allowing enable / disable of cfi via ptrace?
+ * Not allowing enable / disable / locking control via ptrace for now.
+ * Setting shadow stack pointer is allowed. GDB might use it to unwind or
+ * some other fixup. Similarly gdb might want to suppress elp and may want
+ * to reset elp state.
+ */
+static int riscv_cfi_set(struct task_struct *target,
+ const struct user_regset *regset,
+ unsigned int pos, unsigned int count,
+ const void *kbuf, const void __user *ubuf)
+{
+ int ret;
+ struct user_cfi_state user_cfi;
+ struct pt_regs *regs;
+
+ regs = task_pt_regs(target);
+
+ ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &user_cfi, 0, -1);
+ if (ret)
+ return ret;
+
+ /*
+ * Not allowing enabling or locking shadow stack or landing pad
+ * There is no disabling of shadow stack or landing pad via ptrace
+ * rsvd field should be set to zero so that if those fields are needed in future
+ */
+ if (user_cfi.cfi_status.lp_en || user_cfi.cfi_status.lp_lock ||
+ user_cfi.cfi_status.shstk_en || user_cfi.cfi_status.shstk_lock ||
+ !user_cfi.cfi_status.rsvd)
+ return -EINVAL;
+
+ /* If lpad is enabled on target and ptrace requests to set / clear elp, do that */
+ if (is_indir_lp_enabled(target)) {
+ if (user_cfi.cfi_status.elp_state) /* set elp state */
+ regs->status |= SR_ELP;
+ else
+ regs->status &= ~SR_ELP; /* clear elp state */
+ }
+
+ /* If shadow stack enabled on target, set new shadow stack pointer */
+ if (is_shstk_enabled(target))
+ set_active_shstk(target, user_cfi.shstk_ptr);
+
+ return 0;
+}
+#endif
+
static const struct user_regset riscv_user_regset[] = {
[REGSET_X] = {
.core_note_type = NT_PRSTATUS,
@@ -179,6 +252,16 @@ static const struct user_regset riscv_user_regset[] = {
.set = riscv_vr_set,
},
#endif
+#ifdef CONFIG_RISCV_USER_CFI
+ [REGSET_CFI] = {
+ .core_note_type = NT_RISCV_USER_CFI,
+ .align = sizeof(__u64),
+ .n = sizeof(struct user_cfi_state) / sizeof(__u64),
+ .size = sizeof(__u64),
+ .regset_get = riscv_cfi_get,
+ .set = riscv_cfi_set,
+ }
+#endif
};
static const struct user_regset_view riscv_user_native_view = {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 9417309b7230..f60b2de66b1c 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -447,6 +447,7 @@ typedef struct elf64_shdr {
#define NT_MIPS_MSA 0x802 /* MIPS SIMD registers */
#define NT_RISCV_CSR 0x900 /* RISC-V Control and Status Registers */
#define NT_RISCV_VECTOR 0x901 /* RISC-V vector registers */
+#define NT_RISCV_USER_CFI 0x902 /* RISC-V shadow stack state */
#define NT_LOONGARCH_CPUCFG 0xa00 /* LoongArch CPU config registers */
#define NT_LOONGARCH_CSR 0xa01 /* LoongArch control and status registers */
#define NT_LOONGARCH_LSX 0xa02 /* LoongArch Loongson SIMD Extension registers */
--
2.43.0
From 59a6070ad5808df72eb638b8886c29f4dd0e7290 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Fri, 12 Jan 2024 10:11:04 -0800
Subject: [PATCH 26/28] riscv: Documentation for landing pad / indirect branch
tracking
Adding documentation on landing pad aka indirect branch tracking on riscv
and kernel interfaces exposed so that user tasks can enable it.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
Documentation/arch/riscv/zicfilp.rst | 104 +++++++++++++++++++++++++++
1 file changed, 104 insertions(+)
create mode 100644 Documentation/arch/riscv/zicfilp.rst
diff --git a/Documentation/arch/riscv/zicfilp.rst b/Documentation/arch/riscv/zicfilp.rst
new file mode 100644
index 000000000000..3007c81f0465
--- /dev/null
+++ b/Documentation/arch/riscv/zicfilp.rst
@@ -0,0 +1,104 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+:Author: Deepak Gupta <debug(a)rivosinc.com>
+:Date: 12 January 2024
+
+====================================================
+Tracking indirect control transfers on RISC-V Linux
+====================================================
+
+This document briefly describes the interface provided to userspace by Linux
+to enable indirect branch tracking for user mode applications on RISV-V
+
+1. Feature Overview
+--------------------
+
+Memory corruption issues usually result in to crashes, however when in hands of
+an adversary and if used creatively can result into variety security issues.
+
+One of those security issues can be code re-use attacks on program where adversary
+can use corrupt function pointers and chain them together to perform jump oriented
+programming (JOP) or call oriented programming (COP) and thus compromising control
+flow integrity (CFI) of the program.
+
+Function pointers live in read-write memory and thus are susceptible to corruption
+and allows an adversary to reach any program counter (PC) in address space. On
+RISC-V zicfilp extension enforces a restriction on such indirect control transfers
+
+ - indirect control transfers must land on a landing pad instruction `lpad`.
+ There are two exception to this rule
+ - rs1 = x1 or rs1 = x5, i.e. a return from a function and returns are
+ protected using shadow stack (see zicfiss.rst)
+
+ - rs1 = x7. On RISC-V compiler usually does below to reach function
+ which is beyond the offset possible J-type instruction.
+
+ "auipc x7, <imm>"
+ "jalr (x7)"
+
+ Such form of indirect control transfer are still immutable and don't rely
+ on memory and thus rs1=x7 is exempted from tracking and considered software
+ guarded jumps.
+
+`lpad` instruction is pseudo of `auipc rd, <imm_20bit>` and is a HINT nop. `lpad`
+instruction must be aligned on 4 byte boundary and compares 20 bit immediate with x7.
+If `imm_20bit` == 0, CPU don't perform any comparision with x7. If `imm_20bit` != 0,
+then `imm_20bit` must match x7 else CPU will raise `software check exception`
+(cause=18)with `*tval = 2`.
+
+Compiler can generate a hash over function signatures and setup them (truncated
+to 20bit) in x7 at callsites and function proglogs can have `lpad` with same
+function hash. This further reduces number of program counters a call site can
+reach.
+
+2. ELF and psABI
+-----------------
+
+Toolchain sets up `GNU_PROPERTY_RISCV_FEATURE_1_FCFI` for property
+`GNU_PROPERTY_RISCV_FEATURE_1_AND` in notes section of the object file.
+
+3. Linux enabling
+------------------
+
+User space programs can have multiple shared objects loaded in its address space
+and it's a difficult task to make sure all the dependencies have been compiled
+with support of indirect branch. Thus it's left to dynamic loader to enable
+indirect branch tracking for the program.
+
+4. prctl() enabling
+--------------------
+
+`PR_SET_INDIR_BR_LP_STATUS` / `PR_GET_INDIR_BR_LP_STATUS` /
+`PR_LOCK_INDIR_BR_LP_STATUS` are three prctls added to manage indirect branch
+tracking. prctls are arch agnostic and returns -EINVAL on other arches.
+
+`PR_SET_INDIR_BR_LP_STATUS`: If arg1 `PR_INDIR_BR_LP_ENABLE` and if CPU supports
+`zicfilp` then kernel will enabled indirect branch tracking for the task.
+Dynamic loader can issue this `prctl` once it has determined that all the objects
+loaded in address space support indirect branch tracking. Additionally if there is
+a `dlopen` to an object which wasn't compiled with `zicfilp`, dynamic loader can
+issue this prctl with arg1 set to 0 (i.e. `PR_INDIR_BR_LP_ENABLE` being clear)
+
+`PR_GET_INDIR_BR_LP_STATUS`: Returns current status of indirect branch tracking.
+If enabled it'll return `PR_INDIR_BR_LP_ENABLE`
+
+`PR_LOCK_INDIR_BR_LP_STATUS`: Locks current status of indirect branch tracking on
+the task. User space may want to run with strict security posture and wouldn't want
+loading of objects without `zicfilp` support in it and thus would want to disallow
+disabling of indirect branch tracking. In that case user space can use this prctl
+to lock current settings.
+
+5. violations related to indirect branch tracking
+--------------------------------------------------
+
+Pertaining to indirect branch tracking, CPU raises software check exception in
+following conditions
+ - missing `lpad` after indirect call / jmp
+ - `lpad` not on 4 byte boundary
+ - `imm_20bit` embedded in `lpad` instruction doesn't match with `x7`
+
+In all 3 cases, `*tval = 2` is captured and software check exception is raised
+(cause=18)
+
+Linux kernel will treat this as `SIGSEV`` with code = `SEGV_CPERR` and follow
+normal course of signal delivery.
--
2.43.0
From 31d400d5299f01b457fd082e32786c25ee4a9491 Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Fri, 12 Jan 2024 10:14:29 -0800
Subject: [PATCH 27/28] riscv: Documentation for shadow stack on riscv
Adding documentation on shadow stack for user mode on riscv and kernel
interfaces exposed so that user tasks can enable it.
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
Documentation/arch/riscv/zicfiss.rst | 169 +++++++++++++++++++++++++++
1 file changed, 169 insertions(+)
create mode 100644 Documentation/arch/riscv/zicfiss.rst
diff --git a/Documentation/arch/riscv/zicfiss.rst b/Documentation/arch/riscv/zicfiss.rst
new file mode 100644
index 000000000000..f133b6af9c15
--- /dev/null
+++ b/Documentation/arch/riscv/zicfiss.rst
@@ -0,0 +1,169 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+:Author: Deepak Gupta <debug(a)rivosinc.com>
+:Date: 12 January 2024
+
+=========================================================
+Shadow stack to protect function returns on RISC-V Linux
+=========================================================
+
+This document briefly describes the interface provided to userspace by Linux
+to enable shadow stack for user mode applications on RISV-V
+
+1. Feature Overview
+--------------------
+
+Memory corruption issues usually result in to crashes, however when in hands of
+an adversary and if used creatively can result into variety security issues.
+
+One of those security issues can be code re-use attacks on program where adversary
+can use corrupt return addresses present on stack and chain them together to perform
+return oriented programming (ROP) and thus compromising control flow integrity (CFI)
+of the program.
+
+Return addresses live on stack and thus in read-write memory and thus are
+susceptible to corruption and allows an adversary to reach any program counter
+(PC) in address space. On RISC-V `zicfiss` extension provides an alternate stack
+`shadow stack` on which return addresses can be safely placed in prolog of the
+function and retrieved in epilog. `zicfiss` extension makes following changes
+
+ - PTE encodings for shadow stack virtual memory
+ An earlier reserved encoding in first stage translation i.e.
+ PTE.R=0, PTE.W=1, PTE.X=0 becomes PTE encoding for shadow stack pages.
+
+ - `sspush x1/x5` instruction pushes (stores) `x1/x5` to shadow stack.
+
+ - `sspopchk x1/x5` instruction pops (loads) from shadow stack and compares
+ with `x1/x5` and if un-equal, CPU raises `software check exception` with
+ `*tval = 3`
+
+Compiler toolchain makes sure that function prologs have `sspush x1/x5` to save return
+address on shadow stack in addition to regular stack. Similarly function epilogs have
+`ld x5, offset(x2)`; `sspopchk x5` to ensure that popped value from regular stack
+matches with popped value from shadow stack.
+
+2. Shadow stack protections and linux memory manager
+-----------------------------------------------------
+
+As mentioned earlier, shadow stack get new page table encodings and thus have some
+special properties assigned to them and instructions that operate on them as below
+
+ - Regular stores to shadow stack memory raises access store faults.
+ This way shadow stack memory is protected from stray inadvertant
+ writes
+
+ - Regular loads to shadow stack memory are allowed.
+ This allows stack trace utilities or backtrace functions to read
+ true callstack (not tampered)
+
+ - Only shadow stack instructions can generate shadow stack load or
+ shadow stack store.
+
+ - Shadow stack load / shadow stack store on read-only memory raises
+ AMO/store page fault. Thus both `sspush x1/x5` and `sspopchk x1/x5`
+ will raise AMO/store page fault. This simplies COW handling in kernel
+ During fork, kernel can convert shadow stack pages into read-only
+ memory (as it does for regular read-write memory) and as soon as
+ subsequent `sspush` or `sspopchk` in userspace is encountered, then
+ kernel can perform COW.
+
+ - Shadow stack load / shadow stack store on read-write, read-write-
+ execute memory raises an access fault. This is a fatal condition
+ because shadow stack should never be operating on read-write, read-
+ write-execute memory.
+
+3. ELF and psABI
+-----------------
+
+Toolchain sets up `GNU_PROPERTY_RISCV_FEATURE_1_BCFI` for property
+`GNU_PROPERTY_RISCV_FEATURE_1_AND` in notes section of the object file.
+
+4. Linux enabling
+------------------
+
+User space programs can have multiple shared objects loaded in its address space
+and it's a difficult task to make sure all the dependencies have been compiled
+with support of shadow stack. Thus it's left to dynamic loader to enable
+shadow stack for the program.
+
+5. prctl() enabling
+--------------------
+
+`PR_SET_SHADOW_STACK_STATUS` / `PR_GET_SHADOW_STACK_STATUS` /
+`PR_LOCK_SHADOW_STACK_STATUS` are three prctls added to manage shadow stack
+enabling for tasks. prctls are arch agnostic and returns -EINVAL on other arches.
+
+`PR_SET_SHADOW_STACK_STATUS`: If arg1 `PR_SHADOW_STACK_ENABLE` and if CPU supports
+`zicfiss` then kernel will enable shadow stack for the task. Dynamic loader can
+issue this `prctl` once it has determined that all the objects loaded in address
+space have support for shadow stack. Additionally if there is a `dlopen` to an
+object which wasn't compiled with `zicfiss`, dynamic loader can issue this prctl
+with arg1 set to 0 (i.e. `PR_SHADOW_STACK_ENABLE` being clear)
+
+`PR_GET_SHADOW_STACK_STATUS`: Returns current status of indirect branch tracking.
+If enabled it'll return `PR_SHADOW_STACK_ENABLE`
+
+`PR_LOCK_SHADOW_STACK_STATUS`: Locks current status of shadow stack enabling on the
+task. User space may want to run with strict security posture and wouldn't want
+loading of objects without `zicfiss` support in it and thus would want to disallow
+disabling of shadow stack on current task. In that case user space can use this prctl
+to lock current settings.
+
+5. violations related to returns with shadow stack enabled
+-----------------------------------------------------------
+
+Pertaining to shadow stack, CPU raises software check exception in following
+condition
+
+ - On execution of `sspopchk x1/x5`, x1/x5 didn't match top of shadow stack.
+ If mismatch happens then cpu does `*tval = 3` and raise software check
+ exception
+
+Linux kernel will treat this as `SIGSEV`` with code = `SEGV_CPERR` and follow
+normal course of signal delivery.
+
+6. Shadow stack tokens
+-----------------------
+Regular stores on shadow stacks are not allowed and thus can't be tampered with via
+arbitrary stray writes due to bugs. Method of pivoting / switching to shadow stack
+is simply writing to csr `CSR_SSP` changes active shadow stack. This can be problematic
+because usually value to be written to `CSR_SSP` will be loaded somewhere in writeable
+memory and thus allows an adversary to corruption bug in software to pivot to an any
+address in shadow stack range. Shadow stack tokens can help mitigate this problem by
+making sure that:
+
+ - When software is switching away from a shadow stack, shadow stack pointer should be
+ saved on shadow stack itself and call it `shadow stack token`
+
+ - When software is switching to a shadow stack, it should read the `shadow stack token`
+ from shadow stack pointer and verify that `shadow stack token` itself is pointer to
+ shadow stack itself.
+
+ - Once the token verification is done, software can perform the write to `CSR_SSP` to
+ switch shadow stack.
+
+Here software can be user mode task runtime itself which is managing various contexts
+as part of single thread. Software can be kernel as well when kernel has to deliver a
+signal to user task and must save shadow stack pointer. Kernel can perform similar
+procedure by saving a token on user shadow stack itself. This way whenever sigreturn
+happens, kernel can read the token and verify the token and then switch to shadow stack.
+Using this mechanism, kernel helps user task so that any corruption issue in user task
+is not exploited by adversary by arbitrarily using `sigreturn`. Adversary will have to
+make sure that there is a `shadow stack token` in addition to invoking `sigreturn`
+
+7. Signal shadow stack
+-----------------------
+Following structure has been added to sigcontext for RISC-V. `rsvd` field has been kept
+in case we need some extra information in future for landing pads / indirect branch
+tracking. It has been kept today in order to allow backward compatibility in future.
+
+struct __sc_riscv_cfi_state {
+ unsigned long ss_ptr;
+ unsigned long rsvd;
+};
+
+As part of signal delivery, shadow stack token is saved on current shadow stack itself and
+updated pointer is saved away in `ss_ptr` field in `__sc_riscv_cfi_state` under `sigcontext`
+Existing shadow stack allocation is used for signal delivery. During `sigreturn`, kernel will
+obtain `ss_ptr` from `sigcontext` and verify the saved token on shadow stack itself and switch
+shadow stack.
--
2.43.0
From 93a6c741f2881cddeeeba28f1bbd2f06ca070fcc Mon Sep 17 00:00:00 2001
From: Deepak Gupta <debug(a)rivosinc.com>
Date: Wed, 24 Jan 2024 08:50:40 -0800
Subject: [PATCH 28/28] kselftest/riscv: kselftest for user mode cfi
Adds kselftest for RISC-V control flow integrity implementation for user
mode. There is not a lot going on in kernel for enabling landing pad for
user mode. Thus kselftest simply enables landing pad for the binary and
a signal handler is registered for SIGSEGV. Any control flow violation are
reported as SIGSEGV with si_code = SEGV_CPERR. Test will fail on recieving
any SEGV_CPERR. Shadow stack part has more changes in kernel and thus there
are separate tests for that
- enable and disable
- Exercise `map_shadow_stack` syscall
- `fork` test to make sure COW works for shadow stack pages
- gup tests
As of today kernel uses FOLL_FORCE when access happens to memory via
/proc/<pid>/mem. Not breaking that for shadow stack
- signal test. Make sure signal delivery results in token creation on
shadow stack and consumes (and verifies) token on sigreturn
- shadow stack protection test. attempts to write using regular store
instruction on shadow stack memory must result in access faults
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/cfi/Makefile | 10 +
.../testing/selftests/riscv/cfi/cfi_rv_test.h | 85 ++++
.../selftests/riscv/cfi/riscv_cfi_test.c | 91 +++++
.../testing/selftests/riscv/cfi/shadowstack.c | 376 ++++++++++++++++++
.../testing/selftests/riscv/cfi/shadowstack.h | 39 ++
6 files changed, 602 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/riscv/cfi/Makefile
create mode 100644 tools/testing/selftests/riscv/cfi/cfi_rv_test.h
create mode 100644 tools/testing/selftests/riscv/cfi/riscv_cfi_test.c
create mode 100644 tools/testing/selftests/riscv/cfi/shadowstack.c
create mode 100644 tools/testing/selftests/riscv/cfi/shadowstack.h
diff --git a/tools/testing/selftests/riscv/Makefile b/tools/testing/selftests/riscv/Makefile
index 4a9ff515a3a0..867e5875b7ce 100644
--- a/tools/testing/selftests/riscv/Makefile
+++ b/tools/testing/selftests/riscv/Makefile
@@ -5,7 +5,7 @@
ARCH ?= $(shell uname -m 2>/dev/null || echo not)
ifneq (,$(filter $(ARCH),riscv))
-RISCV_SUBTARGETS ?= hwprobe vector mm
+RISCV_SUBTARGETS ?= hwprobe vector mm cfi
else
RISCV_SUBTARGETS :=
endif
diff --git a/tools/testing/selftests/riscv/cfi/Makefile b/tools/testing/selftests/riscv/cfi/Makefile
new file mode 100644
index 000000000000..77f12157fa29
--- /dev/null
+++ b/tools/testing/selftests/riscv/cfi/Makefile
@@ -0,0 +1,10 @@
+CFLAGS += -I$(top_srcdir)/tools/include
+
+CFLAGS += -march=rv64gc_zicfilp_zicfiss
+
+TEST_GEN_PROGS := cfitests
+
+include ../../lib.mk
+
+$(OUTPUT)/cfitests: riscv_cfi_test.c shadowstack.c
+ $(CC) -static -o$@ $(CFLAGS) $(LDFLAGS) $^
diff --git a/tools/testing/selftests/riscv/cfi/cfi_rv_test.h b/tools/testing/selftests/riscv/cfi/cfi_rv_test.h
new file mode 100644
index 000000000000..27267a2e1008
--- /dev/null
+++ b/tools/testing/selftests/riscv/cfi/cfi_rv_test.h
@@ -0,0 +1,85 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef SELFTEST_RISCV_CFI_H
+#define SELFTEST_RISCV_CFI_H
+#include <stddef.h>
+#include <sys/types.h>
+#include "shadowstack.h"
+
+#define RISCV_CFI_SELFTEST_COUNT RISCV_SHADOW_STACK_TESTS
+
+#define CHILD_EXIT_CODE_SSWRITE 10
+#define CHILD_EXIT_CODE_SIG_TEST 11
+
+#define BAD_POINTER (NULL)
+
+#define my_syscall5(num, arg1, arg2, arg3, arg4, arg5) \
+({ \
+ register long _num __asm__ ("a7") = (num); \
+ register long _arg1 __asm__ ("a0") = (long)(arg1); \
+ register long _arg2 __asm__ ("a1") = (long)(arg2); \
+ register long _arg3 __asm__ ("a2") = (long)(arg3); \
+ register long _arg4 __asm__ ("a3") = (long)(arg4); \
+ register long _arg5 __asm__ ("a4") = (long)(arg5); \
+ \
+ __asm__ volatile ( \
+ "ecall\n" \
+ : "+r"(_arg1) \
+ : "r"(_arg2), "r"(_arg3), "r"(_arg4), "r"(_arg5), \
+ "r"(_num) \
+ : "memory", "cc" \
+ ); \
+ _arg1; \
+})
+
+#define my_syscall3(num, arg1, arg2, arg3) \
+({ \
+ register long _num __asm__ ("a7") = (num); \
+ register long _arg1 __asm__ ("a0") = (long)(arg1); \
+ register long _arg2 __asm__ ("a1") = (long)(arg2); \
+ register long _arg3 __asm__ ("a2") = (long)(arg3); \
+ \
+ __asm__ volatile ( \
+ "ecall\n" \
+ : "+r"(_arg1) \
+ : "r"(_arg2), "r"(_arg3), \
+ "r"(_num) \
+ : "memory", "cc" \
+ ); \
+ _arg1; \
+})
+
+#ifndef __NR_prctl
+#define __NR_prctl 167
+#endif
+
+#ifndef __NR_map_shadow_stack
+#define __NR_map_shadow_stack 453
+#endif
+
+#define CSR_SSP 0x011
+
+#ifdef __ASSEMBLY__
+#define __ASM_STR(x) x
+#else
+#define __ASM_STR(x) #x
+#endif
+
+#define csr_read(csr) \
+({ \
+ register unsigned long __v; \
+ __asm__ __volatile__ ("csrr %0, " __ASM_STR(csr) \
+ : "=r" (__v) : \
+ : "memory"); \
+ __v; \
+})
+
+#define csr_write(csr, val) \
+({ \
+ unsigned long __v = (unsigned long) (val); \
+ __asm__ __volatile__ ("csrw " __ASM_STR(csr) ", %0" \
+ : : "rK" (__v) \
+ : "memory"); \
+})
+
+#endif
diff --git a/tools/testing/selftests/riscv/cfi/riscv_cfi_test.c b/tools/testing/selftests/riscv/cfi/riscv_cfi_test.c
new file mode 100644
index 000000000000..c116ae4bb358
--- /dev/null
+++ b/tools/testing/selftests/riscv/cfi/riscv_cfi_test.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "../../kselftest.h"
+#include <signal.h>
+#include <asm/ucontext.h>
+#include <linux/prctl.h>
+#include "cfi_rv_test.h"
+
+/* do not optimize cfi related test functions */
+#pragma GCC push_options
+#pragma GCC optimize("O0")
+
+#define SEGV_CPERR 10 /* control protection fault */
+
+void sigsegv_handler(int signum, siginfo_t *si, void *uc)
+{
+ struct ucontext *ctx = (struct ucontext *) uc;
+
+ if (si->si_code == SEGV_CPERR) {
+ printf("Control flow violation happened somewhere\n");
+ printf("pc where violation happened %lx\n", ctx->uc_mcontext.gregs[0]);
+ exit(-1);
+ }
+
+ /* null pointer deref */
+ if (si->si_addr == BAD_POINTER)
+ exit(CHILD_EXIT_CODE_NULL_PTR_DEREF);
+
+ /* shadow stack write case */
+ exit(CHILD_EXIT_CODE_SSWRITE);
+}
+
+int lpad_enable(void)
+{
+ int ret = 0;
+
+ ret = my_syscall5(__NR_prctl, PR_SET_INDIR_BR_LP_STATUS, PR_INDIR_BR_LP_ENABLE, 0, 0, 0);
+
+ return ret;
+}
+
+bool register_signal_handler(void)
+{
+ struct sigaction sa = {};
+
+ sa.sa_sigaction = sigsegv_handler;
+ sa.sa_flags = SA_SIGINFO;
+ if (sigaction(SIGSEGV, &sa, NULL)) {
+ printf("registering signal handler for landing pad violation failed\n");
+ return false;
+ }
+
+ return true;
+}
+
+int main(int argc, char *argv[])
+{
+ int ret = 0;
+ unsigned long lpad_status = 0;
+
+ ksft_print_header();
+
+ ksft_set_plan(RISCV_CFI_SELFTEST_COUNT);
+
+ ksft_print_msg("starting risc-v tests\n");
+
+ /*
+ * Landing pad test. Not a lot of kernel changes to support landing
+ * pad for user mode except lighting up a bit in senvcfg via a prctl
+ * Enable landing pad through out the execution of test binary
+ */
+ ret = my_syscall5(__NR_prctl, PR_GET_INDIR_BR_LP_STATUS, &lpad_status, 0, 0, 0);
+ if (ret)
+ ksft_exit_skip("Get landing pad status failed with %d\n", ret);
+
+ ret = lpad_enable();
+
+ if (ret)
+ ksft_exit_skip("Enabling landing pad failed with %d\n", ret);
+
+ if (!register_signal_handler())
+ ksft_exit_skip("registering signal handler for SIGSEGV failed\n");
+
+ ksft_print_msg("landing pad enabled for binary\n");
+ ksft_print_msg("starting risc-v shadow stack tests\n");
+ execute_shadow_stack_tests();
+
+ ksft_finished();
+}
+
+#pragma GCC pop_options
diff --git a/tools/testing/selftests/riscv/cfi/shadowstack.c b/tools/testing/selftests/riscv/cfi/shadowstack.c
new file mode 100644
index 000000000000..126654801bed
--- /dev/null
+++ b/tools/testing/selftests/riscv/cfi/shadowstack.c
@@ -0,0 +1,376 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "../../kselftest.h"
+#include <sys/wait.h>
+#include <signal.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include "shadowstack.h"
+#include "cfi_rv_test.h"
+
+/* do not optimize shadow stack related test functions */
+#pragma GCC push_options
+#pragma GCC optimize("O0")
+
+void zar(void)
+{
+ unsigned long ssp = 0, swaped_val = 0;
+
+ ssp = csr_read(CSR_SSP);
+ printf("inside %s and shadow stack ptr is %lx\n", __func__, ssp);
+}
+
+void bar(void)
+{
+ printf("inside %s\n", __func__);
+ zar();
+}
+
+void foo(void)
+{
+ printf("inside %s\n", __func__);
+ bar();
+}
+
+void zar_child(void)
+{
+ unsigned long ssp = 0;
+
+ ssp = csr_read(CSR_SSP);
+ printf("inside %s and shadow stack ptr is %lx\n", __func__, ssp);
+}
+
+void bar_child(void)
+{
+ printf("inside %s\n", __func__);
+ zar_child();
+}
+
+void foo_child(void)
+{
+ printf("inside %s\n", __func__);
+ bar_child();
+}
+
+typedef void (call_func_ptr)(void);
+/*
+ * call couple of functions to test push pop.
+ */
+int shadow_stack_call_tests(call_func_ptr fn_ptr, bool parent)
+{
+ if (parent)
+ printf("call test for parent\n");
+ else
+ printf("call test for child\n");
+
+ (fn_ptr)();
+
+ return 0;
+}
+
+bool enable_disable_check(unsigned long test_num, void *ctx)
+{
+ int ret = 0;
+
+ if (!my_syscall5(__NR_prctl, PR_SET_SHADOW_STACK_STATUS, PR_SHADOW_STACK_ENABLE, 0, 0, 0)) {
+ printf("Shadow stack was enabled\n");
+ shadow_stack_call_tests(&foo, true);
+
+ ret = my_syscall5(__NR_prctl, PR_SET_SHADOW_STACK_STATUS, 0, 0, 0, 0);
+ if (ret)
+ ksft_test_result_fail("shadow stack disable failed\n");
+ } else {
+ ksft_test_result_fail("shadow stack enable failed\n");
+ ret = -EINVAL;
+ }
+
+ return ret ? false : true;
+}
+
+/* forks a thread, and ensure shadow stacks fork out */
+bool shadow_stack_fork_test(unsigned long test_num, void *ctx)
+{
+ int pid = 0, child_status = 0, parent_pid = 0;
+
+ printf("exercising shadow stack fork test\n");
+
+ if (my_syscall5(__NR_prctl, PR_SET_SHADOW_STACK_STATUS, PR_SHADOW_STACK_ENABLE, 0, 0, 0)) {
+ printf("shadow stack enable prctl failed\n");
+ return false;
+ }
+
+ parent_pid = getpid();
+ pid = fork();
+
+ if (pid) {
+ printf("Parent pid %d and child pid %d\n", parent_pid, pid);
+ shadow_stack_call_tests(&foo, true);
+ } else
+ shadow_stack_call_tests(&foo_child, false);
+
+ if (pid) {
+ printf("waiting on child to finish\n");
+ wait(&child_status);
+ } else {
+ /* exit child gracefully */
+ exit(0);
+ }
+
+ if (pid && WIFSIGNALED(child_status)) {
+ printf("child faulted");
+ return false;
+ }
+
+ /* disable shadow stack again */
+ if (my_syscall5(__NR_prctl, PR_SET_SHADOW_STACK_STATUS, 0, 0, 0, 0)) {
+ printf("shadow stack disable prctl failed\n");
+ return false;
+ }
+
+ return true;
+}
+
+/* exercise `map_shadow_stack`, pivot to it and call some functions to ensure it works */
+#define SHADOW_STACK_ALLOC_SIZE 4096
+bool shadow_stack_map_test(unsigned long test_num, void *ctx)
+{
+ unsigned long shdw_addr;
+ int ret = 0;
+
+ shdw_addr = my_syscall3(__NR_map_shadow_stack, NULL, SHADOW_STACK_ALLOC_SIZE, 0);
+
+ if (((long) shdw_addr) <= 0) {
+ printf("map_shadow_stack failed with error code %d\n", (int) shdw_addr);
+ return false;
+ }
+
+ ret = munmap((void *) shdw_addr, SHADOW_STACK_ALLOC_SIZE);
+
+ if (ret) {
+ printf("munmap failed with error code %d\n", ret);
+ return false;
+ }
+
+ return true;
+}
+
+/*
+ * shadow stack protection tests. map a shadow stack and
+ * validate all memory protections work on it
+ */
+bool shadow_stack_protection_test(unsigned long test_num, void *ctx)
+{
+ unsigned long shdw_addr;
+ unsigned long *write_addr = NULL;
+ int ret = 0, pid = 0, child_status = 0;
+
+ shdw_addr = my_syscall3(__NR_map_shadow_stack, NULL, SHADOW_STACK_ALLOC_SIZE, 0);
+
+ if (((long) shdw_addr) <= 0) {
+ printf("map_shadow_stack failed with error code %d\n", (int) shdw_addr);
+ return false;
+ }
+
+ write_addr = (unsigned long *) shdw_addr;
+ pid = fork();
+
+ /* no child was created, return false */
+ if (pid == -1)
+ return false;
+
+ /*
+ * try to perform a store from child on shadow stack memory
+ * it should result in SIGSEGV
+ */
+ if (!pid) {
+ /* below write must lead to SIGSEGV */
+ *write_addr = 0xdeadbeef;
+ } else {
+ wait(&child_status);
+ }
+
+ /* test fail, if 0xdeadbeef present on shadow stack address */
+ if (*write_addr == 0xdeadbeef) {
+ printf("write suceeded\n");
+ return false;
+ }
+
+ /* if child reached here, then fail */
+ if (!pid) {
+ printf("child reached unreachable state\n");
+ return false;
+ }
+
+ /* if child exited via signal handler but not for write on ss */
+ if (WIFEXITED(child_status) &&
+ WEXITSTATUS(child_status) != CHILD_EXIT_CODE_SSWRITE) {
+ printf("child wasn't signaled for write on shadow stack\n");
+ return false;
+ }
+
+ ret = munmap(write_addr, SHADOW_STACK_ALLOC_SIZE);
+ if (ret) {
+ printf("munmap failed with error code %d\n", ret);
+ return false;
+ }
+
+ return true;
+}
+
+#define SS_MAGIC_WRITE_VAL 0xbeefdead
+
+int gup_tests(int mem_fd, unsigned long *shdw_addr)
+{
+ unsigned long val = 0;
+
+ lseek(mem_fd, (unsigned long)shdw_addr, SEEK_SET);
+ if (read(mem_fd, &val, sizeof(val)) < 0) {
+ printf("reading shadow stack mem via gup failed\n");
+ return 1;
+ }
+
+ val = SS_MAGIC_WRITE_VAL;
+ lseek(mem_fd, (unsigned long)shdw_addr, SEEK_SET);
+ if (write(mem_fd, &val, sizeof(val)) < 0) {
+ printf("writing shadow stack mem via gup failed\n");
+ return 1;
+ }
+
+ if (*shdw_addr != SS_MAGIC_WRITE_VAL) {
+ printf("GUP write to shadow stack memory didn't happen\n");
+ return 1;
+ }
+
+ return 0;
+}
+
+bool shadow_stack_gup_tests(unsigned long test_num, void *ctx)
+{
+ unsigned long shdw_addr = 0;
+ unsigned long *write_addr = NULL;
+ int fd = 0;
+ bool ret = false;
+
+ shdw_addr = my_syscall3(__NR_map_shadow_stack, NULL, SHADOW_STACK_ALLOC_SIZE, 0);
+
+ if (((long) shdw_addr) <= 0) {
+ printf("map_shadow_stack failed with error code %d\n", (int) shdw_addr);
+ return false;
+ }
+
+ write_addr = (unsigned long *) shdw_addr;
+
+ fd = open("/proc/self/mem", O_RDWR);
+ if (fd == -1)
+ return false;
+
+ if (gup_tests(fd, write_addr)) {
+ printf("gup tests failed\n");
+ goto out;
+ }
+
+ ret = true;
+out:
+ if (shdw_addr && munmap(write_addr, SHADOW_STACK_ALLOC_SIZE)) {
+ printf("munmap failed with error code %d\n", ret);
+ ret = false;
+ }
+
+ return ret;
+}
+
+volatile bool break_loop;
+
+void sigusr1_handler(int signo)
+{
+ printf("In sigusr1 handler\n");
+ break_loop = true;
+}
+
+bool sigusr1_signal_test(void)
+{
+ if (signal(SIGUSR1, sigusr1_handler) == SIG_ERR) {
+ printf("registerting sigusr1 handler failed\n");
+ return false;
+ }
+
+ return true;
+}
+/*
+ * shadow stack signal test. shadow stack must be enabled.
+ * register a signal, fork another thread which is waiting
+ * on signal. Send a signal from parent to child, verify
+ * that signal was received by child. If not test fails
+ */
+bool shadow_stack_signal_test(unsigned long test_num, void *ctx)
+{
+ int pid = 0, child_status = 0;
+ unsigned long ssp = 0;
+
+ if (my_syscall5(__NR_prctl, PR_SET_SHADOW_STACK_STATUS, PR_SHADOW_STACK_ENABLE, 0, 0, 0)) {
+ printf("shadow stack enable prctl failed\n");
+ return false;
+ }
+
+ pid = fork();
+
+ if (pid == -1) {
+ printf("signal test: fork failed\n");
+ goto out;
+ }
+
+ if (pid == 0) {
+ /* this should be caught by signal handler and do an exit */
+ if (!sigusr1_signal_test()) {
+ printf("sigusr1_signal_test failed\n");
+ exit(-1);
+ }
+
+ while (!break_loop)
+ sleep(1);
+
+ exit(11);
+ /* child shouldn't go beyond here */
+ }
+ /* send SIGUSR1 to child */
+ kill(pid, SIGUSR1);
+ wait(&child_status);
+
+out:
+ if (my_syscall5(__NR_prctl, PR_SET_SHADOW_STACK_STATUS, 0, 0, 0, 0)) {
+ printf("shadow stack disable prctl failed\n");
+ return false;
+ }
+
+ return (WIFEXITED(child_status) &&
+ WEXITSTATUS(child_status) == 11);
+}
+
+int execute_shadow_stack_tests(void)
+{
+ int ret = 0;
+ unsigned long test_count = 0;
+ unsigned long shstk_status = 0;
+
+ printf("Executing RISC-V shadow stack self tests\n");
+
+ ret = my_syscall5(__NR_prctl, PR_GET_SHADOW_STACK_STATUS, &shstk_status, 0, 0, 0);
+
+ if (ret != 0)
+ ksft_exit_skip("Get shadow stack status failed with %d\n", ret);
+
+ /*
+ * If we are here that means get shadow stack status succeeded and
+ * thus shadow stack support is baked in the kernel.
+ */
+ while (test_count < ARRAY_SIZE(shstk_tests)) {
+ ksft_test_result((*shstk_tests[test_count].t_func)(test_count, NULL),
+ shstk_tests[test_count].name);
+ test_count++;
+ }
+
+ return 0;
+}
+
+#pragma GCC pop_options
diff --git a/tools/testing/selftests/riscv/cfi/shadowstack.h b/tools/testing/selftests/riscv/cfi/shadowstack.h
new file mode 100644
index 000000000000..92cb0752238d
--- /dev/null
+++ b/tools/testing/selftests/riscv/cfi/shadowstack.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef SELFTEST_SHADOWSTACK_TEST_H
+#define SELFTEST_SHADOWSTACK_TEST_H
+#include <stddef.h>
+#include <linux/prctl.h>
+
+/*
+ * a cfi test returns true for success or false for fail
+ * takes a number for test number to index into array and void pointer.
+ */
+typedef bool (*shstk_test_func)(unsigned long test_num, void *);
+
+struct shadow_stack_tests {
+ char *name;
+ shstk_test_func t_func;
+};
+
+bool enable_disable_check(unsigned long test_num, void *ctx);
+bool shadow_stack_fork_test(unsigned long test_num, void *ctx);
+bool shadow_stack_map_test(unsigned long test_num, void *ctx);
+bool shadow_stack_protection_test(unsigned long test_num, void *ctx);
+bool shadow_stack_gup_tests(unsigned long test_num, void *ctx);
+bool shadow_stack_signal_test(unsigned long test_num, void *ctx);
+
+static struct shadow_stack_tests shstk_tests[] = {
+ { "enable disable\n", enable_disable_check },
+ { "shstk fork test\n", shadow_stack_fork_test },
+ { "map shadow stack syscall\n", shadow_stack_map_test },
+ { "shadow stack gup tests\n", shadow_stack_gup_tests },
+ { "shadow stack signal tests\n", shadow_stack_signal_test},
+ { "memory protections of shadow stack memory\n", shadow_stack_protection_test }
+};
+
+#define RISCV_SHADOW_STACK_TESTS ARRAY_SIZE(shstk_tests)
+
+int execute_shadow_stack_tests(void);
+
+#endif
--
2.43.0
This test is missing a whole bunch of checks for interface
renaming and one ifup. Presumably it was only used on a system
with renaming disabled and NetworkManager running.
Fixes: 91f430b2c49d ("selftests: net: add a test for UDP tunnel info infra")
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: shuah(a)kernel.org
CC: horms(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
---
.../selftests/drivers/net/netdevsim/udp_tunnel_nic.sh | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh b/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
index 4855ef597a15..f98435c502f6 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
@@ -270,6 +270,7 @@ for port in 0 1; do
echo 1 > $NSIM_DEV_SYS/new_port
fi
NSIM_NETDEV=`get_netdev_name old_netdevs`
+ ifconfig $NSIM_NETDEV up
msg="new NIC device created"
exp0=( 0 0 0 0 )
@@ -431,6 +432,7 @@ for port in 0 1; do
fi
echo $port > $NSIM_DEV_SYS/new_port
+ NSIM_NETDEV=`get_netdev_name old_netdevs`
ifconfig $NSIM_NETDEV up
overflow_table0 "overflow NIC table"
@@ -488,6 +490,7 @@ for port in 0 1; do
fi
echo $port > $NSIM_DEV_SYS/new_port
+ NSIM_NETDEV=`get_netdev_name old_netdevs`
ifconfig $NSIM_NETDEV up
overflow_table0 "overflow NIC table"
@@ -544,6 +547,7 @@ for port in 0 1; do
fi
echo $port > $NSIM_DEV_SYS/new_port
+ NSIM_NETDEV=`get_netdev_name old_netdevs`
ifconfig $NSIM_NETDEV up
overflow_table0 "destroy NIC"
@@ -573,6 +577,7 @@ for port in 0 1; do
fi
echo $port > $NSIM_DEV_SYS/new_port
+ NSIM_NETDEV=`get_netdev_name old_netdevs`
ifconfig $NSIM_NETDEV up
msg="create VxLANs v6"
@@ -633,6 +638,7 @@ for port in 0 1; do
fi
echo $port > $NSIM_DEV_SYS/new_port
+ NSIM_NETDEV=`get_netdev_name old_netdevs`
ifconfig $NSIM_NETDEV up
echo 110 > $NSIM_DEV_DFS/ports/$port/udp_ports_inject_error
@@ -688,6 +694,7 @@ for port in 0 1; do
fi
echo $port > $NSIM_DEV_SYS/new_port
+ NSIM_NETDEV=`get_netdev_name old_netdevs`
ifconfig $NSIM_NETDEV up
msg="create VxLANs v6"
@@ -747,6 +754,7 @@ for port in 0 1; do
fi
echo $port > $NSIM_DEV_SYS/new_port
+ NSIM_NETDEV=`get_netdev_name old_netdevs`
ifconfig $NSIM_NETDEV up
msg="create VxLANs v6"
@@ -877,6 +885,7 @@ msg="re-add a port"
echo 2 > $NSIM_DEV_SYS/del_port
echo 2 > $NSIM_DEV_SYS/new_port
+NSIM_NETDEV=`get_netdev_name old_netdevs`
check_tables
msg="replace VxLAN in overflow table"
--
2.43.0
If there is more than 32 cpus the bitmask will start to contain
commas, leading to:
./rps_default_mask.sh: line 36: [: 00000000,00000000: integer expression expected
Remove the commas, bash doesn't interpret leading zeroes as oct
so that should be good enough. Switch to bash, Simon reports that
not all shells support this type of substitution.
Fixes: c12e0d5f267d ("self-tests: introduce self-tests for RPS default mask")
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
v3:
- switch to bash
v2: https://lore.kernel.org/all/20240120210256.3864747-1-kuba@kernel.org/
- remove all commas
v1: https://lore.kernel.org/all/20240119151248.3476897-1-kuba@kernel.org/
CC: shuah(a)kernel.org
CC: horms(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/rps_default_mask.sh | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/rps_default_mask.sh b/tools/testing/selftests/net/rps_default_mask.sh
index a26c5624429f..4287a8529890 100755
--- a/tools/testing/selftests/net/rps_default_mask.sh
+++ b/tools/testing/selftests/net/rps_default_mask.sh
@@ -1,4 +1,4 @@
-#!/bin/sh
+#!/bin/bash
# SPDX-License-Identifier: GPL-2.0
readonly ksft_skip=4
@@ -33,6 +33,10 @@ chk_rps() {
rps_mask=$($cmd /sys/class/net/$dev_name/queues/rx-0/rps_cpus)
printf "%-60s" "$msg"
+
+ # In case there is more than 32 CPUs we need to remove commas from masks
+ rps_mask=${rps_mask//,}
+ expected_rps_mask=${expected_rps_mask//,}
if [ $rps_mask -eq $expected_rps_mask ]; then
echo "[ ok ]"
else
--
2.43.0
By allowing the filter_glob parameter to be written to, it's possible to
tweak the testsuites that will be executed on new module loads. This
makes it easier to run specific tests without having to reload kunit and
provides a way to filter tests on real HW even if kunit is builtin.
Example for xe driver:
1) Run just 1 test
# echo -n xe_bo > /sys/module/kunit/parameters/filter_glob
# modprobe -r xe_live_test
# modprobe xe_live_test
# ls /sys/kernel/debug/kunit/
xe_bo
2) Run all tests
# echo \* > /sys/module/kunit/parameters/filter_glob
# modprobe -r xe_live_test
# modprobe xe_live_test
# ls /sys/kernel/debug/kunit/
xe_bo xe_dma_buf xe_migrate xe_mocs
For completeness and to cover other use cases, also change filter and
filter_action to rw.
Link: https://lore.kernel.org/intel-xe/dzacvbdditbneiu3e3fmstjmttcbne44yspumpkd6s…
Reviewed-by: Rae Moar <rmoar(a)google.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi(a)intel.com>
---
Rae, I kept your r-b from v1 since the additions are just what we talked
about.
v2: also change filter_action and filter to rw, testing with the xe
module to see if filter=module=none filter_action=skip produces
the result expected by igt
lib/kunit/executor.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/lib/kunit/executor.c b/lib/kunit/executor.c
index 1236b3cd2fbb..371ddcee7fb5 100644
--- a/lib/kunit/executor.c
+++ b/lib/kunit/executor.c
@@ -31,13 +31,13 @@ static char *filter_glob_param;
static char *filter_param;
static char *filter_action_param;
-module_param_named(filter_glob, filter_glob_param, charp, 0400);
+module_param_named(filter_glob, filter_glob_param, charp, 0600);
MODULE_PARM_DESC(filter_glob,
"Filter which KUnit test suites/tests run at boot-time, e.g. list* or list*.*del_test");
-module_param_named(filter, filter_param, charp, 0400);
+module_param_named(filter, filter_param, charp, 0600);
MODULE_PARM_DESC(filter,
"Filter which KUnit test suites/tests run at boot-time using attributes, e.g. speed>slow");
-module_param_named(filter_action, filter_action_param, charp, 0400);
+module_param_named(filter_action, filter_action_param, charp, 0600);
MODULE_PARM_DESC(filter_action,
"Changes behavior of filtered tests using attributes, valid values are:\n"
"<none>: do not run filtered tests as normal\n"
--
2.40.1
Patches 1 and 3 are fixes for tdc that were discovered when running it
using defconfig + tc-testing config and against the latest iproute2.
Patch 2 improves the taprio script that waits for scheduler changes.
Finally, Patch 4 enables all tdc tests.
Pedro Tammela (4):
selftests: tc-testing: add missing netfilter config
selftests: tc-testing: check if 'jq' is available in taprio script
selftests: tc-testing: adjust fq test to latest iproute2
selftests: tc-testing: enable all tdc tests
tools/testing/selftests/tc-testing/config | 1 +
.../selftests/tc-testing/scripts/taprio_wait_for_admin.sh | 5 +++++
tools/testing/selftests/tc-testing/tc-tests/qdiscs/fq.json | 2 +-
tools/testing/selftests/tc-testing/tdc.sh | 3 +--
4 files changed, 8 insertions(+), 3 deletions(-)
--
2.40.1
Hi,
Here are a few fixes for seccomp_bpf tests found when testing on
Android:
user_notification_sibling_pid_ns:
unshare(CLONE_NEWPID) can return EINVAL so have added a check for this.
KILL_THREAD:
This one is a bit more Android specific.
In Bionic pthread_create is calling prctl, this is causing the test to
fail as prctl is in the filter for this test and is killed when it is
called. I've just changed prctl to getpid in this case.
user_notification_addfd:
This test can fail if there are existing file descriptors when the test
starts. It expects the next file descriptor to always increase
sequentially which is not always the case.
Added a get_next_fd function to return the next expected file descriptor.
Regards,
Terry
Terry Tritton (3):
selftests/seccomp: Handle EINVAL on unshare(CLONE_NEWPID)
selftests/seccomp: Change the syscall used in KILL_THREAD test
selftests/seccomp: user_notification_addfd check nextfd is available
tools/testing/selftests/seccomp/seccomp_bpf.c | 41 ++++++++++++++-----
1 file changed, 31 insertions(+), 10 deletions(-)
--
2.43.0.429.g432eaa2c6b-goog
On Tue, 23 Jan 2024 10:55:09 +0100 Petr Machata wrote:
> > If you authored any net or drivers/net selftests, please look around
> > and see if they are passing. If not - send patches or LMK what I need
> > to do to make them pass on the runner.. Make sure to scroll down to
> > the "Not reporting to patchwork" section.
>
> A whole bunch of them fail because of no IPv6 support in the runner
> kernel. E.g. this from bridge-mdb.sh[0]:
Thanks a lot for investigating! I take it that you're looking at
forwarding? Please send a patch to add the missing configs to
tools/testing/selftests/net/forwarding/config
The runner uses that to configure the kernel on top of defconfig.
Unless I'm doing it wrong and the sub-directories are supposed to
inherit the parent directory's config? So net/forwarding/ should
be built with net/'s config? I could not find the info in docs,
does anyone know?
Arch maintainers, please ack/review patches.
This is a resend of a series from Frank last year[1]. I worked in Rob's
review comments to unconditionally call unflatten_device_tree() and
fixup/audit calls to of_have_populated_dt() so that behavior doesn't
change.
I need this series so I can add DT based tests in the clk framework.
Either I can merge it through the clk tree once everyone is happy, or
Rob can merge it through the DT tree and provide some branch so I can
base clk patches on it.
Changes from Frank's series[1]:
* Add a DTB loaded kunit test
* Make of_have_populated_dt() return false if the DTB isn't from the
bootloader
* Architecture calls made unconditional so that a root node is always
made
Frank Rowand (2):
of: Create of_root if no dtb provided by firmware
of: unittest: treat missing of_root as error instead of fixing up
Stephen Boyd (4):
arm64: Unconditionally call unflatten_device_tree()
um: Unconditionally call unflatten_device_tree()
of: Always unflatten in unflatten_and_copy_device_tree()
of: Add KUnit test to confirm DTB is loaded
arch/arm64/kernel/setup.c | 3 +-
arch/um/kernel/dtb.c | 14 +++---
drivers/of/.kunitconfig | 3 ++
drivers/of/Kconfig | 16 ++++++-
drivers/of/Makefile | 4 +-
drivers/of/empty_root.dts | 6 +++
drivers/of/fdt.c | 57 +++++++++++++++++------
drivers/of/of_test.c | 98 +++++++++++++++++++++++++++++++++++++++
drivers/of/platform.c | 3 --
drivers/of/unittest.c | 16 ++-----
include/linux/of.h | 17 +++++--
11 files changed, 191 insertions(+), 46 deletions(-)
create mode 100644 drivers/of/.kunitconfig
create mode 100644 drivers/of/empty_root.dts
create mode 100644 drivers/of/of_test.c
Cc: Anton Ivanov <anton.ivanov(a)cambridgegreys.com>
Cc: Brendan Higgins <brendan.higgins(a)linux.dev>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: David Gow <davidgow(a)google.com>
Cc: Frank Rowand <frowand.list(a)gmail.com>
Cc: Johannes Berg <johannes(a)sipsolutions.net>
Cc: Richard Weinberger <richard(a)nod.at>
Cc: Rob Herring <robh+dt(a)kernel.org>
Cc: Will Deacon <will(a)kernel.org>
[1] https://lore.kernel.org/r/20230317053415.2254616-1-frowand.list@gmail.com
base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a
--
https://git.kernel.org/pub/scm/linux/kernel/git/clk/linux.git/https://git.kernel.org/pub/scm/linux/kernel/git/sboyd/spmi.git
This introduces signal->exec_bprm, which is used to
fix the case when at least one of the sibling threads
is traced, and therefore the trace process may dead-lock
in ptrace_attach, but de_thread will need to wait for the
tracer to continue execution.
The solution is to detect this situation and allow
ptrace_attach to continue by temporarily releasing the
cred_guard_mutex, while de_thread() is still waiting for
traced zombies to be eventually released by the tracer.
In the case of the thread group leader we only have to wait
for the thread to become a zombie, which may also need
co-operation from the tracer due to PTRACE_O_TRACEEXIT.
When a tracer wants to ptrace_attach a task that already
is in execve, we simply retry the ptrace_may_access
check while temporarily installing the new credentials
and dumpability which are about to be used after execve
completes. If the ptrace_attach happens on a thread that
is a sibling-thread of the thread doing execve, it is
sufficient to check against the old credentials, as this
thread will be waited for, before the new credentials are
installed.
Other threads die quickly since the cred_guard_mutex is
released, but a deadly signal is already pending. In case
the mutex_lock_killable misses the signal, the non-zero
current->signal->exec_bprm makes sure they release the
mutex immediately and return with -ERESTARTNOINTR.
This means there is no API change, unlike the previous
version of this patch which was discussed here:
https://lore.kernel.org/lkml/b6537ae6-31b1-5c50-f32b-8b8332ace882@hotmail.d…
See tools/testing/selftests/ptrace/vmaccess.c
for a test case that gets fixed by this change.
Note that since the test case was originally designed to
test the ptrace_attach returning an error in this situation,
the test expectation needed to be adjusted, to allow the
API to succeed at the first attempt.
Signed-off-by: Bernd Edlinger <bernd.edlinger(a)hotmail.de>
---
fs/exec.c | 69 ++++++++++++++++-------
fs/proc/base.c | 6 ++
include/linux/cred.h | 1 +
include/linux/sched/signal.h | 18 ++++++
kernel/cred.c | 28 +++++++--
kernel/ptrace.c | 32 +++++++++++
kernel/seccomp.c | 12 +++-
tools/testing/selftests/ptrace/vmaccess.c | 23 +++++---
8 files changed, 155 insertions(+), 34 deletions(-)
v10: Changes to previous version, make the PTRACE_ATTACH
retun -EAGAIN, instead of execve return -ERESTARTSYS.
Added some lessions learned to the description.
v11: Check old and new credentials in PTRACE_ATTACH again without
changing the API.
Note: I got actually one response from an automatic checker to the v11 patch,
https://lore.kernel.org/lkml/202107121344.wu68hEPF-lkp@intel.com/
which is complaining about:
>> kernel/ptrace.c:425:26: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct cred const *old_cred @@ got struct cred const [noderef] __rcu *real_cred @@
417 struct linux_binprm *bprm = task->signal->exec_bprm;
418 const struct cred *old_cred;
419 struct mm_struct *old_mm;
420
421 retval = down_write_killable(&task->signal->exec_update_lock);
422 if (retval)
423 goto unlock_creds;
424 task_lock(task);
> 425 old_cred = task->real_cred;
v12: Essentially identical to v11.
- Fixed a minor merge conflict in linux v5.17, and fixed the
above mentioned nit by adding __rcu to the declaration.
- re-tested the patch with all linux versions from v5.11 to v6.6
v10 was an alternative approach which did imply an API change.
But I would prefer to avoid such an API change.
The difficult part is getting the right dumpability flags assigned
before de_thread starts, hope you like this version.
If not, the v10 is of course also acceptable.
Thanks
Bernd.
diff --git a/fs/exec.c b/fs/exec.c
index 2f2b0acec4f0..902d3b230485 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1041,11 +1041,13 @@ static int exec_mmap(struct mm_struct *mm)
return 0;
}
-static int de_thread(struct task_struct *tsk)
+static int de_thread(struct task_struct *tsk, struct linux_binprm *bprm)
{
struct signal_struct *sig = tsk->signal;
struct sighand_struct *oldsighand = tsk->sighand;
spinlock_t *lock = &oldsighand->siglock;
+ struct task_struct *t = tsk;
+ bool unsafe_execve_in_progress = false;
if (thread_group_empty(tsk))
goto no_thread_group;
@@ -1068,6 +1070,19 @@ static int de_thread(struct task_struct *tsk)
if (!thread_group_leader(tsk))
sig->notify_count--;
+ while_each_thread(tsk, t) {
+ if (unlikely(t->ptrace)
+ && (t != tsk->group_leader || !t->exit_state))
+ unsafe_execve_in_progress = true;
+ }
+
+ if (unlikely(unsafe_execve_in_progress)) {
+ spin_unlock_irq(lock);
+ sig->exec_bprm = bprm;
+ mutex_unlock(&sig->cred_guard_mutex);
+ spin_lock_irq(lock);
+ }
+
while (sig->notify_count) {
__set_current_state(TASK_KILLABLE);
spin_unlock_irq(lock);
@@ -1158,6 +1173,11 @@ static int de_thread(struct task_struct *tsk)
release_task(leader);
}
+ if (unlikely(unsafe_execve_in_progress)) {
+ mutex_lock(&sig->cred_guard_mutex);
+ sig->exec_bprm = NULL;
+ }
+
sig->group_exec_task = NULL;
sig->notify_count = 0;
@@ -1169,6 +1189,11 @@ static int de_thread(struct task_struct *tsk)
return 0;
killed:
+ if (unlikely(unsafe_execve_in_progress)) {
+ mutex_lock(&sig->cred_guard_mutex);
+ sig->exec_bprm = NULL;
+ }
+
/* protects against exit_notify() and __exit_signal() */
read_lock(&tasklist_lock);
sig->group_exec_task = NULL;
@@ -1253,6 +1278,24 @@ int begin_new_exec(struct linux_binprm * bprm)
if (retval)
return retval;
+ /* If the binary is not readable then enforce mm->dumpable=0 */
+ would_dump(bprm, bprm->file);
+ if (bprm->have_execfd)
+ would_dump(bprm, bprm->executable);
+
+ /*
+ * Figure out dumpability. Note that this checking only of current
+ * is wrong, but userspace depends on it. This should be testing
+ * bprm->secureexec instead.
+ */
+ if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
+ is_dumpability_changed(current_cred(), bprm->cred) ||
+ !(uid_eq(current_euid(), current_uid()) &&
+ gid_eq(current_egid(), current_gid())))
+ set_dumpable(bprm->mm, suid_dumpable);
+ else
+ set_dumpable(bprm->mm, SUID_DUMP_USER);
+
/*
* Ensure all future errors are fatal.
*/
@@ -1261,7 +1304,7 @@ int begin_new_exec(struct linux_binprm * bprm)
/*
* Make this the only thread in the thread group.
*/
- retval = de_thread(me);
+ retval = de_thread(me, bprm);
if (retval)
goto out;
@@ -1284,11 +1327,6 @@ int begin_new_exec(struct linux_binprm * bprm)
if (retval)
goto out;
- /* If the binary is not readable then enforce mm->dumpable=0 */
- would_dump(bprm, bprm->file);
- if (bprm->have_execfd)
- would_dump(bprm, bprm->executable);
-
/*
* Release all of the old mmap stuff
*/
@@ -1350,18 +1388,6 @@ int begin_new_exec(struct linux_binprm * bprm)
me->sas_ss_sp = me->sas_ss_size = 0;
- /*
- * Figure out dumpability. Note that this checking only of current
- * is wrong, but userspace depends on it. This should be testing
- * bprm->secureexec instead.
- */
- if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
- !(uid_eq(current_euid(), current_uid()) &&
- gid_eq(current_egid(), current_gid())))
- set_dumpable(current->mm, suid_dumpable);
- else
- set_dumpable(current->mm, SUID_DUMP_USER);
-
perf_event_exec();
__set_task_comm(me, kbasename(bprm->filename), true);
@@ -1480,6 +1506,11 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
return -ERESTARTNOINTR;
+ if (unlikely(current->signal->exec_bprm)) {
+ mutex_unlock(¤t->signal->cred_guard_mutex);
+ return -ERESTARTNOINTR;
+ }
+
bprm->cred = prepare_exec_creds();
if (likely(bprm->cred))
return 0;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ffd54617c354..0da9adfadb48 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2788,6 +2788,12 @@ static ssize_t proc_pid_attr_write(struct file * file, const char __user * buf,
if (rv < 0)
goto out_free;
+ if (unlikely(current->signal->exec_bprm)) {
+ mutex_unlock(¤t->signal->cred_guard_mutex);
+ rv = -ERESTARTNOINTR;
+ goto out_free;
+ }
+
rv = security_setprocattr(PROC_I(inode)->op.lsm,
file->f_path.dentry->d_name.name, page,
count);
diff --git a/include/linux/cred.h b/include/linux/cred.h
index f923528d5cc4..b01e309f5686 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -159,6 +159,7 @@ extern const struct cred *get_task_cred(struct task_struct *);
extern struct cred *cred_alloc_blank(void);
extern struct cred *prepare_creds(void);
extern struct cred *prepare_exec_creds(void);
+extern bool is_dumpability_changed(const struct cred *, const struct cred *);
extern int commit_creds(struct cred *);
extern void abort_creds(struct cred *);
extern const struct cred *override_creds(const struct cred *);
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 0014d3adaf84..14df7073a0a8 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -234,9 +234,27 @@ struct signal_struct {
struct mm_struct *oom_mm; /* recorded mm when the thread group got
* killed by the oom killer */
+ struct linux_binprm *exec_bprm; /* Used to check ptrace_may_access
+ * against new credentials while
+ * de_thread is waiting for other
+ * traced threads to terminate.
+ * Set while de_thread is executing.
+ * The cred_guard_mutex is released
+ * after de_thread() has called
+ * zap_other_threads(), therefore
+ * a fatal signal is guaranteed to be
+ * already pending in the unlikely
+ * event, that
+ * current->signal->exec_bprm happens
+ * to be non-zero after the
+ * cred_guard_mutex was acquired.
+ */
+
struct mutex cred_guard_mutex; /* guard against foreign influences on
* credential calculations
* (notably. ptrace)
+ * Held while execve runs, except when
+ * a sibling thread is being traced.
* Deprecated do not use in new code.
* Use exec_update_lock instead.
*/
diff --git a/kernel/cred.c b/kernel/cred.c
index 98cb4eca23fb..586cb6c7cf6b 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -433,6 +433,28 @@ static bool cred_cap_issubset(const struct cred *set, const struct cred *subset)
return false;
}
+/**
+ * is_dumpability_changed - Will changing creds from old to new
+ * affect the dumpability in commit_creds?
+ *
+ * Return: false - dumpability will not be changed in commit_creds.
+ * Return: true - dumpability will be changed to non-dumpable.
+ *
+ * @old: The old credentials
+ * @new: The new credentials
+ */
+bool is_dumpability_changed(const struct cred *old, const struct cred *new)
+{
+ if (!uid_eq(old->euid, new->euid) ||
+ !gid_eq(old->egid, new->egid) ||
+ !uid_eq(old->fsuid, new->fsuid) ||
+ !gid_eq(old->fsgid, new->fsgid) ||
+ !cred_cap_issubset(old, new))
+ return true;
+
+ return false;
+}
+
/**
* commit_creds - Install new credentials upon the current task
* @new: The credentials to be assigned
@@ -467,11 +489,7 @@ int commit_creds(struct cred *new)
get_cred(new); /* we will require a ref for the subj creds too */
/* dumpability changes */
- if (!uid_eq(old->euid, new->euid) ||
- !gid_eq(old->egid, new->egid) ||
- !uid_eq(old->fsuid, new->fsuid) ||
- !gid_eq(old->fsgid, new->fsgid) ||
- !cred_cap_issubset(old, new)) {
+ if (is_dumpability_changed(old, new)) {
if (task->mm)
set_dumpable(task->mm, suid_dumpable);
task->pdeath_signal = 0;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 443057bee87c..eb1c450bb7d7 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -20,6 +20,7 @@
#include <linux/pagemap.h>
#include <linux/ptrace.h>
#include <linux/security.h>
+#include <linux/binfmts.h>
#include <linux/signal.h>
#include <linux/uio.h>
#include <linux/audit.h>
@@ -435,6 +436,28 @@ static int ptrace_attach(struct task_struct *task, long request,
if (retval)
goto unlock_creds;
+ if (unlikely(task->in_execve)) {
+ struct linux_binprm *bprm = task->signal->exec_bprm;
+ const struct cred __rcu *old_cred;
+ struct mm_struct *old_mm;
+
+ retval = down_write_killable(&task->signal->exec_update_lock);
+ if (retval)
+ goto unlock_creds;
+ task_lock(task);
+ old_cred = task->real_cred;
+ old_mm = task->mm;
+ rcu_assign_pointer(task->real_cred, bprm->cred);
+ task->mm = bprm->mm;
+ retval = __ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS);
+ rcu_assign_pointer(task->real_cred, old_cred);
+ task->mm = old_mm;
+ task_unlock(task);
+ up_write(&task->signal->exec_update_lock);
+ if (retval)
+ goto unlock_creds;
+ }
+
write_lock_irq(&tasklist_lock);
retval = -EPERM;
if (unlikely(task->exit_state))
@@ -508,6 +531,14 @@ static int ptrace_traceme(void)
{
int ret = -EPERM;
+ if (mutex_lock_interruptible(¤t->signal->cred_guard_mutex))
+ return -ERESTARTNOINTR;
+
+ if (unlikely(current->signal->exec_bprm)) {
+ mutex_unlock(¤t->signal->cred_guard_mutex);
+ return -ERESTARTNOINTR;
+ }
+
write_lock_irq(&tasklist_lock);
/* Are we already being traced? */
if (!current->ptrace) {
@@ -523,6 +554,7 @@ static int ptrace_traceme(void)
}
}
write_unlock_irq(&tasklist_lock);
+ mutex_unlock(¤t->signal->cred_guard_mutex);
return ret;
}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 255999ba9190..b29bbfa0b044 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1955,9 +1955,15 @@ static long seccomp_set_mode_filter(unsigned int flags,
* Make sure we cannot change seccomp or nnp state via TSYNC
* while another thread is in the middle of calling exec.
*/
- if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
- mutex_lock_killable(¤t->signal->cred_guard_mutex))
- goto out_put_fd;
+ if (flags & SECCOMP_FILTER_FLAG_TSYNC) {
+ if (mutex_lock_killable(¤t->signal->cred_guard_mutex))
+ goto out_put_fd;
+
+ if (unlikely(current->signal->exec_bprm)) {
+ mutex_unlock(¤t->signal->cred_guard_mutex);
+ goto out_put_fd;
+ }
+ }
spin_lock_irq(¤t->sighand->siglock);
diff --git a/tools/testing/selftests/ptrace/vmaccess.c b/tools/testing/selftests/ptrace/vmaccess.c
index 4db327b44586..3b7d81fb99bb 100644
--- a/tools/testing/selftests/ptrace/vmaccess.c
+++ b/tools/testing/selftests/ptrace/vmaccess.c
@@ -39,8 +39,15 @@ TEST(vmaccess)
f = open(mm, O_RDONLY);
ASSERT_GE(f, 0);
close(f);
- f = kill(pid, SIGCONT);
- ASSERT_EQ(f, 0);
+ f = waitpid(-1, NULL, 0);
+ ASSERT_NE(f, -1);
+ ASSERT_NE(f, 0);
+ ASSERT_NE(f, pid);
+ f = waitpid(-1, NULL, 0);
+ ASSERT_EQ(f, pid);
+ f = waitpid(-1, NULL, 0);
+ ASSERT_EQ(f, -1);
+ ASSERT_EQ(errno, ECHILD);
}
TEST(attach)
@@ -57,22 +64,24 @@ TEST(attach)
sleep(1);
k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
- ASSERT_EQ(errno, EAGAIN);
- ASSERT_EQ(k, -1);
+ ASSERT_EQ(k, 0);
k = waitpid(-1, &s, WNOHANG);
ASSERT_NE(k, -1);
ASSERT_NE(k, 0);
ASSERT_NE(k, pid);
ASSERT_EQ(WIFEXITED(s), 1);
ASSERT_EQ(WEXITSTATUS(s), 0);
- sleep(1);
- k = ptrace(PTRACE_ATTACH, pid, 0L, 0L);
+ k = waitpid(-1, &s, 0);
+ ASSERT_EQ(k, pid);
+ ASSERT_EQ(WIFSTOPPED(s), 1);
+ ASSERT_EQ(WSTOPSIG(s), SIGTRAP);
+ k = ptrace(PTRACE_CONT, pid, 0L, 0L);
ASSERT_EQ(k, 0);
k = waitpid(-1, &s, 0);
ASSERT_EQ(k, pid);
ASSERT_EQ(WIFSTOPPED(s), 1);
ASSERT_EQ(WSTOPSIG(s), SIGSTOP);
- k = ptrace(PTRACE_DETACH, pid, 0L, 0L);
+ k = ptrace(PTRACE_CONT, pid, 0L, 0L);
ASSERT_EQ(k, 0);
k = waitpid(-1, &s, 0);
ASSERT_EQ(k, pid);
--
2.39.2
Currently the seccomp benchmark selftest produces non-standard output,
meaning that while it makes a number of checks of the performance it
observes this has to be parsed by humans. This means that automated
systems running this suite of tests are almost certainly ignoring the
results which isn't ideal for spotting problems. Let's rework things so
that each check that the program does is reported as a test result to
the framework.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v3:
- Re-add signoff.
- Link to v2: https://lore.kernel.org/r/20240122-b4-kselftest-seccomp-benchmark-ktap-v2-0…
Changes in v2:
- Rebase onto v6.8-rc1.
- Link to v1: https://lore.kernel.org/r/20231219-b4-kselftest-seccomp-benchmark-ktap-v1-0…
---
Mark Brown (2):
kselftest/seccomp: Use kselftest output functions for benchmark
kselftest/seccomp: Report each expectation we assert as a KTAP test
.../testing/selftests/seccomp/seccomp_benchmark.c | 105 +++++++++++++--------
1 file changed, 65 insertions(+), 40 deletions(-)
---
base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
change-id: 20231219-b4-kselftest-seccomp-benchmark-ktap-357603823708
Best regards,
--
Mark Brown <broonie(a)kernel.org>
This is part of an effort to improve detection of regressions impacting
device probe on all platforms. The recently merged DT kselftest [3]
detects probe issues for all devices described statically in the DT.
That leaves out devices discovered at run-time from discoverable buses.
This is where this test comes in. All of the devices that are connected
through discoverable buses (ie USB and PCI), and which are internal and
therefore always present, can be described based on their position in
the system topology in a per-platform YAML file so they can be checked
for. The test will check that the device has been instantiated and bound
to a driver.
Patch 1 introduces the test. Patch 2 and 3 add the device definitions
for the google,spherion machine (Acer Chromebook 514) and XPS 13 as
examples.
This is the output from the test running on Spherion:
TAP version 13
Using board file: boards/google,spherion.yaml
1..8
ok 1 /usb2-controller(a)11200000/1.4.1/camera.device
ok 2 /usb2-controller(a)11200000/1.4.1/camera.0.driver
ok 3 /usb2-controller(a)11200000/1.4.1/camera.1.driver
ok 4 /usb2-controller(a)11200000/1.4.2/bluetooth.device
ok 5 /usb2-controller(a)11200000/1.4.2/bluetooth.0.driver
ok 6 /usb2-controller(a)11200000/1.4.2/bluetooth.1.driver
ok 7 /pci-controller(a)11230000/0.0/0.0/wifi.device
ok 8 /pci-controller(a)11230000/0.0/0.0/wifi.driver
Totals: pass:8 fail:0 xfail:0 xpass:0 skip:0 error:0
[3] https://lore.kernel.org/all/20230828211424.2964562-1-nfraprado@collabora.co…
Changes in v4:
- Dropped RFC tag
- Fixed 'busses' misspelling
- Link to v3: https://lore.kernel.org/all/20231227123643.52348-1-nfraprado@collabora.com
Changes in v3:
- Reverted approach of encoding stable device reference in test file
from device match fields (from modalias) back to HW topology (from v1)
- Changed board file description to YAML
- Rewrote test script in python to handle YAML and support x86 platforms
- Link to v2: https://lore.kernel.org/all/20231127233558.868365-1-nfraprado@collabora.com
Changes in v2:
- Changed approach of encoding stable device reference in test file from
HW topology to device match fields (the ones from modalias)
- Better documented test format
- Link to v1: https://lore.kernel.org/all/20231024211818.365844-1-nfraprado@collabora.com
---
Nícolas F. R. A. Prado (3):
kselftest: Add test to verify probe of devices from discoverable buses
kselftest: devices: Add sample board file for google,spherion
kselftest: devices: Add sample board file for XPS 13 9300
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/devices/Makefile | 4 +
.../devices/boards/Dell Inc.,XPS 13 9300.yaml | 40 +++
.../selftests/devices/boards/google,spherion.yaml | 50 ++++
tools/testing/selftests/devices/ksft.py | 90 ++++++
.../selftests/devices/test_discoverable_devices.py | 318 +++++++++++++++++++++
6 files changed, 503 insertions(+)
---
base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
change-id: 20240122-discoverable-devs-ksft-9d501e312688
Best regards,
--
Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
Add missing tests to run_vmtests.sh. The mm kselftests are run through
run_vmtests.sh. If a test isn't present in this script, it'll not run
with run_tests or `make -C tools/testing/selftests/mm run_tests`.
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
---
Changes since v1:
- Copy the original scripts and their dependence script to install directory as well
---
tools/testing/selftests/mm/Makefile | 3 +++
tools/testing/selftests/mm/run_vmtests.sh | 3 +++
2 files changed, 6 insertions(+)
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 2453add65d12f..c9c8112a7262e 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -114,6 +114,9 @@ TEST_PROGS := run_vmtests.sh
TEST_FILES := test_vmalloc.sh
TEST_FILES += test_hmm.sh
TEST_FILES += va_high_addr_switch.sh
+TEST_FILES += charge_reserved_hugetlb.sh
+TEST_FILES += write_hugetlb_memory.sh
+TEST_FILES += hugetlb_reparenting_test.sh
include ../lib.mk
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 246d53a5d7f28..12754af00b39c 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -248,6 +248,9 @@ CATEGORY="hugetlb" run_test ./map_hugetlb
CATEGORY="hugetlb" run_test ./hugepage-mremap
CATEGORY="hugetlb" run_test ./hugepage-vmemmap
CATEGORY="hugetlb" run_test ./hugetlb-madvise
+CATEGORY="hugetlb" run_test ./charge_reserved_hugetlb.sh -cgroup-v2
+CATEGORY="hugetlb" run_test ./hugetlb_reparenting_test.sh -cgroup-v2
+CATEGORY="hugetlb" run_test ./hugetlb-read-hwpoison
nr_hugepages_tmp=$(cat /proc/sys/vm/nr_hugepages)
# For this test, we need one and just one huge page
--
2.42.0
The busywait timeout value is a millisecond, not a second. So the
current setting 2 is meaningless. Let's copy the WAIT_TIMEOUT from
forwarding/lib.sh and set a BUSYWAIT_TIMEOUT here.
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
Not sure if the default WAIT_TIMEOUT 20s is too large. But since
we usually don't need to wait for that long. I think it's OK to
stay the same value with forwarding/lib.sh. Please tell me if you
think we need to set a more proper value.
BTW, This doesn't look like a fix. But also not a feature. So I just
post it to net tree.
---
tools/testing/selftests/net/lib.sh | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index dca549443801..f9fe182dfbd4 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -4,6 +4,9 @@
##############################################################################
# Defines
+WAIT_TIMEOUT=${WAIT_TIMEOUT:=20}
+BUSYWAIT_TIMEOUT=$((WAIT_TIMEOUT * 1000)) # ms
+
# Kselftest framework requirement - SKIP code is 4.
ksft_skip=4
# namespace list created by setup_ns
@@ -48,7 +51,7 @@ cleanup_ns()
for ns in "$@"; do
ip netns delete "${ns}" &> /dev/null
- if ! busywait 2 ip netns list \| grep -vq "^$ns$" &> /dev/null; then
+ if ! busywait $BUSYWAIT_TIMEOUT ip netns list \| grep -vq "^$ns$" &> /dev/null; then
echo "Warn: Failed to remove namespace $ns"
ret=1
fi
--
2.43.0
Add missing tests to run_vmtests.sh. The mm kselftests are run through
run_vmtests.sh. If a test isn't present in this script, it'll not run
with run_tests or `make -C tools/testing/selftests/mm run_tests`.
Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
---
tools/testing/selftests/mm/run_vmtests.sh | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 246d53a5d7f2..a5e6ba8d3579 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -248,6 +248,9 @@ CATEGORY="hugetlb" run_test ./map_hugetlb
CATEGORY="hugetlb" run_test ./hugepage-mremap
CATEGORY="hugetlb" run_test ./hugepage-vmemmap
CATEGORY="hugetlb" run_test ./hugetlb-madvise
+CATEGORY="hugetlb" run_test ./charge_reserved_hugetlb.sh
+CATEGORY="hugetlb" run_test ./hugetlb_reparenting_test.sh
+CATEGORY="hugetlb" run_test ./hugetlb-read-hwpoison
nr_hugepages_tmp=$(cat /proc/sys/vm/nr_hugepages)
# For this test, we need one and just one huge page
--
2.42.0
In an effort to separate intentional arithmetic wrap-around from
unexpected wrap-around, we need to refactor places that depend on this
kind of math. One of the most common code patterns of this is:
VAR + value < VAR
Notably, this is considered "undefined behavior" for signed and pointer
types, which the kernel works around by using the -fno-strict-overflow
option in the build[1] (which used to just be -fwrapv). Regardless, we
want to get the kernel source to the position where we can meaningfully
instrument arithmetic wrap-around conditions and catch them when they
are unexpected, regardless of whether they are signed[2], unsigned[3],
or pointer[4] types.
Refactor open-coded wrap-around addition test to use add_would_overflow().
This paves the way to enabling the wrap-around sanitizers in the future.
Link: https://git.kernel.org/linus/68df3755e383e6fecf2354a67b08f92f18536594 [1]
Link: https://github.com/KSPP/linux/issues/26 [2]
Link: https://github.com/KSPP/linux/issues/27 [3]
Link: https://github.com/KSPP/linux/issues/344 [4]
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: linux-mm(a)kvack.org
Cc: linux-kselftest(a)vger.kernel.org
Signed-off-by: Kees Cook <keescook(a)chromium.org>
---
mm/memory.c | 4 ++--
mm/mmap.c | 2 +-
mm/mremap.c | 2 +-
mm/nommu.c | 4 ++--
mm/util.c | 2 +-
5 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 7e1f4849463a..d47acdff7af3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2559,7 +2559,7 @@ int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long
unsigned long vm_len, pfn, pages;
/* Check that the physical memory area passed in looks valid */
- if (start + len < start)
+ if (add_would_overflow(start, len))
return -EINVAL;
/*
* You *really* shouldn't map things that aren't page-aligned,
@@ -2569,7 +2569,7 @@ int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long
len += start & ~PAGE_MASK;
pfn = start >> PAGE_SHIFT;
pages = (len + ~PAGE_MASK) >> PAGE_SHIFT;
- if (pfn + pages < pfn)
+ if (add_would_overflow(pfn, pages))
return -EINVAL;
/* We start the mapping 'vm_pgoff' pages into the area */
diff --git a/mm/mmap.c b/mm/mmap.c
index b78e83d351d2..16501fcaf511 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3023,7 +3023,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
return ret;
/* Does pgoff wrap? */
- if (pgoff + (size >> PAGE_SHIFT) < pgoff)
+ if (add_would_overflow(pgoff, (size >> PAGE_SHIFT)))
return ret;
if (mmap_write_lock_killable(mm))
diff --git a/mm/mremap.c b/mm/mremap.c
index 38d98465f3d8..efa27019a05d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -848,7 +848,7 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
/* Need to be careful about a growing mapping */
pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
pgoff += vma->vm_pgoff;
- if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
+ if (add_would_overflow(pgoff, (new_len >> PAGE_SHIFT)))
return ERR_PTR(-EINVAL);
if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
diff --git a/mm/nommu.c b/mm/nommu.c
index b6dc558d3144..299bcfe19eed 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -202,7 +202,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
{
/* Don't allow overflow */
- if ((unsigned long) addr + count < count)
+ if (add_would_overflow(count, (unsigned long)addr))
count = -(unsigned long) addr;
return copy_to_iter(addr, count, iter);
@@ -1705,7 +1705,7 @@ int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, in
{
struct mm_struct *mm;
- if (addr + len < addr)
+ if (add_would_overflow(addr, len))
return 0;
mm = get_task_mm(tsk);
diff --git a/mm/util.c b/mm/util.c
index 5a6a9802583b..e6beeb23b48b 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -567,7 +567,7 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long offset)
{
- if (unlikely(offset + PAGE_ALIGN(len) < offset))
+ if (unlikely(add_would_overflow(offset, PAGE_ALIGN(len))))
return -EINVAL;
if (unlikely(offset_in_page(offset)))
return -EINVAL;
--
2.34.1
6.1-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
[ Upstream commit 3f47c1ebe5ca9c5883e596c7888dec4bec0176d8 ]
The GCC 13.2.0 compiler issued the following warning:
mixer-test.c: In function ‘ctl_value_index_valid’:
mixer-test.c:322:79: warning: format ‘%lld’ expects argument of type ‘long long int’, \
but argument 5 has type ‘long int’ [-Wformat=]
322 | ksft_print_msg("%s.%d value %lld more than maximum %lld\n",
| ~~~^
| |
| long long int
| %ld
323 | ctl->name, index, int64_val,
324 | snd_ctl_elem_info_get_max(ctl->info));
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| long int
Fixing the format specifier as advised by the compiler suggestion removes the
warning.
Fixes: 3f48b137d88e7 ("kselftest: alsa: Factor out check that values meet constraints")
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Jaroslav Kysela <perex(a)perex.cz>
Cc: Takashi Iwai <tiwai(a)suse.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: linux-sound(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Signed-off-by: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
Acked-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20240107173704.937824-3-mirsad.todorovac@alu.uniz…
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/alsa/mixer-test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/alsa/mixer-test.c b/tools/testing/selftests/alsa/mixer-test.c
index d59910658c8c..9ad39db32d14 100644
--- a/tools/testing/selftests/alsa/mixer-test.c
+++ b/tools/testing/selftests/alsa/mixer-test.c
@@ -358,7 +358,7 @@ static bool ctl_value_index_valid(struct ctl_data *ctl,
}
if (int64_val > snd_ctl_elem_info_get_max64(ctl->info)) {
- ksft_print_msg("%s.%d value %lld more than maximum %lld\n",
+ ksft_print_msg("%s.%d value %lld more than maximum %ld\n",
ctl->name, index, int64_val,
snd_ctl_elem_info_get_max(ctl->info));
return false;
--
2.43.0
6.1-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
[ Upstream commit 8c51c13dc63d46e754c44215eabc0890a8bd9bfb ]
Minor fix in the number of arguments to error reporting function in the
test program as reported by GCC 13.2.0 warning.
mixer-test.c: In function ‘find_controls’:
mixer-test.c:169:44: warning: too many arguments for format [-Wformat-extra-args]
169 | ksft_exit_fail_msg("snd_ctl_poll_descriptors() failed for %d\n",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The number of arguments in call to ksft_exit_fail_msg() doesn't correspond
to the format specifiers, so this is adjusted resembling the sibling calls
to the error function.
Fixes: b1446bda56456 ("kselftest: alsa: Check for event generation when we write to controls")
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Jaroslav Kysela <perex(a)perex.cz>
Cc: Takashi Iwai <tiwai(a)suse.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: linux-sound(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Signed-off-by: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
Acked-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20240107173704.937824-2-mirsad.todorovac@alu.uniz…
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/alsa/mixer-test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/alsa/mixer-test.c b/tools/testing/selftests/alsa/mixer-test.c
index 37da902545a4..d59910658c8c 100644
--- a/tools/testing/selftests/alsa/mixer-test.c
+++ b/tools/testing/selftests/alsa/mixer-test.c
@@ -205,7 +205,7 @@ static void find_controls(void)
err = snd_ctl_poll_descriptors(card_data->handle,
&card_data->pollfd, 1);
if (err != 1) {
- ksft_exit_fail_msg("snd_ctl_poll_descriptors() failed for %d\n",
+ ksft_exit_fail_msg("snd_ctl_poll_descriptors() failed for card %d: %d\n",
card, err);
}
--
2.43.0
6.6-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
[ Upstream commit 3f47c1ebe5ca9c5883e596c7888dec4bec0176d8 ]
The GCC 13.2.0 compiler issued the following warning:
mixer-test.c: In function ‘ctl_value_index_valid’:
mixer-test.c:322:79: warning: format ‘%lld’ expects argument of type ‘long long int’, \
but argument 5 has type ‘long int’ [-Wformat=]
322 | ksft_print_msg("%s.%d value %lld more than maximum %lld\n",
| ~~~^
| |
| long long int
| %ld
323 | ctl->name, index, int64_val,
324 | snd_ctl_elem_info_get_max(ctl->info));
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| long int
Fixing the format specifier as advised by the compiler suggestion removes the
warning.
Fixes: 3f48b137d88e7 ("kselftest: alsa: Factor out check that values meet constraints")
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Jaroslav Kysela <perex(a)perex.cz>
Cc: Takashi Iwai <tiwai(a)suse.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: linux-sound(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Signed-off-by: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
Acked-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20240107173704.937824-3-mirsad.todorovac@alu.uniz…
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/alsa/mixer-test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/alsa/mixer-test.c b/tools/testing/selftests/alsa/mixer-test.c
index 208c2170c074..df942149c6f6 100644
--- a/tools/testing/selftests/alsa/mixer-test.c
+++ b/tools/testing/selftests/alsa/mixer-test.c
@@ -319,7 +319,7 @@ static bool ctl_value_index_valid(struct ctl_data *ctl,
}
if (int64_val > snd_ctl_elem_info_get_max64(ctl->info)) {
- ksft_print_msg("%s.%d value %lld more than maximum %lld\n",
+ ksft_print_msg("%s.%d value %lld more than maximum %ld\n",
ctl->name, index, int64_val,
snd_ctl_elem_info_get_max(ctl->info));
return false;
--
2.43.0
6.6-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
[ Upstream commit 8c51c13dc63d46e754c44215eabc0890a8bd9bfb ]
Minor fix in the number of arguments to error reporting function in the
test program as reported by GCC 13.2.0 warning.
mixer-test.c: In function ‘find_controls’:
mixer-test.c:169:44: warning: too many arguments for format [-Wformat-extra-args]
169 | ksft_exit_fail_msg("snd_ctl_poll_descriptors() failed for %d\n",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The number of arguments in call to ksft_exit_fail_msg() doesn't correspond
to the format specifiers, so this is adjusted resembling the sibling calls
to the error function.
Fixes: b1446bda56456 ("kselftest: alsa: Check for event generation when we write to controls")
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Jaroslav Kysela <perex(a)perex.cz>
Cc: Takashi Iwai <tiwai(a)suse.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: linux-sound(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Signed-off-by: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
Acked-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20240107173704.937824-2-mirsad.todorovac@alu.uniz…
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/alsa/mixer-test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/alsa/mixer-test.c b/tools/testing/selftests/alsa/mixer-test.c
index 23df154fcdd7..208c2170c074 100644
--- a/tools/testing/selftests/alsa/mixer-test.c
+++ b/tools/testing/selftests/alsa/mixer-test.c
@@ -166,7 +166,7 @@ static void find_controls(void)
err = snd_ctl_poll_descriptors(card_data->handle,
&card_data->pollfd, 1);
if (err != 1) {
- ksft_exit_fail_msg("snd_ctl_poll_descriptors() failed for %d\n",
+ ksft_exit_fail_msg("snd_ctl_poll_descriptors() failed for card %d: %d\n",
card, err);
}
--
2.43.0
6.7-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
[ Upstream commit 3f47c1ebe5ca9c5883e596c7888dec4bec0176d8 ]
The GCC 13.2.0 compiler issued the following warning:
mixer-test.c: In function ‘ctl_value_index_valid’:
mixer-test.c:322:79: warning: format ‘%lld’ expects argument of type ‘long long int’, \
but argument 5 has type ‘long int’ [-Wformat=]
322 | ksft_print_msg("%s.%d value %lld more than maximum %lld\n",
| ~~~^
| |
| long long int
| %ld
323 | ctl->name, index, int64_val,
324 | snd_ctl_elem_info_get_max(ctl->info));
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| long int
Fixing the format specifier as advised by the compiler suggestion removes the
warning.
Fixes: 3f48b137d88e7 ("kselftest: alsa: Factor out check that values meet constraints")
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Jaroslav Kysela <perex(a)perex.cz>
Cc: Takashi Iwai <tiwai(a)suse.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: linux-sound(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Signed-off-by: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
Acked-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20240107173704.937824-3-mirsad.todorovac@alu.uniz…
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/alsa/mixer-test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/alsa/mixer-test.c b/tools/testing/selftests/alsa/mixer-test.c
index 208c2170c074..df942149c6f6 100644
--- a/tools/testing/selftests/alsa/mixer-test.c
+++ b/tools/testing/selftests/alsa/mixer-test.c
@@ -319,7 +319,7 @@ static bool ctl_value_index_valid(struct ctl_data *ctl,
}
if (int64_val > snd_ctl_elem_info_get_max64(ctl->info)) {
- ksft_print_msg("%s.%d value %lld more than maximum %lld\n",
+ ksft_print_msg("%s.%d value %lld more than maximum %ld\n",
ctl->name, index, int64_val,
snd_ctl_elem_info_get_max(ctl->info));
return false;
--
2.43.0
6.7-stable review patch. If anyone has any objections, please let me know.
------------------
From: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
[ Upstream commit 8c51c13dc63d46e754c44215eabc0890a8bd9bfb ]
Minor fix in the number of arguments to error reporting function in the
test program as reported by GCC 13.2.0 warning.
mixer-test.c: In function ‘find_controls’:
mixer-test.c:169:44: warning: too many arguments for format [-Wformat-extra-args]
169 | ksft_exit_fail_msg("snd_ctl_poll_descriptors() failed for %d\n",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The number of arguments in call to ksft_exit_fail_msg() doesn't correspond
to the format specifiers, so this is adjusted resembling the sibling calls
to the error function.
Fixes: b1446bda56456 ("kselftest: alsa: Check for event generation when we write to controls")
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Jaroslav Kysela <perex(a)perex.cz>
Cc: Takashi Iwai <tiwai(a)suse.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: linux-sound(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Signed-off-by: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
Acked-by: Mark Brown <broonie(a)kernel.org>
Link: https://lore.kernel.org/r/20240107173704.937824-2-mirsad.todorovac@alu.uniz…
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
tools/testing/selftests/alsa/mixer-test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/alsa/mixer-test.c b/tools/testing/selftests/alsa/mixer-test.c
index 23df154fcdd7..208c2170c074 100644
--- a/tools/testing/selftests/alsa/mixer-test.c
+++ b/tools/testing/selftests/alsa/mixer-test.c
@@ -166,7 +166,7 @@ static void find_controls(void)
err = snd_ctl_poll_descriptors(card_data->handle,
&card_data->pollfd, 1);
if (err != 1) {
- ksft_exit_fail_msg("snd_ctl_poll_descriptors() failed for %d\n",
+ ksft_exit_fail_msg("snd_ctl_poll_descriptors() failed for card %d: %d\n",
card, err);
}
--
2.43.0
Currently the seccomp benchmark selftest produces non-standard output,
meaning that while it makes a number of checks of the performance it
observes this has to be parsed by humans. This means that automated
systems running this suite of tests are almost certainly ignoring the
results which isn't ideal for spotting problems. Let's rework things so
that each check that the program does is reported as a test result to
the framework.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v2:
- Rebase onto v6.8-rc1.
- Link to v1: https://lore.kernel.org/r/20231219-b4-kselftest-seccomp-benchmark-ktap-v1-0…
---
Mark Brown (2):
kselftest/seccomp: Use kselftest output functions for benchmark
kselftest/seccomp: Report each expectation we assert as a KTAP test
.../testing/selftests/seccomp/seccomp_benchmark.c | 105 +++++++++++++--------
1 file changed, 65 insertions(+), 40 deletions(-)
---
base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
change-id: 20231219-b4-kselftest-seccomp-benchmark-ktap-357603823708
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Make sv48 the default address space for mmap as some applications
currently depend on this assumption. Users can now select a
desired address space using a non-zero hint address to mmap. Previously,
requesting the default address space from mmap by passing zero as the hint
address would result in using the largest address space possible. Some
applications depend on empty bits in the virtual address space, like Go and
Java, so this patch provides more flexibility for application developers.
-Charlie
---
v10:
- Move pgtable.h defintions into a no __ASSEMBLY__ region to resolve compilation
conflicts (pointed out by Conor)
- Will now compile with allmodconfig
v9:
- Raise the mmap_end default to STACK_TOP_MAX to allow the address space to grow
beyond the default of sv48 on sv57 machines as suggested by Alexandre
- Some of the mmap macros had unnecessary conditionals that I have removed
v8:
- Fix RV32 and the RV32 compat mode of RV64 (suggested by Conor)
- Extract out addr and base from the mmap macros (suggested by Alexandre)
v7:
- Changing RLIMIT_STACK inside of an executing program does not trigger
arch_pick_mmap_layout(), so rewrite tests to change RLIMIT_STACK from a
script before executing tests. RLIMIT_STACK of infinity forces bottomup
mmap allocation.
- Make arch_get_mmap_base macro more readible by extracting out the rnd
calculation.
- Use MMAP_MIN_VA_BITS in TASK_UNMAPPED_BASE to support case when mmap
attempts to allocate address smaller than DEFAULT_MAP_WINDOW.
- Fix incorrect wording in documentation.
v6:
- Rebase onto the correct base
v5:
- Minor wording change in documentation
- Change some parenthesis in arch_get_mmap_ macros
- Added case for addr==0 in arch_get_mmap_ because without this, programs would
crash if RLIMIT_STACK was modified before executing the program. This was
tested using the libhugetlbfs tests.
v4:
- Split testcases/document patch into test cases, in-code documentation, and
formal documentation patches
- Modified the mmap_base macro to be more legible and better represent memory
layout
- Fixed documentation to better reflect the implmentation
- Renamed DEFAULT_VA_BITS to MMAP_VA_BITS
- Added additional test case for rlimit changes
---
Charlie Jenkins (4):
RISC-V: mm: Restrict address space for sv39,sv48,sv57
RISC-V: mm: Add tests for RISC-V mm
RISC-V: mm: Update pgtable comment documentation
RISC-V: mm: Document mmap changes
Documentation/riscv/vm-layout.rst | 22 +++++++
arch/riscv/include/asm/elf.h | 2 +-
arch/riscv/include/asm/pgtable.h | 33 ++++++++--
arch/riscv/include/asm/processor.h | 52 +++++++++++++--
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/mm/.gitignore | 2 +
tools/testing/selftests/riscv/mm/Makefile | 15 +++++
.../riscv/mm/testcases/mmap_bottomup.c | 35 ++++++++++
.../riscv/mm/testcases/mmap_default.c | 35 ++++++++++
.../selftests/riscv/mm/testcases/mmap_test.h | 64 +++++++++++++++++++
.../selftests/riscv/mm/testcases/run_mmap.sh | 12 ++++
11 files changed, 261 insertions(+), 13 deletions(-)
create mode 100644 tools/testing/selftests/riscv/mm/.gitignore
create mode 100644 tools/testing/selftests/riscv/mm/Makefile
create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap_bottomup.c
create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap_default.c
create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap_test.h
create mode 100755 tools/testing/selftests/riscv/mm/testcases/run_mmap.sh
--
2.34.1
Changes in v6:
- Rebased on top of 70d201a40823 (thanks Alexander Gordeev!)
- Resolved a conflict because of 43e8832fed08 being reverted
- Resolved a missing static declaration for lp_sys_getpid, since
-Wmissing-prototypes warning was enabled.
- Retested everything, from running the livepatch selftests from kernel
source, running from a directory here the testes were installed (Joe's
usecase), and running from a gen_tar'ed directory. All of them
executed correctly.
- Added Petr review tags (Thanks!)
- Link to v5: https://lore.kernel.org/r/20240109-send-lp-kselftests-v5-0-364d59a69f12@sus…
Changes in v5:
* Fixed an issue found by Joe that copied Kbuild files along with the
test modules to the installation directory.
* Added Joe Lawrense review tags.
Changes in v4:
* Documented how to compile the livepatch selftests without running the
tests (Joe)
* Removed the mention to lib/livepatch on MAINTAINERS file, reported by
checkpatch.
Changes in v3:
* Rebased on top of v6.6-rc5
* The commits messages were improved (Thanks Petr!)
* Created TEST_GEN_MODS_DIR variable to point to a directly that contains kernel
modules, and adapt selftests to build it before running the test.
* Moved test_klp-call_getpid out of test_programs, since the gen_tar
would just copy the generated test programs to the livepatches dir,
and so scripts relying on test_programs/test_klp-call_getpid will fail.
* Added a module_param for klp_pids, describing it's usage.
* Simplified the call_getpid program to ignore the return of getpid syscall,
since we only want to make sure the process transitions correctly to the
patched stated
* The test-syscall.sh not prints a log message showing the number of remaining
processes to transition into to livepatched state, and check_output expects it
to be 0.
* Added MODULE_AUTHOR and MODULE_DESCRIPTION to test_klp_syscall.c
- Link to v3: https://lore.kernel.org/r/20231031-send-lp-kselftests-v3-0-2b1655c2605f@sus…
- Link to v2: https://lore.kernel.org/linux-kselftest/20220630141226.2802-1-mpdesouza@sus…
This patchset moves the current kernel testing livepatch modules from
lib/livepatches to tools/testing/selftest/livepatch/test_modules, and compiles
them as out-of-tree modules before testing.
There is also a new test being added. This new test exercises multiple processes
calling a syscall, while a livepatch patched the syscall.
Why this move is an improvement:
* The modules are now compiled as out-of-tree modules against the current
running kernel, making them capable of being tested on different systems with
newer or older kernels.
* Such approach now needs kernel-devel package to be installed, since they are
out-of-tree modules. These can be generated by running "make rpm-pkg" in the
kernel source.
What needs to be solved:
* Currently gen_tar only packages the resulting binaries of the tests, and not
the sources. For the current approach, the newly added modules would be
compiled and then packaged. It works when testing on a system with the same
kernel version. But it will fail when running on a machine with different kernel
version, since module was compiled against the kernel currently running.
This is not a new problem, just aligning the expectations. For the current
approach to be truly system agnostic gen_tar would need to include the module
and program sources to be compiled in the target systems.
Thanks in advance!
Marcos
Signed-off-by: Marcos Paulo de Souza <mpdesouza(a)suse.com>
---
Marcos Paulo de Souza (3):
kselftests: lib.mk: Add TEST_GEN_MODS_DIR variable
livepatch: Move tests from lib/livepatch to selftests/livepatch
selftests: livepatch: Test livepatching a heavily called syscall
Documentation/dev-tools/kselftest.rst | 4 +
MAINTAINERS | 1 -
arch/s390/configs/debug_defconfig | 1 -
arch/s390/configs/defconfig | 1 -
lib/Kconfig.debug | 22 ----
lib/Makefile | 2 -
lib/livepatch/Makefile | 14 ---
tools/testing/selftests/lib.mk | 26 ++++-
tools/testing/selftests/livepatch/Makefile | 5 +-
tools/testing/selftests/livepatch/README | 25 +++--
tools/testing/selftests/livepatch/config | 1 -
tools/testing/selftests/livepatch/functions.sh | 34 +++---
.../testing/selftests/livepatch/test-callbacks.sh | 50 ++++-----
tools/testing/selftests/livepatch/test-ftrace.sh | 6 +-
.../testing/selftests/livepatch/test-livepatch.sh | 10 +-
.../selftests/livepatch/test-shadow-vars.sh | 2 +-
tools/testing/selftests/livepatch/test-state.sh | 18 ++--
tools/testing/selftests/livepatch/test-syscall.sh | 53 ++++++++++
tools/testing/selftests/livepatch/test-sysfs.sh | 6 +-
.../selftests/livepatch/test_klp-call_getpid.c | 44 ++++++++
.../selftests/livepatch/test_modules/Makefile | 20 ++++
.../test_modules}/test_klp_atomic_replace.c | 0
.../test_modules}/test_klp_callbacks_busy.c | 0
.../test_modules}/test_klp_callbacks_demo.c | 0
.../test_modules}/test_klp_callbacks_demo2.c | 0
.../test_modules}/test_klp_callbacks_mod.c | 0
.../livepatch/test_modules}/test_klp_livepatch.c | 0
.../livepatch/test_modules}/test_klp_shadow_vars.c | 0
.../livepatch/test_modules}/test_klp_state.c | 0
.../livepatch/test_modules}/test_klp_state2.c | 0
.../livepatch/test_modules}/test_klp_state3.c | 0
.../livepatch/test_modules/test_klp_syscall.c | 116 +++++++++++++++++++++
32 files changed, 340 insertions(+), 121 deletions(-)
---
base-commit: 70d201a40823acba23899342d62bc2644051ad2e
change-id: 20231031-send-lp-kselftests-4c917dcd4565
Best regards,
--
Marcos Paulo de Souza <mpdesouza(a)suse.com>
On systems with 64k page size and 512M huge page sizes, the allocation
and test succeeds but errors out at the munmap. As the comment states,
munmap will failure if its not HUGEPAGE aligned. This is due to the
length of the mapping being 1/2 the size of the hugepage causing the
munmap to not be hugepage aligned. Fix this by making the mapping length
the full hugepage if the hugepage is larger than the length of the
mapping.
Signed-off-by: Nico Pache <npache(a)redhat.com>
---
tools/testing/selftests/mm/map_hugetlb.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/tools/testing/selftests/mm/map_hugetlb.c b/tools/testing/selftests/mm/map_hugetlb.c
index 193281560b61..86e8f2048a40 100644
--- a/tools/testing/selftests/mm/map_hugetlb.c
+++ b/tools/testing/selftests/mm/map_hugetlb.c
@@ -15,6 +15,7 @@
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
+#include "vm_util.h"
#define LENGTH (256UL*1024*1024)
#define PROTECTION (PROT_READ | PROT_WRITE)
@@ -58,10 +59,16 @@ int main(int argc, char **argv)
{
void *addr;
int ret;
+ size_t hugepage_size;
size_t length = LENGTH;
int flags = FLAGS;
int shift = 0;
+ hugepage_size = default_huge_page_size();
+ /* munmap with fail if the length is not page aligned */
+ if (hugepage_size > length)
+ length = hugepage_size;
+
if (argc > 1)
length = atol(argv[1]) << 20;
if (argc > 2) {
--
2.43.0
ksm_tests was previously mmapping a region of memory, aligning the
returned pointer to a PMD boundary, then setting MADV_HUGEPAGE, but was
setting it past the end of the mmapped area due to not taking the
pointer alignment into consideration. Fix this behaviour.
Up until commit efa7df3e3bb5 ("mm: align larger anonymous mappings on
THP boundaries"), this buggy behavior was (usually) masked because the
alignment difference was always less than PMD-size. But since the
mentioned commit, `ksm_tests -H -s 100` started failing.
Fixes: 325254899684 ("selftests: vm: add KSM huge pages merging time test")
Cc: stable(a)vger.kernel.org
Signed-off-by: Ryan Roberts <ryan.roberts(a)arm.com>
---
Applies on top of mm-unstable.
Thanks,
Ryan
tools/testing/selftests/mm/ksm_tests.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/ksm_tests.c b/tools/testing/selftests/mm/ksm_tests.c
index 380b691d3eb9..b748c48908d9 100644
--- a/tools/testing/selftests/mm/ksm_tests.c
+++ b/tools/testing/selftests/mm/ksm_tests.c
@@ -566,7 +566,7 @@ static int ksm_merge_hugepages_time(int merge_type, int mapping, int prot,
if (map_ptr_orig == MAP_FAILED)
err(2, "initial mmap");
- if (madvise(map_ptr, len + HPAGE_SIZE, MADV_HUGEPAGE))
+ if (madvise(map_ptr, len, MADV_HUGEPAGE))
err(2, "MADV_HUGEPAGE");
pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
--
2.25.1
Calling get_system_loc_code before checking devfd and errno - fails the test
when the device is not available, expected a SKIP.
Change the order of 'SKIP_IF_MSG' correctly SKIP when the /dev/papr-vpd device
is not available.
with out patch: Test FAILED on line 271
with patch: [SKIP] Test skipped on line 266: /dev/papr-vpd not present
Signed-off-by: R Nageswara Sastry <rnsastry(a)linux.ibm.com>
---
tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c b/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c
index 98cbb9109ee6..505294da1b9f 100644
--- a/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c
+++ b/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c
@@ -263,10 +263,10 @@ static int papr_vpd_system_loc_code(void)
off_t size;
int fd;
- SKIP_IF_MSG(get_system_loc_code(&lc),
- "Cannot determine system location code");
SKIP_IF_MSG(devfd < 0 && errno == ENOENT,
DEVPATH " not present");
+ SKIP_IF_MSG(get_system_loc_code(&lc),
+ "Cannot determine system location code");
FAIL_IF(devfd < 0);
--
2.37.1 (Apple Git-137.1)
By allowing the filter_glob parameter to be written to, it's possible to
tweak the testsuites that will be executed on new module loads. This
makes it easier to run specific tests without having to reload kunit and
provides a way to filter tests on real HW even if kunit is builtin.
Example for xe driver:
1) Run just 1 test
# echo -n xe_bo > /sys/module/kunit/parameters/filter_glob
# modprobe -r xe_live_test
# modprobe xe_live_test
# ls /sys/kernel/debug/kunit/
xe_bo
2) Run all tests
# echo \* > /sys/module/kunit/parameters/filter_glob
# modprobe -r xe_live_test
# modprobe xe_live_test
# ls /sys/kernel/debug/kunit/
xe_bo xe_dma_buf xe_migrate xe_mocs
References: https://lore.kernel.org/intel-xe/dzacvbdditbneiu3e3fmstjmttcbne44yspumpkd6s…
Signed-off-by: Lucas De Marchi <lucas.demarchi(a)intel.com>
---
lib/kunit/executor.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/kunit/executor.c b/lib/kunit/executor.c
index 1236b3cd2fbb..30ed9d321c19 100644
--- a/lib/kunit/executor.c
+++ b/lib/kunit/executor.c
@@ -31,7 +31,7 @@ static char *filter_glob_param;
static char *filter_param;
static char *filter_action_param;
-module_param_named(filter_glob, filter_glob_param, charp, 0400);
+module_param_named(filter_glob, filter_glob_param, charp, 0600);
MODULE_PARM_DESC(filter_glob,
"Filter which KUnit test suites/tests run at boot-time, e.g. list* or list*.*del_test");
module_param_named(filter, filter_param, charp, 0400);
--
2.40.1
If there is more than 32 cpus the bitmask will start to contain
commas, leading to:
./rps_default_mask.sh: line 36: [: 00000000,00000000: integer expression expected
Remove the commas, bash doesn't interpret leading zeroes as oct
so that should be good enough.
Fixes: c12e0d5f267d ("self-tests: introduce self-tests for RPS default mask")
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
v2:
- remove all commas
v1: https://lore.kernel.org/all/20240119151248.3476897-1-kuba@kernel.org/
CC: shuah(a)kernel.org
CC: horms(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/rps_default_mask.sh | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/net/rps_default_mask.sh b/tools/testing/selftests/net/rps_default_mask.sh
index a26c5624429f..4729e7026a73 100755
--- a/tools/testing/selftests/net/rps_default_mask.sh
+++ b/tools/testing/selftests/net/rps_default_mask.sh
@@ -33,6 +33,10 @@ chk_rps() {
rps_mask=$($cmd /sys/class/net/$dev_name/queues/rx-0/rps_cpus)
printf "%-60s" "$msg"
+
+ # In case there is more than 32 CPUs we need to remove commas from masks
+ rps_mask=${rps_mask//,}
+ expected_rps_mask=${expected_rps_mask//,}
if [ $rps_mask -eq $expected_rps_mask ]; then
echo "[ ok ]"
else
--
2.43.0
From: Jeff Xu <jeffxu(a)chromium.org>
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory
range against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW)
and no-execute (NX) bits. Linux has supported NX since the release of
kernel version 2.6.8 in August 2004 [1]. The memory permission feature
improves the security stance on memory corruption bugs, as an attacker
cannot simply write to arbitrary memory and point the code to it. The
memory must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects
the VMA itself against modifications of the selected seal type.
Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped. Memory sealing can
automatically be applied by the runtime loader to seal .text and
.rodata pages and applications can additionally seal security critical
data at runtime. A similar feature already exists in the XNU kernel
with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the
mimmutable syscall [4]. Also, Chrome wants to adopt this feature for
their CFI work [2] and this patchset has been designed to be
compatible with the Chrome use case.
The new mseal() is an architecture independent syscall, and with
following signature:
mseal(void addr, size_t len, unsigned long types, unsigned long flags)
addr/len: memory range. Must be continuous/allocated memory, or else
mseal() will fail and no VMA is updated. For details on acceptable
arguments, please refer to documentation patch (mseal.rst) of this
patch set. Those are also fully covered by the selftest.
types: bit mask to specify the sealing types.
MM_SEAL_BASE
MM_SEAL_PROT_PKEY
MM_SEAL_DISCARD_RO_ANON
MM_SEAL_SEAL
The MM_SEAL_BASE:
The base package includes the features common to all VMA sealing
types. It prevents sealed VMAs from:
1> Unmapping, moving to another location, and shrinking the size, via
munmap() and mremap(), can leave an empty space, therefore can be
replaced with a VMA with a new set of attributes.
2> Move or expand a different vma into the current location, via mremap().
3> Modifying sealed VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed
VMA.
We consider the MM_SEAL_BASE feature, on which other sealing features
will depend. For instance, it probably does not make sense to seal
PROT_PKEY without sealing the BASE, and the kernel will implicitly add
SEAL_BASE for SEAL_PROT_PKEY.
The MM_SEAL_PROT_PKEY:
Seal PROT and PKEY of the address range, i.e. mprotect() and
pkey_mprotect() will be denied if the memory is sealed with
MM_SEAL_PROT_PKEY.
The MM_SEAL_DISCARD_RO_ANON
Certain types of madvise() operations are destructive [6], such as
MADV_DONTNEED, which can effectively alter region contents by
discarding pages, especially when memory is anonymous. This blocks
such operations for anonymous memory which is not writable to the
user.
The MM_SEAL_SEAL
MM_SEAL_SEAL denies adding a new seal for an VMA.
This is similar to F_SEAL_SEAL in fcntl.
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.
Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in
the case of libc, sealing is only applied to read-only (RO) or
read-execute (RX) memory segments (such as .text and .RELRO) to
prevent them from becoming writable, the lifetime of those mappings
are tied to the lifetime of the process.
Chrome wants to seal two large address space reservations that are
managed by different allocators. The memory is mapped RW- and RWX
respectively but write access to it is restricted using pkeys (or in
the future ARM permission overlay extensions). The lifetime of those
mappings are not tied to the lifetime of the process, therefore, while
the memory is sealed, the allocators still need to free or discard the
unused memory. For example, with madvise(DONTNEED).
However, always allowing madvise(DONTNEED) on this range poses a
security risk. For example if a jump instruction crosses a page
boundary and the second page gets discarded, it will overwrite the
target bytes with zeros and change the control flow. Checking
write-permission before the discard operation allows us to control
when the operation is valid. In this case, the madvise will only
succeed if the executing thread has PKEY write permissions and PKRU
changes are protected in software by control-flow integrity.
Although the initial version of this patch series is targeting the
Chrome browser as its first user, it became evident during upstream
discussions that we would also want to ensure that the patch set
eventually is a complete solution for memory sealing and compatible
with other use cases. The specific scenario currently in mind is
glibc's use case of loading and sealing ELF executables. To this end,
Stephen is working on a change to glibc to add sealing support to the
dynamic linker, which will seal all non-writable segments at startup.
Once this work is completed, all applications will be able to
automatically benefit from these new protections.
--------------------------------------------------------------------
Change history:
===============
V3:
- Abandon per-syscall approach, (Suggested by Linus Torvalds).
- Organize sealing types around their functionality, such as
MM_SEAL_BASE, MM_SEAL_PROT_PKEY.
- Extend the scope of sealing from calls originated in userspace to
both kernel and userspace. (Suggested by Linus Torvalds)
- Add seal type support in mmap(). (Suggested by Pedro Falcato)
- Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent
destructive operations of madvise. (Suggested by Jann Horn and
Stephen Röttger)
- Make sealed VMAs mergeable. (Suggested by Jann Horn)
- Add MAP_SEALABLE to mmap() (Detail see new discussions)
- Add documentation - mseal.rst
Work in progress:
=================
- update man page for mseal() and mmap()
Open discussions:
=================
Several open discussions from V1/V2 were not incorporated into V3. I
believe these are worth more discussion for future versions of this
patch series.
1> mseal() vs mimmutable()
mseal(bitmasks for multiple seal types)
BASE + PROT_PKEY+ MM_SEAL_DISCARD_RO_ANON
Apply PROT_PKEY implies BASE, same for DISCARD_RO_ANON
mimmutable() (openBSD)
This is equal to SEAL_BASE + SEAL_PROT_PKEY in mseal()
Plus allowing downgrade from W=>NW (OpenBSD)
Doesn’t have MM_SEAL_DISCARD_RO_ANON
mimmutable() is designed for memory sealing in libc, and mseal()
is designed for both Chrome browser and libc.
For the two memory ranges that Chrome browser wants to seal, as
discussed previously, the allocator still needs to free or discard
memory for the sealed memory. For performance reasons, we have
explored two solutions in the past: first, using PKEY-based munmap()
[7], and second, separating SEAL_MPROTECT (v1 of this patch set).
Recently, we have experimented with an alternative approach that
allows us to remove the separation of SEAL_MPROTECT. For those two
memory ranges, Chrome browser will use BASE + PROT_PKEY +
DISCARD_RO_ANON for all its sealing needs.
In the case of libc, the .text segment can be sealed with the BASE and
PROT_PKEY, and the RO data segments can be sealed with the BASE +
PROT_PKEY + DISCARD_RO_ANON.
From a flexibility standpoint, separating BASE out could be beneficial
for future extensions of sealing features. For instance, applications
might desire downgradable "prot" permissions (X=>NX, W=>NW, R=>NR),
which would conflict with SEAL_PROT_PKEY.
The more sealing features integrated into a single sealing type, the
fewer applications can utilize these features. For example, some
applications might programmatically require DISCARD_RO_ANON memory,
which would render such VMA unsuitable for sealing.
I'd like to get the community's input on this. From Chrome's
perspective, the separation isn't as important anymore, at least in
the short term. However, I prefer the multiple bits approach because
it's more extensible in the long term.
2> mseal() vs mprotect() vs madvise() for setting the seal.
mprotect():
Using prot field, but prot supports unset. It's workable, i.e. let
applications carry the sealing type and set in all subsequent calls to
mprotect(), but it feels like this is an extra thing to care about.
madvise():
uses enum, multiple sealing types might require multiple roundtrips.
IMO: sealing is a major departure from other memory syscalls because
it takes away capabilities. The other memory APIs add behaviors or
change attributes, but sealing does the opposite: it reduces
capabilities. The name of the syscall, mseal(), can help emphasize the
"taking away" part.
My second choice would be madvise().
3> Other:
There is also a topic about ptrace/, /proc/self/mem, Userfaultfd,
which I think can be followed up using v1 thread, where it has the
most context.
New discussions topics:
=======================
During the development of V3, I had new questions and thoughts and
wished to discuss.
1> shm/aio
From reading the code, it seems to me that aio/shm can mmap/munmap
maps on behalf of userspace, e.g. ksys_shmdt() in shm.c. The lifetime
of those mapping are not tied to the lifetime of the process. If those
memories are sealed from userspace, then unmap will fail. This isn’t a
huge problem, since the memory will eventually be freed at exit or
exec. However, it feels like the solution is not complete, because of
the leaks in VMA address space during the lifetime of the process.
There could be two solutions to address this, which I will discuss
later.
2> Brk (heap/stack)
Currently, userspace applications can seal parts of the heap by
calling malloc() and mseal(). This raises the question of what the
expected behavior is when sealing the heap is attempted.
let's assume following calls from user space:
ptr = malloc(size);
mprotect(ptr, size, RO);
mseal(ptr, size, SEAL_PROT_PKEY);
free(ptr);
Technically, before mseal() is added, the user can change the
protection of the heap by calling mprotect(RO). As long as the user
changes the protection back to RW before free(), the memory can be
reused.
Adding mseal() into picture, however, the heap is then sealed
partially, user can still free it, but the memory remains to be RO,
and the result of brk-shrink is nondeterministic, depending on if
munmap() will try to free the sealed memory.(brk uses munmap to shrink
the heap).
3> Above two cases led to the third topic:
There are two options to address the problem mentioned above.
Option 1: A “MAP_SEALABLE” flag in mmap().
If a map is created without this flag, the mseal() operation will
fail. Applications that are not concerned with sealing will expect
their behavior to be unchanged. For those that are concerned, adding a
flag at mmap time to opt in is not difficult. For the short term, this
solves problems 1 and 2 above. The memory in shm/aio/brk will not have
the MAP_SEALABLE flag at mmap(), and the same is true for the heap.
Option 2: Add MM_SEAL_SEAL during mmap()
It is possible to defensively set MM_SEAL_SEAL for the selected mappings at
creation time. Specifically, we can find all the mmaps that we do not want to
seal, and add the MM_SEAL_SEAL flag in the mmap() call. The difference
between MAP_SEALABLE and MM_SEAL_SEAL is that the first option starts from a
small size and incrementally increases, while the second option is the
opposite.
In my opinion, MAP_SEALABLE is the preferred option. Only a limited set of
mappings need to be sealed, and these are typically created by the runtime. For
the few dedicated applications that manage their own mappings, such as Chrome,
adding an extra flag at mmap() is not a difficult task. It is also a safer
option in terms of regression risk. This is the option included in this
version.
4>
I think it might be possible to seal the stack or other special
mappings created at runtime (vdso, vsyscall, vvar). This means we can
enforce and seal W^X for certain types of application. For instance,
the stack is typically used in read-write mode, but in some cases, it
can become executable. To defend against unintented addition of executable
bit to stack, we could let the application to seal it.
Sealing the heap (for adding X) requires special handling, since the
heap can shrink, and shrink is implemented through munmap().
Indeed, it might be possible that all virtual memory accessible to user
space, regardless of its usage pattern, could be sealed. However, this
would require additional research and development work.
------------------------------------------------------------------------
v2:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add a detailed commit message.
https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/
v1:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/
----------------------------------------------------------------
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b…
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXge…
[6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426Fkcgnf…
[7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/
Jeff Xu (11):
mseal: Add mseal syscall.
mseal: Wire up mseal syscall
mseal: add can_modify_mm and can_modify_vma
mseal: add MM_SEAL_BASE
mseal: add MM_SEAL_PROT_PKEY
mseal: add sealing support for mmap
mseal: make sealed VMA mergeable.
mseal: add MM_SEAL_DISCARD_RO_ANON
mseal: add MAP_SEALABLE to mmap()
selftest mm/mseal memory sealing
mseal:add documentation
Documentation/userspace-api/mseal.rst | 189 ++
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/mips/kernel/vdso.c | 10 +-
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/userfaultfd.c | 8 +-
include/linux/mm.h | 178 +-
include/linux/mm_types.h | 8 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/mman-common.h | 16 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/mman.h | 5 +
kernel/sys_ni.c | 1 +
mm/Kconfig | 9 +
mm/Makefile | 1 +
mm/madvise.c | 14 +-
mm/mempolicy.c | 2 +-
mm/mlock.c | 2 +-
mm/mmap.c | 77 +-
mm/mprotect.c | 12 +-
mm/mremap.c | 44 +-
mm/mseal.c | 376 ++++
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/config | 1 +
tools/testing/selftests/mm/mseal_test.c | 2141 +++++++++++++++++++
41 files changed, 3091 insertions(+), 32 deletions(-)
create mode 100644 Documentation/userspace-api/mseal.rst
create mode 100644 mm/mseal.c
create mode 100644 tools/testing/selftests/mm/mseal_test.c
--
2.43.0.472.g3155946c3a-goog
Hi,
Compilation of lsm_cgroup.c will fail if the vmlinux.h comes from a
kernel that does _not_ have CONFIG_PACKET=y. The reason is that the
definition of struct sockaddr_ll is not present in vmlinux.h and the
compiler will complain that is has an incomplete type.
CLNG-BPF [test_maps] lsm_cgroup.bpf.o
progs/lsm_cgroup.c:105:21: error: variable has incomplete type 'struct sockaddr_ll'
105 | struct sockaddr_ll sa = {};
| ^
progs/lsm_cgroup.c:105:9: note: forward declaration of 'struct sockaddr_ll'
105 | struct sockaddr_ll sa = {};
| ^
1 error generated.
While including linux/if_packet.h somehow made the compilation works for
me, IIUC this isn't a proper solution because vmlinux.h and kernel
headers should not be used at the same time (and would lead to
redefinition error when the kernel is built with CONFIG_PACKET=y, e.g.
on BPF CI).
What would be the suggested way to work around this?
Thanks,
Shung-Hsi
---
tools/testing/selftests/bpf/progs/lsm_cgroup.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/bpf/progs/lsm_cgroup.c b/tools/testing/selftests/bpf/progs/lsm_cgroup.c
index 02c11d16b692..5394ec7ae1d8 100644
--- a/tools/testing/selftests/bpf/progs/lsm_cgroup.c
+++ b/tools/testing/selftests/bpf/progs/lsm_cgroup.c
@@ -2,6 +2,7 @@
#include "vmlinux.h"
#include "bpf_tracing_net.h"
+#include <linux/if_packet.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
If there is more than 32 cpus the bitmask will start to contain
commas, leading to:
./rps_default_mask.sh: line 36: [: 00000000,00000000: integer expression expected
Remove the commas, bash doesn't interpret leading zeroes as oct
so that should be good enough.
Fixes: c12e0d5f267d ("self-tests: introduce self-tests for RPS default mask")
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: shuah(a)kernel.org
CC: horms(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/rps_default_mask.sh | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/net/rps_default_mask.sh b/tools/testing/selftests/net/rps_default_mask.sh
index a26c5624429f..f8e786e220b6 100755
--- a/tools/testing/selftests/net/rps_default_mask.sh
+++ b/tools/testing/selftests/net/rps_default_mask.sh
@@ -33,6 +33,10 @@ chk_rps() {
rps_mask=$($cmd /sys/class/net/$dev_name/queues/rx-0/rps_cpus)
printf "%-60s" "$msg"
+
+ # In case there is more than 32 CPUs we need to remove commas from masks
+ rps_mask=${rps_mask/,}
+ expected_rps_mask=${expected_rps_mask/,}
if [ $rps_mask -eq $expected_rps_mask ]; then
echo "[ ok ]"
else
--
2.43.0
From: Benjamin Poirier <benjamin.poirier(a)gmail.com>
Two small fixes for net selftests.
These patches were carved out of the following RFC series:
https://lore.kernel.org/netdev/20231222135836.992841-1-bpoirier@nvidia.com/
I'm planning to send the rest of the series to net-next after it opens up.
Benjamin Poirier (2):
selftests: bonding: Change script interpreter
selftests: forwarding: Remove executable bits from lib.sh
.../selftests/drivers/net/bonding/mode-1-recovery-updelay.sh | 2 +-
.../selftests/drivers/net/bonding/mode-2-recovery-updelay.sh | 2 +-
tools/testing/selftests/net/forwarding/lib.sh | 0
3 files changed, 2 insertions(+), 2 deletions(-)
mode change 100755 => 100644 tools/testing/selftests/net/forwarding/lib.sh
--
2.43.0
On systems with 64k page size and 512M huge page sizes, the allocation
and test succeeds but errors out at the munmap. As the comment states,
munmap will failure if its not HUGEPAGE aligned. This is due to the
length of the mapping being 1/2 the size of the hugepage causing the
munmap to not be hugepage aligned. Fix this by making the mapping length
the full hugepage if the hugepage is larger than the length of the
mapping.
Signed-off-by: Nico Pache <npache(a)redhat.com>
---
tools/testing/selftests/mm/map_hugetlb.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/mm/map_hugetlb.c b/tools/testing/selftests/mm/map_hugetlb.c
index 193281560b61..dcb8095fcd45 100644
--- a/tools/testing/selftests/mm/map_hugetlb.c
+++ b/tools/testing/selftests/mm/map_hugetlb.c
@@ -58,10 +58,16 @@ int main(int argc, char **argv)
{
void *addr;
int ret;
+ size_t maplength;
size_t length = LENGTH;
int flags = FLAGS;
int shift = 0;
+ maplength = default_huge_page_size();
+ /* mmap with fail if the length is not page */
+ if (maplength > length)
+ length = maplength;
+
if (argc > 1)
length = atol(argv[1]) << 20;
if (argc > 2) {
--
2.43.0
As for the Qemu command, print the command used to run tests with UML.
Cc: Brendan Higgins <brendan.higgins(a)linux.dev>
Cc: David Gow <davidgow(a)google.com>
Signed-off-by: Mickaël Salaün <mic(a)digikod.net>
---
tools/testing/kunit/kunit_kernel.py | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/kunit/kunit_kernel.py b/tools/testing/kunit/kunit_kernel.py
index 0b6488efed47..7254c110ff23 100644
--- a/tools/testing/kunit/kunit_kernel.py
+++ b/tools/testing/kunit/kunit_kernel.py
@@ -146,6 +146,7 @@ class LinuxSourceTreeOperationsUml(LinuxSourceTreeOperations):
"""Runs the Linux UML binary. Must be named 'linux'."""
linux_bin = os.path.join(build_dir, 'linux')
params.extend(['mem=1G', 'console=tty', 'kunit_shutdown=halt'])
+ print('Running tests with:\n$', linux_bin, ' '.join(shlex.quote(arg) for arg in params))
return subprocess.Popen([linux_bin] + params,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
--
2.43.0
When tests are run by runner.sh, bond_options.sh gets killed before
it can complete:
make -C tools/testing/selftests run_tests TARGETS="drivers/net/bonding"
[...]
# timeout set to 120
# selftests: drivers/net/bonding: bond_options.sh
# TEST: prio (active-backup miimon primary_reselect 0) [ OK ]
# TEST: prio (active-backup miimon primary_reselect 1) [ OK ]
# TEST: prio (active-backup miimon primary_reselect 2) [ OK ]
# TEST: prio (active-backup arp_ip_target primary_reselect 0) [ OK ]
# TEST: prio (active-backup arp_ip_target primary_reselect 1) [ OK ]
# TEST: prio (active-backup arp_ip_target primary_reselect 2) [ OK ]
#
not ok 7 selftests: drivers/net/bonding: bond_options.sh # TIMEOUT 120 seconds
This test includes many sleep statements, at least some of which are
related to timers in the operation of the bonding driver itself. Increase
the test timeout to allow the test to complete.
I ran the test in slightly different VMs (including one without HW
virtualization support) and got runtimes of 13m39.760s, 13m31.238s, and
13m2.956s. Use a ~1.5x "safety factor" and set the timeout to 1200s.
Fixes: 42a8d4aaea84 ("selftests: bonding: add bonding prio option test")
Reported-by: Jakub Kicinski <kuba(a)kernel.org>
Closes: https://lore.kernel.org/netdev/20240116104402.1203850a@kernel.org/#t
Suggested-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Benjamin Poirier <bpoirier(a)nvidia.com>
---
tools/testing/selftests/drivers/net/bonding/settings | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/drivers/net/bonding/settings b/tools/testing/selftests/drivers/net/bonding/settings
index 6091b45d226b..79b65bdf05db 100644
--- a/tools/testing/selftests/drivers/net/bonding/settings
+++ b/tools/testing/selftests/drivers/net/bonding/settings
@@ -1 +1 @@
-timeout=120
+timeout=1200
--
2.43.0
The arm64 Guarded Control Stack (GCS) feature provides support for
hardware protected stacks of return addresses, intended to provide
hardening against return oriented programming (ROP) attacks and to make
it easier to gather call stacks for applications such as profiling.
When GCS is active a secondary stack called the Guarded Control Stack is
maintained, protected with a memory attribute which means that it can
only be written with specific GCS operations. The current GCS pointer
can not be directly written to by userspace. When a BL is executed the
value stored in LR is also pushed onto the GCS, and when a RET is
executed the top of the GCS is popped and compared to LR with a fault
being raised if the values do not match. GCS operations may only be
performed on GCS pages, a data abort is generated if they are not.
The combination of hardware enforcement and lack of extra instructions
in the function entry and exit paths should result in something which
has less overhead and is more difficult to attack than a purely software
implementation like clang's shadow stacks.
This series implements support for use of GCS by userspace, along with
support for use of GCS within KVM guests. It does not enable use of GCS
by either EL1 or EL2, this will be implemented separately. Executables
are started without GCS and must use a prctl() to enable it, it is
expected that this will be done very early in application execution by
the dynamic linker or other startup code. For dynamic linking this will
be done by checking that everything in the executable is marked as GCS
compatible.
x86 has an equivalent feature called shadow stacks, this series depends
on the x86 patches for generic memory management support for the new
guarded/shadow stack page type and shares APIs as much as possible. As
there has been extensive discussion with the wider community around the
ABI for shadow stacks I have as far as practical kept implementation
decisions close to those for x86, anticipating that review would lead to
similar conclusions in the absence of strong reasoning for divergence.
The main divergence I am concious of is that x86 allows shadow stack to
be enabled and disabled repeatedly, freeing the shadow stack for the
thread whenever disabled, while this implementation keeps the GCS
allocated after disable but refuses to reenable it. This is to avoid
races with things actively walking the GCS during a disable, we do
anticipate that some systems will wish to disable GCS at runtime but are
not aware of any demand for subsequently reenabling it.
x86 uses an arch_prctl() to manage enable and disable, since only x86
and S/390 use arch_prctl() a generic prctl() was proposed[1] as part of a
patch set for the equivalent RISC-V Zicfiss feature which I initially
adopted fairly directly but following review feedback has been revised
quite a bit.
We currently maintain the x86 pattern of implicitly allocating a shadow
stack for threads started with shadow stack enabled, there has been some
discussion of removing this support and requiring the use of clone3()
with explicit allocation of shadow stacks instead. I have no strong
feelings either way, implicit allocation is not really consistent with
anything else we do and creates the potential for errors around thread
exit but on the other hand it is existing ABI on x86 and minimises the
changes needed in userspace code.
There is an open issue with support for CRIU, on x86 this required the
ability to set the GCS mode via ptrace. This series supports
configuring mode bits other than enable/disable via ptrace but it needs
to be confirmed if this is sufficient.
The series depends on support for shadow stacks in clone3(), that series
includes the addition of ARCH_HAS_USER_SHADOW_STACK.
https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke…
It also depends on the addition of more waitpid() flags to nolibc:
https://lore.kernel.org/r/20231023-nolibc-waitpid-flags-v2-1-b09d096f091f@k…
You can see a branch with the full set of dependencies against Linus'
tree at:
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/misc.git arm64-gcs
[1] https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v7:
- Rebase onto v6.7-rc2 via the clone3() patch series.
- Change the token used to cap the stack during signal handling to be
compatible with GCSPOPM.
- Fix flags for new page types.
- Fold in support for clone3().
- Replace copy_to_user_gcs() with put_user_gcs().
- Link to v6: https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org
Changes in v6:
- Rebase onto v6.6-rc3.
- Add some more gcsb_dsync() barriers following spec clarifications.
- Due to ongoing discussion around clone()/clone3() I've not updated
anything there, the behaviour is the same as on previous versions.
- Link to v5: https://lore.kernel.org/r/20230822-arm64-gcs-v5-0-9ef181dd6324@kernel.org
Changes in v5:
- Don't map any permissions for user GCSs, we always use EL0 accessors
or use a separate mapping of the page.
- Reduce the standard size of the GCS to RLIMIT_STACK/2.
- Enforce a PAGE_SIZE alignment requirement on map_shadow_stack().
- Clarifications and fixes to documentation.
- More tests.
- Link to v4: https://lore.kernel.org/r/20230807-arm64-gcs-v4-0-68cfa37f9069@kernel.org
Changes in v4:
- Implement flags for map_shadow_stack() allowing the cap and end of
stack marker to be enabled independently or not at all.
- Relax size and alignment requirements for map_shadow_stack().
- Add more blurb explaining the advantages of hardware enforcement.
- Link to v3: https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org
Changes in v3:
- Rebase onto v6.5-rc4.
- Add a GCS barrier on context switch.
- Add a GCS stress test.
- Link to v2: https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org
Changes in v2:
- Rebase onto v6.5-rc3.
- Rework prctl() interface to allow each bit to be locked independently.
- map_shadow_stack() now places the cap token based on the size
requested by the caller not the actual space allocated.
- Mode changes other than enable via ptrace are now supported.
- Expand test coverage.
- Various smaller fixes and adjustments.
- Link to v1: https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org
---
Mark Brown (39):
arm64/mm: Restructure arch_validate_flags() for extensibility
prctl: arch-agnostic prctl for shadow stack
mman: Add map_shadow_stack() flags
arm64: Document boot requirements for Guarded Control Stacks
arm64/gcs: Document the ABI for Guarded Control Stacks
arm64/sysreg: Add new system registers for GCS
arm64/sysreg: Add definitions for architected GCS caps
arm64/gcs: Add manual encodings of GCS instructions
arm64/gcs: Provide put_user_gcs()
arm64/cpufeature: Runtime detection of Guarded Control Stack (GCS)
arm64/mm: Allocate PIE slots for EL0 guarded control stack
mm: Define VM_SHADOW_STACK for arm64 when we support GCS
arm64/mm: Map pages for guarded control stack
KVM: arm64: Manage GCS registers for guests
arm64/gcs: Allow GCS usage at EL0 and EL1
arm64/idreg: Add overrride for GCS
arm64/hwcap: Add hwcap for GCS
arm64/traps: Handle GCS exceptions
arm64/mm: Handle GCS data aborts
arm64/gcs: Context switch GCS state for EL0
arm64/gcs: Allocate a new GCS for threads with GCS enabled
arm64/gcs: Implement shadow stack prctl() interface
arm64/mm: Implement map_shadow_stack()
arm64/signal: Set up and restore the GCS context for signal handlers
arm64/signal: Expose GCS state in signal frames
arm64/ptrace: Expose GCS via ptrace and core files
arm64: Add Kconfig for Guarded Control Stack (GCS)
kselftest/arm64: Verify the GCS hwcap
kselftest/arm64: Add GCS as a detected feature in the signal tests
kselftest/arm64: Add framework support for GCS to signal handling tests
kselftest/arm64: Allow signals tests to specify an expected si_code
kselftest/arm64: Always run signals tests with GCS enabled
kselftest/arm64: Add very basic GCS test program
kselftest/arm64: Add a GCS test program built with the system libc
kselftest/arm64: Add test coverage for GCS mode locking
selftests/arm64: Add GCS signal tests
kselftest/arm64: Add a GCS stress test
kselftest/arm64: Enable GCS for the FP stress tests
kselftest/clone3: Enable GCS in the clone3 selftests
Documentation/admin-guide/kernel-parameters.txt | 6 +
Documentation/arch/arm64/booting.rst | 22 +
Documentation/arch/arm64/elf_hwcaps.rst | 3 +
Documentation/arch/arm64/gcs.rst | 233 +++++++
Documentation/arch/arm64/index.rst | 1 +
Documentation/filesystems/proc.rst | 2 +-
arch/arm64/Kconfig | 20 +
arch/arm64/include/asm/cpufeature.h | 6 +
arch/arm64/include/asm/el2_setup.h | 17 +
arch/arm64/include/asm/esr.h | 28 +-
arch/arm64/include/asm/exception.h | 2 +
arch/arm64/include/asm/gcs.h | 107 +++
arch/arm64/include/asm/hwcap.h | 1 +
arch/arm64/include/asm/kvm_arm.h | 4 +-
arch/arm64/include/asm/kvm_host.h | 12 +
arch/arm64/include/asm/mman.h | 23 +-
arch/arm64/include/asm/pgtable-prot.h | 14 +-
arch/arm64/include/asm/processor.h | 7 +
arch/arm64/include/asm/sysreg.h | 20 +
arch/arm64/include/asm/uaccess.h | 40 ++
arch/arm64/include/uapi/asm/hwcap.h | 1 +
arch/arm64/include/uapi/asm/ptrace.h | 8 +
arch/arm64/include/uapi/asm/sigcontext.h | 9 +
arch/arm64/kernel/cpufeature.c | 19 +
arch/arm64/kernel/cpuinfo.c | 1 +
arch/arm64/kernel/entry-common.c | 23 +
arch/arm64/kernel/idreg-override.c | 2 +
arch/arm64/kernel/process.c | 81 +++
arch/arm64/kernel/ptrace.c | 59 ++
arch/arm64/kernel/signal.c | 236 ++++++-
arch/arm64/kernel/traps.c | 11 +
arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 17 +
arch/arm64/kvm/sys_regs.c | 22 +
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/fault.c | 79 ++-
arch/arm64/mm/gcs.c | 259 +++++++
arch/arm64/mm/mmap.c | 13 +-
arch/arm64/tools/cpucaps | 1 +
arch/arm64/tools/sysreg | 55 ++
arch/x86/include/uapi/asm/mman.h | 3 -
fs/proc/task_mmu.c | 3 +
include/linux/mm.h | 16 +-
include/uapi/asm-generic/mman.h | 4 +
include/uapi/linux/elf.h | 1 +
include/uapi/linux/prctl.h | 22 +
kernel/sys.c | 30 +
tools/testing/selftests/arm64/Makefile | 2 +-
tools/testing/selftests/arm64/abi/hwcap.c | 19 +
tools/testing/selftests/arm64/fp/assembler.h | 15 +
tools/testing/selftests/arm64/fp/fpsimd-test.S | 2 +
tools/testing/selftests/arm64/fp/sve-test.S | 2 +
tools/testing/selftests/arm64/fp/za-test.S | 2 +
tools/testing/selftests/arm64/fp/zt-test.S | 2 +
tools/testing/selftests/arm64/gcs/.gitignore | 5 +
tools/testing/selftests/arm64/gcs/Makefile | 24 +
tools/testing/selftests/arm64/gcs/asm-offsets.h | 0
tools/testing/selftests/arm64/gcs/basic-gcs.c | 428 ++++++++++++
tools/testing/selftests/arm64/gcs/gcs-locking.c | 200 ++++++
.../selftests/arm64/gcs/gcs-stress-thread.S | 311 +++++++++
tools/testing/selftests/arm64/gcs/gcs-stress.c | 532 +++++++++++++++
tools/testing/selftests/arm64/gcs/gcs-util.h | 100 +++
tools/testing/selftests/arm64/gcs/libc-gcs.c | 742 +++++++++++++++++++++
tools/testing/selftests/arm64/signal/.gitignore | 1 +
.../testing/selftests/arm64/signal/test_signals.c | 17 +-
.../testing/selftests/arm64/signal/test_signals.h | 6 +
.../selftests/arm64/signal/test_signals_utils.c | 32 +-
.../selftests/arm64/signal/test_signals_utils.h | 39 ++
.../arm64/signal/testcases/gcs_exception_fault.c | 59 ++
.../selftests/arm64/signal/testcases/gcs_frame.c | 78 +++
.../arm64/signal/testcases/gcs_write_fault.c | 67 ++
.../selftests/arm64/signal/testcases/testcases.c | 7 +
.../selftests/arm64/signal/testcases/testcases.h | 1 +
tools/testing/selftests/clone3/clone3.c | 37 +
73 files changed, 4234 insertions(+), 40 deletions(-)
---
base-commit: 3d0134d322380292c055454d9633738733992d61
change-id: 20230303-arm64-gcs-e311ab0d8729
Best regards,
--
Mark Brown <broonie(a)kernel.org>
A BPF application, e.g., a TCP congestion control, might benefit from or
even require precise (=hardware) packet timestamps. These timestamps are
already available through __sk_buff.hwtstamp and
bpf_sock_ops.skb_hwtstamp, but could not be requested: BPF programs were
not allowed to set SO_TIMESTAMPING* on sockets.
Enable BPF programs to actively request the generation of timestamps
from a stream socket. The also required ioctl(SIOCSHWTSTAMP) on the
network device must still be done separately, in user space.
This patch had previously been submitted in a two-part series (first
link below). The second patch has been independently applied in commit
7f6ca95d16b9 ("net: Implement missing getsockopt(SO_TIMESTAMPING_NEW)")
(second link below).
On the earlier submission, there was the open question whether to only
allow, thus enforce, SO_TIMESTAMPING_NEW in this patch:
For a BPF program, this won't make a difference: A timestamp, when
accessed through the fields mentioned above, is directly read from
skb_shared_info.hwtstamps, independent of the places where NEW/OLD is
relevant. See bpf_convert_ctx_access() besides others.
I am unsure, though, when it comes to the interconnection of user space
and BPF "space", when both are interested in the timestamps. I think it
would cause an unsolvable conflict when user space is bound to use
SO_TIMESTAMPING_OLD with a BPF program only allowed to set
SO_TIMESTAMPING_NEW *on the same socket*? Please correct me if I'm
mistaken.
Link: https://lore.kernel.org/lkml/20230703175048.151683-1-jthinz@mailbox.tu-berl…
Link: https://lore.kernel.org/all/20231221231901.67003-1-jthinz@mailbox.tu-berlin…
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: Deepa Dinamani <deepa.kernel(a)gmail.com>
Cc: Willem de Bruijn <willemdebruijn.kernel(a)gmail.com>
Signed-off-by: Jörn-Thorben Hinz <j-t.hinz(a)alumni.tu-berlin.de>
---
include/uapi/linux/bpf.h | 3 ++-
net/core/filter.c | 2 ++
tools/include/uapi/linux/bpf.h | 3 ++-
tools/testing/selftests/bpf/progs/bpf_tracing_net.h | 2 ++
tools/testing/selftests/bpf/progs/setget_sockopt.c | 4 ++++
5 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 754e68ca8744..8825d0648efe 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2734,7 +2734,8 @@ union bpf_attr {
* **SO_RCVBUF**, **SO_SNDBUF**, **SO_MAX_PACING_RATE**,
* **SO_PRIORITY**, **SO_RCVLOWAT**, **SO_MARK**,
* **SO_BINDTODEVICE**, **SO_KEEPALIVE**, **SO_REUSEADDR**,
- * **SO_REUSEPORT**, **SO_BINDTOIFINDEX**, **SO_TXREHASH**.
+ * **SO_REUSEPORT**, **SO_BINDTOIFINDEX**, **SO_TXREHASH**,
+ * **SO_TIMESTAMPING_NEW**, **SO_TIMESTAMPING_OLD**.
* * **IPPROTO_TCP**, which supports the following *optname*\ s:
* **TCP_CONGESTION**, **TCP_BPF_IW**,
* **TCP_BPF_SNDCWND_CLAMP**, **TCP_SAVE_SYN**,
diff --git a/net/core/filter.c b/net/core/filter.c
index 8c9f67c81e22..4f5280874fd8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5144,6 +5144,8 @@ static int sol_socket_sockopt(struct sock *sk, int optname,
case SO_MAX_PACING_RATE:
case SO_BINDTOIFINDEX:
case SO_TXREHASH:
+ case SO_TIMESTAMPING_NEW:
+ case SO_TIMESTAMPING_OLD:
if (*optlen != sizeof(int))
return -EINVAL;
break;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7f24d898efbb..09eaafa6ab43 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2734,7 +2734,8 @@ union bpf_attr {
* **SO_RCVBUF**, **SO_SNDBUF**, **SO_MAX_PACING_RATE**,
* **SO_PRIORITY**, **SO_RCVLOWAT**, **SO_MARK**,
* **SO_BINDTODEVICE**, **SO_KEEPALIVE**, **SO_REUSEADDR**,
- * **SO_REUSEPORT**, **SO_BINDTOIFINDEX**, **SO_TXREHASH**.
+ * **SO_REUSEPORT**, **SO_BINDTOIFINDEX**, **SO_TXREHASH**,
+ * **SO_TIMESTAMPING_NEW**, **SO_TIMESTAMPING_OLD**.
* * **IPPROTO_TCP**, which supports the following *optname*\ s:
* **TCP_CONGESTION**, **TCP_BPF_IW**,
* **TCP_BPF_SNDCWND_CLAMP**, **TCP_SAVE_SYN**,
diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
index 1bdc680b0e0e..95f5f169819e 100644
--- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
+++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
@@ -15,8 +15,10 @@
#define SO_RCVLOWAT 18
#define SO_BINDTODEVICE 25
#define SO_MARK 36
+#define SO_TIMESTAMPING_OLD 37
#define SO_MAX_PACING_RATE 47
#define SO_BINDTOIFINDEX 62
+#define SO_TIMESTAMPING_NEW 65
#define SO_TXREHASH 74
#define __SO_ACCEPTCON (1 << 16)
diff --git a/tools/testing/selftests/bpf/progs/setget_sockopt.c b/tools/testing/selftests/bpf/progs/setget_sockopt.c
index 7a438600ae98..54205d10793c 100644
--- a/tools/testing/selftests/bpf/progs/setget_sockopt.c
+++ b/tools/testing/selftests/bpf/progs/setget_sockopt.c
@@ -48,6 +48,10 @@ static const struct sockopt_test sol_socket_tests[] = {
{ .opt = SO_MARK, .new = 0xeb9f, .expected = 0xeb9f, },
{ .opt = SO_MAX_PACING_RATE, .new = 0xeb9f, .expected = 0xeb9f, },
{ .opt = SO_TXREHASH, .flip = 1, },
+ { .opt = SO_TIMESTAMPING_NEW, .new = SOF_TIMESTAMPING_RX_HARDWARE,
+ .expected = SOF_TIMESTAMPING_RX_HARDWARE, },
+ { .opt = SO_TIMESTAMPING_OLD, .new = SOF_TIMESTAMPING_RX_HARDWARE,
+ .expected = SOF_TIMESTAMPING_RX_HARDWARE, },
{ .opt = 0, },
};
--
2.39.2
This extends the KVM RISC-V ONE_REG interface to report more ISA extensions
namely: Zbz, scalar crypto, vector crypto, Zfh[min], Zihintntl, Zvfh[min],
and Zfa.
This series depends upon the "riscv: report more ISA extensions through
hwprobe" series.from Clement.
(Link: https://lore.kernel.org/lkml/20231114141256.126749-1-cleger@rivosinc.com/)
To test these patches, use KVMTOOL from the riscv_more_exts_v1 branch at:
https://github.com/avpatel/kvmtool.git
These patches can also be found in the riscv_kvm_more_exts_v1 branch at:
https://github.com/avpatel/linux.git
Anup Patel (15):
KVM: riscv: selftests: Generate ISA extension reg_list using macros
RISC-V: KVM: Allow Zbc extension for Guest/VM
KVM: riscv: selftests: Add Zbc extension to get-reg-list test
RISC-V: KVM: Allow scalar crypto extensions for Guest/VM
KVM: riscv: selftests: Add scaler crypto extensions to get-reg-list
test
RISC-V: KVM: Allow vector crypto extensions for Guest/VM
KVM: riscv: selftests: Add vector crypto extensions to get-reg-list
test
RISC-V: KVM: Allow Zfh[min] extensions for Guest/VM
KVM: riscv: selftests: Add Zfh[min] extensions to get-reg-list test
RISC-V: KVM: Allow Zihintntl extension for Guest/VM
KVM: riscv: selftests: Add Zihintntl extension to get-reg-list test
RISC-V: KVM: Allow Zvfh[min] extensions for Guest/VM
KVM: riscv: selftests: Add Zvfh[min] extensions to get-reg-list test
RISC-V: KVM: Allow Zfa extension for Guest/VM
KVM: riscv: selftests: Add Zfa extension to get-reg-list test
arch/riscv/include/uapi/asm/kvm.h | 27 ++
arch/riscv/kvm/vcpu_onereg.c | 54 +++
.../selftests/kvm/riscv/get-reg-list.c | 439 ++++++++----------
3 files changed, 265 insertions(+), 255 deletions(-)
--
2.34.1
As a followup to commit 03fb8565c880 ("selftests: bonding: add missing
build configs"), add more networking-specific config options which are
needed for bonding tests.
For testing, I used the minimal config generated by virtme-ng and I added
the options in the config file. All bonding tests passed.
Fixes: bbb774d921e2 ("net: Add tests for bonding and team address list management") # for ipv6
Fixes: 6cbe791c0f4e ("kselftest: bonding: add num_grat_arp test") # for tc options
Fixes: 222c94ec0ad4 ("selftests: bonding: add tests for ether type changes") # for nlmon
Suggested-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Benjamin Poirier <bpoirier(a)nvidia.com>
---
tools/testing/selftests/drivers/net/bonding/config | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/bonding/config b/tools/testing/selftests/drivers/net/bonding/config
index f85b16fc5128..899d7fb6ea8e 100644
--- a/tools/testing/selftests/drivers/net/bonding/config
+++ b/tools/testing/selftests/drivers/net/bonding/config
@@ -1,5 +1,10 @@
CONFIG_BONDING=y
CONFIG_BRIDGE=y
CONFIG_DUMMY=y
+CONFIG_IPV6=y
CONFIG_MACVLAN=y
+CONFIG_NET_ACT_GACT=y
+CONFIG_NET_CLS_FLOWER=y
+CONFIG_NET_SCH_INGRESS=y
+CONFIG_NLMON=y
CONFIG_VETH=y
--
2.43.0
The device is exported with a fuzz of 4, meaning that the `+ t` here
is removed by the fuzz algorithm, making those tests failing.
Not sure why, but when I run this locally it was passing, but not in the
VM.
Link: https://gitlab.freedesktop.org/bentiss/hid/-/jobs/53692957#L3315
Signed-off-by: Benjamin Tissoires <bentiss(a)kernel.org>
---
Over the break the test suite wasn't properly running on my runner,
and this small issue sneaked in.
---
tools/testing/selftests/hid/tests/test_wacom_generic.py | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/hid/tests/test_wacom_generic.py b/tools/testing/selftests/hid/tests/test_wacom_generic.py
index 352fc39f3c6c..b62c7dba6777 100644
--- a/tools/testing/selftests/hid/tests/test_wacom_generic.py
+++ b/tools/testing/selftests/hid/tests/test_wacom_generic.py
@@ -880,8 +880,8 @@ class TestDTH2452Tablet(test_multitouch.BaseTest.TestMultitouch, TouchTabletTest
does not overlap with other contacts. The value of `t` may be
incremented over time to move the point along a linear path.
"""
- x = 50 + 10 * contact_id + t
- y = 100 + 100 * contact_id + t
+ x = 50 + 10 * contact_id + t * 11
+ y = 100 + 100 * contact_id + t * 11
return test_multitouch.Touch(contact_id, x, y)
def make_contacts(self, n, t=0):
@@ -902,8 +902,8 @@ class TestDTH2452Tablet(test_multitouch.BaseTest.TestMultitouch, TouchTabletTest
tracking_id = contact_ids.tracking_id
slot_num = contact_ids.slot_num
- x = 50 + 10 * contact_id + t
- y = 100 + 100 * contact_id + t
+ x = 50 + 10 * contact_id + t * 11
+ y = 100 + 100 * contact_id + t * 11
# If the data isn't supposed to be stored in any slots, there is
# nothing we can check for in the evdev stream.
---
base-commit: 80d5a73edcfbd1d8d6a4c2b755873c5d63a1ebd7
change-id: 20240117-b4-wip-wacom-tests-fixes-298b50bea47f
Best regards,
--
Benjamin Tissoires <bentiss(a)kernel.org>
On Wed, Jan 17, 2024 at 7:12 PM Jason Gerecke <killertofu(a)gmail.com> wrote:
>
> LGTM. Acked-by: Jason Gerecke <jason.gerecke(a)wacom.com>
Thanks!
I'll add a:
Fixes: b0fb904d074e ("HID: wacom: Add additional tests of confidence behavior")
And send to Linus in the next round for 6.8 so we also fix the future
for-6.9 branches
Cheers,
Benjamin
>
>
> Jason
> ---
> Now instead of four in the eights place /
> you’ve got three, ‘Cause you added one /
> (That is to say, eight) to the two, /
> But you can’t take seven from three, /
> So you look at the sixty-fours....
>
>
>
> On Wed, Jan 17, 2024 at 5:27 AM Benjamin Tissoires <bentiss(a)kernel.org> wrote:
>>
>> The device is exported with a fuzz of 4, meaning that the `+ t` here
>> is removed by the fuzz algorithm, making those tests failing.
>>
>> Not sure why, but when I run this locally it was passing, but not in the
>> VM.
>>
>> Link: https://gitlab.freedesktop.org/bentiss/hid/-/jobs/53692957#L3315
>> Signed-off-by: Benjamin Tissoires <bentiss(a)kernel.org>
>> ---
>> Over the break the test suite wasn't properly running on my runner,
>> and this small issue sneaked in.
>> ---
>> tools/testing/selftests/hid/tests/test_wacom_generic.py | 8 ++++----
>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/testing/selftests/hid/tests/test_wacom_generic.py b/tools/testing/selftests/hid/tests/test_wacom_generic.py
>> index 352fc39f3c6c..b62c7dba6777 100644
>> --- a/tools/testing/selftests/hid/tests/test_wacom_generic.py
>> +++ b/tools/testing/selftests/hid/tests/test_wacom_generic.py
>> @@ -880,8 +880,8 @@ class TestDTH2452Tablet(test_multitouch.BaseTest.TestMultitouch, TouchTabletTest
>> does not overlap with other contacts. The value of `t` may be
>> incremented over time to move the point along a linear path.
>> """
>> - x = 50 + 10 * contact_id + t
>> - y = 100 + 100 * contact_id + t
>> + x = 50 + 10 * contact_id + t * 11
>> + y = 100 + 100 * contact_id + t * 11
>> return test_multitouch.Touch(contact_id, x, y)
>>
>> def make_contacts(self, n, t=0):
>> @@ -902,8 +902,8 @@ class TestDTH2452Tablet(test_multitouch.BaseTest.TestMultitouch, TouchTabletTest
>> tracking_id = contact_ids.tracking_id
>> slot_num = contact_ids.slot_num
>>
>> - x = 50 + 10 * contact_id + t
>> - y = 100 + 100 * contact_id + t
>> + x = 50 + 10 * contact_id + t * 11
>> + y = 100 + 100 * contact_id + t * 11
>>
>> # If the data isn't supposed to be stored in any slots, there is
>> # nothing we can check for in the evdev stream.
>>
>> ---
>> base-commit: 80d5a73edcfbd1d8d6a4c2b755873c5d63a1ebd7
>> change-id: 20240117-b4-wip-wacom-tests-fixes-298b50bea47f
>>
>> Best regards,
>> --
>> Benjamin Tissoires <bentiss(a)kernel.org>
>>
>>
Hi Mohammad,
On 1/16/24 21:48, Mohammad Nassiri wrote:
> The end_server() function only operates in the server thread
> and always takes an accept socket instead of a listen socket as
> its input argument. To align with this, invert the boolean values
> used when calling verify_counters() within the end_server() function.
>
> Fixes: ("3c3ead555648 selftests/net: Add TCP-AO key-management test")
> Signed-off-by: Mohammad Nassiri <mnassiri(a)ciena.com>
> Link: https://lore.kernel.org/all/934627c5-eebb-4626-be23-cfb134c01d1a@arista.com/
As I've written you off-list, the patch probably was not delivered to
mailing lists due to SPF check not passing. Please, fix the send-email
setup when/if you want to send more patches.
Related to this patch: I'm going to carry and resend it together with 2
more patches, as this fix made 3 selftests fail and I've looked into that.
> ---
> tools/testing/selftests/net/tcp_ao/key-management.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/net/tcp_ao/key-management.c b/tools/testing/selftests/net/tcp_ao/key-management.c
> index c48b4970ca17..f6a9395e3cd7 100644
> --- a/tools/testing/selftests/net/tcp_ao/key-management.c
> +++ b/tools/testing/selftests/net/tcp_ao/key-management.c
> @@ -843,7 +843,7 @@ static void end_server(const char *tst_name, int sk,
> synchronize_threads(); /* 4: verified => closed */
> close(sk);
>
> - verify_counters(tst_name, true, false, begin, &end);
> + verify_counters(tst_name, false, true, begin, &end);
> synchronize_threads(); /* 5: counters */
> }
>
Thanks,
Dmitry
When running with CATEGORY= (thp | hugetlb) we see a large numbers of
tests failing. These failures are due to not being able to allocate a
hugepage and normally occur on memory contrainted systems or when using
large page sizes.
drop_cache and compact_memory before the tests for a higher chance at a
successful hugepage allocation.
Signed-off-by: Nico Pache <npache(a)redhat.com>
---
tools/testing/selftests/mm/run_vmtests.sh | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 246d53a5d7f2..040f27e21f47 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -206,6 +206,15 @@ pretty_name() {
# Usage: run_test [test binary] [arbitrary test arguments...]
run_test() {
if test_selected ${CATEGORY}; then
+ # On memory constrainted systems some tests can fail to allocate hugepages.
+ # perform some cleanup before the test for a higher success rate.
+ if [ ${CATEGORY} == "thp" ] | [ ${CATEGORY} == "hugetlb" ]; then
+ echo 3 > /proc/sys/vm/drop_caches
+ sleep 2
+ echo 1 > /proc/sys/vm/compact_memory
+ sleep 2
+ fi
+
local test=$(pretty_name "$*")
local title="running $*"
local sep=$(echo -n "$title" | tr "[:graph:][:space:]" -)
--
2.43.0
hugetlb_madv_vs_map selftest was not part of the mm test-suite since we
didn't have a fix for the problem it found.
Now that the problem is already fixed (see previous commit), let's
enable this selftest in the default test-suite.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
tools/testing/selftests/mm/run_vmtests.sh | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index a5e6ba8d3579..f41e1978e4d4 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -256,6 +256,7 @@ nr_hugepages_tmp=$(cat /proc/sys/vm/nr_hugepages)
# For this test, we need one and just one huge page
echo 1 > /proc/sys/vm/nr_hugepages
CATEGORY="hugetlb" run_test ./hugetlb_fault_after_madv
+CATEGORY="hugetlb" run_test ./hugetlb_madv_vs_map
# Restore the previous number of huge pages, since further tests rely on it
echo "$nr_hugepages_tmp" > /proc/sys/vm/nr_hugepages
--
2.34.1
From: Amit Cohen <amcohen(a)nvidia.com>
'qos_pfc' test checks PFC behavior. The idea is to limit the traffic
using a shaper somewhere in the flow of the packets. In this area, the
buffer is smaller than the buffer at the beginning of the flow, so it fills
up until there is no more space left. The test configures there PFC
which is supposed to notice that the headroom is filling up and send PFC
Xoff to indicate the transmitter to stop sending traffic for the priorities
sharing this PG.
The Xon/Xoff threshold is auto-configured and always equal to
2*(MTU rounded up to cell size). Even after sending the PFC Xoff packet,
traffic will keep arriving until the transmitter receives and processes
the PFC packet. This amount of traffic is known as the PFC delay allowance.
Currently the buffer for the delay traffic is configured as 100KB. The
MTU in the test is 10KB, therefore the threshold for Xoff is about 20KB.
This allows 80KB extra to be stored in this buffer.
8-lane ports use two buffers among which the configured buffer is split,
the Xoff threshold then applies to each buffer in parallel.
The test does not take into account the behavior of 8-lane ports, when the
ports are configured to 400Gbps with 8 lanes or 800Gbps with 8 lanes,
packets are dropped and the test fails.
Check if the relevant ports use 8 lanes, in such case double the size of
the buffer, as the headroom is split half-half.
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: linux-kselftest(a)vger.kernel.org
Fixes: bfa804784e32 ("selftests: mlxsw: Add a PFC test")
Signed-off-by: Amit Cohen <amcohen(a)nvidia.com>
Reviewed-by: Ido Schimmel <idosch(a)nvidia.com>
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
---
.../selftests/drivers/net/mlxsw/qos_pfc.sh | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/drivers/net/mlxsw/qos_pfc.sh b/tools/testing/selftests/drivers/net/mlxsw/qos_pfc.sh
index 49bef76083b8..0f0f4f05807c 100755
--- a/tools/testing/selftests/drivers/net/mlxsw/qos_pfc.sh
+++ b/tools/testing/selftests/drivers/net/mlxsw/qos_pfc.sh
@@ -119,6 +119,9 @@ h2_destroy()
switch_create()
{
+ local lanes_swp4
+ local pg1_size
+
# pools
# -----
@@ -228,7 +231,20 @@ switch_create()
dcb pfc set dev $swp4 prio-pfc all:off 1:on
# PG0 will get autoconfigured to Xoff, give PG1 arbitrarily 100K, which
# is (-2*MTU) about 80K of delay provision.
- dcb buffer set dev $swp4 buffer-size all:0 1:$_100KB
+ pg1_size=$_100KB
+
+ setup_wait_dev_with_timeout $swp4
+
+ lanes_swp4=$(ethtool $swp4 | grep 'Lanes:')
+ lanes_swp4=${lanes_swp4#*"Lanes: "}
+
+ # 8-lane ports use two buffers among which the configured buffer
+ # is split, so double the size to get twice (20K + 80K).
+ if [[ $lanes_swp4 -eq 8 ]]; then
+ pg1_size=$((pg1_size * 2))
+ fi
+
+ dcb buffer set dev $swp4 buffer-size all:0 1:$pg1_size
# bridges
# -------
--
2.42.0
One build issue comes up due to both mount.h included dev_in_maps.c
In file included from dev_in_maps.c:10:
/usr/include/sys/mount.h:35:3: error: expected identifier before numeric constant
35 | MS_RDONLY = 1, /* Mount read-only. */
| ^~~~~~~~~
In file included from dev_in_maps.c:13:
Remove one of them to solve conflict, another error comes up:
dev_in_maps.c:170:6: error: implicit declaration of function ‘mount’ [-Werror=implicit-function-declaration]
170 | if (mount(NULL, "/", NULL, MS_SLAVE | MS_REC, NULL) == -1) {
| ^~~~~
cc1: all warnings being treated as errors
and then , add sys_mount definition to solve it
After both above, dev_in_maps.c can be built correctly on my mache(gcc 10.2,glibc-2.32,kernel-5.10)
Signed-off-by: Hu Yadi <hu.yadi(a)h3c.com>
---
.../selftests/filesystems/overlayfs/dev_in_maps.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/filesystems/overlayfs/dev_in_maps.c b/tools/testing/selftests/filesystems/overlayfs/dev_in_maps.c
index e19ab0e85709..759f86e7d263 100644
--- a/tools/testing/selftests/filesystems/overlayfs/dev_in_maps.c
+++ b/tools/testing/selftests/filesystems/overlayfs/dev_in_maps.c
@@ -10,7 +10,6 @@
#include <linux/mount.h>
#include <sys/syscall.h>
#include <sys/stat.h>
-#include <sys/mount.h>
#include <sys/mman.h>
#include <sched.h>
#include <fcntl.h>
@@ -32,7 +31,11 @@ static int sys_fsmount(int fd, unsigned int flags, unsigned int attr_flags)
{
return syscall(__NR_fsmount, fd, flags, attr_flags);
}
-
+static int sys_mount(const char *src, const char *tgt, const char *fst,
+ unsigned long flags, const void *data)
+{
+ return syscall(__NR_mount, src, tgt, fst, flags, data);
+}
static int sys_move_mount(int from_dfd, const char *from_pathname,
int to_dfd, const char *to_pathname,
unsigned int flags)
@@ -166,8 +169,7 @@ int main(int argc, char **argv)
ksft_test_result_skip("unable to create a new mount namespace\n");
return 1;
}
-
- if (mount(NULL, "/", NULL, MS_SLAVE | MS_REC, NULL) == -1) {
+ if (sys_mount(NULL, "/", NULL, MS_SLAVE | MS_REC, NULL) == -1) {
pr_perror("mount");
return 1;
}
--
2.39.3
Running charge_reserved_hugetlb.sh generates errors if sh is set to
dash:
./charge_reserved_hugetlb.sh: 9: [[: not found
./charge_reserved_hugetlb.sh: 19: [[: not found
./charge_reserved_hugetlb.sh: 27: [[: not found
./charge_reserved_hugetlb.sh: 37: [[: not found
./charge_reserved_hugetlb.sh: 45: Syntax error: "(" unexpected
Switch to using /bin/bash instead of /bin/sh. Make the switch for
write_hugetlb_memory.sh as well which is called from
charge_reserved_hugetlb.sh.
Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
---
tools/testing/selftests/mm/charge_reserved_hugetlb.sh | 2 +-
tools/testing/selftests/mm/write_hugetlb_memory.sh | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/mm/charge_reserved_hugetlb.sh b/tools/testing/selftests/mm/charge_reserved_hugetlb.sh
index 0899019a7fcb..e14bdd4455f2 100755
--- a/tools/testing/selftests/mm/charge_reserved_hugetlb.sh
+++ b/tools/testing/selftests/mm/charge_reserved_hugetlb.sh
@@ -1,4 +1,4 @@
-#!/bin/sh
+#!/bin/bash
# SPDX-License-Identifier: GPL-2.0
# Kselftest framework requirement - SKIP code is 4.
diff --git a/tools/testing/selftests/mm/write_hugetlb_memory.sh b/tools/testing/selftests/mm/write_hugetlb_memory.sh
index 70a02301f4c2..3d2d2eb9d6ff 100755
--- a/tools/testing/selftests/mm/write_hugetlb_memory.sh
+++ b/tools/testing/selftests/mm/write_hugetlb_memory.sh
@@ -1,4 +1,4 @@
-#!/bin/sh
+#!/bin/bash
# SPDX-License-Identifier: GPL-2.0
set -e
--
2.42.0
From: Jeff Xu <jeffxu(a)chromium.org>
This is V4 of the patch, the patch has improved significantly since V1,
thanks to diverse inputs, a few discussions remain, please read those
in the open discussion section of v4 of change history.
-----------------------------------------------------------------
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory
range against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW)
and no-execute (NX) bits. Linux has supported NX since the release of
kernel version 2.6.8 in August 2004 [1]. The memory permission feature
improves the security stance on memory corruption bugs, as an attacker
cannot simply write to arbitrary memory and point the code to it. The
memory must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects
the VMA itself against modifications of the selected seal type.
Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped. Memory sealing can
automatically be applied by the runtime loader to seal .text and
.rodata pages and applications can additionally seal security critical
data at runtime. A similar feature already exists in the XNU kernel
with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the
mimmutable syscall [4]. Also, Chrome wants to adopt this feature for
their CFI work [2] and this patchset has been designed to be
compatible with the Chrome use case.
Two system calls are involved in sealing the map: mmap() and mseal().
The new mseal() is an syscall on 64 bit CPU, and with
following signature:
int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.
mseal() blocks following operations for the given memory range.
1> Unmapping, moving to another location, and shrinking the size,
via munmap() and mremap(), can leave an empty space, therefore can
be replaced with a VMA with a new set of attributes.
2> Moving or expanding a different VMA into the current location,
via mremap().
3> Modifying a VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed VMA.
5> mprotect() and pkey_mprotect().
6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
memory, when users don't have write permission to the memory. Those
behaviors can alter region contents by discarding pages, effectively a
memset(0) for anonymous memory.
In addition: mmap() has two related changes.
The PROT_SEAL bit in prot field of mmap(). When present, it marks
the map sealed since creation.
The MAP_SEALABLE bit in the flags field of mmap(). When present, it marks
the map as sealable. A map created without MAP_SEALABLE will not support
sealing, i.e. mseal() will fail.
Applications that don't care about sealing will expect their behavior
unchanged. For those that need sealing support, opt-in by adding
MAP_SEALABLE in mmap().
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.
Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in
the case of libc, sealing is only applied to read-only (RO) or
read-execute (RX) memory segments (such as .text and .RELRO) to
prevent them from becoming writable, the lifetime of those mappings
are tied to the lifetime of the process.
Chrome wants to seal two large address space reservations that are
managed by different allocators. The memory is mapped RW- and RWX
respectively but write access to it is restricted using pkeys (or in
the future ARM permission overlay extensions). The lifetime of those
mappings are not tied to the lifetime of the process, therefore, while
the memory is sealed, the allocators still need to free or discard the
unused memory. For example, with madvise(DONTNEED).
However, always allowing madvise(DONTNEED) on this range poses a
security risk. For example if a jump instruction crosses a page
boundary and the second page gets discarded, it will overwrite the
target bytes with zeros and change the control flow. Checking
write-permission before the discard operation allows us to control
when the operation is valid. In this case, the madvise will only
succeed if the executing thread has PKEY write permissions and PKRU
changes are protected in software by control-flow integrity.
Although the initial version of this patch series is targeting the
Chrome browser as its first user, it became evident during upstream
discussions that we would also want to ensure that the patch set
eventually is a complete solution for memory sealing and compatible
with other use cases. The specific scenario currently in mind is
glibc's use case of loading and sealing ELF executables. To this end,
Stephen is working on a change to glibc to add sealing support to the
dynamic linker, which will seal all non-writable segments at startup.
Once this work is completed, all applications will be able to
automatically benefit from these new protections.
--------------------------------------------------------------------
Change history:
===============
V4:
(Suggested by Linus Torvalds)
- new signature: mseal(start,len,flags)
- 32 bit is not supported. vm_seal is removed, use vm_flags instead.
- single bit in vm_flags for sealed state.
- CONFIG_MSEAL kernel config is removed.
- single bit of PROT_SEAL in the "Prot" field of mmap().
Other changes:
- update selftest (Suggested by Muhammad Usama Anjum)
- update documentation.
Open discussions:
=================
Below discussion were brought up in V3, and did not receive any input:
the one important to this patch is MAP_SEALABLE in mmap(), which is in
current version of patch, list here for input/comments.
---------------------------------------------------------------------
During the development of V3, I had new questions and thoughts and
wished to discuss.
1> shm/aio
From reading the code, it seems to me that aio/shm can mmap/munmap
maps on behalf of userspace, e.g. ksys_shmdt() in shm.c. The lifetime
of those mapping are not tied to the lifetime of the process. If those
memories are sealed from userspace, then unmap will fail. This isn’t a
huge problem, since the memory will eventually be freed at exit or
exec. However, it feels like the solution is not complete, because of
the leaks in VMA address space during the lifetime of the process.
2> Brk (heap/stack)
Currently, userspace applications can seal parts of the heap by
calling malloc() and mseal(). This raises the question of what the
expected behavior is when sealing the heap is attempted.
let's assume following calls from user space:
ptr = malloc(size);
mprotect(ptr, size, RO);
mseal(ptr, size, SEAL_PROT_PKEY);
free(ptr);
Technically, before mseal() is added, the user can change the
protection of the heap by calling mprotect(RO). As long as the user
changes the protection back to RW before free(), the memory can be
reused.
Adding mseal() into picture, however, the heap is then sealed
partially, user can still free it, but the memory remains to be RO,
and the result of brk-shrink is nondeterministic, depending on if
munmap() will try to free the sealed memory.(brk uses munmap to shrink
the heap).
3> Above two cases led to the third topic:
There one option to address the problem mentioned above.
Option 1: A “MAP_SEALABLE” flag in mmap().
If a map is created without this flag, the mseal() operation will
fail. Applications that are not concerned with sealing will expect
their behavior to be unchanged. For those that are concerned, adding a
flag at mmap time to opt in is not difficult. For the short term, this
solves problems 1 and 2 above. The memory in shm/aio/brk will not have
the MAP_SEALABLE flag at mmap(), and the same is true for the heap.
If we choose not to go with path, all mapping will by default
sealable. We could document above mentioned limitations so devs are
more careful at the time to choose what memory to seal. I think
deny of service through mseal() by attacker is probably not a concern,
if attackers have access to mseal() and unsealed memory, then they can
also do other harmful thing to the memory, such as munmap, etc.
4>
I think it might be possible to seal the stack or other special
mappings created at runtime (vdso, vsyscall, vvar). This means we can
enforce and seal W^X for certain types of application. For instance,
the stack is typically used in read-write mode, but in some cases, it
can become executable. To defend against unintented addition of executable
bit to stack, we could let the application to seal it.
Sealing the heap (for adding X) requires special handling, since the
heap can shrink, and shrink is implemented through munmap().
Indeed, it might be possible that all virtual memory accessible to user
space, regardless of its usage pattern, could be sealed. However, this
would require additional research and development work.
=====================================================================
V3:
- Abandon per-syscall approach, (Suggested by Linus Torvalds).
- Organize sealing types around their functionality, such as
MM_SEAL_BASE, MM_SEAL_PROT_PKEY.
- Extend the scope of sealing from calls originated in userspace to
both kernel and userspace. (Suggested by Linus Torvalds)
- Add seal type support in mmap(). (Suggested by Pedro Falcato)
- Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent
destructive operations of madvise. (Suggested by Jann Horn and
Stephen Röttger)
- Make sealed VMAs mergeable. (Suggested by Jann Horn)
- Add MAP_SEALABLE to mmap()
- Add documentation - mseal.rst
https://lore.kernel.org/linux-mm/20231212231706.2680890-2-jeffxu@chromium.o…
v2:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add a detailed commit message.
https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/
v1:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/
----------------------------------------------------------------
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b…
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXge…
[6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426Fkcgnf…
[7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/
Jeff Xu (4):
mseal: Wire up mseal syscall
mseal: add mseal syscall
selftest mm/mseal memory sealing
mseal:add documentation
Documentation/userspace-api/mseal.rst | 181 ++
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/mm.h | 60 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/mman-common.h | 7 +
include/uapi/asm-generic/unistd.h | 5 +-
kernel/sys_ni.c | 1 +
mm/Makefile | 4 +
mm/madvise.c | 12 +
mm/mmap.c | 27 +
mm/mprotect.c | 10 +
mm/mremap.c | 31 +
mm/mseal.c | 330 ++++
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/mseal_test.c | 1971 +++++++++++++++++++
32 files changed, 2659 insertions(+), 2 deletions(-)
create mode 100644 Documentation/userspace-api/mseal.rst
create mode 100644 mm/mseal.c
create mode 100644 tools/testing/selftests/mm/mseal_test.c
--
2.43.0.195.gebba966016-goog