Attacks against vulnerable userspace applications with the purpose to break
ASLR or bypass canaries traditionally use some level of brute force with
the help of the fork system call. This is possible since when creating a
new process using fork its memory contents are the same as those of the
parent process (the process that called the fork system call). So, the
attacker can test the memory infinite times to find the correct memory
values or the correct memory addresses without worrying about crashing the
application.
Based on the above scenario it would be nice to have this detected and
mitigated, and this is the goal of this patch serie. Specifically the
following attacks are expected to be detected:
1.- Launching (fork()/exec()) a setuid/setgid process repeatedly until a
desirable memory layout is got (e.g. Stack Clash).
2.- Connecting to an exec()ing network daemon (e.g. xinetd) repeatedly
until a desirable memory layout is got (e.g. what CTFs do for simple
network service).
3.- Launching processes without exec() (e.g. Android Zygote) and exposing
state to attack a sibling.
4.- Connecting to a fork()ing network daemon (e.g. apache) repeatedly until
the previously shared memory layout of all the other children is
exposed (e.g. kind of related to HeartBleed).
In each case, a privilege boundary has been crossed:
Case 1: setuid/setgid process
Case 2: network to local
Case 3: privilege changes
Case 4: network to local
So, what will really be detected are fork/exec brute force attacks that
cross any of the commented bounds.
The implementation details and comparison against other existing
implementations can be found in the "Documentation" patch.
It is important to mention that the v8 and v7 versions have changed the
method used to track the information related to the application crashes.
Prior this versions, a pointer per process (in the task_struct structure)
held a reference to the shared statistical data. Or in other words, these
stats were shared by all the fork hierarchy processes. But this has an
important drawback: a brute force attack that happens through the execve
system call losts the faults info since these statistics are freed when the
fork hierarchy disappears. So, the solution adopted in the v6 version was
to use an upper fork hierarchy to track the info for this attack type. But,
as Valdis Kletnieks pointed out during this discussion [1], this method can
be easily bypassed using a double exec (well, this was the method used in
the kselftest to avoid the detection ;) ). So, in this version, to track
all the statistical data (info related with application crashes), the
extended attributes feature for the executable files are used. The xattr is
also used to mark the executables as "not allowed" when an attack is
detected. Then, the execve system call rely on this flag to avoid following
executions of this file.
[1] https://lore.kernel.org/kernelnewbies/20210330173459.GA3163@ubuntu/
Moreover, I think this solves another problem pointed out by Andi Kleen
during the v5 review [2] related to the possibility that a supervisor
respawns processes killed by the Brute LSM. He suggested adding some way so
a supervisor can know that a process has been killed by Brute and then
decide to respawn or not. So, now, the supervisor can read the brute xattr
of one executable and know if it is blocked by Brute and why (using the
statistical data).
[2] https://lore.kernel.org/kernel-hardening/878s78dnrm.fsf@linux.intel.com/
Although the xattr of the executable is accessible from userspace, in
complex daemons this file may not be visible directly by the supervisor as
it may be run through some wrapper. So, an extension to the waitid() system
call has been added in this version. This was suggested by Andi Kleen [3]
during the v7 review. (The case with supervisors using cgroups is not yet
tested).
[3] https://lore.kernel.org/kernel-hardening/19903478-52e0-3829-0515-3e17669108…
Knowing all this information I will explain now the different patches:
The 1/8 patch defines a new LSM hook to get the fatal signal of a task.
This will be useful during the attack detection phase.
The 2/8 patch defines a new LSM and the necessary sysctl attributes to fine
tuning the attack detection.
The 3/8 patch detects a fork/exec brute force attack and narrows the
possible cases taken into account the privilege boundary crossing.
The 4/8 patch mitigates a brute force attack.
The 5/8 patch adds the extension to the waitid system call to notify to
userspace that a task has been killed by Brute LSM when an attack is
mitigated.
The 6/8 patch adds self-tests to validate the Brute LSM expectations.
The 7/8 patch adds the documentation to explain this implementation.
The 8/8 patch updates the maintainers file.
This patch serie is a task of the KSPP [4] and can also be accessed from my
github tree [5] in the "brute_v8" branch.
[4] https://github.com/KSPP/linux/issues/39
[5] https://github.com/johwood/linux/
When I ran the "checkpatch" script I got the following errors, but I think
they are false positives as I follow the same coding style for the others
extended attributes suffixes.
----------------------------------------------------------------------------
../patches/brute_v8/v8-0003-security-brute-Detect-a-brute-force-attack.patch
----------------------------------------------------------------------------
ERROR: Macros with complex values should be enclosed in parentheses
89: FILE: include/uapi/linux/xattr.h:80:
+#define XATTR_NAME_BRUTE XATTR_SECURITY_PREFIX XATTR_BRUTE_SUFFIX
-----------------------------------------------------------------------------
../patches/brute_v8/v8-0006-selftests-brute-Add-tests-for-the-Brute-LSM.patch
-----------------------------------------------------------------------------
ERROR: Macros with complex values should be enclosed in parentheses
159: FILE: tools/testing/selftests/brute/rmxattr.c:18:
+#define XATTR_NAME_BRUTE XATTR_SECURITY_PREFIX XATTR_BRUTE_SUFFIX
When I ran the "kernel-doc" script with the following parameters:
./scripts/kernel-doc --none -v security/brute/brute.c
I got the following warning:
security/brute/brute.c:118: warning: contents before sections
But I don't understand why it is complaining. Could it be a false positive?
The previous versions can be found in:
RFC
https://lore.kernel.org/kernel-hardening/20200910202107.3799376-1-keescook@…
Version 2
https://lore.kernel.org/kernel-hardening/20201025134540.3770-1-john.wood@gm…
Version 3
https://lore.kernel.org/lkml/20210221154919.68050-1-john.wood@gmx.com/
Version 4
https://lore.kernel.org/lkml/20210227150956.6022-1-john.wood@gmx.com/
Version 5
https://lore.kernel.org/kernel-hardening/20210227153013.6747-1-john.wood@gm…
Version 6
https://lore.kernel.org/kernel-hardening/20210307113031.11671-1-john.wood@g…
Version 7
https://lore.kernel.org/kernel-hardening/20210521172414.69456-1-john.wood@g…
Changelog RFC -> v2
-------------------
- Rename this feature with a more suitable name (Jann Horn, Kees Cook).
- Convert the code to an LSM (Kees Cook).
- Add locking to avoid data races (Jann Horn).
- Add a new LSM hook to get the fatal signal of a task (Jann Horn, Kees
Cook).
- Add the last crashes timestamps list to avoid false positives in the
attack detection (Jann Horn).
- Use "period" instead of "rate" (Jann Horn).
- Other minor changes suggested (Jann Horn, Kees Cook).
Changelog v2 -> v3
------------------
- Compute the application crash period on an on-going basis (Kees Cook).
- Detect a brute force attack through the execve system call (Kees Cook).
- Detect an slow brute force attack (Randy Dunlap).
- Fine tuning the detection taken into account privilege boundary crossing
(Kees Cook).
- Taken into account only fatal signals delivered by the kernel (Kees
Cook).
- Remove the sysctl attributes to fine tuning the detection (Kees Cook).
- Remove the prctls to allow per process enabling/disabling (Kees Cook).
- Improve the documentation (Kees Cook).
- Fix some typos in the documentation (Randy Dunlap).
- Add self-test to validate the expectations (Kees Cook).
Changelog v3 -> v4
------------------
- Fix all the warnings shown by the tool "scripts/kernel-doc" (Randy
Dunlap).
Changelog v4 -> v5
------------------
- Fix some typos (Randy Dunlap).
Changelog v5 -> v6
------------------
- Fix a reported deadlock (kernel test robot).
- Add high level details to the documentation (Andi Kleen).
Changelog v6 -> v7
------------------
- Add the "Reviewed-by:" tag to the first patch.
- Rearrange the brute LSM between lockdown and yama (Kees Cook).
- Split subdir and obj in security/Makefile (Kees Cook).
- Reduce the number of header files included (Kees Cook).
- Print the pid when an attack is detected (Kees Cook).
- Use the socket_accept LSM hook instead of socket_sock_rcv_skb hook to
avoid running a hook on every incoming network packet (Kees Cook).
- Update the documentation and fix it to render it properly (Jonathan
Corbet).
- Manage correctly an exec brute force attack avoiding the bypass (Valdis
Kletnieks).
- Other minor changes and cleanups.
Changelog v7 -> v8
------------------
- Rebase against v5.13-rc4.
- Fix a build error if CONFIG_IPV6 and/or CONFIG_SECURITY_NETWORK is not
set (kernel test robot).
- Notify to userspace that a task has been killed by Brute LSM (Andi
Kleen).
- Add a new test to verify that the userspace notification is working.
- Update the documentation accordingly with this new feature.
- Other minor changes and cleanups.
Any constructive comments are welcome.
Thanks in advance.
John Wood (8):
security: Add LSM hook at the point where a task gets a fatal signal
security/brute: Define a LSM and add sysctl attributes
security/brute: Detect a brute force attack
security/brute: Mitigate a brute force attack
security/brute: Notify to userspace "task killed"
selftests/brute: Add tests for the Brute LSM
Documentation: Add documentation for the Brute LSM
MAINTAINERS: Add a new entry for the Brute LSM
Documentation/admin-guide/LSM/Brute.rst | 359 ++++++++++
Documentation/admin-guide/LSM/index.rst | 1 +
MAINTAINERS | 8 +
arch/x86/kernel/signal_compat.c | 2 +-
include/brute/brute.h | 16 +
include/linux/lsm_hook_defs.h | 1 +
include/linux/lsm_hooks.h | 4 +
include/linux/security.h | 4 +
include/uapi/asm-generic/siginfo.h | 3 +-
include/uapi/linux/xattr.h | 3 +
kernel/exit.c | 6 +-
kernel/signal.c | 5 +-
security/Kconfig | 11 +-
security/Makefile | 2 +
security/brute/Kconfig | 15 +
security/brute/Makefile | 2 +
security/brute/brute.c | 795 +++++++++++++++++++++++
security/security.c | 5 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/brute/.gitignore | 3 +
tools/testing/selftests/brute/Makefile | 5 +
tools/testing/selftests/brute/config | 1 +
tools/testing/selftests/brute/exec.c | 46 ++
tools/testing/selftests/brute/rmxattr.c | 34 +
tools/testing/selftests/brute/test.c | 507 +++++++++++++++
tools/testing/selftests/brute/test.sh | 269 ++++++++
26 files changed, 2099 insertions(+), 9 deletions(-)
create mode 100644 Documentation/admin-guide/LSM/Brute.rst
create mode 100644 include/brute/brute.h
create mode 100644 security/brute/Kconfig
create mode 100644 security/brute/Makefile
create mode 100644 security/brute/brute.c
create mode 100644 tools/testing/selftests/brute/.gitignore
create mode 100644 tools/testing/selftests/brute/Makefile
create mode 100644 tools/testing/selftests/brute/config
create mode 100644 tools/testing/selftests/brute/exec.c
create mode 100644 tools/testing/selftests/brute/rmxattr.c
create mode 100644 tools/testing/selftests/brute/test.c
create mode 100755 tools/testing/selftests/brute/test.sh
--
2.25.1
This patchset provides a file descriptor for every VM and VCPU to read
KVM statistics data in binary format.
It is meant to provide a lightweight, flexible, scalable and efficient
lock-free solution for user space telemetry applications to pull the
statistics data periodically for large scale systems. The pulling
frequency could be as high as a few times per second.
In this patchset, every statistics data are treated to have some
attributes as below:
* architecture dependent or generic
* VM statistics data or VCPU statistics data
* type: cumulative, instantaneous,
* unit: none for simple counter, nanosecond, microsecond,
millisecond, second, Byte, KiByte, MiByte, GiByte. Clock Cycles
Since no lock/synchronization is used, the consistency between all
the statistics data is not guaranteed. That means not all statistics
data are read out at the exact same time, since the statistics date
are still being updated by KVM subsystems while they are read out.
---
* v7-> v8
- Rebase to kvm/queue, commit c1dc20e254b4 ("KVM: switch per-VM
stats to u64")
- Revise code to reflect the per-VM stats type from ulong to u64
- Addressed some other nits
* v6 -> v7
- Improve file descriptor allocation function by Krish suggestion
- Use "generic stats" instead of "common stats" as Krish suggested
- Addressed some other nits from Krish and David Matlack
* v5 -> v6
- Use designated initializers for STATS_DESC
- Change KVM_STATS_SCALE... to KVM_STATS_BASE...
- Use a common function for kvm_[vm|vcpu]_stats_read
- Fix some documentation errors/missings
- Use TEST_ASSERT in selftest
- Use a common function for [vm|vcpu]_stats_test in selftest
* v4 -> v5
- Rebase to kvm/queue, commit a4345a7cecfb ("Merge tag
'kvmarm-fixes-5.13-1'")
- Change maximum stats name length to 48
- Replace VM_STATS_COMMON/VCPU_STATS_COMMON macros with stats
descriptor definition macros.
- Fixed some errors/warnings reported by checkpatch.pl
* v3 -> v4
- Rebase to kvm/queue, commit 9f242010c3b4 ("KVM: avoid "deadlock"
between install_new_memslots and MMU notifier")
- Use C-stype comments in the whole patch
- Fix wrong count for x86 VCPU stats descriptors
- Fix KVM stats data size counting and validity check in selftest
* v2 -> v3
- Rebase to kvm/queue, commit edf408f5257b ("KVM: avoid "deadlock"
between install_new_memslots and MMU notifier")
- Resolve some nitpicks about format
* v1 -> v2
- Use ARRAY_SIZE to count the number of stats descriptors
- Fix missing `size` field initialization in macro STATS_DESC
[1] https://lore.kernel.org/kvm/20210402224359.2297157-1-jingzhangos@google.com
[2] https://lore.kernel.org/kvm/20210415151741.1607806-1-jingzhangos@google.com
[3] https://lore.kernel.org/kvm/20210423181727.596466-1-jingzhangos@google.com
[4] https://lore.kernel.org/kvm/20210429203740.1935629-1-jingzhangos@google.com
[5] https://lore.kernel.org/kvm/20210517145314.157626-1-jingzhangos@google.com
[6] https://lore.kernel.org/kvm/20210524151828.4113777-1-jingzhangos@google.com
[7] https://lore.kernel.org/kvm/20210603211426.790093-1-jingzhangos@google.com
---
Jing Zhang (4):
KVM: stats: Separate generic stats from architecture specific ones
KVM: stats: Add fd-based API to read binary stats data
KVM: stats: Add documentation for statistics data binary interface
KVM: selftests: Add selftest for KVM statistics data binary interface
Documentation/virt/kvm/api.rst | 174 +++++++++++++-
arch/arm64/include/asm/kvm_host.h | 9 +-
arch/arm64/kvm/guest.c | 46 +++-
arch/mips/include/asm/kvm_host.h | 9 +-
arch/mips/kvm/mips.c | 71 +++++-
arch/powerpc/include/asm/kvm_host.h | 9 +-
arch/powerpc/kvm/book3s.c | 72 +++++-
arch/powerpc/kvm/book3s_hv.c | 12 +-
arch/powerpc/kvm/book3s_pr.c | 2 +-
arch/powerpc/kvm/book3s_pr_papr.c | 2 +-
arch/powerpc/kvm/booke.c | 67 +++++-
arch/s390/include/asm/kvm_host.h | 9 +-
arch/s390/kvm/kvm-s390.c | 137 ++++++++++-
arch/x86/include/asm/kvm_host.h | 9 +-
arch/x86/kvm/x86.c | 75 +++++-
include/linux/kvm_host.h | 138 ++++++++++-
include/linux/kvm_types.h | 12 +
include/uapi/linux/kvm.h | 46 ++++
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 3 +
.../testing/selftests/kvm/include/kvm_util.h | 3 +
.../selftests/kvm/kvm_binary_stats_test.c | 218 ++++++++++++++++++
tools/testing/selftests/kvm/lib/kvm_util.c | 12 +
virt/kvm/kvm_main.c | 157 ++++++++++++-
24 files changed, 1202 insertions(+), 91 deletions(-)
create mode 100644 tools/testing/selftests/kvm/kvm_binary_stats_test.c
base-commit: c1dc20e254b421a2463da7f053b37d822788224a
--
2.32.0.272.g935e593368-goog
A kernel module + userspace driver to estimate the wakeup latency
caused by going into stop states. The motivation behind this program is
to find significant deviations behind advertised latency and residency
values.
The patchset measures latencies for two kinds of events. IPIs and Timers
As this is a software-only mechanism, there will additional latencies of
the kernel-firmware-hardware interactions. To account for that, the
program also measures a baseline latency on a 100 percent loaded CPU
and the latencies achieved must be in view relative to that.
To achieve this, we introduce a kernel module and expose its control
knobs through the debugfs interface that the selftests can engage with.
The kernel module provides the following interfaces within
/sys/kernel/debug/latency_test/ for,
IPI test:
ipi_cpu_dest = Destination CPU for the IPI
ipi_cpu_src = Origin of the IPI
ipi_latency_ns = Measured latency time in ns
Timeout test:
timeout_cpu_src = CPU on which the timer to be queued
timeout_expected_ns = Timer duration
timeout_diff_ns = Difference of actual duration vs expected timer
Sample output on a POWER9 system is as follows:
# --IPI Latency Test---
# Baseline Average IPI latency(ns): 3114
# Observed Average IPI latency(ns) - Snooze: 3265
# Observed Average IPI latency(ns) - Stop0_lite: 3507
# Observed Average IPI latency(ns) - Stop0: 3739
# Observed Average IPI latency(ns) - Stop2: 3807
# Observed Average IPI latency(ns) - Stop4: 17070
# Observed Average IPI latency(ns) - Stop5: 1038174
#
# --Timeout Latency Test--
# Baseline Average timeout diff(ns): 1420
# Observed Average timeout diff(ns) - Snooze: 1640
# Observed Average timeout diff(ns) - Stop0_lite: 1764
# Observed Average timeout diff(ns) - Stop0: 1715
# Observed Average timeout diff(ns) - Stop2: 1845
# Observed Average timeout diff(ns) - Stop4: 16581
# Observed Average timeout diff(ns) - Stop5: 939977
Pratik R. Sampat (2):
powerpc/cpuidle: Extract IPI based and timer based wakeup latency from
idle states
powerpc/selftest: Add support for cpuidle latency measurement
arch/powerpc/kernel/Makefile | 1 +
arch/powerpc/kernel/test-cpuidle_latency.c | 157 +++++++
lib/Kconfig.debug | 10 +
tools/testing/selftests/powerpc/Makefile | 1 +
.../powerpc/cpuidle_latency/.gitignore | 2 +
.../powerpc/cpuidle_latency/Makefile | 6 +
.../cpuidle_latency/cpuidle_latency.sh | 419 ++++++++++++++++++
.../powerpc/cpuidle_latency/settings | 1 +
8 files changed, 597 insertions(+)
create mode 100644 arch/powerpc/kernel/test-cpuidle_latency.c
create mode 100644 tools/testing/selftests/powerpc/cpuidle_latency/.gitignore
create mode 100644 tools/testing/selftests/powerpc/cpuidle_latency/Makefile
create mode 100755 tools/testing/selftests/powerpc/cpuidle_latency/cpuidle_latency.sh
create mode 100644 tools/testing/selftests/powerpc/cpuidle_latency/settings
--
2.17.1
Use ARRAY_SIZE instead of dividing sizeof array with sizeof an
element.
Clean up the following coccicheck warning:
./tools/testing/selftests/x86/syscall_numbering.c:316:35-36: WARNING:
Use ARRAY_SIZE.
Reported-by: Abaci Robot <abaci(a)linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong(a)linux.alibaba.com>
---
Changes in v2:
-Add ARRAY_SIZE definition.
tools/testing/selftests/x86/syscall_numbering.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/x86/syscall_numbering.c b/tools/testing/selftests/x86/syscall_numbering.c
index 9915917..ef30218 100644
--- a/tools/testing/selftests/x86/syscall_numbering.c
+++ b/tools/testing/selftests/x86/syscall_numbering.c
@@ -40,6 +40,7 @@
#define X32_WRITEV 516
#define X32_BIT 0x40000000
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
static int nullfd = -1; /* File descriptor for /dev/null */
static bool with_x32; /* x32 supported on this kernel? */
@@ -313,7 +314,7 @@ static void test_syscall_numbering(void)
* The MSB is supposed to be ignored, so we loop over a few
* to test that out.
*/
- for (size_t i = 0; i < sizeof(msbs)/sizeof(msbs[0]); i++) {
+ for (size_t i = 0; i < ARRAY_SIZE(msbs); i++) {
int msb = msbs[i];
run("Checking system calls with msb = %d (0x%x)\n",
msb, msb);
--
1.8.3.1
Hi,
This patch converts existing UUID runtime test to use KUnit framework.
Below, there's a comparison between the old output format and the new
one. Keep in mind that even if KUnit seems very verbose, this is the
corner case where _every_ test has failed.
* This is how the current output looks like in success:
test_uuid: all 18 tests passed
* And when it fails:
test_uuid: conversion test #1 failed on LE data: 'c33f4995-3701-450e-9fbf-206a2e98e576'
test_uuid: cmp test #2 failed on LE data: 'c33f4995-3701-450e-9fbf-206a2e98e576'
test_uuid: cmp test #2 actual data: 'c33f4995-3701-450e-9fbf-206a2e98e576'
test_uuid: conversion test #3 failed on BE data: 'c33f4995-3701-450e-9fbf-206a2e98e576'
test_uuid: cmp test #4 failed on BE data: 'c33f4995-3701-450e-9fbf-206a2e98e576'
test_uuid: cmp test #4 actual data: 'c33f4995-3701-450e-9fbf-206a2e98e576'
test_uuid: conversion test #5 failed on LE data: '64b4371c-77c1-48f9-8221-29f054fc023b'
test_uuid: cmp test #6 failed on LE data: '64b4371c-77c1-48f9-8221-29f054fc023b'
test_uuid: cmp test #6 actual data: '64b4371c-77c1-48f9-8221-29f054fc023b'
test_uuid: conversion test #7 failed on BE data: '64b4371c-77c1-48f9-8221-29f054fc023b'
test_uuid: cmp test #8 failed on BE data: '64b4371c-77c1-48f9-8221-29f054fc023b'
test_uuid: cmp test #8 actual data: '64b4371c-77c1-48f9-8221-29f054fc023b'
test_uuid: conversion test #9 failed on LE data: '0cb4ddff-a545-4401-9d06-688af53e7f84'
test_uuid: cmp test #10 failed on LE data: '0cb4ddff-a545-4401-9d06-688af53e7f84'
test_uuid: cmp test #10 actual data: '0cb4ddff-a545-4401-9d06-688af53e7f84'
test_uuid: conversion test #11 failed on BE data: '0cb4ddff-a545-4401-9d06-688af53e7f84'
test_uuid: cmp test #12 failed on BE data: '0cb4ddff-a545-4401-9d06-688af53e7f84'
test_uuid: cmp test #12 actual data: '0cb4ddff-a545-4401-9d06-688af53e7f84'
test_uuid: negative test #13 passed on wrong LE data: 'c33f4995-3701-450e-9fbf206a2e98e576 '
test_uuid: negative test #14 passed on wrong BE data: 'c33f4995-3701-450e-9fbf206a2e98e576 '
test_uuid: negative test #15 passed on wrong LE data: '64b4371c-77c1-48f9-8221-29f054XX023b'
test_uuid: negative test #16 passed on wrong BE data: '64b4371c-77c1-48f9-8221-29f054XX023b'
test_uuid: negative test #17 passed on wrong LE data: '0cb4ddff-a545-4401-9d06-688af53e'
test_uuid: negative test #18 passed on wrong BE data: '0cb4ddff-a545-4401-9d06-688af53e'
test_uuid: failed 18 out of 18 tests
* Now, here's how it looks like with KUnit:
======== [PASSED] uuid ========
[PASSED] uuid_correct_be
[PASSED] uuid_correct_le
[PASSED] uuid_wrong_be
[PASSED] uuid_wrong_le
* And if every test fail with KUnit:
======== [FAILED] uuid ========
[FAILED] uuid_correct_be
# uuid_correct_be: ASSERTION FAILED at lib/test_uuid.c:57
Expected uuid_parse(data->uuid, &be) == 1, but
uuid_parse(data->uuid, &be) == 0
failed to parse 'c33f4995-3701-450e-9fbf-206a2e98e576'
# uuid_correct_be: not ok 1 - c33f4995-3701-450e-9fbf-206a2e98e576
# uuid_correct_be: ASSERTION FAILED at lib/test_uuid.c:57
Expected uuid_parse(data->uuid, &be) == 1, but
uuid_parse(data->uuid, &be) == 0
failed to parse '64b4371c-77c1-48f9-8221-29f054fc023b'
# uuid_correct_be: not ok 2 - 64b4371c-77c1-48f9-8221-29f054fc023b
# uuid_correct_be: ASSERTION FAILED at lib/test_uuid.c:57
Expected uuid_parse(data->uuid, &be) == 1, but
uuid_parse(data->uuid, &be) == 0
failed to parse '0cb4ddff-a545-4401-9d06-688af53e7f84'
# uuid_correct_be: not ok 3 - 0cb4ddff-a545-4401-9d06-688af53e7f84
not ok 1 - uuid_correct_be
[FAILED] uuid_correct_le
# uuid_correct_le: ASSERTION FAILED at lib/test_uuid.c:46
Expected guid_parse(data->uuid, &le) == 1, but
guid_parse(data->uuid, &le) == 0
failed to parse 'c33f4995-3701-450e-9fbf-206a2e98e576'
# uuid_correct_le: not ok 1 - c33f4995-3701-450e-9fbf-206a2e98e576
# uuid_correct_le: ASSERTION FAILED at lib/test_uuid.c:46
Expected guid_parse(data->uuid, &le) == 1, but
guid_parse(data->uuid, &le) == 0
failed to parse '64b4371c-77c1-48f9-8221-29f054fc023b'
# uuid_correct_le: not ok 2 - 64b4371c-77c1-48f9-8221-29f054fc023b
# uuid_correct_le: ASSERTION FAILED at lib/test_uuid.c:46
Expected guid_parse(data->uuid, &le) == 1, but
guid_parse(data->uuid, &le) == 0
failed to parse '0cb4ddff-a545-4401-9d06-688af53e7f84'
# uuid_correct_le: not ok 3 - 0cb4ddff-a545-4401-9d06-688af53e7f84
not ok 2 - uuid_correct_le
[FAILED] uuid_wrong_be
# uuid_wrong_be: ASSERTION FAILED at lib/test_uuid.c:77
Expected uuid_parse(*data, &be) == 0, but
uuid_parse(*data, &be) == -22
parsing of 'c33f4995-3701-450e-9fbf206a2e98e576 ' should've failed
# uuid_wrong_be: not ok 1 - c33f4995-3701-450e-9fbf206a2e98e576
# uuid_wrong_be: ASSERTION FAILED at lib/test_uuid.c:77
Expected uuid_parse(*data, &be) == 0, but
uuid_parse(*data, &be) == -22
parsing of '64b4371c-77c1-48f9-8221-29f054XX023b' should've failed
# uuid_wrong_be: not ok 2 - 64b4371c-77c1-48f9-8221-29f054XX023b
# uuid_wrong_be: ASSERTION FAILED at lib/test_uuid.c:77
Expected uuid_parse(*data, &be) == 0, but
uuid_parse(*data, &be) == -22
parsing of '0cb4ddff-a545-4401-9d06-688af53e' should've failed
# uuid_wrong_be: not ok 3 - 0cb4ddff-a545-4401-9d06-688af53e
not ok 3 - uuid_wrong_be
[FAILED] uuid_wrong_le
# uuid_wrong_le: ASSERTION FAILED at lib/test_uuid.c:68
Expected guid_parse(*data, &le) == 0, but
guid_parse(*data, &le) == -22
parsing of 'c33f4995-3701-450e-9fbf206a2e98e576 ' should've failed
# uuid_wrong_le: not ok 1 - c33f4995-3701-450e-9fbf206a2e98e576
# uuid_wrong_le: ASSERTION FAILED at lib/test_uuid.c:68
Expected guid_parse(*data, &le) == 0, but
guid_parse(*data, &le) == -22
parsing of '64b4371c-77c1-48f9-8221-29f054XX023b' should've failed
# uuid_wrong_le: not ok 2 - 64b4371c-77c1-48f9-8221-29f054XX023b
# uuid_wrong_le: ASSERTION FAILED at lib/test_uuid.c:68
Expected guid_parse(*data, &le) == 0, but
guid_parse(*data, &le) == -22
parsing of '0cb4ddff-a545-4401-9d06-688af53e' should've failed
# uuid_wrong_le: not ok 3 - 0cb4ddff-a545-4401-9d06-688af53e
not ok 4 - uuid_wrong_le
Changes from v1:
- Test suite name: uuid_test -> uuid
- Config name: TEST_UUID -> UUID_KUNIT_TEST
- Config entry in the Kconfig file left where it is
- Converted tests to use _MSG variant
André Almeida (1):
lib: Convert UUID runtime test to KUnit
lib/Kconfig.debug | 11 +++-
lib/Makefile | 2 +-
lib/test_uuid.c | 137 +++++++++++++++++++---------------------------
3 files changed, 67 insertions(+), 83 deletions(-)
--
2.31.1
Hi,
This patch series introduces the futex2 syscalls.
* What happened to the current futex()?
For some years now, developers have been trying to add new features to
futex, but maintainers have been reluctant to accept then, given the
multiplexed interface full of legacy features and tricky to do big
changes. Some problems that people tried to address with patchsets are:
NUMA-awareness[0], smaller sized futexes[1], wait on multiple futexes[2].
NUMA, for instance, just doesn't fit the current API in a reasonable
way. Considering that, it's not possible to merge new features into the
current futex.
** The NUMA problem
At the current implementation, all futex kernel side infrastructure is
stored on a single node. Given that, all futex() calls issued by
processors that aren't located on that node will have a memory access
penalty when doing it.
** The 32bit sized futex problem
Futexes are used to implement atomic operations in userspace.
Supporting 8, 16, 32 and 64 bit sized futexes allows user libraries to
implement all those sizes in a performant way. Thanks Boost devs for
feedback: https://lists.boost.org/Archives/boost/2021/05/251508.php
Embedded systems or anything with memory constrains could benefit of
using smaller sizes for the futex userspace integer.
** The wait on multiple problem
The use case lies in the Wine implementation of the Windows NT interface
WaitMultipleObjects. This Windows API function allows a thread to sleep
waiting on the first of a set of event sources (mutexes, timers, signal,
console input, etc) to signal. Considering this is a primitive
synchronization operation for Windows applications, being able to quickly
signal events on the producer side, and quickly go to sleep on the
consumer side is essential for good performance of those running over Wine.
[0] https://lore.kernel.org/lkml/20160505204230.932454245@linutronix.de/
[1] https://lore.kernel.org/lkml/20191221155659.3159-2-malteskarupke@web.de/
[2] https://lore.kernel.org/lkml/20200213214525.183689-1-andrealmeid@collabora.…
* The solution
As proposed by Peter Zijlstra and Florian Weimer[3], a new interface
is required to solve this, which must be designed with those features in
mind. futex2() is that interface. As opposed to the current multiplexed
interface, the new one should have one syscall per operation. This will
allow the maintainability of the API if it gets extended, and will help
users with type checking of arguments.
In particular, the new interface is extended to support the ability to
wait on any of a list of futexes at a time, which could be seen as a
vectored extension of the FUTEX_WAIT semantics.
[3] https://lore.kernel.org/lkml/20200303120050.GC2596@hirez.programming.kicks-…
* The interface
The new interface can be seen in details in the following patches, but
this is a high level summary of what the interface can do:
- Supports wake/wait semantics, as in futex()
- Supports requeue operations, similarly as FUTEX_CMP_REQUEUE, but with
individual flags for each address
- Supports waiting for a vector of futexes, using a new syscall named
futex_waitv()
- Supports variable sized futexes (8bits, 16bits, 32bits and 64bits)
- Supports NUMA-awareness operations, where the user can specify on
which memory node would like to operate
* Implementation
The internal implementation follows a similar design to the original futex.
Given that we want to replicate the same external behavior of current
futex, this should be somewhat expected. For some functions, like the
init and the code to get a shared key, I literally copied code and
comments from kernel/futex.c. I decided to do so instead of exposing the
original function as a public function since in that way we can freely
modify our implementation if required, without any impact on old futex.
Also, the comments precisely describes the details and corner cases of
the implementation.
Each patch contains a brief description of implementation, but patch 7
"docs: locking: futex2: Add documentation" adds a more complete document
about it.
* The patchset
This patchset can be also found at my git tree:
https://gitlab.collabora.com/tonyk/linux/-/tree/futex2-dev
- Patch 1: Implements wait/wake, and the basics foundations of futex2
- Patches 2-5: Implement the remaining features (shared, waitv,
requeue, sizes).
- Patch 6: Adds the x86_x32 ABI handling. I kept it in a separated
patch since I'm not sure if x86_x32 is still a thing, or if it should
return -ENOSYS.
- Patch 7: Add a documentation file which details the interface and
the internal implementation.
- Patches 8-14: Selftests for all operations along with perf
support for futex2.
- Patch 15: While working on porting glibc for futex2, I found out
that there's a futex_wake() call at the user thread exit path, if
that thread was created with clone(..., CLONE_CHILD_SETTID, ...). In
order to make pthreads work with futex2, it was required to add
this patch. Note that this is more a proof-of-concept of what we
will need to do in future, rather than part of the interface and
shouldn't be merged as it is.
* Testing:
This patchset provides selftests for each operation and their flags.
Along with that, the following work was done:
** Stability
To stress the interface in "real world scenarios":
- glibc[4]: nptl's low level locking was modified to use futex2 API
(except for robust and PI things). All relevant nptl/ tests passed.
- Wine[5]: Proton/Wine was modified in order to use futex2() for the
emulation of Windows NT sync mechanisms based on futex, called "fsync".
Triple-A games with huge CPU's loads and tons of parallel jobs worked
as expected when compared with the previous FUTEX_WAIT_MULTIPLE
implementation at futex(). Some games issue 42k futex2() calls
per second.
- Full GNU/Linux distro: I installed the modified glibc in my host
machine, so all pthread's programs would use futex2(). After tweaking
systemd[6] to allow futex2() calls at seccomp, everything worked as
expected (web browsers do some syscall sandboxing and need some
configuration as well).
- perf: The perf benchmarks tests can also be used to stress the
interface, and they can be found in this patchset.
** Performance
- For comparing futex() and futex2() performance, I used the artificial
benchmarks implemented at perf (wake, wake-parallel, hash and
requeue). The setup was 200 runs for each test and using 8, 80, 800,
8000 for the number of threads, Note that for this test, I'm not using
patch 14 ("kernel: Enable waitpid() for futex2") , for reasons explained
at "The patchset" section.
- For the first three ones, I measured an average of 4% gain in
performance. This is not a big step, but it shows that the new
interface is at least comparable in performance with the current one.
- For requeue, I measured an average of 21% decrease in performance
compared to the original futex implementation. This is expected given
the new design with individual flags. The performance trade-offs are
explained at patch 4 ("futex2: Implement requeue operation").
[4] https://gitlab.collabora.com/tonyk/glibc/-/tree/futex2
[5] https://gitlab.collabora.com/tonyk/wine/-/tree/proton_5.13
[6] https://gitlab.collabora.com/tonyk/systemd
* FAQ
** "Where's the code for NUMA?"
NUMA will be implemented in future versions of this patch, and like the
size feature, it will require work with users of futex to get feedback
about it.
** "Where's the PI/robust stuff?"
As said by Peter Zijlstra at [3], all those new features are related to
the "simple" futex interface, that doesn't use PI or robust. Do we want
to have this complexity at futex2() and if so, should it be part of
this patchset or can it be future work?
Thanks,
André
* Changelog
Changes from v3:
- Implemented variable sized futexes
v3: https://lore.kernel.org/lkml/20210427231248.220501-1-andrealmeid@collabora.…
Changes from v2:
- API now supports 64bit futexes, in addition to 8, 16 and 32.
- This API change will break the glibc[4] and Proton[5] ports for now.
- Refactored futex2_wait and futex2_waitv selftests
v2: https://lore.kernel.org/lkml/20210304004219.134051-1-andrealmeid@collabora.…
Changes from v1:
- Unified futex_set_timer_and_wait and __futex_wait code
- Dropped _carefull from linked list function calls
- Fixed typos on docs patch
- uAPI flags are now added as features are introduced, instead of all flags
in patch 1
- Removed struct futex_single_waiter in favor of an anon struct
v1: https://lore.kernel.org/lkml/20210215152404.250281-1-andrealmeid@collabora.…
André Almeida (15):
futex2: Implement wait and wake functions
futex2: Add support for shared futexes
futex2: Implement vectorized wait
futex2: Implement requeue operation
futex2: Implement support for different futex sizes
futex2: Add compatibility entry point for x86_x32 ABI
docs: locking: futex2: Add documentation
selftests: futex2: Add wake/wait test
selftests: futex2: Add timeout test
selftests: futex2: Add wouldblock test
selftests: futex2: Add waitv test
selftests: futex2: Add requeue test
selftests: futex2: Add futex sizes test
perf bench: Add futex2 benchmark tests
kernel: Enable waitpid() for futex2
Documentation/locking/futex2.rst | 198 +++
Documentation/locking/index.rst | 1 +
MAINTAINERS | 2 +-
arch/arm/tools/syscall.tbl | 4 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 8 +
arch/x86/entry/syscalls/syscall_32.tbl | 4 +
arch/x86/entry/syscalls/syscall_64.tbl | 4 +
fs/inode.c | 1 +
include/linux/compat.h | 26 +
include/linux/fs.h | 1 +
include/linux/syscalls.h | 17 +
include/uapi/asm-generic/unistd.h | 14 +-
include/uapi/linux/futex.h | 34 +
init/Kconfig | 7 +
kernel/Makefile | 1 +
kernel/fork.c | 2 +
kernel/futex2.c | 1289 +++++++++++++++++
kernel/sys_ni.c | 9 +
tools/arch/x86/include/asm/unistd_64.h | 12 +
tools/include/uapi/asm-generic/unistd.h | 11 +-
.../arch/x86/entry/syscalls/syscall_64.tbl | 4 +
tools/perf/bench/bench.h | 4 +
tools/perf/bench/futex-hash.c | 24 +-
tools/perf/bench/futex-requeue.c | 57 +-
tools/perf/bench/futex-wake-parallel.c | 41 +-
tools/perf/bench/futex-wake.c | 37 +-
tools/perf/bench/futex.h | 47 +
tools/perf/builtin-bench.c | 18 +-
.../selftests/futex/functional/.gitignore | 4 +
.../selftests/futex/functional/Makefile | 7 +-
.../futex/functional/futex2_requeue.c | 164 +++
.../selftests/futex/functional/futex2_sizes.c | 146 ++
.../selftests/futex/functional/futex2_wait.c | 195 +++
.../selftests/futex/functional/futex2_waitv.c | 154 ++
.../futex/functional/futex_wait_timeout.c | 58 +-
.../futex/functional/futex_wait_wouldblock.c | 33 +-
.../testing/selftests/futex/functional/run.sh | 6 +
.../selftests/futex/include/futex2test.h | 113 ++
39 files changed, 2707 insertions(+), 52 deletions(-)
create mode 100644 Documentation/locking/futex2.rst
create mode 100644 kernel/futex2.c
create mode 100644 tools/testing/selftests/futex/functional/futex2_requeue.c
create mode 100644 tools/testing/selftests/futex/functional/futex2_sizes.c
create mode 100644 tools/testing/selftests/futex/functional/futex2_wait.c
create mode 100644 tools/testing/selftests/futex/functional/futex2_waitv.c
create mode 100644 tools/testing/selftests/futex/include/futex2test.h
--
2.31.1