From: SeongJae Park <sjpark(a)amazon.de>
When running a test program, 'run_one()' checks if the program has the
execution permission and fails if it doesn't. However, it's easy to
mistakenly missing the permission, as some common tools like 'diff'
don't support the permission change well. Compared to that, making
mistakes in the test program's path would only rare, as those are
explicitly listed in 'TEST_PROGS'. Therefore, it might make more sense
to resolve the situation on our own and run the program.
For the reason, this commit makes the test program runner function to
still print the warning message but try parsing the interpreter of the
program and explicitly run it with the interpreter, in the case.
Suggested-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: SeongJae Park <sjpark(a)amazon.de>
Changes from v1
- Parse and use the interpreter instead of changing the file
tools/testing/selftests/kselftest/runner.sh | 28 +++++++++++++--------
1 file changed, 18 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/kselftest/runner.sh b/tools/testing/selftests/kselftest/runner.sh
index cc9c846585f0..a9ba782d8ca0 100644
@@ -33,9 +33,9 @@ tap_timeout()
# Make sure tests will time out if utility is available.
if [ -x /usr/bin/timeout ] ; then
- /usr/bin/timeout --foreground "$kselftest_timeout" "$1"
+ /usr/bin/timeout --foreground "$kselftest_timeout" $1
@@ -65,17 +65,25 @@ run_one()
TEST_HDR_MSG="selftests: $DIR: $BASENAME_TEST"
echo "# $TEST_HDR_MSG"
- if [ ! -x "$TEST" ]; then
- echo -n "# Warning: file $TEST is "
- if [ ! -e "$TEST" ]; then
- echo "missing!"
- echo "not executable, correct this."
+ if [ ! -e "$TEST" ]; then
+ echo "# Warning: file $TEST is missing!"
echo "not ok $test_num $TEST_HDR_MSG"
+ if [ ! -x "$TEST" ]; then
+ echo "# Warning: file $TEST is not executable"
+ if [ $(head -n 1 "$TEST" | cut -c -2) = "#!" ]
+ interpreter=$(head -n 1 "$TEST" | cut -c 3-)
+ cmd="$interpreter ./$BASENAME_TEST"
+ echo "not ok $test_num $TEST_HDR_MSG"
cd `dirname $TEST` > /dev/null
- ((((( tap_timeout ./$BASENAME_TEST 2>&1; echo $? >&3) |
+ ((((( tap_timeout "$cmd" 2>&1; echo $? >&3) |
tap_prefix >&4) 3>&1) |
(read xs; exit $xs)) 4>>"$logfile" &&
echo "ok $test_num $TEST_HDR_MSG") ||
Real-time setups try hard to ensure proper isolation between time
critical applications and e.g. network processing performed by the
network stack in softirq and RPS is used to move the softirq
activity away from the isolated core.
If the network configuration is dynamic, with netns and devices
routinely created at run-time, enforcing the correct RPS setting
on each newly created device allowing to transient bad configuration
These series try to address the above, introducing a new
sysctl knob: rps_default_mask. The new sysctl entry allows
configuring a systemwide RPS mask, to be enforced since receive
queue creation time without any fourther per device configuration
Additionally, a simple self-test is introduced to check the
v1 -> v2:
- fix sparse warning in patch 2/3
Paolo Abeni (3):
net/sysctl: factor-out netdev_rx_queue_set_rps_mask() helper
net/core: introduce default_rps_mask netns attribute
self-tests: introduce self-tests for RPS default mask
Documentation/admin-guide/sysctl/net.rst | 6 ++
include/linux/netdevice.h | 1 +
net/core/net-sysfs.c | 73 +++++++++++--------
net/core/sysctl_net_core.c | 58 +++++++++++++++
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/config | 3 +
.../testing/selftests/net/rps_default_mask.sh | 57 +++++++++++++++
7 files changed, 169 insertions(+), 30 deletions(-)
create mode 100755 tools/testing/selftests/net/rps_default_mask.sh
From: Vincent Cheng <vincent.cheng.xh(a)renesas.com>
This series adds adjust phase to the PTP Hardware Clock device interface.
Some PTP hardware clocks have a write phase mode that has
a built-in hardware filtering capability. The write phase mode
utilizes a phase offset control word instead of a frequency offset
control word. Add adjust phase function to take advantage of this
Changes since v1:
- As suggested by Richard Cochran:
1. ops->adjphase is new so need to check for non-null function pointer.
2. Kernel coding style uses lower_case_underscores.
3. Use existing PTP clock API for delayed worker.
Vincent Cheng (3):
ptp: Add adjphase function to support phase offset control.
ptp: Add adjust_phase to ptp_clock_caps capability.
ptp: ptp_clockmatrix: Add adjphase() to support PHC write phase mode.
drivers/ptp/ptp_chardev.c | 1 +
drivers/ptp/ptp_clock.c | 3 ++
drivers/ptp/ptp_clockmatrix.c | 92 +++++++++++++++++++++++++++++++++++
drivers/ptp/ptp_clockmatrix.h | 8 ++-
include/linux/ptp_clock_kernel.h | 6 ++-
include/uapi/linux/ptp_clock.h | 4 +-
tools/testing/selftests/ptp/testptp.c | 6 ++-
7 files changed, 114 insertions(+), 6 deletions(-)
Since the RC1 of kernel 5.13, -smp 2 and -smp 4 don't work with a
virtual e5500 QEMU KVM-HV machine anymore. 
I see in the serial console, that the uImage doesn't load. I use the
following QEMU command for booting:
qemu-system-ppc64 -M ppce500 -cpu e5500 -enable-kvm -m 1024 -kernel
uImage -drive format=raw,file=MintPPC32-X5000.img,index=0,if=virtio
-netdev user,id=mynet0 -device virtio-net,netdev=mynet0 -append "rw
root=/dev/vda" -device virtio-vga -device virtio-mouse-pci -device
virtio-keyboard-pci -device pci-ohci,id=newusb -device
usb-audio,bus=newusb.0 -smp 4
The kernels boot without KVM-HV.
Summary for KVM-HV:
-smp 1 -> works
-smp 2 -> doesn't work
-smp 3 -> works
-smp 4 -> doesn't work
I used -smp 4 before the RC1 of kernel 5.13 because my FSL P5040 BookE
machine  has 4 cores.
Does this patch solve this issue? 
I would like to publish two debug features which were needed for other stuff
I work on.
One is the reworked lx-symbols script which now actually works on at least
gdb 9.1 (gdb 9.2 was reported to fail to load the debug symbols from the kernel
for some reason, not related to this patch) and upstream qemu.
The other feature is the ability to trap all guest exceptions (on SVM for now)
and see them in kvmtrace prior to potential merge to double/triple fault.
This can be very useful and I already had to manually patch KVM a few
times for this.
I will, once time permits, implement this feature on Intel as well.
* Some more refactoring and workarounds for lx-symbols script
* added KVM_GUESTDBG_BLOCKIRQ flag to enable 'block interrupts on
single step' together with KVM_CAP_SET_GUEST_DEBUG2 capability
to indicate which guest debug flags are supported.
This is a replacement for unconditional block of interrupts on single
step that was done in previous version of this patch set.
Patches to qemu to use that feature will be sent soon.
* Reworked the the 'intercept all exceptions for debug' feature according
to the review feedback:
- renamed the parameter that enables the feature and
moved it to common kvm module.
(only SVM part is currently implemented though)
- disable the feature for SEV guests as was suggested during the review
- made the vmexit table const again, as was suggested in the review as well.
* Modified a selftest to cover the KVM_GUESTDBG_BLOCKIRQ
* Rebased on kvm/queue
Maxim Levitsky (6):
KVM: SVM: split svm_handle_invalid_exit
KVM: x86: add force_intercept_exceptions_mask
KVM: SVM: implement force_intercept_exceptions_mask
scripts/gdb: rework lx-symbols gdb script
KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ
KVM: selftests: test KVM_GUESTDBG_BLOCKIRQ
Documentation/virt/kvm/api.rst | 1 +
arch/x86/include/asm/kvm_host.h | 5 +-
arch/x86/include/uapi/asm/kvm.h | 1 +
arch/x86/kvm/svm/svm.c | 87 +++++++-
arch/x86/kvm/svm/svm.h | 6 +-
arch/x86/kvm/x86.c | 12 +-
arch/x86/kvm/x86.h | 2 +
kernel/module.c | 8 +-
scripts/gdb/linux/symbols.py | 203 ++++++++++++------
.../testing/selftests/kvm/x86_64/debug_regs.c | 24 ++-
10 files changed, 266 insertions(+), 83 deletions(-)
We are looking to further standardise the output format used by kernel
test frameworks like kselftest and KUnit. Thus far we have used the
TAP (Test Anything Protocol) specification, but it has been extended
in many different ways, so we would like to agree on a common "Kernel
TAP" (KTAP) format to resolve these differences. Thus, below is a
draft of a specification of KTAP. Note that this specification is
largely based on the current format of test results for KUnit tests.
Additionally, this specification was heavily inspired by the KTAP
specification draft by Tim Bird
However, there are some notable differences to his specification. One
such difference is the format of nested tests is more fully specified
in the following specification. However, they are specified in a way
which may not be compatible with many kselftest nested tests.
Specification of KTAP
TAP, or the Test Anything Protocol is a format for specifying test
results used by a number of projects. It's website and specification
are found at: https://testanything.org/. The Linux Kernel uses TAP
output for test results. However, KUnit (and other Kernel testing
frameworks such as kselftest) have some special needs for test results
which don't gel perfectly with the original TAP specification. Thus, a
"Kernel TAP" (KTAP) format is specified to extend and alter TAP to
support these use-cases.
KTAP Output consists of 5 major elements (all line-based):
- The version line
- Plan lines
- Test case result lines
- Diagnostic lines
- A bail out line
An important component in this specification of KTAP is the
specification of the format of nested tests. This can be found in the
section on nested tests below.
The version line
The first line of KTAP output must be the version line. As this
specification documents the first version of KTAP, the recommended
version line is "KTAP version 1". However, since all kernel testing
frameworks use TAP version lines, "TAP version 14" and "TAP version
13" are all acceptable version lines. Version lines with other
versions of TAP or KTAP will not cause the parsing of the test results
to fail but it will produce an error.
Plan lines must follow the format of "1..N" where N is the number of
subtests. The second line of KTAP output must be a plan line, which
indicates the number of tests at the highest level, such that the
tests do not have a parent. Also, in the instance of a test having
subtests, the second line of the test after the subtest header must be
a plan line which indicates the number of subtests within that test.
Test case result lines
Test case result lines must have the format:
<result> <number> [-] [<description>] [<directive>] [<diagnostic data>]
The result can be either "ok", which indicates the test case passed,
or "not ok", which indicates that the test case failed.
The number represents the number of the test case or suite being
performed. The first test case or suite must have the number 1 and the
number must increase by 1 for each additional test case or result at
the same level and within the same testing suite.
The "-" character is optional.
The description is a description of the test, generally the name of
the test, and can be any string of words (can't include #). The
description is optional.
The directive is used to indicate if a test was skipped. The format
for the directive is: "# SKIP [<skip_description>]". The
skip_description is optional and can be any string of words to
describe why the test was skipped. The result of the test case result
line can be either "ok" or "not ok" if the skip directive is used.
Finally, note that TAP 14 specification includes TODO directives but
these are not supported for KTAP.
Examples of test case result lines:
ok 1 - test_case_name
Test was skipped:
not ok 1 - test_case_name # SKIP test_case_name should be skipped
not_ok 1 - test_case_name
Diagnostic lines are used for description of testing operations.
Diagnostic lines are generally formatted as "#
<diagnostic_description>", where the description can be any string.
However, in practice, diagnostic lines are all lines that don't follow
the format of any other KTAP line format. Diagnostic lines can be
anywhere in the test output after the first two lines. There are a few
special diagnostic lines. Diagnostic lines of the format "# Subtest:
<test_name>" indicate the start of a test with subtests. Also,
diagnostic lines of the format "# <test_name>: <description>" refer to
a specific test and tend to occur before the test result line of that
test but are optional.
Bail out line
A bail out line can occur anywhere in the KTAP output and will
indicate that a test has crashed. The format of a bail out line is
"Bail out! [<description>]", where the description can give
information on why the bail out occurred and can be any string.
The new specification for KTAP will support an arbitrary number of
nested subtests. Thus, tests can now have subtests and those subtests
can have subtests. This can be useful to further categorize tests and
organize test results.
The new required format for a test with subtests consists of: a
subtest header line, a plan line, all subtests, and a final test
The first line of the test must be the subtest header line with the
format: "# Subtest: <test_name>".
The second line of the test must be the plan line, which is formatted
as "1..N", where N is the number of subtests.
Following the plan line, all lines pertaining to the subtests will follow.
Finally, the last line of the test is a final test result line with
the format: "(ok|not ok) <number> [-] [<description>] [<directive>]
[<diagnostic data>]", which follows the same format as the general
test result lines described in this section. The result line should
indicate the result of the subtests. Thus, if one of the subtests
fail, the test should fail. The description in the final test result
line should match the name of the test in the subtest header.
An example format:
KTAP version 1
# Subtest: test_suite
ok 1 - test_1
ok 2 - test_2
ok 1 - test_suite
An example format with multiple levels of nested testing:
KTAP version 1
# Subtest: test_suite
# Subtest: sub_test_suite
ok 1 - test_1
ok 2 test_2
ok 1 - sub_test_suite
ok 2 - test
ok 1 - test_suite
In the instance that the plan line is missing, the end of the test
will be denoted by the final result line containing a description that
matches the name of the test given in the subtest header. Note that
thus, if the plan line is missing and one of the subtests have a
matching name to the test suite this will cause errors.
Lastly, indentation is also recommended for improved readability.
Major differences between TAP 14 and KTAP specification
Note that the major differences between TAP 14 and KTAP specification:
- yaml and json are not allowed in diagnostic messages
- TODO directive not allowed
- KTAP allows for an arbitrary number of tests to be nested with
specified nested test format
Example of KTAP
KTAP version 1
# Subtest: test_suite
# Subtest: sub_test_suite
ok 1 - test_1
ok 2 test_2
ok 1 - sub_test_suite
ok 1 - test_suite
Note on incompatibilities with kselftests
To my knowledge, the above specification seems to generally accept the
TAP format of many non-nested test results of kselftests.
An example of a common kselftests TAP format for non-nested test
results that are accepted by the above specification:
TAP version 13
# selftests: vDSO: vdso_test_gettimeofday
# The time is 1628024856.096879
ok 1 selftests: vDSO: vdso_test_gettimeofday
# selftests: vDSO: vdso_test_getcpu
# Could not find __vdso_getcpu
ok 2 selftests: vDSO: vdso_test_getcpu # SKIP
However, one major difference noted with kselftests is the use of more
directives than the "# SKIP" directive. kselftest also supports XPASS
and XFAIL directives. Some additional examples found in kselftests:
not ok 5 selftests: netfilter: nft_concat_range.sh # TIMEOUT 45 seconds
not ok 45 selftests: kvm: kvm_binary_stats_test # exit=127
Should the specification be expanded to include these directives?
However, the general format for kselftests with nested test results
seems to differ from the above specification. It seems that a general
format for nested tests is as follows:
TAP version 13
# selftests: membarrier: membarrier_test_single_thread
# TAP version 13
# ok 1 sys_membarrier available
# ok 2 sys membarrier invalid command test: command = -1, flags = 0,
errno = 22. Failed as expected
ok 1 selftests: membarrier: membarrier_test_single_thread
# selftests: membarrier: membarrier_test_multi_thread
# TAP version 13
# ok 1 sys_membarrier available
# ok 2 sys membarrier invalid command test: command = -1, flags = 0,
errno = 22. Failed as expected
ok 2 selftests: membarrier: membarrier_test_multi_thread
The major differences here, that do not match the above specification,
are use of "# " as an indentation and using a TAP version line to
denote a new test with subtests rather than the subtest header line
described above. If these are widely utilized formats in kselftests,
should we include both versions in the specification or should we
attempt to agree on a single format for nested tests? I personally
believe we should try to agree on a single format for nested tests.
This would allow for a cleaner specification of KTAP and would reduce
So what do people think about the above specification?
How should we handle the differences with kselftests?
If this specification is accepted, where should the specification be documented?
User Interrupts Introduction
User Interrupts (Uintr) is a hardware technology that enables delivering
interrupts directly to user space.
Today, virtually all communication across privilege boundaries happens by going
through the kernel. These include signals, pipes, remote procedure calls and
hardware interrupt based notifications. User interrupts provide the foundation
for more efficient (low latency and low CPU utilization) versions of these
common operations by avoiding transitions through the kernel.
In the User Interrupts hardware architecture, a receiver is always expected to
be a user space task. However, a user interrupt can be sent by another user
space task, kernel or an external source (like a device).
In addition to the general infrastructure to receive user interrupts, this
series introduces a single source: interrupts from another user task. These
are referred to as User IPIs.
The first implementation of User IPIs will be in the Intel processor code-named
Sapphire Rapids. Refer Chapter 11 of the Intel Architecture instruction set
extensions for details of the hardware architecture .
Series-reviewed-by: Tony Luck <tony.luck(a)intel.com>
Main goals of this RFC
- Introduce this upcoming technology to the community.
This cover letter includes a hardware architecture summary along with the
software architecture and kernel design choices. This post is a bit long as a
result. Hopefully, it helps answer more questions than it creates :) I am also
planning to talk about User Interrupts next week at the LPC Kernel summit.
- Discuss potential use cases.
We are starting to look at actual usages and libraries (like libevent and
liburing) that can take advantage of this technology. Unfortunately, we
don't have much to share on this right now. We need some help from the
community to identify usages that can benefit from this. We would like to make
sure the proposed APIs work for the eventual consumers.
- Get early feedback on the software architecture.
We are hoping to get some feedback on the direction of overall software
architecture - starting with User IPI, extending it for kernel-to-user
interrupt notifications and external interrupts in the future.
- Discuss some of the main architecture opens.
There is lot of work that still needs to happen to enable this technology. We
are looking for some input on future patches that would be of interest. Here
are some of the big opens that we are looking to resolve.
* Should Uintr interrupt all blocking system calls like sleep(), read(),
poll(), etc? If so, should we implement an SA_RESTART type of mechanism
similar to signals? - Refer Blocking for interrupts section below.
* Should the User Interrupt Target table (UITT) be shared between threads of a
multi-threaded application or maybe even across processes? - Refer Sharing
the UITT section below.
Why care about this? - Micro benchmark performance
There is a ~9x or higher performance improvement using User IPI over other IPC
mechanisms for event signaling.
Below is the average normalized latency for a 1M ping-pong IPC notifications
with message size=1.
| IPC type | Relative Latency |
| |(normalized to User IPI) |
| User IPI | 1.0 |
| Signal | 14.8 |
| Eventfd | 9.7 |
| Pipe | 16.3 |
| Domain | 17.3 |
Results have been estimated based on tests on internal hardware with Linux
v5.14 + User IPI patches.
Original benchmark: https://github.com/goldsborough/ipc-bench
Updated benchmark: https://github.com/intel/uintr-ipc-bench/tree/linux-rfc-v1
*Performance varies by use, configuration and other factors.
How it works underneath? - Hardware Summary
User Interrupts is a posted interrupt delivery mechanism. The interrupts are
first posted to a memory location and then delivered to the receiver when they
are running with CPL=3.
Kernel managed architectural data structures
UPID: User Posted Interrupt Descriptor - Holds receiver interrupt vector
information and notification state (like an ongoing notification, suppressed
UITT: User Interrupt Target Table - Stores UPID pointer and vector information
for interrupt routing on the sender side. Referred by the senduipi instruction.
The interrupt state of each task is referenced via MSRs which are saved and
restored by the kernel during context switch.
senduipi <index> - send a user IPI to a target task based on the UITT index.
clui - Mask user interrupts by clearing UIF (User Interrupt Flag).
stui - Unmask user interrupts by setting UIF.
testui - Test current value of UIF.
uiret - return from a user interrupt handler.
When a User IPI sender executes 'senduipi <index>', the hardware refers the
UITT table entry pointed by the index and posts the interrupt vector (63-0)
into the receiver's UPID.
If the receiver is running (CPL=3), the sender cpu would send a physical IPI to
the receiver's cpu. On the receiver side this IPI is detected as a User
Interrupt. The User Interrupt handler for the receiver is invoked and the
vector number (63-0) is pushed onto the stack.
Upon execution of 'uiret' in the interrupt handler, the control is transferred
back to instruction that was interrupted.
Refer Chapter 11 of the Intel Architecture instruction set extensions  for
Application interface - Software Architecture
User Interrupts (Uintr) is an opt-in feature (unlike signals). Applications
wanting to use Uintr are expected to register themselves with the kernel using
the Uintr related system calls. A Uintr receiver is always a userspace task. A
Uintr sender can be another userspace task, kernel or a device.
1) A receiver can register/unregister an interrupt handler using the Uintr
receiver related syscalls.
2) A syscall also allows a receiver to register a vector and create a user
interrupt file descriptor - uintr_fd.
uintr_fd = uintr_create_fd(vector, flags)
Uintr can be useful in some of the usages where eventfd or signals are used for
frequent userspace event notifications. The semantics of uintr_fd are somewhat
similar to an eventfd() or the write end of a pipe.
3) Any sender with access to uintr_fd can use it to deliver events (in this
case - interrupts) to a receiver. A sender task can manage its connection with
the receiver using the sender related syscalls based on uintr_fd.
uipi_index = uintr_register_sender(uintr_fd, flags)
Using an FD abstraction provides a secure mechanism to connect with a receiver.
The FD sharing and isolation mechanisms put in place by the kernel would extend
to Uintr as well.
4a) After the initial setup, a sender task can use the SENDUIPI instruction
along with the uipi_index to generate user IPIs without any kernel
If the receiver is running (CPL=3), then the user interrupt is delivered
directly without a kernel transition. If the receiver isn't running the
interrupt is delivered when the receiver gets context switched back. If the
receiver is blocked in the kernel, the user interrupt is delivered to the
kernel which then unblocks the intended receiver to deliver the interrupt.
4b) If the sender is the kernel or a device, the uintr_fd can be passed onto
the related kernel entity to allow them to setup a connection and then generate
a user interrupt for event delivery. <The exact details of this API are still
being worked upon.>
For details of the user interface and associated system calls refer the Uintr
We have also included the same content as patch 1 of this series to make it
easier to review.
Refer the Uintr compiler programming guide  for details on Uintr integration
with GCC and Binutils.
Kernel design choices
Here are some of the reasons and trade-offs for the current design of the APIs.
System call interface
Why a system call interface?: The 2 options we considered are using a char
device at /dev or use system calls (current approach). A syscall approach
avoids exposing a core cpu feature through a driver model. Also, we want to
have a user interrupt FD per vector and share a single common interrupt handler
among all vectors. This seems easier for the kernel and userspace to accomplish
using a syscall based approach.
Data sharing using user interrupts: Uintr doesn't include a mechanism to
share/transmit data. The expectation is applications use existing data sharing
mechanisms to share data and use Uintr only for signaling.
An FD for each vector: A uintr_fd is assigned to each vector to allow fine
grained priority and event management by the receiver. The alternative we
considered was to allocate an FD to the interrupt handler and having that
shared with the sender. However, that approach relies on the sender selecting
the vector and moves the vector priority management to the sender. Also, if
multiple senders want to send unique user interrupts they would need to
coordinate the vector selection amongst them.
Extending the APIs: Currently, the system calls are only extendable using the
flags argument. We can add a variable size struct to some of the syscalls if
Extending existing mechanisms
Uintr can be beneficial in some of the usages where eventfd() or signals are
used. Since Uintr is hardware-dependent, thread-specific and bypasses the
kernel in the fast path, it makes extending existing mechanisms harder.
Main issues with extending signals:
Signal handlers are defined significantly differently than a User interrupt
handler. An application needs to save/restore registers in a user interrupt
handler and call uiret to return from it. Also, signals can be process directed
(or thread directed) but user interrupts are always thread directed.
Comparison of signals with User Interrupts:
| | Signals | User Interrupts |
| Stacks | Has alt stacks | Uses application stack |
| | | (alternate stack option |
| | | not yet enabled) |
| Registers state | Kernel manages incl. | App responsible (Use GCC |
| | FPU/XSTATE area | 'interrupt' attribute for |
| | | general purpose registers)|
| Blocking/Masking | sigprocmask(2)/sa_mask | CLUI instruction (No per |
| | | vector masking) |
| Direction | Uni-directional | Uni-directional |
| Post event | kill(), signal(), | SENDUIPI <index> - index |
| | sigqueue(), etc. | derived from uintr_fd |
| Target | Process-directed or | Thread-directed |
| | thread-directed | |
| Fork/inheritance | Empty signal set | Nothing is inherited |
| Execv | Pending signals preserved | Nothing is inherited |
| Order of delivery | Undetermined | High to low vector numbers|
| for multiple signals| | |
| Handler re-entry | All signals except the | No interrupts can cause |
| | one being handled | handler re-entry. |
| Delivery feedback | 0 or -1 based on whether | No feedback on whether the|
| | the signal was sent | interrupt was sent or |
| | | received. |
Main issues with extending eventfd():
eventfd() has a counter value that is core to the API. User interrupts can't
have an associated counter since the signaling happens at the user level and
the hardware doesn't have a memory counter mechanism. Also, eventfd can be used
for bi-directional signaling where as uintr_fd is uni-directional.
Comparison of eventfd with uintr_fd:
| | Eventfd | uintr_fd (User Interrupt FD) |
| Object | Counter - uint64 | Receiver vector information |
| Post event | write() to eventfd | SENDUIPI <index> - index |
| | | derived from uintr_fd |
| Receive event | read() on eventfd | Implicit - Handler is |
| | | invoked with associated |
| | | vector. |
| Direction | Bi-directional | Uni-directional |
| Data transmitted | Counter - uint64 | None |
| Waiting for events | Poll() family of | No per vector wait. |
| | syscalls | uintr_wait() allows waiting |
| | | for all user interrupts |
User Interrupts is designed as an opt-in feature (unlike signals). The security
model for user interrupts is intended to be similar to eventfd(). The general
idea is that any sender with access to uintr_fd would be able to generate the
associated interrupt vector for the receiver task that created the fd.
The current implementation expects only trusted and cooperating processes to
communicate using user interrupts. Coordination is expected between processes
for a connection teardown. In situations where coordination doesn't happen
(say, due to abrupt process exit), the kernel would end up keeping shared
resources (like UPID) allocated to avoid faults.
Currently, a sender can easily cause a denial of service for the receiver by
generating a storm of user interrupts. A user interrupt handler is invoked with
interrupts disabled, but upon execution of uiret, interrupts get enabled again
by the hardware. This can lead to the handler being invoked again before normal
execution can resume. There isn't a hardware mechanism to mask specific
To enable untrusted processes to communicate, we need to add a per-vector
masking option through another syscall (or maybe IOCTL). However, this can add
some complexity to the kernel code. A vector can only be masked by modifying
the UITT entries at the source. We need to be careful about races while
removing and restoring the UPID from the UITT.
The maximum number of receiver-sender connections would be limited by the
maximum number of open file descriptors and the size of the UITT.
The UITT size is chosen as 4kB fixed size arbitrarily right now. We plan to
make it dynamic and configurable in size. RLIMIT_MEMLOCK or ENOMEM should be
triggered when the size limits have been hit.
Blocking for interrupts
User interrupts are delivered to applications immediately if they are running
in userspace. If a receiver task has blocked in the kernel using the placeholder
uintr_wait() syscall, the task would be woken up to deliver the user interrupt.
However, if the task is blocked due to any other blocking calls like read(),
sleep(), etc; the interrupt will only get delivered when the application gets
scheduled again. We need to consider if applications need to receive User
Interrupts as soon as they are posted (similar to signals) when they are
blocked due to some other reason. Adding this capability would likely make the
kernel implementation more complex.
Interrupting system calls using User Interrupts would also mean we need to
consider an SA_RESTART type of mechanism. We also need to evaluate if some of
the signal handler related semantics in the kernel can be reused for User
Sharing the User Interrupt Target Table (UITT)
The current implementation assigns a unique UITT to each task. This assumes
that User interrupts are used for point-to-point communication between 2 tasks.
Also, this keeps the kernel implementation relatively simple.
However, there are of benefits to sharing the UITT between threads of a
multi-threaded application. One, they would see a consistent view of the UITT.
i.e. SENDUIPI <index> would mean the same on all threads of the application.
Also, each thread doesn't have to register itself using the common uintr_fd.
This would simplify the userspace setup and make efficient use of kernel
memory. The potential downside is that the kernel implementation to allocate,
modify, expand and free the UITT would be more complex.
A similar argument can be made for a set of processes that do a lot of IPC
amongst them. They would prefer to have a shared UITT that lets them target any
process from any process. With the current file descriptor based approach, the
connection setup can be time consuming and somewhat cumbersome. We need to
evaluate if this can be made simpler as well.
Kernel page table isolation (KPTI)
SENDUIPI is a special ring-3 instruction that makes a supervisor mode memory
access to the UPID and UITT memory. The current patches need KPTI to be
disabled for User IPIs to work. To make User IPI work with KPTI, we need to
allocate these structures from a special memory region that has supervisor
access but it is mapped into userspace. The plan is to implement a mechanism
similar to LDT.
Processors that support user interrupts are not affected by Meltdown so the
auto mode of KPTI will default to off. Users who want to force enable KPTI will
need to wait for a later version of this patch series to use user interrupts.
Please let us know if you want the development of these patches to be
prioritized (or deprioritized).
Q: What happens if a process is "surprised" by a user interrupt?
A: For tasks that haven't registered with the kernel and requested for user
interrupts aren't expected or able to receive to user interrupts.
Q: Do user interrupts affect kernel scheduling?
A: No. If a task is blocked waiting for user interrupts, when the kernel
receives a notification on behalf of that task we only put it back on the
runqueue. Delivery of a user interrupt in no way changes the scheduling
priorities of a task.
Q: Does the sender get to know if the interrupt was delivered?
A: No. User interrupts only provides a posted interrupt delivery mechanism. If
applications need to rely on whether the interrupt was delivered they should
consider a userspace mechanism for feedback (like a shared memory counter or a
user interrupt back to the sender).
Q: Why is there no feedback on interrupt delivery?
A: Being a posted interrupt delivery mechanism, the interrupt delivery
happens in 2 steps:
1) The interrupt information is stored in a memory location (UPID).
2) The physical interrupt is delivered to the interrupt receiver.
The 2nd step could happen immediately, after an extended period, or it might
never happen based on the state of the receiver after step 1. (The receiver
could have disabled interrupts, have been context switched out or it might have
crashed during that time.) This makes it very hard for the hardware to reliably
provide feedback upon execution of SENDUIPI.
Q: Can user interrupts be nested?
A: Yes. Using STUI instruction in the interrupt handler would allow new user
interrupts to be delivered. However, there no TPR(thread priority register)
like mechanism to allow only higher priority interrupts. Any user interrupt can
be taken when nesting is enabled.
Q: Can a task receive all pending user interrupts in one go?
A: No. The hardware allows only one vector to be processed at a time. If a task
is interested in knowing all the interrupts that are pending then we could add
a syscall that provides the pending interrupts information.
Q: Do the processes need to be pinned to a cpu?
A: No. User interrupts will be routed correctly to whichever cpu the receiver
is running on. The kernel updates the cpu information in the UPID during
Q: Why are UPID and UITT allocated by the kernel?
A: If allocated by user space, applications could misuse the UPID and UITT to
write to unauthorized memory and generate interrupts on any cpu. The UPID and
UITT are allocated by the kernel and accessed by the hardware with supervisor
Patch structure for this series
- Man-pages and Kernel documentation (patch 1,2)
- Hardware enumeration (patch 3, 4)
- User IPI kernel vector reservation (patch 5)
- Syscall interface for interrupt receiver, sender and vector
management(uintr_fd) (patch 6-12)
- Basic selftests (patch 13)
Along with the patches in this RFC, there are additional tests and samples that
are available at:
Sohil Mehta (13):
x86/uintr/man-page: Include man pages draft for reference
Documentation/x86: Add documentation for User Interrupts
x86/cpu: Enumerate User Interrupts support
x86/fpu/xstate: Enumerate User Interrupts supervisor state
x86/irq: Reserve a user IPI notification vector
x86/uintr: Introduce uintr receiver syscalls
x86/process/64: Add uintr task context switch support
x86/process/64: Clean up uintr task fork and exit paths
x86/uintr: Introduce vector registration and uintr_fd syscall
x86/uintr: Introduce user IPI sender syscalls
x86/uintr: Introduce uintr_wait() syscall
x86/uintr: Wire up the user interrupt syscalls
selftests/x86: Add basic tests for User IPI
.../admin-guide/kernel-parameters.txt | 2 +
Documentation/x86/index.rst | 1 +
Documentation/x86/user-interrupts.rst | 107 +++
arch/x86/Kconfig | 12 +
arch/x86/entry/syscalls/syscall_32.tbl | 6 +
arch/x86/entry/syscalls/syscall_64.tbl | 6 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/entry-common.h | 4 +
arch/x86/include/asm/fpu/types.h | 20 +-
arch/x86/include/asm/fpu/xstate.h | 3 +-
arch/x86/include/asm/hardirq.h | 4 +
arch/x86/include/asm/idtentry.h | 5 +
arch/x86/include/asm/irq_vectors.h | 6 +-
arch/x86/include/asm/msr-index.h | 8 +
arch/x86/include/asm/processor.h | 8 +
arch/x86/include/asm/uintr.h | 76 ++
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/cpu/common.c | 61 ++
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
arch/x86/kernel/fpu/core.c | 17 +
arch/x86/kernel/fpu/xstate.c | 20 +-
arch/x86/kernel/idt.c | 4 +
arch/x86/kernel/irq.c | 51 +
arch/x86/kernel/process.c | 10 +
arch/x86/kernel/process_64.c | 4 +
arch/x86/kernel/uintr_core.c | 880 ++++++++++++++++++
arch/x86/kernel/uintr_fd.c | 300 ++++++
include/linux/syscalls.h | 8 +
include/uapi/asm-generic/unistd.h | 15 +-
kernel/sys_ni.c | 8 +
scripts/checksyscalls.sh | 6 +
tools/testing/selftests/x86/Makefile | 10 +
tools/testing/selftests/x86/uintr.c | 147 +++
tools/uintr/manpages/0_overview.txt | 265 ++++++
tools/uintr/manpages/1_register_receiver.txt | 122 +++
.../uintr/manpages/2_unregister_receiver.txt | 62 ++
tools/uintr/manpages/3_create_fd.txt | 104 +++
tools/uintr/manpages/4_register_sender.txt | 121 +++
tools/uintr/manpages/5_unregister_sender.txt | 79 ++
tools/uintr/manpages/6_wait.txt | 59 ++
42 files changed, 2626 insertions(+), 8 deletions(-)
create mode 100644 Documentation/x86/user-interrupts.rst
create mode 100644 arch/x86/include/asm/uintr.h
create mode 100644 arch/x86/kernel/uintr_core.c
create mode 100644 arch/x86/kernel/uintr_fd.c
create mode 100644 tools/testing/selftests/x86/uintr.c
create mode 100644 tools/uintr/manpages/0_overview.txt
create mode 100644 tools/uintr/manpages/1_register_receiver.txt
create mode 100644 tools/uintr/manpages/2_unregister_receiver.txt
create mode 100644 tools/uintr/manpages/3_create_fd.txt
create mode 100644 tools/uintr/manpages/4_register_sender.txt
create mode 100644 tools/uintr/manpages/5_unregister_sender.txt
create mode 100644 tools/uintr/manpages/6_wait.txt
This is a follow up to my v7 series of fixes for the zram driver 
which ended up uncovering a generic deadlock issue with sysfs and module
removal. I've reported this issue and proposed a few patches first since
March 2021 . At the end of this email you will find an itemized list
of changes since that v1 series, you can also find these changes on my
branch 20210927-sysfs-generic-deadlock-fix  which is based on
linux-next tag next-20210927.
Just a heads up, I'm goin on vacation in two days, won't be back until
Monday October 11th.
On this v8 I incorporate feedback from the v7 series, namely:
- Tejun requested I move the struct module to the last attribute when
- As per discussion with Tejun, trimmed and clarified the commit log
and documentation on the generic fix on patch 7
- As requested by Bart Van Assche, I simplied the setting of the
struct test_config *config into one line instead of two on many
places on patch 3 which adds the new sysfs selftest
- Dan Williams had some questions about patch 7, and so clarified these
questions using a more elaborate example on the commit log to show
where the lock call was happening.
- Trimmed the Cc list considerably as it was way too long before
- Rebased onto linux-next tag next-20210927
Below a list of changes of this patch set since its inception:
- Open coded the sysfs deadlock race to only be localized by the zram
Changes on v2:
- used bdgrab() as well for another race which was speculated by
- improved documentation of fixes
Changes on v3:
- used a localized zram macros for the sysfs attributes instead of
open coding on each routine
- replaced bdget() stuff for a generic get_device() and bus_get() on
dev_attr_show() / dev_attr_store() for the issue speculated by
Changes on v4:
- Cosmetic fixes on the zram fixes as requested by Greg
- Split out the driver core fix as requested by Greg for the
issue speculated by Michan. This fix ended up getting up to its 4th
patch iteration  and eventually hit linux-next. We got a 0day
0day suspend stres fail for this patch 
Changes on v5:
- I ended up writing a test_sysfs driver and with it I ended up
proving that the issue speculated by Michen was not possible and
so I asked Greg to drop the patch from his queue titled
"sysfs: fix kobject refcount to address races with kobject removal"
- checkpatch fixes for the zram changes
Changes on v6:
- I submitted my test_sysfs driver for inclusion upstream which easily
abstracted the deadlock issue in a driver generically 
- I rebased the zram fixes and added also a new patch for zram to use
ATTRIBUTE_GROUPS As per Minchen I sent the patches to be merged
through Andrew Morton.
- Greg ended up NACK'ing the patchset because he was not sure the fix
was correct still
Changes on v7:
- Formalizes the original proposed generic sysfs fix intead of using
macro helpers to work around the issue
- I decided it is best to merge all the effort together into
one patch set because communication was being lost when I split the
patches up. This was not helping in any way to either fix the zram
issues or come to consensus on a generic solution. The patches are
also merged now because they are all related now.
- Running checkpatch exposed that S_IRWXUGO and S_IRWXU|S_IRUGO|S_IXUGO
should be replaced, so I did that in this series in two new patches
- Adds a try_module_get() documentation extension with tribal
knowledge and new information I don't think some folks still believe
in. The new test_sysfs selftest however proves this information to
be correct, the same selftest can be used to try to prove that
- Because the fix is now generic zram's deadlock can easily be fixed
now by just making it use ATTRIBUTE_GROUPS().
Luis Chamberlain (12):
LICENSES: Add the copyleft-next-0.3.1 license
testing: use the copyleft-next-0.3.1 SPDX tag
selftests: add tests_sysfs module
kernfs: add initial failure injection support
test_sysfs: add support to use kernfs failure injection
kernel/module: add documentation for try_module_get()
fs/kernfs/symlink.c: replace S_IRWXUGO with 0777 on
fs/sysfs/dir.c: replace S_IRWXU|S_IRUGO|S_IXUGO with 0755
sysfs: fix deadlock race with module removal
test_sysfs: enable deadlock tests by default
zram: fix crashes with cpu hotplug multistate
zram: use ATTRIBUTE_GROUPS to fix sysfs deadlock module removal
.../fault-injection/fault-injection.rst | 22 +
LICENSES/dual/copyleft-next-0.3.1 | 237 +++
MAINTAINERS | 9 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +-
drivers/block/zram/zram_drv.c | 74 +-
fs/kernfs/Makefile | 1 +
fs/kernfs/dir.c | 44 +-
fs/kernfs/failure-injection.c | 91 ++
fs/kernfs/file.c | 19 +-
fs/kernfs/kernfs-internal.h | 75 +-
fs/kernfs/symlink.c | 4 +-
fs/sysfs/dir.c | 5 +-
fs/sysfs/file.c | 6 +-
fs/sysfs/group.c | 3 +-
include/linux/kernfs.h | 19 +-
include/linux/module.h | 34 +-
include/linux/sysfs.h | 52 +-
kernel/cgroup/cgroup.c | 2 +-
lib/Kconfig.debug | 25 +
lib/Makefile | 1 +
lib/test_kmod.c | 12 +-
lib/test_sysctl.c | 12 +-
lib/test_sysfs.c | 952 ++++++++++++
tools/testing/selftests/kmod/kmod.sh | 13 +-
tools/testing/selftests/sysctl/sysctl.sh | 12 +-
tools/testing/selftests/sysfs/Makefile | 12 +
tools/testing/selftests/sysfs/config | 5 +
tools/testing/selftests/sysfs/sysfs.sh | 1383 +++++++++++++++++
28 files changed, 3026 insertions(+), 102 deletions(-)
create mode 100644 LICENSES/dual/copyleft-next-0.3.1
create mode 100644 fs/kernfs/failure-injection.c
create mode 100644 lib/test_sysfs.c
create mode 100644 tools/testing/selftests/sysfs/Makefile
create mode 100644 tools/testing/selftests/sysfs/config
create mode 100755 tools/testing/selftests/sysfs/sysfs.sh
The Testing & Fuzzing Micro-Conference at Linux Plumbers 2021 will
remain open to new proposals for talks and discussion topics until the
end of next week (Friday 10th Sept). Please feel free to submit yours
with the "Submit new proposal" form on this page:
The MC is currently scheduled for Wednesday 22nd. This is where the
timetable will appear as submissions get accepted:
Last year's edition was very effective in spite of being fully online
rather than in-person. Topics around testing were mentioned in many
other tracks too, such as real-time and toolchains. See also the
related KernelCI blog post with community notes. We're looking
forward to having an equally good virtual experience this time again.
The Testing and Fuzzing microconference focuses on advancing the current
state of testing of the Linux kernel. We aim to create connections
between folks working on similar projects, and help individual projects
We ask that any topic discussions will focus on issues/problems they are
facing and possible alternatives to resolving them. The Microconference
is open to all topics related to testing & fuzzing on Linux, not
necessarily in the kernel space.
KernelCI: Extending coverage and improving user experience.
Growing KCIDB, integrating more sources.
Better sanitizers: KFENCE, improving KCSAN.
Using Clang for better testing coverage.
How to spread KUnit throughout the kernel?
Testing in-kernel Rust code.
Sasha Levin <sashal(a)kernel.org>
Guillaume Tucker <guillaume.tucker(a)collabora.com>
We refactored the lib/test_hash.c file into KUnit as part of the student
group LKCAMP  introductory hackathon for kernel development.
This test was pointed to our group by Daniel Latypov , so its full
conversion into a pure KUnit test was our goal in this patch series, but
we ran into many problems relating to it not being split as unit tests,
which complicated matters a bit, as the reasoning behind the original
tests is quite cryptic for those unfamiliar with hash implementations.
Some interesting developments we'd like to highlight are:
- In patch 1/5 we noticed that there was an unused define directive that
could be removed.
- In patch 4/5 we noticed how stringhash and hash tests are all under
the lib/test_hash.c file, which might cause some confusion, and we
also broke those kernel config entries up.
Overall KUnit developments have been made in the other patches in this
In patches 2/5, 3/5 and 5/5 we refactored the lib/test_hash.c
file so as to make it more compatible with the KUnit style, whilst
preserving the original idea of the maintainer who designed it (i.e.
George Spelvin), which might be undesirable for unit tests, but we
assume it is enough for a first patch.
This is our first patch series so we hope our contributions are
interesting and also hope to get some useful criticism from the
Changes since V1:
- Fixed compilation on parisc and m68k.
- Fixed whitespace mistakes.
- Renamed a few functions.
- Refactored globals into struct for test function params, thus removing
- Reworded some commit messages.
 - https://lkcamp.dev/
 - https://lore.kernel.org/linux-kselftest/CAGS_qxojszgM19u=3HLwFgKX5bm5Khywvs…
Isabella Basso (5):
hash.h: remove unused define directive
test_hash.c: split test_int_hash into arch-specific functions
test_hash.c: split test_hash_init
lib/Kconfig.debug: properly split hash test kernel entries
test_hash.c: refactor into kunit
include/linux/hash.h | 5 +-
lib/Kconfig.debug | 28 ++++-
lib/Makefile | 3 +-
lib/test_hash.c | 247 +++++++++++++++++--------------------
tools/include/linux/hash.h | 5 +-
5 files changed, 139 insertions(+), 149 deletions(-)