Calling get_system_loc_code before checking devfd and errno - fails the test
when the device is not available, expected a SKIP.
Change the order of 'SKIP_IF_MSG' correctly SKIP when the /dev/papr-vpd device
is not available.
with out patch: Test FAILED on line 271
with patch: [SKIP] Test skipped on line 266: /dev/papr-vpd not present
Signed-off-by: R Nageswara Sastry <rnsastry(a)linux.ibm.com>
---
tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c b/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c
index 98cbb9109ee6..505294da1b9f 100644
--- a/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c
+++ b/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c
@@ -263,10 +263,10 @@ static int papr_vpd_system_loc_code(void)
off_t size;
int fd;
- SKIP_IF_MSG(get_system_loc_code(&lc),
- "Cannot determine system location code");
SKIP_IF_MSG(devfd < 0 && errno == ENOENT,
DEVPATH " not present");
+ SKIP_IF_MSG(get_system_loc_code(&lc),
+ "Cannot determine system location code");
FAIL_IF(devfd < 0);
--
2.37.1 (Apple Git-137.1)
The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1]. With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses. This provides some
protection against ROP attacks and making it easier to collect call
stacks. These shadow stacks are allocated in the address space of the
userspace process.
Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled. The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread. This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces. As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.
Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process in a similar manner
to how the normal stack is specified, keeping the current implicit
allocation behaviour if one is not specified either with clone3() or
through the use of clone(). The user must provide a shadow stack
address and size, this must point to memory mapped for use as a shadow
stackby map_shadow_stack() with a shadow stack token at the top of the
stack.
Please note that the x86 portions of this code are build tested only, I
don't appear to have a system that can run CET avaible to me, I have
done testing with an integration into my pending work for GCS. There is
some possibility that the arm64 implementation may require the use of
clone3() and explicit userspace allocation of shadow stacks, this is
still under discussion.
Please further note that the token consumption done by clone3() is not
currently implemented in an atomic fashion, Rick indicated that he would
look into fixing this if people are OK with the implementation.
A new architecture feature Kconfig option for shadow stacks is added as
here, this was suggested as part of the review comments for the arm64
GCS series and since we need to detect if shadow stacks are supported it
seemed sensible to roll it in here.
[1] https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org/
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v5:
- Rebase onto v6.8-rc2.
- Rework ABI to have the user allocate the shadow stack memory with
map_shadow_stack() and a token.
- Force inlining of the x86 shadow stack enablement.
- Move shadow stack enablement out into a shared header for reuse by
other tests.
- Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@ke…
Changes in v4:
- Formatting changes.
- Use a define for minimum shadow stack size and move some basic
validation to fork.c.
- Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke…
Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke…
Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke…
---
Mark Brown (7):
Documentation: userspace-api: Add shadow stack API documentation
selftests: Provide helper header for shadow stack testing
mm: Introduce ARCH_HAS_USER_SHADOW_STACK
fork: Add shadow stack support to clone3()
selftests/clone3: Factor more of main loop into test_clone3()
selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
selftests/clone3: Test shadow stack support
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/shadow_stack.rst | 41 +++++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/shstk.h | 11 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 91 +++++++---
fs/proc/task_mmu.c | 2 +-
include/linux/mm.h | 2 +-
include/linux/sched/task.h | 2 +
include/uapi/linux/sched.h | 13 +-
kernel/fork.c | 61 +++++--
mm/Kconfig | 6 +
tools/testing/selftests/clone3/clone3.c | 211 ++++++++++++++++++----
tools/testing/selftests/clone3/clone3_selftests.h | 8 +
tools/testing/selftests/ksft_shstk.h | 63 +++++++
15 files changed, 430 insertions(+), 85 deletions(-)
---
base-commit: 41bccc98fb7931d63d03f326a746ac4d429c1dd3
change-id: 20231019-clone3-shadow-stack-15d40d2bf536
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Hi Linus,
Please pull the following KUnit fixes update for Linux 6.8-rc5.
This KUnit update for Linux 6.8-rc5 consists of one important fix
to unregister kunit_bus when KUnit module is unloaded. Not doing
so causes an error when KUnit module tries to re-register the bus
when it gets reloaded.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 1a9f2c776d1416c4ea6cb0d0b9917778c41a1a7d:
Documentation: KUnit: Update the instructions on how to test static functions (2024-01-22 07:59:03 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-kunit-fixes-6.8-rc5
for you to fetch changes up to 829388b725f8d266ccec32a2f446717d8693eaba:
kunit: device: Unregister the kunit_bus on shutdown (2024-02-06 17:07:37 -0700)
----------------------------------------------------------------
linux_kselftest-kunit-fixes-6.8-rc5
This KUnit update for Linux 6.8-rc5 consists of one important fix
to unregister kunit_bus when KUnit module is unloaded. Not doing
so causes an error when KUnit module tries to re-register the bus
when it gets reloaded.
----------------------------------------------------------------
David Gow (1):
kunit: device: Unregister the kunit_bus on shutdown
lib/kunit/device-impl.h | 2 ++
lib/kunit/device.c | 14 ++++++++++++++
lib/kunit/test.c | 3 +++
3 files changed, 19 insertions(+)
----------------------------------------------------------------
[Putting this as a RFC because I'm pretty sure I'm not doing the things
correctly at the BPF level.]
[Also using bpf-next as the base tree as there will be conflicting
changes otherwise]
Ideally I'd like to have something similar to bpf_timers, but not
in soft IRQ context. So I'm emulating this with a sleepable
bpf_tail_call() (see "HID: bpf: allow to defer work in a delayed
workqueue").
Basically, I need to be able to defer a HID-BPF program for the
following reasons (from the aforementioned patch):
1. defer an event:
Sometimes we receive an out of proximity event, but the device can not
be trusted enough, and we need to ensure that we won't receive another
one in the following n milliseconds. So we need to wait those n
milliseconds, and eventually re-inject that event in the stack.
2. inject new events in reaction to one given event:
We might want to transform one given event into several. This is the
case for macro keys where a single key press is supposed to send
a sequence of key presses. But this could also be used to patch a
faulty behavior, if a device forgets to send a release event.
3. communicate with the device in reaction to one event:
We might want to communicate back to the device after a given event.
For example a device might send us an event saying that it came back
from sleeping state and needs to be re-initialized.
Currently we can achieve that by keeping a userspace program around,
raise a bpf event, and let that userspace program inject the events and
commands.
However, we are just keeping that program alive as a daemon for just
scheduling commands. There is no logic in it, so it doesn't really justify
an actual userspace wakeup. So a kernel workqueue seems simpler to handle.
Besides the "this is insane to reimplement the wheel", the other part I'm
not sure is whether we can say that BPF maps of type prog_array and of
type queue/stack can be used in sleepable context.
I don't see any warning when running the test programs, but that's probably
not a guarantee I'm doing the things properly :)
Cheers,
Benjamin
Signed-off-by: Benjamin Tissoires <bentiss(a)kernel.org>
---
Benjamin Tissoires (9):
bpf: allow more maps in sleepable bpf programs
HID: bpf/dispatch: regroup kfuncs definitions
HID: bpf: export hid_hw_output_report as a BPF kfunc
selftests/hid: Add test for hid_bpf_hw_output_report
HID: bpf: allow to inject HID event from BPF
selftests/hid: add tests for hid_bpf_input_report
HID: bpf: allow to defer work in a delayed workqueue
selftests/hid: add test for hid_bpf_schedule_delayed_work
selftests/hid: add another set of delayed work tests
Documentation/hid/hid-bpf.rst | 26 +-
drivers/hid/bpf/entrypoints/entrypoints.bpf.c | 8 +
drivers/hid/bpf/entrypoints/entrypoints.lskel.h | 236 +++++++++------
drivers/hid/bpf/hid_bpf_dispatch.c | 315 ++++++++++++++++-----
drivers/hid/bpf/hid_bpf_jmp_table.c | 34 ++-
drivers/hid/hid-core.c | 2 +
include/linux/hid_bpf.h | 10 +
kernel/bpf/verifier.c | 3 +
tools/testing/selftests/hid/hid_bpf.c | 286 ++++++++++++++++++-
tools/testing/selftests/hid/progs/hid.c | 205 ++++++++++++++
.../testing/selftests/hid/progs/hid_bpf_helpers.h | 8 +
11 files changed, 962 insertions(+), 171 deletions(-)
---
base-commit: 4f7a05917237b006ceae760507b3d15305769ade
change-id: 20240205-hid-bpf-sleepable-c01260fd91c4
Best regards,
--
Benjamin Tissoires <bentiss(a)kernel.org>
From: Paul Durrant <pdurrant(a)amazon.com>
This series has one small fix to what was in v11 [1]:
* KVM: xen: re-initialize shared_info if guest (32/64-bit) mode is set
The v11 patch failed to set the return code of the ioctl if the mode
was not actually changed, leading to a spurious failure.
This version of the series also contains a new bug-fix to the pfncache
code from David Woodhouse.
[1] https://lore.kernel.org/kvm/20231219161109.1318-1-paul@xen.org/
David Woodhouse (1):
KVM: pfncache: rework __kvm_gpc_refresh() to fix locking issues
Paul Durrant (19):
KVM: pfncache: Add a map helper function
KVM: pfncache: remove unnecessary exports
KVM: xen: mark guest pages dirty with the pfncache lock held
KVM: pfncache: add a mark-dirty helper
KVM: pfncache: remove KVM_GUEST_USES_PFN usage
KVM: pfncache: stop open-coding offset_in_page()
KVM: pfncache: include page offset in uhva and use it consistently
KVM: pfncache: allow a cache to be activated with a fixed (userspace)
HVA
KVM: xen: separate initialization of shared_info cache and content
KVM: xen: re-initialize shared_info if guest (32/64-bit) mode is set
KVM: xen: allow shared_info to be mapped by fixed HVA
KVM: xen: allow vcpu_info to be mapped by fixed HVA
KVM: selftests / xen: map shared_info using HVA rather than GFN
KVM: selftests / xen: re-map vcpu_info using HVA rather than GPA
KVM: xen: advertize the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA capability
KVM: xen: split up kvm_xen_set_evtchn_fast()
KVM: xen: don't block on pfncache locks in kvm_xen_set_evtchn_fast()
KVM: pfncache: check the need for invalidation under read lock first
KVM: xen: allow vcpu_info content to be 'safely' copied
Documentation/virt/kvm/api.rst | 53 ++-
arch/x86/kvm/x86.c | 7 +-
arch/x86/kvm/xen.c | 356 +++++++++++-------
include/linux/kvm_host.h | 40 +-
include/linux/kvm_types.h | 8 -
include/uapi/linux/kvm.h | 9 +-
.../selftests/kvm/x86_64/xen_shinfo_test.c | 59 ++-
virt/kvm/pfncache.c | 340 +++++++++--------
8 files changed, 535 insertions(+), 337 deletions(-)
base-commit: 1c6d984f523f67ecfad1083bb04c55d91977bb15
--
2.39.2
Major changes in RFC v5:
------------------------
1. Rebased on top of 'Abstract page from net stack' series and used the
new netmem type to refer to LSB set pointers instead of re-using
struct page.
2. Downgraded this series back to RFC and called it RFC v5. This is
because this series is now dependent on 'Abstract page from net
stack'[1] and the queue API. Both are removed from the series to
pre-requisite work.
3. Reworked the page_pool devmem support to use netmem and for some
more unified handling.
4. Reworked the reference counting of net_iov (renamed from
page_pool_iov) to use pp_ref_count for refcounting.
The full changes including the dependent series and GVE page pool
support is here:
https://github.com/mina/linux/commits/tcpdevmem-rfcv5/
[1] https://patchwork.kernel.org/project/netdevbpf/list/?series=810774
Major changes in v1:
--------------------
1. Implemented MVP queue API ndos to remove the userspace-visible
driver reset.
2. Fixed issues in the napi_pp_put_page() devmem frag unref path.
3. Removed RFC tag.
Many smaller addressed comments across all the patches (patches have
individual change log).
Full tree including the rest of the GVE driver changes:
https://github.com/mina/linux/commits/tcpdevmem-v1
Changes in RFC v3:
------------------
1. Pulled in the memory-provider dependency from Jakub's RFC[1] to make the
series reviewable and mergable.
2. Implemented multi-rx-queue binding which was a todo in v2.
3. Fix to cmsg handling.
The sticking point in RFC v2[2] was the device reset required to refill
the device rx-queues after the dmabuf bind/unbind. The solution
suggested as I understand is a subset of the per-queue management ops
Jakub suggested or similar:
https://lore.kernel.org/netdev/20230815171638.4c057dcd@kernel.org/
This is not addressed in this revision, because:
1. This point was discussed at netconf & netdev and there is openness to
using the current approach of requiring a device reset.
2. Implementing individual queue resetting seems to be difficult for my
test bed with GVE. My prototype to test this ran into issues with the
rx-queues not coming back up properly if reset individually. At the
moment I'm unsure if it's a mistake in the POC or a genuine issue in
the virtualization stack behind GVE, which currently doesn't test
individual rx-queue restart.
3. Our usecases are not bothered by requiring a device reset to refill
the buffer queues, and we'd like to support NICs that run into this
limitation with resetting individual queues.
My thought is that drivers that have trouble with per-queue configs can
use the support in this series, while drivers that support new netdev
ops to reset individual queues can automatically reset the queue as
part of the dma-buf bind/unbind.
The same approach with device resets is presented again for consideration
with other sticking points addressed.
This proposal includes the rx devmem path only proposed for merge. For a
snapshot of my entire tree which includes the GVE POC page pool support &
device memory support:
https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-v3
[1] https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.…
[2] https://lore.kernel.org/netdev/CAHS8izOVJGJH5WF68OsRWFKJid1_huzzUK+hpKbLcL4…
Changes in RFC v2:
------------------
The sticking point in RFC v1[1] was the dma-buf pages approach we used to
deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
that attempts to resolve this by implementing scatterlist support in the
networking stack, such that we can import the dma-buf scatterlist
directly. This is the approach proposed at a high level here[2].
Detailed changes:
1. Replaced dma-buf pages approach with importing scatterlist into the
page pool.
2. Replace the dma-buf pages centric API with a netlink API.
3. Removed the TX path implementation - there is no issue with
implementing the TX path with scatterlist approach, but leaving
out the TX path makes it easier to review.
4. Functionality is tested with this proposal, but I have not conducted
perf testing yet. I'm not sure there are regressions, but I removed
perf claims from the cover letter until they can be re-confirmed.
5. Added Signed-off-by: contributors to the implementation.
6. Fixed some bugs with the RX path since RFC v1.
Any feedback welcome, but specifically the biggest pending questions
needing feedback IMO are:
1. Feedback on the scatterlist-based approach in general.
2. Netlink API (Patch 1 & 2).
3. Approach to handle all the drivers that expect to receive pages from
the page pool (Patch 6).
[1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.c…
[2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLX…
----------------------
* TL;DR:
Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
from device memory efficiently, without bouncing the data to a host memory
buffer.
* Problem:
A large amount of data transfers have device memory as the source and/or
destination. Accelerators drastically increased the volume of such transfers.
Some examples include:
- ML accelerators transferring large amounts of training data from storage into
GPU/TPU memory. In some cases ML training setup time can be as long as 50% of
TPU compute time, improving data transfer throughput & efficiency can help
improving GPU/TPU utilization.
- Distributed training, where ML accelerators, such as GPUs on different hosts,
exchange data among them.
- Distributed raw block storage applications transfer large amounts of data with
remote SSDs, much of this data does not require host processing.
Today, the majority of the Device-to-Device data transfers the network are
implemented as the following low level operations: Device-to-Host copy,
Host-to-Host network transfer, and Host-to-Device copy.
The implementation is suboptimal, especially for bulk data transfers, and can
put significant strains on system resources, such as host memory bandwidth,
PCIe bandwidth, etc. One important reason behind the current state is the
kernel’s lack of semantics to express device to network transfers.
* Proposal:
In this patch series we attempt to optimize this use case by implementing
socket APIs that enable the user to:
1. send device memory across the network directly, and
2. receive incoming network packets directly into device memory.
Packet _payloads_ go directly from the NIC to device memory for receive and from
device memory to NIC for transmit.
Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
normally. The NIC _must_ support header split to achieve this.
Advantages:
- Alleviate host memory bandwidth pressure, compared to existing
network-transfer + device-copy semantics.
- Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
of the PCIe tree, compared to traditional path which sends data through the
root complex.
* Patch overview:
** Part 1: netlink API
Gives user ability to bind dma-buf to an RX queue.
** Part 2: scatterlist support
Currently the standard for device memory sharing is DMABUF, which doesn't
generate struct pages. On the other hand, networking stack (skbs, drivers, and
page pool) operate on pages. We have 2 options:
1. Generate struct pages for dmabuf device memory, or,
2. Modify the networking stack to process scatterlist.
** part 3: page pool support
We piggy back on page pool memory providers proposal:
https://github.com/kuba-moo/linux/tree/pp-providers
It allows the page pool to define a memory provider that provides the
page allocation and freeing. It helps abstract most of the device memory
TCP changes from the driver.
** part 4: support for unreadable skb frags
Page pool iovs are not accessible by the host; we implement changes
throughput the networking stack to correctly handle skbs with unreadable
frags.
** Part 5: recvmsg() APIs
We define user APIs for the user to send and receive device memory.
Not included with this series is the GVE devmem TCP support, just to
simplify the review. Code available here if desired:
https://github.com/mina/linux/tree/tcpdevmem
This series is built on top of net-next with Jakub's pp-providers changes
cherry-picked.
* NIC dependencies:
1. (strict) Devmem TCP require the NIC to support header split, i.e. the
capability to split incoming packets into a header + payload and to put
each into a separate buffer. Devmem TCP works by using device memory
for the packet payload, and host memory for the packet headers.
2. (optional) Devmem TCP works better with flow steering support & RSS support,
i.e. the NIC's ability to steer flows into certain rx queues. This allows the
sysadmin to enable devmem TCP on a subset of the rx queues, and steer
devmem TCP traffic onto these queues and non devmem TCP elsewhere.
The NIC I have access to with these properties is the GVE with DQO support
running in Google Cloud, but any NIC that supports these features would suffice.
I may be able to help reviewers bring up devmem TCP on their NICs.
* Testing:
The series includes a udmabuf kselftest that show a simple use case of
devmem TCP and validates the entire data path end to end without
a dependency on a specific dmabuf provider.
** Test Setup
Kernel: net-next with this series and memory provider API cherry-picked
locally.
Hardware: Google Cloud A3 VMs.
NIC: GVE with header split & RSS & flow steering support.
Cc: Pavel Begunkov <asml.silence(a)gmail.com>
Cc: David Wei <dw(a)davidwei.uk>
Cc: Jason Gunthorpe <jgg(a)ziepe.ca>
Cc: Yunsheng Lin <linyunsheng(a)huawei.com>
Cc: Shailend Chand <shailend(a)google.com>
Cc: Harshitha Ramamurthy <hramamurthy(a)google.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Jeroen de Borst <jeroendb(a)google.com>
Cc: Praveen Kaligineedi <pkaligineedi(a)google.com>
Jakub Kicinski (1):
net: page_pool: create hooks for custom page providers
Mina Almasry (13):
net: page_pool: factor out page_pool recycle check
net: netdev netlink api to bind dma-buf to a net device
netdev: support binding dma-buf to netdevice
netdev: netdevice devmem allocator
page_pool: convert to use netmem
page_pool: devmem support
memory-provider: dmabuf devmem memory provider
net: support non paged skb frags
net: add support for skbs with unreadable frags
tcp: RX path for devmem TCP
net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags
net: add devmem TCP documentation
selftests: add ncdevmem, netcat for devmem TCP
Documentation/netlink/specs/netdev.yaml | 52 +++
Documentation/networking/devmem.rst | 271 +++++++++++++
Documentation/networking/index.rst | 1 +
arch/alpha/include/uapi/asm/socket.h | 8 +-
arch/mips/include/uapi/asm/socket.h | 6 +
arch/parisc/include/uapi/asm/socket.h | 6 +
arch/sparc/include/uapi/asm/socket.h | 6 +
include/linux/skbuff.h | 68 +++-
include/linux/socket.h | 1 +
include/net/devmem.h | 127 ++++++
include/net/netdev_rx_queue.h | 1 +
include/net/netmem.h | 223 ++++++++++-
include/net/page_pool/helpers.h | 117 ++++--
include/net/page_pool/types.h | 27 +-
include/net/sock.h | 2 +
include/net/tcp.h | 5 +-
include/trace/events/page_pool.h | 28 +-
include/uapi/asm-generic/socket.h | 6 +
include/uapi/linux/netdev.h | 19 +
include/uapi/linux/uio.h | 14 +
net/bpf/test_run.c | 5 +-
net/core/datagram.c | 6 +
net/core/dev.c | 314 ++++++++++++++-
net/core/gro.c | 7 +-
net/core/netdev-genl-gen.c | 19 +
net/core/netdev-genl-gen.h | 2 +
net/core/netdev-genl.c | 123 ++++++
net/core/page_pool.c | 419 +++++++++++++-------
net/core/skbuff.c | 92 ++++-
net/core/sock.c | 45 +++
net/ipv4/tcp.c | 196 +++++++++-
net/ipv4/tcp_input.c | 13 +-
net/ipv4/tcp_ipv4.c | 9 +
net/ipv4/tcp_output.c | 5 +-
net/packet/af_packet.c | 4 +-
tools/include/uapi/linux/netdev.h | 19 +
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 5 +
tools/testing/selftests/net/ncdevmem.c | 489 ++++++++++++++++++++++++
39 files changed, 2527 insertions(+), 234 deletions(-)
create mode 100644 Documentation/networking/devmem.rst
create mode 100644 include/net/devmem.h
create mode 100644 tools/testing/selftests/net/ncdevmem.c
--
2.43.0.472.g3155946c3a-goog
From: Jeff Xu <jeffxu(a)chromium.org>
This is V9 version, with removing MAP_SEALABLE and PROT_SEAL
from mmap(), adding perfrmance benchmark and a test
to demo the sealing of read-only memory segment of elf mapping.
------------------------------------------------------------------
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory
range against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW)
and no-execute (NX) bits. Linux has supported NX since the release of
kernel version 2.6.8 in August 2004 [1]. The memory permission feature
improves the security stance on memory corruption bugs, as an attacker
cannot simply write to arbitrary memory and point the code to it. The
memory must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects
the VMA itself against modifications of the selected seal type.
Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped. Memory sealing can
automatically be applied by the runtime loader to seal .text and
.rodata pages and applications can additionally seal security critical
data at runtime. A similar feature already exists in the XNU kernel
with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the
mimmutable syscall [4]. Also, Chrome wants to adopt this feature for
their CFI work [2] and this patchset has been designed to be
compatible with the Chrome use case.
Two system calls are involved in sealing the map: mmap() and mseal().
The new mseal() is an syscall on 64 bit CPU, and with
following signature:
int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.
mseal() blocks following operations for the given memory range.
1> Unmapping, moving to another location, and shrinking the size,
via munmap() and mremap(), can leave an empty space, therefore can
be replaced with a VMA with a new set of attributes.
2> Moving or expanding a different VMA into the current location,
via mremap().
3> Modifying a VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed VMA.
5> mprotect() and pkey_mprotect().
6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
memory, when users don't have write permission to the memory. Those
behaviors can alter region contents by discarding pages, effectively a
memset(0) for anonymous memory.
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.
Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in
the case of libc, sealing is only applied to read-only (RO) or
read-execute (RX) memory segments (such as .text and .RELRO) to
prevent them from becoming writable, the lifetime of those mappings
are tied to the lifetime of the process.
Chrome wants to seal two large address space reservations that are
managed by different allocators. The memory is mapped RW- and RWX
respectively but write access to it is restricted using pkeys (or in
the future ARM permission overlay extensions). The lifetime of those
mappings are not tied to the lifetime of the process, therefore, while
the memory is sealed, the allocators still need to free or discard the
unused memory. For example, with madvise(DONTNEED).
However, always allowing madvise(DONTNEED) on this range poses a
security risk. For example if a jump instruction crosses a page
boundary and the second page gets discarded, it will overwrite the
target bytes with zeros and change the control flow. Checking
write-permission before the discard operation allows us to control
when the operation is valid. In this case, the madvise will only
succeed if the executing thread has PKEY write permissions and PKRU
changes are protected in software by control-flow integrity.
Although the initial version of this patch series is targeting the
Chrome browser as its first user, it became evident during upstream
discussions that we would also want to ensure that the patch set
eventually is a complete solution for memory sealing and compatible
with other use cases. The specific scenario currently in mind is
glibc's use case of loading and sealing ELF executables. To this end,
Stephen is working on a change to glibc to add sealing support to the
dynamic linker, which will seal all non-writable segments at startup.
Once this work is completed, all applications will be able to
automatically benefit from these new protections.
In closing, I would like to formally acknowledge the valuable
contributions received during the RFC process, which were instrumental
in shaping this patch:
Jann Horn: raising awareness and providing valuable insights on the
destructive madvise operations.
Liam R. Howlett: perf optimization.
Linus Torvalds: assisting in defining system call signature and scope.
Theo de Raadt: sharing the experiences and insight gained from
implementing mimmutable() in OpenBSD.
MM perf benchmarks
==================
This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
check the VMAs’ sealing flag, so that no partial update can be made,
when any segment within the given memory range is sealed.
To measure the performance impact of this loop, two tests are developed.
[8]
The first is measuring the time taken for a particular system call,
by using clock_gettime(CLOCK_MONOTONIC). The second is using
PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
similar results.
The tests have roughly below sequence:
for (i = 0; i < 1000, i++)
create 1000 mappings (1 page per VMA)
start the sampling
for (j = 0; j < 1000, j++)
mprotect one mapping
stop and save the sample
delete 1000 mappings
calculates all samples.
Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
4G memory, Chromebook.
Based on the latest upstream code:
The first test (measuring time)
syscall__ vmas t t_mseal delta_ns per_vma %
munmap__ 1 909 944 35 35 104%
munmap__ 2 1398 1502 104 52 107%
munmap__ 4 2444 2594 149 37 106%
munmap__ 8 4029 4323 293 37 107%
munmap__ 16 6647 6935 288 18 104%
munmap__ 32 11811 12398 587 18 105%
mprotect 1 439 465 26 26 106%
mprotect 2 1659 1745 86 43 105%
mprotect 4 3747 3889 142 36 104%
mprotect 8 6755 6969 215 27 103%
mprotect 16 13748 14144 396 25 103%
mprotect 32 27827 28969 1142 36 104%
madvise_ 1 240 262 22 22 109%
madvise_ 2 366 442 76 38 121%
madvise_ 4 623 751 128 32 121%
madvise_ 8 1110 1324 215 27 119%
madvise_ 16 2127 2451 324 20 115%
madvise_ 32 4109 4642 534 17 113%
The second test (measuring cpu cycle)
syscall__ vmas cpu cmseal delta_cpu per_vma %
munmap__ 1 1790 1890 100 100 106%
munmap__ 2 2819 3033 214 107 108%
munmap__ 4 4959 5271 312 78 106%
munmap__ 8 8262 8745 483 60 106%
munmap__ 16 13099 14116 1017 64 108%
munmap__ 32 23221 24785 1565 49 107%
mprotect 1 906 967 62 62 107%
mprotect 2 3019 3203 184 92 106%
mprotect 4 6149 6569 420 105 107%
mprotect 8 9978 10524 545 68 105%
mprotect 16 20448 21427 979 61 105%
mprotect 32 40972 42935 1963 61 105%
madvise_ 1 434 497 63 63 115%
madvise_ 2 752 899 147 74 120%
madvise_ 4 1313 1513 200 50 115%
madvise_ 8 2271 2627 356 44 116%
madvise_ 16 4312 4883 571 36 113%
madvise_ 32 8376 9319 943 29 111%
Based on the result, for 6.8 kernel, sealing check adds
20-40 nano seconds, or around 50-100 CPU cycles, per VMA.
In addition, I applied the sealing to 5.10 kernel:
The first test (measuring time)
syscall__ vmas t tmseal delta_ns per_vma %
munmap__ 1 357 390 33 33 109%
munmap__ 2 442 463 21 11 105%
munmap__ 4 614 634 20 5 103%
munmap__ 8 1017 1137 120 15 112%
munmap__ 16 1889 2153 263 16 114%
munmap__ 32 4109 4088 -21 -1 99%
mprotect 1 235 227 -7 -7 97%
mprotect 2 495 464 -30 -15 94%
mprotect 4 741 764 24 6 103%
mprotect 8 1434 1437 2 0 100%
mprotect 16 2958 2991 33 2 101%
mprotect 32 6431 6608 177 6 103%
madvise_ 1 191 208 16 16 109%
madvise_ 2 300 324 24 12 108%
madvise_ 4 450 473 23 6 105%
madvise_ 8 753 806 53 7 107%
madvise_ 16 1467 1592 125 8 108%
madvise_ 32 2795 3405 610 19 122%
The second test (measuring cpu cycle)
syscall__ nbr_vma cpu cmseal delta_cpu per_vma %
munmap__ 1 684 715 31 31 105%
munmap__ 2 861 898 38 19 104%
munmap__ 4 1183 1235 51 13 104%
munmap__ 8 1999 2045 46 6 102%
munmap__ 16 3839 3816 -23 -1 99%
munmap__ 32 7672 7887 216 7 103%
mprotect 1 397 443 46 46 112%
mprotect 2 738 788 50 25 107%
mprotect 4 1221 1256 35 9 103%
mprotect 8 2356 2429 72 9 103%
mprotect 16 4961 4935 -26 -2 99%
mprotect 32 9882 10172 291 9 103%
madvise_ 1 351 380 29 29 108%
madvise_ 2 565 615 49 25 109%
madvise_ 4 872 933 61 15 107%
madvise_ 8 1508 1640 132 16 109%
madvise_ 16 3078 3323 245 15 108%
madvise_ 32 5893 6704 811 25 114%
For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30
CPU cycles, there is even decrease in some cases.
It might be interesting to compare 5.10 and 6.8 kernel
The first test (measuring time)
syscall__ vmas t_5_10 t_6_8 delta_ns per_vma %
munmap__ 1 357 909 552 552 254%
munmap__ 2 442 1398 956 478 316%
munmap__ 4 614 2444 1830 458 398%
munmap__ 8 1017 4029 3012 377 396%
munmap__ 16 1889 6647 4758 297 352%
munmap__ 32 4109 11811 7702 241 287%
mprotect 1 235 439 204 204 187%
mprotect 2 495 1659 1164 582 335%
mprotect 4 741 3747 3006 752 506%
mprotect 8 1434 6755 5320 665 471%
mprotect 16 2958 13748 10790 674 465%
mprotect 32 6431 27827 21397 669 433%
madvise_ 1 191 240 49 49 125%
madvise_ 2 300 366 67 33 122%
madvise_ 4 450 623 173 43 138%
madvise_ 8 753 1110 357 45 147%
madvise_ 16 1467 2127 660 41 145%
madvise_ 32 2795 4109 1314 41 147%
The second test (measuring cpu cycle)
syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma %
munmap__ 1 684 1790 1106 1106 262%
munmap__ 2 861 2819 1958 979 327%
munmap__ 4 1183 4959 3776 944 419%
munmap__ 8 1999 8262 6263 783 413%
munmap__ 16 3839 13099 9260 579 341%
munmap__ 32 7672 23221 15549 486 303%
mprotect 1 397 906 509 509 228%
mprotect 2 738 3019 2281 1140 409%
mprotect 4 1221 6149 4929 1232 504%
mprotect 8 2356 9978 7622 953 423%
mprotect 16 4961 20448 15487 968 412%
mprotect 32 9882 40972 31091 972 415%
madvise_ 1 351 434 82 82 123%
madvise_ 2 565 752 186 93 133%
madvise_ 4 872 1313 442 110 151%
madvise_ 8 1508 2271 763 95 151%
madvise_ 16 3078 4312 1234 77 140%
madvise_ 32 5893 8376 2483 78 142%
From 5.10 to 6.8
munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma.
mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma.
madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma.
In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the
increase from 5.10 to 6.8 is significantly larger, approximately ten
times greater for munmap and mprotect.
When I discuss the mm performance with Brian Makin, an engineer worked
on performance, it was brought to my attention that such a performance
benchmarks, which measuring millions of mm syscall in a tight loop, may
not accurately reflect real-world scenarios, such as that of a database
service. Also this is tested using a single HW and ChromeOS, the data
from another HW or distribution might be different. It might be best
to take this data with a grain of salt.
Change history:
===============
V9:
- remove mmap(PROT_SEAL) and mmap(MAP_SEALABLE) (Linus, Theo de Raadt)
- Update mseal_test to check for prot bit (Liam R. Howlett)
- Update documentation to give more detail on sealing check (Liam R. Howlett)
- Add seal_elf test.
- Add performance measure data.
- mseal_test: fix arm build.
V8:
- perf optimization in mmap. (Liam R. Howlett)
- add one testcase (test_seal_zero_address)
- Update mseal.rst to add note for MAP_SEALABLE.
https://lore.kernel.org/lkml/20240131175027.3287009-1-jeffxu@chromium.org/
V7:
- fix index.rst (Randy Dunlap)
- fix arm build (Randy Dunlap)
- return EPERM for blocked operations (Theo de Raadt)
https://lore.kernel.org/linux-mm/20240122152905.2220849-2-jeffxu@chromium.o…
V6:
- Drop RFC from subject, Given Linus's general approval.
- Adjust syscall number for mseal (main Jan.11/2024)
- Code style fix (Matthew Wilcox)
- selftest: use ksft macros (Muhammad Usama Anjum)
- Document fix. (Randy Dunlap)
https://lore.kernel.org/all/20240111234142.2944934-1-jeffxu@chromium.org/
V5:
- fix build issue in mseal-Wire-up-mseal-syscall
(Suggested by Linus Torvalds, and Greg KH)
- updates on selftest.
https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/#r
V4:
(Suggested by Linus Torvalds)
- new signature: mseal(start,len,flags)
- 32 bit is not supported. vm_seal is removed, use vm_flags instead.
- single bit in vm_flags for sealed state.
- CONFIG_MSEAL kernel config is removed.
- single bit of PROT_SEAL in the "Prot" field of mmap().
Other changes:
- update selftest (Suggested by Muhammad Usama Anjum)
- update documentation.
https://lore.kernel.org/all/20240104185138.169307-1-jeffxu@chromium.org/
V3:
- Abandon per-syscall approach, (Suggested by Linus Torvalds).
- Organize sealing types around their functionality, such as
MM_SEAL_BASE, MM_SEAL_PROT_PKEY.
- Extend the scope of sealing from calls originated in userspace to
both kernel and userspace. (Suggested by Linus Torvalds)
- Add seal type support in mmap(). (Suggested by Pedro Falcato)
- Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent
destructive operations of madvise. (Suggested by Jann Horn and
Stephen Röttger)
- Make sealed VMAs mergeable. (Suggested by Jann Horn)
- Add MAP_SEALABLE to mmap()
- Add documentation - mseal.rst
https://lore.kernel.org/linux-mm/20231212231706.2680890-2-jeffxu@chromium.o…
v2:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add a detailed commit message.
https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/
v1:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/
----------------------------------------------------------------
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b…
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXge…
[6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426Fkcgnf…
[7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/
[8] https://github.com/peaktocreek/mmperf
Jeff Xu (5):
mseal: Wire up mseal syscall
mseal: add mseal syscall
selftest mm/mseal memory sealing
mseal:add documentation
selftest mm/mseal read-only elf memory segment
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/mseal.rst | 199 ++
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 5 +-
kernel/sys_ni.c | 1 +
mm/Makefile | 4 +
mm/internal.h | 37 +
mm/madvise.c | 12 +
mm/mmap.c | 31 +-
mm/mprotect.c | 10 +
mm/mremap.c | 31 +
mm/mseal.c | 307 ++++
tools/testing/selftests/mm/.gitignore | 2 +
tools/testing/selftests/mm/Makefile | 2 +
tools/testing/selftests/mm/mseal_test.c | 1836 +++++++++++++++++++
tools/testing/selftests/mm/seal_elf.c | 183 ++
33 files changed, 2678 insertions(+), 3 deletions(-)
create mode 100644 Documentation/userspace-api/mseal.rst
create mode 100644 mm/mseal.c
create mode 100644 tools/testing/selftests/mm/mseal_test.c
create mode 100644 tools/testing/selftests/mm/seal_elf.c
--
2.43.0.687.g38aa6559b0-goog
When opening yama/ptrace_scope we unconditionally use sudo to ensure we
are running as root, resulting in failures if running in a minimal root
filesystem where sudo is not installed. Since automated test systems will
typically just run all of kselftest as root (and many kselftests rely on
this for full functionality) add a check to see if we're already root and
only invoke sudo if not.
Since I am unclear what the intended effect of the command being run is I
have not added any error handling for the case where we fail to obtain
root.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/mm/run_vmtests.sh | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index fe140a9f4f9d..c8ca830dba93 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -248,6 +248,17 @@ run_test() {
echo "TAP version 13" | tap_output
+HAVE_ROOT=0
+if [ "$(id -u)" = "0" ]; then
+ AS_ROOT=
+ HAVE_ROOT=1
+elif [ "$(command -v sudo)" != "" ]; then
+ AS_ROOT=sudo
+ HAVE_ROOT=1
+else
+ echo # WARNING: Unable to run as root
+fi
+
CATEGORY="hugetlb" run_test ./hugepage-mmap
shmmax=$(cat /proc/sys/kernel/shmmax)
@@ -363,7 +374,8 @@ CATEGORY="hmm" run_test bash ./test_hmm.sh smoke
# MADV_POPULATE_READ and MADV_POPULATE_WRITE tests
CATEGORY="madv_populate" run_test ./madv_populate
-(echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope 2>&1) | tap_prefix
+# FIXME: What if we can't get root?
+(echo 0 | ${AS_ROOT} tee /proc/sys/kernel/yama/ptrace_scope 2>&1) | tap_prefix
CATEGORY="memfd_secret" run_test ./memfd_secret
# KSM KSM_MERGE_TIME_HUGE_PAGES test with size of 100
---
base-commit: 445a555e0623387fa9b94e68e61681717e70200a
change-id: 20240209-kselftest-mm-check-deps-01a825e5fed4
Best regards,
--
Mark Brown <broonie(a)kernel.org>
This series aims to keep the git status clean after building the
selftests by adding some missing .gitignore files and object inclusion
in existing .gitignore files. This is one of the requirements listed in
the selftests documentation for new tests, but it is not always followed
as desired.
After adding these .gitignore files and including the generated objects,
the working tree appears clean again.
The new version includes a missing entry fot the .gitignore in damon,
which was reported by Bernd Edlinger <bernd.edlinger(a)hotmail.de>, who
also proposed a patch for it as well as for other missing .gitignore
files covered by v1. Bernd has been added to the corresponding patch as
the reporter. If a different tag is desired, I am fine with it.
To: Shuah Khan <shuah(a)kernel.org>
To: SeongJae Park <sj(a)kernel.org>
To: Bernd Edlinger <bernd.edlinger(a)hotmail.de>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: damon(a)lists.linux.dev
Cc: linux-mm(a)kvack.org
Signed-off-by: Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
Changes in v3:
- General: base on mm-unstable to avoid conflicts.
- damon: add missing Closes tag.
- Link to v2: https://lore.kernel.org/r/20240212-selftest_gitignore-v2-0-75f0de50a178@gma…
Changes in v2:
- Remove patch for netfilter (not relevant anymore).
- Add patch for damon (missing binary in .gitignore).
- Link to v1: https://lore.kernel.org/r/20240101-selftest_gitignore-v1-0-eb61b09adb05@gma…
---
Javier Carrasco (4):
selftests: uevent: add missing gitignore
selftests: thermal: intel: power_floor: add missing gitignore
selftests: thermal: intel: workload_hint: add missing gitignore
selftests: damon: add access_memory to .gitignore
tools/testing/selftests/damon/.gitignore | 1 +
tools/testing/selftests/thermal/intel/power_floor/.gitignore | 1 +
tools/testing/selftests/thermal/intel/workload_hint/.gitignore | 1 +
tools/testing/selftests/uevent/.gitignore | 1 +
4 files changed, 4 insertions(+)
---
base-commit: 7e56cf9a7f108e8129d75cea0dabc9488fb4defa
change-id: 20240101-selftest_gitignore-7da2c503766e
Best regards,
--
Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
The mentioned test is still flaky, unusally enough in 'fast'
environments.
Patch 2/2 [try to] address the existing issues, while patch 1/2
introduces more strict tests for the existing net helpers, to hopefully
prevent future pain.
Paolo Abeni (2):
selftests: net: more strict check in net_helper
selftests: net: more pmtu.sh fixes
tools/testing/selftests/net/net_helper.sh | 11 +++++++----
tools/testing/selftests/net/pmtu.sh | 4 ++--
2 files changed, 9 insertions(+), 6 deletions(-)
--
2.43.0
The gro self-tests sends the packets to be aggregated with
multiple write operations.
When running is slow environment, it's hard to guarantee that
the GRO engine will wait for the last packet in an intended
train.
The above causes almost deterministic failures in our CI for
the 'large' test-case.
Address the issue explicitly ignoring failures for such case
in slow environments (KSFT_MACHINE_SLOW==true).
Fixes: 7d1575014a63 ("selftests/net: GRO coalesce test")
Reviewed-by: Willem de Bruijn <willemb(a)google.com>
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
v2 -> v3:
- dump the error suppression message only on actual errors (Jakub)
v1 -> v2:
- replace the '-a' operator with '&&' - Mattbe
Note that the fixes tag is there mainly to justify targeting the net
tree, and this is aiming at net to hopefully make the test more stable
ASAP for both trees.
---
tools/testing/selftests/net/gro.sh | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/net/gro.sh b/tools/testing/selftests/net/gro.sh
index 19352f106c1d..02c21ff4ca81 100755
--- a/tools/testing/selftests/net/gro.sh
+++ b/tools/testing/selftests/net/gro.sh
@@ -31,6 +31,11 @@ run_test() {
1>>log.txt
wait "${server_pid}"
exit_code=$?
+ if [[ ${test} == "large" && -n "${KSFT_MACHINE_SLOW}" && \
+ ${exit_code} -ne 0 ]]; then
+ echo "Ignoring errors due to slow environment" 1>&2
+ exit_code=0
+ fi
if [[ "${exit_code}" -eq 0 ]]; then
break;
fi
--
2.43.0
Hi Christian, Janosch, Heiko,
Here is a pair of patches to replace the RFC I recently sent [1].
Patch 1 performs the host/guest access register swap that Christian
suggested (instead of a full sync_reg/store_reg process).
Patch 2 provides a selftest patch that exercises this scenario.
Applying patch 2 without patch 1 fails in the following way:
[eric@host linux]# ./tools/testing/selftests/kvm/s390x/memop
TAP version 13
1..16
ok 1 simple copy
ok 2 generic error checks
ok 3 copy with storage keys
ok 4 cmpxchg with storage keys
ok 5 concurrently cmpxchg with storage keys
ok 6 copy with key storage protection override
ok 7 copy with key fetch protection
ok 8 copy with key fetch protection override
==== Test Assertion Failure ====
s390x/memop.c:186: !r
pid=5720 tid=5720 errno=4 - Interrupted system call
1 0x00000000010042af: memop_ioctl at memop.c:186 (discriminator 3)
2 0x0000000001006697: test_copy_access_register at memop.c:400 (discriminator 2)
3 0x0000000001002aaf: main at memop.c:1181
4 0x000003ffaec33a5b: ?? ??:0
5 0x000003ffaec33b5d: ?? ??:0
6 0x0000000001002ba9: _start at ??:?
KVM_S390_MEM_OP failed, rc: 40 errno: 4 (Interrupted system call)
Thoughts on this approach?
Thanks,
Eric
[1] https://lore.kernel.org/r/20240131205832.2179029-1-farman@linux.ibm.com/
Eric Farman (2):
KVM: s390: load guest access registers in MEM_OP ioctl
KVM: s390: selftests: memop: add a simple AR test
arch/s390/kvm/kvm-s390.c | 6 +++++
tools/testing/selftests/kvm/s390x/memop.c | 28 +++++++++++++++++++++++
2 files changed, 34 insertions(+)
--
2.40.1
From: Zi Yan <ziy(a)nvidia.com>
Hi all,
File folio supports any order and people would like to support flexible orders
for anonymous folio[1] too. Currently, split_huge_page() only splits a huge
page to order-0 pages, but splitting to orders higher than 0 is also useful.
This patchset adds support for splitting a huge page to any lower order pages
and uses it during file folio truncate operations.
The patchset is on top of mm-everything-2023-03-27-21-20.
Changelog
===
Since v2
---
1. Fixed an issue in __split_page_owner() introduced during my rebase
Since v1
---
1. Changed split_page_memcg() and split_page_owner() parameter to use order
2. Used folio_test_pmd_mappable() in place of the equivalent code
Details
===
* Patch 1 changes split_page_memcg() to use order instead of nr_pages
* Patch 2 changes split_page_owner() to use order instead of nr_pages
* Patch 3 and 4 add new_order parameter split_page_memcg() and
split_page_owner() and prepare for upcoming changes.
* Patch 5 adds split_huge_page_to_list_to_order() to split a huge page
to any lower order. The original split_huge_page_to_list() calls
split_huge_page_to_list_to_order() with new_order = 0.
* Patch 6 uses split_huge_page_to_list_to_order() in large pagecache folio
truncation instead of split the large folio all the way down to order-0.
* Patch 7 adds a test API to debugfs and test cases in
split_huge_page_test selftests.
Comments and/or suggestions are welcome.
[1] https://lore.kernel.org/linux-mm/Y%2FblF0GIunm+pRIC@casper.infradead.org/
Zi Yan (7):
mm/memcg: use order instead of nr in split_page_memcg()
mm/page_owner: use order instead of nr in split_page_owner()
mm: memcg: make memcg huge page split support any order split.
mm: page_owner: add support for splitting to any order in split
page_owner.
mm: thp: split huge page to any lower order pages.
mm: truncate: split huge page cache page to a non-zero order if
possible.
mm: huge_memory: enable debugfs to split huge pages to any order.
include/linux/huge_mm.h | 10 +-
include/linux/memcontrol.h | 4 +-
include/linux/page_owner.h | 10 +-
mm/huge_memory.c | 137 ++++++++---
mm/memcontrol.c | 10 +-
mm/page_alloc.c | 8 +-
mm/page_owner.c | 8 +-
mm/truncate.c | 21 +-
.../selftests/mm/split_huge_page_test.c | 225 +++++++++++++++++-
9 files changed, 365 insertions(+), 68 deletions(-)
--
2.39.2
This series aims to keep the git status clean after building the
selftests by adding some missing .gitignore files and object inclusion
in existing .gitignore files. This is one of the requirements listed in
the selftests documentation for new tests, but it is not always followed
as desired.
After adding these .gitignore files and including the generated objects,
the working tree appears clean again.
The new version includes a missing entry fot the .gitignore in damon,
which was reported by Bernd Edlinger <bernd.edlinger(a)hotmail.de>, who
also proposed a patch for it as well as for other missing .gitignore
files covered by v1. Bernd has been added to the corresponding patch as
the reporter. If a different tag is desired, I am fine with it.
To: Shuah Khan <shuah(a)kernel.org>
To: SeongJae Park <sj(a)kernel.org>
To: Bernd Edlinger <bernd.edlinger(a)hotmail.de>
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: damon(a)lists.linux.dev
Cc: linux-mm(a)kvack.org
Signed-off-by: Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
Changes in v2:
- Remove patch for netfilter (not relevant anymore).
- Add patch for damon (missing binary in .gitignore).
- Link to v1: https://lore.kernel.org/r/20240101-selftest_gitignore-v1-0-eb61b09adb05@gma…
---
Javier Carrasco (4):
selftests: uevent: add missing gitignore
selftests: thermal: intel: power_floor: add missing gitignore
selftests: thermal: intel: workload_hint: add missing gitignore
selftests: damon: add access_memory to .gitignore
tools/testing/selftests/damon/.gitignore | 1 +
tools/testing/selftests/thermal/intel/power_floor/.gitignore | 1 +
tools/testing/selftests/thermal/intel/workload_hint/.gitignore | 1 +
tools/testing/selftests/uevent/.gitignore | 1 +
4 files changed, 4 insertions(+)
---
base-commit: 716f4aaa7b48a55c73d632d0657b35342b1fefd7
change-id: 20240101-selftest_gitignore-7da2c503766e
Best regards,
--
Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
Non-contiguous CBM support for Intel CAT has been merged into the kernel
with Commit 0e3cd31f6e90 ("x86/resctrl: Enable non-contiguous CBMs in
Intel CAT") but there is no selftest that would validate if this feature
works correctly. The selftest needs to verify if writing non-contiguous
CBMs to the schemata file behaves as expected in comparison to the
information about non-contiguous CBMs support.
The patch series is based on a rework of resctrl selftests that's
currently in review [1]. The patch also implements a similar
functionality presented in the bash script included in the cover letter
of the original non-contiguous CBMs in Intel CAT series [3].
Changelog v5:
- Add a few reviewed-by tags.
- Make some minor text changes in patch messages and comments.
- Redo resctrl_mon_feature_exists() to be more generic and fix some of
my errors in refactoring feature checking.
Changelog v4:
- Changes to error failure return values in non-contiguous test.
- Some minor text refactoring without functional changes.
Changelog v3:
- Rebase onto v4 of Ilpo's series [1].
- Split old patch 3/4 into two parts. One doing refactoring and one
adding a new function.
- Some changes to all the patches after Reinette's review.
Changelog v2:
- Rebase onto v4 of Ilpo's series [2].
- Add two patches that prepare helpers for the new test.
- Move Ilpo's patch that adds test grouping to this series.
- Apply Ilpo's suggestion to the patch that adds a new test.
[1] https://lore.kernel.org/all/20231215150515.36983-1-ilpo.jarvinen@linux.inte…
[2] https://lore.kernel.org/all/20231211121826.14392-1-ilpo.jarvinen@linux.inte…
[3] https://lore.kernel.org/all/cover.1696934091.git.maciej.wieczor-retman@inte…
Older versions of this series:
[v1] https://lore.kernel.org/all/20231109112847.432687-1-maciej.wieczor-retman@i…
[v2] https://lore.kernel.org/all/cover.1702392177.git.maciej.wieczor-retman@inte…
[v3] https://lore.kernel.org/all/cover.1706180726.git.maciej.wieczor-retman@inte…
[v4] https://lore.kernel.org/all/cover.1707130307.git.maciej.wieczor-retman@inte…
Ilpo Järvinen (1):
selftests/resctrl: Add test groups and name L3 CAT test L3_CAT
Maciej Wieczor-Retman (4):
selftests/resctrl: Add a helper for the non-contiguous test
selftests/resctrl: Split validate_resctrl_feature_request()
selftests/resctrl: Add resource_info_file_exists()
selftests/resctrl: Add non-contiguous CBMs CAT test
tools/testing/selftests/resctrl/cat_test.c | 84 ++++++++++++++++-
tools/testing/selftests/resctrl/cmt_test.c | 2 +-
tools/testing/selftests/resctrl/mba_test.c | 2 +-
tools/testing/selftests/resctrl/mbm_test.c | 6 +-
tools/testing/selftests/resctrl/resctrl.h | 10 +-
.../testing/selftests/resctrl/resctrl_tests.c | 18 +++-
tools/testing/selftests/resctrl/resctrlfs.c | 94 +++++++++++++++++--
7 files changed, 194 insertions(+), 22 deletions(-)
--
2.43.0
Title under line too short
Signed-off-by: Meng Li <li.meng(a)amd.com>
---
Documentation/admin-guide/pm/amd-pstate.rst | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/pm/amd-pstate.rst b/Documentation/admin-guide/pm/amd-pstate.rst
index 0a3aa6b8ffd5..1e0d101b020a 100644
--- a/Documentation/admin-guide/pm/amd-pstate.rst
+++ b/Documentation/admin-guide/pm/amd-pstate.rst
@@ -381,7 +381,7 @@ driver receives a message with the highest performance change, it will
update the core ranking and set the cpu's priority.
``amd-pstate`` Preferred Core Switch
-=================================
+=====================================
Kernel Parameters
-----------------
--
2.34.1
The gro self-tests sends the packets to be aggregated with
multiple write operations.
When running is slow environment, it's hard to guarantee that
the GRO engine will wait for the last packet in an intended
train.
The above causes almost deterministic failures in our CI for
the 'large' test-case.
Address the issue explicitly ignoring failures for such case
in slow environments (KSFT_MACHINE_SLOW==true).
Fixes: 7d1575014a63 ("selftests/net: GRO coalesce test")
Reviewed-by: Willem de Bruijn <willemb(a)google.com>
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
v1 -> v2:
- replace the '-a' operator with '&&' - Mattbe
Note that the fixes tag is there mainly to justify targeting the net
tree, and this is aiming at net to hopefully make the test more stable
ASAP for both trees.
I experimented with a largish refactory replacing the multiple writes
with a single GSO packet, but exhausted by time budget before reaching
any good result.
---
tools/testing/selftests/net/gro.sh | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/net/gro.sh b/tools/testing/selftests/net/gro.sh
index 19352f106c1d..3190b41e8bfc 100755
--- a/tools/testing/selftests/net/gro.sh
+++ b/tools/testing/selftests/net/gro.sh
@@ -31,6 +31,10 @@ run_test() {
1>>log.txt
wait "${server_pid}"
exit_code=$?
+ if [[ ${test} == "large" && -n "${KSFT_MACHINE_SLOW}" ]]; then
+ echo "Ignoring errors due to slow environment" 1>&2
+ exit_code=0
+ fi
if [[ "${exit_code}" -eq 0 ]]; then
break;
fi
--
2.43.0
The mentioned test is failing in slow environments:
# SO_TXTIME ipv4 clock monotonic
# ./so_txtime: recv: timeout: Resource temporarily unavailable
not ok 1 selftests: net: so_txtime.sh # exit=1
The receiver is started in background and the sender could end-up
transmitting the packet before the receiver is ready, so that the
later recv times out.
Address the issue explcitly waiting for the socket being bound to
the relevant port.
Fixes: af5136f95045 ("selftests/net: SO_TXTIME with ETF and FQ")
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
Note that to really cope with slow env the mentioned self-tests also
need net-next commit c41dfb0dfbec ("selftests/net: ignore timing
errors in so_txtime if KSFT_MACHINE_SLOW"), so this could be applied to
net-next, too
---
tools/testing/selftests/net/so_txtime.sh | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/net/so_txtime.sh b/tools/testing/selftests/net/so_txtime.sh
index 3f06f4d286a9..ade0e5755099 100755
--- a/tools/testing/selftests/net/so_txtime.sh
+++ b/tools/testing/selftests/net/so_txtime.sh
@@ -5,6 +5,8 @@
set -e
+source net_helper.sh
+
readonly DEV="veth0"
readonly BIN="./so_txtime"
@@ -51,13 +53,16 @@ do_test() {
local readonly CLOCK="$2"
local readonly TXARGS="$3"
local readonly RXARGS="$4"
+ local PROTO
if [[ "${IP}" == "4" ]]; then
local readonly SADDR="${SADDR4}"
local readonly DADDR="${DADDR4}"
+ PROTO=udp
elif [[ "${IP}" == "6" ]]; then
local readonly SADDR="${SADDR6}"
local readonly DADDR="${DADDR6}"
+ PROTO=udp6
else
echo "Invalid IP version ${IP}"
exit 1
@@ -65,6 +70,7 @@ do_test() {
local readonly START="$(date +%s%N --date="+ 0.1 seconds")"
ip netns exec "${NS2}" "${BIN}" -"${IP}" -c "${CLOCK}" -t "${START}" -S "${SADDR}" -D "${DADDR}" "${RXARGS}" -r &
+ wait_local_port_listen "${NS2}" 8000 "${PROTO}"
ip netns exec "${NS1}" "${BIN}" -"${IP}" -c "${CLOCK}" -t "${START}" -S "${SADDR}" -D "${DADDR}" "${TXARGS}"
wait "$!"
}
--
2.43.0
This patch series is based on the RFC patch from Frederic [1]. Instead
of offering RCU_NOCB as a separate option, it is now lumped into a
root-only cpuset.cpus.isolation_full flag that will enable all the
additional CPU isolation capabilities available for isolated partitions
if set. RCU_NOCB is just the first one to this party. Additional dynamic
CPU isolation capabilities will be added in the future.
The first 2 patches are adopted from Federic with minor twists to fix
merge conflicts and compilation issue. The rests are for implementing
the new cpuset.cpus.isolation_full interface which is essentially a flag
to globally enable or disable full CPU isolation on isolated partitions.
On read, it also shows the CPU isolation capabilities that are currently
enabled. RCU_NOCB requires that the rcu_nocbs option be present in
the kernel boot command line. Without that, the rcu_nocb functionality
cannot be enabled even if the isolation_full flag is set. So we allow
users to check the isolation_full file to verify that if the desired
CPU isolation capability is enabled or not.
Only sanity checking has been done so far. More testing, especially on
the RCU side, will be needed.
[1] https://lore.kernel.org/lkml/20220525221055.1152307-1-frederic@kernel.org/
Frederic Weisbecker (2):
rcu/nocb: Pass a cpumask instead of a single CPU to offload/deoffload
rcu/nocb: Prepare to change nocb cpumask from CPU-hotplug protected
cpuset caller
Waiman Long (6):
rcu/no_cb: Add rcu_nocb_enabled() to expose the rcu_nocb state
cgroup/cpuset: Better tracking of addition/deletion of isolated CPUs
cgroup/cpuset: Add cpuset.cpus.isolation_full
cgroup/cpuset: Enable dynamic rcu_nocb mode on isolated CPUs
cgroup/cpuset: Document the new cpuset.cpus.isolation_full control
file
cgroup/cpuset: Update test_cpuset_prs.sh to handle
cpuset.cpus.isolation_full
Documentation/admin-guide/cgroup-v2.rst | 24 ++
include/linux/rcupdate.h | 15 +-
kernel/cgroup/cpuset.c | 237 ++++++++++++++----
kernel/rcu/rcutorture.c | 6 +-
kernel/rcu/tree_nocb.h | 118 ++++++---
.../selftests/cgroup/test_cpuset_prs.sh | 23 +-
6 files changed, 337 insertions(+), 86 deletions(-)
--
2.39.3
The altnames test uses the forwarding/lib.sh and that dependency
currently causes failures when running the test after install:
make -C tools/testing/selftests/ TARGETS=net install
./tools/testing/selftests/kselftest_install/run_kselftest.sh \
-t net:altnames.sh
# ...
# ./altnames.sh: line 8: ./forwarding/lib.sh: No such file or directory
# RTNETLINK answers: Operation not permitted
# ./altnames.sh: line 73: tests_run: command not found
# ./altnames.sh: line 65: pre_cleanup: command not found
Address the issue leveraging the TEST_INCLUDES infrastructure
provided by commit 2a0683be5b4c ("selftests: Introduce Makefile variable
to list shared bash scripts")
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
tools/testing/selftests/net/Makefile | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 211753756bde..7b6918d5f4af 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -97,6 +97,8 @@ TEST_PROGS += vlan_hw_filter.sh
TEST_FILES := settings
TEST_FILES += in_netns.sh lib.sh net_helper.sh setup_loopback.sh setup_veth.sh
+TEST_INCLUDES := forwarding/lib.sh
+
include ../lib.mk
$(OUTPUT)/reuseport_bpf_numa: LDLIBS += -lnuma
--
2.43.0
Open vSwitch module accepts actions as a list from the netlink socket
and then creates a copy which it uses in the action set processing.
During processing of the action list on a packet, the module keeps a
count of the execution depth and exits processing if the action depth
goes too high.
However, during netlink processing the recursion depth isn't checked
anywhere, and the copy trusts that kernel has large enough stack to
accommodate it. The OVS sample action was the original action which
could perform this kinds of recursion, and it originally checked that
it didn't exceed the sample depth limit. However, when sample became
optimized to provide the clone() semantics, the recursion limit was
dropped.
This series adds a depth limit during the __ovs_nla_copy_actions() call
that will ensure we don't exceed the max that the OVS userspace could
generate for a clone().
Additionally, this series provides a selftest in 2/2 that can be used to
determine if the OVS module is allowing unbounded access. It can be
safely omitted where the ovs selftest framework isn't available.
Aaron Conole (2):
net: openvswitch: limit the number of recursions from action sets
selftests: openvswitch: Add validation for the recursion test
net/openvswitch/flow_netlink.c | 49 ++++++++-----
.../selftests/net/openvswitch/openvswitch.sh | 13 ++++
.../selftests/net/openvswitch/ovs-dpctl.py | 71 +++++++++++++++----
3 files changed, 102 insertions(+), 31 deletions(-)
--
2.41.0
Fix various problems in the forwarding selftests so that they will pass
in the netdev CI instead of being ignored. See commit messages for
details.
Ido Schimmel (4):
selftests: forwarding: Fix layer 2 miss test flakiness
selftests: forwarding: Fix bridge MDB test flakiness
selftests: forwarding: Suppress grep warnings
selftests: forwarding: Fix bridge locked port test flakiness
.../selftests/net/forwarding/bridge_locked_port.sh | 4 ++--
.../testing/selftests/net/forwarding/bridge_mdb.sh | 14 +++++++++-----
.../selftests/net/forwarding/tc_flower_l2_miss.sh | 8 ++++++--
3 files changed, 17 insertions(+), 9 deletions(-)
--
2.43.0
The two tests that make use of multicast routig (router.sh and
router_multicast.sh) are currently failing in the netdev CI because the
kernel is missing multicast routing support.
Fix by adding the required config entries.
Fixes: 6d4efada3b82 ("selftests: forwarding: Add multicast routing test")
Signed-off-by: Ido Schimmel <idosch(a)nvidia.com>
---
Targeting at net-next because this is where 4acf4e62cd57 ("selftests:
forwarding: Add missing config entries") was applied to, but you can
apply to net as well.
---
tools/testing/selftests/net/forwarding/config | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/tools/testing/selftests/net/forwarding/config b/tools/testing/selftests/net/forwarding/config
index ba2343514582..f59083d8c59d 100644
--- a/tools/testing/selftests/net/forwarding/config
+++ b/tools/testing/selftests/net/forwarding/config
@@ -2,7 +2,14 @@ CONFIG_BRIDGE=m
CONFIG_VLAN_8021Q=m
CONFIG_BRIDGE_VLAN_FILTERING=y
CONFIG_NET_L3_MASTER_DEV=y
+CONFIG_IP_MROUTE=y
+CONFIG_IP_MROUTE_MULTIPLE_TABLES=y
+CONFIG_IP_PIMSM_V1=y
+CONFIG_IP_PIMSM_V2=y
+CONFIG_IPV6_MROUTE=y
+CONFIG_IPV6_MROUTE_MULTIPLE_TABLES=y
CONFIG_IPV6_MULTIPLE_TABLES=y
+CONFIG_IPV6_PIMSM_V2=y
CONFIG_NET_VRF=m
CONFIG_BPF_SYSCALL=y
CONFIG_CGROUP_BPF=y
--
2.43.0
A couple of small updates for the check_compaction selftest which make
it play more nicely with test automation systems.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (2):
selftests/mm: Log skipped compaction test as a skip
selftests/mm: Log a consistent test name for check_compaction
tools/testing/selftests/mm/compaction_test.c | 37 +++++++++++++++-------------
1 file changed, 20 insertions(+), 17 deletions(-)
---
base-commit: b1d3a0e70c3881d2f8cf6692ccf7c2a4fb2d030d
change-id: 20240208-kselftest-mm-cleanup-30edd2e567cb
Best regards,
--
Mark Brown <broonie(a)kernel.org>
I encountered the following build errors while compiling the selftests net
test cases on Linux next-20240208 tag with clang toolchain.
Reported-by: Linux Kernel Functional Testing <lkft(a)linaro.org>
selftests/net/ip_local_port_range
ip_local_port_range.c:152:17: error: use of undeclared identifier
'IPPROTO_MPTCP'
152 | .so_protocol = IPPROTO_MPTCP,
| ^
ip_local_port_range.c:176:17: error: use of undeclared identifier
'IPPROTO_MPTCP'
176 | .so_protocol = IPPROTO_MPTCP,
| ^
2 errors generated.
Build link,
- https://storage.tuxsuite.com/public/linaro/lkft/builds/2c4LtUoRSYhdGbErOY8h…
--
Linaro LKFT
https://lkft.linaro.org
The upcoming RISC-V Ssdtso specification introduces a bit in the senvcfg
CSR to switch the memory consistency model of user mode at run-time from
RVWMO to TSO. The active consistency model can therefore be switched on a
per-hart base and managed by the kernel on a per-process base.
This patchset implements basic Ssdtso support and adds a prctl API on top
so that user-space processes can switch to a stronger memory consistency
model (than the kernel was written for) at run-time.
The patchset also comes with a short documentation of the prctl API.
This series is based on the third draft of the Ssdtso specification
which can be found here:
https://github.com/riscv/riscv-ssdtso/releases/tag/v1.0-draft3
Note, that the Ssdtso specification is in development state
(i.e., not frozen or even ratified) which is also the reason
why this series is marked as RFC.
This series saw the following changes since v1:
* Reordered/restructured patches
* Fixed build issues
* Addressed typos
* Removed ability to switch TSO->WMO
* Moved the state from per-thread to per-process
* Reschedule all CPUs after switching
* Some cleanups in the documentation
* Adding compatibility with Ztso (spec change in draft 3)
This patchset can also be found in this GitHub branch:
https://github.com/cmuellner/linux/tree/ssdtso-v2
A QEMU implementation of DTSO can be found in this GitHub branch:
https://github.com/cmuellner/qemu/tree/ssdtso-v2
Christoph Müllner (6):
mm: Add dynamic memory consistency model switching
uapi: prctl: Add new prctl call to set/get the memory consistency
model
RISC-V: Enable dynamic memory consistency model support with Ssdtso
RISC-V: Implement prctl call to set/get the memory consistency model
RISC-V: Expose Ssdtso via hwprobe API
RISC-V: selftests: Add DTSO tests
Documentation/arch/riscv/hwprobe.rst | 3 +
.../mm/dynamic-memory-consistency-model.rst | 86 ++++++++++++++++
Documentation/mm/index.rst | 1 +
arch/Kconfig | 14 +++
arch/riscv/Kconfig | 11 +++
arch/riscv/include/asm/csr.h | 1 +
arch/riscv/include/asm/dtso.h | 97 +++++++++++++++++++
arch/riscv/include/asm/hwcap.h | 1 +
arch/riscv/include/asm/processor.h | 7 ++
arch/riscv/include/asm/switch_to.h | 3 +
arch/riscv/include/uapi/asm/hwprobe.h | 1 +
arch/riscv/kernel/Makefile | 1 +
arch/riscv/kernel/asm-offsets.c | 3 +
arch/riscv/kernel/cpufeature.c | 1 +
arch/riscv/kernel/dtso.c | 67 +++++++++++++
arch/riscv/kernel/sys_hwprobe.c | 2 +
include/linux/sched.h | 5 +
include/uapi/linux/prctl.h | 5 +
kernel/sys.c | 12 +++
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/dtso/.gitignore | 1 +
tools/testing/selftests/riscv/dtso/Makefile | 11 +++
tools/testing/selftests/riscv/dtso/dtso.c | 82 ++++++++++++++++
23 files changed, 416 insertions(+), 1 deletion(-)
create mode 100644 Documentation/mm/dynamic-memory-consistency-model.rst
create mode 100644 arch/riscv/include/asm/dtso.h
create mode 100644 arch/riscv/kernel/dtso.c
create mode 100644 tools/testing/selftests/riscv/dtso/.gitignore
create mode 100644 tools/testing/selftests/riscv/dtso/Makefile
create mode 100644 tools/testing/selftests/riscv/dtso/dtso.c
--
2.43.0
The reuseport_addr_any.sh is currently skipping DCCP tests and
pmtu.sh is skipping all the FOU/GUE related cases: add the missing
options.
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
Note that this does not include the - still missing - OVS-related
option and pmtu.sh is will keep skipping such cases. Such tests
will still fail in the virtme environment even with the relevant
kernel options enabled, as they have an hard to solve dependency
on systemd/dbus.
The longer term plan is to move such test cases in the openvswitch
directory. One short term option to avoid skips in selftests results
while retaining the potential code coverage would be making the ovs
tests disabled by default but reachable via pmtu.sh command line
arguments.
---
tools/testing/selftests/net/config | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index 3b749addd364..54d21e2911a9 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -13,6 +13,10 @@ CONFIG_IPV6=y
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_VETH=y
CONFIG_NET_IPVTI=y
+CONFIG_NET_FOU=y
+CONFIG_NET_FOU_IP_TUNNELS=y
+CONFIG_NET_IPIP=y
+CONFIG_IPV6_SIT=y
CONFIG_IPV6_VTI=y
CONFIG_DUMMY=y
CONFIG_BRIDGE_VLAN_FILTERING=y
@@ -24,6 +28,7 @@ CONFIG_IFB=y
CONFIG_INET_DIAG=y
CONFIG_INET_ESP=y
CONFIG_INET_ESP_OFFLOAD=y
+CONFIG_IP_DCCP=m
CONFIG_IP_GRE=m
CONFIG_NETFILTER=y
CONFIG_NETFILTER_ADVANCED=y
--
2.43.0
Here's a follow-up from my RFC series last year:
https://lore.kernel.org/lkml/20221004093131.40392-1-thuth@redhat.com/T/
and from v1 earlier this year:
https://lore.kernel.org/kvm/20230712075910.22480-1-thuth@redhat.com/
Basic idea of this series is now to use the kselftest_harness.h
framework to get TAP output in the tests, so that it is easier
for the user to see what is going on, and e.g. to be able to
detect whether a certain test is part of the test binary or not
(which is useful when tests get extended in the course of time).
v2:
- Dropped the "Rename the ASSERT_EQ macro" patch (already merged)
- Split the fixes in the sync_regs_test into separate patches
(see the first two patches)
- Introduce the KVM_ONE_VCPU_TEST_SUITE() macro as suggested
by Sean (see third patch) and use it in the following patches
- Add a new patch to convert vmx_pmu_caps_test.c, too
Thomas Huth (7):
KVM: selftests: x86: sync_regs_test: Use vcpu_run() where appropriate
KVM: selftests: x86: sync_regs_test: Get regs structure before
modifying it
KVM: selftests: Add a macro to define a test with one vcpu
KVM: selftests: x86: Use TAP interface in the sync_regs test
KVM: selftests: x86: Use TAP interface in the fix_hypercall test
KVM: selftests: x86: Use TAP interface in the vmx_pmu_caps test
KVM: selftests: x86: Use TAP interface in the userspace_msr_exit test
.../selftests/kvm/include/kvm_test_harness.h | 35 +++++
.../selftests/kvm/x86_64/fix_hypercall_test.c | 27 ++--
.../selftests/kvm/x86_64/sync_regs_test.c | 121 +++++++++++++-----
.../kvm/x86_64/userspace_msr_exit_test.c | 19 +--
.../selftests/kvm/x86_64/vmx_pmu_caps_test.c | 50 ++------
5 files changed, 160 insertions(+), 92 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/kvm_test_harness.h
--
2.41.0
On Fri, Nov 24, 2023 at 12:04:09PM +0100, Jonas Oberhauser wrote:
> Unfortunately, at least last time I checked RISC-V still hadn't gotten such
> instructions.
> What they have is the *semantics* of the instructions, but no actual opcodes
> to encode them.
> I argued for them in the RISC-V memory group, but it was considered to be
> outside the scope of that group.
(Sorry for the late, late reply; just recalled this thread...)
That's right. AFAICT, the discussion about the native load-acquire
and store-release instructions was revived somewhere last year within
the RVI community, culminating in the so called Zalasr-proposal [1];
Brendan, Hans and Andrew (+ Cc) might be able to provide more up-to-
date information about the status/plans for that proposal.
(Remark that RISC-V did introduce LR/SCs and AMOs instructions with
acquire/release semantics separately, cf. the so called A-extension.)
Andrea
[1] https://github.com/mehnadnerd/riscv-zalasr
The gro self-tests sends the packets to be aggregated with
multiple write operations.
When running is slow environment, it's hard to guarantee that
the GRO engine will wait for the last packet in an intended
train.
The above causes almost deterministic failures in our CI for
the 'large' test-case.
Address the issue explicitly ignoring failures for such case
in slow environments (KSFT_MACHINE_SLOW==true).
Fixes: 7d1575014a63 ("selftests/net: GRO coalesce test")
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
---
Note that the fixes tag is there mainly to justify targeting the net
tree, and this is aiming at net to hopefully make the test more stable
ASAP for both trees.
I experimented with a largish refactory replacing the multiple writes
with a single GSO packet, but exhausted by time budget before reaching
any good result.
---
tools/testing/selftests/net/gro.sh | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/net/gro.sh b/tools/testing/selftests/net/gro.sh
index 19352f106c1d..114b5281a3f5 100755
--- a/tools/testing/selftests/net/gro.sh
+++ b/tools/testing/selftests/net/gro.sh
@@ -31,6 +31,10 @@ run_test() {
1>>log.txt
wait "${server_pid}"
exit_code=$?
+ if [ ${test} == "large" -a -n "${KSFT_MACHINE_SLOW}" ]; then
+ echo "Ignoring errors due to slow environment" 1>&2
+ exit_code=0
+ fi
if [[ "${exit_code}" -eq 0 ]]; then
break;
fi
--
2.43.0