Recently we committed a fix to allow processes to receive notifications for
non-zero exits via the process connector module. Commit is a4c9a56e6a2c.
However, for threads, when it does a pthread_exit(&exit_status) call, the
kernel is not aware of the exit status with which pthread_exit is called.
It is sent by child thread to the parent process, if it is waiting in
pthread_join(). Hence, for a thread exiting abnormally, kernel cannot
send notifications to any listening processes.
The exception to this is if the thread is sent a signal which it has not
handled, and dies along with it's process as a result; for eg. SIGSEGV or
SIGKILL. In this case, kernel is aware of the non-zero exit and sends a
notification for it.
For our use case, we cannot have parent wait in pthread_join, one of the
main reasons for this being that we do not want to track normal
pthread_exit(), which could be a very large number. We only want to be
notified of any abnormal exits. Hence, threads are created with
pthread_attr_t set to PTHREAD_CREATE_DETACHED.
To fix this problem, we add a new type PROC_CN_MCAST_NOTIFY to proc connector
API, which allows a thread to send it's exit status to kernel either when
it needs to call pthread_exit() with non-zero value to indicate some
error or from signal handler before pthread_exit().
v1->v2 changes:
- Handled comment by Peter Zijlstra to remove locking for PF_EXIT_NOTIFY
task->flags.
- Added error handling in thread.c
v->v1 changes:
- Handled comment by Simon Horman to remove unused err in cn_proc.c
- Handled comment by Simon Horman to make adata and key_display static
in cn_hash_test.c
Anjali Kulkarni (3):
connector/cn_proc: Add hash table for threads
connector/cn_proc: Kunit tests for threads hash table
connector/cn_proc: Selftest for threads
drivers/connector/Makefile | 2 +-
drivers/connector/cn_hash.c | 240 ++++++++++++++++++
drivers/connector/cn_proc.c | 55 +++-
drivers/connector/connector.c | 96 ++++++-
include/linux/connector.h | 47 ++++
include/linux/sched.h | 2 +-
include/uapi/linux/cn_proc.h | 4 +-
lib/Kconfig.debug | 17 ++
lib/Makefile | 1 +
lib/cn_hash_test.c | 167 ++++++++++++
lib/cn_hash_test.h | 12 +
tools/testing/selftests/connector/Makefile | 23 +-
.../testing/selftests/connector/proc_filter.c | 5 +
tools/testing/selftests/connector/thread.c | 116 +++++++++
.../selftests/connector/thread_filter.c | 96 +++++++
15 files changed, 873 insertions(+), 10 deletions(-)
create mode 100644 drivers/connector/cn_hash.c
create mode 100644 lib/cn_hash_test.c
create mode 100644 lib/cn_hash_test.h
create mode 100644 tools/testing/selftests/connector/thread.c
create mode 100644 tools/testing/selftests/connector/thread_filter.c
--
2.46.0
The GCS stress test program currently uses the PID of the threads it
creates in the test names it reports, resulting in unstable test names
between runs. Fix this by using a thread number instead.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/arm64/gcs/gcs-stress.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/arm64/gcs/gcs-stress.c b/tools/testing/selftests/arm64/gcs/gcs-stress.c
index bdec7ee8cfd5..03222c36c436 100644
--- a/tools/testing/selftests/arm64/gcs/gcs-stress.c
+++ b/tools/testing/selftests/arm64/gcs/gcs-stress.c
@@ -56,7 +56,7 @@ static int num_processors(void)
return nproc;
}
-static void start_thread(struct child_data *child)
+static void start_thread(struct child_data *child, int id)
{
int ret, pipefd[2], i;
struct epoll_event ev;
@@ -132,7 +132,7 @@ static void start_thread(struct child_data *child)
ev.events = EPOLLIN | EPOLLHUP;
ev.data.ptr = child;
- ret = asprintf(&child->name, "Thread-%d", child->pid);
+ ret = asprintf(&child->name, "Thread-%d", id);
if (ret == -1)
ksft_exit_fail_msg("asprintf() failed\n");
@@ -437,7 +437,7 @@ int main(int argc, char **argv)
tests);
for (i = 0; i < gcs_threads; i++)
- start_thread(&children[i]);
+ start_thread(&children[i], i);
/*
* All children started, close the startup pipe and let them
---
base-commit: bb9ae1a66c85eeb626864efd812c62026e126ec0
change-id: 20241011-arm64-gcs-stress-stable-name-8550519fe152
Best regards,
--
Mark Brown <broonie(a)kernel.org>
From: Feng Zhou <zhoufeng.zf(a)bytedance.com>
When TCP over IPv4 via INET6 API, sk->sk_family is AF_INET6, but it is a v4 pkt.
inet_csk(sk)->icsk_af_ops is ipv6_mapped and use ip_queue_xmit. Some sockopt did
not take effect, such as tos.
0001: Use sk_is_inet helper to fix it.
0002: Setget_sockopt add a test for tcp over ipv4 via ipv6.
Changelog:
v2->v3: Addressed comments from Eric Dumazet
- Use sk_is_inet() helper
Details in here:
https://lore.kernel.org/bpf/CANn89i+9GmBLCdgsfH=WWe-tyFYpiO27wONyxaxiU6aOBC…
v1->v2: Addressed comments from kernel test robot
- Fix compilation error
Details in here:
https://lore.kernel.org/bpf/202408152058.YXAnhLgZ-lkp@intel.com/T/
Feng Zhou (2):
bpf: Fix bpf_get/setsockopt to tos not take effect when TCP over IPv4
via INET6 API
selftests/bpf: Setget_sockopt add a test for tcp over ipv4 via ipv6
net/core/filter.c | 7 +++-
.../selftests/bpf/prog_tests/setget_sockopt.c | 33 +++++++++++++++++++
.../selftests/bpf/progs/setget_sockopt.c | 13 ++++++--
3 files changed, 49 insertions(+), 4 deletions(-)
--
2.30.2
From: Jeff Xu <jeffxu(a)chromium.org>
Pedro Falcato's optimization [1] for checking sealed VMAs, which replaces
the can_modify_mm() function with an in-loop check, necessitates an update
to the mseal.rst documentation to reflect this change.
Furthermore, the document has received offline comments regarding the code
sample and suggestions for sentence clarification to enhance reader
comprehension.
[1] https://lore.kernel.org/linux-mm/20240817-mseal-depessimize-v3-0-d8d2e037df…
History:
V3: update according to Randy Dunlap's comment
V2: update according to Randy Dunlap's comments.
https://lore.kernel.org/all/20241001002628.2239032-1-jeffxu@chromium.org/
V1: initial version
https://lore.kernel.org/all/20240927185211.729207-1-jeffxu@chromium.org/
Jeff Xu (1):
mseal: update mseal.rst
Documentation/userspace-api/mseal.rst | 307 +++++++++++++-------------
1 file changed, 148 insertions(+), 159 deletions(-)
--
2.47.0.rc0.187.ge670bccf7e-goog
This series introduces a new ioctl KVM_HYPERV_SET_TLB_FLUSH_INHIBIT. It
allows hypervisors to inhibit remote TLB flushing of a vCPU coming from
Hyper-V hyper-calls (namely HvFlushVirtualAddressSpace(Ex) and
HvFlushirtualAddressList(Ex)). It is required to implement the
HvTranslateVirtualAddress hyper-call as part of the ongoing effort to
emulate VSM within KVM and QEMU. The hyper-call requires several new KVM
APIs, one of which is KVM_HYPERV_SET_TLB_FLUSH_INHIBIT.
Once the inhibit flag is set, any processor attempting to flush the TLB on
the marked vCPU, with a HyperV hyper-call, will be suspended until the
flag is cleared again. During the suspension the vCPU will not run at all,
neither receiving events nor running other code. It will wake up from
suspension once the vCPU it is waiting on clears the inhibit flag. This
behaviour is specified in Microsoft's "Hypervisor Top Level Functional
Specification" (TLFS).
The vCPU will block execution during the suspension, making it transparent
to the hypervisor. An alternative design to what is proposed here would be
to exit from the Hyper-V hypercall upon finding an inhibited vCPU. We
decided against it, to allow for a simpler and more performant
implementation. Exiting to user space would create an additional
synchronisation burden and make the resulting code more complex.
Additionally, since the suspension is specific to HyperV events, it
wouldn't provide any functional benefits.
The TLFS specifies that the instruction pointer is not moved during the
suspension, so upon unsuspending the hyper-calls is re-executed. This
means that, if the vCPU encounters another inhibited TLB and is
resuspended, any pending events and interrupts are still executed. This is
identical to the vCPU receiving such events right before the hyper-call.
This inhibiting of TLB flushes is necessary, to securely implement
intercepts. These allow a higher "Virtual Trust Level" (VTL) to react to
a lower VTL accessing restricted memory. In such an intercept the VTL may
want to emulate a memory access in software, however, if another processor
flushes the TLB during that operation, incorrect behaviour can result.
The patch series includes basic testing of the ioctl and suspension state.
All previously passing KVM selftests and KVM unit tests still pass.
Series overview:
- 1: Document the new ioctl
- 2: Implement the suspension state
- 3: Update TLB flush hyper-call in preparation
- 4-5: Implement the ioctl
- 6: Add traces
- 7: Implement testing
As the suspension state is transparent to the hypervisor, testing is
complicated. The current version makes use of a set time intervall to give
the vCPU time to enter the hyper-call and get suspended. Ideas for
improvement on this are very welcome.
This series, alongside my series [1] implementing KVM_TRANSLATE2, the
series by Nicolas Saenz Julienne [2] implementing the core building blocks
for VSM and the accompanying QEMU implementation [3], is capable of
booting Windows Server 2019 with VSM/CredentialGuard enabled.
All three series are also available on GitHub [4].
[1] https://lore.kernel.org/linux-kernel/20240910152207.38974-1-nikwip@amazon.d…
[2] https://lore.kernel.org/linux-hyperv/20240609154945.55332-1-nsaenz@amazon.c…
[3] https://github.com/vianpl/qemu/tree/vsm/next
[4] https://github.com/vianpl/linux/tree/vsm/next
Best,
Nikolas
Nikolas Wipper (7):
KVM: Add API documentation for KVM_HYPERV_SET_TLB_FLUSH_INHIBIT
KVM: x86: Implement Hyper-V's vCPU suspended state
KVM: x86: Check vCPUs before enqueuing TLB flushes in
kvm_hv_flush_tlb()
KVM: Introduce KVM_HYPERV_SET_TLB_FLUSH_INHIBIT
KVM: x86: Implement KVM_HYPERV_SET_TLB_FLUSH_INHIBIT
KVM: x86: Add trace events to track Hyper-V suspensions
KVM: selftests: Add tests for KVM_HYPERV_SET_TLB_FLUSH_INHIBIT
Documentation/virt/kvm/api.rst | 41 +++
arch/x86/include/asm/kvm_host.h | 5 +
arch/x86/kvm/hyperv.c | 86 +++++-
arch/x86/kvm/hyperv.h | 17 ++
arch/x86/kvm/trace.h | 39 +++
arch/x86/kvm/x86.c | 41 ++-
include/uapi/linux/kvm.h | 15 +
tools/testing/selftests/kvm/Makefile | 1 +
.../kvm/x86_64/hyperv_tlb_flush_inhibit.c | 274 ++++++++++++++++++
9 files changed, 503 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/hyperv_tlb_flush_inhibit.c
--
2.40.1
Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
MPTCP connection requests toward a listening socket created by the
in-kernel PM for a port based signal endpoint will never be accepted,
they need to be explicitly rejected.
- Patch 1: Explicitly reject such requests. A fix for >= v5.12.
- Patch 2: Cover this case in the MPTCP selftests to avoid regressions.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Changes in v2:
- This new version fixes the root cause for the issue Cong Wang sent a
patch for a few weeks ago, see the v1, and the explanations below. The
new version is very different from the v1, from a different author.
Thanks to Cong Wang for the first analysis, and to Paolo for having
spot the root cause, and sent a fix for it.
- Link to v1: https://lore.kernel.org/r/20240908180620.822579-1-xiyou.wangcong@gmail.com
- Link: https://lore.kernel.org/r/a5289a0d-2557-40b8-9575-6f1a0bbf06e4@redhat.com
---
Paolo Abeni (2):
mptcp: prevent MPC handshake on port-based signal endpoints
selftests: mptcp: join: test for prohibited MPC to port-based endp
net/mptcp/mib.c | 1 +
net/mptcp/mib.h | 1 +
net/mptcp/pm_netlink.c | 1 +
net/mptcp/protocol.h | 1 +
net/mptcp/subflow.c | 11 +++
tools/testing/selftests/net/mptcp/mptcp_join.sh | 117 +++++++++++++++++-------
6 files changed, 101 insertions(+), 31 deletions(-)
---
base-commit: 174714f0e505070a16be6fbede30d32b81df789f
change-id: 20241014-net-mptcp-mpc-port-endp-4f88bd428ec7
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
DAMON debugfs interface was the only user interface of DAMON at the
beginning[1]. However, it turned out the interface would be not good
enough for long-term flexibility and stability.
In Feb 2022[2], we therefore introduced DAMON sysfs interface as an
alternative user interface that aims long-term flexibility and
stability. With its introduction, DAMON debugfs interface has announced
to be deprecated in near future.
In Feb 2023[3], we announced the official deprecation of DAMON debugfs
interface. In Jan 2024[4], we further made the deprecation difficult to
be ignored.
And as of this writing (2024-10-14), no problem or concerns about the
deprecation have reported. Apparently users are already moved to the
alternative, or made good plans for the change.
Remove the DAMON debugfs interface code from the tree. Given the past
timeline and the absence of reported problems or concerns, it is safe
enough to be done. That said, we will not drop the RFC tag of this
patch series at least until the end of this year, to use this as the
real last call for users.
[1] https://lore.kernel.org/20210716081449.22187-1-sj38.park@gmail.com
[2] https://lore.kernel.org/20220228081314.5770-1-sj@kernel.org
[3] https://lore.kernel.org/20230209192009.7885-1-sj@kernel.org
[4] https://lore.kernel.org/20240130013549.89538-1-sj@kernel.org
SeongJae Park (7):
Docs/admin-guide/mm/damon/usage: remove DAMON debugfs interface
documentation
Docs/mm/damon/design: update for removal of DAMON debugfs interface
selftests/damon/config: remove configs for DAMON debugfs interface
selftests
selftests/damon: remove tests for DAMON debugfs interface
kunit: configs: remove configs for DAMON debugfs interface tests
mm/damon: remove DAMON debugfs interface kunit tests
mm/damon: remove DAMON debugfs interface
Documentation/admin-guide/mm/damon/usage.rst | 309 -----
Documentation/mm/damon/design.rst | 23 +-
mm/damon/Kconfig | 30 -
mm/damon/Makefile | 1 -
mm/damon/dbgfs.c | 1148 -----------------
mm/damon/tests/.kunitconfig | 7 -
mm/damon/tests/dbgfs-kunit.h | 173 ---
tools/testing/kunit/configs/all_tests.config | 3 -
tools/testing/selftests/damon/.gitignore | 3 -
tools/testing/selftests/damon/Makefile | 11 +-
tools/testing/selftests/damon/config | 1 -
.../testing/selftests/damon/debugfs_attrs.sh | 17 -
.../debugfs_duplicate_context_creation.sh | 27 -
.../selftests/damon/debugfs_empty_targets.sh | 21 -
.../damon/debugfs_huge_count_read_write.sh | 22 -
.../damon/debugfs_rm_non_contexts.sh | 19 -
.../selftests/damon/debugfs_schemes.sh | 19 -
.../selftests/damon/debugfs_target_ids.sh | 19 -
.../damon/debugfs_target_ids_pid_leak.c | 68 -
.../damon/debugfs_target_ids_pid_leak.sh | 22 -
...fs_target_ids_read_before_terminate_race.c | 80 --
...s_target_ids_read_before_terminate_race.sh | 14 -
.../selftests/damon/huge_count_read_write.c | 48 -
23 files changed, 11 insertions(+), 2074 deletions(-)
delete mode 100644 mm/damon/dbgfs.c
delete mode 100644 mm/damon/tests/dbgfs-kunit.h
delete mode 100755 tools/testing/selftests/damon/debugfs_attrs.sh
delete mode 100755 tools/testing/selftests/damon/debugfs_duplicate_context_creation.sh
delete mode 100755 tools/testing/selftests/damon/debugfs_empty_targets.sh
delete mode 100755 tools/testing/selftests/damon/debugfs_huge_count_read_write.sh
delete mode 100755 tools/testing/selftests/damon/debugfs_rm_non_contexts.sh
delete mode 100755 tools/testing/selftests/damon/debugfs_schemes.sh
delete mode 100755 tools/testing/selftests/damon/debugfs_target_ids.sh
delete mode 100644 tools/testing/selftests/damon/debugfs_target_ids_pid_leak.c
delete mode 100755 tools/testing/selftests/damon/debugfs_target_ids_pid_leak.sh
delete mode 100644 tools/testing/selftests/damon/debugfs_target_ids_read_before_terminate_race.c
delete mode 100755 tools/testing/selftests/damon/debugfs_target_ids_read_before_terminate_race.sh
delete mode 100644 tools/testing/selftests/damon/huge_count_read_write.c
base-commit: 5ef943709a1b88304aa6e8cb8683a25bf81874f0
--
2.39.5
PACKET socket can retain its fanout membership through link down and up
and leave a fanout while closed regardless of link state.
However, socket was forbidden from joining a fanout while it was not
RUNNING.
This scenario was identified while studying DPDK pmd_af_packet_drv.
Since sockets are only created during initialization, there is no reason
to fail the initialization if a single link is temporarily down.
This patch allows PACKET socket to join a fanout while not RUNNING.
Selftest psock_fanout is extended to test this "fanout while link down"
scenario.
Selftest psock_fanout is also extended to test fanout create/join by
socket that did not bind or specified a protocol, which carries an
implicit bind.
This is the only test that was performed.
Changes:
V04:
* Minimized code change.
* Removed test of ifindex. A socket that went through bind "unlisted" race can
join a fanout.
V03: https://lore.kernel.org/netdev/cover.1728555449.git.gur.stavi@huawei.com
* psock_fanout: add test for joining fanout with unbound socket.
* Test that socket can receive packets before adding it to a fanout match.
This is kind of replaces the RUNNING test that was removed.
* Initialize po->ifindex in packet_create. To -1 if no protocol is specified
and add an explicit initialization to 0 if protocol is specified.
* Refactor relevant code in fanout_add within bind_lock, as a sequence of
if {} else if {}, in order to reduce indentation of nested if statements and
provide specific error codes.
V02: https://lore.kernel.org/netdev/cover.1728382839.git.gur.stavi@huawei.com
* psock_fanout: use explicit loopback up/down instead of toggle.
* psock_fanout: don't try to restore loopback state on failure.
* Rephrase commit message about "leaving a fanout".
V01: https://lore.kernel.org/netdev/cover.1728303615.git.gur.stavi@huawei.com/
Gur Stavi (3):
af_packet: allow fanout_add when socket is not RUNNING
selftests: net/psock_fanout: socket joins fanout when link is down
selftests: net/psock_fanout: unbound socket fanout
net/packet/af_packet.c | 9 +--
tools/testing/selftests/net/psock_fanout.c | 78 +++++++++++++++++++++-
2 files changed, 80 insertions(+), 7 deletions(-)
base-commit: c531f2269a53db5cf64b24baf785ccbcda52970f
--
2.45.2
Recently we committed a fix to allow processes to receive notifications for
non-zero exits via the process connector module. Commit is a4c9a56e6a2c.
However, for threads, when it does a pthread_exit(&exit_status) call, the
kernel is not aware of the exit status with which pthread_exit is called.
It is sent by child thread to the parent process, if it is waiting in
pthread_join(). Hence, for a thread exiting abnormally, kernel cannot
send notifications to any listening processes.
The exception to this is if the thread is sent a signal which it has not
handled, and dies along with it's process as a result; for eg. SIGSEGV or
SIGKILL. In this case, kernel is aware of the non-zero exit and sends a
notification for it.
For our use case, we cannot have parent wait in pthread_join, one of the
main reasons for this being that we do not want to track normal
pthread_exit(), which could be a very large number. We only want to be
notified of any abnormal exits. Hence, threads are created with
pthread_attr_t set to PTHREAD_CREATE_DETACHED.
To fix this problem, we add a new type PROC_CN_MCAST_NOTIFY to proc connector
API, which allows a thread to send it's exit status to kernel either when
it needs to call pthread_exit() with non-zero value to indicate some
error or from signal handler before pthread_exit().
v->v1 changes:
- Handled comment by Simon Horman to remove unused err in cn_proc.c
- Handled comment by Simon Horman to make adata and key_display static
in cn_hash_test.c
Anjali Kulkarni (3):
connector/cn_proc: Add hash table for threads
connector/cn_proc: Kunit tests for threads hash table
connector/cn_proc: Selftest for threads
drivers/connector/Makefile | 2 +-
drivers/connector/cn_hash.c | 240 ++++++++++++++++++
drivers/connector/cn_proc.c | 58 ++++-
drivers/connector/connector.c | 96 ++++++-
include/linux/connector.h | 47 ++++
include/linux/sched.h | 2 +-
include/uapi/linux/cn_proc.h | 4 +-
lib/Kconfig.debug | 17 ++
lib/Makefile | 1 +
lib/cn_hash_test.c | 167 ++++++++++++
lib/cn_hash_test.h | 12 +
tools/testing/selftests/connector/Makefile | 23 +-
.../testing/selftests/connector/proc_filter.c | 5 +
tools/testing/selftests/connector/thread.c | 90 +++++++
.../selftests/connector/thread_filter.c | 93 +++++++
15 files changed, 847 insertions(+), 10 deletions(-)
create mode 100644 drivers/connector/cn_hash.c
create mode 100644 lib/cn_hash_test.c
create mode 100644 lib/cn_hash_test.h
create mode 100644 tools/testing/selftests/connector/thread.c
create mode 100644 tools/testing/selftests/connector/thread_filter.c
--
2.46.0
This splits the preparation works of the iommu and the Intel iommu driver
out from the iommufd pasid attach/replace series. [1]
To support domain replacement, the definition of the set_dev_pasid op
needs to be enhanced. Meanwhile, the existing set_dev_pasid callbacks
should be extended as well to suit the new definition.
This series first prepares the Intel iommu set_dev_pasid op for the new
definition, adds the missing set_dev_pasid support for nested domain, makes
ARM SMMUv3 set_dev_pasid op to suit the new definition, and in the end
enhances the definition of set_dev_pasid op. The AMD set_dev_pasid callback
is extended to fail if the caller tries to do domain replacement to meet the
new definition of set_dev_pasid op. AMD iommu driver would support it later
per Vasant [2].
[1] https://lore.kernel.org/linux-iommu/20240412081516.31168-1-yi.l.liu@intel.c…
[2] https://lore.kernel.org/linux-iommu/fa9c4fc3-9365-465e-8926-b4d2d6361b9c@am…
v2:
- Make ARM SMMUv3 set_dev_pasid op support domain replacement (Jason)
- Drop patch 03 of v1 (Kevin)
- Multiple tweaks in VT-d driver (Kevin)
v1: https://lore.kernel.org/linux-iommu/20240628085538.47049-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Jason Gunthorpe (1):
iommu/arm-smmu-v3: Make smmuv3 set_dev_pasid() op support replace
Lu Baolu (1):
iommu/vt-d: Add set_dev_pasid callback for nested domain
Yi Liu (4):
iommu: Pass old domain to set_dev_pasid op
iommu/vt-d: Move intel_drain_pasid_prq() into
intel_pasid_tear_down_entry()
iommu/vt-d: Make intel_iommu_set_dev_pasid() to handle domain
replacement
iommu: Make set_dev_pasid op support domain replacement
drivers/iommu/amd/amd_iommu.h | 3 +-
drivers/iommu/amd/pasid.c | 6 +-
.../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 5 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 8 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 +-
drivers/iommu/intel/iommu.c | 122 ++++++++++++------
drivers/iommu/intel/iommu.h | 3 +
drivers/iommu/intel/nested.c | 1 +
drivers/iommu/intel/pasid.c | 13 +-
drivers/iommu/intel/pasid.h | 8 +-
drivers/iommu/intel/svm.c | 6 +-
drivers/iommu/iommu.c | 3 +-
include/linux/iommu.h | 5 +-
13 files changed, 129 insertions(+), 56 deletions(-)
--
2.34.1
The arm64 Guarded Control Stack (GCS) feature provides support for
hardware protected stacks of return addresses, intended to provide
hardening against return oriented programming (ROP) attacks and to make
it easier to gather call stacks for applications such as profiling.
When GCS is active a secondary stack called the Guarded Control Stack is
maintained, protected with a memory attribute which means that it can
only be written with specific GCS operations. The current GCS pointer
can not be directly written to by userspace. When a BL is executed the
value stored in LR is also pushed onto the GCS, and when a RET is
executed the top of the GCS is popped and compared to LR with a fault
being raised if the values do not match. GCS operations may only be
performed on GCS pages, a data abort is generated if they are not.
The combination of hardware enforcement and lack of extra instructions
in the function entry and exit paths should result in something which
has less overhead and is more difficult to attack than a purely software
implementation like clang's shadow stacks.
This series implements support for use of GCS by userspace, along with
support for use of GCS within KVM guests. It does not enable use of GCS
by either EL1 or EL2, this will be implemented separately. Executables
are started without GCS and must use a prctl() to enable it, it is
expected that this will be done very early in application execution by
the dynamic linker or other startup code. For dynamic linking this will
be done by checking that everything in the executable is marked as GCS
compatible.
x86 has an equivalent feature called shadow stacks, this series depends
on the x86 patches for generic memory management support for the new
guarded/shadow stack page type and shares APIs as much as possible. As
there has been extensive discussion with the wider community around the
ABI for shadow stacks I have as far as practical kept implementation
decisions close to those for x86, anticipating that review would lead to
similar conclusions in the absence of strong reasoning for divergence.
The main divergence I am concious of is that x86 allows shadow stack to
be enabled and disabled repeatedly, freeing the shadow stack for the
thread whenever disabled, while this implementation keeps the GCS
allocated after disable but refuses to reenable it. This is to avoid
races with things actively walking the GCS during a disable, we do
anticipate that some systems will wish to disable GCS at runtime but are
not aware of any demand for subsequently reenabling it.
x86 uses an arch_prctl() to manage enable and disable, since only x86
and S/390 use arch_prctl() a generic prctl() was proposed[1] as part of a
patch set for the equivalent RISC-V Zicfiss feature which I initially
adopted fairly directly but following review feedback has been revised
quite a bit.
We currently maintain the x86 pattern of implicitly allocating a shadow
stack for threads started with shadow stack enabled, there has been some
discussion of removing this support and requiring the use of clone3()
with explicit allocation of shadow stacks instead. I have no strong
feelings either way, implicit allocation is not really consistent with
anything else we do and creates the potential for errors around thread
exit but on the other hand it is existing ABI on x86 and minimises the
changes needed in userspace code.
glibc and bionic changes using this ABI have been implemented and
tested. Headless Android systems have been validated and Ross Burton
has used this code has been used to bring up a Yocto system with GCS
enabed as standard, a test implementation of V8 support has also been
done.
uprobes are not currently supported, missing emulation was identified
late in review.
There is an open issue with support for CRIU, on x86 this required the
ability to set the GCS mode via ptrace. This series supports
configuring mode bits other than enable/disable via ptrace but it needs
to be confirmed if this is sufficient.
It is likely that we could relax some of the barriers added here with
some more targeted placements, this is left for further study.
There is an in process series adding clone3() support for shadow stacks:
https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@ke…
Previous versions of this series depended on that, this dependency has
been removed in order to make merging easier.
[1] https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v13:
- Rebase onto v6.12-rc1.
- Allocate VM_HIGH_ARCH_6 since protection keys used all the existing
bits.
- Implement mm_release() and free transparently allocated GCSs there.
- Use bit 32 of AT_HWCAP for GCS due to AT_HWCAP2 being filled.
- Since we now only set GCSCRE0_EL1 on change ensure that it is
initialised with GCSPR_EL0 accessible to EL0.
- Fix OOM handling on thread copy.
- Link to v12: https://lore.kernel.org/r/20240829-arm64-gcs-v12-0-42fec947436a@kernel.org
Changes in v12:
- Clarify and simplify the signal handling code so we work with the
register state.
- When checking for write aborts to shadow stack pages ensure the fault
is a data abort.
- Depend on !UPROBES.
- Comment cleanups.
- Link to v11: https://lore.kernel.org/r/20240822-arm64-gcs-v11-0-41b81947ecb5@kernel.org
Changes in v11:
- Remove the dependency on the addition of clone3() support for shadow
stacks, rebasing onto v6.11-rc3.
- Make ID_AA64PFR1_EL1.GCS writeable in KVM.
- Hide GCS registers when GCS is not enabled for KVM guests.
- Require HCRX_EL2.GCSEn if booting at EL1.
- Require that GCSCR_EL1 and GCSCRE0_EL1 be initialised regardless of
if we boot at EL2 or EL1.
- Remove some stray use of bit 63 in signal cap tokens.
- Warn if we see a GCS with VM_SHARED.
- Remove rdundant check for VM_WRITE in fault handling.
- Cleanups and clarifications in the ABI document.
- Clean up and improve documentation of some sync placement.
- Only set the EL0 GCS mode if it's actually changed.
- Various minor fixes and tweaks.
- Link to v10: https://lore.kernel.org/r/20240801-arm64-gcs-v10-0-699e2bd2190b@kernel.org
Changes in v10:
- Fix issues with THP.
- Tighten up requirements for initialising GCSCR*.
- Only generate GCS signal frames for threads using GCS.
- Only context switch EL1 GCS registers if S1PIE is enabled.
- Move context switch of GCSCRE0_EL1 to EL0 context switch.
- Make GCS registers unconditionally visible to userspace.
- Use FHU infrastructure.
- Don't change writability of ID_AA64PFR1_EL1 for KVM.
- Remove unused arguments from alloc_gcs().
- Typo fixes.
- Link to v9: https://lore.kernel.org/r/20240625-arm64-gcs-v9-0-0f634469b8f0@kernel.org
Changes in v9:
- Rebase onto v6.10-rc3.
- Restructure and clarify memory management fault handling.
- Fix up basic-gcs for the latest clone3() changes.
- Convert to newly merged KVM ID register based feature configuration.
- Fixes for NV traps.
- Link to v8: https://lore.kernel.org/r/20240203-arm64-gcs-v8-0-c9fec77673ef@kernel.org
Changes in v8:
- Invalidate signal cap token on stack when consuming.
- Typo and other trivial fixes.
- Don't try to use process_vm_write() on GCS, it intentionally does not
work.
- Fix leak of thread GCSs.
- Rebase onto latest clone3() series.
- Link to v7: https://lore.kernel.org/r/20231122-arm64-gcs-v7-0-201c483bd775@kernel.org
Changes in v7:
- Rebase onto v6.7-rc2 via the clone3() patch series.
- Change the token used to cap the stack during signal handling to be
compatible with GCSPOPM.
- Fix flags for new page types.
- Fold in support for clone3().
- Replace copy_to_user_gcs() with put_user_gcs().
- Link to v6: https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org
Changes in v6:
- Rebase onto v6.6-rc3.
- Add some more gcsb_dsync() barriers following spec clarifications.
- Due to ongoing discussion around clone()/clone3() I've not updated
anything there, the behaviour is the same as on previous versions.
- Link to v5: https://lore.kernel.org/r/20230822-arm64-gcs-v5-0-9ef181dd6324@kernel.org
Changes in v5:
- Don't map any permissions for user GCSs, we always use EL0 accessors
or use a separate mapping of the page.
- Reduce the standard size of the GCS to RLIMIT_STACK/2.
- Enforce a PAGE_SIZE alignment requirement on map_shadow_stack().
- Clarifications and fixes to documentation.
- More tests.
- Link to v4: https://lore.kernel.org/r/20230807-arm64-gcs-v4-0-68cfa37f9069@kernel.org
Changes in v4:
- Implement flags for map_shadow_stack() allowing the cap and end of
stack marker to be enabled independently or not at all.
- Relax size and alignment requirements for map_shadow_stack().
- Add more blurb explaining the advantages of hardware enforcement.
- Link to v3: https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org
Changes in v3:
- Rebase onto v6.5-rc4.
- Add a GCS barrier on context switch.
- Add a GCS stress test.
- Link to v2: https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org
Changes in v2:
- Rebase onto v6.5-rc3.
- Rework prctl() interface to allow each bit to be locked independently.
- map_shadow_stack() now places the cap token based on the size
requested by the caller not the actual space allocated.
- Mode changes other than enable via ptrace are now supported.
- Expand test coverage.
- Various smaller fixes and adjustments.
- Link to v1: https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org
---
Mark Brown (40):
mm: Introduce ARCH_HAS_USER_SHADOW_STACK
mm: Define VM_HIGH_ARCH_6
arm64/mm: Restructure arch_validate_flags() for extensibility
prctl: arch-agnostic prctl for shadow stack
mman: Add map_shadow_stack() flags
arm64: Document boot requirements for Guarded Control Stacks
arm64/gcs: Document the ABI for Guarded Control Stacks
arm64/sysreg: Add definitions for architected GCS caps
arm64/gcs: Add manual encodings of GCS instructions
arm64/gcs: Provide put_user_gcs()
arm64/gcs: Provide basic EL2 setup to allow GCS usage at EL0 and EL1
arm64/cpufeature: Runtime detection of Guarded Control Stack (GCS)
arm64/mm: Allocate PIE slots for EL0 guarded control stack
mm: Define VM_SHADOW_STACK for arm64 when we support GCS
arm64/mm: Map pages for guarded control stack
KVM: arm64: Manage GCS access and registers for guests
arm64/idreg: Add overrride for GCS
arm64/hwcap: Add hwcap for GCS
arm64/traps: Handle GCS exceptions
arm64/mm: Handle GCS data aborts
arm64/gcs: Context switch GCS state for EL0
arm64/gcs: Ensure that new threads have a GCS
arm64/gcs: Implement shadow stack prctl() interface
arm64/mm: Implement map_shadow_stack()
arm64/signal: Set up and restore the GCS context for signal handlers
arm64/signal: Expose GCS state in signal frames
arm64/ptrace: Expose GCS via ptrace and core files
arm64: Add Kconfig for Guarded Control Stack (GCS)
kselftest/arm64: Verify the GCS hwcap
kselftest/arm64: Add GCS as a detected feature in the signal tests
kselftest/arm64: Add framework support for GCS to signal handling tests
kselftest/arm64: Allow signals tests to specify an expected si_code
kselftest/arm64: Always run signals tests with GCS enabled
kselftest/arm64: Add very basic GCS test program
kselftest/arm64: Add a GCS test program built with the system libc
kselftest/arm64: Add test coverage for GCS mode locking
kselftest/arm64: Add GCS signal tests
kselftest/arm64: Add a GCS stress test
kselftest/arm64: Enable GCS for the FP stress tests
KVM: selftests: arm64: Add GCS registers to get-reg-list
Documentation/admin-guide/kernel-parameters.txt | 3 +
Documentation/arch/arm64/booting.rst | 32 +
Documentation/arch/arm64/elf_hwcaps.rst | 4 +
Documentation/arch/arm64/gcs.rst | 230 +++++++
Documentation/arch/arm64/index.rst | 1 +
Documentation/filesystems/proc.rst | 2 +-
arch/arm64/Kconfig | 21 +
arch/arm64/include/asm/cpufeature.h | 6 +
arch/arm64/include/asm/el2_setup.h | 30 +
arch/arm64/include/asm/esr.h | 28 +-
arch/arm64/include/asm/exception.h | 2 +
arch/arm64/include/asm/gcs.h | 107 +++
arch/arm64/include/asm/hwcap.h | 1 +
arch/arm64/include/asm/kvm_host.h | 12 +
arch/arm64/include/asm/mman.h | 23 +-
arch/arm64/include/asm/mmu_context.h | 9 +
arch/arm64/include/asm/pgtable-prot.h | 14 +-
arch/arm64/include/asm/processor.h | 7 +
arch/arm64/include/asm/sysreg.h | 20 +
arch/arm64/include/asm/uaccess.h | 40 ++
arch/arm64/include/asm/vncr_mapping.h | 2 +
arch/arm64/include/uapi/asm/hwcap.h | 3 +-
arch/arm64/include/uapi/asm/ptrace.h | 8 +
arch/arm64/include/uapi/asm/sigcontext.h | 9 +
arch/arm64/kernel/cpufeature.c | 23 +
arch/arm64/kernel/cpuinfo.c | 1 +
arch/arm64/kernel/entry-common.c | 23 +
arch/arm64/kernel/pi/idreg-override.c | 2 +
arch/arm64/kernel/process.c | 94 +++
arch/arm64/kernel/ptrace.c | 62 +-
arch/arm64/kernel/signal.c | 227 ++++++-
arch/arm64/kernel/traps.c | 11 +
arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 31 +
arch/arm64/kvm/sys_regs.c | 27 +-
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/fault.c | 40 ++
arch/arm64/mm/gcs.c | 254 +++++++
arch/arm64/mm/mmap.c | 9 +-
arch/arm64/tools/cpucaps | 1 +
arch/x86/Kconfig | 1 +
arch/x86/include/uapi/asm/mman.h | 3 -
fs/proc/task_mmu.c | 2 +-
include/linux/mm.h | 18 +-
include/uapi/asm-generic/mman.h | 4 +
include/uapi/linux/elf.h | 1 +
include/uapi/linux/prctl.h | 22 +
kernel/sys.c | 30 +
mm/Kconfig | 6 +
tools/testing/selftests/arm64/Makefile | 2 +-
tools/testing/selftests/arm64/abi/hwcap.c | 19 +
tools/testing/selftests/arm64/fp/assembler.h | 15 +
tools/testing/selftests/arm64/fp/fpsimd-test.S | 2 +
tools/testing/selftests/arm64/fp/sve-test.S | 2 +
tools/testing/selftests/arm64/fp/za-test.S | 2 +
tools/testing/selftests/arm64/fp/zt-test.S | 2 +
tools/testing/selftests/arm64/gcs/.gitignore | 5 +
tools/testing/selftests/arm64/gcs/Makefile | 24 +
tools/testing/selftests/arm64/gcs/asm-offsets.h | 0
tools/testing/selftests/arm64/gcs/basic-gcs.c | 357 ++++++++++
tools/testing/selftests/arm64/gcs/gcs-locking.c | 200 ++++++
.../selftests/arm64/gcs/gcs-stress-thread.S | 311 +++++++++
tools/testing/selftests/arm64/gcs/gcs-stress.c | 530 +++++++++++++++
tools/testing/selftests/arm64/gcs/gcs-util.h | 100 +++
tools/testing/selftests/arm64/gcs/libc-gcs.c | 728 +++++++++++++++++++++
tools/testing/selftests/arm64/signal/.gitignore | 1 +
.../testing/selftests/arm64/signal/test_signals.c | 17 +-
.../testing/selftests/arm64/signal/test_signals.h | 6 +
.../selftests/arm64/signal/test_signals_utils.c | 32 +-
.../selftests/arm64/signal/test_signals_utils.h | 39 ++
.../arm64/signal/testcases/gcs_exception_fault.c | 62 ++
.../selftests/arm64/signal/testcases/gcs_frame.c | 88 +++
.../arm64/signal/testcases/gcs_write_fault.c | 67 ++
.../selftests/arm64/signal/testcases/testcases.c | 7 +
.../selftests/arm64/signal/testcases/testcases.h | 1 +
tools/testing/selftests/kvm/aarch64/get-reg-list.c | 28 +
75 files changed, 4120 insertions(+), 34 deletions(-)
---
base-commit: 9852d85ec9d492ebef56dc5f229416c925758edc
change-id: 20230303-arm64-gcs-e311ab0d8729
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Recently, a defer helper was added to Python selftests. The idea is to keep
cleanup commands close to their dirtying counterparts, thereby making it
more transparent what is cleaning up what, making it harder to miss a
cleanup, and make the whole cleanup business exception safe. All these
benefits are applicable to bash as well, exception safety can be
interpreted in terms of safety vs. a SIGINT.
This patchset therefore introduces a framework of several helpers that
serve to schedule cleanups in bash selftests.
- Patch #1 has more details about the primitives being introduced.
Patch #2 adds a fallback cleanup() function to lib.sh, because ideally
selftests wouldn't need to introduce a dedicated cleanup function at all.
- Patch #3 adds a parameter to stop_traffic(), which makes it possible to
start other background processes after the traffic is started without
confusing the cleanup.
- Patches #4 to #10 convert a number of selftests.
The goal was to convert all tests that use start_traffic / stop_traffic
to the defer framework. Leftover traffic generators are a particularly
painful sort of a missed cleanup. Normal unfinished cleanups can usually
be cleaned up simply by rerunning the test and interrupting it early to
let the cleanups run again / in full. This does not work with
stop_traffic, because it is only issued at the end of the test case that
starts the traffic. At the same time, leftover traffic generators
influence follow-up test runs, and are hard to notice.
The tests were however converted whole-sale, not just their traffic bits.
Thus they form a proof of concept of the defer framework.
v1 (from the RFC):
- Patch #1:
- Added the priority defer track
- Dropped defer_scoped_fn, added in_defer_scope
- Extracted to a separate independent module
- Patch #2:
- Moved this bit to a separate patch
- Patch #3:
- New patch
- Patch #4 (RED):
- Squashed the individual RED-related patches into one
- Converted the SW datapath RED selftest as well
- Patch #5 (TBF):
- Fully converted the selftest, not just stop_traffic
- Patches #6, #7, #8, #9, #10:
- New patch
Petr Machata (10):
selftests: net: lib: Introduce deferred commands
selftests: forwarding: Add a fallback cleanup()
selftests: forwarding: lib: Allow passing PID to stop_traffic()
selftests: RED: Use defer for test cleanup
selftests: TBF: Use defer for test cleanup
selftests: ETS: Use defer for test cleanup
selftests: mlxsw: qos_mc_aware: Use defer for test cleanup
selftests: mlxsw: qos_ets_strict: Use defer for test cleanup
selftests: mlxsw: qos_max_descriptors: Use defer for test cleanup
selftests: mlxsw: devlink_trap_police: Use defer for test cleanup
.../drivers/net/mlxsw/devlink_trap_policer.sh | 85 ++++-----
.../drivers/net/mlxsw/qos_ets_strict.sh | 167 ++++++++---------
.../drivers/net/mlxsw/qos_max_descriptors.sh | 118 +++++-------
.../drivers/net/mlxsw/qos_mc_aware.sh | 146 +++++++--------
.../selftests/drivers/net/mlxsw/sch_ets.sh | 26 ++-
.../drivers/net/mlxsw/sch_red_core.sh | 171 +++++++++---------
.../drivers/net/mlxsw/sch_red_ets.sh | 24 +--
.../drivers/net/mlxsw/sch_red_root.sh | 18 +-
tools/testing/selftests/net/forwarding/lib.sh | 13 +-
.../selftests/net/forwarding/sch_ets.sh | 7 +-
.../selftests/net/forwarding/sch_ets_core.sh | 81 +++------
.../selftests/net/forwarding/sch_ets_tests.sh | 14 +-
.../selftests/net/forwarding/sch_red.sh | 103 ++++-------
.../selftests/net/forwarding/sch_tbf_core.sh | 91 +++-------
.../net/forwarding/sch_tbf_etsprio.sh | 7 +-
.../selftests/net/forwarding/sch_tbf_root.sh | 3 +-
tools/testing/selftests/net/lib.sh | 3 +
tools/testing/selftests/net/lib/Makefile | 2 +-
tools/testing/selftests/net/lib/sh/defer.sh | 115 ++++++++++++
19 files changed, 587 insertions(+), 607 deletions(-)
create mode 100644 tools/testing/selftests/net/lib/sh/defer.sh
--
2.45.0
Userland library functions such as allocators and threading implementations
often require regions of memory to act as 'guard pages' - mappings which,
when accessed, result in a fatal signal being sent to the accessing
process.
The current means by which these are implemented is via a PROT_NONE mmap()
mapping, which provides the required semantics however incur an overhead of
a VMA for each such region.
With a great many processes and threads, this can rapidly add up and incur
a significant memory penalty. It also has the added problem of preventing
merges that might otherwise be permitted.
This series takes a different approach - an idea suggested by Vlasimil
Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
provenance becomes a little tricky to ascertain after this - please forgive
any omissions!) - rather than locating the guard pages at the VMA layer,
instead placing them in page tables mapping the required ranges.
Early testing of the prototype version of this code suggests a 5 times
speed up in memory mapping invocations (in conjunction with use of
process_madvise()) and a 13% reduction in VMAs on an entirely idle android
system and unoptimised code.
We expect with optimisation and a loaded system with a larger number of
guard pages this could significantly increase, but in any case these
numbers are encouraging.
This way, rather than having separate VMAs specifying which parts of a
range are guard pages, instead we have a VMA spanning the entire range of
memory a user is permitted to access and including ranges which are to be
'guarded'.
After mapping this, a user can specify which parts of the range should
result in a fatal signal when accessed.
By restricting the ability to specify guard pages to memory mapped by
existing VMAs, we can rely on the mappings being torn down when the
mappings are ultimately unmapped and everything works simply as if the
memory were not faulted in, from the point of view of the containing VMAs.
This mechanism in effect poisons memory ranges similar to hardware memory
poisoning, only it is an entirely software-controlled form of poisoning.
Any poisoned region of memory is also able to 'unpoisoned', that is, to
have its poison markers removed.
The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON
which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this
poisoning.
Poisoning can be performed across multiple VMAs and any existing mappings
will be cleared, that is zapped, before installing the poisoned page table
mappings.
There is no concept of 'nested' poisoning, multiple attempts to poison a
range will, after the first poisoning, have no effect.
Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned
memory, so a user can safely unpoison a range of memory and clear only
poison page table mappings leaving the rest intact.
The actual mechanism by which the page table entries are specified makes
use of existing logic - PTE markers, which are used for the userfaultfd
UFFDIO_POISON mechanism.
Unfortunately PTE_MARKER_POISONED is not suited for the guard page
mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
logic to handle it.
We also extend the generic page walk mechanism to allow for installation of
PTEs (carefully restricted to memory management logic only to prevent
unwanted abuse).
We ensure that zapping performed by, for instance, MADV_DONTNEED, does not
remove guard poison markers, nor does forking (except when VM_WIPEONFORK is
specified for a VMA which implies a total removal of memory
characteristics).
It's important to note that the guard page implementation is emphatically
NOT a security feature, so a user can remove the poisoning if they wish. We
simply implement it in such a way as to provide the least surprising
behaviour.
An extensive set of self-tests are provided which ensure behaviour is as
expected and additionally self-documents expected behaviour of poisoned
ranges.
Suggested-by: Vlastimil Babka <vbabka(a)suze.cz>
Suggested-by: Jann Horn <jannh(a)google.com>
Suggested-by: David Hildenbrand <david(a)redhat.com>
Lorenzo Stoakes (4):
mm: pagewalk: add the ability to install PTEs
mm: add PTE_MARKER_GUARD PTE marker
mm: madvise: implement lightweight guard page mechanism
selftests/mm: add self tests for guard page feature
arch/alpha/include/uapi/asm/mman.h | 3 +
arch/mips/include/uapi/asm/mman.h | 3 +
arch/parisc/include/uapi/asm/mman.h | 3 +
arch/xtensa/include/uapi/asm/mman.h | 3 +
include/linux/mm_inline.h | 2 +-
include/linux/pagewalk.h | 18 +-
include/linux/swapops.h | 26 +-
include/uapi/asm-generic/mman-common.h | 3 +
mm/hugetlb.c | 3 +
mm/internal.h | 6 +
mm/madvise.c | 158 +++
mm/memory.c | 18 +-
mm/mprotect.c | 3 +-
mm/mseal.c | 1 +
mm/pagewalk.c | 174 ++--
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++
18 files changed, 1525 insertions(+), 69 deletions(-)
create mode 100644 tools/testing/selftests/mm/guard-pages.c
--
2.46.2
From: Xiu Jianfeng <xiujianfeng(a)huawei.com>
When compiling the cgroup selftests with the following command:
make -C tools/testing/selftests/cgroup/
the compiler complains as below:
test_cpu.c: In function ‘test_cpucg_nice’:
test_cpu.c:284:39: error: incompatible type for argument 2 of ‘hog_cpus_timed’
284 | hog_cpus_timed(cpucg, param);
| ^~~~~
| |
| struct cpu_hog_func_param
test_cpu.c:132:53: note: expected ‘void *’ but argument is of type ‘struct cpu_hog_func_param’
132 | static int hog_cpus_timed(const char *cgroup, void *arg)
| ~~~~~~^~~
Fix it by passing the address of param to hog_cpus_timed().
Fixes: 2e82c0d4562a ("cgroup/rstat: Selftests for niced CPU statistics")
Signed-off-by: Xiu Jianfeng <xiujianfeng(a)huawei.com>
---
tools/testing/selftests/cgroup/test_cpu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/cgroup/test_cpu.c b/tools/testing/selftests/cgroup/test_cpu.c
index 201ce14cb422..a2b50af8e9ee 100644
--- a/tools/testing/selftests/cgroup/test_cpu.c
+++ b/tools/testing/selftests/cgroup/test_cpu.c
@@ -281,7 +281,7 @@ static int test_cpucg_nice(const char *root)
/* Try to keep niced CPU usage as constrained to hog_cpu as possible */
nice(1);
- hog_cpus_timed(cpucg, param);
+ hog_cpus_timed(cpucg, ¶m);
exit(0);
} else {
waitpid(pid, &status, 0);
--
2.34.1
Hello all,
This patch series offers improvements to the way .BTF_ids section data is
created and later patched by resolve_btfids.
Patch #1 simplifies the byte-order translation in resolve_btfids while
making it more resilient to future .BTF_ids encoding updates.
Patch #2 makes sure all BTF ID data is 4-byte aligned, and not only the
.BTF_ids used for vmlinux.
Patch #3 syncs the above changes in btf_ids.h to tools/include, obviating
a previous alignment fix in selftests/bpf.
Feedback and suggestions are welcome!
Best regards,
Tony
Tony Ambardar (3):
tools/resolve_btfids: Simplify handling cross-endian compilation
bpf: btf: Ensure natural alignment of .BTF_ids section
tools/bpf, selftests/bpf : Sync btf_ids.h to tools
include/linux/btf_ids.h | 1 +
tools/bpf/resolve_btfids/main.c | 60 +++++---------
tools/include/linux/btf_ids.h | 80 +++++++++++++++++--
.../selftests/bpf/prog_tests/resolve_btfids.c | 6 --
4 files changed, 97 insertions(+), 50 deletions(-)
--
2.34.1
From: Jeff Xu <jeffxu(a)chromium.org>
Pedro Falcato's optimization [1] for checking sealed VMAs, which replaces
the can_modify_mm() function with an in-loop check, necessitates an update
to the mseal.rst documentation to reflect this change.
Furthermore, the document has received offline comments regarding the code
sample and suggestions for sentence clarification to enhance reader
comprehension.
[1] https://lore.kernel.org/linux-mm/20240817-mseal-depessimize-v3-0-d8d2e037df…
Jeff Xu (1):
mseal: update mseal.rst
Documentation/userspace-api/mseal.rst | 290 ++++++++++++--------------
1 file changed, 136 insertions(+), 154 deletions(-)
--
2.46.1.824.gd892dcdcdd-goog
Hi
Note for V12:
There was a small conflict between the Intel PT changes in
"KVM: x86: Fix Intel PT Host/Guest mode when host tracing" and the
changes in this patch set, so I have put the patch sets together,
along with outstanding fix "perf/x86/intel/pt: Fix buffer full but
size is 0 case"
Cover letter for KVM changes (patches 2 to 4):
There is a long-standing problem whereby running Intel PT on host and guest
in Host/Guest mode, causes VM-Entry failure.
The motivation for this patch set is to provide a fix for stable kernels
prior to the advent of the "Mediated Passthrough vPMU" patch set:
https://lore.kernel.org/kvm/20240801045907.4010984-1-mizhang@google.com/
which would render a large part of the fix unnecessary but likely not be
suitable for backport to stable due to its size and complexity.
Ideally, this patch set would be applied before "Mediated Passthrough vPMU"
Note that the fix does not conflict with "Mediated Passthrough vPMU", it
is just that "Mediated Passthrough vPMU" will make the code to stop and
restart Intel PT unnecessary.
Note for V11:
Moving aux_paused into a union within struct hw_perf_event caused
a regression because aux_paused was being written unconditionally
even though it is valid only for AUX (e.g. Intel PT) PMUs.
That is fixed in V11.
Hardware traces, such as instruction traces, can produce a vast amount of
trace data, so being able to reduce tracing to more specific circumstances
can be useful.
The ability to pause or resume tracing when another event happens, can do
that.
These patches add such a facilty and show how it would work for Intel
Processor Trace.
Maintainers of other AUX area tracing implementations are requested to
consider if this is something they might employ and then whether or not
the ABI would work for them. Note, thank you to James Clark (ARM) for
evaluating the API for Coresight. Suzuki K Poulose (ARM) also responded
positively to the RFC.
Changes to perf tools are now (since V4) fleshed out.
Please note, Intel® Architecture Instruction Set Extensions and Future
Features Programming Reference March 2024 319433-052, currently:
https://cdrdv2.intel.com/v1/dl/getContent/671368
introduces hardware pause / resume for Intel PT in a feature named
Intel PT Trigger Tracing.
For that more fields in perf_event_attr will be necessary. The main
differences are:
- it can be applied not just to overflows, but optionally to
every event
- a packet is emitted into the trace, optionally with IP
information
- no PMI
- works with PMC and DR (breakpoint) events only
Here are the proposed additions to perf_event_attr, please comment:
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 0c557f0a17b3..05dcc43f11bb 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -369,6 +369,22 @@ enum perf_event_read_format {
PERF_FORMAT_MAX = 1U << 5, /* non-ABI */
};
+enum {
+ PERF_AUX_ACTION_START_PAUSED = 1U << 0,
+ PERF_AUX_ACTION_PAUSE = 1U << 1,
+ PERF_AUX_ACTION_RESUME = 1U << 2,
+ PERF_AUX_ACTION_EMIT = 1U << 3,
+ PERF_AUX_ACTION_NR = 0x1f << 4,
+ PERF_AUX_ACTION_NO_IP = 1U << 9,
+ PERF_AUX_ACTION_PAUSE_ON_EVT = 1U << 10,
+ PERF_AUX_ACTION_RESUME_ON_EVT = 1U << 11,
+ PERF_AUX_ACTION_EMIT_ON_EVT = 1U << 12,
+ PERF_AUX_ACTION_NR_ON_EVT = 0x1f << 13,
+ PERF_AUX_ACTION_NO_IP_ON_EVT = 1U << 18,
+ PERF_AUX_ACTION_MASK = ~PERF_AUX_ACTION_START_PAUSED,
+ PERF_AUX_PAUSE_RESUME_MASK = PERF_AUX_ACTION_PAUSE | PERF_AUX_ACTION_RESUME,
+};
+
#define PERF_ATTR_SIZE_VER0 64 /* sizeof first published struct */
#define PERF_ATTR_SIZE_VER1 72 /* add: config2 */
#define PERF_ATTR_SIZE_VER2 80 /* add: branch_sample_type */
@@ -515,10 +531,19 @@ struct perf_event_attr {
union {
__u32 aux_action;
struct {
- __u32 aux_start_paused : 1, /* start AUX area tracing paused */
- aux_pause : 1, /* on overflow, pause AUX area tracing */
- aux_resume : 1, /* on overflow, resume AUX area tracing */
- __reserved_3 : 29;
+ __u32 aux_start_paused : 1, /* start AUX area tracing paused */
+ aux_pause : 1, /* on overflow, pause AUX area tracing */
+ aux_resume : 1, /* on overflow, resume AUX area tracing */
+ aux_emit : 1, /* generate AUX records instead of events */
+ aux_nr : 5, /* AUX area tracing reference number */
+ aux_no_ip : 1, /* suppress IP in AUX records */
+ /* Following apply to event occurrence not overflows */
+ aux_pause_on_evt : 1, /* on event, pause AUX area tracing */
+ aux_resume_on_evt : 1, /* on event, resume AUX area tracing */
+ aux_emit_on_evt : 1, /* generate AUX records instead of events */
+ aux_nr_on_evt : 5, /* AUX area tracing reference number */
+ aux_no_ip_on_evt : 1, /* suppress IP in AUX records */
+ __reserved_3 : 13;
};
};
Changes in V12:
Add previously sent patch "perf/x86/intel/pt: Fix buffer full
but size is 0 case"
Add previously sent patch set "KVM: x86: Fix Intel PT Host/Guest
mode when host tracing"
Rebase on current tip plus patch set "KVM: x86: Fix Intel PT Host/Guest
mode when host tracing"
Changes in V11:
perf/core: Add aux_pause, aux_resume, aux_start_paused
Make assignment to event->hw.aux_paused conditional on
(pmu->capabilities & PERF_PMU_CAP_AUX_PAUSE).
perf/x86/intel: Do not enable large PEBS for events with aux actions or aux sampling
Remove definition of has_aux_action() because it has
already been added as an inline function.
perf/x86/intel/pt: Fix sampling synchronization
perf tools: Enable evsel__is_aux_event() to work for ARM/ARM64
perf tools: Enable evsel__is_aux_event() to work for S390_CPUMSF
Dropped because they have already been applied
Changes in V10:
perf/core: Add aux_pause, aux_resume, aux_start_paused
Move aux_paused into a union within struct hw_perf_event.
Additional comment wrt PERF_EF_PAUSE/PERF_EF_RESUME.
Factor out has_aux_action() as an inline function.
Use scoped_guard for irqsave.
Move calls of perf_event_aux_pause() from __perf_event_output()
to __perf_event_overflow().
Changes in V9:
perf/x86/intel/pt: Fix sampling synchronization
New patch
perf/core: Add aux_pause, aux_resume, aux_start_paused
Move aux_paused to struct hw_perf_event
perf/x86/intel/pt: Add support for pause / resume
Add more comments and barriers for resume_allowed and
pause_allowed
Always use WRITE_ONCE with resume_allowed
Changes in V8:
perf tools: Parse aux-action
Fix clang warning:
util/auxtrace.c:821:7: error: missing field 'aux_action' initializer [-Werror,-Wmissing-field-initializers]
821 | {NULL},
| ^
Changes in V7:
Add Andi's Reviewed-by for patches 2-12
Re-base
Changes in V6:
perf/core: Add aux_pause, aux_resume, aux_start_paused
Removed READ/WRITE_ONCE from __perf_event_aux_pause()
Expanded comment about guarding against NMI
Changes in V5:
perf/core: Add aux_pause, aux_resume, aux_start_paused
Added James' Ack
perf/x86/intel: Do not enable large PEBS for events with aux actions or aux sampling
New patch
perf tools
Added Ian's Ack
Changes in V4:
perf/core: Add aux_pause, aux_resume, aux_start_paused
Rename aux_output_cfg -> aux_action
Reorder aux_action bits from:
aux_pause, aux_resume, aux_start_paused
to:
aux_start_paused, aux_pause, aux_resume
Fix aux_action bits __u64 -> __u32
coresight: Have a stab at support for pause / resume
Dropped
perf tools
All new patches
Changes in RFC V3:
coresight: Have a stab at support for pause / resume
'mode' -> 'flags' so it at least compiles
Changes in RFC V2:
Use ->stop() / ->start() instead of ->pause_resume()
Move aux_start_paused bit into aux_output_cfg
Tighten up when Intel PT pause / resume is allowed
Add an example of how it might work for CoreSight
Adrian Hunter (14):
perf/x86/intel/pt: Fix buffer full but size is 0 case
KVM: x86: Fix Intel PT IA32_RTIT_CTL MSR validation
KVM: x86: Fix Intel PT Host/Guest mode when host tracing also
KVM: selftests: Add guest Intel PT test
perf/core: Add aux_pause, aux_resume, aux_start_paused
perf/x86/intel/pt: Add support for pause / resume
perf/x86/intel: Do not enable large PEBS for events with aux actions or aux sampling
perf tools: Add aux_start_paused, aux_pause and aux_resume
perf tools: Add aux-action config term
perf tools: Parse aux-action
perf tools: Add missing_features for aux_start_paused, aux_pause, aux_resume
perf intel-pt: Improve man page format
perf intel-pt: Add documentation for pause / resume
perf intel-pt: Add a test for pause / resume
arch/x86/events/intel/core.c | 4 +-
arch/x86/events/intel/pt.c | 209 +++++++-
arch/x86/events/intel/pt.h | 16 +
arch/x86/include/asm/intel_pt.h | 4 +
arch/x86/kvm/vmx/vmx.c | 26 +-
arch/x86/kvm/vmx/vmx.h | 1 -
include/linux/perf_event.h | 28 +
include/uapi/linux/perf_event.h | 11 +-
kernel/events/core.c | 72 ++-
kernel/events/internal.h | 1 +
tools/include/uapi/linux/perf_event.h | 11 +-
tools/perf/Documentation/perf-intel-pt.txt | 596 +++++++++++++--------
tools/perf/Documentation/perf-record.txt | 4 +
tools/perf/builtin-record.c | 4 +-
tools/perf/tests/shell/test_intel_pt.sh | 28 +
tools/perf/util/auxtrace.c | 67 ++-
tools/perf/util/auxtrace.h | 6 +-
tools/perf/util/evsel.c | 13 +-
tools/perf/util/evsel.h | 1 +
tools/perf/util/evsel_config.h | 1 +
tools/perf/util/parse-events.c | 10 +
tools/perf/util/parse-events.h | 1 +
tools/perf/util/parse-events.l | 1 +
tools/perf/util/perf_event_attr_fprintf.c | 3 +
tools/perf/util/pmu.c | 1 +
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/include/x86_64/processor.h | 1 +
tools/testing/selftests/kvm/x86_64/intel_pt.c | 381 +++++++++++++
28 files changed, 1238 insertions(+), 264 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/intel_pt.c
Regards
Adrian
Recently we committed a fix to allow processes to receive notifications for
non-zero exits via the process connector module. Commit is a4c9a56e6a2c.
However, for threads, when it does a pthread_exit(&exit_status) call, the
kernel is not aware of the exit status with which pthread_exit is called.
It is sent by child thread to the parent process, if it is waiting in
pthread_join(). Hence, for a thread exiting abnormally, kernel cannot
send notifications to any listening processes.
The exception to this is if the thread is sent a signal which it has not
handled, and dies along with it's process as a result; for eg. SIGSEGV or
SIGKILL. In this case, kernel is aware of the non-zero exit and sends a
notification for it.
For our use case, we cannot have parent wait in pthread_join, one of the
main reasons for this being that we do not want to track normal
pthread_exit(), which could be a very large number. We only want to be
notified of any abnormal exits. Hence, threads are created with
pthread_attr_t set to PTHREAD_CREATE_DETACHED.
To fix this problem, we add a new type PROC_CN_MCAST_NOTIFY to proc connector
API, which allows a thread to send it's exit status to kernel either when
it needs to call pthread_exit() with non-zero value to indicate some
error or from signal handler before pthread_exit().
Anjali Kulkarni (3):
connector/cn_proc: Add hash table for threads
connector/cn_proc: Kunit tests for threads hash table
connector/cn_proc: Selftest for threads
drivers/connector/Makefile | 2 +-
drivers/connector/cn_hash.c | 240 ++++++++++++++++++
drivers/connector/cn_proc.c | 59 ++++-
drivers/connector/connector.c | 96 ++++++-
include/linux/connector.h | 47 ++++
include/linux/sched.h | 2 +-
include/uapi/linux/cn_proc.h | 4 +-
lib/Kconfig.debug | 17 ++
lib/Makefile | 1 +
lib/cn_hash_test.c | 167 ++++++++++++
lib/cn_hash_test.h | 12 +
tools/testing/selftests/connector/Makefile | 23 +-
.../testing/selftests/connector/proc_filter.c | 5 +
tools/testing/selftests/connector/thread.c | 90 +++++++
.../selftests/connector/thread_filter.c | 93 +++++++
15 files changed, 848 insertions(+), 10 deletions(-)
create mode 100644 drivers/connector/cn_hash.c
create mode 100644 lib/cn_hash_test.c
create mode 100644 lib/cn_hash_test.h
create mode 100644 tools/testing/selftests/connector/thread.c
create mode 100644 tools/testing/selftests/connector/thread_filter.c
--
2.46.0
We have now two kdevops proof of concepts with kernel-patches-daemon [0],
one for Linux kernel modules testing [1] and the other with radix tree
testing (xarray, maple tree) [2]. These trees just contain the required
.github/workflows/* files used to trigger a github self-hosted runner
to run kdevops since evaluation shows that using github hosted runners
will just not work or scale for Linux kernel testing [3]. The way this
works with KPD is that KPD has an app in the linux-kdevops organization
which is in charge of taking patch series posted to your respective
subsystem patchwork (you can have dedicated filters on a mailing list
for only specific files if you don't have a dedicated mailing list), it
creates a git tree branch using your configured KPD main development
tree source, and pushes it out to a respective test tree under github
for for you. For example, in the case of development for Linux modules
it pushes out a branch with a delta onto the linux-modules-kpd tree [4]
and in it, it will also merge the latest kdevops-ci-modules [1] work,
which is where the github runner work gets developed. For the radix tree
we currently do not yet have a patchwork instance defined but we *could*,
and the way it would work is that KPD would push out a branch into
the linux-radix-tree-kpd [5] tree with the github actions defined in its
respective kdevops-ci-radix-tree [3] tree.
What these PoC shows is that the way kdevops has designed testing
selftests is that we actually only need to differ in *one* single line
of code on the github actions runner to test either of these two Linux
kernel subsystems: the defconfig used.
To be able to *share* the *same* Linux kernel github actions runner
code development between the Linux kernel module tests and the radix
tree, all we need to do then is use the git tree onto which a delta
was pushed onto as the source for the defconfig. So all we have to do
now is just add a symlink of the respective development test tree onto
its corresponding defconfig.
Add the respective defconfig then for linux-modules-kpd by symlinking it
to the seltests-kmod-cli defconfig. This will let us later share *one*
github development action runner code for self-hosted runners for *all*
Linux kernel sefltests we define in *one* development tree which KPD
could leverage.
Now that we have locked down the linux-kdevops github organization to
only allow respective developers to be able to trigger pushes or PRs,
this also allows us to add dedicated self-hosted runners per target
test development repository so we can scale our testing as we need with
security in mind. The only thing left to do here now, is to evaluate
if we want an allow check for who's patches we want to enable automatic
testing for through KPD.
[0] https://github.com/facebookincubator/kernel-patches-daemon
[1] https://github.com/linux-kdevops/kdevops-ci-modules
[2] https://github.com/linux-kdevops/kdevops-ci-radix-tree
[3] https://lore.kernel.org/kdevops/CAB=NE6VKWSkv1JZ_Z2LKq4o7+JBkKc6u8Wa1zxxBnG…
[4] https://github.com/linux-kdevops/linux-modules-kpd
[5] https://github.com/linux-kdevops/linux-radix-tree-kpd
Signed-off-by: Luis Chamberlain <mcgrof(a)kernel.org>
---
defconfigs/linux-modules-kpd | 1 +
1 file changed, 1 insertion(+)
create mode 120000 defconfigs/linux-modules-kpd
diff --git a/defconfigs/linux-modules-kpd b/defconfigs/linux-modules-kpd
new file mode 120000
index 000000000000..e61fd7f687b0
--- /dev/null
+++ b/defconfigs/linux-modules-kpd
@@ -0,0 +1 @@
+seltests-kmod-cli
\ No newline at end of file
--
2.43.0
Add Kunit tests for the kernel's implementation of the standard CRC-16
algorithm (<linux/crc16.h>). The test data consists of 100
randomly-generated test cases, validated against a naive CRC-16
implementation.
This test follows roughly the same logic as lib/crc32test.c, but
without the performance measurements.
Signed-off-by: Vinicius Peixoto <vpeixoto(a)lkcamp.dev>
Co-developed-by: Enzo Bertoloti <ebertoloti(a)lkcamp.dev>
Signed-off-by: Enzo Bertoloti <ebertoloti(a)lkcamp.dev>
Co-developed-by: Fabricio Gasperin <fgasperin(a)lkcamp.dev>
Signed-off-by: Fabricio Gasperin <fgasperin(a)lkcamp.dev>
Suggested-by: David Laight <David.Laight(a)ACULAB.COM>
---
This patch was developed during a hackathon organized by LKCAMP [1],
with the objective of writing KUnit tests, both to introduce people to
the kernel development process and to learn about different subsystems
(with the positive side effect of improving the kernel test coverage, of
course).
We noticed there were tests for CRC32 in lib/crc32test.c and thought it
would be nice to have something similar for CRC16, since it seems to be
widely used in network drivers (as well as in some ext4 code).
We would really appreciate any feedback/suggestions on how to improve
this. Thanks! :-)
Changes in v2 (suggested by David Laight):
- Use the PRNG from include/linux/prandom.h to generate pseudorandom
data/test cases instead of having them hardcoded as large static
arrays
- Add a naive CRC16 implementation used to validate the kernel's
implementation (instead of having the test case results be hard-coded)
- Link to v1: https://lore.kernel.org/linux-kselftest/20240922232643.535329-1-vpeixoto@lk…
Changes in v3:
- Fix compilation warnings about function documentation
- Link to v2: https://lore.kernel.org/r/20241003-crc16-kunit-v2-1-5fe74b113e1e@lkcamp.dev
[1] https://lkcamp.dev/about
---
lib/Kconfig.debug | 9 ++++
lib/Makefile | 1 +
lib/crc16_kunit.c | 155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 165 insertions(+)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 7315f643817ae1021f1e4b3dd27b424f49e3f761..f9617e3054948ce43090f524dc67650e9549cee8 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2850,6 +2850,15 @@ config USERCOPY_KUNIT_TEST
on the copy_to/from_user infrastructure, making sure basic
user/kernel boundary testing is working.
+config CRC16_KUNIT_TEST
+ tristate "KUnit tests for CRC16"
+ depends on KUNIT
+ default KUNIT_ALL_TESTS
+ select CRC16
+ help
+ Enable this option to run unit tests for the kernel's CRC16
+ implementation (<linux/crc16.h>).
+
config TEST_UDELAY
tristate "udelay test driver"
help
diff --git a/lib/Makefile b/lib/Makefile
index 773adf88af41665b2419202e5427e0513c6becae..1faed6414a85fd366b4966a00e8ba231d7546e14 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -389,6 +389,7 @@ CFLAGS_fortify_kunit.o += $(DISABLE_STRUCTLEAK_PLUGIN)
obj-$(CONFIG_FORTIFY_KUNIT_TEST) += fortify_kunit.o
obj-$(CONFIG_SIPHASH_KUNIT_TEST) += siphash_kunit.o
obj-$(CONFIG_USERCOPY_KUNIT_TEST) += usercopy_kunit.o
+obj-$(CONFIG_CRC16_KUNIT_TEST) += crc16_kunit.o
obj-$(CONFIG_GENERIC_LIB_DEVMEM_IS_ALLOWED) += devmem_is_allowed.o
diff --git a/lib/crc16_kunit.c b/lib/crc16_kunit.c
new file mode 100644
index 0000000000000000000000000000000000000000..0918c98a96d26f4e795e3eb92923db7c549ac01f
--- /dev/null
+++ b/lib/crc16_kunit.c
@@ -0,0 +1,155 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KUnits tests for CRC16.
+ *
+ * Copyright (C) 2024, LKCAMP
+ * Author: Vinicius Peixoto <vpeixoto(a)lkcamp.dev>
+ * Author: Fabricio Gasperin <fgasperin(a)lkcamp.dev>
+ * Author: Enzo Bertoloti <ebertoloti(a)lkcamp.dev>
+ */
+#include <kunit/test.h>
+#include <linux/crc16.h>
+#include <linux/prandom.h>
+
+#define CRC16_KUNIT_DATA_SIZE 4096
+#define CRC16_KUNIT_TEST_SIZE 100
+#define CRC16_KUNIT_SEED 0x12345678
+
+/**
+ * struct crc16_test - CRC16 test data
+ * @crc: initial input value to CRC16
+ * @start: Start index within the data buffer
+ * @length: Length of the data
+ */
+static struct crc16_test {
+ u16 crc;
+ u16 start;
+ u16 length;
+} tests[CRC16_KUNIT_TEST_SIZE];
+
+u8 data[CRC16_KUNIT_DATA_SIZE];
+
+
+/* Naive implementation of CRC16 for validation purposes */
+static inline u16 _crc16_naive_byte(u16 crc, u8 data)
+{
+ u8 i = 0;
+
+ crc ^= (u16) data;
+ for (i = 0; i < 8; i++) {
+ if (crc & 0x01)
+ crc = (crc >> 1) ^ 0xa001;
+ else
+ crc = crc >> 1;
+ }
+
+ return crc;
+}
+
+
+static inline u16 _crc16_naive(u16 crc, u8 *buffer, size_t len)
+{
+ while (len--)
+ crc = _crc16_naive_byte(crc, *buffer++);
+ return crc;
+}
+
+
+/* Small helper for generating pseudorandom 16-bit data */
+static inline u16 _rand16(void)
+{
+ static u32 rand = CRC16_KUNIT_SEED;
+
+ rand = next_pseudo_random32(rand);
+ return rand & 0xFFFF;
+}
+
+
+static int crc16_init_test_data(struct kunit_suite *suite)
+{
+ size_t i;
+
+ /* Fill the data buffer with random bytes */
+ for (i = 0; i < CRC16_KUNIT_DATA_SIZE; i++)
+ data[i] = _rand16() & 0xFF;
+
+ /* Generate random test data while ensuring the random
+ * start + length values won't overflow the 4096-byte
+ * buffer (0x7FF * 2 = 0xFFE < 0x1000)
+ */
+ for (size_t i = 0; i < CRC16_KUNIT_TEST_SIZE; i++) {
+ tests[i].crc = _rand16();
+ tests[i].start = _rand16() & 0x7FF;
+ tests[i].length = _rand16() & 0x7FF;
+ }
+
+ return 0;
+}
+
+static void crc16_test_empty(struct kunit *test)
+{
+ u16 crc;
+
+ /* The result for empty data should be the same as the
+ * initial crc
+ */
+ crc = crc16(0x00, data, 0);
+ KUNIT_EXPECT_EQ(test, crc, 0);
+ crc = crc16(0xFF, data, 0);
+ KUNIT_EXPECT_EQ(test, crc, 0xFF);
+}
+
+static void crc16_test_correctness(struct kunit *test)
+{
+ size_t i;
+ u16 crc, crc_naive;
+
+ for (i = 0; i < CRC16_KUNIT_TEST_SIZE; i++) {
+ /* Compare results with the naive crc16 implementation */
+ crc = crc16(tests[i].crc, data + tests[i].start,
+ tests[i].length);
+ crc_naive = _crc16_naive(tests[i].crc, data + tests[i].start,
+ tests[i].length);
+ KUNIT_EXPECT_EQ(test, crc, crc_naive);
+ }
+}
+
+
+static void crc16_test_combine(struct kunit *test)
+{
+ size_t i, j;
+ u16 crc, crc_naive;
+
+ /* Make sure that combining two consecutive crc16 calculations
+ * yields the same result as calculating the crc16 for the whole thing
+ */
+ for (i = 0; i < CRC16_KUNIT_TEST_SIZE; i++) {
+ crc_naive = crc16(tests[i].crc, data + tests[i].start, tests[i].length);
+ for (j = 0; j < tests[i].length; j++) {
+ crc = crc16(tests[i].crc, data + tests[i].start, j);
+ crc = crc16(crc, data + tests[i].start + j, tests[i].length - j);
+ KUNIT_EXPECT_EQ(test, crc, crc_naive);
+ }
+ }
+}
+
+
+static struct kunit_case crc16_test_cases[] = {
+ KUNIT_CASE(crc16_test_empty),
+ KUNIT_CASE(crc16_test_combine),
+ KUNIT_CASE(crc16_test_correctness),
+ {},
+};
+
+static struct kunit_suite crc16_test_suite = {
+ .name = "crc16",
+ .test_cases = crc16_test_cases,
+ .suite_init = crc16_init_test_data,
+};
+kunit_test_suite(crc16_test_suite);
+
+MODULE_AUTHOR("Fabricio Gasperin <fgasperin(a)lkcamp.dev>");
+MODULE_AUTHOR("Vinicius Peixoto <vpeixoto(a)lkcamp.dev>");
+MODULE_AUTHOR("Enzo Bertoloti <ebertoloti(a)lkcamp.dev>");
+MODULE_DESCRIPTION("Unit tests for crc16");
+MODULE_LICENSE("GPL");
---
base-commit: 9852d85ec9d492ebef56dc5f229416c925758edc
change-id: 20241003-crc16-kunit-127a4dc2b72c
Best regards,
--
Vinicius Peixoto <vpeixoto(a)lkcamp.dev>
PASID (Process Address Space ID) is a PCIe extension to tag the DMA
transactions out of a physical device, and most modern IOMMU hardware
have supported PASID granular address translation. So a PASID-capable
device can be attached to multiple hwpts (a.k.a. domains), and each
attachment is tagged with a pasid.
This series is based on the preparation series [1] [2], it first adds a
missing iommu API to replace the domain for a pasid. Based on the iommu
pasid attach/ replace/detach APIs, this series adds iommufd APIs for device
drivers to attach/replace/detach pasid to/from hwpt per userspace's request,
and adds selftest to validate the iommufd APIs.
While this series has a missing part which is to enforce the domain
allocation with special flag if it will be used by PASID [3]. This is due
to special requirements by AMD. Since it is still in mailing discussion [4],
so let's mark it here. Once it's finalized, this series needs to enforce
the domain flag check to ensure the AMD pasid support is not broken from
day-1.
The completed code can be found in the below link [5]. Heads up! The existing
iommufd selftest was broken, there was a fix [6] to it, but not been
upstreamed yet. If want to run the iommufd selftest, please apply that fix.
Sorry for the inconvenience.
[1] https://lore.kernel.org/linux-iommu/20240912130427.10119-1-yi.l.liu@intel.c…
[2] https://lore.kernel.org/linux-iommu/20240912130653.11028-1-yi.l.liu@intel.c…
[3] https://lore.kernel.org/linux-iommu/20240822124433.GD3468552@ziepe.ca/
[4] https://lore.kernel.org/linux-iommu/20240911101911.6269-3-vasant.hegde@amd.…
[5] https://github.com/yiliu1765/iommufd/tree/iommufd_pasid
[6] https://lore.kernel.org/linux-iommu/20240111073213.180020-1-baolu.lu@linux.…
Change log:
v4:
- Replace remove_dev_pasid() by supporting set_dev_pasid() for blocking domain (Kevin)
- This is done by the preparation series "Support attaching PASID to the blocked_domain"
- Misc tweaks to foil the merging of the iommufd iopf series. Three new patches are added:
- iommufd: Always pass iommu_attach_handle to iommu core
- iommufd: Move the iommufd_handle helpers to iommufd_private.h
- iommufd: Refactor __fault_domain_replace_dev() to be a wrapper of iommu_replace_group_handle()
- Renmae patch 03 of v3 to be "iommufd: Support pasid attach/replace"
- Add test case for attaching/replacing iopf-capable hwpt to pasid
v3: https://lore.kernel.org/kvm/20240628090557.50898-1-yi.l.liu@intel.com/
- Split the set_dev_pasid op enhancements for domain replacement to be a
separate series "Make set_dev_pasid op supportting domain replacement" [1].
The below changes are made in the separate series.
*) set_dev_pasid() callback should keep the old config if failed to attach to
a domain. This simplifies the caller a lot as caller does not need to attach
it back to old domain explicitly. This also avoids some corner cases in which
the core may do duplicated domain attachment as described in below link (Jason)
https://lore.kernel.org/linux-iommu/BN9PR11MB52768C98314A95AFCD2FA6478C0F2@…
*) Drop patch 10 of v2 as it's a bug fix and can be submitted separately (Kevin)
*) Rebase on top of Baolu's domain_alloc_paging refactor series (Jason)
- Drop the attach_data which includes attach_fn and pasid, insteadly passing the
pasid through the device attach path. (Jason)
- Add a pasid-num-bits property to mock dev to make pasid selftest work (Kevin)
v2: https://lore.kernel.org/linux-iommu/20240412081516.31168-1-yi.l.liu@intel.c…
- Domain replace for pasid should be handled in set_dev_pasid() callbacks
instead of remove_dev_pasid and call set_dev_pasid afteward in iommu
layer (Jason)
- Make xarray operations more self-contained in iommufd pasid attach/replace/detach
(Jason)
- Tweak the dev_iommu_get_max_pasids() to allow iommu driver to populate the
max_pasids. This makes the iommufd selftest simpler to meet the max_pasids
check in iommu_attach_device_pasid() (Jason)
v1: https://lore.kernel.org/kvm/20231127063428.127436-1-yi.l.liu@intel.com/#r
- Implemnet iommu_replace_device_pasid() to fall back to the original domain
if this replacement failed (Kevin)
- Add check in do_attach() to check corressponding attach_fn per the pasid value.
rfc: https://lore.kernel.org/linux-iommu/20230926092651.17041-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Yi Liu (10):
iommu: Introduce a replace API for device pasid
iommufd: Refactor __fault_domain_replace_dev() to be a wrapper of
iommu_replace_group_handle()
iommufd: Move the iommufd_handle helpers to iommufd_private.h
iommufd: Always pass iommu_attach_handle to iommu core
iommufd: Pass pasid through the device attach/replace path
iommufd: Support pasid attach/replace
iommufd/selftest: Add set_dev_pasid and remove_dev_pasid in mock iommu
iommufd/selftest: Add a helper to get test device
iommufd/selftest: Add test ops to test pasid attach/detach
iommufd/selftest: Add coverage for iommufd pasid attach/detach
drivers/iommu/iommu-priv.h | 4 +
drivers/iommu/iommu.c | 90 +++++-
drivers/iommu/iommufd/Makefile | 1 +
drivers/iommu/iommufd/device.c | 46 ++--
drivers/iommu/iommufd/fault.c | 90 ++----
drivers/iommu/iommufd/hw_pagetable.c | 5 +-
drivers/iommu/iommufd/iommufd_private.h | 129 ++++++++-
drivers/iommu/iommufd/iommufd_test.h | 30 ++
drivers/iommu/iommufd/pasid.c | 157 +++++++++++
drivers/iommu/iommufd/selftest.c | 208 +++++++++++++-
include/linux/iommufd.h | 7 +
tools/testing/selftests/iommu/iommufd.c | 256 ++++++++++++++++++
.../selftests/iommu/iommufd_fail_nth.c | 29 +-
tools/testing/selftests/iommu/iommufd_utils.h | 78 ++++++
14 files changed, 1005 insertions(+), 125 deletions(-)
create mode 100644 drivers/iommu/iommufd/pasid.c
--
2.34.1
Hi Linus,
Please pull this kselftest fixes update for Linux 6.12-rc3.
This kselftest update for Linux 6.12-rc3 consists of several fixes
for build, run-time errors, and reporting errors:
-- ftrace: regression test for a kernel crash when running function graph
tracing and then enabling function profiler.
-- rseq: fix for mm_cid test failure.
-- vDSO:
- fixes to reporting skip and other error conditions.
- changes to unconditionally build chacha and getrandom tests on
all architectures to make it easier for them to run in CIs.
- build error when sched.h to bring in CLONE_NEWTIME define.
diff is attached.
Note: Had to fix a commit message last minute on rseq patch right
before generating the pull request. The last 2 patches have been in
my tree longer than just a few hours. :)
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit c66be905cda24fb782b91053b196bd2e966f95b7:
selftests: breakpoints: use remaining time to check if suspend succeed (2024-10-02 14:37:30 -0600)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-fixes-6.12-rc3
for you to fetch changes up to 4ee5ca9a29384fcf3f18232fdf8474166dea8dca:
ftrace/selftest: Test combination of function_graph tracer and function profiler (2024-10-11 15:05:16 -0600)
----------------------------------------------------------------
linux_kselftest-fixes-6.12-rc3
This kselftest update for Linux 6.12-rc3 consists of several fixes
for build, run-time errors, and reporting errors:
-- ftrace: regression test for a kernel crash when running function graph
tracing and then enabling function profiler.
-- rseq: fix for mm_cid test failure.
-- vDSO:
- fixes to reporting skip and other error conditions.
- changes unconditionally build chacha and getrandom tests on
all architectures to make it easier for them to run in CIs.
- build error when sched.h to bring in CLONE_NEWTIME define.
----------------------------------------------------------------
Jason A. Donenfeld (3):
selftests: vDSO: unconditionally build chacha test
selftests: vDSO: unconditionally build getrandom test
selftests: vDSO: improve getrandom and chacha error messages
Mathieu Desnoyers (1):
selftests/rseq: Fix mm_cid test failure
Steven Rostedt (1):
ftrace/selftest: Test combination of function_graph tracer and function profiler
Yu Liao (1):
selftests: vDSO: Explicitly include sched.h
tools/arch/arm64/vdso | 1 -
tools/arch/loongarch/vdso | 1 -
tools/arch/powerpc/vdso | 1 -
tools/arch/s390/vdso | 1 -
tools/arch/x86/vdso | 1 -
.../ftrace/test.d/ftrace/fgraph-profiler.tc | 31 ++++++
tools/testing/selftests/rseq/rseq.c | 110 ++++++++++++++-------
tools/testing/selftests/rseq/rseq.h | 10 +-
tools/testing/selftests/vDSO/Makefile | 6 +-
tools/testing/selftests/vDSO/vdso_test_chacha.c | 36 ++++---
tools/testing/selftests/vDSO/vdso_test_getrandom.c | 76 +++++++-------
tools/testing/selftests/vDSO/vgetrandom-chacha.S | 18 ++++
12 files changed, 183 insertions(+), 109 deletions(-)
delete mode 120000 tools/arch/arm64/vdso
delete mode 120000 tools/arch/loongarch/vdso
delete mode 120000 tools/arch/powerpc/vdso
delete mode 120000 tools/arch/s390/vdso
delete mode 120000 tools/arch/x86/vdso
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/fgraph-profiler.tc
create mode 100644 tools/testing/selftests/vDSO/vgetrandom-chacha.S
----------------------------------------------------------------
This fix solves this error, when calling kselftest with targets "net/rds":
The error was found by running tests manually with the command:
make kselftest TARGETS="net/rds"
The patch also specifies to import ip() function from the utils module.
Signed-off-by: Alessandro Zanni <alessandro.zanni87(a)gmail.com>
---
Notes:
v2:
modified the way the parent path is added
added test to reproduce the error
tools/testing/selftests/net/rds/test.py | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/rds/test.py b/tools/testing/selftests/net/rds/test.py
index e6bb109bcead..4a7178d11193 100755
--- a/tools/testing/selftests/net/rds/test.py
+++ b/tools/testing/selftests/net/rds/test.py
@@ -14,8 +14,11 @@ import sys
import atexit
from pwd import getpwuid
from os import stat
-from lib.py import ip
+# Allow utils module to be imported from different directory
+this_dir = os.path.dirname(os.path.realpath(__file__))
+sys.path.append(os.path.join(this_dir, "../"))
+from lib.py.utils import ip
libc = ctypes.cdll.LoadLibrary('libc.so.6')
setns = libc.setns
--
2.43.0
This fix solves this error, when calling kselftest with targets
"drivers/net":
File "tools/testing/selftests/net/lib/py/nsim.py", line 64, in __init__
if e.errno == errno.ENOSPC:
NameError: name 'errno' is not defined
The error was found by running tests manually with the command:
make kselftest TARGETS="drivers/net"
The module errno makes available standard error system symbols.
Reviewed-by: Petr Machata <petrm(a)nvidia.com>
Signed-off-by: Alessandro Zanni <alessandro.zanni87(a)gmail.com>
---
Notes:
v2: added how to run the test
tools/testing/selftests/net/lib/py/nsim.py | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/net/lib/py/nsim.py b/tools/testing/selftests/net/lib/py/nsim.py
index f571a8b3139b..1a8cbe9acc48 100644
--- a/tools/testing/selftests/net/lib/py/nsim.py
+++ b/tools/testing/selftests/net/lib/py/nsim.py
@@ -1,5 +1,6 @@
# SPDX-License-Identifier: GPL-2.0
+import errno
import json
import os
import random
--
2.43.0
This patchset creates a selftest for the robust list interface, to track
regressions and assure that the interface keeps working as expected.
In this version I removed the kselftest_harness include, but I expanded the
current futex selftest API a little bit with basic ASSERT_ macros to make the
test easier to write and read. In the future, hopefully we can move all futex
selftests to the kselftest_harness API anyway.
Changes from v2:
- Create ASSERT_ macros for futex selftests
- Dropped kselftest_harness include, using just futex test API
- This is the expected output:
TAP version 13
1..6
ok 1 test_robustness
ok 2 test_set_robust_list_invalid_size
ok 3 test_get_robust_list_self
ok 4 test_get_robust_list_child
ok 5 test_set_list_op_pending
ok 6 test_robust_list_multiple_elements
# Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0
https://lore.kernel.org/lkml/20240903134033.816500-1-andrealmeid@igalia.com
André Almeida (2):
selftests/futex: Add ASSERT_ macros
selftests/futex: Create test for robust list
.../selftests/futex/functional/.gitignore | 1 +
.../selftests/futex/functional/Makefile | 3 +-
.../selftests/futex/functional/robust_list.c | 512 ++++++++++++++++++
.../testing/selftests/futex/include/logging.h | 28 +
4 files changed, 543 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/futex/functional/robust_list.c
--
2.46.0
From: Steven Rostedt <rostedt(a)goodmis.org>
The addition of recording both the function name and return address to the
function graph tracer updated the selftest to check for "=-5" from "= -5".
But this causes the test to fail on certain configs, as "= -5" is still a
value that can be returned if function addresses are not enabled (older kernels).
Check for both "=-5" and " -5" as a success value.
Fixes: 21e92806d39c6 ("function_graph: Support recording and printing the function return address")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
Shuah, this update is only for changes in my tree, so you do not need to add it.
tools/testing/selftests/ftrace/test.d/ftrace/fgraph-retval.tc | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/fgraph-retval.tc b/tools/testing/selftests/ftrace/test.d/ftrace/fgraph-retval.tc
index e8e46378b88d..4307d4eef417 100644
--- a/tools/testing/selftests/ftrace/test.d/ftrace/fgraph-retval.tc
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/fgraph-retval.tc
@@ -29,7 +29,7 @@ set -e
: "Test printing the error code in signed decimal format"
echo 0 > options/funcgraph-retval-hex
-count=`cat trace | grep 'proc_reg_write' | grep '=-5' | wc -l`
+count=`cat trace | grep 'proc_reg_write' | grep -e '=-5 ' -e '= -5 ' | wc -l`
if [ $count -eq 0 ]; then
fail "Return value can not be printed in signed decimal format"
fi
--
2.45.2
v25: https://patchwork.kernel.org/project/netdevbpf/list/?series=885396&state=*
===
Major changes:
- Moved devmem.h and mp_dmabuf_devmem.h to internal header files.
- Changed the page_pool_params to take in a queue_idx rather than
a struct netdev_rx_queue.
- Added WARN_ON_ONCE around __skb_checksum readability check and added
check to skb_checksum_help().
Other more minor feedback addressed as well.
v24: https://patchwork.kernel.org/project/netdevbpf/list/?series=884556&state=*
====
No major changes. Mostly addressing issues in the error paths of dmabuf
binding, and code cleanups/improvements from reviewers:
Changes:
- Fix failing ynl regen error.
- Error path fixes & extack error messages in dmabuf binding.
- Code cleanup in introspection.
- gitignore ynl.d generated file.
Full devmem TCP changes including the full GVE driver implementation is
here:
https://github.com/mina/linux/commits/tcpdevmem-v24/
v23: https://patchwork.kernel.org/project/netdevbpf/list/?series=882978&state=*
====
Fixing relatively minor issues called out in v22. (thanks again!)
Mostly code cleanups, extack error messages, and minor reworks. Nothing
major really changed, so the exact changes per commit is called in the
commit messages.
Full devmem TCP changes including the full GVE driver implementation is
here:
https://github.com/mina/linux/commits/tcpdevmem-v23/
v22: https://patchwork.kernel.org/project/netdevbpf/list/?series=881158&state=*
====
v22 aims to resolve the pending issue pointed to in v21, which is the
interaction with xdp. In this series I rebase on top of the minor
refactor which refactors propagating xdp configuration to slave devices:
https://patchwork.kernel.org/project/netdevbpf/list/?series=881994&state=*
I then disable setting xdp on devices using memory providers, and
propagating xdp configuration to devices using memory providers.
Full devmem TCP changes including the full GVE driver implementation is
here:
https://github.com/mina/linux/commits/tcpdevmem-v22/
v21: https://patchwork.kernel.org/project/netdevbpf/list/?series=880735&state=*
====
v20 addressed some comments and resolved a test failure, but introduced
an unfortunate build error with a config edge case I wasn't testing. v21
simply resolves that error.
Major Changes:
- Resolve build error with CONFIG_PAGE_POOL=n && CONFIG_NET=y
Full devmem TCP changes including the full GVE driver implementation is
here:
https://github.com/mina/linux/commits/tcpdevmem-v21/
v20: https://patchwork.kernel.org/project/netdevbpf/list/?series=879373&state=*
====
v20 aims to resolve a couple of bug reports against v19, and addresses
some review comments around the page_pool_check_memory_provider
mechanism.
Major changes:
- Test edge cases such as header split disabled in selftest.
- Change `offset = 0` back to `offset = offset - start` to resolve issue
found in RX path by Taehee (thanks!)
- Address a few comments around page_pool_check_memory_provider() from
Pavel & Jakub.
- Removed some unnecessary includes across various patches in the
series.
- Removed unnecessary EXPORT_SYMBOL(page_pool_mem_providers) (Jakub).
- Fix regression caused by incorrect dev_get_max_mp_channel check, along
with rename (Jakub).
Full devmem TCP changes including the full GVE driver implementation is
here:
https://github.com/mina/linux/commits/tcpdevmem-v20/
v19: https://patchwork.kernel.org/project/netdevbpf/list/?series=876852&state=*
====
v18 got a thorough review (thanks!), and this iteration addresses the
feedback.
Major changes:
- Prevent deactivating mp bound queues.
- Prevent installing xdp on mp bound netdevs, or installing mps on xdp
installed netdevs.
- Fix corner cases in netlink API vis-a-vis missing attributes.
- Iron out the unreadable netmem driver support story. To be honest, the
conversation with Jakub & Pavel got a bit confusing for me. I've
implemented an approach in this set that makes sense to me, and
AFAICT, addresses the requirements. It may be good as-is, or it
may be a conversation starter/continuer. To be honest IMO there
are many ways to skin this cat and I don't see an extremely strong
reason to go for one approach over another. Here is one approach you
may like.
- Don't reset niov dma_addr on allocation & free.
- Add some tests to the selftest that catches some of the issues around
missing netlink attributes or deactivating mp-bound queues.
Full devmem TCP changes including the full GVE driver implementation is
here:
https://github.com/mina/linux/commits/tcpdevmem-v19/
v18: https://patchwork.kernel.org/project/netdevbpf/list/?series=874848&state=*
====
v17 got minor feedback: (a) to beef up the description on patch 1 and (b)
to remove the leading underscores in the header definition.
I applied (a). (b) seems to be against current conventions so I did not
apply before further discussion.
Full devmem TCP changes including the full GVE driver implementation is
here:
https://github.com/mina/linux/commits/tcpdevmem-v17/
v17: https://patchwork.kernel.org/project/netdevbpf/list/?series=869900&state=*
====
v16 also got a very thorough review and some testing (thanks again!).
Thes version addresses all the concerns reported on v15, in terms of
feedback and issues reported.
Major changes:
- Use ASSERT_RTNL.
- Moved around some of the page_pool helpers definitions so I can hide
some netmem helpers in private files as Jakub suggested.
- Don't make every net_iov hold a ref on the binding as Jakub suggested.
- Fix issue reported by Taehee where we access queues after they have
been freed.
Full devmem TCP changes including the full GVE driver implementation is
here:
https://github.com/mina/linux/commits/tcpdevmem-v17/
v16: https://patchwork.kernel.org/project/netdevbpf/list/?series=866353&state=*
====
v15 got a thorough review and some testing, and this version addresses almost
all the feedback. Some more minor comments where the authors said it
could be done later, I left out.
Major changes:
- Addition of dma-buf introspection to page-pool-get and queue-get.
- Fixes to selftests suggested by Taehee.
- Fixes to documentation suggested by Donald.
- A couple of suggestions and fixes to TCP patches by Eric and David.
- Fixes to number assignements suggested by Arnd.
- Use rtnl_lock()ing to guard against queue reconfiguration while the
page_pool initialization is happening. (Jakub).
- Fixes to a few warnings reproduced by Taehee.
- Fixes to dma-buf binding suggested by Taehee and Jakub.
- Fixes to netlink UAPI suggested by Jakub
- Applied a number of Reviewed-bys and Acked-bys (including ones I lost
from v13+).
Full devmem TCP changes including the full GVE driver implementation is
here:
https://github.com/mina/linux/commits/tcpdevmem-v16/
One caveat: Taehee reproduced a KASAN warning and reported it here:
https://lore.kernel.org/netdev/CAMArcTUdCxOBYGF3vpbq=eBvqZfnc44KBaQTN7H-wqd…
I estimate the issue to be minor and easily fixable:
https://lore.kernel.org/netdev/CAHS8izNgaqC--GGE2xd85QB=utUnOHmioCsDd1TNxJW…
I hope to be able to follow up with a fix to net tree as net-next closes
imminently, but if this iteration doesn't make it in, I will repost with
a fix squashed after net-next reopens, no problem.
v15: https://patchwork.kernel.org/project/netdevbpf/list/?series=865481&state=*
====
No material changes in this version, only a fix to linking against
libynl.a from the last version. Per Jakub's instructions I've pulled one
of his patches into this series, and now use the new libynl.a correctly,
I hope.
As usual, the full devmem TCP changes including the full GVE driver
implementation is here:
https://github.com/mina/linux/commits/tcpdevmem-v15/
v14: https://patchwork.kernel.org/project/netdevbpf/list/?series=865135&archive=…
====
No material changes in this version. Only rebase and re-verification on
top of net-next. v13, I think, raced with commit ebad6d0334793
("net/ipv4: Use nested-BH locking for ipv4_tcp_sk.") being merged to
net-next that caused a patchwork failure to apply. This series should
apply cleanly on commit c4532232fa2a4 ("selftests: net: remove unneeded
IP_GRE config").
I did not wait the customary 24hr as Jakub said it's OK to repost as soon
as I build test the rebased version:
https://lore.kernel.org/netdev/20240625075926.146d769d@kernel.org/
v13: https://patchwork.kernel.org/project/netdevbpf/list/?series=861406&archive=…
====
Major changes:
--------------
This iteration addresses Pavel's review comments, applies his
reviewed-by's, and seeks to fix the patchwork build error (sorry!).
As usual, the full devmem TCP changes including the full GVE driver
implementation is here:
https://github.com/mina/linux/commits/tcpdevmem-v13/
v12: https://patchwork.kernel.org/project/netdevbpf/list/?series=859747&state=*
====
Major changes:
--------------
This iteration only addresses one minor comment from Pavel with regards
to the trace printing of netmem, and the patchwork build error
introduced in v11 because I missed doing an allmodconfig build, sorry.
Other than that v11, AFAICT, received no feedback. There is one
discussion about how the specifics of plugging io uring memory through
the page pool, but not relevant to content in this particular patchset,
AFAICT.
As usual, the full devmem TCP changes including the full GVE driver
implementation is here:
https://github.com/mina/linux/commits/tcpdevmem-v12/
v11: https://patchwork.kernel.org/project/netdevbpf/list/?series=857457&state=*
====
Major Changes:
--------------
v11 addresses feedback received in v10. The major change is the removal
of the memory provider ops as requested by Christoph. We still
accomplish the same thing, but utilizing direct function calls with if
statements rather than generic ops.
Additionally address sparse warnings, bugs and review comments from
folks that reviewed.
As usual, the full devmem TCP changes including the full GVE driver
implementation is here:
https://github.com/mina/linux/commits/tcpdevmem-v11/
Detailed changelog:
-------------------
- Fixes in netdev_rx_queue_restart() from Pavel & David.
- Remove commit e650e8c3a36f5 ("net: page_pool: create hooks for
custom page providers") from the series to address Christoph's
feedback and rebased other patches on the series on this change.
- Fixed build errors with CONFIG_DMA_SHARED_BUFFER &&
!CONFIG_GENERIC_ALLOCATOR build.
- Fixed sparse warnings pointed out by Paolo.
- Drop unnecessary gro_pull_from_frag0 checks.
- Added Bagas reviewed-by to docs.
v10: https://patchwork.kernel.org/project/netdevbpf/list/?series=852422&state=*
====
Major Changes:
--------------
v9 was sent right before the merge window closed (sorry!). v10 is almost
a re-send of the series now that the merge window re-opened. Only
rebased to latest net-next and addressed some minor iterative comments
received on v9.
As usual, the full devmem TCP changes including the full GVE driver
implementation is here:
https://github.com/mina/linux/commits/tcpdevmem-v10/
Detailed changelog:
-------------------
- Fixed tokens leaking in DONTNEED setsockopt (Nikolay).
- Moved net_iov_dma_addr() to devmem.c and made it a devmem specific
helpers (David).
- Rename hook alloc_pages to alloc_netmems as alloc_pages is now
preprocessor macro defined and causes a build error.
v9:
===
Major Changes:
--------------
GVE queue API has been merged. Submitting this version as non-RFC after
rebasing on top of the merged API, and dropped the out of tree queue API
I was carrying on github. Addressed the little feedback v8 has received.
Detailed changelog:
------------------
- Added new patch from David Wei to this series for
netdev_rx_queue_restart()
- Fixed sparse error.
- Removed CONFIG_ checks in netmem_is_net_iov()
- Flipped skb->readable to skb->unreadable
- Minor fixes to selftests & docs.
RFC v8:
=======
Major Changes:
--------------
- Fixed build error generated by patch-by-patch build.
- Applied docs suggestions from Randy.
RFC v7:
=======
Major Changes:
--------------
This revision largely rebases on top of net-next and addresses the feedback
RFCv6 received from folks, namely Jakub, Yunsheng, Arnd, David, & Pavel.
The series remains in RFC because the queue-API ndos defined in this
series are not yet implemented. I have a GVE implementation I carry out
of tree for my testing. A upstreamable GVE implementation is in the
works. Aside from that, in my estimation all the patches are ready for
review/merge. Please do take a look.
As usual the full devmem TCP changes including the full GVE driver
implementation is here:
https://github.com/mina/linux/commits/tcpdevmem-v7/
Detailed changelog:
- Use admin-perm in netlink API.
- Addressed feedback from Jakub with regards to netlink API
implementation.
- Renamed devmem.c functions to something more appropriate for that
file.
- Improve the performance seen through the page_pool benchmark.
- Fix the value definition of all the SO_DEVMEM_* uapi.
- Various fixes to documentation.
Perf - page-pool benchmark:
---------------------------
Improved performance of bench_page_pool_simple.ko tests compared to v6:
https://pastebin.com/raw/v5dYRg8L
net-next base: 8 cycle fast path.
RFC v6: 10 cycle fast path.
RFC v7: 9 cycle fast path.
RFC v7 with CONFIG_DMA_SHARED_BUFFER disabled: 8 cycle fast path,
same as baseline.
Perf - Devmem TCP benchmark:
---------------------
Perf is about the same regardless of the changes in v7, namely the
removal of the static_branch_unlikely to improve the page_pool benchmark
performance:
189/200gbps bi-directional throughput with RX devmem TCP and regular TCP
TX i.e. ~95% line rate.
RFC v6:
=======
Major Changes:
--------------
This revision largely rebases on top of net-next and addresses the little
feedback RFCv5 received.
The series remains in RFC because the queue-API ndos defined in this
series are not yet implemented. I have a GVE implementation I carry out
of tree for my testing. A upstreamable GVE implementation is in the
works. Aside from that, in my estimation all the patches are ready for
review/merge. Please do take a look.
As usual the full devmem TCP changes including the full GVE driver
implementation is here:
https://github.com/mina/linux/commits/tcpdevmem-v6/
This version also comes with some performance data recorded in the cover
letter (see below changelog).
Detailed changelog:
- Rebased on top of the merged netmem_ref changes.
- Converted skb->dmabuf to skb->readable (Pavel). Pavel's original
suggestion was to remove the skb->dmabuf flag entirely, but when I
looked into it closely, I found the issue that if we remove the flag
we have to dereference the shinfo(skb) pointer to obtain the first
frag to tell whether an skb is readable or not. This can cause a
performance regression if it dirties the cache line when the
shinfo(skb) was not really needed. Instead, I converted the skb->dmabuf
flag into a generic skb->readable flag which can be re-used by io_uring
0-copy RX.
- Squashed a few locking optimizations from Eric Dumazet in the RX path
and the DEVMEM_DONTNEED setsockopt.
- Expanded the tests a bit. Added validation for invalid scenarios and
added some more coverage.
Perf - page-pool benchmark:
---------------------------
bench_page_pool_simple.ko tests with and without these changes:
https://pastebin.com/raw/ncHDwAbn
AFAIK the number that really matters in the perf tests is the
'tasklet_page_pool01_fast_path Per elem'. This one measures at about 8
cycles without the changes but there is some 1 cycle noise in some
results.
With the patches this regresses to 9 cycles with the changes but there
is 1 cycle noise occasionally running this test repeatedly.
Lastly I tried disable the static_branch_unlikely() in
netmem_is_net_iov() check. To my surprise disabling the
static_branch_unlikely() check reduces the fast path back to 8 cycles,
but the 1 cycle noise remains.
Perf - Devmem TCP benchmark:
---------------------
189/200gbps bi-directional throughput with RX devmem TCP and regular TCP
TX i.e. ~95% line rate.
Major changes in RFC v5:
========================
1. Rebased on top of 'Abstract page from net stack' series and used the
new netmem type to refer to LSB set pointers instead of re-using
struct page.
2. Downgraded this series back to RFC and called it RFC v5. This is
because this series is now dependent on 'Abstract page from net
stack'[1] and the queue API. Both are removed from the series to
reduce the patch # and those bits are fairly independent or
pre-requisite work.
3. Reworked the page_pool devmem support to use netmem and for some
more unified handling.
4. Reworked the reference counting of net_iov (renamed from
page_pool_iov) to use pp_ref_count for refcounting.
The full changes including the dependent series and GVE page pool
support is here:
https://github.com/mina/linux/commits/tcpdevmem-rfcv5/
[1] https://patchwork.kernel.org/project/netdevbpf/list/?series=810774
Major changes in v1:
====================
1. Implemented MVP queue API ndos to remove the userspace-visible
driver reset.
2. Fixed issues in the napi_pp_put_page() devmem frag unref path.
3. Removed RFC tag.
Many smaller addressed comments across all the patches (patches have
individual change log).
Full tree including the rest of the GVE driver changes:
https://github.com/mina/linux/commits/tcpdevmem-v1
Changes in RFC v3:
==================
1. Pulled in the memory-provider dependency from Jakub's RFC[1] to make the
series reviewable and mergeable.
2. Implemented multi-rx-queue binding which was a todo in v2.
3. Fix to cmsg handling.
The sticking point in RFC v2[2] was the device reset required to refill
the device rx-queues after the dmabuf bind/unbind. The solution
suggested as I understand is a subset of the per-queue management ops
Jakub suggested or similar:
https://lore.kernel.org/netdev/20230815171638.4c057dcd@kernel.org/
This is not addressed in this revision, because:
1. This point was discussed at netconf & netdev and there is openness to
using the current approach of requiring a device reset.
2. Implementing individual queue resetting seems to be difficult for my
test bed with GVE. My prototype to test this ran into issues with the
rx-queues not coming back up properly if reset individually. At the
moment I'm unsure if it's a mistake in the POC or a genuine issue in
the virtualization stack behind GVE, which currently doesn't test
individual rx-queue restart.
3. Our usecases are not bothered by requiring a device reset to refill
the buffer queues, and we'd like to support NICs that run into this
limitation with resetting individual queues.
My thought is that drivers that have trouble with per-queue configs can
use the support in this series, while drivers that support new netdev
ops to reset individual queues can automatically reset the queue as
part of the dma-buf bind/unbind.
The same approach with device resets is presented again for consideration
with other sticking points addressed.
This proposal includes the rx devmem path only proposed for merge. For a
snapshot of my entire tree which includes the GVE POC page pool support &
device memory support:
https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-v3
[1] https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.…
[2] https://lore.kernel.org/netdev/CAHS8izOVJGJH5WF68OsRWFKJid1_huzzUK+hpKbLcL4…
Changes in RFC v2:
==================
The sticking point in RFC v1[1] was the dma-buf pages approach we used to
deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
that attempts to resolve this by implementing scatterlist support in the
networking stack, such that we can import the dma-buf scatterlist
directly. This is the approach proposed at a high level here[2].
Detailed changes:
1. Replaced dma-buf pages approach with importing scatterlist into the
page pool.
2. Replace the dma-buf pages centric API with a netlink API.
3. Removed the TX path implementation - there is no issue with
implementing the TX path with scatterlist approach, but leaving
out the TX path makes it easier to review.
4. Functionality is tested with this proposal, but I have not conducted
perf testing yet. I'm not sure there are regressions, but I removed
perf claims from the cover letter until they can be re-confirmed.
5. Added Signed-off-by: contributors to the implementation.
6. Fixed some bugs with the RX path since RFC v1.
Any feedback welcome, but specifically the biggest pending questions
needing feedback IMO are:
1. Feedback on the scatterlist-based approach in general.
2. Netlink API (Patch 1 & 2).
3. Approach to handle all the drivers that expect to receive pages from
the page pool (Patch 6).
[1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.c…
[2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLX…
==================
* TL;DR:
Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
from device memory efficiently, without bouncing the data to a host memory
buffer.
* Problem:
A large amount of data transfers have device memory as the source and/or
destination. Accelerators drastically increased the volume of such transfers.
Some examples include:
- ML accelerators transferring large amounts of training data from storage into
GPU/TPU memory. In some cases ML training setup time can be as long as 50% of
TPU compute time, improving data transfer throughput & efficiency can help
improving GPU/TPU utilization.
- Distributed training, where ML accelerators, such as GPUs on different hosts,
exchange data among them.
- Distributed raw block storage applications transfer large amounts of data with
remote SSDs, much of this data does not require host processing.
Today, the majority of the Device-to-Device data transfers the network are
implemented as the following low level operations: Device-to-Host copy,
Host-to-Host network transfer, and Host-to-Device copy.
The implementation is suboptimal, especially for bulk data transfers, and can
put significant strains on system resources, such as host memory bandwidth,
PCIe bandwidth, etc. One important reason behind the current state is the
kernel’s lack of semantics to express device to network transfers.
* Proposal:
In this patch series we attempt to optimize this use case by implementing
socket APIs that enable the user to:
1. send device memory across the network directly, and
2. receive incoming network packets directly into device memory.
Packet _payloads_ go directly from the NIC to device memory for receive and from
device memory to NIC for transmit.
Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
normally. The NIC _must_ support header split to achieve this.
Advantages:
- Alleviate host memory bandwidth pressure, compared to existing
network-transfer + device-copy semantics.
- Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
of the PCIe tree, compared to traditional path which sends data through the
root complex.
* Patch overview:
** Part 1: netlink API
Gives user ability to bind dma-buf to an RX queue.
** Part 2: scatterlist support
Currently the standard for device memory sharing is DMABUF, which doesn't
generate struct pages. On the other hand, networking stack (skbs, drivers, and
page pool) operate on pages. We have 2 options:
1. Generate struct pages for dmabuf device memory, or,
2. Modify the networking stack to process scatterlist.
Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.
** part 3: page pool support
We piggy back on page pool memory providers proposal:
https://github.com/kuba-moo/linux/tree/pp-providers
It allows the page pool to define a memory provider that provides the
page allocation and freeing. It helps abstract most of the device memory
TCP changes from the driver.
** part 4: support for unreadable skb frags
Page pool iovs are not accessible by the host; we implement changes
throughput the networking stack to correctly handle skbs with unreadable
frags.
** Part 5: recvmsg() APIs
We define user APIs for the user to send and receive device memory.
Not included with this series is the GVE devmem TCP support, just to
simplify the review. Code available here if desired:
https://github.com/mina/linux/tree/tcpdevmem
This series is built on top of net-next with Jakub's pp-providers changes
cherry-picked.
* NIC dependencies:
1. (strict) Devmem TCP require the NIC to support header split, i.e. the
capability to split incoming packets into a header + payload and to put
each into a separate buffer. Devmem TCP works by using device memory
for the packet payload, and host memory for the packet headers.
2. (optional) Devmem TCP works better with flow steering support & RSS support,
i.e. the NIC's ability to steer flows into certain rx queues. This allows the
sysadmin to enable devmem TCP on a subset of the rx queues, and steer
devmem TCP traffic onto these queues and non devmem TCP elsewhere.
The NIC I have access to with these properties is the GVE with DQO support
running in Google Cloud, but any NIC that supports these features would suffice.
I may be able to help reviewers bring up devmem TCP on their NICs.
* Testing:
The series includes a udmabuf kselftest that show a simple use case of
devmem TCP and validates the entire data path end to end without
a dependency on a specific dmabuf provider.
** Test Setup
Kernel: net-next with this series and memory provider API cherry-picked
locally.
Hardware: Google Cloud A3 VMs.
NIC: GVE with header split & RSS & flow steering support.
Cc: Pavel Begunkov <asml.silence(a)gmail.com>
Cc: David Wei <dw(a)davidwei.uk>
Cc: Jason Gunthorpe <jgg(a)ziepe.ca>
Cc: Yunsheng Lin <linyunsheng(a)huawei.com>
Cc: Shailend Chand <shailend(a)google.com>
Cc: Harshitha Ramamurthy <hramamurthy(a)google.com>
Cc: Shakeel Butt <shakeel.butt(a)linux.dev>
Cc: Jeroen de Borst <jeroendb(a)google.com>
Cc: Praveen Kaligineedi <pkaligineedi(a)google.com>
Cc: Bagas Sanjaya <bagasdotme(a)gmail.com>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: Nikolay Aleksandrov <razor(a)blackwall.org>
Cc: Taehee Yoo <ap420073(a)gmail.com>
Cc: Donald Hunter <donald.hunter(a)gmail.com>
Mina Almasry (13):
netdev: add netdev_rx_queue_restart()
net: netdev netlink api to bind dma-buf to a net device
netdev: support binding dma-buf to netdevice
netdev: netdevice devmem allocator
page_pool: devmem support
memory-provider: dmabuf devmem memory provider
net: support non paged skb frags
net: add support for skbs with unreadable frags
tcp: RX path for devmem TCP
net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags
net: add devmem TCP documentation
selftests: add ncdevmem, netcat for devmem TCP
netdev: add dmabuf introspection
Documentation/netlink/specs/netdev.yaml | 61 +++
Documentation/networking/devmem.rst | 269 +++++++++++
Documentation/networking/index.rst | 1 +
arch/alpha/include/uapi/asm/socket.h | 6 +
arch/mips/include/uapi/asm/socket.h | 6 +
arch/parisc/include/uapi/asm/socket.h | 6 +
arch/sparc/include/uapi/asm/socket.h | 6 +
include/linux/netdevice.h | 2 +
include/linux/skbuff.h | 61 ++-
include/linux/skbuff_ref.h | 9 +-
include/linux/socket.h | 1 +
include/net/netdev_rx_queue.h | 5 +
include/net/netmem.h | 132 +++++-
include/net/page_pool/helpers.h | 39 +-
include/net/page_pool/types.h | 23 +-
include/net/sock.h | 2 +
include/net/tcp.h | 3 +-
include/trace/events/page_pool.h | 12 +-
include/uapi/asm-generic/socket.h | 6 +
include/uapi/linux/netdev.h | 13 +
include/uapi/linux/uio.h | 17 +
net/Kconfig | 5 +
net/core/Makefile | 2 +
net/core/datagram.c | 6 +
net/core/dev.c | 33 +-
net/core/devmem.c | 389 ++++++++++++++++
net/core/devmem.h | 180 ++++++++
net/core/gro.c | 3 +-
net/core/mp_dmabuf_devmem.h | 44 ++
net/core/netdev-genl-gen.c | 23 +
net/core/netdev-genl-gen.h | 6 +
net/core/netdev-genl.c | 139 +++++-
net/core/netdev_rx_queue.c | 81 ++++
net/core/netmem_priv.h | 31 ++
net/core/page_pool.c | 120 +++--
net/core/page_pool_priv.h | 46 ++
net/core/page_pool_user.c | 32 +-
net/core/skbuff.c | 77 +++-
net/core/sock.c | 68 +++
net/ethtool/common.c | 8 +
net/ipv4/esp4.c | 3 +-
net/ipv4/tcp.c | 263 ++++++++++-
net/ipv4/tcp_input.c | 13 +-
net/ipv4/tcp_ipv4.c | 16 +
net/ipv4/tcp_minisocks.c | 2 +
net/ipv4/tcp_output.c | 5 +-
net/ipv6/esp6.c | 3 +-
net/packet/af_packet.c | 4 +-
net/xdp/xsk_buff_pool.c | 5 +
tools/include/uapi/linux/netdev.h | 13 +
tools/net/ynl/lib/.gitignore | 1 +
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 9 +
tools/testing/selftests/net/ncdevmem.c | 570 ++++++++++++++++++++++++
54 files changed, 2757 insertions(+), 124 deletions(-)
create mode 100644 Documentation/networking/devmem.rst
create mode 100644 net/core/devmem.c
create mode 100644 net/core/devmem.h
create mode 100644 net/core/mp_dmabuf_devmem.h
create mode 100644 net/core/netdev_rx_queue.c
create mode 100644 net/core/netmem_priv.h
create mode 100644 tools/testing/selftests/net/ncdevmem.c
--
2.46.0.469.g59c65b2a67-goog
PACKET socket can retain its fanout membership through link down and up
and leave a fanout while closed regardless of link state.
However, socket was forbidden from joining a fanout while it was not
RUNNING.
This patch allows PACKET socket to join a fanout while not RUNNING.
Selftest psock_fanout is extended to test this scenario.
This is the only test that was performed.
This scenario was identified while studying DPDK pmd_af_packet_drv.
Since sockets are only created during initialization, there is no reason
to fail the initialization if a single link is temporarily down.
I hope it is not considered as breaking user space and that applications
are not designed to expect this failure.
Changes:
V02:
* psock_fanout: use explicit loopback up/down instead of toggle.
* psock_fanout: don't try to restore loopback state on failure.
* Rephrase commit message about "leaving a fanout".
V01: https://lore.kernel.org/netdev/cover.1728303615.git.gur.stavi@huawei.com/
Gur Stavi (2):
af_packet: allow fanout_add when socket is not RUNNING
selftests: net/psock_fanout: socket joins fanout when link is down
net/packet/af_packet.c | 10 +++---
tools/testing/selftests/net/psock_fanout.c | 42 ++++++++++++++++++++--
2 files changed, 44 insertions(+), 8 deletions(-)
base-commit: f95b4725e796b12e5f347a0d161e1d3843142aa8
--
2.45.2