The selftest I provided can reproduce a panic:
'./test_progs -a cgroup_storage_update'
When we attach a program to cgroup and if prog->aux->cgroup_storage
exists, which means the cgroup_storage map is used in the program, we
will then allocate storage by bpf_cgroup_storages_alloc() and assign it
to pl->storage.
At the end, pl->storage will be assigned to
cgrp->bpf.effective[atype]->cgroup_storage by xxx_effective_progs().
But when we attach a program without the cgroup_storage map being used
(prog->aux->cgroup_storage is empty), the cgroup_storage in struct
bpf_prog_array_item is empty.
Then, if we use BPF_LINK_UPDATE to replace the old program with a new one
that uses the cgroup_storage map, we miss the cgroup_storage being
initialized.
This causes a panic when accessing storage in bpf_get_local_storage.
Jiayuan Chen (2):
bpf: Create cgroup storage if needed when updating link
selftests/bpf: Add link update test for cgroup_storage
kernel/bpf/cgroup.c | 24 +++++++---
.../selftests/bpf/prog_tests/cgroup_storage.c | 45 +++++++++++++++++++
.../selftests/bpf/progs/cgroup_storage.c | 6 +++
3 files changed, 70 insertions(+), 5 deletions(-)
--
2.47.1
If we try to access argument which is pointer to const void, it's an
UNKNOWN type, verifier will fail to load.
Use is_void_or_int_ptr to check if type is void or int pointer.
Add a selftest to check it.
---
KaFai Wan (2):
bpf: Allow access to const void pointer arguments in tracing programs
selftests/bpf: Add test to access const void pointer argument in
tracing program
kernel/bpf/btf.c | 13 +++----------
net/bpf/test_run.c | 8 +++++++-
.../selftests/bpf/progs/verifier_btf_ctx_access.c | 12 ++++++++++++
3 files changed, 22 insertions(+), 11 deletions(-)
Changelog:
v3->v4: Addressed comments from Alexei Starovoitov
- change SOB to match From email address
- add Acked-by from jirka
Details in here:
https://lore.kernel.org/all/20250417151548.1276279-1-kafai.wan@hotmail.com/
v2->v3: Addressed comments from jirka
- remove duplicate checks for void pointer
Details in here:
https://lore.kernel.org/bpf/20250416161756.1079178-1-kafai.wan@hotmail.com/
v1->v2: Addressed comments from jirka
- use btf_type_is_void to check if type is void
- merge is_void_ptr and is_int_ptr to is_void_or_int_ptr
- fix selftests
Details in here:
https://lore.kernel.org/all/20250412170626.3638516-1-kafai.wan@hotmail.com/
--
2.43.0
Until CONFIG_DMABUF_SYSFS_STATS was added [1] it was only possible to
perform per-buffer accounting with debugfs which is not suitable for
production environments. Eventually we discovered the overhead with
per-buffer sysfs file creation/removal was significantly impacting
allocation and free times, and exacerbated kernfs lock contention. [2]
dma_buf_stats_setup() is responsible for 39% of single-page buffer
creation duration, or 74% of single-page dma_buf_export() duration when
stressing dmabuf allocations and frees.
I prototyped a change from per-buffer to per-exporter statistics with a
RCU protected list of exporter allocations that accommodates most (but
not all) of our use-cases and avoids almost all of the sysfs overhead.
While that adds less overhead than per-buffer sysfs, and less even than
the maintenance of the dmabuf debugfs_list, it's still *additional*
overhead on top of the debugfs_list and doesn't give us per-buffer info.
This series uses the existing dmabuf debugfs_list to implement a BPF
dmabuf iterator, which adds no overhead to buffer allocation/free and
provides per-buffer info. While the kernel must have CONFIG_DEBUG_FS for
the dmabuf_iter to be available, debugfs does not need to be mounted.
The BPF program loaded by userspace that extracts per-buffer information
gets to define its own interface which avoids the lack of ABI stability
with debugfs (even if it were mounted).
As this is a replacement for our use of CONFIG_DMABUF_SYSFS_STATS, the
last patch is a RFC for removing it from the kernel. Please see my
suggestion there regarding the timeline for that.
[1] https://lore.kernel.org/linux-media/20201210044400.1080308-1-hridya@google.…
[2] https://lore.kernel.org/all/20220516171315.2400578-1-tjmercier@google.com/
T.J. Mercier (4):
dma-buf: Rename and expose debugfs symbols
bpf: Add dmabuf iterator
selftests/bpf: Add test for dmabuf_iter
RFC: dma-buf: Remove DMA-BUF statistics
.../ABI/testing/sysfs-kernel-dmabuf-buffers | 24 ---
Documentation/driver-api/dma-buf.rst | 5 -
drivers/dma-buf/Kconfig | 15 --
drivers/dma-buf/Makefile | 1 -
drivers/dma-buf/dma-buf-sysfs-stats.c | 202 ------------------
drivers/dma-buf/dma-buf-sysfs-stats.h | 35 ---
drivers/dma-buf/dma-buf.c | 40 +---
include/linux/btf_ids.h | 1 +
include/linux/dma-buf.h | 6 +
kernel/bpf/Makefile | 3 +
kernel/bpf/dmabuf_iter.c | 130 +++++++++++
tools/testing/selftests/bpf/config | 1 +
.../selftests/bpf/prog_tests/dmabuf_iter.c | 116 ++++++++++
.../testing/selftests/bpf/progs/dmabuf_iter.c | 31 +++
14 files changed, 299 insertions(+), 311 deletions(-)
delete mode 100644 Documentation/ABI/testing/sysfs-kernel-dmabuf-buffers
delete mode 100644 drivers/dma-buf/dma-buf-sysfs-stats.c
delete mode 100644 drivers/dma-buf/dma-buf-sysfs-stats.h
create mode 100644 kernel/bpf/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/progs/dmabuf_iter.c
--
2.49.0.604.gff1f9ca942-goog
On Tue, Apr 22, 2025 at 07:58:56PM -0400, Waiman Long <llong(a)redhat.com> wrote:
> Am I correct to assume that the purpose of 1d09069f5313f ("selftests:
> memcg: expect no low events in unprotected sibling") is to force a
> failure in the test_memcg_low test to force a change in the current
> behavior? Or was it the case that it didn't fail when you submit your
> patch?
Yes, the failure had been intended to mark unexpected mode of reclaim
(there's still a reproducer somewhere in the references). However, I
learnt that:
a) it ain't easy to fix,
b) the only occurence of the troublesome behavior was in the test and
never reported by users in real life.
I've started to prefer the variant where the particular check is
indefinite since that.
HTH,
Michal
Misbehaving guests can cause bus locks to degrade the performance of
a system. Non-WB (write-back) and misaligned locked RMW (read-modify-write)
instructions are referred to as "bus locks" and require system wide
synchronization among all processors to guarantee the atomicity. The bus
locks can impose notable performance penalties for all processors within
the system.
Support for the Bus Lock Threshold is indicated by CPUID
Fn8000_000A_EDX[29] BusLockThreshold=1, the VMCB provides a Bus Lock
Threshold enable bit and an unsigned 16-bit Bus Lock Threshold count.
VMCB intercept bit
VMCB Offset Bits Function
14h 5 Intercept bus lock operations
Bus lock threshold count
VMCB Offset Bits Function
120h 15:0 Bus lock counter
During VMRUN, the bus lock threshold count is fetched and stored in an
internal count register. Prior to executing a bus lock within the guest,
the processor verifies the count in the bus lock register. If the count is
greater than zero, the processor executes the bus lock, reducing the count.
However, if the count is zero, the bus lock operation is not performed, and
instead, a Bus Lock Threshold #VMEXIT is triggered to transfer control to
the Virtual Machine Monitor (VMM).
A Bus Lock Threshold #VMEXIT is reported to the VMM with VMEXIT code 0xA5h,
VMEXIT_BUSLOCK. EXITINFO1 and EXITINFO2 are set to 0 on a VMEXIT_BUSLOCK.
On a #VMEXIT, the processor writes the current value of the Bus Lock
Threshold Counter to the VMCB.
Note: Currently, virtualizing the Bus Lock Threshold feature for L1 guest is
not supported.
More details about the Bus Lock Threshold feature can be found in AMD APM
[1].
v3 -> v4
- Incorporated Sean's review comments
- Added a preparatory patch to move linear_rip out of kvm_pio_request, so
that it can be used by the bus lock threshold patches.
- Added complete_userspace_buslock() function to reload bus_lock_counter
to '1' only if the usespace has not changed the RIP.
- Added changes to continue running bus_lock_counter accross the nested
transitions.
v2 -> v3
- Drop parch to add virt tag in /proc/cpuinfo.
- Incorporated Tom's review comments.
v1 -> v2
- Incorporated misc review comments from Sean.
- Removed bus_lock_counter module parameter.
- Set the value of bus_lock_counter to zero by default and reload the value by 1
in bus lock exit handler.
- Add documentation for the behavioral difference for KVM_EXIT_BUS_LOCK.
- Improved selftest for buslock to work on SVM and VMX.
- Rewrite the commit messages.
Patches are prepared on kvm-next/next (c9ea48bb6ee6).
Testing done:
- Tested the Bus Lock Threshold functionality on normal, SEV, SEV-ES and SEV-SNP guests.
- Tested the Bus Lock Threshold functionality on nested guests.
v1: https://lore.kernel.org/kvm/20240709175145.9986-4-manali.shukla@amd.com/T/
v2: https://lore.kernel.org/kvm/20241001063413.687787-4-manali.shukla@amd.com/T/
v3: https://lore.kernel.org/kvm/20241004053341.5726-1-manali.shukla@amd.com/T/
[1]: AMD64 Architecture Programmer's Manual Pub. 24593, April 2024,
Vol 2, 15.14.5 Bus Lock Threshold.
https://bugzilla.kernel.org/attachment.cgi?id=306250
Manali Shukla (3):
KVM: x86: Preparatory patch to move linear_rip out of kvm_pio_request
x86/cpufeatures: Add CPUID feature bit for the Bus Lock Threshold
KVM: SVM: Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs
Nikunj A Dadhania (2):
KVM: SVM: Enable Bus lock threshold exit
KVM: selftests: Add bus lock exit test
Documentation/virt/kvm/api.rst | 19 +++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/include/asm/svm.h | 5 +-
arch/x86/include/uapi/asm/svm.h | 2 +
arch/x86/kvm/svm/nested.c | 42 ++++++
arch/x86/kvm/svm/svm.c | 38 +++++
arch/x86/kvm/svm/svm.h | 2 +
arch/x86/kvm/x86.c | 8 +-
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../selftests/kvm/x86/kvm_buslock_test.c | 135 ++++++++++++++++++
11 files changed, 249 insertions(+), 6 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86/kvm_buslock_test.c
base-commit: c9ea48bb6ee6b28bbc956c1e8af98044618fed5e
--
2.34.1
v9: https://lore.kernel.org/netdev/20250415224756.152002-1-almasrymina@google.c…
Changelog:
- Use priv->bindings list instead of sock_bindings_list. This was missed
during the rebase as the bindings have been updated to use
priv->bindings recently (thanks Stan!)
v8: https://lore.kernel.org/netdev/20250308214045.1160445-1-almasrymina@google.…
Only address minor comments on V7
Changelog:
- Use netdev locking instead of rtnl_locking to match rx path.
- Now that iouring zcrx is in net-next, use NET_IOV_IOURING instead of
NET_IOV_UNSPECIFIED.
- Post send binding to net_devmem_dmabuf_bindings after it's been fully
initialized (Stan).
v7: https://lore.kernel.org/netdev/20250227041209.2031104-1-almasrymina@google.…
===
Changelog:
- Check the dmabuf net_iov binding belongs to the device the TX is going
out on. (Jakub)
- Provide detailed inspection of callsites of
__skb_frag_ref/skb_page_unref in patch 2's changelog (Jakub)
v6: https://lore.kernel.org/netdev/20250222191517.743530-1-almasrymina@google.c…
===
v6 has no major changes. Addressed a few issues from Paolo and David,
and collected Acks from Stan. Thank you everyone for the review!
Changes:
- retain behavior to process MSG_FASTOPEN even if the provided cmsg is
invalid (Paolo).
- Rework the freeing of tx_vec slightly (it now has its own err label).
(Paolo).
- Squash the commit that makes dmabuf unbinding scheduled work into the
same one which implements the TX path so we don't run into future
errors on bisecting (Paolo).
- Fix/add comments to explain how dmabuf binding refcounting works
(David).
v5: https://lore.kernel.org/netdev/20250220020914.895431-1-almasrymina@google.c…
===
v5 has no major changes; it clears up the relatively minor issues
pointed out to in v4, and rebases the series on top of net-next to
resolve the conflict with a patch that raced to the tree. It also
collects the review tags from v4.
Changes:
- Rebase to net-next
- Fix issues in selftest (Stan).
- Address comments in the devmem and netmem driver docs (Stan and Bagas)
- Fix zerocopy_fill_skb_from_devmem return error code (Stan).
v4: https://lore.kernel.org/netdev/20250203223916.1064540-1-almasrymina@google.…
===
v4 mainly addresses the critical driver support issue surfaced in v3 by
Paolo and Stan. Drivers aiming to support netmem_tx should make sure not
to pass the netmem dma-addrs to the dma-mapping APIs, as these dma-addrs
may come from dma-bufs.
Additionally other feedback from v3 is addressed.
Major changes:
- Add helpers to handle netmem dma-addrs. Add GVE support for
netmem_tx.
- Fix binding->tx_vec not being freed on error paths during the
tx binding.
- Add a minimal devmem_tx test to devmem.py.
- Clean up everything obsolete from the cover letter (Paolo).
v3: https://patchwork.kernel.org/project/netdevbpf/list/?series=929401&state=*
===
Address minor comments from RFCv2 and fix a few build warnings and
ynl-regen issues. No major changes.
RFC v2: https://patchwork.kernel.org/project/netdevbpf/list/?series=920056&state=*
=======
RFC v2 addresses much of the feedback from RFC v1. I plan on sending
something close to this as net-next reopens, sending it slightly early
to get feedback if any.
Major changes:
--------------
- much improved UAPI as suggested by Stan. We now interpret the iov_base
of the passed in iov from userspace as the offset into the dmabuf to
send from. This removes the need to set iov.iov_base = NULL which may
be confusing to users, and enables us to send multiple iovs in the
same sendmsg() call. ncdevmem and the docs show a sample use of that.
- Removed the duplicate dmabuf iov_iter in binding->iov_iter. I think
this is good improvment as it was confusing to keep track of
2 iterators for the same sendmsg, and mistracking both iterators
caused a couple of bugs reported in the last iteration that are now
resolved with this streamlining.
- Improved test coverage in ncdevmem. Now multiple sendmsg() are tested,
and sending multiple iovs in the same sendmsg() is tested.
- Fixed issue where dmabuf unmapping was happening in invalid context
(Stan).
====================================================================
The TX path had been dropped from the Device Memory TCP patch series
post RFCv1 [1], to make that series slightly easier to review. This
series rebases the implementation of the TX path on top of the
net_iov/netmem framework agreed upon and merged. The motivation for
the feature is thoroughly described in the docs & cover letter of the
original proposal, so I don't repeat the lengthy descriptions here, but
they are available in [1].
Full outline on usage of the TX path is detailed in the documentation
included with this series.
Test example is available via the kselftest included in the series as well.
The series is relatively small, as the TX path for this feature largely
piggybacks on the existing MSG_ZEROCOPY implementation.
Patch Overview:
---------------
1. Documentation & tests to give high level overview of the feature
being added.
1. Add netmem refcounting needed for the TX path.
2. Devmem TX netlink API.
3. Devmem TX net stack implementation.
4. Make dma-buf unbinding scheduled work to handle TX cases where it gets
freed from contexts where we can't sleep.
5. Add devmem TX documentation.
6. Add scaffolding enabling driver support for netmem_tx. Add helpers, driver
feature flag, and docs to enable drivers to declare netmem_tx support.
7. Guard netmem_tx against being enabled against drivers that don't
support it.
8. Add devmem_tx selftests. Add TX path to ncdevmem and add a test to
devmem.py.
Testing:
--------
Testing is very similar to devmem TCP RX path. The ncdevmem test used
for the RX path is now augemented with client functionality to test TX
path.
* Test Setup:
Kernel: net-next with this RFC and memory provider API cherry-picked
locally.
Hardware: Google Cloud A3 VMs.
NIC: GVE with header split & RSS & flow steering support.
Performance results are not included with this version, unfortunately.
I'm having issues running the dma-buf exporter driver against the
upstream kernel on my test setup. The issues are specific to that
dma-buf exporter and do not affect this patch series. I plan to follow
up this series with perf fixes if the tests point to issues once they're
up and running.
Special thanks to Stan who took a stab at rebasing the TX implementation
on top of the netmem/net_iov framework merged. Parts of his proposal [2]
that are reused as-is are forked off into their own patches to give full
credit.
[1] https://lore.kernel.org/netdev/20240909054318.1809580-1-almasrymina@google.…
[2] https://lore.kernel.org/netdev/20240913150913.1280238-2-sdf@fomichev.me/T/#…
Cc: sdf(a)fomichev.me
Cc: asml.silence(a)gmail.com
Cc: dw(a)davidwei.uk
Cc: Jamal Hadi Salim <jhs(a)mojatatu.com>
Cc: Victor Nogueira <victor(a)mojatatu.com>
Cc: Pedro Tammela <pctammela(a)mojatatu.com>
Cc: Samiullah Khawaja <skhawaja(a)google.com>
Cc: Kuniyuki Iwashima <kuniyu(a)amazon.com>
Mina Almasry (8):
netmem: add niov->type attribute to distinguish different net_iov
types
net: add get_netmem/put_netmem support
net: devmem: Implement TX path
net: add devmem TCP TX documentation
net: enable driver support for netmem TX
gve: add netmem TX support to GVE DQO-RDA mode
net: check for driver support in netmem TX
selftests: ncdevmem: Implement devmem TCP TX
Stanislav Fomichev (1):
net: devmem: TCP tx netlink api
Documentation/netlink/specs/netdev.yaml | 12 +
Documentation/networking/devmem.rst | 150 ++++++++-
.../networking/net_cachelines/net_device.rst | 1 +
Documentation/networking/netdev-features.rst | 5 +
Documentation/networking/netmem.rst | 23 +-
drivers/net/ethernet/google/gve/gve_main.c | 4 +
drivers/net/ethernet/google/gve/gve_tx_dqo.c | 8 +-
include/linux/netdevice.h | 2 +
include/linux/skbuff.h | 17 +-
include/linux/skbuff_ref.h | 4 +-
include/net/netmem.h | 34 +-
include/net/sock.h | 1 +
include/uapi/linux/netdev.h | 1 +
io_uring/zcrx.c | 1 +
net/core/datagram.c | 48 ++-
net/core/dev.c | 34 +-
net/core/devmem.c | 139 ++++++--
net/core/devmem.h | 83 ++++-
net/core/netdev-genl-gen.c | 13 +
net/core/netdev-genl-gen.h | 1 +
net/core/netdev-genl.c | 75 ++++-
net/core/skbuff.c | 48 ++-
net/core/sock.c | 6 +
net/ipv4/ip_output.c | 3 +-
net/ipv4/tcp.c | 50 ++-
net/ipv6/ip6_output.c | 3 +-
net/vmw_vsock/virtio_transport_common.c | 5 +-
tools/include/uapi/linux/netdev.h | 1 +
.../selftests/drivers/net/hw/devmem.py | 26 +-
.../selftests/drivers/net/hw/ncdevmem.c | 300 +++++++++++++++++-
30 files changed, 1009 insertions(+), 89 deletions(-)
base-commit: 240ce924d2718b8f6f622f2a9a9c219b9da736e8
--
2.49.0.805.g082f7c87e0-goog