Changes in RFC v3:
------------------
1. Pulled in the memory-provider dependency from Jakub's RFC[1] to make the
series reviewable and mergable.
2. Implemented multi-rx-queue binding which was a todo in v2.
3. Fix to cmsg handling.
The sticking point in RFC v2[2] was the device reset required to refill
the device rx-queues after the dmabuf bind/unbind. The solution
suggested as I understand is a subset of the per-queue management ops
Jakub suggested or similar:
https://lore.kernel.org/netdev/20230815171638.4c057dcd@kernel.org/
This is not addressed in this revision, because:
1. This point was discussed at netconf & netdev and there is openness to
using the current approach of requiring a device reset.
2. Implementing individual queue resetting seems to be difficult for my
test bed with GVE. My prototype to test this ran into issues with the
rx-queues not coming back up properly if reset individually. At the
moment I'm unsure if it's a mistake in the POC or a genuine issue in
the virtualization stack behind GVE, which currently doesn't test
individual rx-queue restart.
3. Our usecases are not bothered by requiring a device reset to refill
the buffer queues, and we'd like to support NICs that run into this
limitation with resetting individual queues.
My thought is that drivers that have trouble with per-queue configs can
use the support in this series, while drivers that support new netdev
ops to reset individual queues can automatically reset the queue as
part of the dma-buf bind/unbind.
The same approach with device resets is presented again for consideration
with other sticking points addressed.
This proposal includes the rx devmem path only proposed for merge. For a
snapshot of my entire tree which includes the GVE POC page pool support &
device memory support:
https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-v3
[1] https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.…
[2] https://lore.kernel.org/netdev/CAHS8izOVJGJH5WF68OsRWFKJid1_huzzUK+hpKbLcL4…
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Jeroen de Borst <jeroendb(a)google.com>
Cc: Praveen Kaligineedi <pkaligineedi(a)google.com>
Changes in RFC v2:
------------------
The sticking point in RFC v1[1] was the dma-buf pages approach we used to
deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
that attempts to resolve this by implementing scatterlist support in the
networking stack, such that we can import the dma-buf scatterlist
directly. This is the approach proposed at a high level here[2].
Detailed changes:
1. Replaced dma-buf pages approach with importing scatterlist into the
page pool.
2. Replace the dma-buf pages centric API with a netlink API.
3. Removed the TX path implementation - there is no issue with
implementing the TX path with scatterlist approach, but leaving
out the TX path makes it easier to review.
4. Functionality is tested with this proposal, but I have not conducted
perf testing yet. I'm not sure there are regressions, but I removed
perf claims from the cover letter until they can be re-confirmed.
5. Added Signed-off-by: contributors to the implementation.
6. Fixed some bugs with the RX path since RFC v1.
Any feedback welcome, but specifically the biggest pending questions
needing feedback IMO are:
1. Feedback on the scatterlist-based approach in general.
2. Netlink API (Patch 1 & 2).
3. Approach to handle all the drivers that expect to receive pages from
the page pool (Patch 6).
[1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@gmail.c…
[2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLX…
----------------------
* TL;DR:
Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
from device memory efficiently, without bouncing the data to a host memory
buffer.
* Problem:
A large amount of data transfers have device memory as the source and/or
destination. Accelerators drastically increased the volume of such transfers.
Some examples include:
- ML accelerators transferring large amounts of training data from storage into
GPU/TPU memory. In some cases ML training setup time can be as long as 50% of
TPU compute time, improving data transfer throughput & efficiency can help
improving GPU/TPU utilization.
- Distributed training, where ML accelerators, such as GPUs on different hosts,
exchange data among them.
- Distributed raw block storage applications transfer large amounts of data with
remote SSDs, much of this data does not require host processing.
Today, the majority of the Device-to-Device data transfers the network are
implemented as the following low level operations: Device-to-Host copy,
Host-to-Host network transfer, and Host-to-Device copy.
The implementation is suboptimal, especially for bulk data transfers, and can
put significant strains on system resources, such as host memory bandwidth,
PCIe bandwidth, etc. One important reason behind the current state is the
kernel’s lack of semantics to express device to network transfers.
* Proposal:
In this patch series we attempt to optimize this use case by implementing
socket APIs that enable the user to:
1. send device memory across the network directly, and
2. receive incoming network packets directly into device memory.
Packet _payloads_ go directly from the NIC to device memory for receive and from
device memory to NIC for transmit.
Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
normally. The NIC _must_ support header split to achieve this.
Advantages:
- Alleviate host memory bandwidth pressure, compared to existing
network-transfer + device-copy semantics.
- Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
of the PCIe tree, compared to traditional path which sends data through the
root complex.
* Patch overview:
** Part 1: netlink API
Gives user ability to bind dma-buf to an RX queue.
** Part 2: scatterlist support
Currently the standard for device memory sharing is DMABUF, which doesn't
generate struct pages. On the other hand, networking stack (skbs, drivers, and
page pool) operate on pages. We have 2 options:
1. Generate struct pages for dmabuf device memory, or,
2. Modify the networking stack to process scatterlist.
Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.
** part 3: page pool support
We piggy back on page pool memory providers proposal:
https://github.com/kuba-moo/linux/tree/pp-providers
It allows the page pool to define a memory provider that provides the
page allocation and freeing. It helps abstract most of the device memory
TCP changes from the driver.
** part 4: support for unreadable skb frags
Page pool iovs are not accessible by the host; we implement changes
throughput the networking stack to correctly handle skbs with unreadable
frags.
** Part 5: recvmsg() APIs
We define user APIs for the user to send and receive device memory.
Not included with this RFC is the GVE devmem TCP support, just to
simplify the review. Code available here if desired:
https://github.com/mina/linux/tree/tcpdevmem
This RFC is built on top of net-next with Jakub's pp-providers changes
cherry-picked.
* NIC dependencies:
1. (strict) Devmem TCP require the NIC to support header split, i.e. the
capability to split incoming packets into a header + payload and to put
each into a separate buffer. Devmem TCP works by using device memory
for the packet payload, and host memory for the packet headers.
2. (optional) Devmem TCP works better with flow steering support & RSS support,
i.e. the NIC's ability to steer flows into certain rx queues. This allows the
sysadmin to enable devmem TCP on a subset of the rx queues, and steer
devmem TCP traffic onto these queues and non devmem TCP elsewhere.
The NIC I have access to with these properties is the GVE with DQO support
running in Google Cloud, but any NIC that supports these features would suffice.
I may be able to help reviewers bring up devmem TCP on their NICs.
* Testing:
The series includes a udmabuf kselftest that show a simple use case of
devmem TCP and validates the entire data path end to end without
a dependency on a specific dmabuf provider.
** Test Setup
Kernel: net-next with this RFC and memory provider API cherry-picked
locally.
Hardware: Google Cloud A3 VMs.
NIC: GVE with header split & RSS & flow steering support.
Jakub Kicinski (2):
net: page_pool: factor out releasing DMA from releasing the page
net: page_pool: create hooks for custom page providers
Mina Almasry (10):
net: netdev netlink api to bind dma-buf to a net device
netdev: support binding dma-buf to netdevice
netdev: netdevice devmem allocator
memory-provider: dmabuf devmem memory provider
page-pool: device memory support
net: support non paged skb frags
net: add support for skbs with unreadable frags
tcp: RX path for devmem TCP
net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages
selftests: add ncdevmem, netcat for devmem TCP
Documentation/netlink/specs/netdev.yaml | 28 ++
include/linux/netdevice.h | 93 ++++
include/linux/skbuff.h | 56 ++-
include/linux/socket.h | 1 +
include/net/netdev_rx_queue.h | 1 +
include/net/page_pool/helpers.h | 151 ++++++-
include/net/page_pool/types.h | 55 +++
include/net/sock.h | 2 +
include/net/tcp.h | 5 +-
include/uapi/asm-generic/socket.h | 6 +
include/uapi/linux/netdev.h | 10 +
include/uapi/linux/uio.h | 10 +
net/core/datagram.c | 6 +
net/core/dev.c | 240 +++++++++++
net/core/gro.c | 7 +-
net/core/netdev-genl-gen.c | 14 +
net/core/netdev-genl-gen.h | 1 +
net/core/netdev-genl.c | 118 +++++
net/core/page_pool.c | 209 +++++++--
net/core/skbuff.c | 80 +++-
net/core/sock.c | 36 ++
net/ipv4/tcp.c | 205 ++++++++-
net/ipv4/tcp_input.c | 13 +-
net/ipv4/tcp_ipv4.c | 7 +
net/ipv4/tcp_output.c | 5 +-
net/packet/af_packet.c | 4 +-
tools/include/uapi/linux/netdev.h | 10 +
tools/net/ynl/generated/netdev-user.c | 42 ++
tools/net/ynl/generated/netdev-user.h | 47 ++
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 5 +
tools/testing/selftests/net/ncdevmem.c | 546 ++++++++++++++++++++++++
32 files changed, 1950 insertions(+), 64 deletions(-)
create mode 100644 tools/testing/selftests/net/ncdevmem.c
--
2.42.0.869.gea05f2083d-goog
Regressions that cause a device to no longer be probed by a driver can
have a big impact on the platform's functionality, and despite being
relatively common there isn't currently any generic test to detect them.
As an example, bootrr [1] does test for device probe, but it requires
defining the expected probed devices for each platform.
Given that the Devicetree already provides a static description of
devices on the system, it is a good basis for building such a test on
top.
This series introduces a test to catch regressions that prevent devices
from probing.
Patches 1 and 2 extend the existing dt-extract-compatibles to be able to
output only the compatibles that can be expected to match a Devicetree
node to a driver. Patch 2 adds a kselftest that walks over the
Devicetree nodes on the current platform and compares the compatibles to
the ones on the list, and on an ignore list, to point out devices that
failed to be probed.
A compatible list is needed because not all compatibles that can show up
in a Devicetree node can be used to match to a driver, for example the
code for that compatible might use "OF_DECLARE" type macros and avoid
the driver framework, or the node might be controlled by a driver that
was bound to a different node.
An ignore list is needed for the few cases where it's common for a
driver to match a device but not probe, like for the "simple-mfd"
compatible, where the driver only probes if that compatible is the
node's first compatible.
The reason for parsing the kernel source instead of relying on
information exposed by the kernel at runtime (say, looking at modaliases
or introducing some other mechanism), is to be able to catch issues
where a config was renamed or a driver moved across configs, and the
.config used by the kernel not updated accordingly. We need to parse the
source to find all compatibles present in the kernel independent of the
current config being run.
[1] https://github.com/kernelci/bootrr
Changes in v3:
- Added DT selftest path to MAINTAINERS
- Enabled device probe test for nodes with 'status = "ok"'
- Added pass/fail/skip totals to end of test output
Changes in v2:
- Extended dt-extract-compatibles script to be able to extract driver
matching compatibles, instead of adding a new one in Coccinelle
- Made kselftest output in the KTAP format
Nícolas F. R. A. Prado (3):
dt: dt-extract-compatibles: Handle cfile arguments in generator
function
dt: dt-extract-compatibles: Add flag for driver matching compatibles
kselftest: Add new test for detecting unprobed Devicetree devices
MAINTAINERS | 1 +
scripts/dtc/dt-extract-compatibles | 74 +++++++++++++----
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/dt/.gitignore | 1 +
tools/testing/selftests/dt/Makefile | 21 +++++
.../selftests/dt/compatible_ignore_list | 1 +
tools/testing/selftests/dt/ktap_helpers.sh | 70 ++++++++++++++++
.../selftests/dt/test_unprobed_devices.sh | 83 +++++++++++++++++++
8 files changed, 236 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/dt/.gitignore
create mode 100644 tools/testing/selftests/dt/Makefile
create mode 100644 tools/testing/selftests/dt/compatible_ignore_list
create mode 100644 tools/testing/selftests/dt/ktap_helpers.sh
create mode 100755 tools/testing/selftests/dt/test_unprobed_devices.sh
--
2.42.0
Changelog:
v8:
* Fixed a couple of build errors in the case of !CONFIG_MEMCG
* Simplified the online memcg selection scheme for the zswap global
limit reclaim (suggested by Michal Hocko and Johannes Weiner)
(patch 2 and patch 3)
* Added a new kconfig to allows users to enable zswap shrinker by
default. (suggested by Johannes Weiner) (patch 6)
v7:
* Added the mem_cgroup_iter_online() function to the API for the new
behavior (suggested by Andrew Morton) (patch 2)
* Fixed a missing list_lru_del -> list_lru_del_obj (patch 1)
v6:
* Rebase on top of latest mm-unstable.
* Fix/improve the in-code documentation of the new list_lru
manipulation functions (patch 1)
v5:
* Replace reference getting with an rcu_read_lock() section for
zswap lru modifications (suggested by Yosry)
* Add a new prep patch that allows mem_cgroup_iter() to return
online cgroup.
* Add a callback that updates pool->next_shrink when the cgroup is
offlined (suggested by Yosry Ahmed, Johannes Weiner)
v4:
* Rename list_lru_add to list_lru_add_obj and __list_lru_add to
list_lru_add (patch 1) (suggested by Johannes Weiner and
Yosry Ahmed)
* Some cleanups on the memcg aware LRU patch (patch 2)
(suggested by Yosry Ahmed)
* Use event interface for the new per-cgroup writeback counters.
(patch 3) (suggested by Yosry Ahmed)
* Abstract zswap's lruvec states and handling into
zswap_lruvec_state (patch 5) (suggested by Yosry Ahmed)
v3:
* Add a patch to export per-cgroup zswap writeback counters
* Add a patch to update zswap's kselftest
* Separate the new list_lru functions into its own prep patch
* Do not start from the top of the hierarchy when encounter a memcg
that is not online for the global limit zswap writeback (patch 2)
(suggested by Yosry Ahmed)
* Do not remove the swap entry from list_lru in
__read_swapcache_async() (patch 2) (suggested by Yosry Ahmed)
* Removed a redundant zswap pool getting (patch 2)
(reported by Ryan Roberts)
* Use atomic for the nr_zswap_protected (instead of lruvec's lock)
(patch 5) (suggested by Yosry Ahmed)
* Remove the per-cgroup zswap shrinker knob (patch 5)
(suggested by Yosry Ahmed)
v2:
* Fix loongarch compiler errors
* Use pool stats instead of memcg stats when !CONFIG_MEMCG_KEM
There are currently several issues with zswap writeback:
1. There is only a single global LRU for zswap, making it impossible to
perform worload-specific shrinking - an memcg under memory pressure
cannot determine which pages in the pool it owns, and often ends up
writing pages from other memcgs. This issue has been previously
observed in practice and mitigated by simply disabling
memcg-initiated shrinking:
https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u
But this solution leaves a lot to be desired, as we still do not
have an avenue for an memcg to free up its own memory locked up in
the zswap pool.
2. We only shrink the zswap pool when the user-defined limit is hit.
This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed
ahead of time, as this depends on the workload (specifically, on
factors such as memory access patterns and compressibility of the
memory pages).
This patch series solves these issues by separating the global zswap
LRU into per-memcg and per-NUMA LRUs, and performs workload-specific
(i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The
new shrinker does not have any parameter that must be tuned by the
user, and can be opted in or out on a per-memcg basis.
As a proof of concept, we ran the following synthetic benchmark:
build the linux kernel in a memory-limited cgroup, and allocate some
cold data in tmpfs to see if the shrinker could write them out and
improved the overall performance. Depending on the amount of cold data
generated, we observe from 14% to 35% reduction in kernel CPU time used
in the kernel builds.
Domenico Cerasuolo (3):
zswap: make shrinking memcg-aware
mm: memcg: add per-memcg zswap writeback stat
selftests: cgroup: update per-memcg zswap writeback selftest
Nhat Pham (3):
list_lru: allows explicit memcg and NUMA node selection
memcontrol: implement mem_cgroup_tryget_online()
zswap: shrinks zswap pool based on memory pressure
Documentation/admin-guide/mm/zswap.rst | 10 +
drivers/android/binder_alloc.c | 7 +-
fs/dcache.c | 8 +-
fs/gfs2/quota.c | 6 +-
fs/inode.c | 4 +-
fs/nfs/nfs42xattr.c | 8 +-
fs/nfsd/filecache.c | 4 +-
fs/xfs/xfs_buf.c | 6 +-
fs/xfs/xfs_dquot.c | 2 +-
fs/xfs/xfs_qm.c | 2 +-
include/linux/list_lru.h | 54 ++-
include/linux/memcontrol.h | 15 +
include/linux/mmzone.h | 2 +
include/linux/vm_event_item.h | 1 +
include/linux/zswap.h | 27 +-
mm/Kconfig | 14 +
mm/list_lru.c | 48 ++-
mm/memcontrol.c | 3 +
mm/mmzone.c | 1 +
mm/swap.h | 3 +-
mm/swap_state.c | 26 +-
mm/vmstat.c | 1 +
mm/workingset.c | 4 +-
mm/zswap.c | 456 +++++++++++++++++---
tools/testing/selftests/cgroup/test_zswap.c | 74 ++--
25 files changed, 661 insertions(+), 125 deletions(-)
base-commit: 5cdba94229e58a39ca389ad99763af29e6b0c5a5
--
2.34.1
Hi all,
Here's v2 series to improve resctrl selftests with generalized test
framework and rewritten CAT test. In contrast to v1, this version does
not include L2 CAT test because it needs further work. In general, I
feel that v2 is in much better shape than v1 because I ended up
addressing a few small things beyond what came up during v1 review.
The series contains following improvements:
- Excludes shareable bits from CAT test allocation to avoid interference
- Replaces file "sink" with a volatile variable
- Alters read pattern to defeat HW prefetcher optimizations
- Rewrites CAT test to make the CAT test reliable and truly measure
if CAT is working or not
- Introduces generalized test framework making easier to add new tests
- Lots of other cleanups & refactoring
This series have been tested across a large number of systems from
different generations.
v2:
- Postpone adding L2 CAT test as more investigations are necessary
- Add patch to remove ctrlc_handler() from wrong place
- Improvements to changelogs
- Function comments improvements & comment cleanups
- Move some parts of the changes into more logical patch
- If checks: buf == NULL -> !buf
- Variable naming:
- p -> buf
- cbm_mask_path -> cbm_path
- Function naming:
- get_cbm_mask() -> get_full_cbm()
- cache_size() -> cache_portion_size()
- Use PATH_MAX
- Improved cache_portion_size() parameter names
- int count -> unsigned int
- Pass filename to measurement taking functions instead of
resctrl_val_param
- !lines ? : reversal
- Removed bogus static from function local variable
- Open perf fd only once, reset & enable in the innermost test loop
- Add perf fd ioctl() error handling
- Add patch to change compiler optimization prevention "sink" from file
to volatile variable
- Remove cpu_no and resource (the latter was added in v1) members from
resctrl_val_param (pass uparams and test where those are needed)
- Removed ARRAY_SIZE() macro
- Add patch to rename "resource_id" to "domain_id"
Ilpo Järvinen (26):
selftests/resctrl: Don't use ctrlc_handler() outside signal handling
selftests/resctrl: Split fill_buf to allow tests finer-grained control
selftests/resctrl: Refactor fill_buf functions
selftests/resctrl: Refactor get_cbm_mask() and rename to
get_full_cbm()
selftests/resctrl: Mark get_cache_size() cache_type const
selftests/resctrl: Create cache_portion_size() helper
selftests/resctrl: Exclude shareable bits from schemata in CAT test
selftests/resctrl: Split measure_cache_vals()
selftests/resctrl: Split show_cache_info() to test specific and
generic parts
selftests/resctrl: Remove unnecessary __u64 -> unsigned long
conversion
selftests/resctrl: Remove nested calls in perf event handling
selftests/resctrl: Consolidate naming of perf event related things
selftests/resctrl: Improve perf init
selftests/resctrl: Convert perf related globals to locals
selftests/resctrl: Move cat_val() to cat_test.c and rename to
cat_test()
selftests/resctrl: Open perf fd before start & add error handling
selftests/resctrl: Replace file write with volatile variable
selftests/resctrl: Read in less obvious order to defeat prefetch
optimizations
selftests/resctrl: Rewrite Cache Allocation Technology (CAT) test
selftests/resctrl: Create struct for input parameters
selftests/resctrl: Introduce generalized test framework
selftests/resctrl: Pass write_schemata() resource instead of test name
selftests/resctrl: Add helper to convert L2/3 to integer
selftests/resctrl: Rename resource ID to domain ID
selftests/resctrl: Get domain id from cache id
selftests/resctrl: Add test groups and name L3 CAT test L3_CAT
tools/testing/selftests/resctrl/cache.c | 274 +++++----------
tools/testing/selftests/resctrl/cat_test.c | 328 +++++++++++-------
tools/testing/selftests/resctrl/cmt_test.c | 76 +++-
tools/testing/selftests/resctrl/fill_buf.c | 132 ++++---
tools/testing/selftests/resctrl/mba_test.c | 26 +-
tools/testing/selftests/resctrl/mbm_test.c | 28 +-
tools/testing/selftests/resctrl/resctrl.h | 117 +++++--
.../testing/selftests/resctrl/resctrl_tests.c | 205 +++++------
tools/testing/selftests/resctrl/resctrl_val.c | 56 +--
tools/testing/selftests/resctrl/resctrlfs.c | 246 ++++++++-----
10 files changed, 821 insertions(+), 667 deletions(-)
--
2.30.2
Hi Reinette, Fenghua,
This series introduces a new mount option enabling an alternate mode for
MBM to work around an issue on present AMD implementations and any other
resctrl implementation where there are more RMIDs (or equivalent) than
hardware counters.
The L3 External Bandwidth Monitoring feature of the AMD PQoS
extension[1] only guarantees that RMIDs currently assigned to a
processor will be tracked by hardware. The counters of any other RMIDs
which are no longer being tracked will be reset to zero. The MBM event
counters return "Unavailable" to indicate when this has happened.
An interval for effectively measuring memory bandwidth typically needs
to be multiple seconds long. In Google's workloads, it is not feasible
to bound the number of jobs with different RMIDs which will run in a
cache domain over any period of time. Consequently, on a
fully-committed system where all RMIDs are allocated, few groups'
counters return non-zero values.
To demonstrate the underlying issue, the first patch provides a test
case in tools/testing/selftests/resctrl/test_rmids.sh.
On an AMD EPYC 7B12 64-Core Processor with the default behavior:
# ./test_rmids.sh
Created 255 monitoring groups.
g1: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
g2: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
g3: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
[..]
g238: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
g239: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
g240: mbm_total_bytes: Unavailable -> Unavailable (FAIL)
g241: mbm_total_bytes: Unavailable -> 660497472
g242: mbm_total_bytes: Unavailable -> 660793344
g243: mbm_total_bytes: Unavailable -> 660477312
g244: mbm_total_bytes: Unavailable -> 660495360
g245: mbm_total_bytes: Unavailable -> 660775360
g246: mbm_total_bytes: Unavailable -> 660645504
g247: mbm_total_bytes: Unavailable -> 660696128
g248: mbm_total_bytes: Unavailable -> 660605248
g249: mbm_total_bytes: Unavailable -> 660681280
g250: mbm_total_bytes: Unavailable -> 660834240
g251: mbm_total_bytes: Unavailable -> 660440064
g252: mbm_total_bytes: Unavailable -> 660501504
g253: mbm_total_bytes: Unavailable -> 660590720
g254: mbm_total_bytes: Unavailable -> 660548352
g255: mbm_total_bytes: Unavailable -> 660607296
255 groups, 0 returned counts in first pass, 15 in second
successfully measured bandwidth from 15/255 groups
To compare, here is the output from an Intel(R) Xeon(R) Platinum 8173M
CPU:
# ./test_rmids.sh
Created 223 monitoring groups.
g1: mbm_total_bytes: 0 -> 606126080
g2: mbm_total_bytes: 0 -> 613236736
g3: mbm_total_bytes: 0 -> 610254848
[..]
g221: mbm_total_bytes: 0 -> 584679424
g222: mbm_total_bytes: 0 -> 588808192
g223: mbm_total_bytes: 0 -> 587317248
223 groups, 223 returned counts in first pass, 223 in second
successfully measured bandwidth from 223/223 groups
To make better use of the hardware in such a use case, this patchset
introduces a "soft" RMID implementation, where each CPU is permanently
assigned a "hard" RMID. On context switches which change the current
soft RMID, the difference between each CPU's current event counts and
most recent counts is added to the totals for the current or outgoing
soft RMID.
This technique does not work for cache occupancy counters, so this patch
series disables cache occupancy events when soft RMIDs are enabled.
This series adds the "mbm_soft_rmid" mount option to allow users to
opt-in to the functionaltiy when they deem it helpful.
When the same system from the earlier AMD example enables the
mbm_soft_rmid mount option:
# ./test_rmids.sh
Created 255 monitoring groups.
g1: mbm_total_bytes: 0 -> 686560576
g2: mbm_total_bytes: 0 -> 668204416
[..]
g252: mbm_total_bytes: 0 -> 672651200
g253: mbm_total_bytes: 0 -> 666956800
g254: mbm_total_bytes: 0 -> 665917056
g255: mbm_total_bytes: 0 -> 671049600
255 groups, 255 returned counts in first pass, 255 in second
successfully measured bandwidth from 255/255 groups
(patches are based on tip/master)
[1] https://www.amd.com/system/files/TechDocs/56375_1.03_PUB.pdf
Peter Newman (8):
selftests/resctrl: Verify all RMIDs count together
x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events
x86/resctrl: Flush MBM event counts on soft RMID change
x86/resctrl: Call mon_event_count() directly for soft RMIDs
x86/resctrl: Create soft RMID version of __mon_event_count()
x86/resctrl: Assign HW RMIDs to CPUs for soft RMID
x86/resctrl: Use mbm_update() to push soft RMID counts
x86/resctrl: Add mount option to enable soft RMID
Stephane Eranian (1):
x86/resctrl: Hold a spinlock in __rmid_read() on AMD
arch/x86/include/asm/resctrl.h | 29 +++-
arch/x86/kernel/cpu/resctrl/core.c | 80 ++++++++-
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 9 +-
arch/x86/kernel/cpu/resctrl/internal.h | 19 ++-
arch/x86/kernel/cpu/resctrl/monitor.c | 158 +++++++++++++++++-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 52 ++++++
tools/testing/selftests/resctrl/test_rmids.sh | 93 +++++++++++
7 files changed, 425 insertions(+), 15 deletions(-)
create mode 100755 tools/testing/selftests/resctrl/test_rmids.sh
base-commit: dd806e2f030e57dd5bac973372aa252b6c175b73
--
2.40.0.634.g4ca3ef3211-goog
Commit 2810c1e99867 ("kunit: Fix wild-memory-access bug in
kunit_free_suite_set()") fixed a wild-memory-access bug that could have
happened during the loading phase of test suites built and executed as
loadable modules. However, it also introduced a problematic side effect
that causes test suites modules to crash when they attempt to register
fake devices.
When a module is loaded, it traverses the MODULE_STATE_UNFORMED and
MODULE_STATE_COMING states before reaching the normal operating state
MODULE_STATE_LIVE. Finally, when the module is removed, it moves to
MODULE_STATE_GOING before being released. However, if the loading
function load_module() fails between complete_formation() and
do_init_module(), the module goes directly from MODULE_STATE_COMING to
MODULE_STATE_GOING without passing through MODULE_STATE_LIVE.
This behavior was causing kunit_module_exit() to be called without
having first executed kunit_module_init(). Since kunit_module_exit() is
responsible for freeing the memory allocated by kunit_module_init()
through kunit_filter_suites(), this behavior was resulting in a
wild-memory-access bug.
Commit 2810c1e99867 ("kunit: Fix wild-memory-access bug in
kunit_free_suite_set()") fixed this issue by running the tests when the
module is still in MODULE_STATE_COMING. However, modules in that state
are not fully initialized, lacking sysfs kobjects. Therefore, if a test
module attempts to register a fake device, it will inevitably crash.
This patch proposes a different approach to fix the original
wild-memory-access bug while restoring the normal module execution flow
by making kunit_module_exit() able to detect if kunit_module_init() has
previously initialized the tests suite set. In this way, test modules
can once again register fake devices without crashing.
This behavior is achieved by checking whether mod->kunit_suites is a
virtual or direct mapping address. If it is a virtual address, then
kunit_module_init() has allocated the suite_set in kunit_filter_suites()
using kmalloc_array(). On the contrary, if mod->kunit_suites is still
pointing to the original address that was set when looking up the
.kunit_test_suites section of the module, then the loading phase has
failed and there's no memory to be freed.
v2:
- add include <linux/mm.h>
Fixes: 2810c1e99867 ("kunit: Fix wild-memory-access bug in kunit_free_suite_set()")
Signed-off-by: Marco Pagani <marpagan(a)redhat.com>
---
lib/kunit/test.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index f2eb71f1a66c..0e829b9f8ce5 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -16,6 +16,7 @@
#include <linux/panic.h>
#include <linux/sched/debug.h>
#include <linux/sched.h>
+#include <linux/mm.h>
#include "debugfs.h"
#include "hooks-impl.h"
@@ -737,12 +738,14 @@ static void kunit_module_exit(struct module *mod)
};
const char *action = kunit_action();
+ if (!suite_set.start || !virt_addr_valid(suite_set.start))
+ return;
+
if (!action)
__kunit_test_suites_exit(mod->kunit_suites,
mod->num_kunit_suites);
- if (suite_set.start)
- kunit_free_suite_set(suite_set);
+ kunit_free_suite_set(suite_set);
}
static int kunit_module_notify(struct notifier_block *nb, unsigned long val,
@@ -752,12 +755,12 @@ static int kunit_module_notify(struct notifier_block *nb, unsigned long val,
switch (val) {
case MODULE_STATE_LIVE:
+ kunit_module_init(mod);
break;
case MODULE_STATE_GOING:
kunit_module_exit(mod);
break;
case MODULE_STATE_COMING:
- kunit_module_init(mod);
break;
case MODULE_STATE_UNFORMED:
break;
base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab
--
2.42.0
This patch series introduces UFFDIO_MOVE feature to userfaultfd, which
has long been implemented and maintained by Andrea in his local tree [1],
but was not upstreamed due to lack of use cases where this approach would
be better than allocating a new page and copying the contents. Previous
upstraming attempts could be found at [6] and [7].
UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
needs pages to be allocated [2]. However, with UFFDIO_MOVE, if pages are
available (in userspace) for recycling, as is usually the case in heap
compaction algorithms, then we can avoid the page allocation and memcpy
(done by UFFDIO_COPY). Also, since the pages are recycled in the
userspace, we avoid the need to release (via madvise) the pages back to
the kernel [3].
We see over 40% reduction (on a Google pixel 6 device) in the compacting
thread’s completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was
measured using a benchmark that emulates a heap compaction implementation
using userfaultfd (to allow concurrent accesses by application threads).
More details of the usecase are explained in [3].
Furthermore, UFFDIO_MOVE enables moving swapped-out pages without
touching them within the same vma. Today, it can only be done by mremap,
however it forces splitting the vma.
TODOs for follow-up improvements:
- cross-mm support. Known differences from single-mm and missing pieces:
- memcg recharging (might need to isolate pages in the process)
- mm counters
- cross-mm deposit table moves
- cross-mm test
- document the address space where src and dest reside in struct
uffdio_move
- TLB flush batching. Will require extensive changes to PTL locking in
move_pages_pte(). OTOH that might let us reuse parts of mremap code.
Changes since v4 [9]:
- added Acked-by in patch 1, per Peter Xu
- added description for ctx, mm and mode parameters of move_pages(),
per kernel test robot
- added Reviewed-by's, per Peter Xu and Axel Rasmussen
- removed unused operations in uffd_test_case_ops
- refactored uffd-unit-test changes to avoid using global variables and
handle pmd moves without page size overrides, per Peter Xu
Changes since v3 [8]:
- changed retry path in folio_lock_anon_vma_read() to unlock and then
relock RCU, per Peter Xu
- removed cross-mm support from initial patchset, per David Hildenbrand
- replaced BUG_ONs with VM_WARN_ON or WARN_ON_ONCE, per David Hildenbrand
- added missing cache flushing, per Lokesh Gidra and Peter Xu
- updated manpage text in the patch description, per Peter Xu
- renamed internal functions from "remap" to "move", per Peter Xu
- added mmap_changing check after taking mmap_lock, per Peter Xu
- changed uffd context check to ensure dst_mm is registered onto uffd we
are operating on, Peter Xu and David Hildenbrand
- changed to non-maybe variants of maybe*_mkwrite(), per David Hildenbrand
- fixed warning for CONFIG_TRANSPARENT_HUGEPAGE=n, per kernel test robot
- comments cleanup, per David Hildenbrand and Peter Xu
- checks for VM_IO,VM_PFNMAP,VM_HUGETLB,..., per David Hildenbrand
- prevent moving pinned pages, per Peter Xu
- changed uffd tests to call move uffd_test_ctx_clear() at the end of the
test run instead of in the beginning of the next run
- added support for testcase-specific ops
- added test for moving PMD-aligned blocks
Changes since v2 [5]:
- renamed UFFDIO_REMAP to UFFDIO_MOVE, per David Hildenbrand
- rebase over mm-unstable to use folio_move_anon_rmap(),
per David Hildenbrand
- added text for manpage explaining DONTFORK and KSM requirements for this
feature, per David Hildenbrand
- check for anon_vma changes in the fast path of folio_lock_anon_vma_read,
per Peter Xu
- updated the title and description of the first patch,
per David Hildenbrand
- updating comments in folio_lock_anon_vma_read() explaining the need for
anon_vma checks, per David Hildenbrand
- changed all mapcount checks to PageAnonExclusive, per Jann Horn and
David Hildenbrand
- changed counters in remap_swap_pte() from MM_ANONPAGES to MM_SWAPENTS,
per Jann Horn
- added a check for PTE change after folio is locked in remap_pages_pte(),
per Jann Horn
- added handling of PMD migration entries and bailout when pmd_devmap(),
per Jann Horn
- added checks to ensure both src and dst VMAs are writable, per Peter Xu
- added UFFD_FEATURE_MOVE, per Peter Xu
- removed obsolete comments, per Peter Xu
- renamed remap_anon_pte to remap_present_pte, per Peter Xu
- added a comment for folio_get_anon_vma() explaining the need for
anon_vma checks, per Peter Xu
- changed error handling in remap_pages() to make it more clear,
per Peter Xu
- changed EFAULT to EAGAIN to retry when a hugepage appears or disappears
from under us, per Peter Xu
- added links to previous upstreaming attempts, per David Hildenbrand
Changes since v1 [4]:
- add mmget_not_zero in userfaultfd_remap, per Jann Horn
- removed extern from function definitions, per Matthew Wilcox
- converted to folios in remap_pages_huge_pmd, per Matthew Wilcox
- use PageAnonExclusive in remap_pages_huge_pmd, per David Hildenbrand
- handle pgtable transfers between MMs, per Jann Horn
- ignore concurrent A/D pte bit changes, per Jann Horn
- split functions into smaller units, per David Hildenbrand
- test for folio_test_large in remap_anon_pte, per Matthew Wilcox
- use pte_swp_exclusive for swapcount check, per David Hildenbrand
- eliminated use of mmu_notifier_invalidate_range_start_nonblock,
per Jann Horn
- simplified THP alignment checks, per Jann Horn
- refactored the loop inside remap_pages, per Jann Horn
- additional clarifying comments, per Jann Horn
Main changes since Andrea's last version [1]:
- Trivial translations from page to folio, mmap_sem to mmap_lock
- Replace pmd_trans_unstable() with pte_offset_map_nolock() and handle its
possible failure
- Move pte mapping into remap_pages_pte to allow for retries when source
page or anon_vma is contended. Since pte_offset_map_nolock() start RCU
read section, we can't block anymore after mapping a pte, so have to unmap
the ptesm do the locking and retry.
- Add and use anon_vma_trylock_write() to avoid blocking while in RCU
read section.
- Accommodate changes in mmu_notifier_range_init() API, switch to
mmu_notifier_invalidate_range_start_nonblock() to avoid blocking while in
RCU read section.
- Open-code now removed __swp_swapcount()
- Replace pmd_read_atomic() with pmdp_get_lockless()
- Add new selftest for UFFDIO_MOVE
[1] https://gitlab.com/aarcange/aa/-/commit/2aec7aea56b10438a3881a20a411aa4b1fc…
[2] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redha…
[3] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyj…
[4] https://lore.kernel.org/all/20230914152620.2743033-1-surenb@google.com/
[5] https://lore.kernel.org/all/20230923013148.1390521-1-surenb@google.com/
[6] https://lore.kernel.org/all/1425575884-2574-21-git-send-email-aarcange@redh…
[7] https://lore.kernel.org/all/cover.1547251023.git.blake.caldwell@colorado.ed…
[8] https://lore.kernel.org/all/20231009064230.2952396-1-surenb@google.com/
[9] https://lore.kernel.org/all/20231028003819.652322-1-surenb@google.com/
Andrea Arcangeli (2):
mm/rmap: support move to different root anon_vma in
folio_move_anon_rmap()
userfaultfd: UFFDIO_MOVE uABI
Suren Baghdasaryan (3):
selftests/mm: call uffd_test_ctx_clear at the end of the test
selftests/mm: add uffd_test_case_ops to allow test case-specific
operations
selftests/mm: add UFFDIO_MOVE ioctl test
Documentation/admin-guide/mm/userfaultfd.rst | 3 +
fs/userfaultfd.c | 72 +++
include/linux/rmap.h | 5 +
include/linux/userfaultfd_k.h | 11 +
include/uapi/linux/userfaultfd.h | 29 +-
mm/huge_memory.c | 122 ++++
mm/khugepaged.c | 3 +
mm/rmap.c | 30 +
mm/userfaultfd.c | 599 +++++++++++++++++++
tools/testing/selftests/mm/uffd-common.c | 39 +-
tools/testing/selftests/mm/uffd-common.h | 9 +
tools/testing/selftests/mm/uffd-stress.c | 5 +-
tools/testing/selftests/mm/uffd-unit-tests.c | 192 ++++++
13 files changed, 1115 insertions(+), 4 deletions(-)
--
2.43.0.rc1.413.gea7ed67945-goog
Fix an out of bounds access reported in
https://lore.kernel.org/oe-lkp/202311201431.57aae8f3-oliver.sang@intel.com
Make sure that the ctl_table header size is greater than 0 before
evaluating if a ctl_table is permanently empty; this evaluation accesses
the first element regardless of the size. Adjusted the ctl_table_size of
Permanently empty ctl_table registers to 1 as they the check now
requires them to have size greater than 0. Added tests for empty
directory handling; in response to the path followed by empty ctl_tables
changing slightly. Clarified the results of sysctl self tests to more
easily identify which ones are OK, Skipped and Failed.
Comments are greatly appreciated
Best
Signed-off-by: Joel Granados <j.granados(a)samsung.com>
---
Joel Granados (3):
sysctl: Fix out of bounds access for empty sysctl registers
sysctl: Add a selftest for handling empty dirs
sysclt: Clarify the results of selftest run
fs/proc/proc_sysctl.c | 9 +-
lib/test_sysctl.c | 29 ++++++
tools/testing/selftests/sysctl/sysctl.sh | 146 ++++++++++++++++++-------------
3 files changed, 121 insertions(+), 63 deletions(-)
---
base-commit: 8b793bcda61f6c3ed4f5b2ded7530ef6749580cb
change-id: 20231121-jag-fix_out_of_bounds_insert-86380d1bc95e
Best regards,
--
Joel Granados <j.granados(a)samsung.com>