Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages of address translations to get access
to the physical address. A stage-1 translation table is owned by userspace
(e.g. by a guest OS), while a stage-2 is owned by kernel. Any change to a
stage-1 translation table should be followed by an IOTLB invalidation.
Take Intel VT-d as an example, the stage-1 translation table is guest I/O
page table. As the below diagram shows, the guest I/O page table pointer
in GPA (guest physical address) is passed to host and be used to perform
a stage-1 translation. Along with it, a modification to present mappings
in the guest I/O page table should be followed by an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |---------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | FS for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.----------------------------------.
| | | SS for GPA->HPA, unmanaged domain|
| | '----------------------------------'
'-------------'
Where:
- FS = First stage page tables
- SS = Second stage page tables
<Intel VT-d Nested translation>
In IOMMUFD, all the translation tables are tracked by hw_pagetable (hwpt)
and each hwpt is backed by an iommu_domain allocated from an iommu driver.
So in this series hw_pagetable and iommu_domain means the same thing if no
special note. IOMMUFD has already supported allocating hw_pagetable linked
with an IOAS. However, a nesting case requires IOMMUFD to allow allocating
hw_pagetable with driver specific parameters and interface to sync stage-1
IOTLB as user owns the stage-1 translation table.
This series is based on the iommu hw info reporting series [1] and nested
parent domain allocation [2]. It first extends domain_alloc_user to allocate
hwpt with user data by allowing the IOMMUFD internal infrastructure to accept
user_data and parent hwpt, relaying the user_data/parent to the iommu core
to allocate IOMMU_DOMAIN_NESTED. And it then extends the IOMMU_HWPT_ALLOC
ioctl to accept user data and a parent hwpt ID.
Note that this series is the part-1 set of a two-part nesting series. It
does not include the cache invalidation interface, which will be added in
the part 2.
Complete code can be found in [3], it is on top of Joao's dirty page tracking
v6 series and fix patches. QEMU could can be found in [4].
At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.
[1] https://lore.kernel.org/linux-iommu/20230818101033.4100-1-yi.l.liu@intel.co… - merged
[2] https://lore.kernel.org/linux-iommu/20230928071528.26258-1-yi.l.liu@intel.c… - merged
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v6:
- Rebase on top of Joao's dirty tracking series:
https://lore.kernel.org/linux-iommu/20231024135109.73787-1-joao.m.martins@o…
- Rebase on top of the enforce_cache_coherency removal patch:
https://lore.kernel.org/linux-iommu/ZTcAhwYjjzqM0A5M@Asurada-Nvidia/
- Add parent and user_data check in iommu driver before the driver actually
supports the two input. This can make better bisect support, the change is
in patch 02.
v5: https://lore.kernel.org/linux-iommu/20231020091946.12173-1-yi.l.liu@intel.c…
- Split the iommufd nesting series into two parts of alloc_user and
invalidation (Jason)
- Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and
do the same with the structures/alloc()/abort()/destroy(). Reworked the
selftest accordingly too. (Jason)
- Move hwpt/data_type into struct iommu_user_data from standalone op
arguments. (Jason)
- Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA,
_TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin)
- Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin)
- Add macro to the iommu_copy_struct_from_user() to calculate min_size
(Jason)
- Fix two bugs spotted by ZhaoYan
v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.…
- Separate HWPT alloc/destroy/abort functions between user-managed HWPTs
and kernel-managed HWPTs
- Rework invalidate uAPI to be a multi-request array-based design
- Add a struct iommu_user_data_array and a helper for driver to sanitize
and copy the entry data from user space invalidation array
- Add a patch fixing TEST_LENGTH() in selftest program
- Drop IOMMU_RESV_IOVA_RANGES patches
- Update kdoc and inline comments
- Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested
translation, this does not change the rule that resv regions should only
be added to the kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be
added in later series as it is needed only by SMMU so far.
v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.…
- Add new uAPI things in alphabetical order
- Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for
sanity, replacing the previous op->domain_alloc_user_data_len solution
- Return ERR_PTR from domain_alloc_user instead of NULL
- Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation
(Kevin)
- Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence
userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest
I/O page table). (Kevin)
- Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl
- Minor changes per Kevin's inputs
v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.c…
- Add union iommu_domain_user_data to include all user data structures to
avoid passing void * in kernel APIs.
- Add iommu op to return user data length for user domain allocation
- Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
- Store the invalidation data length in
iommu_domain_ops::cache_invalidate_user_data_len
- Convert cache_invalidate_user op to be int instead of void
- Remove @data_type in struct iommu_hwpt_invalidate
- Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08
of v1
v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.…
Thanks,
Yi Liu
Jason Gunthorpe (2):
iommufd: Rename IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING
iommufd/device: Wrap IOMMUFD_OBJ_HWPT_PAGING-only configurations
Lu Baolu (1):
iommu: Add IOMMU_DOMAIN_NESTED
Nicolin Chen (6):
iommufd: Derive iommufd_hwpt_paging from iommufd_hw_pagetable
iommufd: Share iommufd_hwpt_alloc with IOMMUFD_OBJ_HWPT_NESTED
iommufd: Add a nested HW pagetable object
iommu: Add iommu_copy_struct_from_user helper
iommufd/selftest: Add nested domain allocation for mock domain
iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with nested HWPTs
Yi Liu (1):
iommu: Pass in parent domain with user_data to domain_alloc_user op
drivers/iommu/intel/iommu.c | 7 +-
drivers/iommu/iommufd/device.c | 157 +++++++---
drivers/iommu/iommufd/hw_pagetable.c | 271 +++++++++++++-----
drivers/iommu/iommufd/iommufd_private.h | 73 +++--
drivers/iommu/iommufd/iommufd_test.h | 18 ++
drivers/iommu/iommufd/main.c | 10 +-
drivers/iommu/iommufd/selftest.c | 151 ++++++++--
drivers/iommu/iommufd/vfio_compat.c | 6 +-
include/linux/iommu.h | 72 ++++-
include/uapi/linux/iommufd.h | 31 +-
tools/testing/selftests/iommu/iommufd.c | 120 ++++++++
.../selftests/iommu/iommufd_fail_nth.c | 3 +-
tools/testing/selftests/iommu/iommufd_utils.h | 31 +-
13 files changed, 768 insertions(+), 182 deletions(-)
--
2.34.1
Clang uses a different set of CLI args for coverage, and the output
needs to be processed by a different set of tools.
Update the Makefile and add an example of usage in kunit docs.
Michał Winiarski (2):
arch: um: Add Clang coverage support
Documentation: kunit: Add clang UML coverage example
Documentation/dev-tools/kunit/running_tips.rst | 11 +++++++++++
arch/um/Makefile-skas | 5 +++++
2 files changed, 16 insertions(+)
--
2.42.0
In some conditions, background processes in udpgro don't have enough
time to set up the sockets. When foreground processes start, this
results in the bad GRO lookup test freezing or reporting that it
received 0 gro segments.
To fix this, increase the time given to background processes to complete
the startup before foreground processes start.
This is the same issue and the same fix as posted by Adrien Therry.
Link: https://lore.kernel.org/all/20221101184809.50013-1-athierry@redhat.com/
Signed-off-by: Lucas Karpinski <lkarpins(a)redhat.com>
---
tools/testing/selftests/net/udpgro.sh | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/udpgro.sh b/tools/testing/selftests/net/udpgro.sh
index 0c743752669a..4ccbcb2390ad 100755
--- a/tools/testing/selftests/net/udpgro.sh
+++ b/tools/testing/selftests/net/udpgro.sh
@@ -97,7 +97,8 @@ run_one_nat() {
echo "ok" || \
echo "failed"&
- sleep 0.1
+ # Hack: let bg programs complete the startup
+ sleep 0.2
./udpgso_bench_tx ${tx_args}
ret=$?
kill -INT $pid
--
2.41.0
Change ifconfig with ip command,
on a system where ifconfig is
not used this script will not
work correcly.
Test result with this patchset:
sudo make TARGETS="net" kselftest
....
TAP version 13
1..1
timeout set to 1500
selftests: net: route_localnet.sh
run arp_announce test
net.ipv4.conf.veth0.route_localnet = 1
net.ipv4.conf.veth1.route_localnet = 1
net.ipv4.conf.veth0.arp_announce = 2
net.ipv4.conf.veth1.arp_announce = 2
PING 127.25.3.14 (127.25.3.14) from 127.25.3.4 veth0: 56(84)
bytes of data.
64 bytes from 127.25.3.14: icmp_seq=1 ttl=64 time=0.038 ms
64 bytes from 127.25.3.14: icmp_seq=2 ttl=64 time=0.068 ms
64 bytes from 127.25.3.14: icmp_seq=3 ttl=64 time=0.068 ms
64 bytes from 127.25.3.14: icmp_seq=4 ttl=64 time=0.068 ms
64 bytes from 127.25.3.14: icmp_seq=5 ttl=64 time=0.068 ms
--- 127.25.3.14 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4073ms
rtt min/avg/max/mdev = 0.038/0.062/0.068/0.012 ms
ok
run arp_ignore test
net.ipv4.conf.veth0.route_localnet = 1
net.ipv4.conf.veth1.route_localnet = 1
net.ipv4.conf.veth0.arp_ignore = 3
net.ipv4.conf.veth1.arp_ignore = 3
PING 127.25.3.14 (127.25.3.14) from 127.25.3.4 veth0: 56(84)
bytes of data.
64 bytes from 127.25.3.14: icmp_seq=1 ttl=64 time=0.032 ms
64 bytes from 127.25.3.14: icmp_seq=2 ttl=64 time=0.065 ms
64 bytes from 127.25.3.14: icmp_seq=3 ttl=64 time=0.066 ms
64 bytes from 127.25.3.14: icmp_seq=4 ttl=64 time=0.065 ms
64 bytes from 127.25.3.14: icmp_seq=5 ttl=64 time=0.065 ms
--- 127.25.3.14 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4092ms
rtt min/avg/max/mdev = 0.032/0.058/0.066/0.013 ms
ok
ok 1 selftests: net: route_localnet.sh
...
Signed-off-by: Swarup Laxman Kotiaklapudi <swarupkotikalapudi(a)gmail.com>
---
tools/testing/selftests/net/route_localnet.sh | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/net/route_localnet.sh b/tools/testing/selftests/net/route_localnet.sh
index 116bfeab72fa..3ab9beb4462c 100755
--- a/tools/testing/selftests/net/route_localnet.sh
+++ b/tools/testing/selftests/net/route_localnet.sh
@@ -18,8 +18,10 @@ setup() {
ip route del 127.0.0.0/8 dev lo table local
ip netns exec "${PEER_NS}" ip route del 127.0.0.0/8 dev lo table local
- ifconfig veth0 127.25.3.4/24 up
- ip netns exec "${PEER_NS}" ifconfig veth1 127.25.3.14/24 up
+ ip a add 127.25.3.4/24 dev veth0
+ ip link set dev veth0 up
+ ip netns exec "${PEER_NS}" ip a add 127.25.3.14/24 dev veth1
+ ip netns exec "${PEER_NS}" ip link set dev veth1 up
ip route flush cache
ip netns exec "${PEER_NS}" ip route flush cache
--
2.34.1
On Fri, Oct 20, 2023 at 06:21:40AM +0200, Nicolas Schier wrote:
> On Thu, Oct 19, 2023 at 03:50:05PM -0300, Marcos Paulo de Souza wrote:
> > On Sat, Oct 14, 2023 at 05:35:55PM +0900, Masahiro Yamada wrote:
> > > On Tue, Oct 10, 2023 at 5:43 AM Marcos Paulo de Souza <mpdesouza(a)suse.de> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I found an issue while moving the livepatch kselftest modules to be built on the
> > > > fly, instead of building them on kernel building.
> > > >
> > > > If, for some reason, there is a recursive make invocation that starts from the
> > > > top level Makefile and in the leaf Makefile it tries to build a module (using M=
> > > > in the make invocation), it doesn't produce the module. This happens because the
> > > > toplevel Makefile checks for M= only once. This is controlled by the
> > > > sub_make_done variable, which is exported after checking the command line
> > > > options are passed to the top level Makefile. Once this variable is set it's
> > > > the M= setting is never checked again on the recursive call.
> > > >
> > > > This can be observed when cleaning the bpf kselftest dir. When calling
> > > >
> > > > $ make TARGETS="bpf" SKIP_TARGETS="" kselftest-clean
> > > >
> > > > What happens:
> > > >
> > > > 1. It checks for some command line settings (like M=) was passed (it wasn't),
> > > > set some definitions and exports sub_make_done.
> > > >
> > > > 2. Jump into tools/testing/selftests/bpf, and calls the clean target.
> > > >
> > > > 3. The clean target is overwritten to remove some files and then jump to
> > > > bpf_testmod dir and call clean there
> > > >
> > > > 4. On bpf_testmod/Makefile, the clean target will execute
> > > > $(Q)make -C $(KDIR) M=$(BPF_TESTMOD_DIR) clean
> > > >
> > > > 5. The KDIR is to toplevel dir. The top Makefile will check that sub_make_done was
> > > > already set, ignoring the M= setting.
> > > >
> > > > 6. As M= wasn't checked, KBUILD_EXTMOD isn't set, and the clean target applies
> > > > to the kernel as a whole, making it clean all generated code/objects and
> > > > everything.
> > > >
> > > > One way to avoid it is to call "unexport sub_make_done" on
> > > > tools/testing/selftests/bpf/bpf_testmod/Makefile before processing the all
> > > > target, forcing the toplevel Makefile to process the M=, producing the module
> > > > file correctly.
> > > >
> > > > If the M=dir points to /lib/modules/.../build, then it fails with "m2c: No such
> > > > file", which I already reported here[1]. At the time this problem was treated
> > > > like a problem with kselftest infrastructure.
> > > >
> > > > Important: The process works fine if the initial make invocation is targeted to a
> > > > different directory (using -C), since it doesn't goes through the toplevel
> > > > Makefile, and sub_make_done variable is not set.
> > > >
> > > > I attached a minimal reproducer, that can be used to better understand the
> > > > problem. The "make testmod" and "make testmod-clean" have the same effect that
> > > > can be seem with the bpf kselftests. There is a unexport call commented on
> > > > test-mods/Makefile, and once that is called the process works as expected.
> > > >
> > > > Is there a better way to fix this? Is this really a problem, or am I missing
> > > > something?
> > >
> > >
> > > Or, using KBUILD_EXTMOD will work too.
> >
> > Yes, that works, only if set to /lib/modules:
> >
> > $ make kselftest TARGETS=bpf SKIP_TARGETS=""
> > make[3]: Entering directory '/home/mpdesouza/git/linux/tools/testing/selftests/bpf'
> > MOD bpf_testmod.ko
> > warning: the compiler differs from the one used to build the kernel
> > The kernel was built by: gcc (SUSE Linux) 13.2.1 20230803 [revision cc279d6c64562f05019e1d12d0d825f9391b5553]
> > You are using: gcc (SUSE Linux) 13.2.1 20230912 [revision b96e66fd4ef3e36983969fb8cdd1956f551a074b]
> > CC [M] /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.o
> > MODPOST /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod/Module.symvers
> > CC [M] /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.mod.o
> > LD [M] /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.ko
> > BTF [M] /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.ko
> > Skipping BTF generation for /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.ko due to unavailability of vmlinux
> > BINARY xdp_synproxy
> > ...
> >
> > But if we set the KBUILD_EXTMOD to toplevel Makefile, it fails with a different
> > strange issue:
> >
> > $ make kselftest TARGETS=bpf SKIP_TARGETS=""
> > BINARY urandom_read
> > MOD bpf_testmod.ko
> > m2c -o scripts/Makefile.build -e scripts/Makefile.build scripts/Makefile.build.mod
> > make[6]: m2c: No such file or directory
> > make[6]: *** [<builtin>: scripts/Makefile.build] Error 127
> > make[5]: *** [Makefile:1913: /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod] Error 2
> > make[4]: *** [Makefile:19: all] Error 2
> > make[3]: *** [Makefile:229: /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod.ko] Error 2
> > make[3]: Leaving directory '/home/mpdesouza/git/linux/tools/testing/selftests/bpf'
> > make[2]: *** [Makefile:175: all] Error 2
> > make[1]: *** [/home/mpdesouza/git/linux/Makefile:1362: kselftest] Error 2
> >
> > I attached a patch that can reproduce the case where it works, and the case
> > where it doesn't by changing the value of KDIR.
> >
> > I understand that KBUILD_EXTMOD, as the name implies, was designed to build
> > "external" modules, and not ones that live inside kernel, but how could this be
> > solved?
>
> It seems to me as if there is some confusion about in-tree vs.
> out-of-tree kmods.
>
> KBUILD_EXTMOD and M are almost the same and indicate that you want to
> build _external_ (=out-of-tree) kernel modules. In-tree modules are
> only those that stay in-tree _and_ are built along with the kernel.
> Thus, 'make modules KBUILD_EXTMOD=fs/ext4' could be used to build ext4
> kmod as "out-of-tree" kernel module, that even taints the kernel if it
> gets loaded.
>
> If you want bpf_testmod.ko to be an in-tree kmod, it has to be build
> during the usual kernel build, not by running 'make kselftest'.
>
> If you use 'make -C $(KDIR)' for building out-of-tree kmods, KDIR has to
> point to the kernel build directory. (Or it may point to the source
> tree if you give O=$(BUILDDIR) as well).
Thanks for the explanation Nicolas. In this, I believe that the BPF module
should be moved into lib/, like lib/livepatch, when then be built along with
other in-tree modules.
Currently there is a bug when running the kselftests-clean target with bpf:
make kselftest-clean TARGETS=bpf SKIP_TARGETS=""
As the M= argument is ignore on the toplevel Makefile, this make invocation
applies the clean to all built kernel objects/modules/everything, which is bug
IMO.
There is a statement in the BPF docs saying that the selftests should be run
inside the tools/testing/selftests/bpf directory. At the same time, kselftests
should comply with all the targets defined in the documention, like gen_tar, and
run_tests. In this case should the build process be fixed, or just make
kselftests less restrict?
(CCing kselftests and bpf ML)
>
> HTH.
>
> Kind regards,
> Nicolas
>
>
> > For the sake of my initial about livepatch kselftests, KBUILD_EXTMOD
> > will suffice, since we will target /lib/modules, but I would like to know what
> > we can do in this case. Do you have other suggestions?
> >
> > Thanks in advance,
> > Marcos
> >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Best Regards
> > > Masahiro Yamada
>
> > diff --git a/tools/testing/selftests/bpf/bpf_testmod/Makefile b/tools/testing/selftests/bpf/bpf_testmod/Makefile
> > index 15cb36c4483a..1dce76f35405 100644
> > --- a/tools/testing/selftests/bpf/bpf_testmod/Makefile
> > +++ b/tools/testing/selftests/bpf/bpf_testmod/Makefile
> > @@ -1,5 +1,6 @@
> > BPF_TESTMOD_DIR := $(realpath $(dir $(abspath $(lastword $(MAKEFILE_LIST)))))
> > -KDIR ?= $(abspath $(BPF_TESTMOD_DIR)/../../../../..)
> > +#KDIR ?= $(abspath $(BPF_TESTMOD_DIR)/../../../../..)
> > +KDIR ?= /lib/modules/$(shell uname -r)/build
> >
> > ifeq ($(V),1)
> > Q =
> > @@ -12,9 +13,10 @@ MODULES = bpf_testmod.ko
> > obj-m += bpf_testmod.o
> > CFLAGS_bpf_testmod.o = -I$(src)
> >
> > +export KBUILD_EXTMOD := $(BPF_TESTMOD_DIR)
> > +
> > all:
> > - +$(Q)make -C $(KDIR) M=$(BPF_TESTMOD_DIR) modules
> > + +$(Q)make -C $(KDIR) modules
> >
> > clean:
> > - +$(Q)make -C $(KDIR) M=$(BPF_TESTMOD_DIR) clean
> > -
> > + +$(Q)make -C $(KDIR) clean
>
This patch series introduces UFFDIO_MOVE feature to userfaultfd, which
has long been implemented and maintained by Andrea in his local tree [1],
but was not upstreamed due to lack of use cases where this approach would
be better than allocating a new page and copying the contents. Previous
upstraming attempts could be found at [6] and [7].
UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
needs pages to be allocated [2]. However, with UFFDIO_MOVE, if pages are
available (in userspace) for recycling, as is usually the case in heap
compaction algorithms, then we can avoid the page allocation and memcpy
(done by UFFDIO_COPY). Also, since the pages are recycled in the
userspace, we avoid the need to release (via madvise) the pages back to
the kernel [3].
We see over 40% reduction (on a Google pixel 6 device) in the compacting
thread’s completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was
measured using a benchmark that emulates a heap compaction implementation
using userfaultfd (to allow concurrent accesses by application threads).
More details of the usecase are explained in [3].
Furthermore, UFFDIO_MOVE enables moving swapped-out pages without
touching them within the same vma. Today, it can only be done by mremap,
however it forces splitting the vma.
Main changes since Andrea's last version [1]:
- Trivial translations from page to folio, mmap_sem to mmap_lock
- Replace pmd_trans_unstable() with pte_offset_map_nolock() and handle its
possible failure
- Move pte mapping into remap_pages_pte to allow for retries when source
page or anon_vma is contended. Since pte_offset_map_nolock() start RCU
read section, we can't block anymore after mapping a pte, so have to unmap
the ptesm do the locking and retry.
- Add and use anon_vma_trylock_write() to avoid blocking while in RCU
read section.
- Accommodate changes in mmu_notifier_range_init() API, switch to
mmu_notifier_invalidate_range_start_nonblock() to avoid blocking while in
RCU read section.
- Open-code now removed __swp_swapcount()
- Replace pmd_read_atomic() with pmdp_get_lockless()
- Add new selftest for UFFDIO_MOVE
Changes since v1 [4]:
- add mmget_not_zero in userfaultfd_remap, per Jann Horn
- removed extern from function definitions, per Matthew Wilcox
- converted to folios in remap_pages_huge_pmd, per Matthew Wilcox
- use PageAnonExclusive in remap_pages_huge_pmd, per David Hildenbrand
- handle pgtable transfers between MMs, per Jann Horn
- ignore concurrent A/D pte bit changes, per Jann Horn
- split functions into smaller units, per David Hildenbrand
- test for folio_test_large in remap_anon_pte, per Matthew Wilcox
- use pte_swp_exclusive for swapcount check, per David Hildenbrand
- eliminated use of mmu_notifier_invalidate_range_start_nonblock,
per Jann Horn
- simplified THP alignment checks, per Jann Horn
- refactored the loop inside remap_pages, per Jann Horn
- additional clarifying comments, per Jann Horn
Changes since v2 [5]:
- renamed UFFDIO_REMAP to UFFDIO_MOVE, per David Hildenbrand
- rebase over mm-unstable to use folio_move_anon_rmap(),
per David Hildenbrand
- added text for manpage explaining DONTFORK and KSM requirements for this
feature, per David Hildenbrand
- check for anon_vma changes in the fast path of folio_lock_anon_vma_read,
per Peter Xu
- updated the title and description of the first patch,
per David Hildenbrand
- updating comments in folio_lock_anon_vma_read() explaining the need for
anon_vma checks, per David Hildenbrand
- changed all mapcount checks to PageAnonExclusive, per Jann Horn and
David Hildenbrand
- changed counters in remap_swap_pte() from MM_ANONPAGES to MM_SWAPENTS,
per Jann Horn
- added a check for PTE change after folio is locked in remap_pages_pte(),
per Jann Horn
- added handling of PMD migration entries and bailout when pmd_devmap(),
per Jann Horn
- added checks to ensure both src and dst VMAs are writable, per Peter Xu
- added UFFD_FEATURE_MOVE, per Peter Xu
- removed obsolete comments, per Peter Xu
- renamed remap_anon_pte to remap_present_pte, per Peter Xu
- added a comment for folio_get_anon_vma() explaining the need for
anon_vma checks, per Peter Xu
- changed error handling in remap_pages() to make it more clear,
per Peter Xu
- changed EFAULT to EAGAIN to retry when a hugepage appears or disappears
from under us, per Peter Xu
- added links to previous upstreaming attempts, per David Hildenbrand
[1] https://gitlab.com/aarcange/aa/-/commit/2aec7aea56b10438a3881a20a411aa4b1fc…
[2] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redha…
[3] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyj…
[4] https://lore.kernel.org/all/20230914152620.2743033-1-surenb@google.com/
[5] https://lore.kernel.org/all/20230923013148.1390521-1-surenb@google.com/
[6] https://lore.kernel.org/all/1425575884-2574-21-git-send-email-aarcange@redh…
[7] https://lore.kernel.org/all/cover.1547251023.git.blake.caldwell@colorado.ed…
The patchset applies over mm-unstable.
Andrea Arcangeli (2):
mm/rmap: support move to different root anon_vma in
folio_move_anon_rmap()
userfaultfd: UFFDIO_MOVE uABI
Suren Baghdasaryan (1):
selftests/mm: add UFFDIO_MOVE ioctl test
Documentation/admin-guide/mm/userfaultfd.rst | 3 +
fs/userfaultfd.c | 63 ++
include/linux/rmap.h | 5 +
include/linux/userfaultfd_k.h | 12 +
include/uapi/linux/userfaultfd.h | 29 +-
mm/huge_memory.c | 138 +++++
mm/khugepaged.c | 3 +
mm/rmap.c | 30 +
mm/userfaultfd.c | 602 +++++++++++++++++++
tools/testing/selftests/mm/uffd-common.c | 41 +-
tools/testing/selftests/mm/uffd-common.h | 1 +
tools/testing/selftests/mm/uffd-unit-tests.c | 62 ++
12 files changed, 986 insertions(+), 3 deletions(-)
--
2.42.0.609.gbb76f46606-goog