Hello,
RFC v2 addresses comments in RFC v1 [1]. This series is also rebased
on kvm/next (v6.15-rc4).
Here's the series stitched together for your convenience:
https://github.com/googleprodkernel/linux-cc/tree/kvm-gmem-link-migrate-rfc…
Changes from RFC v1:
+ Adds patches to make guest mem use guest mem inodes instead of
anonymous inodes.
+ Changed the name of factored out gmem allocating function to
kvm_gmem_alloc_view().
+ Changed the flag name vm_move_enc_ctxt_supported to
use_vm_enc_ctxt_op.
+ Various small changes to make patchset compatible with latest version
of kvm/next.
As a refresher, split file/inode model was proposed in guest_mem v11,
where memslot bindings belong to the file and pages belong to the inode.
This model lends itself well to having different VMs use separate files
pointing to the same inode.
The split file/inode model has also been used by the other following
recent patch series:
+ mmap support for guest_memfd: [2]
+ NUMA mempolicy support for guest_memfd: [3]
+ HugeTLB support for guest_memfd: [4]
This RFC proposes an ioctl, KVM_LINK_GUEST_MEMFD, that takes a VM and
a gmem fd, and returns another gmem fd referencing a different file
and associated with VM. This RFC also includes an update to
KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM to migrate memory context
(slot->arch.lpage_info and kvm->mem_attr_array) from source to
destination vm, intra-host.
Intended usage of the two ioctls:
1. Source VM’s fd is passed to destination VM via unix sockets.
2. Destination VM uses new ioctl KVM_LINK_GUEST_MEMFD to link source
VM’s fd to a new fd.
3. Destination VM will pass new fds to KVM_SET_USER_MEMORY_REGION,
which will bind the new file, pointing to the same inode that the
source VM’s file points to, to memslots.
4. Use KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM to move kvm->mem_attr_array
and slot->arch.lpage_info to the destination VM.
5. Run the destination VM as per normal.
Some other approaches considered were:
+ Using the linkat() syscall, but that requires a mount/directory for
a source fd to be linked to
+ Using the dup() syscall, but that only duplicates the fd, and both
fds point to the same file
[1] https://lore.kernel.org/all/cover.1691446946.git.ackerleytng@google.com/T/
[2] https://lore.kernel.org/all/20250328153133.3504118-2-tabba@google.com/
[3] https://lore.kernel.org/all/20250408112402.181574-6-shivankg@amd.com/
[4] https://lore.kernel.org/all/c1ee659c212b5a8b0e7a7f4d1763699176dd3a62.174726…
---
Ackerley Tng (12):
KVM: guest_memfd: Make guest mem use guest mem inodes instead of
anonymous inodes
KVM: guest_mem: Refactor out kvm_gmem_alloc_view()
KVM: guest_mem: Add ioctl KVM_LINK_GUEST_MEMFD
KVM: selftests: Add tests for KVM_LINK_GUEST_MEMFD ioctl
KVM: selftests: Test transferring private memory to another VM
KVM: x86: Refactor sev's flag migration_in_progress to kvm struct
KVM: x86: Refactor common code out of sev.c
KVM: x86: Refactor common migration preparation code out of
sev_vm_move_enc_context_from
KVM: x86: Let moving encryption context be configurable
KVM: x86: Handle moving of memory context for intra-host migration
KVM: selftests: Generalize migration functions from
sev_migrate_tests.c
KVM: selftests: Add tests for migration of private mem
David Hildenbrand (1):
fs: Refactor to provide function that allocates a secure anonymous
inode
arch/x86/include/asm/kvm_host.h | 3 +-
arch/x86/kvm/svm/sev.c | 82 +------
arch/x86/kvm/svm/svm.h | 3 +-
arch/x86/kvm/x86.c | 218 ++++++++++++++++-
arch/x86/kvm/x86.h | 6 +
fs/anon_inodes.c | 23 +-
include/linux/fs.h | 13 +-
include/linux/kvm_host.h | 18 ++
include/uapi/linux/kvm.h | 8 +
include/uapi/linux/magic.h | 1 +
mm/secretmem.c | 9 +-
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../testing/selftests/kvm/guest_memfd_test.c | 43 ++++
.../testing/selftests/kvm/include/kvm_util.h | 31 +++
.../kvm/x86/private_mem_migrate_tests.c | 93 ++++++++
.../selftests/kvm/x86/sev_migrate_tests.c | 48 ++--
virt/kvm/guest_memfd.c | 225 +++++++++++++++---
virt/kvm/kvm_main.c | 17 +-
virt/kvm/kvm_mm.h | 14 +-
19 files changed, 697 insertions(+), 159 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86/private_mem_migrate_tests.c
--
2.49.0.1101.gccaa498523-goog
Cover three recent cases:
1. missing ops locking for the lowers during netdev_sync_lower_features
2. missing locking for dev_set_promiscuity (plus netdev_ops_assert_locked
with a comment on why/when it's needed)
3. rcu lock during team_change_rx_flags
Verified that each one triggers when the respective fix is reverted.
Not sure about the placement, but since it all relies on teaming,
added to the teaming directory.
One ugly bit is that I add NETIF_F_LRO to netdevsim; there is no way
to trigger netdev_sync_lower_features without it.
Signed-off-by: Stanislav Fomichev <stfomichev(a)gmail.com>
---
drivers/net/netdevsim/netdev.c | 2 +
net/core/dev.c | 10 ++-
.../selftests/drivers/net/team/Makefile | 2 +-
.../testing/selftests/drivers/net/team/config | 1 +
.../selftests/drivers/net/team/propagation.sh | 79 +++++++++++++++++++
5 files changed, 92 insertions(+), 2 deletions(-)
create mode 100755 tools/testing/selftests/drivers/net/team/propagation.sh
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index 0e0321a7ddd7..3bd1f8cffee8 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -879,11 +879,13 @@ static void nsim_setup(struct net_device *dev)
NETIF_F_SG |
NETIF_F_FRAGLIST |
NETIF_F_HW_CSUM |
+ NETIF_F_LRO |
NETIF_F_TSO;
dev->hw_features |= NETIF_F_HW_TC |
NETIF_F_SG |
NETIF_F_FRAGLIST |
NETIF_F_HW_CSUM |
+ NETIF_F_LRO |
NETIF_F_TSO;
dev->max_mtu = ETH_MAX_MTU;
dev->xdp_features = NETDEV_XDP_ACT_HW_OFFLOAD;
diff --git a/net/core/dev.c b/net/core/dev.c
index 0d891634c692..4debd4b8e0f5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9188,8 +9188,16 @@ static int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify)
dev_change_rx_flags(dev, IFF_PROMISC);
}
- if (notify)
+ if (notify) {
+ /* The ops lock is only required to ensure consistent locking
+ * for `NETDEV_CHANGE` notifiers. This function is sometimes
+ * called without the lock, even for devices that are ops
+ * locked, such as in `dev_uc_sync_multiple` when using
+ * bonding or teaming.
+ */
+ netdev_ops_assert_locked(dev);
__dev_notify_flags(dev, old_flags, IFF_PROMISC, 0, NULL);
+ }
return 0;
}
diff --git a/tools/testing/selftests/drivers/net/team/Makefile b/tools/testing/selftests/drivers/net/team/Makefile
index 2d5a76d99181..eaf6938f100e 100644
--- a/tools/testing/selftests/drivers/net/team/Makefile
+++ b/tools/testing/selftests/drivers/net/team/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
# Makefile for net selftests
-TEST_PROGS := dev_addr_lists.sh
+TEST_PROGS := dev_addr_lists.sh propagation.sh
TEST_INCLUDES := \
../bonding/lag_lib.sh \
diff --git a/tools/testing/selftests/drivers/net/team/config b/tools/testing/selftests/drivers/net/team/config
index b5e3a3aad4bf..636b3525b679 100644
--- a/tools/testing/selftests/drivers/net/team/config
+++ b/tools/testing/selftests/drivers/net/team/config
@@ -1,5 +1,6 @@
CONFIG_DUMMY=y
CONFIG_IPV6=y
CONFIG_MACVLAN=y
+CONFIG_NETDEVSIM=m
CONFIG_NET_TEAM=y
CONFIG_NET_TEAM_MODE_LOADBALANCE=y
diff --git a/tools/testing/selftests/drivers/net/team/propagation.sh b/tools/testing/selftests/drivers/net/team/propagation.sh
new file mode 100755
index 000000000000..849a5f2cb3a7
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/team/propagation.sh
@@ -0,0 +1,79 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+NSIM_LRO_ID=$((256 + RANDOM % 256))
+NSIM_LRO_SYS=/sys/bus/netdevsim/devices/netdevsim$NSIM_LRO_ID
+
+NSIM_DEV_SYS_NEW=/sys/bus/netdevsim/new_device
+NSIM_DEV_SYS_DEL=/sys/bus/netdevsim/del_device
+
+cleanup()
+{
+ ip link del dummyteam &>/dev/null
+ ip link del team0 &>/dev/null
+ echo $NSIM_LRO_ID > $NSIM_DEV_SYS_DEL
+}
+
+# Trigger LRO propagation to the lower.
+# https://lore.kernel.org/netdev/aBvOpkIoxcr9PfDg@mini-arch/
+team_lro()
+{
+ # using netdevsim because it supports NETIF_F_LRO
+ NSIM_LRO_NAME=$(find $NSIM_LRO_SYS/net -maxdepth 1 -type d ! \
+ -path $NSIM_LRO_SYS/net -exec basename {} \;)
+
+ ip link add name team0 type team
+ ip link set $NSIM_LRO_NAME down
+ ip link set dev $NSIM_LRO_NAME master team0
+ ip link set team0 up
+ ethtool -K team0 large-receive-offload off
+
+ ip link del team0
+}
+
+# Trigger promisc propagation to the lower during IFLA_MASTER.
+# https://lore.kernel.org/netdev/20250506032328.3003050-1-sdf@fomichev.me/
+team_promisc()
+{
+ ip link add name dummyteam type dummy
+ ip link add name team0 type team
+ ip link set dummyteam down
+ ip link set team0 promisc on
+ ip link set dev dummyteam master team0
+ ip link set team0 up
+
+ ip link del team0
+ ip link del dummyteam
+}
+
+# Trigger promisc propagation to the lower via netif_change_flags (aka
+# ndo_change_rx_flags).
+# https://lore.kernel.org/netdev/20250514220319.3505158-1-stfomichev@gmail.co…
+team_change_flags()
+{
+ ip link add name dummyteam type dummy
+ ip link add name team0 type team
+ ip link set dummyteam down
+ ip link set dev dummyteam master team0
+ ip link set team0 up
+ ip link set team0 promisc on
+
+ # Make sure we can add more L2 addresses without any issues.
+ ip link add link team0 address 00:00:00:00:00:01 team0.1 type macvlan
+ ip link set team0.1 up
+
+ ip link del team0.1
+ ip link del team0
+ ip link del dummyteam
+}
+
+trap cleanup EXIT
+modprobe netdevsim || :
+echo $NSIM_LRO_ID > $NSIM_DEV_SYS_NEW
+udevadm settle
+team_lro
+team_promisc
+team_change_flags
+modprobe -r netdevsim || :
--
2.49.0
When running 'make' in tools/testing/selftests/arm64/ without explicitly
setting the OUTPUT variable, the build system will creates test
directories (e.g., /bti) in the root filesystem due to OUTPUT defaulting
to an empty string. This causes unintended pollution of the root directory.
This patch adds proper handling for the OUTPUT variable: Sets OUTPUT
to the current directory (.) if not specified
Signed-off-by: tanze <tanze(a)kylinos.cn>
---
tools/testing/selftests/arm64/Makefile | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/arm64/Makefile b/tools/testing/selftests/arm64/Makefile
index 22029e60eff3..c4c72ee2ef55 100644
--- a/tools/testing/selftests/arm64/Makefile
+++ b/tools/testing/selftests/arm64/Makefile
@@ -21,6 +21,8 @@ CFLAGS += $(KHDR_INCLUDES)
CFLAGS += -I$(top_srcdir)/tools/include
+OUTPUT ?= $(CURDIR)
+
export CFLAGS
export top_srcdir
--
2.25.1
Two guard_regions tests on userfaultfd fail when userfaultfd is not
present. Skip them instead.
hugevm test reads kernel config to get page table level information and
fails when neither /proc/config.gz nor /boot/config-* is present. Skip
it instead.
Zi Yan (2):
selftests/mm: skip guard_regions.uffd tests when uffd is not present.
selftests/mm: skip hugevm test if kernel config file is not present.
tools/testing/selftests/mm/guard-regions.c | 17 ++++++++++--
.../selftests/mm/va_high_addr_switch.sh | 26 +++++++------------
2 files changed, 24 insertions(+), 19 deletions(-)
--
2.47.2
The ID_AA64PFR1_EL1.MTE_frac field is currently hidden from KVM.
However, when ID_AA64PFR1_EL1.MTE==2, ID_AA64PFR1_EL1.MTE_frac==0
indicates that MTE_ASYNC is supported. On a host with
ID_AA64PFR1_EL1.MTE==2 but without MTE_ASYNC support a guest with the
MTE capability enabled will incorrectly see MTE_ASYNC advertised as
supported. This series fixes that.
This was found by inspection and the current behaviour is not known to
break anything. Linux doesn't check MTE_frac, and wrongly, assumes
MTE async faults can be generated whenever MTE is supported. This is
a separate problem and not addressed here.
I am looking for feedback on whether this change is valuable or
otherwise.
Changes since v1:
Only pass MTE_Frac hw value to the guest when it is the exact failure case.
Changed base commit to v6.15-rc5 but still applies on v6.16-rc2 as well.
Ben Horgan (3):
arm64/sysreg: Expose MTE_frac so that it is visible to KVM
KVM: arm64: Make MTE_frac masking conditional on MTE capability
KVM: selftests: Confirm exposing MTE_frac does not break migration
arch/arm64/kernel/cpufeature.c | 1 +
arch/arm64/kvm/sys_regs.c | 28 ++++++-
.../testing/selftests/kvm/arm64/set_id_regs.c | 77 ++++++++++++++++++-
3 files changed, 103 insertions(+), 3 deletions(-)
base-commit: 92a09c47464d040866cf2b4cd052bc60555185fb
--
2.43.0
Add small grammar fixes in perf events and Real Time Clock tests'
output messages.
Include braces around a single if statement, when there are multiple
statements in the else branch, to align with the kernel coding style.
Signed-off-by: Hanne-Lotta Mäenpää <hannelotta(a)gmail.com>
---
tools/testing/selftests/perf_events/watermark_signal.c | 7 ++++---
tools/testing/selftests/rtc/rtctest.c | 10 +++++-----
2 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/perf_events/watermark_signal.c b/tools/testing/selftests/perf_events/watermark_signal.c
index 49dc1e831174..6176afd4950b 100644
--- a/tools/testing/selftests/perf_events/watermark_signal.c
+++ b/tools/testing/selftests/perf_events/watermark_signal.c
@@ -65,8 +65,9 @@ TEST(watermark_signal)
child = fork();
EXPECT_GE(child, 0);
- if (child == 0)
+ if (child == 0) {
do_child();
+ }
else if (child < 0) {
perror("fork()");
goto cleanup;
@@ -75,7 +76,7 @@ TEST(watermark_signal)
if (waitpid(child, &child_status, WSTOPPED) != child ||
!(WIFSTOPPED(child_status) && WSTOPSIG(child_status) == SIGSTOP)) {
fprintf(stderr,
- "failed to sycnhronize with child errno=%d status=%x\n",
+ "failed to synchronize with child errno=%d status=%x\n",
errno,
child_status);
goto cleanup;
@@ -84,7 +85,7 @@ TEST(watermark_signal)
fd = syscall(__NR_perf_event_open, &attr, child, -1, -1,
PERF_FLAG_FD_CLOEXEC);
if (fd < 0) {
- fprintf(stderr, "failed opening event %llx\n", attr.config);
+ fprintf(stderr, "failed to setup performance monitoring %llx\n", attr.config);
goto cleanup;
}
diff --git a/tools/testing/selftests/rtc/rtctest.c b/tools/testing/selftests/rtc/rtctest.c
index be175c0e6ae3..8fd4d5d3b527 100644
--- a/tools/testing/selftests/rtc/rtctest.c
+++ b/tools/testing/selftests/rtc/rtctest.c
@@ -138,10 +138,10 @@ TEST_F_TIMEOUT(rtc, date_read_loop, READ_LOOP_DURATION_SEC + 2) {
rtc_read = rtc_time_to_timestamp(&rtc_tm);
/* Time should not go backwards */
ASSERT_LE(prev_rtc_read, rtc_read);
- /* Time should not increase more then 1s at a time */
+ /* Time should not increase more than 1s per read */
ASSERT_GE(prev_rtc_read + 1, rtc_read);
- /* Sleep 11ms to avoid killing / overheating the RTC */
+ /* Sleep 11ms to avoid overheating the RTC */
nanosleep_with_retries(READ_LOOP_SLEEP_MS * 1000000);
prev_rtc_read = rtc_read;
@@ -236,7 +236,7 @@ TEST_F(rtc, alarm_alm_set) {
if (alarm_state == RTC_ALARM_DISABLED)
SKIP(return, "Skipping test since alarms are not supported.");
if (alarm_state == RTC_ALARM_RES_MINUTE)
- SKIP(return, "Skipping test since alarms has only minute granularity.");
+ SKIP(return, "Skipping test since alarms have only minute granularity.");
rc = ioctl(self->fd, RTC_RD_TIME, &tm);
ASSERT_NE(-1, rc);
@@ -306,7 +306,7 @@ TEST_F(rtc, alarm_wkalm_set) {
if (alarm_state == RTC_ALARM_DISABLED)
SKIP(return, "Skipping test since alarms are not supported.");
if (alarm_state == RTC_ALARM_RES_MINUTE)
- SKIP(return, "Skipping test since alarms has only minute granularity.");
+ SKIP(return, "Skipping test since alarms have only minute granularity.");
rc = ioctl(self->fd, RTC_RD_TIME, &alarm.time);
ASSERT_NE(-1, rc);
@@ -502,7 +502,7 @@ int main(int argc, char **argv)
if (access(rtc_file, R_OK) == 0)
ret = test_harness_run(argc, argv);
else
- ksft_exit_skip("[SKIP]: Cannot access rtc file %s - Exiting\n",
+ ksft_exit_skip("Cannot access RTC file %s - exiting\n",
rtc_file);
return ret;
--
2.39.5