This series implements selftests targeting the feature floated by Chao via:
Below changes aim to test the fd based approach for guest private memory
in context of normal (non-confidential) VMs executing on non-confidential
private_mem_test.c file adds selftest to access private memory from the
guest via private/shared accesses and checking if the contents can be
leaked to/accessed by vmm via shared memory view before/after conversions.
Updates in V2:
1) Simplified vcpu run loop implementation API
2) Removed VM creation logic from private mem library
Updates in V1 (Compared to RFC v3 patches):
1) Incorporated suggestions from Sean around simplifying KVM changes
2) Addressed comments from Sean
3) Added private mem test with shared memory backed by 2MB hugepages.
This series has dependency on following patches:
1) V10 series patches from Chao mentioned above.
Github link for the patches posted as part of this series:
Vishal Annapurve (6):
KVM: x86: Add support for testing private memory
KVM: Selftests: Add support for private memory
KVM: selftests: x86: Add IS_ALIGNED/IS_PAGE_ALIGNED helpers
KVM: selftests: x86: Add helpers to execute VMs with private memory
KVM: selftests: Add get_free_huge_2m_pages
KVM: selftests: x86: Add selftest for private memory
arch/x86/kvm/mmu/mmu_internal.h | 6 +-
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 2 +
.../selftests/kvm/include/kvm_util_base.h | 15 +-
.../testing/selftests/kvm/include/test_util.h | 5 +
.../kvm/include/x86_64/private_mem.h | 24 ++
.../selftests/kvm/include/x86_64/processor.h | 1 +
tools/testing/selftests/kvm/lib/kvm_util.c | 58 ++++-
tools/testing/selftests/kvm/lib/test_util.c | 29 +++
.../selftests/kvm/lib/x86_64/private_mem.c | 139 ++++++++++++
.../selftests/kvm/x86_64/private_mem_test.c | 212 ++++++++++++++++++
virt/kvm/Kconfig | 4 +
virt/kvm/kvm_main.c | 3 +-
13 files changed, 490 insertions(+), 9 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/x86_64/private_mem.h
create mode 100644 tools/testing/selftests/kvm/lib/x86_64/private_mem.c
create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_test.c
During the course of implementing FEAT_LPA2 within the arm64 KVM port, I found a
couple of issues within the KVM selftest code, which I thought were worth
posting independently. The LPA2 patches, for which I will post v2 in the next
few days, depend on these fixes for its testing.
Ryan Roberts (2):
KVM: selftests: Fixup config fragment for access_tracking_perf_test
KVM: selftests: arm64: Fix pte encode/decode for PA bits > 48
tools/testing/selftests/kvm/config | 1 +
.../selftests/kvm/lib/aarch64/processor.c | 32 ++++++++++++++-----
2 files changed, 25 insertions(+), 8 deletions(-)
On Mon, Feb 27, 2023 at 5:57 PM Daniel Xu <dxu(a)dxuuu.xyz> wrote:
> Hi Alexei,
> On Mon, Feb 27, 2023 at 03:03:38PM -0800, Alexei Starovoitov wrote:
> > On Mon, Feb 27, 2023 at 12:51:02PM -0700, Daniel Xu wrote:
> > > === Context ===
> > >
> > > In the context of a middlebox, fragmented packets are tricky to handle.
> > > The full 5-tuple of a packet is often only available in the first
> > > fragment which makes enforcing consistent policy difficult. There are
> > > really only two stateless options, neither of which are very nice:
> > >
> > > 1. Enforce policy on first fragment and accept all subsequent fragments.
> > > This works but may let in certain attacks or allow data exfiltration.
> > >
> > > 2. Enforce policy on first fragment and drop all subsequent fragments.
> > > This does not really work b/c some protocols may rely on
> > > fragmentation. For example, DNS may rely on oversized UDP packets for
> > > large responses.
> > >
> > > So stateful tracking is the only sane option. RFC 8900  calls this
> > > out as well in section 6.3:
> > >
> > > Middleboxes [...] should process IP fragments in a manner that is
> > > consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
> > > must maintain state in order to achieve this goal.
> > >
> > > === BPF related bits ===
> > >
> > > However, when policy is enforced through BPF, the prog is run before the
> > > kernel reassembles fragmented packets. This leaves BPF developers in a
> > > awkward place: implement reassembly (possibly poorly) or use a stateless
> > > method as described above.
> > >
> > > Fortunately, the kernel has robust support for fragmented IP packets.
> > > This patchset wraps the existing defragmentation facilities in kfuncs so
> > > that BPF progs running on middleboxes can reassemble fragmented packets
> > > before applying policy.
> > >
> > > === Patchset details ===
> > >
> > > This patchset is (hopefully) relatively straightforward from BPF perspective.
> > > One thing I'd like to call out is the skb_copy()ing of the prog skb. I
> > > did this to maintain the invariant that the ctx remains valid after prog
> > > has run. This is relevant b/c ip_defrag() and ip_check_defrag() may
> > > consume the skb if the skb is a fragment.
> > Instead of doing all that with extra skb copy can you hook bpf prog after
> > the networking stack already handled ip defrag?
> > What kind of middle box are you doing? Why does it have to run at TC layer?
> Unless I'm missing something, the only other relevant hooks would be
> socket hooks, right?
> Unfortunately I don't think my use case can do that. We are running the
> kernel as a router, so no sockets are involved.
Are you using bpf_fib_lookup and populating kernel routing
table and doing everything on your own including neigh ?
Have you considered to skb redirect to another netdev that does ip defrag?
Like macvlan does it under some conditions. This can be generalized.
Recently Florian proposed to allow calling bpf progs from all existing
You can pretend to local deliver and hook in NF_INET_LOCAL_IN ?
I feel it would be so much cleaner if stack does ip_defrag normally.
The general issue of skb ownership between bpf prog and defrag logic
isn't really solved with skb_copy. It's still an issue.
Bring back the Python scripts that were initially added with
TEST_GEN_FILES but now with TEST_FILES to avoid having them deleted
when doing a clean. Also fix the way the architecture is being
determined as they should also be installed when ARCH=x86_64 is
provided explicitly. Then also append extra files to TEST_FILES and
TEST_PROGS with += so they don't get discarded.
Fixes: ba2d788aa873 ("selftests: amd-pstate: Trigger tbench benchmark and test cpus")
Fixes: ac527cee87c9 ("selftests: amd-pstate: Don't delete source files via Makefile")
Signed-off-by: Guillaume Tucker <guillaume.tucker(a)collabora.com>
tools/testing/selftests/amd-pstate/Makefile | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/amd-pstate/Makefile b/tools/testing/selftests/amd-pstate/Makefile
index 5fd1424db37d..c382f579fe94 100644
@@ -4,10 +4,15 @@
# No binaries, but make sure arg-less "make" doesn't trigger "run_tests"
-uname_M := $(shell uname -m 2>/dev/null || echo not)
-ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/)
+ARCH ?= $(shell uname -m 2>/dev/null || echo not)
+ARCH := $(shell echo $(ARCH) | sed -e s/i.86/x86/ -e s/x86_64/x86/)
-TEST_PROGS := run.sh
-TEST_FILES := basic.sh tbench.sh gitsource.sh
+TEST_FILES += ../../../power/x86/amd_pstate_tracer/amd_pstate_trace.py
+TEST_FILES += ../../../power/x86/intel_pstate_tracer/intel_pstate_tracer.py
+TEST_PROGS += run.sh
+TEST_FILES += basic.sh tbench.sh gitsource.sh
Building and running the subsuite 'ir' of kselftest, shows the
ir_loopback: module rc-loopback is not found in /lib/modules/6.2.0-rc8-next-20230220 [SKIP]
By creating a config file with RC_LOOPBACK=m, LIRC=y and a few
IR_*DECODER=m in the selftests/ir/ directory the tests pass.
Reported-by: Naresh Kamboju <naresh.kamboju(a)linaro.org>
Signed-off-by: Anders Roxell <anders.roxell(a)linaro.org>
tools/testing/selftests/ir/config | 13 +++++++++++++
1 file changed, 13 insertions(+)
create mode 100644 tools/testing/selftests/ir/config
diff --git a/tools/testing/selftests/ir/config b/tools/testing/selftests/ir/config
new file mode 100644
@@ -0,0 +1,13 @@
* Rebased on top of Jason's iommufd_hwpt branch:
Particularly the following series:
1) "Revise the hwpt lifetime model"
2) "Add iommufd physical device operations for replace and alloc hwpt"
* Dropped patches from this series accordingly. There were a couple of
VFIO patches that will be submitted after the VFIO cdev series. Also,
renamed the series to be "emulated".
* Moved dma_unmap sanity patch to the first in the series.
* Moved dma_unmap sanity to cover both VFIO and IOMMUFD pathways.
* Added Kevin's "Reviewed-by" to two of the patches.
* Fixed a NULL pointer bug in vfio_iommufd_emulated_bind().
* Moved unmap() call to the common place in iommufd_access_set_ioas().
* Rebased on top of vfio_device cdev v2 series.
* Update the kdoc and commit message of iommu_group_replace_domain().
* Dropped revert-to-core-domain part in iommu_group_replace_domain().
* Dropped !ops->dma_unmap check in vfio_iommufd_emulated_attach_ioas().
* Added missing rc value in vfio_iommufd_emulated_attach_ioas() from the
* Added a new patch in vfio_main to deny vfio_pin/unpin_pages() calls if
vdev->ops->dma_unmap is not implemented.
* Added a __iommmufd_device_detach helper and let the replace routine do
a partial detach().
* Added restriction on auto_domains to use the replace feature.
* Added the patch "iommufd/device: Make hwpt_list list_add/del symmetric"
from the has_group removal series.
The existing IOMMU APIs provide a pair of functions: iommu_attach_group()
for callers to attach a device from the default_domain (NULL if not being
supported) to a given iommu domain, and iommu_detach_group() for callers
to detach a device from a given domain to the default_domain. Internally,
the detach_dev op is deprecated for the newer drivers with default_domain.
This means that those drivers likely can switch an attaching domain to
another one, without stagging the device at a blocking or default domain,
for use cases such as:
1) vPASID mode, when a guest wants to replace a single pasid (PASID=0)
table with a larger table (PASID=N)
2) Nesting mode, when switching the attaching device from an S2 domain
to an S1 domain, or when switching between relevant S1 domains.
This series is rebased on top of Jason Gunthorpe's series that introduces
iommu_group_replace_domain API and IOMMUFD infrastructure for the IOMMUFD
"physical" devices. The IOMMUFD "emulated" deivces will need some extra
steps to replace the access->ioas object and its iopt pointer.
You can also find this series on Github:
Nicolin Chen (5):
vfio: Do not allow !ops->dma_unmap in vfio_pin/unpin_pages()
iommufd: Create access in vfio_iommufd_emulated_bind()
iommufd/selftest: Add IOMMU_TEST_OP_ACCESS_SET_IOAS coverage
iommufd: Add replace support in iommufd_access_set_ioas()
iommufd/selftest: Add coverage for access->ioas replacement
drivers/iommu/iommufd/device.c | 114 ++++++++++++++----
drivers/iommu/iommufd/iommufd_private.h | 2 +
drivers/iommu/iommufd/iommufd_test.h | 4 +
drivers/iommu/iommufd/selftest.c | 25 +++-
drivers/vfio/iommufd.c | 23 ++--
drivers/vfio/vfio_main.c | 4 +
include/linux/iommufd.h | 3 +-
tools/testing/selftests/iommu/iommufd.c | 29 ++++-
tools/testing/selftests/iommu/iommufd_utils.h | 22 +++-
9 files changed, 185 insertions(+), 41 deletions(-)
=== Context ===
In the context of a middlebox, fragmented packets are tricky to handle.
The full 5-tuple of a packet is often only available in the first
fragment which makes enforcing consistent policy difficult. There are
really only two stateless options, neither of which are very nice:
1. Enforce policy on first fragment and accept all subsequent fragments.
This works but may let in certain attacks or allow data exfiltration.
2. Enforce policy on first fragment and drop all subsequent fragments.
This does not really work b/c some protocols may rely on
fragmentation. For example, DNS may rely on oversized UDP packets for
So stateful tracking is the only sane option. RFC 8900  calls this
out as well in section 6.3:
Middleboxes [...] should process IP fragments in a manner that is
consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
must maintain state in order to achieve this goal.
=== BPF related bits ===
However, when policy is enforced through BPF, the prog is run before the
kernel reassembles fragmented packets. This leaves BPF developers in a
awkward place: implement reassembly (possibly poorly) or use a stateless
method as described above.
Fortunately, the kernel has robust support for fragmented IP packets.
This patchset wraps the existing defragmentation facilities in kfuncs so
that BPF progs running on middleboxes can reassemble fragmented packets
before applying policy.
=== Patchset details ===
This patchset is (hopefully) relatively straightforward from BPF perspective.
One thing I'd like to call out is the skb_copy()ing of the prog skb. I
did this to maintain the invariant that the ctx remains valid after prog
has run. This is relevant b/c ip_defrag() and ip_check_defrag() may
consume the skb if the skb is a fragment.
Originally I did play around with teaching the verifier about kfuncs
that may consume the ctx and disallowing ctx accesses in ret != 0
branches. It worked ok, but it seemed too complex to modify the
surrounding assumptions about ctx validity.
Changes from v1:
* Add support for ipv6 defragmentation
Daniel Xu (8):
ip: frags: Return actual error codes from ip_check_defrag()
bpf: verifier: Support KF_CHANGES_PKT flag
bpf, net, frags: Add bpf_ip_check_defrag() kfunc
net: ipv6: Factor ipv6_frag_rcv() to take netns and user
bpf: net: ipv6: Add bpf_ipv6_frag_rcv() kfunc
bpf: selftests: Support not connecting client socket
bpf: selftests: Support custom type and proto for client sockets
bpf: selftests: Add defrag selftests
Documentation/bpf/kfuncs.rst | 7 +
drivers/net/macvlan.c | 2 +-
include/linux/btf.h | 1 +
include/net/ip.h | 11 +
include/net/ipv6.h | 1 +
include/net/ipv6_frag.h | 1 +
include/net/transp_v6.h | 1 +
kernel/bpf/verifier.c | 8 +
net/ipv4/Makefile | 1 +
net/ipv4/ip_fragment.c | 15 +-
net/ipv4/ip_fragment_bpf.c | 98 ++++++
net/ipv6/Makefile | 1 +
net/ipv6/af_inet6.c | 4 +
net/ipv6/reassembly.c | 16 +-
net/ipv6/reassembly_bpf.c | 143 ++++++++
net/packet/af_packet.c | 2 +-
tools/testing/selftests/bpf/Makefile | 3 +-
.../selftests/bpf/generate_udp_fragments.py | 90 +++++
.../selftests/bpf/ip_check_defrag_frags.h | 57 +++
tools/testing/selftests/bpf/network_helpers.c | 26 +-
tools/testing/selftests/bpf/network_helpers.h | 3 +
.../bpf/prog_tests/ip_check_defrag.c | 327 ++++++++++++++++++
.../selftests/bpf/progs/bpf_tracing_net.h | 1 +
.../selftests/bpf/progs/ip_check_defrag.c | 133 +++++++
24 files changed, 931 insertions(+), 21 deletions(-)
create mode 100644 net/ipv4/ip_fragment_bpf.c
create mode 100644 net/ipv6/reassembly_bpf.c
create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c
The test_local_dnat_portonly() function initiates the client-side as
soon as it sets the listening side to the background. This could lead to
a race condition where the server may not be ready to listen. To ensure
that the server-side is up and running before initiating the
client-side, a delay is introduced to the test_local_dnat_portonly()
Before the fix:
PASS: netns routing/connectivity: ns0-rthlYrBU can reach ns1-rthlYrBU and ns2-rthlYrBU
PASS: ping to ns1-rthlYrBU was ip NATted to ns2-rthlYrBU
PASS: ping to ns1-rthlYrBU OK after ip nat output chain flush
PASS: ipv6 ping to ns1-rthlYrBU was ip6 NATted to ns2-rthlYrBU
2023/02/27 04:11:03 socat E connect(5, AF=2 10.0.1.99:2000, 16): Connection refused
ERROR: inet port rewrite
After the fix:
PASS: netns routing/connectivity: ns0-9sPJV6JJ can reach ns1-9sPJV6JJ and ns2-9sPJV6JJ
PASS: ping to ns1-9sPJV6JJ was ip NATted to ns2-9sPJV6JJ
PASS: ping to ns1-9sPJV6JJ OK after ip nat output chain flush
PASS: ipv6 ping to ns1-9sPJV6JJ was ip6 NATted to ns2-9sPJV6JJ
PASS: inet port rewrite without l3 address
Fixes: 282e5f8fe907 ("netfilter: nat: really support inet nat without l3 address")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
tools/testing/selftests/netfilter/nft_nat.sh | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/netfilter/nft_nat.sh b/tools/testing/selftests/netfilter/nft_nat.sh
index 924ecb3f1f73..dd40d9f6f259 100755
@@ -404,6 +404,8 @@ EOF
echo SERVER-$family | ip netns exec "$ns1" timeout 5 socat -u STDIN TCP-LISTEN:2000 &
+ sleep 1
result=$(ip netns exec "$ns0" timeout 1 socat TCP:$daddr:2000 STDOUT)
if [ "$result" = "SERVER-inet" ];then
*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here. This series adds features that weren't
- There is no atomic get soft-dirty/Written-to status and clear present in
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it  as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP  as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project . The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project .
The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.
Muhammad Usama Anjum
Muhammad Usama Anjum (6):
userfaultfd: Add UFFD WP Async support
userfaultfd: update documentation to describe UFFD_FEATURE_WP_ASYNC
fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about
tools headers UAPI: Update linux/fs.h with the kernel sources
mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
selftests: vm: add pagemap ioctl tests
Documentation/admin-guide/mm/pagemap.rst | 24 +
Documentation/admin-guide/mm/userfaultfd.rst | 7 +
fs/proc/task_mmu.c | 290 ++++++
fs/userfaultfd.c | 20 +-
include/linux/userfaultfd_k.h | 11 +
include/uapi/linux/fs.h | 50 ++
include/uapi/linux/userfaultfd.h | 10 +-
mm/memory.c | 23 +-
tools/include/uapi/linux/fs.h | 50 ++
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 5 +-
tools/testing/selftests/vm/pagemap_ioctl.c | 881 +++++++++++++++++++
12 files changed, 1364 insertions(+), 8 deletions(-)
create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c