On 3/2/23 13:27, Stephen Boyd wrote:
> Quoting Rob Herring (2023-03-02 09:32:09)
>> On Thu, Mar 2, 2023 at 2:14 AM David Gow <davidgow(a)google.com> wrote:
>>>
>>> On Thu, 2 Mar 2023 at 09:38, Stephen Boyd <sboyd(a)kernel.org> wrote:
>>>>
>>>> This patch series adds unit tests for the clk fixed rate basic type and
>>>> the clk registration functions that use struct clk_parent_data. To get
>>>> there, we add support for loading a DTB into the UML kernel that's
>>>> running the unit tests along with probing platform drivers to bind to
>>>> device nodes specified in DT.
>>>>
>>>> With this series, we're able to exercise some of the code in the common
>>>> clk framework that uses devicetree lookups to find parents and the fixed
>>>> rate clk code that scans devicetree directly and creates clks. Please
>>>> review.
>>>>
>>>
>>> Thanks Stephen -- this is really neat!
>>>
>>> This works well here, and I love all of the tests for the
>>> KUnit/device-tree integration as well.
>>>
>>> I'm still looking through the details of it (alas, I've mostly lived
>>> in x86-land, so my device-tree knowledge is, uh, spotty to say the
>>> least), but apart from possibly renaming some things or similarly
>>> minor tweaks, I've not got any real suggestions thus far.
>>>
>>> I do wonder whether we'll want, on the KUnit side, to have some way of
>>> supporting KUnit device trees on non-UML architecctures (e.g., if we
>>> need to test something architecture-specific, or on a big-endian
>>> platform, etc), but I think that's a question for the future, rather
>>> than something that affects this series.
>>
>> I'll say that's a requirement. We should be able to structure the
>> tests to not interfere with the running system's DT. The DT unittest
>> does that.
>
> That could be another choice in the unit test choice menu.
> CONFIG_OF_KUNIT_NOT_UML that injects some built-in DTB overlay on an
> architecture that wants to run tests.
>
>>
>> As a side topic, Is anyone looking at getting UML to work on arm64?
>> It's surprising how much x86 stuff there is which is I guess one
>> reason it hasn't happened.
>
> I've no idea but it would be nice indeed.
>
>>
>>> Similarly, I wonder if there's something we could do with device tree
>>> overlays, in order to make it possible for tests to swap nodes in and
>>> out for testing.
>>
>> Yes, that's how the DT unittest works. But it is pretty much one big
>> overlay (ignoring the overlay tests). It could probably be more
>> modular where it is apply overlay, test, remove overlay, repeat.
>>
>
> I didn't want to rely on the overlay code to inject DT nodes. Having
> tests written for the fake KUnit machine is simple. It closely matches
> how clk code probes the DTB and how nodes are created and populated on
> the platform bus as devices. CLK_OF_DECLARE() would need the overlay to
> be applied early too, which doesn't happen otherwise as far as I know.
>
> But perhaps this design is too much of an end-to-end test and not a unit
> test? In the spirit of unit testing we shouldn't care about how the node
> is added to the live devicetree, just that there is a devicetree at all.
>
> Supporting overlays to more easily test combinations sounds like a good
> idea. Probably some kunit_*() prefixed functions could be used to
In an imaginary world where overlay support was completed, then _maybe_.
To me, the most important environment to test is where the devictree
data is populated in early boot from an FDT. This is the environment
that drivers currently exist in.
Populating devicetree data via an overlay adds in the functioning of the
overlay apply code (and how the rules behind that functioning may differ
from devicetree data populated in early boot from an FDT).
In an ideal world where overlay support was completed, most or all of the
devicetree tests that were performed against the devicetree data populated
in early boot from an FDT would be repeated, but against comparable
devicetree data populated via an overlay load. The tests with the overlay
data may have to be aware of some differences in how an overlay load
processes an FDT vs how the early boot processing of an FDT behaves.
This extra testing would verify that the overlay environment behaves
the same as the non-overlay environment (with some known exceptions
due to overlay policies).
Overlay support is not complete:
https://elinux.org/Device_Tree_Reference#Mainline_Linux_Supporthttps://elinux.org/Frank%27s_Evolving_Overlay_Thoughts
-Frank
> apply a test managed overlay and automatically remove it when the test
> is over would work. The clk registration tests could use this API to
> inject an overlay and then manually call the of_platform_populate()
> function to create the platform device(s). The overlay could be built in
> drivers/clk/ too and then probably some macroish function can find the
> blob and apply it.
>
> Is there some way to delete the platform devices that we populate from
> the overlay? I'd like the tests to be hermetic.
Building and running the subsuite 'ir' of kselftest, shows the
following issues:
ir_loopback: module rc-loopback is not found in /lib/modules/6.2.0-rc8-next-20230220 [SKIP]
By creating a config file with RC_LOOPBACK=m, LIRC=y and a few
IR_*DECODER=m in the selftests/ir/ directory the tests pass.
Reported-by: Naresh Kamboju <naresh.kamboju(a)linaro.org>
Signed-off-by: Anders Roxell <anders.roxell(a)linaro.org>
---
tools/testing/selftests/ir/config | 13 +++++++++++++
1 file changed, 13 insertions(+)
create mode 100644 tools/testing/selftests/ir/config
diff --git a/tools/testing/selftests/ir/config b/tools/testing/selftests/ir/config
new file mode 100644
index 000000000000..a6031914fa3d
--- /dev/null
+++ b/tools/testing/selftests/ir/config
@@ -0,0 +1,13 @@
+CONFIG_LIRC=y
+CONFIG_IR_IMON_DECODER=m
+CONFIG_IR_JVC_DECODER=m
+CONFIG_IR_MCE_KBD_DECODER=m
+CONFIG_IR_NEC_DECODER=m
+CONFIG_IR_RC5_DECODER=m
+CONFIG_IR_RC6_DECODER=m
+CONFIG_IR_RCMM_DECODER=m
+CONFIG_IR_SANYO_DECODER=m
+CONFIG_IR_SHARP_DECODER=m
+CONFIG_IR_SONY_DECODER=m
+CONFIG_IR_XMP_DECODER=m
+CONFIG_RC_LOOPBACK=m
--
2.39.1
Changelog
v3:
* Rebased on top of Jason's iommufd_hwpt branch:
https://github.com/jgunthorpe/linux/commits/iommufd_hwpt
Particularly the following series:
1) "Revise the hwpt lifetime model"
https://lore.kernel.org/linux-iommu/0-v2-406f7ac07936+6a-iommufd_hwpt_jgg@n…
2) "Add iommufd physical device operations for replace and alloc hwpt"
https://lore.kernel.org/linux-iommu/0-v1-7612f88c19f5+2f21-iommufd_alloc_jg…
* Dropped patches from this series accordingly. There were a couple of
VFIO patches that will be submitted after the VFIO cdev series. Also,
renamed the series to be "emulated".
* Moved dma_unmap sanity patch to the first in the series.
* Moved dma_unmap sanity to cover both VFIO and IOMMUFD pathways.
* Added Kevin's "Reviewed-by" to two of the patches.
* Fixed a NULL pointer bug in vfio_iommufd_emulated_bind().
* Moved unmap() call to the common place in iommufd_access_set_ioas().
v2:
https://lore.kernel.org/linux-iommu/cover.1675802050.git.nicolinc@nvidia.co…
* Rebased on top of vfio_device cdev v2 series.
* Update the kdoc and commit message of iommu_group_replace_domain().
* Dropped revert-to-core-domain part in iommu_group_replace_domain().
* Dropped !ops->dma_unmap check in vfio_iommufd_emulated_attach_ioas().
* Added missing rc value in vfio_iommufd_emulated_attach_ioas() from the
iommufd_access_set_ioas() call.
* Added a new patch in vfio_main to deny vfio_pin/unpin_pages() calls if
vdev->ops->dma_unmap is not implemented.
* Added a __iommmufd_device_detach helper and let the replace routine do
a partial detach().
* Added restriction on auto_domains to use the replace feature.
* Added the patch "iommufd/device: Make hwpt_list list_add/del symmetric"
from the has_group removal series.
v1:
https://lore.kernel.org/linux-iommu/cover.1675320212.git.nicolinc@nvidia.co…
Hi all,
The existing IOMMU APIs provide a pair of functions: iommu_attach_group()
for callers to attach a device from the default_domain (NULL if not being
supported) to a given iommu domain, and iommu_detach_group() for callers
to detach a device from a given domain to the default_domain. Internally,
the detach_dev op is deprecated for the newer drivers with default_domain.
This means that those drivers likely can switch an attaching domain to
another one, without stagging the device at a blocking or default domain,
for use cases such as:
1) vPASID mode, when a guest wants to replace a single pasid (PASID=0)
table with a larger table (PASID=N)
2) Nesting mode, when switching the attaching device from an S2 domain
to an S1 domain, or when switching between relevant S1 domains.
This series is rebased on top of Jason Gunthorpe's series that introduces
iommu_group_replace_domain API and IOMMUFD infrastructure for the IOMMUFD
"physical" devices. The IOMMUFD "emulated" deivces will need some extra
steps to replace the access->ioas object and its iopt pointer.
You can also find this series on Github:
https://github.com/nicolinc/iommufd/commits/iommu_group_replace_domain-v3
Thank you
Nicolin Chen
Nicolin Chen (5):
vfio: Do not allow !ops->dma_unmap in vfio_pin/unpin_pages()
iommufd: Create access in vfio_iommufd_emulated_bind()
iommufd/selftest: Add IOMMU_TEST_OP_ACCESS_SET_IOAS coverage
iommufd: Add replace support in iommufd_access_set_ioas()
iommufd/selftest: Add coverage for access->ioas replacement
drivers/iommu/iommufd/device.c | 114 ++++++++++++++----
drivers/iommu/iommufd/iommufd_private.h | 2 +
drivers/iommu/iommufd/iommufd_test.h | 4 +
drivers/iommu/iommufd/selftest.c | 25 +++-
drivers/vfio/iommufd.c | 23 ++--
drivers/vfio/vfio_main.c | 4 +
include/linux/iommufd.h | 3 +-
tools/testing/selftests/iommu/iommufd.c | 29 ++++-
tools/testing/selftests/iommu/iommufd_utils.h | 22 +++-
9 files changed, 185 insertions(+), 41 deletions(-)
--
2.39.2
=== Context ===
In the context of a middlebox, fragmented packets are tricky to handle.
The full 5-tuple of a packet is often only available in the first
fragment which makes enforcing consistent policy difficult. There are
really only two stateless options, neither of which are very nice:
1. Enforce policy on first fragment and accept all subsequent fragments.
This works but may let in certain attacks or allow data exfiltration.
2. Enforce policy on first fragment and drop all subsequent fragments.
This does not really work b/c some protocols may rely on
fragmentation. For example, DNS may rely on oversized UDP packets for
large responses.
So stateful tracking is the only sane option. RFC 8900 [0] calls this
out as well in section 6.3:
Middleboxes [...] should process IP fragments in a manner that is
consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
must maintain state in order to achieve this goal.
=== BPF related bits ===
However, when policy is enforced through BPF, the prog is run before the
kernel reassembles fragmented packets. This leaves BPF developers in a
awkward place: implement reassembly (possibly poorly) or use a stateless
method as described above.
Fortunately, the kernel has robust support for fragmented IP packets.
This patchset wraps the existing defragmentation facilities in kfuncs so
that BPF progs running on middleboxes can reassemble fragmented packets
before applying policy.
=== Patchset details ===
This patchset is (hopefully) relatively straightforward from BPF perspective.
One thing I'd like to call out is the skb_copy()ing of the prog skb. I
did this to maintain the invariant that the ctx remains valid after prog
has run. This is relevant b/c ip_defrag() and ip_check_defrag() may
consume the skb if the skb is a fragment.
Originally I did play around with teaching the verifier about kfuncs
that may consume the ctx and disallowing ctx accesses in ret != 0
branches. It worked ok, but it seemed too complex to modify the
surrounding assumptions about ctx validity.
[0]: https://datatracker.ietf.org/doc/html/rfc8900
===
Changes from v1:
* Add support for ipv6 defragmentation
Daniel Xu (8):
ip: frags: Return actual error codes from ip_check_defrag()
bpf: verifier: Support KF_CHANGES_PKT flag
bpf, net, frags: Add bpf_ip_check_defrag() kfunc
net: ipv6: Factor ipv6_frag_rcv() to take netns and user
bpf: net: ipv6: Add bpf_ipv6_frag_rcv() kfunc
bpf: selftests: Support not connecting client socket
bpf: selftests: Support custom type and proto for client sockets
bpf: selftests: Add defrag selftests
Documentation/bpf/kfuncs.rst | 7 +
drivers/net/macvlan.c | 2 +-
include/linux/btf.h | 1 +
include/net/ip.h | 11 +
include/net/ipv6.h | 1 +
include/net/ipv6_frag.h | 1 +
include/net/transp_v6.h | 1 +
kernel/bpf/verifier.c | 8 +
net/ipv4/Makefile | 1 +
net/ipv4/ip_fragment.c | 15 +-
net/ipv4/ip_fragment_bpf.c | 98 ++++++
net/ipv6/Makefile | 1 +
net/ipv6/af_inet6.c | 4 +
net/ipv6/reassembly.c | 16 +-
net/ipv6/reassembly_bpf.c | 143 ++++++++
net/packet/af_packet.c | 2 +-
tools/testing/selftests/bpf/Makefile | 3 +-
.../selftests/bpf/generate_udp_fragments.py | 90 +++++
.../selftests/bpf/ip_check_defrag_frags.h | 57 +++
tools/testing/selftests/bpf/network_helpers.c | 26 +-
tools/testing/selftests/bpf/network_helpers.h | 3 +
.../bpf/prog_tests/ip_check_defrag.c | 327 ++++++++++++++++++
.../selftests/bpf/progs/bpf_tracing_net.h | 1 +
.../selftests/bpf/progs/ip_check_defrag.c | 133 +++++++
24 files changed, 931 insertions(+), 21 deletions(-)
create mode 100644 net/ipv4/ip_fragment_bpf.c
create mode 100644 net/ipv6/reassembly_bpf.c
create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c
--
2.39.1
The test_local_dnat_portonly() function initiates the client-side as
soon as it sets the listening side to the background. This could lead to
a race condition where the server may not be ready to listen. To ensure
that the server-side is up and running before initiating the
client-side, a delay is introduced to the test_local_dnat_portonly()
function.
Before the fix:
# ./nft_nat.sh
PASS: netns routing/connectivity: ns0-rthlYrBU can reach ns1-rthlYrBU and ns2-rthlYrBU
PASS: ping to ns1-rthlYrBU was ip NATted to ns2-rthlYrBU
PASS: ping to ns1-rthlYrBU OK after ip nat output chain flush
PASS: ipv6 ping to ns1-rthlYrBU was ip6 NATted to ns2-rthlYrBU
2023/02/27 04:11:03 socat[6055] E connect(5, AF=2 10.0.1.99:2000, 16): Connection refused
ERROR: inet port rewrite
After the fix:
# ./nft_nat.sh
PASS: netns routing/connectivity: ns0-9sPJV6JJ can reach ns1-9sPJV6JJ and ns2-9sPJV6JJ
PASS: ping to ns1-9sPJV6JJ was ip NATted to ns2-9sPJV6JJ
PASS: ping to ns1-9sPJV6JJ OK after ip nat output chain flush
PASS: ipv6 ping to ns1-9sPJV6JJ was ip6 NATted to ns2-9sPJV6JJ
PASS: inet port rewrite without l3 address
Fixes: 282e5f8fe907 ("netfilter: nat: really support inet nat without l3 address")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
tools/testing/selftests/netfilter/nft_nat.sh | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/netfilter/nft_nat.sh b/tools/testing/selftests/netfilter/nft_nat.sh
index 924ecb3f1f73..dd40d9f6f259 100755
--- a/tools/testing/selftests/netfilter/nft_nat.sh
+++ b/tools/testing/selftests/netfilter/nft_nat.sh
@@ -404,6 +404,8 @@ EOF
echo SERVER-$family | ip netns exec "$ns1" timeout 5 socat -u STDIN TCP-LISTEN:2000 &
sc_s=$!
+ sleep 1
+
result=$(ip netns exec "$ns0" timeout 1 socat TCP:$daddr:2000 STDOUT)
if [ "$result" = "SERVER-inet" ];then
--
2.38.1
*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
async
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
Hello,
Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (6):
userfaultfd: Add UFFD WP Async support
userfaultfd: update documentation to describe UFFD_FEATURE_WP_ASYNC
fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about
PTEs
tools headers UAPI: Update linux/fs.h with the kernel sources
mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
selftests: vm: add pagemap ioctl tests
Documentation/admin-guide/mm/pagemap.rst | 24 +
Documentation/admin-guide/mm/userfaultfd.rst | 7 +
fs/proc/task_mmu.c | 290 ++++++
fs/userfaultfd.c | 20 +-
include/linux/userfaultfd_k.h | 11 +
include/uapi/linux/fs.h | 50 ++
include/uapi/linux/userfaultfd.h | 10 +-
mm/memory.c | 23 +-
tools/include/uapi/linux/fs.h | 50 ++
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 5 +-
tools/testing/selftests/vm/pagemap_ioctl.c | 881 +++++++++++++++++++
12 files changed, 1364 insertions(+), 8 deletions(-)
create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c
--
2.30.2