Patch 1 fixes a possible deadlock in subflow_error_report() reported by
lockdep. The report was in fact a false positive but the modification
makes sense and silences lockdep to allow syzkaller to find real issues.
The regression has been introduced in v5.12.
Patch 2 is a refactoring needed to be able to fix the two next issues.
It improves the situation and can be backported up to v6.0.
Patches 3 and 4 fix UaF reported by KASAN. It fixes issues potentially
visible since v5.7 and v5.19 but only reproducible until recently
(v6.0). These two patches depend on patch 2/7.
Patch 5 fixes the order of the printed values: expected vs seen values.
The regression has been introduced recently: v6.3-rc1.
Patch 6 adds missing ro_after_init flags. A previous patch added them
for other functions but these two have been missed. This previous patch
has been backported to stable versions (up to v5.12) so probably better
to do the same here.
Patch 7 fixes tcp_set_state() being called twice in a row since v5.10.
Patch 8 fixes another lockdep false positive issue but this time in
MPTCP PM code. Same here, some modifications in the code has been made
to silence this issue and help finding real ones later. This issue can
be seen since v6.2.
Note that checkpatch.pl is now complaining about the "Closes" tag but
discussions are ongoing to add an exception:
https://lore.kernel.org/all/a27480c5-c3d4-b302-285e-323df0349b8f@tessares.n…
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
---
Changes in v2:
- Patches 3 and 4 have been modified to fix the issue reported on netdev
- Patch 8 has been added
- Rebased
- Link to v1: https://lore.kernel.org/r/20230227-upstream-net-20230227-mptcp-fixes-v1-0-0…
---
Geliang Tang (1):
mptcp: add ro_after_init for tcp{,v6}_prot_override
Matthieu Baerts (2):
selftests: mptcp: userspace pm: fix printed values
mptcp: avoid setting TCP_CLOSE state twice
Paolo Abeni (5):
mptcp: fix possible deadlock in subflow_error_report
mptcp: refactor passive socket initialization
mptcp: use the workqueue to destroy unaccepted sockets
mptcp: fix UaF in listener shutdown
mptcp: fix lockdep false positive in mptcp_pm_nl_create_listen_socket()
net/mptcp/pm_netlink.c | 16 +++
net/mptcp/protocol.c | 64 +++++------
net/mptcp/protocol.h | 6 +-
net/mptcp/subflow.c | 128 +++++++---------------
tools/testing/selftests/net/mptcp/userspace_pm.sh | 2 +-
5 files changed, 95 insertions(+), 121 deletions(-)
---
base-commit: 67eeadf2f95326f6344adacb70c880bf2ccff57b
change-id: 20230227-upstream-net-20230227-mptcp-fixes-cc78f3a2f5b2
Best regards,
--
Matthieu Baerts <matthieu.baerts(a)tessares.net>
This is the basic functionality for iommufd to support
iommufd_device_replace() and IOMMU_HWPT_ALLOC for physical devices.
iommufd_device_replace() allows changing the HWPT associated with the
device to a new IOAS or HWPT. Replace does this in way that failure leaves
things unchanged, and utilizes the iommu iommu_group_replace_domain() API
to allow the iommu driver to perform an optional non-disruptive change.
IOMMU_HWPT_ALLOC allows HWPTs to be explicitly allocated by the user and
used by attach or replace. At this point it isn't very useful since the
HWPT is the same as the automatically managed HWPT from the IOAS. However
a following series will allow userspace to customize the created HWPT.
The implementation is complicated because we have to introduce some
per-iommu_group memory in iommufd and redo how we think about multi-device
groups to be more explicit. This solves all the locking problems in the
prior attempts.
This series is infrastructure work for the following series which:
- Add replace for attach
- Expose replace through VFIO APIs
- Implement driver parameters for HWPT creation (nesting)
Once review of this is complete I will keep it on a side branch and
accumulate the following series when they are ready so we can have a
stable base and make more incremental progress. When we have all the parts
together to get a full implementation it can go to Linus.
I have this on github:
https://github.com/jgunthorpe/linux/commits/iommufd_hwpt
v2:
- Use WARN_ON for the igroup->group test and move that logic to a
function iommufd_group_try_get()
- Change igroup->devices to igroup->device list
Replace will need to iterate over all attached idevs
- Rename to iommufd_group_setup_msi()
- New patch to export iommu_get_resv_regions()
- New patch to use per-device reserved regions instead of per-group
regions
- Split out the reorganizing of iommufd_device_change_pt() from the
replace patch
- Replace uses the per-dev reserved regions
- Use stdev_id in a few more places in the selftest
- Fix error handling in IOMMU_HWPT_ALLOC
- Clarify comments
- Rebase on v6.3-rc1
v1: https://lore.kernel.org/all/0-v1-7612f88c19f5+2f21-iommufd_alloc_jgg@nvidia…
Cc: Nicolin Chen <nicolinc(a)nvidia.com>
Cc: Yi Liu <yi.l.liu(a)intel.com>
Cc: kvm(a)vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
Jason Gunthorpe (15):
iommufd: Move isolated msi enforcement to iommufd_device_bind()
iommufd: Add iommufd_group
iommufd: Replace the hwpt->devices list with iommufd_group
iommu: Export iommu_get_resv_regions()
iommufd: Keep track of each device's reserved regions instead of
groups
iommufd: Use the iommufd_group to avoid duplicate MSI setup
iommufd: Make sw_msi_start a group global
iommufd: Move putting a hwpt to a helper function
iommufd: Add enforced_cache_coherency to iommufd_hw_pagetable_alloc()
iommufd: Reorganize iommufd_device_attach into
iommufd_device_change_pt
iommufd: Add iommufd_device_replace()
iommufd: Make destroy_rwsem use a lock class per object type
iommufd: Add IOMMU_HWPT_ALLOC
iommufd/selftest: Return the real idev id from selftest mock_domain
iommufd/selftest: Add a selftest for IOMMU_HWPT_ALLOC
Nicolin Chen (2):
iommu: Introduce a new iommu_group_replace_domain() API
iommufd/selftest: Test iommufd_device_replace()
drivers/iommu/iommu-priv.h | 10 +
drivers/iommu/iommu.c | 41 +-
drivers/iommu/iommufd/device.c | 498 +++++++++++++-----
drivers/iommu/iommufd/hw_pagetable.c | 96 +++-
drivers/iommu/iommufd/io_pagetable.c | 25 +-
drivers/iommu/iommufd/iommufd_private.h | 51 +-
drivers/iommu/iommufd/iommufd_test.h | 6 +
drivers/iommu/iommufd/main.c | 17 +-
drivers/iommu/iommufd/selftest.c | 40 ++
include/linux/iommufd.h | 1 +
include/uapi/linux/iommufd.h | 26 +
tools/testing/selftests/iommu/iommufd.c | 64 ++-
.../selftests/iommu/iommufd_fail_nth.c | 52 +-
tools/testing/selftests/iommu/iommufd_utils.h | 61 ++-
14 files changed, 789 insertions(+), 199 deletions(-)
create mode 100644 drivers/iommu/iommu-priv.h
base-commit: 4ed4791afb34c61650b17407846174a72e4034f4
--
2.39.2
Add support for sockmap to vsock.
We're testing usage of vsock as a way to redirect guest-local UDS
requests to the host and this patch series greatly improves the
performance of such a setup.
Compared to copying packets via userspace, this improves throughput by
121% in basic testing.
Tested as follows.
Setup: guest unix dgram sender -> guest vsock redirector -> host vsock
server
Threads: 1
Payload: 64k
No sockmap:
- 76.3 MB/s
- The guest vsock redirector was
"socat VSOCK-CONNECT:2:1234 UNIX-RECV:/path/to/sock"
Using sockmap (this patch):
- 168.8 MB/s (+121%)
- The guest redirector was a simple sockmap echo server,
redirecting unix ingress to vsock 2:1234 egress.
- Same sender and server programs
*Note: these numbers are from RFC v1
Only the virtio transport has been tested. The loopback transport was
used in writing bpf/selftests, but not thoroughly tested otherwise.
This series requires the skb patch.
Changes in v3:
- vsock/bpf: Refactor wait logic in vsock_bpf_recvmsg() to avoid
backwards goto
- vsock/bpf: Check psock before acquiring slock
- vsock/bpf: Return bool instead of int of 0 or 1
- vsock/bpf: Wrap macro args __sk/__psock in parens
- vsock/bpf: Place comment trailer */ on separate line
Changes in v2:
- vsock/bpf: rename vsock_dgram_* -> vsock_*
- vsock/bpf: change sk_psock_{get,put} and {lock,release}_sock() order
to minimize slock hold time
- vsock/bpf: use "new style" wait
- vsock/bpf: fix bug in wait log
- vsock/bpf: add check that recvmsg sk_type is one dgram, seqpacket, or
stream. Return error if not one of the three.
- virtio/vsock: comment __skb_recv_datagram() usage
- virtio/vsock: do not init copied in read_skb()
- vsock/bpf: add ifdef guard around struct proto in dgram_recvmsg()
- selftests/bpf: add vsock loopback config for aarch64
- selftests/bpf: add vsock loopback config for s390x
- selftests/bpf: remove vsock device from vmtest.sh qemu machine
- selftests/bpf: remove CONFIG_VIRTIO_VSOCKETS=y from config.x86_64
- vsock/bpf: move transport-related (e.g., if (!vsk->transport)) checks
out of fast path
Signed-off-by: Bobby Eshleman <bobby.eshleman(a)bytedance.com>
---
Bobby Eshleman (3):
vsock: support sockmap
selftests/bpf: add vsock to vmtest.sh
selftests/bpf: Add a test case for vsock sockmap
drivers/vhost/vsock.c | 1 +
include/linux/virtio_vsock.h | 1 +
include/net/af_vsock.h | 17 ++
net/vmw_vsock/Makefile | 1 +
net/vmw_vsock/af_vsock.c | 55 ++++++-
net/vmw_vsock/virtio_transport.c | 2 +
net/vmw_vsock/virtio_transport_common.c | 24 +++
net/vmw_vsock/vsock_bpf.c | 175 +++++++++++++++++++++
net/vmw_vsock/vsock_loopback.c | 2 +
tools/testing/selftests/bpf/config.aarch64 | 2 +
tools/testing/selftests/bpf/config.s390x | 3 +
tools/testing/selftests/bpf/config.x86_64 | 3 +
.../selftests/bpf/prog_tests/sockmap_listen.c | 163 +++++++++++++++++++
13 files changed, 443 insertions(+), 6 deletions(-)
---
base-commit: d83115ce337a632f996e44c9f9e18cadfcf5a094
change-id: 20230118-support-vsock-sockmap-connectible-2e1297d2111a
Best regards,
--
Bobby Eshleman <bobby.eshleman(a)bytedance.com>
---
Bobby Eshleman (3):
vsock: support sockmap
selftests/bpf: add vsock to vmtest.sh
selftests/bpf: add a test case for vsock sockmap
drivers/vhost/vsock.c | 1 +
include/linux/virtio_vsock.h | 1 +
include/net/af_vsock.h | 17 ++
net/vmw_vsock/Makefile | 1 +
net/vmw_vsock/af_vsock.c | 55 ++++++-
net/vmw_vsock/virtio_transport.c | 2 +
net/vmw_vsock/virtio_transport_common.c | 25 +++
net/vmw_vsock/vsock_bpf.c | 174 +++++++++++++++++++++
net/vmw_vsock/vsock_loopback.c | 2 +
tools/testing/selftests/bpf/config.aarch64 | 2 +
tools/testing/selftests/bpf/config.s390x | 3 +
tools/testing/selftests/bpf/config.x86_64 | 3 +
.../selftests/bpf/prog_tests/sockmap_listen.c | 163 +++++++++++++++++++
13 files changed, 443 insertions(+), 6 deletions(-)
---
base-commit: c2ea552065e43d05bce240f53c3185fd3a066204
change-id: 20230227-vsock-sockmap-upstream-9d65c84174a2
Best regards,
--
Bobby Eshleman <bobby.eshleman(a)bytedance.com>
On Fri, 10 Mar 2023 at 07:25, Stephen Boyd <sboyd(a)kernel.org> wrote:
>
> Quoting David Gow (2023-03-02 23:15:31)
> > On Thu, 2 Mar 2023 at 09:38, Stephen Boyd <sboyd(a)kernel.org> wrote:
> > >
> > > Introduce KUnit resource wrappers around platform_driver_register(),
> > > platform_device_alloc(), and platform_device_add() so that test authors
> > > can register platform drivers/devices from their tests and have the
> > > drivers/devices automatically be unregistered when the test is done.
> > >
> > > This makes test setup code simpler when a platform driver or platform
> > > device is needed. Add a few test cases at the same time to make sure the
> > > APIs work as intended.
> > >
> > > Cc: Brendan Higgins <brendan.higgins(a)linux.dev>
> > > Cc: David Gow <davidgow(a)google.com>
> > > Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
> > > Cc: "Rafael J. Wysocki" <rafael(a)kernel.org>
> > > Signed-off-by: Stephen Boyd <sboyd(a)kernel.org>
> > > ---
> > >
> > > Should this be moved to drivers/base/ and called platform_kunit.c?
> > > The include/kunit/platform_driver.h could also be
> > > kunit/platform_device.h to match linux/platform_device.h if that is more
> > > familiar.
> >
> > DRM has a similar thing already (albeit with a root_device, which is
> > more common with KUnit tests generally):
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/inc…
> >
> > But that's reasonably drm-specific, so it makes sense that it lives
> > with DRM stuff. platform_device is a bit more generic.
> >
> > I'd probably personally err on the side of having these in
> > drivers/base/, as I think we'll ultimately need similar things for a
> > lot of different devices, and I'd rather not end up with things like
> > USB device helpers living in the lib/kunit directory alongside the
> > "core" KUnit code. But I could be persuaded otherwise.
>
> Ok no problem. I'll move it.
>
> >
> > >
> > > And I'm not super certain about allocating a driver structure and
> > > embedding it in a wrapper struct. Maybe the code should just use
> > > kunit_get_current_test() instead?
> >
> > I think there are enough cases througout the kernel where
> > device/driver structs are needed that having this makes sense.
> > Combined with the fact that, while kunit_get_current_test() can be
> > used even when KUnit is not loaded, actually doing anything with the
> > resulting struct kunit pointer will probably require (at least for the
> > moment) KUnit functions to be reachable, so would break if
> > CONFIG_KUNIT=m.
>
> Wouldn't it still work in that case? The unit tests would be modular as
> well because they depend on CONFIG_KUNIT.
>
Yeah, the only case where this starts to get hairy is if the tests end
up in the same module as the thing being tested (which sometimes
happens to avoid having to export a bunch of symbols: see, e.g.
thunderbolt and amdgpu), and then someone wants to build production
kernels with CONFIG_KUNIT=m (alas, Red Hat and Android).
So that's the only real place where you might need to avoid the
non-'hook' KUnit functions, but those drivers are pretty few and far
between, and most of the really useful functionality should be moving
to 'hooks' which will be patched out cleanly at runtime.
> >
> > So, unless you actually find kunit_get_current_test() and friends to
> > be easier to work with, I'd probably stick with this.
> >
>
> Alright thanks.
>
> > > diff --git a/lib/kunit/platform_driver.c b/lib/kunit/platform_driver.c
> > > new file mode 100644
> > > index 000000000000..11d155114936
> > > --- /dev/null
> > > +++ b/lib/kunit/platform_driver.c
> > > @@ -0,0 +1,207 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * Test managed platform driver
> > > + */
> > > +
> > > +#include <linux/device/driver.h>
> > > +#include <linux/platform_device.h>
> > > +
> > > +#include <kunit/resource.h>
> > > +
> > > +struct kunit_platform_device_alloc_params {
> > > + const char *name;
> > > + int id;
> > > +};
> >
> > FYI: It's my plan to eventually get rid of (or at least de-emphasize)
> > the whole 'init' function aspect of KUnit resources so we don't need
> > all of these extra structs and the like. It probably won't make it in
> > for 6.4, but we'll see...
>
> Will we be able to get the error values out of the init function? It's
> annoying that the error values can't be returned as error pointers to
> kunit_alloc_resource(). I end up skipping init, and doing it directly
> before or after calling the kunit_alloc_resource() function. I'll try to
> avoid init functions in the allocations.
Yeah, that's largely why the plan is to get rid of them: it just made
passing things around an enormous pain.
Just doing your own initialisation before adding it as a resource is
usually the right thing to do.
There's also going to be a simpler kunit_defer() wrapper around it,
which would just allow you to schedule a cleanup function to be called
(without the need to keep kunit_resource pointers around, etc), for
the cases where you don't need to look up resources elsewhere.
But just doing your own thing and calling kunit_alloc_resource() is
probably best for now, and should map well onto whatever this ends up
evolving into.
Cheers,
-- David
On Fri, 10 Mar 2023 at 07:19, Stephen Boyd <sboyd(a)kernel.org> wrote:
>
> Quoting David Gow (2023-03-02 23:15:04)
> > On Thu, 2 Mar 2023 at 09:38, Stephen Boyd <sboyd(a)kernel.org> wrote:
> > >
> > > To fully exercise common clk framework code in KUnit we need to
> > > associate 'struct device' pointers with 'struct device_node' pointers so
> > > that things like clk_get() can parse DT nodes for 'clocks' and so that
> > > clk providers can use DT to provide clks; the most common mode of
> > > operation for clk providers.
> > >
> > > Adding support to KUnit so that it loads a DTB is fairly simple after
> > > commit b31297f04e86 ("um: Add devicetree support"). We can simply pass a
> > > pre-compiled deviectree blob (DTB) on the kunit.py commandline and UML
> > > will load it. The problem is that tests won't know that the commandline
> > > has been modified, nor that a DTB has been loaded. Take a different
> > > approach so that tests can skip if a DTB hasn't been loaded.
> > >
> > > Reuse the Makefile logic from the OF unittests to build a DTB into the
> > > kernel. This DTB will be for the mythical machine "linux,kunit", i.e.
> > > the devicetree for the KUnit "board". In practice, it is a dtsi file
> > > that will gather includes for kunit tests that rely in part on a
> > > devicetree being loaded. The devicetree should only be loaded if
> > > CONFIG_OF_KUNIT=y. Make that a choice config parallel to the existing
> > > CONFIG_OF_UNITTEST so that only one devicetree can be loaded in the
> > > system at a time. Similarly, the kernel commandline option to load a
> > > DTB is ignored if CONFIG_OF_KUNIT is enabled so that only one DTB is
> > > loaded at a time.
> >
> > This feels a little bit like it's just papering over the real problem,
> > which is that there's no way tests can skip themselves if no DTB is
> > loaded.
>
> Hmm. I think you're suggesting that the unit test data be loaded
> whenever CONFIG_OF=y and CONFIG_KUNIT=y. Then tests can check for
> CONFIG_OF and skip if it isn't enabled?
>
More of the opposite: that we should have some way of supporting tests
which might want to use a DTB other than the built-in one. Mostly for
non-UML situations where an actual devicetree is needed to even boot
far enough to get test output (so we wouldn't be able to override it
with a compiled-in test one).
I think moving to overlays probably will render this idea obsolete:
but the thought was to give test code a way to check for the required
devicetree nodes at runtime, and skip the test if they weren't found.
That way, the failure mode for trying to boot this on something which
required another device tree for, e.g., serial, would be "these tests
are skipped because the wrong device tree is loaded", not "I get no
output because serial isn't working".
Again, though, it's only really needed for non-UML, and just loading
overlays as needed should be much more sensible anyway.
> >
> > That being said, I do think that there's probably some sense in
> > supporting the compiled-in DTB as well (it's definitely simpler than
> > patching kunit.py to always pass the extra command-line option in, for
> > example).
> > But maybe it'd be nice to have the command-line option override the
> > built-in one if present.
>
> Got it. I need to test loading another DTB on the commandline still, but
> I think this won't be a problem. We'll load the unittest-data DTB even
> with KUnit on UML, so assuming that works on UML right now it should be
> unchanged by this series once I resend.
Again, moving to overlays should render this mostly obsolete, no? Or
am I misunderstanding how the overlay stuff will work?
One possible future advantage of being able to test with custom DTs at
boot time would be for fuzzing (provide random DT properties, see what
happens in the test). We've got some vague plans to support a way of
passing custom data to tests to support this kind of case (though, if
we're using overlays, maybe the test could just patch those if we
wanted to do that).
Cheers,
-- David
On Fri, 10 Mar 2023 at 07:12, Stephen Boyd <sboyd(a)kernel.org> wrote:
>
> Quoting David Gow (2023-03-02 23:14:55)
> > On Thu, 2 Mar 2023 at 09:38, Stephen Boyd <sboyd(a)kernel.org> wrote:
> > >
> > > Document the linux,kunit board compatible string. This board is loaded
> > > into the Linux kernel when KUnit is testing devicetree dependent code.
> >
> > As with the series as a whole, this might need to change a little bit
> > if we want to either use devicetree overlays and/or other
> > architectures.
> >
> > That being said, I'm okay with having this until then: the only real
> > topic for bikeshedding is the name.
> > - Is KUnit best as a board name, or part of the vendor name?
> > - Do we want to include the architecture in the name?
> > Should it be "linux,kunit", "linux-kunit,uml", "linux,kunit-uml", etc?
>
> I think I will drop this patch. I have overlays working. I hijacked
> of_core_init() to load the testcase data from drivers/of/unittest-data
> and made a container node for kunit overlays to apply to.
Makes sense to me, thanks!
Looking forward to seeing how the overlays work in practice!
This series, currently based on 6.3-rc1, is divided into two parts:
- Commits 1-3 refactor userfaultfd ioctl code without behavior changes, with the
main goal of improving consistency and reducing the number of function args.
- Commit 4 adds UFFDIO_CONTINUE_MODE_WP.
The refactors are sorted by increasing controversial-ness, the idea being we
could drop some of the refactors if they are deemed not worth it.
Changelog:
v3->v4:
- massage the uffd_flags_t implementation to eliminate all sparse warnings
- add a couple inline helpers to make uffd_flags_t usage easier
- drop the refactor passing `struct uffdio_range *` around (previously 4/5)
- define a temporary `struct mm_struct *` in function with >=3 `vma->vm_mm`
- consistent argument order between `flags` and `pagep`
- expand on the use case in patch 4/4 message
v2->v3:
- rebase onto 6.3-rc1
- typedef a new type for mfill flags in patch 3/5 (suggested by Nadav)
v1->v2:
- refactor before adding the new flag, to avoid perpetuating messiness
Axel Rasmussen (4):
mm: userfaultfd: rename functions for clarity + consistency
mm: userfaultfd: don't pass around both mm and vma
mm: userfaultfd: combine 'mode' and 'wp_copy' arguments
mm: userfaultfd: add UFFDIO_CONTINUE_MODE_WP to install WP PTEs
fs/userfaultfd.c | 29 ++--
include/linux/hugetlb.h | 27 ++--
include/linux/shmem_fs.h | 9 +-
include/linux/userfaultfd_k.h | 68 +++++----
include/uapi/linux/userfaultfd.h | 7 +
mm/hugetlb.c | 28 ++--
mm/shmem.c | 14 +-
mm/userfaultfd.c | 170 +++++++++++------------
tools/testing/selftests/mm/userfaultfd.c | 4 +
9 files changed, 187 insertions(+), 169 deletions(-)
--
2.40.0.rc1.284.g88254d51c5-goog
So far KSM can only be enabled by calling madvise for memory regions. To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.
Use case 1:
The madvise call is not available in the programming language. An example for
this are programs with forked workloads using a garbage collected language without
pointers. In such a language madvise cannot be made available.
In addition the addresses of objects get moved around as they are garbage
collected. KSM sharing needs to be enabled "from the outside" for these type of
workloads.
Use case 2:
The same interpreter can also be used for workloads where KSM brings no
benefit or even has overhead. We'd like to be able to enable KSM on a workload
by workload basis.
Use case 3:
With the madvise call sharing opportunities are only enabled for the current
process: it is a workload-local decision. A considerable number of sharing
opportuniites may exist across multiple workloads or jobs. Only a higler level
entity like a job scheduler or container can know for certain if its running
one or more instances of a job. That job scheduler however doesn't have
the necessary internal worklaod knowledge to make targeted madvise calls.
Security concerns:
In previous discussions security concerns have been brought up. The problem is
that an individual workload does not have the knowledge about what else is
running on a machine. Therefore it has to be very conservative in what memory
areas can be shared or not. However, if the system is dedicated to running
multiple jobs within the same security domain, its the job scheduler that has
the knowledge that sharing can be safely enabled and is even desirable.
Performance:
Experiments with using UKSM have shown a capacity increase of around 20%.
1. New options for prctl system command
This patch series adds two new options to the prctl system call. The first
one allows to enable KSM at the process level and the second one to query the
setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a cgroup
and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate over all
the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be inherited by
the new child process.
In addition when KSM is disabled for a process, KSM will be disabled for the
VMA's where KSM has been enabled.
3. Add general_profit metric
The general_profit metric of KSM is specified in the documentation, but not
calculated. This adds the general profit metric to /sys/kernel/debug/mm/ksm.
4. Add more metrics to ksm_stat
This adds the process profit and ksm type metric to /proc/<pid>/ksm_stat.
5. Add more tests to ksm_tests
This adds an option to specify the merge type to the ksm_tests. This allows to
test madvise and prctl KSM. It also adds a new option to query if prctl KSM has
been enabled. It adds a fork test to verify that the KSM process setting is
inherited by client processes.
Changes:
- V3:
- folded patch 1 - 6
- folded patch 7 - 14
- folded patch 15 - 19
- Expanded on the use cases in the cover letter
- Added a section on security concerns to the cover letter
- V2:
- Added use cases to the cover letter
- Removed the tracing patch from the patch series and posted it as an
individual patch
- Refreshed repo
Stefan Roesch (3):
mm: add new api to enable ksm per process
mm: add new KSM process and sysfs knobs
selftests/mm: add new selftests for KSM
Documentation/ABI/testing/sysfs-kernel-mm-ksm | 8 +
Documentation/admin-guide/mm/ksm.rst | 8 +-
fs/proc/base.c | 5 +
include/linux/ksm.h | 19 +-
include/linux/sched/coredump.h | 1 +
include/uapi/linux/prctl.h | 2 +
kernel/sys.c | 29 ++
mm/ksm.c | 114 +++++++-
tools/include/uapi/linux/prctl.h | 2 +
tools/testing/selftests/mm/Makefile | 3 +-
tools/testing/selftests/mm/ksm_tests.c | 254 +++++++++++++++---
11 files changed, 389 insertions(+), 56 deletions(-)
base-commit: 234a68e24b120b98875a8b6e17a9dead277be16a
--
2.30.2
Fix bug in debugfs logs that causes individual parameterized results to not
appear because the log is reinitialized (cleared) when each parameter is
run.
Ensure these results appear in the debugfs logs, increase log size to
allow for the size of parameterized results. As a result, append lines to
the log directly rather than using an intermediate variable that can cause
stack size warnings due to the increased log size.
Here is the debugfs log of ext4_inode_test which uses parameterized tests
before the fix:
KTAP version 1
# Subtest: ext4_inode_test
1..1
# Totals: pass:16 fail:0 skip:0 total:16
ok 1 ext4_inode_test
As you can see, this log does not include any of the individual
parametrized results.
After (in combination with the next two fixes to remove extra empty line
and ensure KTAP valid format):
KTAP version 1
1..1
KTAP version 1
# Subtest: ext4_inode_test
1..1
KTAP version 1
# Subtest: inode_test_xtimestamp_decoding
ok 1 1901-12-13 Lower bound of 32bit < 0 timestamp, no extra bits
... (the rest of the individual parameterized tests)
ok 16 2446-05-10 Upper bound of 32bit >=0 timestamp. All extra
# inode_test_xtimestamp_decoding: pass:16 fail:0 skip:0 total:16
ok 1 inode_test_xtimestamp_decoding
# Totals: pass:16 fail:0 skip:0 total:16
ok 1 ext4_inode_test
Signed-off-by: Rae Moar <rmoar(a)google.com>
Reviewed-by: David Gow <davidgow(a)google.com>
---
Changes from v3 -> v4:
- No changes.
Changes from v2 -> v3:
- Fix a off-by-one bug in the kunit_log_append method.
Changes from v1 -> v2:
- Remove the use of the line variable in kunit_log_append that was
causing stack size warnings.
- Add before and after to the commit message.
include/kunit/test.h | 2 +-
lib/kunit/test.c | 18 ++++++++++++------
2 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/include/kunit/test.h b/include/kunit/test.h
index 08d3559dd703..0668d29f3453 100644
--- a/include/kunit/test.h
+++ b/include/kunit/test.h
@@ -34,7 +34,7 @@ DECLARE_STATIC_KEY_FALSE(kunit_running);
struct kunit;
/* Size of log associated with test. */
-#define KUNIT_LOG_SIZE 512
+#define KUNIT_LOG_SIZE 1500
/* Maximum size of parameter description string. */
#define KUNIT_PARAM_DESC_SIZE 128
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index c9e15bb60058..c4d6304edd61 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -114,22 +114,27 @@ static void kunit_print_test_stats(struct kunit *test,
*/
void kunit_log_append(char *log, const char *fmt, ...)
{
- char line[KUNIT_LOG_SIZE];
va_list args;
- int len_left;
+ int len, log_len, len_left;
if (!log)
return;
- len_left = KUNIT_LOG_SIZE - strlen(log) - 1;
+ log_len = strlen(log);
+ len_left = KUNIT_LOG_SIZE - log_len - 1;
if (len_left <= 0)
return;
+ /* Evaluate length of line to add to log */
va_start(args, fmt);
- vsnprintf(line, sizeof(line), fmt, args);
+ len = vsnprintf(NULL, 0, fmt, args) + 1;
+ va_end(args);
+
+ /* Print formatted line to the log */
+ va_start(args, fmt);
+ vsnprintf(log + log_len, min(len, len_left), fmt, args);
va_end(args);
- strncat(log, line, len_left);
}
EXPORT_SYMBOL_GPL(kunit_log_append);
@@ -437,7 +442,6 @@ static void kunit_run_case_catch_errors(struct kunit_suite *suite,
struct kunit_try_catch_context context;
struct kunit_try_catch *try_catch;
- kunit_init_test(test, test_case->name, test_case->log);
try_catch = &test->try_catch;
kunit_try_catch_init(try_catch,
@@ -533,6 +537,8 @@ int kunit_run_tests(struct kunit_suite *suite)
struct kunit_result_stats param_stats = { 0 };
test_case->status = KUNIT_SKIPPED;
+ kunit_init_test(&test, test_case->name, test_case->log);
+
if (!test_case->generate_params) {
/* Non-parameterised test. */
kunit_run_case_catch_errors(suite, test_case, &test);
base-commit: 60684c2bd35064043360e6f716d1b7c20e967b7d
--
2.40.0.rc0.216.gc4246ad0f0-goog
Shuah,
I'd like this to go through your tree as this is timeout related.
In order to help me help developers run tests against the components
I maintain much easily I have enabled selftests support on kdevops [0].
kdevops deals with abstractsions like letting you pick virtualization
or cloud solutions to run the tests using kconfig, installs all
dependencies for you, and with just a few make target commands can get
you the latest linux-next tested against selftests.
If other find this useful and would like support for their selftests on
kdevops feel free to send patches. Eventually the idea is to be able to
run as many selftests in parallel using different guests for each main
selftest to speed up tests.
Prior to this I used to run tests manually, now the selftests helpers
are used (./tools/testing/selftests/run_kselftest.sh -s) and with this
the default selftest timeout is hit. This just increases that for the few
selftests I help maintain where obviously its not enough anymore.
Note: on the firmware side I am spotting an OOM triggered by running
tests in a loop, so far I hit in the android configuration but its
not clear if the issue is just for that setup.
[0] https://github.com/linux-kdevops/kdevops
Luis Chamberlain (2):
selftests/kmod: increase the kmod timeout from 45 to 165
selftests/firmware: increase timeout from 165 to 230
tools/testing/selftests/firmware/settings | 8 +++++++-
tools/testing/selftests/kmod/settings | 4 ++++
2 files changed, 11 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/kmod/settings
--
2.39.1
This series implements selftests targeting the feature floated by Chao via:
https://lore.kernel.org/lkml/20221202061347.1070246-10-chao.p.peng@linux.in…
Below changes aim to test the fd based approach for guest private memory
in context of normal (non-confidential) VMs executing on non-confidential
platforms.
private_mem_test.c file adds selftest to access private memory from the
guest via private/shared accesses and checking if the contents can be
leaked to/accessed by vmm via shared memory view before/after conversions.
Updates in V2:
1) Simplified vcpu run loop implementation API
2) Removed VM creation logic from private mem library
Updates in V1 (Compared to RFC v3 patches):
1) Incorporated suggestions from Sean around simplifying KVM changes
2) Addressed comments from Sean
3) Added private mem test with shared memory backed by 2MB hugepages.
V1 series:
https://lore.kernel.org/lkml/20221111014244.1714148-1-vannapurve@google.com…
This series has dependency on following patches:
1) V10 series patches from Chao mentioned above.
Github link for the patches posted as part of this series:
https://github.com/vishals4gh/linux/commits/priv_memfd_selftests_v2
Vishal Annapurve (6):
KVM: x86: Add support for testing private memory
KVM: Selftests: Add support for private memory
KVM: selftests: x86: Add IS_ALIGNED/IS_PAGE_ALIGNED helpers
KVM: selftests: x86: Add helpers to execute VMs with private memory
KVM: selftests: Add get_free_huge_2m_pages
KVM: selftests: x86: Add selftest for private memory
arch/x86/kvm/mmu/mmu_internal.h | 6 +-
tools/testing/selftests/kvm/.gitignore | 1 +
tools/testing/selftests/kvm/Makefile | 2 +
.../selftests/kvm/include/kvm_util_base.h | 15 +-
.../testing/selftests/kvm/include/test_util.h | 5 +
.../kvm/include/x86_64/private_mem.h | 24 ++
.../selftests/kvm/include/x86_64/processor.h | 1 +
tools/testing/selftests/kvm/lib/kvm_util.c | 58 ++++-
tools/testing/selftests/kvm/lib/test_util.c | 29 +++
.../selftests/kvm/lib/x86_64/private_mem.c | 139 ++++++++++++
.../selftests/kvm/x86_64/private_mem_test.c | 212 ++++++++++++++++++
virt/kvm/Kconfig | 4 +
virt/kvm/kvm_main.c | 3 +-
13 files changed, 490 insertions(+), 9 deletions(-)
create mode 100644 tools/testing/selftests/kvm/include/x86_64/private_mem.h
create mode 100644 tools/testing/selftests/kvm/lib/x86_64/private_mem.c
create mode 100644 tools/testing/selftests/kvm/x86_64/private_mem_test.c
--
2.39.0.rc0.267.gcb52ba06e7-goog
From: Roberto Sassu <roberto.sassu(a)huawei.com>
Commit 62622dab0a28 ("ima: return IMA digest value only when IMA_COLLECTED
flag is set") caused bpf_ima_inode_hash() to refuse to give non-fresh
digests. IMA test #3 assumed the old behavior, that bpf_ima_inode_hash()
still returned also non-fresh digests.
Correct the test by accepting both cases. If the samples returned are 1,
assume that the commit above is applied and that the returned digest is
fresh. If the samples returned are 2, assume that the commit above is not
applied, and check both the non-fresh and fresh digest.
Fixes: 62622dab0a28 ("ima: return IMA digest value only when IMA_COLLECTED flag is set")
Reported by: David Vernet <void(a)manifault.com>
Signed-off-by: Roberto Sassu <roberto.sassu(a)huawei.com>
---
.../selftests/bpf/prog_tests/test_ima.c | 29 ++++++++++++++-----
1 file changed, 21 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/test_ima.c b/tools/testing/selftests/bpf/prog_tests/test_ima.c
index b13feceb38f..810b14981c2 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_ima.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_ima.c
@@ -70,7 +70,7 @@ void test_test_ima(void)
u64 bin_true_sample;
char cmd[256];
- int err, duration = 0;
+ int err, duration = 0, fresh_digest_idx = 0;
struct ima *skel = NULL;
skel = ima__open_and_load();
@@ -129,7 +129,15 @@ void test_test_ima(void)
/*
* Test #3
* - Goal: confirm that bpf_ima_inode_hash() returns a non-fresh digest
- * - Expected result: 2 samples (/bin/true: non-fresh, fresh)
+ * - Expected result:
+ * 1 sample (/bin/true: fresh) if commit 62622dab0a28 applied
+ * 2 samples (/bin/true: non-fresh, fresh) if commit 62622dab0a28 is
+ * not applied
+ *
+ * If commit 62622dab0a28 ("ima: return IMA digest value only when
+ * IMA_COLLECTED flag is set") is applied, bpf_ima_inode_hash() refuses
+ * to give a non-fresh digest, hence the correct result is 1 instead of
+ * 2.
*/
test_init(skel->bss);
@@ -144,13 +152,18 @@ void test_test_ima(void)
goto close_clean;
err = ring_buffer__consume(ringbuf);
- ASSERT_EQ(err, 2, "num_samples_or_err");
- ASSERT_NEQ(ima_hash_from_bpf[0], 0, "ima_hash");
- ASSERT_NEQ(ima_hash_from_bpf[1], 0, "ima_hash");
- ASSERT_EQ(ima_hash_from_bpf[0], bin_true_sample, "sample_equal_or_err");
+ ASSERT_GE(err, 1, "num_samples_or_err");
+ if (err == 2) {
+ ASSERT_NEQ(ima_hash_from_bpf[0], 0, "ima_hash");
+ ASSERT_EQ(ima_hash_from_bpf[0], bin_true_sample,
+ "sample_equal_or_err");
+ fresh_digest_idx = 1;
+ }
+
+ ASSERT_NEQ(ima_hash_from_bpf[fresh_digest_idx], 0, "ima_hash");
/* IMA refreshed the digest. */
- ASSERT_NEQ(ima_hash_from_bpf[1], bin_true_sample,
- "sample_different_or_err");
+ ASSERT_NEQ(ima_hash_from_bpf[fresh_digest_idx], bin_true_sample,
+ "sample_equal_or_err");
/*
* Test #4
--
2.25.1
This series, currently based on 6.3-rc1, is divided into two parts:
- Commits 1-4 refactor userfaultfd ioctl code without behavior changes, with the
main goal of improving consistency and reducing the number of function args.
- Commit 5 adds UFFDIO_CONTINUE_MODE_WP.
The refactors are sorted by increasing controversial-ness, the idea being we
could drop some of the refactors if they are deemed not worth it.
Changelog:
v2->v3:
- rebase onto 6.3-rc1
- typedef a new type for mfill flags in patch 3/5 (suggested by Nadav)
v1->v2:
- refactor before adding the new flag, to avoid perpetuating messiness
Axel Rasmussen (5):
mm: userfaultfd: rename functions for clarity + consistency
mm: userfaultfd: don't pass around both mm and vma
mm: userfaultfd: combine 'mode' and 'wp_copy' arguments
mm: userfaultfd: don't separate addr + len arguments
mm: userfaultfd: add UFFDIO_CONTINUE_MODE_WP to install WP PTEs
fs/userfaultfd.c | 120 +++++-------
include/linux/hugetlb.h | 27 ++-
include/linux/shmem_fs.h | 9 +-
include/linux/userfaultfd_k.h | 61 +++---
include/uapi/linux/userfaultfd.h | 7 +
mm/hugetlb.c | 34 ++--
mm/shmem.c | 14 +-
mm/userfaultfd.c | 235 +++++++++++------------
tools/testing/selftests/mm/userfaultfd.c | 4 +
9 files changed, 247 insertions(+), 264 deletions(-)
--
2.40.0.rc0.216.gc4246ad0f0-goog
During the course of implementing FEAT_LPA2 within the arm64 KVM port, I found a
couple of issues within the KVM selftest code, which I thought were worth
posting independently. The LPA2 patches, for which I will post v2 in the next
few days, depend on these fixes for its testing.
Ryan Roberts (2):
KVM: selftests: Fixup config fragment for access_tracking_perf_test
KVM: selftests: arm64: Fix pte encode/decode for PA bits > 48
tools/testing/selftests/kvm/config | 1 +
.../selftests/kvm/lib/aarch64/processor.c | 32 ++++++++++++++-----
2 files changed, 25 insertions(+), 8 deletions(-)
--
2.25.1
This patchset updates __reg_combine_64_into_32 function to set 32-bit bounds
when lower 32-bit value is not wrapping, and add cases to for it.
Xu Kuohai (2):
bpf: update 32-bit bounds when the lower 32-bit value is not wrapping
selftests/bpf: check bounds not in the 32-bit range
kernel/bpf/verifier.c | 27 ++--
tools/testing/selftests/bpf/verifier/bounds.c | 121 ++++++++++++++++++
2 files changed, 132 insertions(+), 16 deletions(-)
--
2.30.2
These patches help make the logs a bit more friendly to work with by
adding human readable names for cards and controls alongside the numbers
assigned to them even when things are working well.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (2):
kselftest/alsa - mixer: Always log control names
kselftest/alsa: Log card names during startup
tools/testing/selftests/alsa/mixer-test.c | 13 +++++++++++++
tools/testing/selftests/alsa/pcm-test.c | 10 ++++++++++
2 files changed, 23 insertions(+)
---
base-commit: fe15c26ee26efa11741a7b632e9f23b01aca4cc6
change-id: 20230223-alsa-log-ctl-name-fb07f30d7217
Best regards,
--
Mark Brown <broonie(a)kernel.org>
If a control has an invalid default value then we might fail to set it
when restoring the default value after our write tests, for example due to
correctly implemented range checks in put() operations. Currently this
causes us to report the tests we were running as failed even when the
operation we were trying to test is successful, making it look like there
are problems where none really exist. Stop doing this, only reporting any
issues during the actual test.
We already have validation for the initial readback being in spec and for
writing the default value back so failed tests will be reported for these
controls, and we log an error on the operation that failed when we write so
there will be a diagnostic warning the user that there is a problem.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/alsa/mixer-test.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/alsa/mixer-test.c b/tools/testing/selftests/alsa/mixer-test.c
index 05f1749ae19d..ac5efa42d488 100644
--- a/tools/testing/selftests/alsa/mixer-test.c
+++ b/tools/testing/selftests/alsa/mixer-test.c
@@ -755,7 +755,6 @@ static bool test_ctl_write_valid_enumerated(struct ctl_data *ctl)
static void test_ctl_write_valid(struct ctl_data *ctl)
{
bool pass;
- int err;
/* If the control is turned off let's be polite */
if (snd_ctl_elem_info_is_inactive(ctl->info)) {
@@ -797,9 +796,7 @@ static void test_ctl_write_valid(struct ctl_data *ctl)
}
/* Restore the default value to minimise disruption */
- err = write_and_verify(ctl, ctl->def_val, NULL);
- if (err < 0)
- pass = false;
+ write_and_verify(ctl, ctl->def_val, NULL);
ksft_test_result(pass, "write_valid.%d.%d\n",
ctl->card->card, ctl->elem);
@@ -1015,9 +1012,7 @@ static void test_ctl_write_invalid(struct ctl_data *ctl)
}
/* Restore the default value to minimise disruption */
- err = write_and_verify(ctl, ctl->def_val, NULL);
- if (err < 0)
- pass = false;
+ write_and_verify(ctl, ctl->def_val, NULL);
ksft_test_result(pass, "write_invalid.%d.%d\n",
ctl->card->card, ctl->elem);
---
base-commit: fe15c26ee26efa11741a7b632e9f23b01aca4cc6
change-id: 20230224-alsa-mixer-test-restore-invalid-2a57b98aeb7f
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Fix bug in debugfs logs that causes individual parameterized results to not
appear because the log is reinitialized (cleared) when each parameter is
run.
Ensure these results appear in the debugfs logs, increase log size to
allow for the size of parameterized results. As a result, append lines to
the log directly rather than using an intermediate variable that can cause
stack size warnings due to the increased log size.
Here is the debugfs log of ext4_inode_test which uses parameterized tests
before the fix:
KTAP version 1
# Subtest: ext4_inode_test
1..1
# Totals: pass:16 fail:0 skip:0 total:16
ok 1 ext4_inode_test
As you can see, this log does not include any of the individual
parametrized results.
After (in combination with the next two fixes to remove extra empty line
and ensure KTAP valid format):
KTAP version 1
1..1
KTAP version 1
# Subtest: ext4_inode_test
1..1
KTAP version 1
# Subtest: inode_test_xtimestamp_decoding
ok 1 1901-12-13 Lower bound of 32bit < 0 timestamp, no extra bits
... (the rest of the individual parameterized tests)
ok 16 2446-05-10 Upper bound of 32bit >=0 timestamp. All extra
# inode_test_xtimestamp_decoding: pass:16 fail:0 skip:0 total:16
ok 1 inode_test_xtimestamp_decoding
# Totals: pass:16 fail:0 skip:0 total:16
ok 1 ext4_inode_test
Signed-off-by: Rae Moar <rmoar(a)google.com>
Reviewed-by: David Gow <davidgow(a)google.com>
---
Changes from v2 -> v3:
- Fix a off-by-one bug in the kunit_log_append method.
Changes from v1 -> v2:
- Remove the use of the line variable in kunit_log_append that was causing
stack size warnings.
- Add before and after to the commit message.
include/kunit/test.h | 2 +-
lib/kunit/test.c | 18 ++++++++++++------
2 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/include/kunit/test.h b/include/kunit/test.h
index 08d3559dd703..0668d29f3453 100644
--- a/include/kunit/test.h
+++ b/include/kunit/test.h
@@ -34,7 +34,7 @@ DECLARE_STATIC_KEY_FALSE(kunit_running);
struct kunit;
/* Size of log associated with test. */
-#define KUNIT_LOG_SIZE 512
+#define KUNIT_LOG_SIZE 1500
/* Maximum size of parameter description string. */
#define KUNIT_PARAM_DESC_SIZE 128
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index c9e15bb60058..c4d6304edd61 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -114,22 +114,27 @@ static void kunit_print_test_stats(struct kunit *test,
*/
void kunit_log_append(char *log, const char *fmt, ...)
{
- char line[KUNIT_LOG_SIZE];
va_list args;
- int len_left;
+ int len, log_len, len_left;
if (!log)
return;
- len_left = KUNIT_LOG_SIZE - strlen(log) - 1;
+ log_len = strlen(log);
+ len_left = KUNIT_LOG_SIZE - log_len - 1;
if (len_left <= 0)
return;
+ /* Evaluate length of line to add to log */
va_start(args, fmt);
- vsnprintf(line, sizeof(line), fmt, args);
+ len = vsnprintf(NULL, 0, fmt, args) + 1;
+ va_end(args);
+
+ /* Print formatted line to the log */
+ va_start(args, fmt);
+ vsnprintf(log + log_len, min(len, len_left), fmt, args);
va_end(args);
- strncat(log, line, len_left);
}
EXPORT_SYMBOL_GPL(kunit_log_append);
@@ -437,7 +442,6 @@ static void kunit_run_case_catch_errors(struct kunit_suite *suite,
struct kunit_try_catch_context context;
struct kunit_try_catch *try_catch;
- kunit_init_test(test, test_case->name, test_case->log);
try_catch = &test->try_catch;
kunit_try_catch_init(try_catch,
@@ -533,6 +537,8 @@ int kunit_run_tests(struct kunit_suite *suite)
struct kunit_result_stats param_stats = { 0 };
test_case->status = KUNIT_SKIPPED;
+ kunit_init_test(&test, test_case->name, test_case->log);
+
if (!test_case->generate_params) {
/* Non-parameterised test. */
kunit_run_case_catch_errors(suite, test_case, &test);
base-commit: 60684c2bd35064043360e6f716d1b7c20e967b7d
--
2.40.0.rc0.216.gc4246ad0f0-goog
On Mon, Feb 27, 2023 at 5:57 PM Daniel Xu <dxu(a)dxuuu.xyz> wrote:
>
> Hi Alexei,
>
> On Mon, Feb 27, 2023 at 03:03:38PM -0800, Alexei Starovoitov wrote:
> > On Mon, Feb 27, 2023 at 12:51:02PM -0700, Daniel Xu wrote:
> > > === Context ===
> > >
> > > In the context of a middlebox, fragmented packets are tricky to handle.
> > > The full 5-tuple of a packet is often only available in the first
> > > fragment which makes enforcing consistent policy difficult. There are
> > > really only two stateless options, neither of which are very nice:
> > >
> > > 1. Enforce policy on first fragment and accept all subsequent fragments.
> > > This works but may let in certain attacks or allow data exfiltration.
> > >
> > > 2. Enforce policy on first fragment and drop all subsequent fragments.
> > > This does not really work b/c some protocols may rely on
> > > fragmentation. For example, DNS may rely on oversized UDP packets for
> > > large responses.
> > >
> > > So stateful tracking is the only sane option. RFC 8900 [0] calls this
> > > out as well in section 6.3:
> > >
> > > Middleboxes [...] should process IP fragments in a manner that is
> > > consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
> > > must maintain state in order to achieve this goal.
> > >
> > > === BPF related bits ===
> > >
> > > However, when policy is enforced through BPF, the prog is run before the
> > > kernel reassembles fragmented packets. This leaves BPF developers in a
> > > awkward place: implement reassembly (possibly poorly) or use a stateless
> > > method as described above.
> > >
> > > Fortunately, the kernel has robust support for fragmented IP packets.
> > > This patchset wraps the existing defragmentation facilities in kfuncs so
> > > that BPF progs running on middleboxes can reassemble fragmented packets
> > > before applying policy.
> > >
> > > === Patchset details ===
> > >
> > > This patchset is (hopefully) relatively straightforward from BPF perspective.
> > > One thing I'd like to call out is the skb_copy()ing of the prog skb. I
> > > did this to maintain the invariant that the ctx remains valid after prog
> > > has run. This is relevant b/c ip_defrag() and ip_check_defrag() may
> > > consume the skb if the skb is a fragment.
> >
> > Instead of doing all that with extra skb copy can you hook bpf prog after
> > the networking stack already handled ip defrag?
> > What kind of middle box are you doing? Why does it have to run at TC layer?
>
> Unless I'm missing something, the only other relevant hooks would be
> socket hooks, right?
>
> Unfortunately I don't think my use case can do that. We are running the
> kernel as a router, so no sockets are involved.
Are you using bpf_fib_lookup and populating kernel routing
table and doing everything on your own including neigh ?
Have you considered to skb redirect to another netdev that does ip defrag?
Like macvlan does it under some conditions. This can be generalized.
Recently Florian proposed to allow calling bpf progs from all existing
netfilter hooks.
You can pretend to local deliver and hook in NF_INET_LOCAL_IN ?
I feel it would be so much cleaner if stack does ip_defrag normally.
The general issue of skb ownership between bpf prog and defrag logic
isn't really solved with skb_copy. It's still an issue.