Changes in v21:
- Abort walk instead of returning error if WP is to be performed on
partial hugetlb
*Changes in v20*
- Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO
*Changes in v19*
- Minor changes and interface updates
*Changes in v18*
- Rebase on top of next-20230613
- Minor updates
*Changes in v17*
- Rebase on top of next-20230606
- Minor improvements in PAGEMAP_SCAN IOCTL patch
*Changes in v16*
- Fix a corner case
- Add exclusive PM_SCAN_OP_WP back
*Changes in v15*
- Build fix (Add missed build fix in RESEND)
*Changes in v14*
- Fix build error caused by #ifdef added at last minute in some configs
*Changes in v13*
- Rebase on top of next-20230414
- Give-up on using uffd_wp_range() and write new helpers, flush tlb only
once
*Changes in v12*
- Update and other memory types to UFFD_FEATURE_WP_ASYNC
- Rebaase on top of next-20230406
- Review updates
*Changes in v11*
- Rebase on top of next-20230307
- Base patches on UFFD_FEATURE_WP_UNPOPULATED
- Do a lot of cosmetic changes and review updates
- Remove ENGAGE_WP + !GET operation as it can be performed with
UFFDIO_WRITEPROTECT
*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
async
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
*Motivation*
The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of
the pages that are written to in a region of virtual memory.
This syscall is used in Windows applications and games etc. This syscall is
being emulated in pretty slow manner in userspace. Our purpose is to
enhance the kernel such that we translate it efficiently in a better way.
Currently some out of tree hack patches are being used to efficiently
emulate it in some kernels. We intend to replace those with these patches.
So the whole gaming on Linux can effectively get benefit from this. It
means there would be tons of users of this code.
CRIU use case [2] was mentioned by Andrei and Danylo:
> Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
> shadow memory [4]. Being able to migrate such binaries allows to highly
> reduce the amount of work needed to identify and fix post-migration
> crashes, which happen constantly.
Andrei's defines the following uses of this code:
* it is more granular and allows us to track changed pages more
effectively. The current interface can clear dirty bits for the entire
process only. In addition, reading info about pages is a separate
operation. It means we must freeze the process to read information
about all its pages, reset dirty bits, only then we can start dumping
pages. The information about pages becomes more and more outdated,
while we are processing pages. The new interface solves both these
downsides. First, it allows us to read pte bits and clear the
soft-dirty bit atomically. It means that CRIU will not need to freeze
processes to pre-dump their memory. Second, it clears soft-dirty bits
for a specified region of memory. It means CRIU will have actual info
about pages to the moment of dumping them.
* The new interface has to be much faster because basic page filtering
is happening in the kernel. With the old interface, we have to read
pagemap for each page.
*Implementation Evolution (Short Summary)*
From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
feature can be used under the hood with some additions like:
* reset soft-dirty flag for only a specific region of memory instead of
clearing the flag for the entire process
* get and clear soft-dirty flag for a specific region atomically
So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
flag. But using soft-dirty flag, sometimes we get extra pages which weren't
even written. They had become soft-dirty because of VMA merging and
VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
reported that mprotect etc messes up the soft-dirty flag while ignoring
VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
discussed if we can revert these patches. But we could not reach to any
conclusion. So at this point, I made couple of tries to solve this whole
VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
* [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
regression. We left it behind.
* [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
got the reply don't increase the size of the VMA by 8 bytes.
At this point, we left soft-dirty considering it is too much delicate and
userfaultfd [9] seemed like the only way forward. From there onward, we
have been basing soft-dirty emulation on userfaultfd wp feature where
kernel resolves the faults itself when WP_ASYNC feature is used. It was
straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
those pages dirty or written-to which are really written in reality. (PS
There is another WP_UNPOPULATED userfautfd feature is required which is
needed to avoid pre-faulting memory before write-protecting [9].)
All the different masks were added on the request of CRIU devs to create
interface more generic and better.
[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-…
[2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
[3] https://github.com/google/sanitizers
[4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
[5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
[6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
[7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
[10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
* Original Cover letter from v8*
Hello,
Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (4):
fs/proc/task_mmu: Implement IOCTL to get and optionally clear info
about PTEs
tools headers UAPI: Update linux/fs.h with the kernel sources
mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
selftests: mm: add pagemap ioctl tests
Peter Xu (1):
userfaultfd: UFFD_FEATURE_WP_ASYNC
Documentation/admin-guide/mm/pagemap.rst | 58 +
Documentation/admin-guide/mm/userfaultfd.rst | 35 +
fs/proc/task_mmu.c | 560 +++++++
fs/userfaultfd.c | 26 +-
include/linux/hugetlb.h | 1 +
include/linux/userfaultfd_k.h | 21 +-
include/uapi/linux/fs.h | 54 +
include/uapi/linux/userfaultfd.h | 9 +-
mm/hugetlb.c | 34 +-
mm/memory.c | 27 +-
tools/include/uapi/linux/fs.h | 54 +
tools/testing/selftests/mm/.gitignore | 2 +
tools/testing/selftests/mm/Makefile | 3 +-
tools/testing/selftests/mm/config | 1 +
tools/testing/selftests/mm/pagemap_ioctl.c | 1464 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
16 files changed, 2329 insertions(+), 24 deletions(-)
create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c
mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh
--
2.39.2
Hi Linus,
Please pull the following Kselftest update for Linux 6.5-rc1.
This kselftest update for Linux 6.5-rc1 consists of:
- change to allow runners to override the timeout
This change is made to avoid future increases of long
timeouts
- several other spelling and cleanups
- a new subtest to video_device_test
- enhancements to test coverage in clone3 test
- other fixes to ftrace and cpufreq tests
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 858fd168a95c5b9669aac8db6c14a9aeab446375:
Linux 6.4-rc6 (2023-06-11 14:35:30 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux-kselftest-next-6.5-rc1
for you to fetch changes up to 8cd0d8633e2de4e6dd9ddae7980432e726220fdb:
selftests/ftace: Fix KTAP output ordering (2023-06-12 16:40:22 -0600)
----------------------------------------------------------------
linux-kselftest-next-6.5-rc1
This kselftest update for Linux 6.5-rc1 consists of:
- change to allow runners to override the timeout
This change is made to avoid future increases of long
timeouts
- several other spelling and cleanups
- a new subtest to video_device_test
- enhancements to test coverage in clone3 test
- other fixes to ftrace and cpufreq tests
----------------------------------------------------------------
Akanksha J N (1):
selftests/ftrace: Add new test case which checks for optimized probes
Colin Ian King (2):
selftests: prctl: Fix spelling mistake "anonynous" -> "anonymous"
kselftest: vDSO: Fix accumulation of uninitialized ret when CLOCK_REALTIME is undefined
Ivan Orlov (1):
selftests: media_tests: Add new subtest to video_device_test
Luis Chamberlain (1):
selftests: allow runners to override the timeout
Mark Brown (2):
selftests/cpufreq: Don't enable generic lock debugging options
selftests/ftace: Fix KTAP output ordering
Rishabh Bhatnagar (1):
kselftests: Sort the collections list to avoid duplicate tests
Tobias Klauser (1):
selftests/clone3: test clone3 with exit signal in flags
Ziqi Zhao (1):
selftest: pidfd: Omit long and repeating outputs
Documentation/dev-tools/kselftest.rst | 22 ++++
tools/testing/selftests/clone3/clone3.c | 5 +-
tools/testing/selftests/cpufreq/config | 8 --
tools/testing/selftests/ftrace/ftracetest | 2 +-
.../ftrace/test.d/kprobe/kprobe_opt_types.tc | 34 +++++++
tools/testing/selftests/kselftest/runner.sh | 11 +-
.../selftests/media_tests/video_device_test.c | 111 +++++++++++++++------
tools/testing/selftests/pidfd/pidfd.h | 1 -
tools/testing/selftests/pidfd/pidfd_fdinfo_test.c | 1 +
tools/testing/selftests/pidfd/pidfd_test.c | 3 +-
.../selftests/prctl/set-anon-vma-name-test.c | 2 +-
tools/testing/selftests/run_kselftest.sh | 7 +-
.../selftests/vDSO/vdso_test_clock_getres.c | 4 +-
13 files changed, 166 insertions(+), 45 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/kprobe_opt_types.tc
----------------------------------------------------------------
Make sv39 the default address space for mmap as some applications
currently depend on this assumption. The RISC-V specification enforces
that bits outside of the virtual address range are not used, so
restricting the size of the default address space as such should be
temporary. A hint address passed to mmap will cause the largest address
space that fits entirely into the hint to be used. If the hint is less
than or equal to 1<<38, a 39-bit address will be used. After an address
space is completely full, the next smallest address space will be used.
Documentation is also added to the RISC-V virtual memory section to explain
these changes.
Charlie Jenkins (2):
RISC-V: mm: Restrict address space for sv39,sv48,sv57
RISC-V: mm: Update documentation and include test
Documentation/riscv/vm-layout.rst | 20 ++++++++
arch/riscv/include/asm/elf.h | 2 +-
arch/riscv/include/asm/pgtable.h | 21 ++++++--
arch/riscv/include/asm/processor.h | 41 +++++++++++++---
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/mm/Makefile | 22 +++++++++
.../selftests/riscv/mm/testcases/mmap.c | 49 +++++++++++++++++++
7 files changed, 144 insertions(+), 13 deletions(-)
create mode 100644 tools/testing/selftests/riscv/mm/Makefile
create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap.c
base-commit: eef509789cecdce895020682192d32e8bac790e8
--
2.34.1
Hello!
Here is v4 of the mremap start address optimization / fix for exec warning. It
took me a while to write a test that catches the issue me/Linus discussed in
the last version. And I verified kernel crashes without the check. See below.
The main changes in this series is:
Care to be taken to move purely within a VMA, in other words this check
in call_align_down():
if (vma->vm_start != addr_masked)
return false;
As an example of why this is needed:
Consider the following range which is 2MB aligned and is
a part of a larger 10MB range which is not shown. Each
character is 256KB below making the source and destination
2MB each. The lower case letters are moved (s to d) and the
upper case letters are not moved.
|DDDDddddSSSSssss|
If we align down 'ssss' to start from the 'SSSS', we will end up destroying
SSSS. The above if statement prevents that and I verified it.
I also added a test for this in the last patch.
History of patches
==================
v3->v4:
1. Make sure to check address to align is beginning of VMA
2. Add test to check this (test fails with a kernel crash if we don't do this).
v2->v3:
1. Masked address was stored in int, fixed it to unsigned long to avoid truncation.
2. We now handle moves happening purely within a VMA, a new test is added to handle this.
3. More code comments.
v1->v2:
1. Trigger the optimization for mremaps smaller than a PMD. I tested by tracing
that it works correctly.
2. Fix issue with bogus return value found by Linus if we broke out of the
above loop for the first PMD itself.
v1: Initial RFC.
Description of patches
======================
These patches optimizes the start addresses in move_page_tables() and tests the
changes. It addresses a warning [1] that occurs due to a downward, overlapping
move on a mutually-aligned offset within a PMD during exec. By initiating the
copy process at the PMD level when such alignment is present, we can prevent
this warning and speed up the copying process at the same time. Linus Torvalds
suggested this idea.
Please check the individual patches for more details.
thanks,
- Joel
[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
Joel Fernandes (Google) (7):
mm/mremap: Optimize the start addresses in move_page_tables()
mm/mremap: Allow moves within the same VMA for stack
selftests: mm: Fix failure case when new remap region was not found
selftests: mm: Add a test for mutually aligned moves > PMD size
selftests: mm: Add a test for remapping to area immediately after
existing mapping
selftests: mm: Add a test for remapping within a range
selftests: mm: Add a test for moving from an offset from start of
mapping
fs/exec.c | 2 +-
include/linux/mm.h | 2 +-
mm/mremap.c | 63 ++++-
tools/testing/selftests/mm/mremap_test.c | 301 +++++++++++++++++++----
4 files changed, 319 insertions(+), 49 deletions(-)
--
2.41.0.rc2.161.g9c6817b8e7-goog
Hi Linus,
Please pull the following KUnit next update for Linux 6.5-rc1.
This KUnit update for Linux 6.5-rc1 consists of:
- kunit_add_action() API to defer a call until test exit.
- Update document to add kunit_add_action() usage notes.
- Changes to always run cleanup from a test kthread.
- Documentation updates to clarify cleanup usage
- assertions should not be used in cleanup
- Documentation update to clearly indicate that exit
functions should run even if init fails
- Several fixes and enhancements to existing tests.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit ac9a78681b921877518763ba0e89202254349d1b:
Linux 6.4-rc1 (2023-05-07 13:34:35 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux-kselftest-kunit-6.5-rc1
for you to fetch changes up to 2e66833579ed759d7b7da1a8f07eb727ec6e80db:
MAINTAINERS: Add source tree entry for kunit (2023-06-15 09:16:01 -0600)
----------------------------------------------------------------
linux-kselftest-kunit-6.5-rc1
This KUnit update for Linux 6.5-rc1 consists of:
- kunit_add_action() API to defer a call until test exit.
- Update document to add kunit_add_action() usage notes.
- Changes to always run cleanup from a test kthread.
- Documentation updates to clarify cleanup usage
- assertions should not be used in cleanup
- Documentation update to clearly indicate that exit
functions should run even if init fails
- Several fixes and enhancements to existing tests.
----------------------------------------------------------------
Daniel Latypov (1):
kunit: tool: undo type subscripts for subprocess.Popen
David Gow (11):
kunit: Always run cleanup from a test kthread
Documentation: kunit: Note that assertions should not be used in cleanup
Documentation: kunit: Warn that exit functions run even if init fails
kunit: example: Provide example exit functions
kunit: Add kunit_add_action() to defer a call until test exit
kunit: executor_test: Use kunit_add_action()
kunit: kmalloc_array: Use kunit_add_action()
Documentation: kunit: Add usage notes for kunit_add_action()
kunit: Fix obsolete name in documentation headers (func->action)
kunit: Move kunit_abort() call out of kunit_do_failed_assertion()
Documentation: kunit: Rename references to kunit_abort()
Geert Uytterhoeven (1):
Documentation: kunit: Modular tests should not depend on KUNIT=y
Michal Wajdeczko (3):
kunit/test: Add example test showing parameterized testing
kunit: Fix reporting of the skipped parameterized tests
kunit: Update kunit_print_ok_not_ok function
SeongJae Park (1):
MAINTAINERS: Add source tree entry for kunit
Takashi Sakamoto (1):
Documentation: Kunit: add MODULE_LICENSE to sample code
Documentation/dev-tools/kunit/architecture.rst | 4 +-
Documentation/dev-tools/kunit/start.rst | 7 +-
Documentation/dev-tools/kunit/usage.rst | 69 ++++++++++-
MAINTAINERS | 2 +
include/kunit/resource.h | 92 +++++++++++++++
include/kunit/test.h | 34 ++++--
lib/kunit/executor_test.c | 11 +-
lib/kunit/kunit-example-test.c | 56 +++++++++
lib/kunit/kunit-test.c | 88 +++++++++++++-
lib/kunit/resource.c | 99 ++++++++++++++++
lib/kunit/test.c | 157 ++++++++++++++-----------
tools/testing/kunit/kunit_kernel.py | 6 +-
tools/testing/kunit/mypy.ini | 6 +
tools/testing/kunit/run_checks.py | 2 +-
14 files changed, 538 insertions(+), 95 deletions(-)
create mode 100644 tools/testing/kunit/mypy.ini
----------------------------------------------------------------
Hi Shuah,
This series contains updates to the rseq selftests.
* A typo in the Makefile prevents the basic_percpu_ops_mm_cid_test to use
the mm_cid field.
* Fix load-acquire/store-release macros which were buggy on arm64.
(this depends on commit "Implement rseq_unqual_scalar_typeof").
* The change "Use rseq_unqual_scalar_typeof in macros" is not a fix
per se, but improves the assembler generated.
Can you pick these in the selftests tree please ?
Thanks,
Mathieu
Mathieu Desnoyers (4):
selftests/rseq: Fix CID_ID typo in Makefile
selftests/rseq: Implement rseq_unqual_scalar_typeof
selftests/rseq: Fix arm64 buggy load-acquire/store-release macros
selftests/rseq: Use rseq_unqual_scalar_typeof in macros
tools/testing/selftests/rseq/Makefile | 2 +-
tools/testing/selftests/rseq/compiler.h | 26 ++++++++++
tools/testing/selftests/rseq/rseq-arm.h | 4 +-
tools/testing/selftests/rseq/rseq-arm64.h | 58 ++++++++++++-----------
tools/testing/selftests/rseq/rseq-mips.h | 4 +-
tools/testing/selftests/rseq/rseq-ppc.h | 4 +-
tools/testing/selftests/rseq/rseq-riscv.h | 6 +--
tools/testing/selftests/rseq/rseq-s390.h | 4 +-
tools/testing/selftests/rseq/rseq-x86.h | 4 +-
9 files changed, 70 insertions(+), 42 deletions(-)
--
2.25.1
We want to replace iptables TPROXY with a BPF program at TC ingress.
To make this work in all cases we need to assign a SO_REUSEPORT socket
to an skb, which is currently prohibited. This series adds support for
such sockets to bpf_sk_assing.
I did some refactoring to cut down on the amount of duplicate code. The
key to this is to use INDIRECT_CALL in the reuseport helpers. To show
that this approach is not just beneficial to TC sk_assign I removed
duplicate code for bpf_sk_lookup as well.
Changes from v1:
- Correct commit abbrev length (Kuniyuki)
- Reduce duplication (Kuniyuki)
- Add checks on sk_state (Martin)
- Split exporting inet[6]_lookup_reuseport into separate patch (Eric)
Joint work with Daniel Borkmann.
Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com>
---
Changes in v3:
- Fix warning re udp_ehashfn and udp6_ehashfn (Simon)
- Return higher scoring connected UDP reuseport sockets (Kuniyuki)
- Fix ipv6 module builds
- Link to v2: https://lore.kernel.org/r/20230613-so-reuseport-v2-0-b7c69a342613@isovalent…
---
Daniel Borkmann (1):
selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper
Lorenz Bauer (6):
udp: re-score reuseport groups when connected sockets are present
net: export inet_lookup_reuseport and inet6_lookup_reuseport
net: document inet[6]_lookup_reuseport sk_state requirements
net: remove duplicate reuseport_lookup functions
net: remove duplicate sk_lookup helpers
bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
include/net/inet6_hashtables.h | 84 ++++++++-
include/net/inet_hashtables.h | 77 +++++++-
include/net/sock.h | 7 +-
include/net/udp.h | 8 +
include/uapi/linux/bpf.h | 3 -
net/core/filter.c | 2 -
net/ipv4/inet_hashtables.c | 70 +++++---
net/ipv4/udp.c | 88 ++++-----
net/ipv6/inet6_hashtables.c | 73 +++++---
net/ipv6/udp.c | 98 ++++------
tools/include/uapi/linux/bpf.h | 3 -
tools/testing/selftests/bpf/network_helpers.c | 3 +
.../selftests/bpf/prog_tests/assign_reuse.c | 197 +++++++++++++++++++++
.../selftests/bpf/progs/test_assign_reuse.c | 142 +++++++++++++++
14 files changed, 676 insertions(+), 179 deletions(-)
---
base-commit: 970308a7b544fa1c7ee98a2721faba3765be8dd8
change-id: 20230613-so-reuseport-e92c526173ee
Best regards,
--
Lorenz Bauer <lmb(a)isovalent.com>
v3:
- [v2] https://lore.kernel.org/lkml/20230531163405.2200292-1-longman@redhat.com/
- Change the new control file from root-only "cpuset.cpus.reserve" to
non-root "cpuset.cpus.exclusive" which lists the set of exclusive
CPUs distributed down the hierarchy.
- Add a patch to restrict boot-time isolated CPUs to isolated
partitions only.
- Update the test_cpuset_prs.sh test script and documentation
accordingly.
v2:
- [v1] https://lore.kernel.org/lkml/20230412153758.3088111-1-longman@redhat.com/
- Dropped the special "isolcpus" partition in v1
- Add the root only "cpuset.cpus.reserve" control file for reserving
CPUs used for remote isolated partitions.
- Update the test_cpuset_prs.sh test script and documentation
accordingly.
This patch series introduces a new cpuset control file
"cpuset.cpus.exclusive" which must be a subset of "cpuset.cpus"
and the parent's "cpuset.cpus.exclusive". This control file lists
the exclusive CPUs to be distributed down the hierarchy. Any one
of the exclusive CPUs can only be distributed to at most one child
cpuset. Unlike "cpuset.cpus", invalid input to "cpuset.cpus.exclusive"
will be rejected with an error. This new control file has no effect on
the behavior of the cpuset until it turns into a partition root. At that
point, its effective CPUs will be set to its exclusive CPUs unless some
of them are offline.
This patch series also introduces a new category of cpuset partition
called remote partitions. The existing partition category where the
partition roots have to be clustered around the root cgroup in a
hierarchical way is now referred to as local partitions.
A remote partition can be formed far from the root cgroup
with no partition root parent. While local partitions can be
created without touching "cpuset.cpus.exclusive" as it can be set
automatically if a cpuset becomes a local partition root. Properly set
"cpuset.cpus.exclusive" values down the hierarchy are required to create
a remote partition.
Both scheduling and isolated partitions can be formed in a remote
partition. A local partition can be created under a remote partition.
A remote partition, however, cannot be formed under a local partition
for now.
Modern container orchestration tools like Kubernetes use the cgroup
hierarchy to manage different containers. And it is relying on other
middleware like systemd to help managing it. If a container needs to
use isolated CPUs, it is hard to get those with the local partitions
as it will require the administrative parent cgroup to be a partition
root too which tool like systemd may not be ready to manage.
With this patch series, we allow the creation of remote partition
far from the root. The container management tool can manage the
"cpuset.cpus.exclusive" file without impacting the other cpuset
files that are managed by other middlewares. Of course, invalid
"cpuset.cpus.exclusive" values will be rejected and changes to
"cpuset.cpus" can affect the value of "cpuset.cpus.exclusive" due to
the requirement that it has to be a subset of the former control file.
Waiman Long (9):
cgroup/cpuset: Inherit parent's load balance state in v2
cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE
handling
cgroup/cpuset: Improve temporary cpumasks handling
cgroup/cpuset: Allow suppression of sched domain rebuild in
update_cpumasks_hier()
cgroup/cpuset: Add cpuset.cpus.exclusive for v2
cgroup/cpuset: Introduce remote partition
cgroup/cpuset: Check partition conflict with housekeeping setup
cgroup/cpuset: Documentation update for partition
cgroup/cpuset: Extend test_cpuset_prs.sh to test remote partition
Documentation/admin-guide/cgroup-v2.rst | 100 +-
kernel/cgroup/cpuset.c | 1352 ++++++++++++-----
.../selftests/cgroup/test_cpuset_prs.sh | 398 +++--
3 files changed, 1297 insertions(+), 553 deletions(-)
--
2.31.1