Create a selftest that exercises the race between page faults and
madvise(MADV_DONTNEED) in the same huge page. Do it by running two
threads that touches the huge page and madvise(MADV_DONTNEED) at the same
time.
In case of a SIGBUS coming at pagefault, the test should fail, since we
hit the bug.
The test doesn't have a signal handler, and if it fails, it fails like
the following
----------------------------------
running ./hugetlb_fault_after_madv
----------------------------------
./run_vmtests.sh: line 186: 595563 Bus error (core dumped) "$@"
[FAIL]
This selftest goes together with the fix of the bug[1] itself.
[1] https://lore.kernel.org/all/20231001005659.2185316-1-riel@surriel.com/#r
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
tools/testing/selftests/mm/Makefile | 1 +
.../selftests/mm/hugetlb_fault_after_madv.c | 73 +++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
3 files changed, 78 insertions(+)
create mode 100644 tools/testing/selftests/mm/hugetlb_fault_after_madv.c
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 6a9fc5693145..e71ec9910c62 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -68,6 +68,7 @@ TEST_GEN_FILES += split_huge_page_test
TEST_GEN_FILES += ksm_tests
TEST_GEN_FILES += ksm_functional_tests
TEST_GEN_FILES += mdwe_test
+TEST_GEN_FILES += hugetlb_fault_after_madv
ifneq ($(ARCH),arm64)
TEST_GEN_PROGS += soft-dirty
diff --git a/tools/testing/selftests/mm/hugetlb_fault_after_madv.c b/tools/testing/selftests/mm/hugetlb_fault_after_madv.c
new file mode 100644
index 000000000000..73b81c632366
--- /dev/null
+++ b/tools/testing/selftests/mm/hugetlb_fault_after_madv.c
@@ -0,0 +1,73 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <pthread.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "vm_util.h"
+#include "../kselftest.h"
+
+#define MMAP_SIZE (1 << 21)
+#define INLOOP_ITER 100
+
+char *huge_ptr;
+
+/* Touch the memory while it is being madvised() */
+void *touch(void *unused)
+{
+ char *ptr = (char *)huge_ptr;
+
+ for (int i = 0; i < INLOOP_ITER; i++)
+ ptr[0] = '.';
+
+ return NULL;
+}
+
+void *madv(void *unused)
+{
+ usleep(rand() % 10);
+
+ for (int i = 0; i < INLOOP_ITER; i++)
+ madvise(huge_ptr, MMAP_SIZE, MADV_DONTNEED);
+
+ return NULL;
+}
+
+int main(void)
+{
+ unsigned long free_hugepages;
+ pthread_t thread1, thread2;
+ /*
+ * On kernel 6.4, we are able to reproduce the problem with ~1000
+ * interactions
+ */
+ int max = 10000;
+
+ srand(getpid());
+
+ free_hugepages = get_free_hugepages();
+ if (free_hugepages != 1) {
+ ksft_exit_skip("This test needs one and only one page to execute. Got %lu\n",
+ free_hugepages);
+ }
+
+ while (max--) {
+ huge_ptr = mmap(NULL, MMAP_SIZE, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
+ -1, 0);
+
+ if ((unsigned long)huge_ptr == -1)
+ ksft_exit_skip("Failed to allocated huge page\n");
+
+ pthread_create(&thread1, NULL, madv, NULL);
+ pthread_create(&thread2, NULL, touch, NULL);
+
+ pthread_join(thread1, NULL);
+ pthread_join(thread2, NULL);
+ munmap(huge_ptr, MMAP_SIZE);
+ }
+
+ return KSFT_PASS;
+}
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 3e2bc818d566..9f53f7318a38 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -221,6 +221,10 @@ CATEGORY="hugetlb" run_test ./hugepage-mremap
CATEGORY="hugetlb" run_test ./hugepage-vmemmap
CATEGORY="hugetlb" run_test ./hugetlb-madvise
+# For this test, we need one and just one huge page
+echo 1 > /proc/sys/vm/nr_hugepages
+CATEGORY="hugetlb" run_test ./hugetlb_fault_after_madv
+
if test_selected "hugetlb"; then
echo "NOTE: These hugetlb tests provide minimal coverage. Use"
echo " https://github.com/libhugetlbfs/libhugetlbfs.git for"
--
2.34.1
In the PMTU test, when all previous tests are skipped and the new test
passes, the exit code is set to 0. However, the current check mistakenly
treats this as an assignment, causing the check to pass every time.
Consequently, regardless of how many tests have failed, if the latest test
passes, the PMTU test will report a pass.
Fixes: 2a9d3716b810 ("selftests: pmtu.sh: improve the test result processing")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
v2: use "-eq" instead of "=" to make less error-prone
---
tools/testing/selftests/net/pmtu.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index f838dd370f6a..b3b2dc5a630c 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -2048,7 +2048,7 @@ run_test() {
case $ret in
0)
all_skipped=false
- [ $exitcode=$ksft_skip ] && exitcode=0
+ [ $exitcode -eq $ksft_skip ] && exitcode=0
;;
$ksft_skip)
[ $all_skipped = true ] && exitcode=$ksft_skip
--
2.41.0
This patch series introduces UFFDIO_MOVE feature to userfaultfd, which
has long been implemented and maintained by Andrea in his local tree [1],
but was not upstreamed due to lack of use cases where this approach would
be better than allocating a new page and copying the contents. Previous
upstraming attempts could be found at [6] and [7].
UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
needs pages to be allocated [2]. However, with UFFDIO_MOVE, if pages are
available (in userspace) for recycling, as is usually the case in heap
compaction algorithms, then we can avoid the page allocation and memcpy
(done by UFFDIO_COPY). Also, since the pages are recycled in the
userspace, we avoid the need to release (via madvise) the pages back to
the kernel [3].
We see over 40% reduction (on a Google pixel 6 device) in the compacting
thread’s completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was
measured using a benchmark that emulates a heap compaction implementation
using userfaultfd (to allow concurrent accesses by application threads).
More details of the usecase are explained in [3].
Furthermore, UFFDIO_MOVE enables moving swapped-out pages without
touching them within the same vma. Today, it can only be done by mremap,
however it forces splitting the vma.
TODOs for follow-up improvements:
- cross-mm support. Known differences from single-mm and missing pieces:
- memcg recharging (might need to isolate pages in the process)
- mm counters
- cross-mm deposit table moves
- cross-mm test
- document the address space where src and dest reside in struct
uffdio_move
- TLB flush batching. Will require extensive changes to PTL locking in
move_pages_pte(). OTOH that might let us reuse parts of mremap code.
Changes since v3 [8]:
- changed retry path in folio_lock_anon_vma_read() to unlock and then
relock RCU, per Peter Xu
- removed cross-mm support from initial patchset, per David Hildenbrand
- replaced BUG_ONs with VM_WARN_ON or WARN_ON_ONCE, per David Hildenbrand
- added missing cache flushing, per Lokesh Gidra and Peter Xu
- updated manpage text in the patch description, per Peter Xu
- renamed internal functions from "remap" to "move", per Peter Xu
- added mmap_changing check after taking mmap_lock, per Peter Xu
- changed uffd context check to ensure dst_mm is registered onto uffd we
are operating on, Peter Xu and David Hildenbrand
- changed to non-maybe variants of maybe*_mkwrite(), per David Hildenbrand
- fixed warning for CONFIG_TRANSPARENT_HUGEPAGE=n, per kernel test robot
- comments cleanup, per David Hildenbrand and Peter Xu
- checks for VM_IO,VM_PFNMAP,VM_HUGETLB,..., per David Hildenbrand
- prevent moving pinned pages, per Peter Xu
- changed uffd tests to call move uffd_test_ctx_clear() at the end of the
test run instead of in the beginning of the next run
- added support for testcase-specific ops
- added test for moving PMD-aligned blocks
Changes since v2 [5]:
- renamed UFFDIO_REMAP to UFFDIO_MOVE, per David Hildenbrand
- rebase over mm-unstable to use folio_move_anon_rmap(),
per David Hildenbrand
- added text for manpage explaining DONTFORK and KSM requirements for this
feature, per David Hildenbrand
- check for anon_vma changes in the fast path of folio_lock_anon_vma_read,
per Peter Xu
- updated the title and description of the first patch,
per David Hildenbrand
- updating comments in folio_lock_anon_vma_read() explaining the need for
anon_vma checks, per David Hildenbrand
- changed all mapcount checks to PageAnonExclusive, per Jann Horn and
David Hildenbrand
- changed counters in remap_swap_pte() from MM_ANONPAGES to MM_SWAPENTS,
per Jann Horn
- added a check for PTE change after folio is locked in remap_pages_pte(),
per Jann Horn
- added handling of PMD migration entries and bailout when pmd_devmap(),
per Jann Horn
- added checks to ensure both src and dst VMAs are writable, per Peter Xu
- added UFFD_FEATURE_MOVE, per Peter Xu
- removed obsolete comments, per Peter Xu
- renamed remap_anon_pte to remap_present_pte, per Peter Xu
- added a comment for folio_get_anon_vma() explaining the need for
anon_vma checks, per Peter Xu
- changed error handling in remap_pages() to make it more clear,
per Peter Xu
- changed EFAULT to EAGAIN to retry when a hugepage appears or disappears
from under us, per Peter Xu
- added links to previous upstreaming attempts, per David Hildenbrand
Changes since v1 [4]:
- add mmget_not_zero in userfaultfd_remap, per Jann Horn
- removed extern from function definitions, per Matthew Wilcox
- converted to folios in remap_pages_huge_pmd, per Matthew Wilcox
- use PageAnonExclusive in remap_pages_huge_pmd, per David Hildenbrand
- handle pgtable transfers between MMs, per Jann Horn
- ignore concurrent A/D pte bit changes, per Jann Horn
- split functions into smaller units, per David Hildenbrand
- test for folio_test_large in remap_anon_pte, per Matthew Wilcox
- use pte_swp_exclusive for swapcount check, per David Hildenbrand
- eliminated use of mmu_notifier_invalidate_range_start_nonblock,
per Jann Horn
- simplified THP alignment checks, per Jann Horn
- refactored the loop inside remap_pages, per Jann Horn
- additional clarifying comments, per Jann Horn
Main changes since Andrea's last version [1]:
- Trivial translations from page to folio, mmap_sem to mmap_lock
- Replace pmd_trans_unstable() with pte_offset_map_nolock() and handle its
possible failure
- Move pte mapping into remap_pages_pte to allow for retries when source
page or anon_vma is contended. Since pte_offset_map_nolock() start RCU
read section, we can't block anymore after mapping a pte, so have to unmap
the ptesm do the locking and retry.
- Add and use anon_vma_trylock_write() to avoid blocking while in RCU
read section.
- Accommodate changes in mmu_notifier_range_init() API, switch to
mmu_notifier_invalidate_range_start_nonblock() to avoid blocking while in
RCU read section.
- Open-code now removed __swp_swapcount()
- Replace pmd_read_atomic() with pmdp_get_lockless()
- Add new selftest for UFFDIO_MOVE
[1] https://gitlab.com/aarcange/aa/-/commit/2aec7aea56b10438a3881a20a411aa4b1fc…
[2] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redha…
[3] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyj…
[4] https://lore.kernel.org/all/20230914152620.2743033-1-surenb@google.com/
[5] https://lore.kernel.org/all/20230923013148.1390521-1-surenb@google.com/
[6] https://lore.kernel.org/all/1425575884-2574-21-git-send-email-aarcange@redh…
[7] https://lore.kernel.org/all/cover.1547251023.git.blake.caldwell@colorado.ed…
[8] https://lore.kernel.org/all/20231009064230.2952396-1-surenb@google.com/
The patchset applies over mm-unstable.
Andrea Arcangeli (2):
mm/rmap: support move to different root anon_vma in
folio_move_anon_rmap()
userfaultfd: UFFDIO_MOVE uABI
Suren Baghdasaryan (3):
selftests/mm: call uffd_test_ctx_clear at the end of the test
selftests/mm: add uffd_test_case_ops to allow test case-specific
operations
selftests/mm: add UFFDIO_MOVE ioctl test
Documentation/admin-guide/mm/userfaultfd.rst | 3 +
fs/userfaultfd.c | 72 +++
include/linux/rmap.h | 5 +
include/linux/userfaultfd_k.h | 11 +
include/uapi/linux/userfaultfd.h | 29 +-
mm/huge_memory.c | 122 ++++
mm/khugepaged.c | 3 +
mm/rmap.c | 30 +
mm/userfaultfd.c | 596 +++++++++++++++++++
tools/testing/selftests/mm/uffd-common.c | 51 +-
tools/testing/selftests/mm/uffd-common.h | 11 +
tools/testing/selftests/mm/uffd-stress.c | 5 +-
tools/testing/selftests/mm/uffd-unit-tests.c | 144 +++++
13 files changed, 1078 insertions(+), 4 deletions(-)
--
2.42.0.820.g83a721a137-goog
Üdvözlöm, van egy vállalkozásom, amelyre úgy hivatkoztam rád, mint
neked ugyanaz a vezetéknév, mint a néhai ügyfelem, de a részletek az
alábbiak lesznek értesítjük Önt, amikor megerősíti ennek az e-mailnek
a kézhezvételét. Üdvözlettel
*Changes in v33*:
- Add PAGE_IS_FILE support for THPs
*Changes in v31 and v32*:
- Minor updates
*Changes in v30*:
- Rebase on top of next-20230815
- Minor nitpicks
*Changes in v29:*
- Polish IOCTL and improve documentation
*Changes in v28:*
- Fix walk_end and add 17 test cases in selftests patch
*Changes in v27:*
- Handle review comments and minor improvements
- Add performance improvement patch on top with test for easy review
*Changes in v26:*
- Code re-structurring and API changes in PAGEMAP_IOCTL
*Changes in v25*:
- Do proper filtering on hole as well (hole got missed earlier)
*Changes in v24*:
- Rebase on top of next-20230710
- Place WP markers in case of hole as well
*Changes in v23*:
- Set vec_buf_index in loop only when vec_buf_index is set
- Return -EFAULT instead of -EINVAL if vec is NULL
- Correctly return the walk ending address to the page granularity
*Changes in v22*:
- Interface change:
- Replace [start start + len) with [start, end)
- Return the ending address of the address walk in start
*Changes in v21*:
- Abort walk instead of returning error if WP is to be performed on
partial hugetlb
*Changes in v20*
- Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO
*Changes in v19*
- Minor changes and interface updates
*Changes in v18*
- Rebase on top of next-20230613
- Minor updates
*Changes in v17*
- Rebase on top of next-20230606
- Minor improvements in PAGEMAP_SCAN IOCTL patch
*Changes in v16*
- Fix a corner case
- Add exclusive PM_SCAN_OP_WP back
*Changes in v15*
- Build fix (Add missed build fix in RESEND)
*Changes in v14*
- Fix build error caused by #ifdef added at last minute in some configs
*Changes in v13*
- Rebase on top of next-20230414
- Give-up on using uffd_wp_range() and write new helpers, flush tlb only
once
*Changes in v12*
- Update and other memory types to UFFD_FEATURE_WP_ASYNC
- Rebaase on top of next-20230406
- Review updates
*Changes in v11*
- Rebase on top of next-20230307
- Base patches on UFFD_FEATURE_WP_UNPOPULATED
- Do a lot of cosmetic changes and review updates
- Remove ENGAGE_WP + !GET operation as it can be performed with
UFFDIO_WRITEPROTECT
*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
async
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
*Motivation*
The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
GetWriteWatch() and ResetWriteWatch() syscalls [1]. The GetWriteWatch()
retrieves the addresses of the pages that are written to in a region of
virtual memory.
This syscall is used in Windows applications and games etc. This syscall is
being emulated in pretty slow manner in userspace. Our purpose is to
enhance the kernel such that we translate it efficiently in a better way.
Currently some out of tree hack patches are being used to efficiently
emulate it in some kernels. We intend to replace those with these patches.
So the whole gaming on Linux can effectively get benefit from this. It
means there would be tons of users of this code.
CRIU use case [2] was mentioned by Andrei and Danylo:
> Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
> shadow memory [4]. Being able to migrate such binaries allows to highly
> reduce the amount of work needed to identify and fix post-migration
> crashes, which happen constantly.
Andrei's defines the following uses of this code:
* it is more granular and allows us to track changed pages more
effectively. The current interface can clear dirty bits for the entire
process only. In addition, reading info about pages is a separate
operation. It means we must freeze the process to read information
about all its pages, reset dirty bits, only then we can start dumping
pages. The information about pages becomes more and more outdated,
while we are processing pages. The new interface solves both these
downsides. First, it allows us to read pte bits and clear the
soft-dirty bit atomically. It means that CRIU will not need to freeze
processes to pre-dump their memory. Second, it clears soft-dirty bits
for a specified region of memory. It means CRIU will have actual info
about pages to the moment of dumping them.
* The new interface has to be much faster because basic page filtering
is happening in the kernel. With the old interface, we have to read
pagemap for each page.
*Implementation Evolution (Short Summary)*
From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
feature can be used under the hood with some additions like:
* reset soft-dirty flag for only a specific region of memory instead of
clearing the flag for the entire process
* get and clear soft-dirty flag for a specific region atomically
So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
flag. But using soft-dirty flag, sometimes we get extra pages which weren't
even written. They had become soft-dirty because of VMA merging and
VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
reported that mprotect etc messes up the soft-dirty flag while ignoring
VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
discussed if we can revert these patches. But we could not reach to any
conclusion. So at this point, I made couple of tries to solve this whole
VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
* [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
regression. We left it behind.
* [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
got the reply don't increase the size of the VMA by 8 bytes.
At this point, we left soft-dirty considering it is too much delicate and
userfaultfd [9] seemed like the only way forward. From there onward, we
have been basing soft-dirty emulation on userfaultfd wp feature where
kernel resolves the faults itself when WP_ASYNC feature is used. It was
straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
those pages dirty or written-to which are really written in reality. (PS
There is another WP_UNPOPULATED userfautfd feature is required which is
needed to avoid pre-faulting memory before write-protecting [9].)
All the different masks were added on the request of CRIU devs to create
interface more generic and better.
[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-…
[2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
[3] https://github.com/google/sanitizers
[4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
[5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
[6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
[7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
[10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
* Original Cover letter from v8*
Hello,
Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirty feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (5):
fs/proc/task_mmu: Implement IOCTL to get and optionally clear info
about PTEs
fs/proc/task_mmu: Add fast paths to get/clear PAGE_IS_WRITTEN flag
tools headers UAPI: Update linux/fs.h with the kernel sources
mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
selftests: mm: add pagemap ioctl tests
Peter Xu (1):
userfaultfd: UFFD_FEATURE_WP_ASYNC
Documentation/admin-guide/mm/pagemap.rst | 89 +
Documentation/admin-guide/mm/userfaultfd.rst | 35 +
fs/proc/task_mmu.c | 722 ++++++++
fs/userfaultfd.c | 26 +-
include/linux/hugetlb.h | 1 +
include/linux/userfaultfd_k.h | 28 +-
include/uapi/linux/fs.h | 59 +
include/uapi/linux/userfaultfd.h | 9 +-
mm/hugetlb.c | 34 +-
mm/memory.c | 28 +-
tools/include/uapi/linux/fs.h | 59 +
tools/testing/selftests/mm/.gitignore | 2 +
tools/testing/selftests/mm/Makefile | 3 +-
tools/testing/selftests/mm/config | 1 +
tools/testing/selftests/mm/pagemap_ioctl.c | 1660 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
16 files changed, 2736 insertions(+), 24 deletions(-)
create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c
--
2.40.1
Immediate is incorrectly cast to u32 before being spilled, losing sign
information. The range information is incorrect after load again. Fix
immediate spill by remove the cast. The second patch add a test case
for this.
Signed-off-by: Hao Sun <sunhao.th(a)gmail.com>
---
Changes in v3:
- Change the expected log to fix the test case
- Link to v2: https://lore.kernel.org/r/20231101-fix-check-stack-write-v2-0-cb7c17b869b0@…
Changes in v2:
- Add fix and cc tags.
- Link to v1: https://lore.kernel.org/r/20231026-fix-check-stack-write-v1-0-6b325ef3ce7e@…
---
Hao Sun (2):
bpf: Fix check_stack_write_fixed_off() to correctly spill imm
selftests/bpf: Add test for immediate spilled to stack
kernel/bpf/verifier.c | 2 +-
tools/testing/selftests/bpf/verifier/bpf_st_mem.c | 32 +++++++++++++++++++++++
2 files changed, 33 insertions(+), 1 deletion(-)
---
base-commit: f2fbb908112311423b09cd0d2b4978f174b99585
change-id: 20231026-fix-check-stack-write-c40996694dfa
Best regards,
--
Hao Sun <sunhao.th(a)gmail.com>
Hi Linus,
Please pull the following KUnit next update for Linux 6.7-rc1.
This kunit update for Linux 6.7-rc1 consists of:
-- string-stream testing enhancements
-- several fixes memory leaks
-- fix to reset status during parameter handling
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit ce9ecca0238b140b88f43859b211c9fdfd8e5b70:
Linux 6.6-rc2 (2023-09-17 14:40:24 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-kunit-6.7-rc1
for you to fetch changes up to 8040345fdae4cb256c5d981f91ae0f22bea8adcc:
kunit: test: Fix the possible memory leak in executor_test (2023-09-28 08:51:07 -0600)
----------------------------------------------------------------
linux_kselftest-kunit-6.7-rc1
This kunit update for Linux 6.7-rc1 consists of:
-- string-stream testing enhancements
-- several fixes memory leaks
-- fix to reset status during parameter handling
----------------------------------------------------------------
Jinjie Ruan (4):
kunit: Fix missed memory release in kunit_free_suite_set()
kunit: Fix the wrong kfree of copy for kunit_filter_suites()
kunit: Fix possible memory leak in kunit_filter_suites()
kunit: test: Fix the possible memory leak in executor_test
Michal Wajdeczko (1):
kunit: Reset test status on each param iteration
Richard Fitzgerald (10):
kunit: string-stream: Don't create a fragment for empty strings
kunit: string-stream: Improve testing of string_stream
kunit: string-stream: Add option to make all lines end with newline
kunit: string-stream-test: Add cases for string_stream newline appending
kunit: Don't use a managed alloc in is_literal()
kunit: string-stream: Add kunit_alloc_string_stream()
kunit: string-stream: Decouple string_stream from kunit
kunit: string-stream: Add tests for freeing resource-managed string_stream
kunit: Use string_stream for test log
kunit: string-stream: Test performance of string_stream
include/kunit/test.h | 14 +-
lib/kunit/assert.c | 14 +-
lib/kunit/debugfs.c | 36 ++-
lib/kunit/executor.c | 23 +-
lib/kunit/executor_test.c | 36 +--
lib/kunit/kunit-example-test.c | 5 +-
lib/kunit/kunit-test.c | 56 ++++-
lib/kunit/string-stream-test.c | 525 +++++++++++++++++++++++++++++++++++++++--
lib/kunit/string-stream.c | 100 ++++++--
lib/kunit/string-stream.h | 16 +-
lib/kunit/test.c | 56 +----
11 files changed, 734 insertions(+), 147 deletions(-)
----------------------------------------------------------------
This patchset adds two kfunc helpers, bpf_xdp_get_xfrm_state() and
bpf_xdp_xfrm_state_release() that wrap xfrm_state_lookup() and
xfrm_state_put(). The intent is to support software RSS (via XDP) for
the ongoing/upcoming ipsec pcpu work [0]. Recent experiments performed
on (hopefully) reproducible AWS testbeds indicate that single tunnel
pcpu ipsec can reach line rate on 100G ENA nics.
Note this patchset only tests/shows generic xfrm_state access. The
"secret sauce" (if you can really even call it that) involves accessing
a soon-to-be-upstreamed pcpu_num field in xfrm_state. Early example is
available here [1].
[0]: https://datatracker.ietf.org/doc/html/draft-ietf-ipsecme-multi-sa-performan…
[1]: https://github.com/danobi/xdp-tools/blob/e89a1c617aba3b50d990f779357d6ce286…
Changes from RFCv1:
* Add Antony's commit tags
* Add KF_ACQUIRE and KF_RELEASE semantics
Daniel Xu (7):
bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
bpf: xfrm: Add bpf_xdp_xfrm_state_release() kfunc
bpf: selftests: test_tunnel: Use ping -6 over ping6
bpf: selftests: test_tunnel: Mount bpffs if necessary
bpf: selftests: test_tunnel: Use vmlinux.h declarations
bpf: selftests: test_tunnel: Disable CO-RE relocations
bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state()
include/net/xfrm.h | 9 ++
net/xfrm/Makefile | 1 +
net/xfrm/xfrm_policy.c | 2 +
net/xfrm/xfrm_state_bpf.c | 121 ++++++++++++++++++
.../selftests/bpf/progs/bpf_tracing_net.h | 1 +
.../selftests/bpf/progs/test_tunnel_kern.c | 98 ++++++++------
tools/testing/selftests/bpf/test_tunnel.sh | 43 +++++--
7 files changed, 221 insertions(+), 54 deletions(-)
create mode 100644 net/xfrm/xfrm_state_bpf.c
--
2.42.0
Immediate is incorrectly cast to u32 before being spilled, losing sign
information. The range information is incorrect after load again. Fix
immediate spill by remove the cast. The second patch add a test case
for this.
Signed-off-by: Hao Sun <sunhao.th(a)gmail.com>
---
Changes in v2:
- Add fix and cc tags.
- Link to v1: https://lore.kernel.org/r/20231026-fix-check-stack-write-v1-0-6b325ef3ce7e@…
---
Hao Sun (2):
bpf: Fix check_stack_write_fixed_off() to correctly spill imm
selftests/bpf: Add test for immediate spilled to stack
kernel/bpf/verifier.c | 2 +-
tools/testing/selftests/bpf/verifier/bpf_st_mem.c | 32 +++++++++++++++++++++++
2 files changed, 33 insertions(+), 1 deletion(-)
---
base-commit: f1c73396133cb3d913e2075298005644ee8dfade
change-id: 20231026-fix-check-stack-write-c40996694dfa
Best regards,
--
Hao Sun <sunhao.th(a)gmail.com>
This patchset adds a kfunc helper, bpf_xdp_get_xfrm_state(), that wraps
xfrm_state_lookup(). The intent is to support software RSS (via XDP) for
the ongoing/upcoming ipsec pcpu work [0]. Recent experiments performed
on (hopefully) reproducible AWS testbeds indicate that single tunnel
pcpu ipsec can reach line rate on 100G ENA nics.
More details about that will be presented at netdev next week [1].
Antony did the initial stable bpf helper - I later ported it to unstable
kfuncs. So for the series, please apply a Co-developed-by for Antony,
provided he acks and signs off on this.
[0]: https://datatracker.ietf.org/doc/html/draft-ietf-ipsecme-multi-sa-performan…
[1]: https://netdevconf.info/0x17/sessions/workshop/security-workshop.html
Daniel Xu (6):
bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
bpf: selftests: test_tunnel: Use ping -6 over ping6
bpf: selftests: test_tunnel: Mount bpffs if necessary
bpf: selftests: test_tunnel: Use vmlinux.h declarations
bpf: selftests: test_tunnel: Disable CO-RE relocations
bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state()
include/net/xfrm.h | 9 ++
net/xfrm/Makefile | 1 +
net/xfrm/xfrm_policy.c | 2 +
net/xfrm/xfrm_state_bpf.c | 105 ++++++++++++++++++
.../selftests/bpf/progs/bpf_tracing_net.h | 1 +
.../selftests/bpf/progs/test_tunnel_kern.c | 95 +++++++++-------
tools/testing/selftests/bpf/test_tunnel.sh | 43 ++++---
7 files changed, 202 insertions(+), 54 deletions(-)
create mode 100644 net/xfrm/xfrm_state_bpf.c
--
2.42.0
Hello everyone,
This series implements the Permission Overlay Extension introduced in 2022
VMSA enhancements [1]. It is based on v6.6-rc3.
Changes since v1[2]:
# Added Kconfig option
# Added KVM support
# Move VM_PKEY* defines into arch/
# Add isb() for POR_EL0 context switch
# Added hwcap test, get-reg-list-test, signal frame handling test
ptrace support is missing, I will add that for v3.
The Permission Overlay Extension allows to constrain permissions on memory
regions. This can be used from userspace (EL0) without a system call or TLB
invalidation.
POE is used to implement the Memory Protection Keys [3] Linux syscall.
The first few patches add the basic framework, then the PKEYS interface is
implemented, and then the selftests are made to work on arm64.
There was discussion about what the 'default' protection key value should be,
I used disallow-all (apart from pkey 0), which matches what x86 does.
I have tested the modified protection_keys test on x86_64 [5], but not PPC.
I haven't build tested the x86/ppc changes, will work on getting at least
an x86 build environment working.
Thanks,
Joey
[1] https://community.arm.com/arm-community-blogs/b/architectures-and-processor…
[2] https://lore.kernel.org/linux-arm-kernel/20230927140123.5283-1-joey.gouly@a…
[3] Documentation/core-api/protection-keys.rst
[4] https://lore.kernel.org/linux-arm-kernel/20230919092850.1940729-7-mark.rutl…
[5] test_ptrace_modifies_pkru asserts for me on a Ubuntu 5.4 kernel, but does so before my changes as well
Joey Gouly (24):
arm64/sysreg: add system register POR_EL{0,1}
arm64/sysreg: update CPACR_EL1 register
arm64: cpufeature: add Permission Overlay Extension cpucap
arm64: disable trapping of POR_EL0 to EL2
arm64: context switch POR_EL0 register
KVM: arm64: Save/restore POE registers
arm64: enable the Permission Overlay Extension for EL0
arm64: add POIndex defines
arm64: define VM_PKEY_BIT* for arm64
arm64: mask out POIndex when modifying a PTE
arm64: enable ARCH_HAS_PKEYS on arm64
arm64: handle PKEY/POE faults
arm64: stop using generic mm_hooks.h
arm64: implement PKEYS support
arm64: add POE signal support
arm64: enable PKEY support for CPUs with S1POE
arm64: enable POE and PIE to coexist
kselftest/arm64: move get_header()
selftests: mm: move fpregs printing
selftests: mm: make protection_keys test work on arm64
kselftest/arm64: add HWCAP test for FEAT_S1POE
kselftest/arm64: parse POE_MAGIC in a signal frame
kselftest/arm64: Add test case for POR_EL0 signal frame records
KVM: selftests: get-reg-list: add Permission Overlay registers
Documentation/arch/arm64/elf_hwcaps.rst | 3 +
arch/arm64/Kconfig | 18 +++
arch/arm64/include/asm/cpufeature.h | 6 +
arch/arm64/include/asm/el2_setup.h | 10 +-
arch/arm64/include/asm/hwcap.h | 1 +
arch/arm64/include/asm/kvm_arm.h | 4 +-
arch/arm64/include/asm/kvm_host.h | 4 +
arch/arm64/include/asm/mman.h | 8 +-
arch/arm64/include/asm/mmu.h | 2 +
arch/arm64/include/asm/mmu_context.h | 51 ++++++-
arch/arm64/include/asm/page.h | 10 ++
arch/arm64/include/asm/pgtable-hwdef.h | 10 ++
arch/arm64/include/asm/pgtable-prot.h | 8 +-
arch/arm64/include/asm/pgtable.h | 26 +++-
arch/arm64/include/asm/pkeys.h | 110 ++++++++++++++
arch/arm64/include/asm/por.h | 33 +++++
arch/arm64/include/asm/processor.h | 1 +
arch/arm64/include/asm/sysreg.h | 16 ++
arch/arm64/include/asm/traps.h | 1 +
arch/arm64/include/uapi/asm/hwcap.h | 1 +
arch/arm64/include/uapi/asm/sigcontext.h | 7 +
arch/arm64/kernel/cpufeature.c | 23 +++
arch/arm64/kernel/cpuinfo.c | 1 +
arch/arm64/kernel/process.c | 19 +++
arch/arm64/kernel/signal.c | 51 +++++++
arch/arm64/kernel/traps.c | 12 +-
arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 10 ++
arch/arm64/kvm/sys_regs.c | 2 +
arch/arm64/mm/fault.c | 44 +++++-
arch/arm64/mm/mmap.c | 9 ++
arch/arm64/mm/mmu.c | 40 +++++
arch/arm64/tools/cpucaps | 1 +
arch/arm64/tools/sysreg | 15 +-
arch/powerpc/include/asm/page.h | 11 ++
arch/x86/include/asm/page.h | 10 ++
fs/proc/task_mmu.c | 2 +
include/linux/mm.h | 13 --
tools/testing/selftests/arm64/abi/hwcap.c | 13 ++
.../testing/selftests/arm64/signal/.gitignore | 1 +
.../arm64/signal/testcases/poe_siginfo.c | 86 +++++++++++
.../arm64/signal/testcases/testcases.c | 27 +---
.../arm64/signal/testcases/testcases.h | 28 +++-
.../selftests/kvm/aarch64/get-reg-list.c | 14 ++
tools/testing/selftests/mm/Makefile | 2 +-
tools/testing/selftests/mm/pkey-arm64.h | 138 ++++++++++++++++++
tools/testing/selftests/mm/pkey-helpers.h | 8 +
tools/testing/selftests/mm/pkey-powerpc.h | 3 +
tools/testing/selftests/mm/pkey-x86.h | 4 +
tools/testing/selftests/mm/protection_keys.c | 29 ++--
49 files changed, 880 insertions(+), 66 deletions(-)
create mode 100644 arch/arm64/include/asm/pkeys.h
create mode 100644 arch/arm64/include/asm/por.h
create mode 100644 tools/testing/selftests/arm64/signal/testcases/poe_siginfo.c
create mode 100644 tools/testing/selftests/mm/pkey-arm64.h
--
2.25.1
The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zisslpcfi respectively), I am actively
working on GCS[1]. With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses. This provides some
protection against ROP attacks and making it easier to collect call
stacks. These shadow stacks are allocated in the address space of the
userspace process.
Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled. The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread. This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces. As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.
Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process in a similar manner
to how the normal stack is specified, keeping the current implicit
allocation behaviour if one is not specified either with clone3() or
through the use of clone(). When the shadow stack is specified
explicitly the kernel will not free it, the inconsistency with
implicitly allocated shadow stacks is a bit awkward but that's existing
ABI so we can't change it.
The memory provided must have been allocated for use as a shadow stack,
the expectation is that this will be done using the map_shadow_stack()
syscall. I opted not to add validation for this in clone3() since it
will be enforced by hardware anyway.
Please note that the x86 portions of this code are build tested only, I
don't appear to have a system that can run CET avaible to me, I have
done testing with an integration into my pending work for GCS. There is
some possibility that the arm64 implementation may require the use of
clone3() and explicit userspace allocation of shadow stacks, this is
still under discussion.
A new architecture feature Kconfig option for shadow stacks is added as
here, this was suggested as part of the review comments for the arm64
GCS series and since we need to detect if shadow stacks are supported it
seemed sensible to roll it in here.
The selftest portions of this depend on 34dce23f7e40 ("selftests/clone3:
Report descriptive test names") in -next[2].
[1] https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org/
[2] https://lore.kernel.org/r/20231018-kselftest-clone3-output-v1-1-12b7c50ea2c…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (5):
mm: Introduce ARCH_HAS_USER_SHADOW_STACK
fork: Add shadow stack support to clone3()
selftests/clone3: Factor more of main loop into test_clone3()
selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
kselftest/clone3: Test shadow stack support
arch/x86/Kconfig | 1 +
arch/x86/include/asm/shstk.h | 11 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 36 ++++-
fs/proc/task_mmu.c | 2 +-
include/linux/mm.h | 2 +-
include/linux/sched/task.h | 2 +
include/uapi/linux/sched.h | 17 +-
kernel/fork.c | 40 ++++-
mm/Kconfig | 6 +
tools/testing/selftests/clone3/clone3.c | 180 +++++++++++++++++-----
tools/testing/selftests/clone3/clone3_selftests.h | 5 +
12 files changed, 247 insertions(+), 57 deletions(-)
---
base-commit: 80ab9b52e8d4add7735abdfb935877354b69edb6
change-id: 20231019-clone3-shadow-stack-15d40d2bf536
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Changelog:
v4:
* Rename list_lru_add to list_lru_add_obj and __list_lru_add to
list_lru_add (patch 1) (suggested by Johannes Weiner and
Yosry Ahmed)
* Some cleanups on the memcg aware LRU patch (patch 2)
(suggested by Yosry Ahmed)
* Use event interface for the new per-cgroup writeback counters.
(patch 3) (suggested by Yosry Ahmed)
* Abstract zswap's lruvec states and handling into
zswap_lruvec_state (patch 5) (suggested by Yosry Ahmed)
v3:
* Add a patch to export per-cgroup zswap writeback counters
* Add a patch to update zswap's kselftest
* Separate the new list_lru functions into its own prep patch
* Do not start from the top of the hierarchy when encounter a memcg
that is not online for the global limit zswap writeback (patch 2)
(suggested by Yosry Ahmed)
* Do not remove the swap entry from list_lru in
__read_swapcache_async() (patch 2) (suggested by Yosry Ahmed)
* Removed a redundant zswap pool getting (patch 2)
(reported by Ryan Roberts)
* Use atomic for the nr_zswap_protected (instead of lruvec's lock)
(patch 5) (suggested by Yosry Ahmed)
* Remove the per-cgroup zswap shrinker knob (patch 5)
(suggested by Yosry Ahmed)
v2:
* Fix loongarch compiler errors
* Use pool stats instead of memcg stats when !CONFIG_MEMCG_KEM
There are currently several issues with zswap writeback:
1. There is only a single global LRU for zswap, making it impossible to
perform worload-specific shrinking - an memcg under memory pressure
cannot determine which pages in the pool it owns, and often ends up
writing pages from other memcgs. This issue has been previously
observed in practice and mitigated by simply disabling
memcg-initiated shrinking:
https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u
But this solution leaves a lot to be desired, as we still do not
have an avenue for an memcg to free up its own memory locked up in
the zswap pool.
2. We only shrink the zswap pool when the user-defined limit is hit.
This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed
ahead of time, as this depends on the workload (specifically, on
factors such as memory access patterns and compressibility of the
memory pages).
This patch series solves these issues by separating the global zswap
LRU into per-memcg and per-NUMA LRUs, and performs workload-specific
(i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The
new shrinker does not have any parameter that must be tuned by the
user, and can be opted in or out on a per-memcg basis.
As a proof of concept, we ran the following synthetic benchmark:
build the linux kernel in a memory-limited cgroup, and allocate some
cold data in tmpfs to see if the shrinker could write them out and
improved the overall performance. Depending on the amount of cold data
generated, we observe from 14% to 35% reduction in kernel CPU time used
in the kernel builds.
Domenico Cerasuolo (3):
zswap: make shrinking memcg-aware
mm: memcg: add per-memcg zswap writeback stat
selftests: cgroup: update per-memcg zswap writeback selftest
Nhat Pham (2):
list_lru: allows explicit memcg and NUMA node selection
zswap: shrinks zswap pool based on memory pressure
Documentation/admin-guide/mm/zswap.rst | 7 +
drivers/android/binder_alloc.c | 5 +-
fs/dcache.c | 8 +-
fs/gfs2/quota.c | 6 +-
fs/inode.c | 4 +-
fs/nfs/nfs42xattr.c | 8 +-
fs/nfsd/filecache.c | 4 +-
fs/xfs/xfs_buf.c | 6 +-
fs/xfs/xfs_dquot.c | 2 +-
fs/xfs/xfs_qm.c | 2 +-
include/linux/list_lru.h | 46 ++-
include/linux/memcontrol.h | 5 +
include/linux/mmzone.h | 2 +
include/linux/vm_event_item.h | 1 +
include/linux/zswap.h | 25 +-
mm/list_lru.c | 48 ++-
mm/memcontrol.c | 1 +
mm/mmzone.c | 1 +
mm/swap.h | 3 +-
mm/swap_state.c | 25 +-
mm/vmstat.c | 1 +
mm/workingset.c | 4 +-
mm/zswap.c | 365 ++++++++++++++++----
tools/testing/selftests/cgroup/test_zswap.c | 74 ++--
24 files changed, 526 insertions(+), 127 deletions(-)
--
2.34.1
In the PMTU test, when all previous tests are skipped and the new test
passes, the exit code is set to 0. However, the current check mistakenly
treats this as an assignment, causing the check to pass every time.
Consequently, regardless of how many tests have failed, if the latest test
passes, the PMTU test will report a pass.
Fixes: 2a9d3716b810 ("selftests: pmtu.sh: improve the test result processing")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
tools/testing/selftests/net/pmtu.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index f838dd370f6a..b9648da4c371 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -2048,7 +2048,7 @@ run_test() {
case $ret in
0)
all_skipped=false
- [ $exitcode=$ksft_skip ] && exitcode=0
+ [ $exitcode = $ksft_skip ] && exitcode=0
;;
$ksft_skip)
[ $all_skipped = true ] && exitcode=$ksft_skip
--
2.41.0
---
On Ubuntu and probably other distros, ptrace permissions are tightend a
bit by default; i.e., /proc/sys/kernel/yama/ptrace_score is set to 1.
This cases memfd_secret's ptrace attach test fails with a permission
error. Set it to 0 piror to running the program.
Signed-off-by: Itaru Kitayama <itaru.kitayama(a)linux.dev>
---
tools/testing/selftests/mm/run_vmtests.sh | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 3e2bc818d566..7d31718ce834 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -303,6 +303,7 @@ CATEGORY="hmm" run_test bash ./test_hmm.sh smoke
# MADV_POPULATE_READ and MADV_POPULATE_WRITE tests
CATEGORY="madv_populate" run_test ./madv_populate
+echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
CATEGORY="memfd_secret" run_test ./memfd_secret
# KSM KSM_MERGE_TIME_HUGE_PAGES test with size of 100
---
base-commit: ffc253263a1375a65fa6c9f62a893e9767fbebfa
change-id: 20231030-selftest-c75b1b460817
Best regards,
--
Itaru Kitayama <itaru.kitayama(a)linux.dev>
According to the awk manual, the -e option does not need to be specified
in front of 'program' (unless you need to mix program-file).
The redundant -e option can cause error when users use awk tools other
than gawk (for example, mawk does not support the -e option).
Error Example:
awk: not an option: -e
Cgroup v2 mount point not found!
Signed-off-by: Juntong Deng <juntong.deng(a)outlook.com>
---
tools/testing/selftests/cgroup/test_cpuset_prs.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index 4afb132e4e4f..6820653e8432 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -20,7 +20,7 @@ skip_test() {
WAIT_INOTIFY=$(cd $(dirname $0); pwd)/wait_inotify
# Find cgroup v2 mount point
-CGROUP2=$(mount -t cgroup2 | head -1 | awk -e '{print $3}')
+CGROUP2=$(mount -t cgroup2 | head -1 | awk '{print $3}')
[[ -n "$CGROUP2" ]] || skip_test "Cgroup v2 mount point not found!"
CPUS=$(lscpu | grep "^CPU(s):" | sed -e "s/.*:[[:space:]]*//")
--
2.39.2
From: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Add a test case for probing on a symbol in a module without module name.
When probing on a symbol in a module, ftrace accepts both the syntax that
<MODNAME>:<SYMBOL> and <SYMBOL>. Current test case only checks the former
syntax. This adds a test for the latter one.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
---
.../ftrace/test.d/kprobe/kprobe_module.tc | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_module.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_module.tc
index 7e74ee11edf9..4b32e1b9a8d3 100644
--- a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_module.tc
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_module.tc
@@ -13,6 +13,12 @@ fi
MOD=trace_printk
FUNC=trace_printk_irq_work
+:;: "Add an event on a module function without module name" ;:
+
+echo "p:event0 $FUNC" > kprobe_events
+test -d events/kprobes/event0 || exit_failure
+echo "-:kprobes/event0" >> kprobe_events
+
:;: "Add an event on a module function without specifying event name" ;:
echo "p $MOD:$FUNC" > kprobe_events
This is part of an effort to improve detection of regressions impacting
device probe on all platforms. The recently merged DT kselftest [1]
detects probe issues for all devices described statically in the DT.
That leaves out devices discovered at run-time from discoverable busses.
This is where this test comes in. All of the devices that are connected
through discoverable busses (ie USB and PCI), and which are internal and
therefore always present, can be described in a per-platform file so
they can be checked for. The test will check that the device has been
instantiated and bound to a driver.
Patch 1 introduces the test. Patch 2 adds the test definitions for the
google,spherion machine (Acer Chromebook 514) as an example.
This is the sample output from the test running on Spherion:
TAP version 13
Using board file: boards/google,spherion
1..10
ok 1 usb.camera.0.device
ok 2 usb.camera.0.driver
ok 3 usb.camera.1.device
ok 4 usb.camera.1.driver
ok 5 usb.bluetooth.0.device
ok 6 usb.bluetooth.0.driver
ok 7 usb.bluetooth.1.device
ok 8 usb.bluetooth.1.driver
ok 9 pci.wifi.device
ok 10 pci.wifi.driver
Totals: pass:10 fail:0 xfail:0 xpass:0 skip:0 error:0
[1] https://lore.kernel.org/all/20230828211424.2964562-1-nfraprado@collabora.co…
Nícolas F. R. A. Prado (2):
kselftest: Add test to verify probe of devices from discoverable
busses
kselftest: devices: Add board file for google,spherion
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/devices/.gitignore | 1 +
tools/testing/selftests/devices/Makefile | 8 +
.../selftests/devices/boards/google,spherion | 3 +
.../devices/test_discoverable_devices.sh | 165 ++++++++++++++++++
5 files changed, 178 insertions(+)
create mode 100644 tools/testing/selftests/devices/.gitignore
create mode 100644 tools/testing/selftests/devices/Makefile
create mode 100644 tools/testing/selftests/devices/boards/google,spherion
create mode 100755 tools/testing/selftests/devices/test_discoverable_devices.sh
--
2.42.0
There is a spelling mistake in a printf message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king(a)gmail.com>
---
tools/testing/selftests/sched/cs_prctl_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/sched/cs_prctl_test.c b/tools/testing/selftests/sched/cs_prctl_test.c
index 3e1619b6bf2d..7b4fc02a0d05 100644
--- a/tools/testing/selftests/sched/cs_prctl_test.c
+++ b/tools/testing/selftests/sched/cs_prctl_test.c
@@ -276,7 +276,7 @@ int main(int argc, char *argv[])
if (setpgid(0, 0) != 0)
handle_error("process group");
- printf("\n## Create a thread/process/process group hiearchy\n");
+ printf("\n## Create a thread/process/process group hierarchy\n");
create_processes(num_processes, num_threads, procs);
need_cleanup = 1;
disp_processes(num_processes, procs);
--
2.39.2
Immediate is incorrectly cast to u32 before being spilled, losing sign
information. The range information is incorrect after load again. Fix
immediate spill by remove the cast. The second patch add a test case
for this.
Signed-off-by: Hao Sun <sunhao.th(a)gmail.com>
---
Hao Sun (2):
bpf: Fix check_stack_write_fixed_off() to correctly spill imm
selftests/bpf: Add test for immediate spilled to stack
kernel/bpf/verifier.c | 2 +-
tools/testing/selftests/bpf/verifier/bpf_st_mem.c | 32 +++++++++++++++++++++++
2 files changed, 33 insertions(+), 1 deletion(-)
---
base-commit: 399f6185a1c02f39bcadb8749bc2d9d48685816f
change-id: 20231026-fix-check-stack-write-c40996694dfa
Best regards,
--
Hao Sun <sunhao.th(a)gmail.com>
Kunit recently gained support to setup attributes, the first one being
the speed of a given test, then allowing to filter out slow tests.
A slow test is defined in the documentation as taking more than one
second. There's an another speed attribute called "super slow" but whose
definition is less clear.
Add support to the test runner to check the test execution time, and
report tests that should be marked as slow but aren't.
Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
---
To: Brendan Higgins <brendan.higgins(a)linux.dev>
To: David Gow <davidgow(a)google.com>
Cc: Jani Nikula <jani.nikula(a)linux.intel.com>
Cc: Rae Moar <rmoar(a)google.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: kunit-dev(a)googlegroups.com
Cc: linux-kernel(a)vger.kernel.org
Changes from v2:
- Add defines and comments to make the warning reporting threshold more
obvious
- Switch the duration comparisons to timespec64_compare to be more
accurate
- Link: https://lore.kernel.org/all/20230920084903.1522728-1-mripard@kernel.org/
Changes from v1:
- Split the patch out of the series
- Change to trigger the warning only if the runtime is twice the
threshold (Jani, Rae)
- Split the speed check into a separate function (Rae)
- Link: https://lore.kernel.org/all/20230911-kms-slow-tests-v1-0-d3800a69a1a1@kerne…
---
lib/kunit/test.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index 49698a168437..4b710c92340a 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -372,6 +372,36 @@ void kunit_init_test(struct kunit *test, const char *name, char *log)
}
EXPORT_SYMBOL_GPL(kunit_init_test);
+/* Only warn when a test takes more than twice the threshold */
+#define KUNIT_SPEED_WARNING_MULTIPLIER 2
+
+/* Slow tests are defined as taking more than 1s */
+#define KUNIT_SPEED_SLOW_THRESHOLD_S 1
+
+#define KUNIT_SPEED_SLOW_WARNING_THRESHOLD_S \
+ (KUNIT_SPEED_WARNING_MULTIPLIER * KUNIT_SPEED_SLOW_THRESHOLD_S)
+
+#define s_to_timespec64(s) ns_to_timespec64((s) * NSEC_PER_SEC)
+
+static void kunit_run_case_check_speed(struct kunit *test,
+ struct kunit_case *test_case,
+ struct timespec64 duration)
+{
+ struct timespec64 slow_thr =
+ s_to_timespec64(KUNIT_SPEED_SLOW_WARNING_THRESHOLD_S);
+ enum kunit_speed speed = test_case->attr.speed;
+
+ if (timespec64_compare(&duration, &slow_thr) < 0)
+ return;
+
+ if (speed == KUNIT_SPEED_VERY_SLOW || speed == KUNIT_SPEED_SLOW)
+ return;
+
+ kunit_warn(test,
+ "Test should be marked slow (runtime: %lld.%09lds)",
+ duration.tv_sec, duration.tv_nsec);
+}
+
/*
* Initializes and runs test case. Does not clean up or do post validations.
*/
@@ -379,6 +409,8 @@ static void kunit_run_case_internal(struct kunit *test,
struct kunit_suite *suite,
struct kunit_case *test_case)
{
+ struct timespec64 start, end;
+
if (suite->init) {
int ret;
@@ -390,7 +422,13 @@ static void kunit_run_case_internal(struct kunit *test,
}
}
+ ktime_get_ts64(&start);
+
test_case->run_case(test);
+
+ ktime_get_ts64(&end);
+
+ kunit_run_case_check_speed(test, test_case, timespec64_sub(end, start));
}
static void kunit_case_internal_cleanup(struct kunit *test)
--
2.41.0
This is the first part to add Intel VT-d nested translation based on IOMMUFD
nesting infrastructure. As the iommufd nesting infrastructure series[1],
iommu core supports new ops to allocate domains with user data. For nesting,
the user data is vendor-specific, IOMMU_HWPT_DATA_VTD_S1 is defined for
the Intel VT-d stage-1 page table, it will be used in the stage-1 domain
allocation path. struct iommu_hwpt_vtd_s1 is defined to pass user_data
for the Intel VT-d stage-1 domain allocation. This series does not have
the cache invalidation path, it would be added in part 2/2.
The first Intel platform supporting nested translation is Sapphire
Rapids which, unfortunately, has a hardware errata [2] requiring special
treatment. This errata happens when a stage-1 page table page (either
level) is located in a stage-2 read-only region. In that case the IOMMU
hardware may ignore the stage-2 RO permission and still set the A/D bit
in stage-1 page table entries during page table walking.
A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report
this errata to userspace. With that restriction the user should either
disable nested translation to favor RO stage-2 mappings or ensure no
RO stage-2 mapping to enable nested translation.
Intel-iommu driver is armed with necessary checks to prevent such mix
in patch8 of this series.
Qemu currently does add RO mappings though. The vfio agent in Qemu
simply maps all valid regions in the GPA address space which certainly
includes RO regions e.g. vbios.
In reality we don't know a usage relying on DMA reads from the BIOS
region. Hence finding a way to skip RO regions (e.g. via a discard manager)
in Qemu might be an acceptable tradeoff. The actual change needs more
discussion in Qemu community. For now we just hacked Qemu to test.
Complete code can be found in [3], corresponding QEMU could can be found
in [4].
[1] https://lore.kernel.org/linux-iommu/20231026043938.63898-1-yi.l.liu@intel.c…
[2] https://www.intel.com/content/www/us/en/content-details/772415/content-deta…
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v8:
- Adopt changes suggested by Jason on domain_alloc_user() op
https://lore.kernel.org/linux-iommu/20231024230319.GW3952@nvidia.com/
- Add Kevin's r-b on patch 06
- Fix description for IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 (Kevin)
v7: https://lore.kernel.org/linux-iommu/20231024151412.50046-1-yi.l.liu@intel.c…
- Rebase on top of latest iommufd nesting part 1/2
- Add the nested_parent flag in patch 07 and sanitize it for nested domain
allocation (Baolu)
- Fail the nested domain allocation if dirty tracking flag is set
v6: https://lore.kernel.org/linux-iommu/20231020093246.17015-1-yi.l.liu@intel.c…
- Add Kevin's r-b for patch 1 and 8
- Drop Kevin's r-b for patch 7
- Address comments from Kevin
- Split the VT-d nesting series into two parts 1/2 and 2/2
v5: https://lore.kernel.org/linux-iommu/20230921075431.125239-1-yi.l.liu@intel.…
- Add Kevin's r-b for patch 2, 3 ,5 8, 10
- Drop enforce_cache_coherency callback from the nested type domain ops (Kevin)
- Remove duplicate agaw check in patch 04 (Kevin)
- Remove duplicate domain_update_iommu_cap() in patch 06 (Kevin)
- Check parent's force_snooping to set pgsnp in the pasid entry (Kevin)
- uapi data structure check (Kevin)
- Simplify the errata handling as user can allocate nested parent domain
v4: https://lore.kernel.org/linux-iommu/20230724111335.107427-1-yi.l.liu@intel.…
- Remove ascii art tables (Jason)
- Drop EMT (Tina, Jason)
- Drop MTS and related definitions (Kevin)
- Rename macro IOMMU_VTD_PGTBL_ to IOMMU_VTD_S1_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd_ to iommu_hwpt_vtd_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd to iommu_hwpt_vtd_s1 (Kevin)
- Put the vendor specific hwpt alloc data structure before enuma iommu_hwpt_type (Kevin)
- Do not trim the higher page levels of S2 domain in nested domain attachment as the
S2 domain may have been used independently. (Kevin)
- Remove the first-stage pgd check against the maximum address of s2_domain as hw
can check it anyhow. It makes sense to check every pfns used in the stage-1 page
table. But it cannot make it. So just leave it to hw. (Kevin)
- Split the iotlb flush part into an order of uapi, helper and callback implementation (Kevin)
- Change the policy of VT-d nesting errata, disallow RO mapping once a domain is used
as parent domain of a nested domain. This removes the nested_users counting. (Kevin)
- Minor fix for "make htmldocs"
v3: https://lore.kernel.org/linux-iommu/20230511145110.27707-1-yi.l.liu@intel.c…
- Further split the patches into an order of adding helpers for nested
domain, iotlb flush, nested domain attachment and nested domain allocation
callback, then report the hw_info to userspace.
- Add batch support in cache invalidation from userspace
- Disallow nested translation usage if RO mappings exists in stage-2 domain
due to errata on readonly mappings on Sapphire Rapids platform.
v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.…
- The iommufd infrastructure is split to be separate series.
v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Lu Baolu (5):
iommu/vt-d: Extend dmar_domain to support nested domain
iommu/vt-d: Add helper for nested domain allocation
iommu/vt-d: Add helper to setup pasid nested translation
iommu/vt-d: Add nested domain allocation
iommu/vt-d: Disallow read-only mappings to nest parent domain
Yi Liu (3):
iommufd: Add data structure for Intel VT-d stage-1 domain allocation
iommu/vt-d: Make domain attach helpers to be extern
iommu/vt-d: Set the nested domain to a device
drivers/iommu/intel/Makefile | 2 +-
drivers/iommu/intel/iommu.c | 60 +++++++++---------
drivers/iommu/intel/iommu.h | 46 ++++++++++++--
drivers/iommu/intel/nested.c | 117 +++++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.c | 112 +++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.h | 2 +
include/uapi/linux/iommufd.h | 42 ++++++++++++-
7 files changed, 345 insertions(+), 36 deletions(-)
create mode 100644 drivers/iommu/intel/nested.c
--
2.34.1
Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages of address translations to get access
to the physical address. A stage-1 translation table is owned by userspace
(e.g. by a guest OS), while a stage-2 is owned by kernel. Any change to a
stage-1 translation table should be followed by an IOTLB invalidation.
Take Intel VT-d as an example, the stage-1 translation table is guest I/O
page table. As the below diagram shows, the guest I/O page table pointer
in GPA (guest physical address) is passed to host and be used to perform
a stage-1 translation. Along with it, a modification to present mappings
in the guest I/O page table should be followed by an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |---------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | FS for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.----------------------------------.
| | | SS for GPA->HPA, unmanaged domain|
| | '----------------------------------'
'-------------'
Where:
- FS = First stage page tables
- SS = Second stage page tables
<Intel VT-d Nested translation>
In IOMMUFD, all the translation tables are tracked by hw_pagetable (hwpt)
and each hwpt is backed by an iommu_domain allocated from an iommu driver.
So in this series hw_pagetable and iommu_domain means the same thing if no
special note. IOMMUFD has already supported allocating hw_pagetable linked
with an IOAS. However, a nesting case requires IOMMUFD to allow allocating
hw_pagetable with driver specific parameters and interface to sync stage-1
IOTLB as user owns the stage-1 translation table.
This series is based on the iommu hw info reporting series [1] and nested
parent domain allocation [2]. It first extends domain_alloc_user to allocate
hwpt with user data by allowing the IOMMUFD internal infrastructure to accept
user_data and parent hwpt, relaying the user_data/parent to the iommu core
to allocate IOMMU_DOMAIN_NESTED. And it then extends the IOMMU_HWPT_ALLOC
ioctl to accept user data and a parent hwpt ID.
Note that this series is the part-1 set of a two-part nesting series. It
does not include the cache invalidation interface, which will be added in
the part 2.
Complete code can be found in [3], it is on top of Joao's dirty page tracking
v6 series and fix patches. QEMU could can be found in [4].
At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.
[1] https://lore.kernel.org/linux-iommu/20230818101033.4100-1-yi.l.liu@intel.co… - merged
[2] https://lore.kernel.org/linux-iommu/20230928071528.26258-1-yi.l.liu@intel.c… - merged
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v7:
- Fix a bug from Kevin
- Add r-b from Kevin
- Adopt Jason's suggestion to plumb user_data pointer to hwpt_paging allocation
https://lore.kernel.org/linux-iommu/20231024173009.GQ3952@nvidia.com/
- Select bit 6 for __IOMMU_DOMAIN_NESTED (Jason)
- Other compiling fixes per linux-next integration (Jason/Joao)
- Move patch "iommu: Pass in parent domain with user_data to domain_alloc_user op"
right before "iommufd: Add a nested HW pagetable object" (Jason)
v6: https://lore.kernel.org/linux-iommu/20231024150609.46884-1-yi.l.liu@intel.c…
- Rebase on top of Joao's dirty tracking series:
https://lore.kernel.org/linux-iommu/20231024135109.73787-1-joao.m.martins@o…
- Rebase on top of the enforce_cache_coherency removal patch:
https://lore.kernel.org/linux-iommu/ZTcAhwYjjzqM0A5M@Asurada-Nvidia/
- Add parent and user_data check in iommu driver before the driver actually
supports the two input. This can make better bisect support, the change is
in patch 02.
v5: https://lore.kernel.org/linux-iommu/20231020091946.12173-1-yi.l.liu@intel.c…
- Split the iommufd nesting series into two parts of alloc_user and
invalidation (Jason)
- Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and
do the same with the structures/alloc()/abort()/destroy(). Reworked the
selftest accordingly too. (Jason)
- Move hwpt/data_type into struct iommu_user_data from standalone op
arguments. (Jason)
- Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA,
_TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin)
- Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin)
- Add macro to the iommu_copy_struct_from_user() to calculate min_size
(Jason)
- Fix two bugs spotted by ZhaoYan
v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.…
- Separate HWPT alloc/destroy/abort functions between user-managed HWPTs
and kernel-managed HWPTs
- Rework invalidate uAPI to be a multi-request array-based design
- Add a struct iommu_user_data_array and a helper for driver to sanitize
and copy the entry data from user space invalidation array
- Add a patch fixing TEST_LENGTH() in selftest program
- Drop IOMMU_RESV_IOVA_RANGES patches
- Update kdoc and inline comments
- Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested
translation, this does not change the rule that resv regions should only
be added to the kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be
added in later series as it is needed only by SMMU so far.
v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.…
- Add new uAPI things in alphabetical order
- Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for
sanity, replacing the previous op->domain_alloc_user_data_len solution
- Return ERR_PTR from domain_alloc_user instead of NULL
- Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation
(Kevin)
- Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence
userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest
I/O page table). (Kevin)
- Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl
- Minor changes per Kevin's inputs
v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.c…
- Add union iommu_domain_user_data to include all user data structures to
avoid passing void * in kernel APIs.
- Add iommu op to return user data length for user domain allocation
- Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
- Store the invalidation data length in
iommu_domain_ops::cache_invalidate_user_data_len
- Convert cache_invalidate_user op to be int instead of void
- Remove @data_type in struct iommu_hwpt_invalidate
- Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08
of v1
v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.…
Thanks,
Yi Liu
Jason Gunthorpe (2):
iommufd: Rename IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING
iommufd/device: Wrap IOMMUFD_OBJ_HWPT_PAGING-only configurations
Lu Baolu (1):
iommu: Add IOMMU_DOMAIN_NESTED
Nicolin Chen (6):
iommufd: Derive iommufd_hwpt_paging from iommufd_hw_pagetable
iommufd: Share iommufd_hwpt_alloc with IOMMUFD_OBJ_HWPT_NESTED
iommufd: Add a nested HW pagetable object
iommu: Add iommu_copy_struct_from_user helper
iommufd/selftest: Add nested domain allocation for mock domain
iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with nested HWPTs
Yi Liu (1):
iommu: Pass in parent domain with user_data to domain_alloc_user op
drivers/iommu/amd/iommu.c | 9 +-
drivers/iommu/intel/iommu.c | 7 +-
drivers/iommu/iommufd/device.c | 156 ++++++++---
drivers/iommu/iommufd/hw_pagetable.c | 265 +++++++++++++-----
drivers/iommu/iommufd/iommufd_private.h | 70 +++--
drivers/iommu/iommufd/iommufd_test.h | 18 ++
drivers/iommu/iommufd/main.c | 10 +-
drivers/iommu/iommufd/selftest.c | 153 ++++++++--
drivers/iommu/iommufd/vfio_compat.c | 6 +-
include/linux/iommu.h | 71 ++++-
include/uapi/linux/iommufd.h | 31 +-
tools/testing/selftests/iommu/iommufd.c | 115 ++++++++
.../selftests/iommu/iommufd_fail_nth.c | 3 +-
tools/testing/selftests/iommu/iommufd_utils.h | 30 +-
14 files changed, 758 insertions(+), 186 deletions(-)
--
2.34.1
This series enables support for the data processing extensions in the
newly released 2023 architecture, this is mainly support for 8 bit
floating point formats. Most of the extensions only introduce new
instructions and therefore only require hwcaps but there is a new EL0
visible control register FPMR used to control the 8 bit floating point
formats, we need to manage traps for this and context switch it.
The sharing of floating point save code between the host and guest
kernels slightly complicates the introduction of KVM support, we first
introduce host support with some placeholders for KVM then replace those
with the actual KVM support.
I've not added test coverage for ptrace, I've got a not quite finished
test program which exercises all the FP ptrace interfaces and their
interactions together, my plan is to cover it there rather than add
another tiny test program that duplicates the boilerplace for tracing a
target and doesn't actually run the traced program.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (21):
arm64/sysreg: Add definition for ID_AA64PFR2_EL1
arm64/sysreg: Update ID_AA64ISAR2_EL1 defintion for DDI0601 2023-09
arm64/sysreg: Add definition for ID_AA64ISAR3_EL1
arm64/sysreg: Add definition for ID_AA64FPFR0_EL1
arm64/sysreg: Update ID_AA64SMFR0_EL1 definition for DDI0601 2023-09
arm64/sysreg: Update SCTLR_EL1 for DDI0601 2023-09
arm64/sysreg: Update HCRX_EL2 definition for DDI0601 2023-09
arm64/sysreg: Add definition for FPMR
arm64/cpufeature: Hook new identification registers up to cpufeature
arm64/fpsimd: Enable host kernel access to FPMR
arm64/fpsimd: Support FEAT_FPMR
arm64/signal: Add FPMR signal handling
arm64/ptrace: Expose FPMR via ptrace
KVM: arm64: Add newly allocated ID registers to register descriptions
KVM: arm64: Support FEAT_FPMR for guests
arm64/hwcap: Define hwcaps for 2023 DPISA features
kselftest/arm64: Handle FPMR context in generic signal frame parser
kselftest/arm64: Add basic FPMR test
kselftest/arm64: Add 2023 DPISA hwcap test coverage
KVM: arm64: selftests: Document feature registers added in 2023 extensions
KVM: arm64: selftests: Teach get-reg-list about FPMR
Documentation/arch/arm64/elf_hwcaps.rst | 49 +++++
arch/arm64/include/asm/cpu.h | 3 +
arch/arm64/include/asm/cpufeature.h | 5 +
arch/arm64/include/asm/fpsimd.h | 2 +
arch/arm64/include/asm/hwcap.h | 15 ++
arch/arm64/include/asm/kvm_arm.h | 4 +-
arch/arm64/include/asm/kvm_host.h | 3 +
arch/arm64/include/asm/processor.h | 2 +
arch/arm64/include/uapi/asm/hwcap.h | 15 ++
arch/arm64/include/uapi/asm/sigcontext.h | 8 +
arch/arm64/kernel/cpufeature.c | 72 +++++++
arch/arm64/kernel/cpuinfo.c | 18 ++
arch/arm64/kernel/fpsimd.c | 13 ++
arch/arm64/kernel/ptrace.c | 42 ++++
arch/arm64/kernel/signal.c | 59 ++++++
arch/arm64/kvm/fpsimd.c | 19 +-
arch/arm64/kvm/hyp/include/hyp/switch.h | 7 +-
arch/arm64/kvm/sys_regs.c | 17 +-
arch/arm64/tools/cpucaps | 1 +
arch/arm64/tools/sysreg | 153 ++++++++++++++-
include/uapi/linux/elf.h | 1 +
tools/testing/selftests/arm64/abi/hwcap.c | 217 +++++++++++++++++++++
tools/testing/selftests/arm64/signal/.gitignore | 1 +
.../arm64/signal/testcases/fpmr_siginfo.c | 82 ++++++++
.../selftests/arm64/signal/testcases/testcases.c | 8 +
.../selftests/arm64/signal/testcases/testcases.h | 1 +
tools/testing/selftests/kvm/aarch64/get-reg-list.c | 11 +-
27 files changed, 810 insertions(+), 18 deletions(-)
---
base-commit: 05d3ef8bba77c1b5f98d941d8b2d4aeab8118ef1
change-id: 20231003-arm64-2023-dpisa-2f3d25746474
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Hello,
kernel test robot noticed "kernel-selftests.uevent.uevent_filtering.fail" on:
commit: 5b45a753776be5d21cf395ec97e81c9187fbeaca ("selftests: uevent filtering: fix return on error in uevent_listener")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
[test failed on linux-next/master 2030579113a1b1b5bfd7ff24c0852847836d8fd1]
in testcase: kernel-selftests
version: kernel-selftests-x86_64-60acb023-1_20230329
with following parameters:
group: group-03
compiler: gcc-12
test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz (Cascade Lake) with 32G memory
(please refer to attached dmesg/kmsg for entire log/backtrace)
we also noticed this issue does not always happen. as below, we saw 15 failures
out of 50 runs. however, parent keeps passing.
37013b557b7f39e6 5b45a753776be5d21cf395ec97e
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:50 30% 15:50 kernel-selftests.uevent.uevent_filtering.fail
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang(a)intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202310261454.46082aaa-oliver.sang@intel.com
TAP version 13
1..1
# timeout set to 300
# selftests: uevent: uevent_filtering
# TAP version 13
# 1..1
# # Starting 1 tests from 1 test cases.
# # RUN global.uevent_filtering ...
# add@/devices/virtual/mem/fullACTION=addDEVPATH=/devices/virtual/mem/fullSUBSYSTEM=memSYNTH_UUID=0MAJOR=1MINOR=7DEVNAME=fullDEVMODE=0666SEQNUM=3532
# add@/devices/virtual/mem/fullACTION=addDEVPATH=/devices/virtual/mem/fullSUBSYSTEM=memSYNTH_UUID=0MAJOR=1MINOR=7DEVNAME=fullDEVMODE=0666SEQNUM=3546
# add@/devices/virtual/mem/fullACTION=addDEVPATH=/devices/virtual/mem/fullSUBSYSTEM=memSYNTH_UUID=0MAJOR=1MINOR=7DEVNAME=fullDEVMODE=0666SEQNUM=3556
# add@/devices/virtual/mem/fullACTION=addDEVPATH=/devices/virtual/mem/fullSUBSYSTEM=memSYNTH_UUID=0MAJOR=1MINOR=7DEVNAME=fullDEVMODE=0666SEQNUM=3585
# add@/devices/virtual/mem/fullACTION=addDEVPATH=/devices/virtual/mem/fullSUBSYSTEM=memSYNTH_UUID=0MAJOR=1MINOR=7DEVNAME=fullDEVMODE=0666SEQNUM=3595
# No buffer space available - Failed to receive uevent
# # uevent_filtering.c:479:uevent_filtering:Expected 0 (0) == ret (-1)
# # uevent_filtering: Test failed at step #10
# # FAIL global.uevent_filtering
# not ok 1 global.uevent_filtering
# # FAILED: 0 / 1 tests passed.
# # Totals: pass:0 fail:1 xfail:0 xpass:0 skip:0 error:0
not ok 1 selftests: uevent: uevent_filtering # exit=1
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231026/202310261454.46082aaa-oliv…
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Kunit recently gained support to setup attributes, the first one being
the speed of a given test, then allowing to filter out slow tests.
A slow test is defined in the documentation as taking more than one
second. There's an another speed attribute called "super slow" but whose
definition is less clear.
Add support to the test runner to check the test execution time, and
report tests that should be marked as slow but aren't.
Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
---
To: Brendan Higgins <brendan.higgins(a)linux.dev>
To: David Gow <davidgow(a)google.com>
Cc: Jani Nikula <jani.nikula(a)linux.intel.com>
Cc: Rae Moar <rmoar(a)google.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: kunit-dev(a)googlegroups.com
Cc: linux-kernel(a)vger.kernel.org
Changes from v1:
- Split the patch out of the series
- Change to trigger the warning only if the runtime is twice the
threshold (Jani, Rae)
- Split the speed check into a separate function (Rae)
- Link: https://lore.kernel.org/all/20230911-kms-slow-tests-v1-0-d3800a69a1a1@kerne…
---
lib/kunit/test.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index 49698a168437..a1d5dd2bf87d 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -372,6 +372,25 @@ void kunit_init_test(struct kunit *test, const char *name, char *log)
}
EXPORT_SYMBOL_GPL(kunit_init_test);
+#define KUNIT_SPEED_SLOW_THRESHOLD_S 1
+
+static void kunit_run_case_check_speed(struct kunit *test,
+ struct kunit_case *test_case,
+ struct timespec64 duration)
+{
+ enum kunit_speed speed = test_case->attr.speed;
+
+ if (duration.tv_sec < (2 * KUNIT_SPEED_SLOW_THRESHOLD_S))
+ return;
+
+ if (speed == KUNIT_SPEED_VERY_SLOW || speed == KUNIT_SPEED_SLOW)
+ return;
+
+ kunit_warn(test,
+ "Test should be marked slow (runtime: %lld.%09lds)",
+ duration.tv_sec, duration.tv_nsec);
+}
+
/*
* Initializes and runs test case. Does not clean up or do post validations.
*/
@@ -379,6 +398,8 @@ static void kunit_run_case_internal(struct kunit *test,
struct kunit_suite *suite,
struct kunit_case *test_case)
{
+ struct timespec64 start, end;
+
if (suite->init) {
int ret;
@@ -390,7 +411,13 @@ static void kunit_run_case_internal(struct kunit *test,
}
}
+ ktime_get_ts64(&start);
+
test_case->run_case(test);
+
+ ktime_get_ts64(&end);
+
+ kunit_run_case_check_speed(test, test_case, timespec64_sub(end, start));
}
static void kunit_case_internal_cleanup(struct kunit *test)
--
2.41.0
This is the first part to add Intel VT-d nested translation based on IOMMUFD
nesting infrastructure. As the iommufd nesting infrastructure series[1],
iommu core supports new ops to allocate domains with user data. For nesting,
the user data is vendor-specific, IOMMU_HWPT_DATA_VTD_S1 is defined for
the Intel VT-d stage-1 page table, it will be used in the stage-1 domain
allocation path. struct iommu_hwpt_vtd_s1 is defined to pass user_data
for the Intel VT-d stage-1 domain allocation. This series does not have
the cache invalidation path, it would be added in part 2/2.
The first Intel platform supporting nested translation is Sapphire
Rapids which, unfortunately, has a hardware errata [2] requiring special
treatment. This errata happens when a stage-1 page table page (either
level) is located in a stage-2 read-only region. In that case the IOMMU
hardware may ignore the stage-2 RO permission and still set the A/D bit
in stage-1 page table entries during page table walking.
A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report
this errata to userspace. With that restriction the user should either
disable nested translation to favor RO stage-2 mappings or ensure no
RO stage-2 mapping to enable nested translation.
Intel-iommu driver is armed with necessary checks to prevent such mix
in patch8 of this series.
Qemu currently does add RO mappings though. The vfio agent in Qemu
simply maps all valid regions in the GPA address space which certainly
includes RO regions e.g. vbios.
In reality we don't know a usage relying on DMA reads from the BIOS
region. Hence finding a way to skip RO regions (e.g. via a discard manager)
in Qemu might be an acceptable tradeoff. The actual change needs more
discussion in Qemu community. For now we just hacked Qemu to test.
Complete code can be found in [3], corresponding QEMU could can be found
in [4].
[1] https://lore.kernel.org/linux-iommu/20231024150609.46884-1-yi.l.liu@intel.c…
[2] https://www.intel.com/content/www/us/en/content-details/772415/content-deta…
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v7:
- Rebase on top of latest iommufd nesting part 1/2
- Add the nested_parent flag in patch 07 and sanitize it for nested domain
allocation (Baolu)
- Fail the nested domain allocation if dirty tracking flag is set
v6: https://lore.kernel.org/linux-iommu/20231020093246.17015-1-yi.l.liu@intel.c…
- Add Kevin's r-b for patch 1 and 8
- Drop Kevin's r-b for patch 7
- Address comments from Kevin
- Split the VT-d nesting series into two parts 1/2 and 2/2
v5: https://lore.kernel.org/linux-iommu/20230921075431.125239-1-yi.l.liu@intel.…
- Add Kevin's r-b for patch 2, 3 ,5 8, 10
- Drop enforce_cache_coherency callback from the nested type domain ops (Kevin)
- Remove duplicate agaw check in patch 04 (Kevin)
- Remove duplicate domain_update_iommu_cap() in patch 06 (Kevin)
- Check parent's force_snooping to set pgsnp in the pasid entry (Kevin)
- uapi data structure check (Kevin)
- Simplify the errata handling as user can allocate nested parent domain
v4: https://lore.kernel.org/linux-iommu/20230724111335.107427-1-yi.l.liu@intel.…
- Remove ascii art tables (Jason)
- Drop EMT (Tina, Jason)
- Drop MTS and related definitions (Kevin)
- Rename macro IOMMU_VTD_PGTBL_ to IOMMU_VTD_S1_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd_ to iommu_hwpt_vtd_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd to iommu_hwpt_vtd_s1 (Kevin)
- Put the vendor specific hwpt alloc data structure before enuma iommu_hwpt_type (Kevin)
- Do not trim the higher page levels of S2 domain in nested domain attachment as the
S2 domain may have been used independently. (Kevin)
- Remove the first-stage pgd check against the maximum address of s2_domain as hw
can check it anyhow. It makes sense to check every pfns used in the stage-1 page
table. But it cannot make it. So just leave it to hw. (Kevin)
- Split the iotlb flush part into an order of uapi, helper and callback implementation (Kevin)
- Change the policy of VT-d nesting errata, disallow RO mapping once a domain is used
as parent domain of a nested domain. This removes the nested_users counting. (Kevin)
- Minor fix for "make htmldocs"
v3: https://lore.kernel.org/linux-iommu/20230511145110.27707-1-yi.l.liu@intel.c…
- Further split the patches into an order of adding helpers for nested
domain, iotlb flush, nested domain attachment and nested domain allocation
callback, then report the hw_info to userspace.
- Add batch support in cache invalidation from userspace
- Disallow nested translation usage if RO mappings exists in stage-2 domain
due to errata on readonly mappings on Sapphire Rapids platform.
v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.…
- The iommufd infrastructure is split to be separate series.
v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Lu Baolu (5):
iommu/vt-d: Extend dmar_domain to support nested domain
iommu/vt-d: Add helper for nested domain allocation
iommu/vt-d: Add helper to setup pasid nested translation
iommu/vt-d: Add nested domain allocation
iommu/vt-d: Disallow read-only mappings to nest parent domain
Yi Liu (3):
iommufd: Add data structure for Intel VT-d stage-1 domain allocation
iommu/vt-d: Make domain attach helpers to be extern
iommu/vt-d: Set the nested domain to a device
drivers/iommu/intel/Makefile | 2 +-
drivers/iommu/intel/iommu.c | 88 +++++++++++++++++----------
drivers/iommu/intel/iommu.h | 46 ++++++++++++--
drivers/iommu/intel/nested.c | 109 ++++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.c | 112 +++++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.h | 2 +
include/uapi/linux/iommufd.h | 42 ++++++++++++-
7 files changed, 362 insertions(+), 39 deletions(-)
create mode 100644 drivers/iommu/intel/nested.c
--
2.34.1
Hi,
while testing a new patch on the livepatch kselftests, I was testing the gen_tar
target and I figured that we only copy the resulting binaries to the final tar
file.
Per the kselftests documentation[1], the gen_tar target is used to package the
tests to run "on different systems". But what if the different system has
different libraries/library versions? Wouldn't it be a problem?
This question came when I was working to build the livepatch modules as part of
the kselftests testing suit. The plan was to just package the test
scripts/programs/modules and then run the tests on a different system, likewise
a different SLE version. Since the kernel would be different in this case, I
expected that gen_tar would copy the module source files so they can be compiled
on the target system.
While the current approach can work when the selftests rely solely on shell scripts(cpufreq, kexec),
those who compile userspace binaries (cgroup, alsa, sched, ...) may not work.
Am I missing something? Is gen_tar only meant to copy the tests to be run on
systems with the same libraries or with the libraries with the exactly the same
version?
Thanks in advance,
Marcos
[1]: https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html
Isolated cpuset partition can currently be created to contain an
exclusive set of CPUs not used in other cgroups and with load balancing
disabled to reduce interference from the scheduler.
The main purpose of this isolated partition type is to dynamically
emulate what can be done via the "isolcpus" boot command line option,
specifically the default domain flag. One effect of the "isolcpus" option
is to remove the isolated CPUs from the cpumasks of unbound workqueues
since running work functions in an isolated CPU can be a major source
of interference. Changing the unbound workqueue cpumasks can be done at
run time by writing an appropriate cpumask without the isolated CPUs to
/sys/devices/virtual/workqueue/cpumask. So one can set up an isolated
cpuset partition and then write to the cpumask sysfs file to achieve
similar level of CPU isolation. However, this manual process can be
error prone.
This patch series implements automatic exclusion of isolated CPUs from
unbound workqueue cpumasks when an isolated cpuset partition is created
and then adds those CPUs back when the isolated partition is destroyed.
There are also other places in the kernel that look at the HK_FLAG_DOMAIN
cpumask or other HK_FLAG_* cpumasks and exclude the isolated CPUs from
certain actions to further reduce interference. CPUs in an isolated
cpuset partition will not be able to avoid those interferences yet. That
may change in the future as the need arises.
Waiman Long (4):
workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs
from wq_unbound_cpumask
selftests/cgroup: Minor code cleanup and reorganization of
test_cpuset_prs.sh
cgroup/cpuset: Keep track of CPUs in isolated partitions
cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumask
Documentation/admin-guide/cgroup-v2.rst | 10 +-
include/linux/workqueue.h | 2 +-
kernel/cgroup/cpuset.c | 237 +++++++++++++-----
kernel/workqueue.c | 42 +++-
.../selftests/cgroup/test_cpuset_prs.sh | 209 +++++++++------
5 files changed, 350 insertions(+), 150 deletions(-)
--
2.39.3
Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages of address translations to get access
to the physical address. A stage-1 translation table is owned by userspace
(e.g. by a guest OS), while a stage-2 is owned by kernel. Any change to a
stage-1 translation table should be followed by an IOTLB invalidation.
Take Intel VT-d as an example, the stage-1 translation table is guest I/O
page table. As the below diagram shows, the guest I/O page table pointer
in GPA (guest physical address) is passed to host and be used to perform
a stage-1 translation. Along with it, a modification to present mappings
in the guest I/O page table should be followed by an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |---------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | FS for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.----------------------------------.
| | | SS for GPA->HPA, unmanaged domain|
| | '----------------------------------'
'-------------'
Where:
- FS = First stage page tables
- SS = Second stage page tables
<Intel VT-d Nested translation>
In IOMMUFD, all the translation tables are tracked by hw_pagetable (hwpt)
and each hwpt is backed by an iommu_domain allocated from an iommu driver.
So in this series hw_pagetable and iommu_domain means the same thing if no
special note. IOMMUFD has already supported allocating hw_pagetable linked
with an IOAS. However, a nesting case requires IOMMUFD to allow allocating
hw_pagetable with driver specific parameters and interface to sync stage-1
IOTLB as user owns the stage-1 translation table.
This series is based on the iommu hw info reporting series [1] and nested
parent domain allocation [2]. It first extends domain_alloc_user to allocate
hwpt with user data by allowing the IOMMUFD internal infrastructure to accept
user_data and parent hwpt, relaying the user_data/parent to the iommu core
to allocate IOMMU_DOMAIN_NESTED. And it then extends the IOMMU_HWPT_ALLOC
ioctl to accept user data and a parent hwpt ID.
Note that this series is the part-1 set of a two-part nesting series. It
does not include the cache invalidation interface, which will be added in
the part 2.
Complete code can be found in [3], it is on top of Joao's dirty page tracking
v6 series and fix patches. QEMU could can be found in [4].
At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.
[1] https://lore.kernel.org/linux-iommu/20230818101033.4100-1-yi.l.liu@intel.co… - merged
[2] https://lore.kernel.org/linux-iommu/20230928071528.26258-1-yi.l.liu@intel.c… - merged
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v6:
- Rebase on top of Joao's dirty tracking series:
https://lore.kernel.org/linux-iommu/20231024135109.73787-1-joao.m.martins@o…
- Rebase on top of the enforce_cache_coherency removal patch:
https://lore.kernel.org/linux-iommu/ZTcAhwYjjzqM0A5M@Asurada-Nvidia/
- Add parent and user_data check in iommu driver before the driver actually
supports the two input. This can make better bisect support, the change is
in patch 02.
v5: https://lore.kernel.org/linux-iommu/20231020091946.12173-1-yi.l.liu@intel.c…
- Split the iommufd nesting series into two parts of alloc_user and
invalidation (Jason)
- Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and
do the same with the structures/alloc()/abort()/destroy(). Reworked the
selftest accordingly too. (Jason)
- Move hwpt/data_type into struct iommu_user_data from standalone op
arguments. (Jason)
- Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA,
_TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin)
- Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin)
- Add macro to the iommu_copy_struct_from_user() to calculate min_size
(Jason)
- Fix two bugs spotted by ZhaoYan
v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.…
- Separate HWPT alloc/destroy/abort functions between user-managed HWPTs
and kernel-managed HWPTs
- Rework invalidate uAPI to be a multi-request array-based design
- Add a struct iommu_user_data_array and a helper for driver to sanitize
and copy the entry data from user space invalidation array
- Add a patch fixing TEST_LENGTH() in selftest program
- Drop IOMMU_RESV_IOVA_RANGES patches
- Update kdoc and inline comments
- Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested
translation, this does not change the rule that resv regions should only
be added to the kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be
added in later series as it is needed only by SMMU so far.
v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.…
- Add new uAPI things in alphabetical order
- Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for
sanity, replacing the previous op->domain_alloc_user_data_len solution
- Return ERR_PTR from domain_alloc_user instead of NULL
- Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation
(Kevin)
- Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence
userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest
I/O page table). (Kevin)
- Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl
- Minor changes per Kevin's inputs
v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.c…
- Add union iommu_domain_user_data to include all user data structures to
avoid passing void * in kernel APIs.
- Add iommu op to return user data length for user domain allocation
- Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
- Store the invalidation data length in
iommu_domain_ops::cache_invalidate_user_data_len
- Convert cache_invalidate_user op to be int instead of void
- Remove @data_type in struct iommu_hwpt_invalidate
- Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08
of v1
v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.…
Thanks,
Yi Liu
Jason Gunthorpe (2):
iommufd: Rename IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING
iommufd/device: Wrap IOMMUFD_OBJ_HWPT_PAGING-only configurations
Lu Baolu (1):
iommu: Add IOMMU_DOMAIN_NESTED
Nicolin Chen (6):
iommufd: Derive iommufd_hwpt_paging from iommufd_hw_pagetable
iommufd: Share iommufd_hwpt_alloc with IOMMUFD_OBJ_HWPT_NESTED
iommufd: Add a nested HW pagetable object
iommu: Add iommu_copy_struct_from_user helper
iommufd/selftest: Add nested domain allocation for mock domain
iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with nested HWPTs
Yi Liu (1):
iommu: Pass in parent domain with user_data to domain_alloc_user op
drivers/iommu/intel/iommu.c | 7 +-
drivers/iommu/iommufd/device.c | 157 +++++++---
drivers/iommu/iommufd/hw_pagetable.c | 271 +++++++++++++-----
drivers/iommu/iommufd/iommufd_private.h | 73 +++--
drivers/iommu/iommufd/iommufd_test.h | 18 ++
drivers/iommu/iommufd/main.c | 10 +-
drivers/iommu/iommufd/selftest.c | 151 ++++++++--
drivers/iommu/iommufd/vfio_compat.c | 6 +-
include/linux/iommu.h | 72 ++++-
include/uapi/linux/iommufd.h | 31 +-
tools/testing/selftests/iommu/iommufd.c | 120 ++++++++
.../selftests/iommu/iommufd_fail_nth.c | 3 +-
tools/testing/selftests/iommu/iommufd_utils.h | 31 +-
13 files changed, 768 insertions(+), 182 deletions(-)
--
2.34.1
Clang uses a different set of CLI args for coverage, and the output
needs to be processed by a different set of tools.
Update the Makefile and add an example of usage in kunit docs.
Michał Winiarski (2):
arch: um: Add Clang coverage support
Documentation: kunit: Add clang UML coverage example
Documentation/dev-tools/kunit/running_tips.rst | 11 +++++++++++
arch/um/Makefile-skas | 5 +++++
2 files changed, 16 insertions(+)
--
2.42.0