July 2023 - Linux-kselftest-mirror

[RESEND PATCH v3 0/2] RISC-V: mm: Make SV48 the default address space

by Charlie Jenkins

Make sv48 the default address space for mmap as some applications currently depend on this assumption. Also enable users to select desired address space using a non-zero hint address to mmap. Previous kernel changes caused Java and other applications to be broken on sv57 which this patch fixes. Documentation is also added to the RISC-V virtual memory section to explain these changes. Charlie Jenkins (2): RISC-V: mm: Restrict address space for sv39,sv48,sv57 RISC-V: mm: Update documentation and include test Documentation/riscv/vm-layout.rst | 22 +++++++++ arch/riscv/include/asm/elf.h | 2 +- arch/riscv/include/asm/pgtable.h | 21 ++++++-- arch/riscv/include/asm/processor.h | 34 ++++++++++--- tools/testing/selftests/riscv/Makefile | 2 +- tools/testing/selftests/riscv/mm/.gitignore | 1 + tools/testing/selftests/riscv/mm/Makefile | 21 ++++++++ .../selftests/riscv/mm/testcases/mmap.c | 49 +++++++++++++++++++ 8 files changed, 139 insertions(+), 13 deletions(-) create mode 100644 tools/testing/selftests/riscv/mm/.gitignore create mode 100644 tools/testing/selftests/riscv/mm/Makefile create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap.c -- 2.41.0

2 years, 5 months

5
10
0 0

[PATCH bpf-next v2 0/6] Support defragmenting IPv(4|6) packets in BPF

by Daniel Xu

=== Context === In the context of a middlebox, fragmented packets are tricky to handle. The full 5-tuple of a packet is often only available in the first fragment which makes enforcing consistent policy difficult. There are really only two stateless options, neither of which are very nice: 1. Enforce policy on first fragment and accept all subsequent fragments. This works but may let in certain attacks or allow data exfiltration. 2. Enforce policy on first fragment and drop all subsequent fragments. This does not really work b/c some protocols may rely on fragmentation. For example, DNS may rely on oversized UDP packets for large responses. So stateful tracking is the only sane option. RFC 8900 [0] calls this out as well in section 6.3: Middleboxes [...] should process IP fragments in a manner that is consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes must maintain state in order to achieve this goal. === BPF related bits === Policy has traditionally been enforced from XDP/TC hooks. Both hooks run before kernel reassembly facilities. However, with the new BPF_PROG_TYPE_NETFILTER, we can rather easily hook into existing netfilter reassembly infra. The basic idea is we bump a refcnt on the netfilter defrag module and then run the bpf prog after the defrag module runs. This allows bpf progs to transparently see full, reassembled packets. The nice thing about this is that progs don't have to carry around logic to detect fragments. === Changelog === Changes from v1: * Drop bpf_program__attach_netfilter() patches * static -> static const where appropriate * Fix callback assignment order during registration * Only request_module() if callbacks are missing * Fix retval when modprobe fails in userspace * Fix v6 defrag module name (nf_defrag_ipv6_hooks -> nf_defrag_ipv6) * Simplify priority checking code * Add warning if module doesn't assign callbacks in the future * Take refcnt on module while defrag link is active [0]: https://datatracker.ietf.org/doc/html/rfc8900 Daniel Xu (6): netfilter: defrag: Add glue hooks for enabling/disabling defrag netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link netfilter: bpf: Prevent defrag module unload while link active bpf: selftests: Support not connecting client socket bpf: selftests: Support custom type and proto for client sockets bpf: selftests: Add defrag selftests include/linux/netfilter.h | 15 + include/uapi/linux/bpf.h | 5 + net/ipv4/netfilter/nf_defrag_ipv4.c | 17 +- net/ipv6/netfilter/nf_defrag_ipv6_hooks.c | 11 + net/netfilter/core.c | 6 + net/netfilter/nf_bpf_link.c | 149 ++++++++- tools/include/uapi/linux/bpf.h | 5 + tools/testing/selftests/bpf/Makefile | 4 +- .../selftests/bpf/generate_udp_fragments.py | 90 ++++++ .../selftests/bpf/ip_check_defrag_frags.h | 57 ++++ tools/testing/selftests/bpf/network_helpers.c | 26 +- tools/testing/selftests/bpf/network_helpers.h | 3 + .../bpf/prog_tests/ip_check_defrag.c | 282 ++++++++++++++++++ .../selftests/bpf/progs/ip_check_defrag.c | 104 +++++++ 14 files changed, 752 insertions(+), 22 deletions(-) create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c -- 2.41.0

2 years, 5 months

2
4
0 0

[PATCH bpf-next] selftests/bpf: Bump and validate MAX_SYMS

by Björn Töpel

From: Björn Töpel <bjorn(a)rivosinc.com> BPF tests that load /proc/kallsyms, e.g. bpf_cookie, will perform a buffer overrun if the number of syms on the system is larger than MAX_SYMS. Bump the MAX_SYMS to 400000, and add a runtime check that bails out if the maximum is reached. Signed-off-by: Björn Töpel <bjorn(a)rivosinc.com> --- tools/testing/selftests/bpf/trace_helpers.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c index 9b070cdf44ac..f83d9f65c65b 100644 --- a/tools/testing/selftests/bpf/trace_helpers.c +++ b/tools/testing/selftests/bpf/trace_helpers.c @@ -18,7 +18,7 @@ #define TRACEFS_PIPE "/sys/kernel/tracing/trace_pipe" #define DEBUGFS_PIPE "/sys/kernel/debug/tracing/trace_pipe" -#define MAX_SYMS 300000 +#define MAX_SYMS 400000 static struct ksym syms[MAX_SYMS]; static int sym_cnt; @@ -46,6 +46,9 @@ int load_kallsyms_refresh(void) break; if (!addr) continue; + if (i >= MAX_SYMS) + return -EFBIG; + syms[i].addr = (long) addr; syms[i].name = strdup(func); i++; base-commit: fd283ab196a867f8f65f36913e0fadd031fcb823 -- 2.39.2

2 years, 5 months

3
2
0 0

[linux-next:master] BUILD REGRESSION c36ac601a98fb148147640bae219108ee81566f8

by kernel test robot

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master branch HEAD: c36ac601a98fb148147640bae219108ee81566f8 Add linux-next specific files for 20230706 Error/Warning reports: https://lore.kernel.org/oe-kbuild-all/202306122223.HHER4zOo-lkp@intel.com https://lore.kernel.org/oe-kbuild-all/202307050034.tAJSN9qy-lkp@intel.com Error/Warning: (recently discovered and may have been fixed) arch/parisc/kernel/pdt.c:67:6: warning: no previous prototype for 'arch_report_meminfo' [-Wmissing-prototypes] arch/riscv/kernel/crash_core.c:14:64: error: 'VMEMMAP_START' undeclared (first use in this function) arch/riscv/kernel/crash_core.c:15:62: error: 'VMEMMAP_END' undeclared (first use in this function); did you mean 'MEMREMAP_ENC'? arch/riscv/kernel/crash_core.c:8:27: error: 'VA_BITS' undeclared (first use in this function) lib/kunit/executor_test.c:138:4: warning: cast from 'void (*)(const void *)' to 'kunit_action_t *' (aka 'void (*)(void *)') converts to incompatible function type [-Wcast-function-type-strict] lib/kunit/test.c:775:38: warning: cast from 'void (*)(const void *)' to 'kunit_action_t *' (aka 'void (*)(void *)') converts to incompatible function type [-Wcast-function-type-strict] Unverified Error/Warning (likely false positive, please contact us if interested): drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c:98 mlx5_devcom_register_device() error: uninitialized symbol 'tmp_dev'. kernel/trace/trace_functions_graph.c:1012 print_graph_return() warn: bitwise AND condition is false here kernel/trace/trace_functions_graph.c:726 print_graph_entry_leaf() warn: bitwise AND condition is false here {standard input}: Error: local label `"2" (instance number 9 of a fb label)' is not defined Error/Warning ids grouped by kconfigs: gcc_recent_errors |-- i386-randconfig-m021-20230705 | |-- kernel-trace-trace_functions_graph.c-print_graph_entry_leaf()-warn:bitwise-AND-condition-is-false-here | `-- kernel-trace-trace_functions_graph.c-print_graph_return()-warn:bitwise-AND-condition-is-false-here |-- parisc-randconfig-r003-20230706 | `-- arch-parisc-kernel-pdt.c:warning:no-previous-prototype-for-arch_report_meminfo |-- parisc-randconfig-r081-20230703 | `-- arch-parisc-kernel-pdt.c:warning:no-previous-prototype-for-arch_report_meminfo |-- riscv-randconfig-r042-20230706 | |-- arch-riscv-kernel-crash_core.c:error:VA_BITS-undeclared-(first-use-in-this-function) | |-- arch-riscv-kernel-crash_core.c:error:VMEMMAP_END-undeclared-(first-use-in-this-function) | `-- arch-riscv-kernel-crash_core.c:error:VMEMMAP_START-undeclared-(first-use-in-this-function) |-- s390-randconfig-m041-20230705 | `-- drivers-net-ethernet-mellanox-mlx5-core-lib-devcom.c-mlx5_devcom_register_device()-error:uninitialized-symbol-tmp_dev-. `-- sh-allmodconfig `-- standard-input:Error:local-label-(instance-number-of-a-fb-label)-is-not-defined clang_recent_errors |-- arm64-randconfig-r004-20230706 | `-- lib-kunit-test.c:warning:cast-from-void-(-)(const-void-)-to-kunit_action_t-(aka-void-(-)(void-)-)-converts-to-incompatible-function-type |-- hexagon-randconfig-r041-20230706 | |-- lib-kunit-executor_test.c:warning:cast-from-void-(-)(const-void-)-to-kunit_action_t-(aka-void-(-)(void-)-)-converts-to-incompatible-function-type | `-- lib-kunit-test.c:warning:cast-from-void-(-)(const-void-)-to-kunit_action_t-(aka-void-(-)(void-)-)-converts-to-incompatible-function-type `-- powerpc-allyesconfig `-- clang:error:unsupported-option-fsanitize-thread-for-target-powerpc-unknown-linux-gnu elapsed time: 735m configs tested: 144 configs skipped: 8 tested configs: alpha allyesconfig gcc alpha defconfig gcc alpha randconfig-r006-20230706 gcc arc allyesconfig gcc arc defconfig gcc arc randconfig-r043-20230706 gcc arm allmodconfig gcc arm allyesconfig gcc arm defconfig gcc arm gemini_defconfig gcc arm imx_v4_v5_defconfig clang arm jornada720_defconfig gcc arm milbeaut_m10v_defconfig clang arm mps2_defconfig gcc arm mv78xx0_defconfig clang arm pxa910_defconfig gcc arm randconfig-r046-20230706 clang arm s5pv210_defconfig clang arm spear3xx_defconfig clang arm64 allyesconfig gcc arm64 defconfig gcc arm64 randconfig-r004-20230706 clang arm64 randconfig-r024-20230706 gcc csky defconfig gcc hexagon alldefconfig clang hexagon randconfig-r041-20230706 clang hexagon randconfig-r045-20230706 clang i386 allyesconfig gcc i386 buildonly-randconfig-r004-20230706 clang i386 buildonly-randconfig-r005-20230706 clang i386 buildonly-randconfig-r006-20230706 clang i386 debian-10.3 gcc i386 defconfig gcc i386 randconfig-i001-20230706 clang i386 randconfig-i002-20230706 clang i386 randconfig-i003-20230706 clang i386 randconfig-i004-20230706 clang i386 randconfig-i005-20230706 clang i386 randconfig-i006-20230706 clang i386 randconfig-i011-20230706 gcc i386 randconfig-i012-20230706 gcc i386 randconfig-i013-20230706 gcc i386 randconfig-i014-20230706 gcc i386 randconfig-i015-20230706 gcc i386 randconfig-i016-20230706 gcc i386 randconfig-r035-20230706 clang loongarch allmodconfig gcc loongarch allnoconfig gcc loongarch defconfig gcc loongarch randconfig-r001-20230706 gcc loongarch randconfig-r025-20230706 gcc loongarch randconfig-r031-20230706 gcc m68k allmodconfig gcc m68k allyesconfig gcc m68k defconfig gcc m68k sun3_defconfig gcc m68k sun3x_defconfig gcc microblaze randconfig-r005-20230706 gcc mips allmodconfig gcc mips allyesconfig gcc mips ci20_defconfig gcc mips db1xxx_defconfig gcc mips rs90_defconfig clang nios2 defconfig gcc openrisc or1klitex_defconfig gcc openrisc randconfig-r015-20230706 gcc parisc allyesconfig gcc parisc defconfig gcc parisc randconfig-r003-20230706 gcc parisc randconfig-r005-20230706 gcc parisc randconfig-r032-20230705 gcc parisc randconfig-r036-20230706 gcc parisc64 defconfig gcc powerpc allmodconfig gcc powerpc allnoconfig gcc powerpc g5_defconfig clang powerpc mpc5200_defconfig clang powerpc mpc834x_itx_defconfig gcc powerpc pcm030_defconfig gcc powerpc randconfig-r013-20230706 gcc powerpc randconfig-r036-20230705 gcc powerpc skiroot_defconfig clang powerpc walnut_defconfig clang powerpc xes_mpc85xx_defconfig clang riscv allmodconfig gcc riscv allnoconfig gcc riscv allyesconfig gcc riscv defconfig gcc riscv randconfig-r003-20230706 clang riscv randconfig-r021-20230706 gcc riscv randconfig-r023-20230706 gcc riscv randconfig-r042-20230706 gcc riscv rv32_defconfig gcc s390 allmodconfig gcc s390 allyesconfig gcc s390 defconfig gcc s390 randconfig-r031-20230705 gcc s390 randconfig-r044-20230706 gcc sh allmodconfig gcc sh ecovec24_defconfig gcc sh rsk7264_defconfig gcc sh titan_defconfig gcc sparc allyesconfig gcc sparc defconfig gcc sparc randconfig-r016-20230706 gcc sparc sparc64_defconfig gcc sparc64 randconfig-r002-20230706 gcc sparc64 randconfig-r035-20230705 gcc um allmodconfig clang um allnoconfig clang um allyesconfig clang um defconfig gcc um i386_defconfig gcc um randconfig-r011-20230706 clang um randconfig-r034-20230706 gcc um x86_64_defconfig gcc x86_64 allyesconfig gcc x86_64 buildonly-randconfig-r001-20230706 clang x86_64 buildonly-randconfig-r002-20230706 clang x86_64 buildonly-randconfig-r003-20230706 clang x86_64 defconfig gcc x86_64 kexec gcc x86_64 randconfig-r026-20230706 gcc x86_64 randconfig-r033-20230706 clang x86_64 randconfig-x001-20230706 gcc x86_64 randconfig-x002-20230706 gcc x86_64 randconfig-x003-20230706 gcc x86_64 randconfig-x004-20230706 gcc x86_64 randconfig-x005-20230706 gcc x86_64 randconfig-x006-20230706 gcc x86_64 randconfig-x011-20230706 clang x86_64 randconfig-x012-20230706 clang x86_64 randconfig-x013-20230706 clang x86_64 randconfig-x014-20230706 clang x86_64 randconfig-x015-20230706 clang x86_64 randconfig-x016-20230706 clang x86_64 rhel-8.3-rust clang x86_64 rhel-8.3 gcc xtensa audio_kc705_defconfig gcc xtensa cadence_csp_defconfig gcc xtensa randconfig-r002-20230706 gcc xtensa randconfig-r004-20230706 gcc xtensa randconfig-r022-20230706 gcc xtensa randconfig-r034-20230705 gcc -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki

2 years, 5 months

1
0
0 0

[PATCH v23 0/5] Implement IOCTL to get and optionally clear info about PTEs

by Muhammad Usama Anjum

*Changes in v23*: - Set vec_buf_index in loop only when vec_buf_index is set - Return -EFAULT instead of -EINVAL if vec is NULL - Correctly return the walk ending address to the page granularity *Changes in v22*: - Interface change: - Replace [start start + len) with [start, end) - Return the ending address of the address walk in start *Changes in v21*: - Abort walk instead of returning error if WP is to be performed on partial hugetlb *Changes in v20* - Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO *Changes in v19* - Minor changes and interface updates *Changes in v18* - Rebase on top of next-20230613 - Minor updates *Changes in v17* - Rebase on top of next-20230606 - Minor improvements in PAGEMAP_SCAN IOCTL patch *Changes in v16* - Fix a corner case - Add exclusive PM_SCAN_OP_WP back *Changes in v15* - Build fix (Add missed build fix in RESEND) *Changes in v14* - Fix build error caused by #ifdef added at last minute in some configs *Changes in v13* - Rebase on top of next-20230414 - Give-up on using uffd_wp_range() and write new helpers, flush tlb only once *Changes in v12* - Update and other memory types to UFFD_FEATURE_WP_ASYNC - Rebaase on top of next-20230406 - Review updates *Changes in v11* - Rebase on top of next-20230307 - Base patches on UFFD_FEATURE_WP_UNPOPULATED - Do a lot of cosmetic changes and review updates - Remove ENGAGE_WP + !GET operation as it can be performed with UFFDIO_WRITEPROTECT *Changes in v10* - Add specific condition to return error if hugetlb is used with wp async - Move changes in tools/include/uapi/linux/fs.h to separate patch - Add documentation *Changes in v9:* - Correct fault resolution for userfaultfd wp async - Fix build warnings and errors which were happening on some configs - Simplify pagemap ioctl's code *Changes in v8:* - Update uffd async wp implementation - Improve PAGEMAP_IOCTL implementation *Changes in v7:* - Add uffd wp async - Update the IOCTL to use uffd under the hood instead of soft-dirty flags *Motivation* The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of the pages that are written to in a region of virtual memory. This syscall is used in Windows applications and games etc. This syscall is being emulated in pretty slow manner in userspace. Our purpose is to enhance the kernel such that we translate it efficiently in a better way. Currently some out of tree hack patches are being used to efficiently emulate it in some kernels. We intend to replace those with these patches. So the whole gaming on Linux can effectively get benefit from this. It means there would be tons of users of this code. CRIU use case [2] was mentioned by Andrei and Danylo: > Use cases for migrating sparse VMAs are binaries sanitized with ASAN, > MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of > shadow memory [4]. Being able to migrate such binaries allows to highly > reduce the amount of work needed to identify and fix post-migration > crashes, which happen constantly. Andrei's defines the following uses of this code: * it is more granular and allows us to track changed pages more effectively. The current interface can clear dirty bits for the entire process only. In addition, reading info about pages is a separate operation. It means we must freeze the process to read information about all its pages, reset dirty bits, only then we can start dumping pages. The information about pages becomes more and more outdated, while we are processing pages. The new interface solves both these downsides. First, it allows us to read pte bits and clear the soft-dirty bit atomically. It means that CRIU will not need to freeze processes to pre-dump their memory. Second, it clears soft-dirty bits for a specified region of memory. It means CRIU will have actual info about pages to the moment of dumping them. * The new interface has to be much faster because basic page filtering is happening in the kernel. With the old interface, we have to read pagemap for each page. *Implementation Evolution (Short Summary)* From the definition of GetWriteWatch(), we feel like kernel's soft-dirty feature can be used under the hood with some additions like: * reset soft-dirty flag for only a specific region of memory instead of clearing the flag for the entire process * get and clear soft-dirty flag for a specific region atomically So we decided to use ioctl on pagemap file to read or/and reset soft-dirty flag. But using soft-dirty flag, sometimes we get extra pages which weren't even written. They had become soft-dirty because of VMA merging and VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were able to by-pass this short coming by ignoring VM_SOFTDIRTY until David reported that mprotect etc messes up the soft-dirty flag while ignoring VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We discussed if we can revert these patches. But we could not reach to any conclusion. So at this point, I made couple of tries to solve this whole VM_SOFTDIRTY issue by correcting the soft-dirty implementation: * [7] Correct the bug fixed wrongly back in 2014. It had potential to cause regression. We left it behind. * [8] Keep a list of soft-dirty part of a VMA across splits and merges. I got the reply don't increase the size of the VMA by 8 bytes. At this point, we left soft-dirty considering it is too much delicate and userfaultfd [9] seemed like the only way forward. From there onward, we have been basing soft-dirty emulation on userfaultfd wp feature where kernel resolves the faults itself when WP_ASYNC feature is used. It was straight forward to add WP_ASYNC feature in userfautlfd. Now we get only those pages dirty or written-to which are really written in reality. (PS There is another WP_UNPOPULATED userfautfd feature is required which is needed to avoid pre-faulting memory before write-protecting [9].) All the different masks were added on the request of CRIU devs to create interface more generic and better. [1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-… [2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com [3] https://github.com/google/sanitizers [4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit [5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com [6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/ [7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.… [8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.… [9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com [10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com * Original Cover letter from v8* Hello, Note: Soft-dirty pages and pages which have been written-to are synonyms. As kernel already has soft-dirty feature inside which we have given up to use, we are using written-to terminology while using UFFD async WP under the hood. This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear the info about page table entries. The following operations are supported in this ioctl: - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped (PAGE_IS_SWAPPED). - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which pages have been written-to. - Find pages which have been written-to and write protect the pages (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) It is possible to find and clear soft-dirty pages entirely in userspace. But it isn't efficient: - The mprotect and SIGSEGV handler for bookkeeping - The userfaultfd wp (synchronous) with the handler for bookkeeping Some benchmarks can be seen here[1]. This series adds features that weren't present earlier: - There is no atomic get soft-dirty/Written-to status and clear present in the kernel. - The pages which have been written-to can not be found in accurate way. (Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty pages than there actually are.) Historically, soft-dirty PTE bit tracking has been used in the CRIU project. The procfs interface is enough for finding the soft-dirty bit status and clearing the soft-dirty bit of all the pages of a process. We have the use case where we need to track the soft-dirty PTE bit for only specific pages on-demand. We need this tracking and clear mechanism of a region of memory while the process is running to emulate the getWriteWatch() syscall of Windows. *(Moved to using UFFD instead of soft-dirtyi feature to find pages which have been written-to from v7 patch series)*: Stop using the soft-dirty flags for finding which pages have been written to. It is too delicate and wrong as it shows more soft-dirty pages than the actual soft-dirty pages. There is no interest in correcting it [2][3] as this is how the feature was written years ago. It shouldn't be updated to changed behaviour. Peter Xu has suggested using the async version of the UFFD WP [4] as it is based inherently on the PTEs. So in this patch series, I've added a new mode to the UFFD which is asynchronous version of the write protect. When this variant of the UFFD WP is used, the page faults are resolved automatically by the kernel. The pages which have been written-to can be found by reading pagemap file (!PM_UFFD_WP). This feature can be used successfully to find which pages have been written to from the time the pages were write protected. This works just like the soft-dirty flag without showing any extra pages which aren't soft-dirty in reality. The information related to pages if the page is file mapped, present and swapped is required for the CRIU project [5][6]. The addition of the required mask, any mask, excluded mask and return masks are also required for the CRIU project [5]. The IOCTL returns the addresses of the pages which match the specific masks. The page addresses are returned in struct page_region in a compact form. The max_pages is needed to support a use case where user only wants to get a specific number of pages. So there is no need to find all the pages of interest in the range when max_pages is specified. The IOCTL returns when the maximum number of the pages are found. The max_pages is optional. If max_pages is specified, it must be equal or greater than the vec_size. This restriction is needed to handle worse case when one page_region only contains info of one page and it cannot be compacted. This is needed to emulate the Windows getWriteWatch() syscall. The patch series include the detailed selftest which can be used as an example for the uffd async wp test and PAGEMAP_IOCTL. It shows the interface usages as well. [1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora… [2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.… [3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.… [4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n [5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/ [6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/ Regards, Muhammad Usama Anjum Muhammad Usama Anjum (4): fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs tools headers UAPI: Update linux/fs.h with the kernel sources mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL selftests: mm: add pagemap ioctl tests Peter Xu (1): userfaultfd: UFFD_FEATURE_WP_ASYNC Documentation/admin-guide/mm/pagemap.rst | 58 + Documentation/admin-guide/mm/userfaultfd.rst | 35 + fs/proc/task_mmu.c | 577 +++++++ fs/userfaultfd.c | 26 +- include/linux/hugetlb.h | 1 + include/linux/userfaultfd_k.h | 21 +- include/uapi/linux/fs.h | 55 + include/uapi/linux/userfaultfd.h | 9 +- mm/hugetlb.c | 34 +- mm/memory.c | 27 +- tools/include/uapi/linux/fs.h | 55 + tools/testing/selftests/mm/.gitignore | 2 + tools/testing/selftests/mm/Makefile | 3 +- tools/testing/selftests/mm/config | 1 + tools/testing/selftests/mm/pagemap_ioctl.c | 1464 ++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 4 + 16 files changed, 2348 insertions(+), 24 deletions(-) create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh -- 2.39.2

2 years, 5 months

1
5
0 0

[PATCH v22 0/5] Implement IOCTL to get and optionally clear info about PTEs

by Muhammad Usama Anjum

Changes in v22: - Interface change: - Replace [start start + len) with [start, end) - Return the ending address of the address walk in start Changes in v21: - Abort walk instead of returning error if WP is to be performed on partial hugetlb *Changes in v20* - Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO *Changes in v19* - Minor changes and interface updates *Changes in v18* - Rebase on top of next-20230613 - Minor updates *Changes in v17* - Rebase on top of next-20230606 - Minor improvements in PAGEMAP_SCAN IOCTL patch *Changes in v16* - Fix a corner case - Add exclusive PM_SCAN_OP_WP back *Changes in v15* - Build fix (Add missed build fix in RESEND) *Changes in v14* - Fix build error caused by #ifdef added at last minute in some configs *Changes in v13* - Rebase on top of next-20230414 - Give-up on using uffd_wp_range() and write new helpers, flush tlb only once *Changes in v12* - Update and other memory types to UFFD_FEATURE_WP_ASYNC - Rebaase on top of next-20230406 - Review updates *Changes in v11* - Rebase on top of next-20230307 - Base patches on UFFD_FEATURE_WP_UNPOPULATED - Do a lot of cosmetic changes and review updates - Remove ENGAGE_WP + !GET operation as it can be performed with UFFDIO_WRITEPROTECT *Changes in v10* - Add specific condition to return error if hugetlb is used with wp async - Move changes in tools/include/uapi/linux/fs.h to separate patch - Add documentation *Changes in v9:* - Correct fault resolution for userfaultfd wp async - Fix build warnings and errors which were happening on some configs - Simplify pagemap ioctl's code *Changes in v8:* - Update uffd async wp implementation - Improve PAGEMAP_IOCTL implementation *Changes in v7:* - Add uffd wp async - Update the IOCTL to use uffd under the hood instead of soft-dirty flags *Motivation* The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of the pages that are written to in a region of virtual memory. This syscall is used in Windows applications and games etc. This syscall is being emulated in pretty slow manner in userspace. Our purpose is to enhance the kernel such that we translate it efficiently in a better way. Currently some out of tree hack patches are being used to efficiently emulate it in some kernels. We intend to replace those with these patches. So the whole gaming on Linux can effectively get benefit from this. It means there would be tons of users of this code. CRIU use case [2] was mentioned by Andrei and Danylo: > Use cases for migrating sparse VMAs are binaries sanitized with ASAN, > MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of > shadow memory [4]. Being able to migrate such binaries allows to highly > reduce the amount of work needed to identify and fix post-migration > crashes, which happen constantly. Andrei's defines the following uses of this code: * it is more granular and allows us to track changed pages more effectively. The current interface can clear dirty bits for the entire process only. In addition, reading info about pages is a separate operation. It means we must freeze the process to read information about all its pages, reset dirty bits, only then we can start dumping pages. The information about pages becomes more and more outdated, while we are processing pages. The new interface solves both these downsides. First, it allows us to read pte bits and clear the soft-dirty bit atomically. It means that CRIU will not need to freeze processes to pre-dump their memory. Second, it clears soft-dirty bits for a specified region of memory. It means CRIU will have actual info about pages to the moment of dumping them. * The new interface has to be much faster because basic page filtering is happening in the kernel. With the old interface, we have to read pagemap for each page. *Implementation Evolution (Short Summary)* From the definition of GetWriteWatch(), we feel like kernel's soft-dirty feature can be used under the hood with some additions like: * reset soft-dirty flag for only a specific region of memory instead of clearing the flag for the entire process * get and clear soft-dirty flag for a specific region atomically So we decided to use ioctl on pagemap file to read or/and reset soft-dirty flag. But using soft-dirty flag, sometimes we get extra pages which weren't even written. They had become soft-dirty because of VMA merging and VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were able to by-pass this short coming by ignoring VM_SOFTDIRTY until David reported that mprotect etc messes up the soft-dirty flag while ignoring VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We discussed if we can revert these patches. But we could not reach to any conclusion. So at this point, I made couple of tries to solve this whole VM_SOFTDIRTY issue by correcting the soft-dirty implementation: * [7] Correct the bug fixed wrongly back in 2014. It had potential to cause regression. We left it behind. * [8] Keep a list of soft-dirty part of a VMA across splits and merges. I got the reply don't increase the size of the VMA by 8 bytes. At this point, we left soft-dirty considering it is too much delicate and userfaultfd [9] seemed like the only way forward. From there onward, we have been basing soft-dirty emulation on userfaultfd wp feature where kernel resolves the faults itself when WP_ASYNC feature is used. It was straight forward to add WP_ASYNC feature in userfautlfd. Now we get only those pages dirty or written-to which are really written in reality. (PS There is another WP_UNPOPULATED userfautfd feature is required which is needed to avoid pre-faulting memory before write-protecting [9].) All the different masks were added on the request of CRIU devs to create interface more generic and better. [1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-… [2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com [3] https://github.com/google/sanitizers [4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit [5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com [6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/ [7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.… [8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.… [9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com [10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com * Original Cover letter from v8* Hello, Note: Soft-dirty pages and pages which have been written-to are synonyms. As kernel already has soft-dirty feature inside which we have given up to use, we are using written-to terminology while using UFFD async WP under the hood. This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear the info about page table entries. The following operations are supported in this ioctl: - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped (PAGE_IS_SWAPPED). - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which pages have been written-to. - Find pages which have been written-to and write protect the pages (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) It is possible to find and clear soft-dirty pages entirely in userspace. But it isn't efficient: - The mprotect and SIGSEGV handler for bookkeeping - The userfaultfd wp (synchronous) with the handler for bookkeeping Some benchmarks can be seen here[1]. This series adds features that weren't present earlier: - There is no atomic get soft-dirty/Written-to status and clear present in the kernel. - The pages which have been written-to can not be found in accurate way. (Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty pages than there actually are.) Historically, soft-dirty PTE bit tracking has been used in the CRIU project. The procfs interface is enough for finding the soft-dirty bit status and clearing the soft-dirty bit of all the pages of a process. We have the use case where we need to track the soft-dirty PTE bit for only specific pages on-demand. We need this tracking and clear mechanism of a region of memory while the process is running to emulate the getWriteWatch() syscall of Windows. *(Moved to using UFFD instead of soft-dirtyi feature to find pages which have been written-to from v7 patch series)*: Stop using the soft-dirty flags for finding which pages have been written to. It is too delicate and wrong as it shows more soft-dirty pages than the actual soft-dirty pages. There is no interest in correcting it [2][3] as this is how the feature was written years ago. It shouldn't be updated to changed behaviour. Peter Xu has suggested using the async version of the UFFD WP [4] as it is based inherently on the PTEs. So in this patch series, I've added a new mode to the UFFD which is asynchronous version of the write protect. When this variant of the UFFD WP is used, the page faults are resolved automatically by the kernel. The pages which have been written-to can be found by reading pagemap file (!PM_UFFD_WP). This feature can be used successfully to find which pages have been written to from the time the pages were write protected. This works just like the soft-dirty flag without showing any extra pages which aren't soft-dirty in reality. The information related to pages if the page is file mapped, present and swapped is required for the CRIU project [5][6]. The addition of the required mask, any mask, excluded mask and return masks are also required for the CRIU project [5]. The IOCTL returns the addresses of the pages which match the specific masks. The page addresses are returned in struct page_region in a compact form. The max_pages is needed to support a use case where user only wants to get a specific number of pages. So there is no need to find all the pages of interest in the range when max_pages is specified. The IOCTL returns when the maximum number of the pages are found. The max_pages is optional. If max_pages is specified, it must be equal or greater than the vec_size. This restriction is needed to handle worse case when one page_region only contains info of one page and it cannot be compacted. This is needed to emulate the Windows getWriteWatch() syscall. The patch series include the detailed selftest which can be used as an example for the uffd async wp test and PAGEMAP_IOCTL. It shows the interface usages as well. [1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora… [2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.… [3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.… [4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n [5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/ [6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/ Regards, Muhammad Usama Anjum Muhammad Usama Anjum (4): fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs tools headers UAPI: Update linux/fs.h with the kernel sources mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL selftests: mm: add pagemap ioctl tests Peter Xu (1): userfaultfd: UFFD_FEATURE_WP_ASYNC Documentation/admin-guide/mm/pagemap.rst | 58 + Documentation/admin-guide/mm/userfaultfd.rst | 35 + fs/proc/task_mmu.c | 565 +++++++ fs/userfaultfd.c | 26 +- include/linux/hugetlb.h | 1 + include/linux/userfaultfd_k.h | 21 +- include/uapi/linux/fs.h | 55 + include/uapi/linux/userfaultfd.h | 9 +- mm/hugetlb.c | 34 +- mm/memory.c | 27 +- tools/include/uapi/linux/fs.h | 55 + tools/testing/selftests/mm/.gitignore | 2 + tools/testing/selftests/mm/Makefile | 3 +- tools/testing/selftests/mm/config | 1 + tools/testing/selftests/mm/pagemap_ioctl.c | 1464 ++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 4 + 16 files changed, 2336 insertions(+), 24 deletions(-) create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh -- 2.39.2

2 years, 5 months

4
13
0 0

[PATCH v2 1/6] mm: userfaultfd: add new UFFDIO_POISON ioctl

by Axel Rasmussen

The basic idea here is to "simulate" memory poisoning for VMs. A VM running on some host might encounter a memory error, after which some page(s) are poisoned (i.e., future accesses SIGBUS). They expect that once poisoned, pages can never become "un-poisoned". So, when we live migrate the VM, we need to preserve the poisoned status of these pages. When live migrating, we try to get the guest running on its new host as quickly as possible. So, we start it running before all memory has been copied, and before we're certain which pages should be poisoned or not. So the basic way to use this new feature is: - On the new host, the guest's memory is registered with userfaultfd, in either MISSING or MINOR mode (doesn't really matter for this purpose). - On any first access, we get a userfaultfd event. At this point we can communicate with the old host to find out if the page was poisoned. - If so, we can respond with a UFFDIO_POISON - this places a swap marker so any future accesses will SIGBUS. Because the pte is now "present", future accesses won't generate more userfaultfd events, they'll just SIGBUS directly. UFFDIO_POISON does not handle unmapping previously-present PTEs. This isn't needed, because during live migration we want to intercept all accesses with userfaultfd (not just writes, so WP mode isn't useful for this). So whether minor or missing mode is being used (or both), the PTE won't be present in any case, so handling that case isn't needed. Why return VM_FAULT_HWPOISON instead of VM_FAULT_SIGBUS when one of these markers is encountered? For "normal" userspace programs there isn't a big difference, both yield a SIGBUS. The difference for KVM is key though: VM_FAULT_HWPOISON will result in an MCE being injected into the guest (which is the behavior we want). With VM_FAULT_SIGBUS, the hypervisor would need to catch the SIGBUS and deal with the MCE injection itself. Signed-off-by: Axel Rasmussen <axelrasmussen(a)google.com> --- fs/userfaultfd.c | 63 ++++++++++++++++++++++++++++++++ include/linux/swapops.h | 3 +- include/linux/userfaultfd_k.h | 4 ++ include/uapi/linux/userfaultfd.h | 25 +++++++++++-- mm/memory.c | 4 ++ mm/userfaultfd.c | 62 ++++++++++++++++++++++++++++++- 6 files changed, 156 insertions(+), 5 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 7cecd49e078b..c26a883399c9 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1965,6 +1965,66 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) return ret; } +static inline int userfaultfd_poison(struct userfaultfd_ctx *ctx, unsigned long arg) +{ + __s64 ret; + struct uffdio_poison uffdio_poison; + struct uffdio_poison __user *user_uffdio_poison; + struct userfaultfd_wake_range range; + + user_uffdio_poison = (struct uffdio_poison __user *)arg; + + ret = -EAGAIN; + if (atomic_read(&ctx->mmap_changing)) + goto out; + + ret = -EFAULT; + if (copy_from_user(&uffdio_poison, user_uffdio_poison, + /* don't copy the output fields */ + sizeof(uffdio_poison) - (sizeof(__s64)))) + goto out; + + ret = validate_range(ctx->mm, uffdio_poison.range.start, + uffdio_poison.range.len); + if (ret) + goto out; + + ret = -EINVAL; + /* double check for wraparound just in case. */ + if (uffdio_poison.range.start + uffdio_poison.range.len <= + uffdio_poison.range.start) { + goto out; + } + if (uffdio_poison.mode & ~UFFDIO_POISON_MODE_DONTWAKE) + goto out; + + if (mmget_not_zero(ctx->mm)) { + ret = mfill_atomic_poison(ctx->mm, uffdio_poison.range.start, + uffdio_poison.range.len, + &ctx->mmap_changing, 0); + mmput(ctx->mm); + } else { + return -ESRCH; + } + + if (unlikely(put_user(ret, &user_uffdio_poison->updated))) + return -EFAULT; + if (ret < 0) + goto out; + + /* len == 0 would wake all */ + BUG_ON(!ret); + range.len = ret; + if (!(uffdio_poison.mode & UFFDIO_POISON_MODE_DONTWAKE)) { + range.start = uffdio_poison.range.start; + wake_userfault(ctx, &range); + } + ret = range.len == uffdio_poison.range.len ? 0 : -EAGAIN; + +out: + return ret; +} + static inline unsigned int uffd_ctx_features(__u64 user_features) { /* @@ -2066,6 +2126,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd, case UFFDIO_CONTINUE: ret = userfaultfd_continue(ctx, arg); break; + case UFFDIO_POISON: + ret = userfaultfd_poison(ctx, arg); + break; } return ret; } diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 4c932cb45e0b..8259fee32421 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -394,7 +394,8 @@ typedef unsigned long pte_marker; #define PTE_MARKER_UFFD_WP BIT(0) #define PTE_MARKER_SWAPIN_ERROR BIT(1) -#define PTE_MARKER_MASK (BIT(2) - 1) +#define PTE_MARKER_UFFD_POISON BIT(2) +#define PTE_MARKER_MASK (BIT(3) - 1) static inline swp_entry_t make_pte_marker_entry(pte_marker marker) { diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index ac7b0c96d351..ac8c6854097c 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -46,6 +46,7 @@ enum mfill_atomic_mode { MFILL_ATOMIC_COPY, MFILL_ATOMIC_ZEROPAGE, MFILL_ATOMIC_CONTINUE, + MFILL_ATOMIC_POISON, NR_MFILL_ATOMIC_MODES, }; @@ -83,6 +84,9 @@ extern ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm, extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst_start, unsigned long len, atomic_t *mmap_changing, uffd_flags_t flags); +extern ssize_t mfill_atomic_poison(struct mm_struct *dst_mm, unsigned long start, + unsigned long len, atomic_t *mmap_changing, + uffd_flags_t flags); extern int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, unsigned long len, bool enable_wp, atomic_t *mmap_changing); diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 66dd4cd277bd..62151706c5a3 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -39,7 +39,8 @@ UFFD_FEATURE_MINOR_SHMEM | \ UFFD_FEATURE_EXACT_ADDRESS | \ UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \ - UFFD_FEATURE_WP_UNPOPULATED) + UFFD_FEATURE_WP_UNPOPULATED | \ + UFFD_FEATURE_POISON) #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -49,12 +50,14 @@ (__u64)1 << _UFFDIO_COPY | \ (__u64)1 << _UFFDIO_ZEROPAGE | \ (__u64)1 << _UFFDIO_WRITEPROTECT | \ - (__u64)1 << _UFFDIO_CONTINUE) + (__u64)1 << _UFFDIO_CONTINUE | \ + (__u64)1 << _UFFDIO_POISON) #define UFFD_API_RANGE_IOCTLS_BASIC \ ((__u64)1 << _UFFDIO_WAKE | \ (__u64)1 << _UFFDIO_COPY | \ + (__u64)1 << _UFFDIO_WRITEPROTECT | \ (__u64)1 << _UFFDIO_CONTINUE | \ - (__u64)1 << _UFFDIO_WRITEPROTECT) + (__u64)1 << _UFFDIO_POISON) /* * Valid ioctl command number range with this API is from 0x00 to @@ -71,6 +74,7 @@ #define _UFFDIO_ZEROPAGE (0x04) #define _UFFDIO_WRITEPROTECT (0x06) #define _UFFDIO_CONTINUE (0x07) +#define _UFFDIO_POISON (0x08) #define _UFFDIO_API (0x3F) /* userfaultfd ioctl ids */ @@ -91,6 +95,8 @@ struct uffdio_writeprotect) #define UFFDIO_CONTINUE _IOWR(UFFDIO, _UFFDIO_CONTINUE, \ struct uffdio_continue) +#define UFFDIO_POISON _IOWR(UFFDIO, _UFFDIO_POISON, \ + struct uffdio_poison) /* read() structure */ struct uffd_msg { @@ -225,6 +231,7 @@ struct uffdio_api { #define UFFD_FEATURE_EXACT_ADDRESS (1<<11) #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12) #define UFFD_FEATURE_WP_UNPOPULATED (1<<13) +#define UFFD_FEATURE_POISON (1<<14) __u64 features; __u64 ioctls; @@ -321,6 +328,18 @@ struct uffdio_continue { __s64 mapped; }; +struct uffdio_poison { + struct uffdio_range range; +#define UFFDIO_POISON_MODE_DONTWAKE ((__u64)1<<0) + __u64 mode; + + /* + * Fields below here are written by the ioctl and must be at the end: + * the copy_from_user will not read past here. + */ + __s64 updated; +}; + /* * Flags for the userfaultfd(2) system call itself. */ diff --git a/mm/memory.c b/mm/memory.c index d8a9a770b1f1..7fbda39e060d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3692,6 +3692,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) if (WARN_ON_ONCE(!marker)) return VM_FAULT_SIGBUS; + /* Poison emulation explicitly requested for this PTE. */ + if (marker & PTE_MARKER_UFFD_POISON) + return VM_FAULT_HWPOISON; + /* Higher priority than uffd-wp when data corrupted */ if (marker & PTE_MARKER_SWAPIN_ERROR) return VM_FAULT_SIGBUS; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index a2bf37ee276d..87b62ca1e09e 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -286,6 +286,51 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, goto out; } +/* Handles UFFDIO_POISON for all non-hugetlb VMAs. */ +static int mfill_atomic_pte_poison(pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, + uffd_flags_t flags) +{ + int ret; + struct mm_struct *dst_mm = dst_vma->vm_mm; + pte_t _dst_pte, *dst_pte; + spinlock_t *ptl; + + _dst_pte = make_pte_marker(PTE_MARKER_UFFD_POISON); + dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); + + if (vma_is_shmem(dst_vma)) { + struct inode *inode; + pgoff_t offset, max_off; + + /* serialize against truncate with the page table lock */ + inode = dst_vma->vm_file->f_inode; + offset = linear_page_index(dst_vma, dst_addr); + max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); + ret = -EFAULT; + if (unlikely(offset >= max_off)) + goto out_unlock; + } + + ret = -EEXIST; + /* + * For now, we don't handle unmapping pages, so only support filling in + * none PTEs, or replacing PTE markers. + */ + if (!pte_none_mostly(*dst_pte)) + goto out_unlock; + + set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); + + /* No need to invalidate - it was non-present before */ + update_mmu_cache(dst_vma, dst_addr, dst_pte); + ret = 0; +out_unlock: + pte_unmap_unlock(dst_pte, ptl); + return ret; +} + static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) { pgd_t *pgd; @@ -336,8 +381,12 @@ static __always_inline ssize_t mfill_atomic_hugetlb( * supported by hugetlb. A PMD_SIZE huge pages may exist as used * by THP. Since we can not reliably insert a zero page, this * feature is not supported. + * + * PTE marker handling for hugetlb is a bit special, so for now + * UFFDIO_POISON is not supported. */ - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) { + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE) || + uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) { mmap_read_unlock(dst_mm); return -EINVAL; } @@ -481,6 +530,9 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd, if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) { return mfill_atomic_pte_continue(dst_pmd, dst_vma, dst_addr, flags); + } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) { + return mfill_atomic_pte_poison(dst_pmd, dst_vma, + dst_addr, flags); } /* @@ -702,6 +754,14 @@ ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long start, uffd_flags_set_mode(flags, MFILL_ATOMIC_CONTINUE)); } +ssize_t mfill_atomic_poison(struct mm_struct *dst_mm, unsigned long start, + unsigned long len, atomic_t *mmap_changing, + uffd_flags_t flags) +{ + return mfill_atomic(dst_mm, start, 0, len, mmap_changing, + uffd_flags_set_mode(flags, MFILL_ATOMIC_POISON)); +} + long uffd_wp_range(struct vm_area_struct *dst_vma, unsigned long start, unsigned long len, bool enable_wp) { -- 2.41.0.255.g8b1d071c50-goog

2 years, 5 months

5
22
0 0

[linux-next:master] BUILD REGRESSION e1f6a8eaf1c271a0158114a03e3605f4fba059ad

by kernel test robot

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master branch HEAD: e1f6a8eaf1c271a0158114a03e3605f4fba059ad Add linux-next specific files for 20230705 Error/Warning reports: https://lore.kernel.org/oe-kbuild-all/202306122223.HHER4zOo-lkp@intel.com https://lore.kernel.org/oe-kbuild-all/202306260401.qZlYQpV2-lkp@intel.com https://lore.kernel.org/oe-kbuild-all/202306291857.nyJjYwqk-lkp@intel.com https://lore.kernel.org/oe-kbuild-all/202306301756.x8dgyYnL-lkp@intel.com Error/Warning: (recently discovered and may have been fixed) arch/parisc/kernel/pdt.c:67:6: warning: no previous prototype for 'arch_report_meminfo' [-Wmissing-prototypes] arch/riscv/kernel/crash_core.c:12:57: warning: format specifies type 'unsigned long' but the argument has type 'int' [-Wformat] arch/riscv/kernel/crash_core.c:14:57: error: use of undeclared identifier 'VMEMMAP_START' arch/riscv/kernel/crash_core.c:15:55: error: use of undeclared identifier 'VMEMMAP_END'; did you mean 'MEMREMAP_ENC'? arch/riscv/kernel/crash_core.c:8:20: error: use of undeclared identifier 'VA_BITS' drivers/bluetooth/btmtk.c:386:44: error: 'struct hci_dev' has no member named 'dump' drivers/char/mem.c:164:25: error: implicit declaration of function 'unxlate_dev_mem_ptr'; did you mean 'xlate_dev_mem_ptr'? [-Werror=implicit-function-declaration] drivers/gpu/drm/i915/soc/intel_gmch.c:41:13: error: variable 'mchbar_addr' set but not used [-Werror=unused-but-set-variable] drivers/mfd/max77541.c:176:18: warning: cast to smaller integer type 'enum max7754x_ids' from 'const void *' [-Wvoid-pointer-to-enum-cast] lib/kunit/executor_test.c:138:4: warning: cast from 'void (*)(const void *)' to 'kunit_action_t *' (aka 'void (*)(void *)') converts to incompatible function type [-Wcast-function-type-strict] lib/kunit/test.c:775:38: warning: cast from 'void (*)(const void *)' to 'kunit_action_t *' (aka 'void (*)(void *)') converts to incompatible function type [-Wcast-function-type-strict] Unverified Error/Warning (likely false positive, please contact us if interested): drivers/tty/serial/fsl_lpuart.c:1314 lpuart_timer_func() error: uninitialized symbol 'flags'. kernel/trace/trace_functions_graph.c:1012 print_graph_return() warn: bitwise AND condition is false here kernel/trace/trace_functions_graph.c:726 print_graph_entry_leaf() warn: bitwise AND condition is false here {standard input}: Error: local label `"2" (instance number 9 of a fb label)' is not defined Error/Warning ids grouped by kconfigs: gcc_recent_errors |-- arc-randconfig-r026-20230705 | `-- drivers-bluetooth-btmtk.c:error:struct-hci_dev-has-no-member-named-dump |-- csky-randconfig-m041-20230705 | `-- drivers-tty-serial-fsl_lpuart.c-lpuart_timer_func()-error:uninitialized-symbol-flags-. |-- i386-buildonly-randconfig-r004-20230705 | `-- drivers-gpu-drm-i915-soc-intel_gmch.c:error:variable-mchbar_addr-set-but-not-used |-- i386-randconfig-m021-20230705 | |-- kernel-trace-trace_functions_graph.c-print_graph_entry_leaf()-warn:bitwise-AND-condition-is-false-here | `-- kernel-trace-trace_functions_graph.c-print_graph_return()-warn:bitwise-AND-condition-is-false-here |-- loongarch-randconfig-r091-20230703 | `-- drivers-bluetooth-btmtk.c:error:struct-hci_dev-has-no-member-named-dump |-- microblaze-randconfig-r001-20230705 | `-- drivers-bluetooth-btmtk.c:error:struct-hci_dev-has-no-member-named-dump |-- parisc-randconfig-r081-20230703 | `-- arch-parisc-kernel-pdt.c:warning:no-previous-prototype-for-arch_report_meminfo |-- sh-allmodconfig | |-- drivers-char-mem.c:error:implicit-declaration-of-function-unxlate_dev_mem_ptr | `-- standard-input:Error:local-label-(instance-number-of-a-fb-label)-is-not-defined `-- sh-randconfig-r004-20230705 |-- drivers-bluetooth-btmtk.c:error:struct-hci_dev-has-no-member-named-dump `-- drivers-char-mem.c:error:implicit-declaration-of-function-unxlate_dev_mem_ptr clang_recent_errors |-- arm64-randconfig-r023-20230705 | |-- drivers-mfd-max77541.c:warning:cast-to-smaller-integer-type-enum-max7754x_ids-from-const-void | `-- lib-kunit-test.c:warning:cast-from-void-(-)(const-void-)-to-kunit_action_t-(aka-void-(-)(void-)-)-converts-to-incompatible-function-type |-- arm64-randconfig-r024-20230705 | `-- lib-kunit-test.c:warning:cast-from-void-(-)(const-void-)-to-kunit_action_t-(aka-void-(-)(void-)-)-converts-to-incompatible-function-type |-- powerpc-randconfig-r011-20230705 | `-- lib-kunit-test.c:warning:cast-from-void-(-)(const-void-)-to-kunit_action_t-(aka-void-(-)(void-)-)-converts-to-incompatible-function-type |-- powerpc-randconfig-r025-20230705 | `-- clang:error:unsupported-option-fsanitize-thread-for-target-powerpc-unknown-linux-gnu |-- riscv-randconfig-r042-20230705 | |-- arch-riscv-kernel-crash_core.c:error:use-of-undeclared-identifier-VA_BITS | |-- arch-riscv-kernel-crash_core.c:error:use-of-undeclared-identifier-VMEMMAP_END | |-- arch-riscv-kernel-crash_core.c:error:use-of-undeclared-identifier-VMEMMAP_START | |-- arch-riscv-kernel-crash_core.c:warning:format-specifies-type-unsigned-long-but-the-argument-has-type-int | |-- lib-kunit-executor_test.c:warning:cast-from-void-(-)(const-void-)-to-kunit_action_t-(aka-void-(-)(void-)-)-converts-to-incompatible-function-type | `-- lib-kunit-test.c:warning:cast-from-void-(-)(const-void-)-to-kunit_action_t-(aka-void-(-)(void-)-)-converts-to-incompatible-function-type |-- s390-randconfig-r014-20230705 | `-- lib-kunit-test.c:warning:cast-from-void-(-)(const-void-)-to-kunit_action_t-(aka-void-(-)(void-)-)-converts-to-incompatible-function-type |-- s390-randconfig-r021-20230705 | `-- drivers-mfd-max77541.c:warning:cast-to-smaller-integer-type-enum-max7754x_ids-from-const-void |-- x86_64-randconfig-r024-20230705 | `-- drivers-mfd-max77541.c:warning:cast-to-smaller-integer-type-enum-max7754x_ids-from-const-void |-- x86_64-randconfig-x002-20230705 | `-- drivers-mfd-max77541.c:warning:cast-to-smaller-integer-type-enum-max7754x_ids-from-const-void |-- x86_64-randconfig-x003-20230705 | `-- drivers-mfd-max77541.c:warning:cast-to-smaller-integer-type-enum-max7754x_ids-from-const-void `-- x86_64-randconfig-x005-20230705 `-- drivers-mfd-max77541.c:warning:cast-to-smaller-integer-type-enum-max7754x_ids-from-const-void elapsed time: 734m configs tested: 142 configs skipped: 7 tested configs: alpha allyesconfig gcc alpha defconfig gcc alpha randconfig-r003-20230705 gcc alpha randconfig-r036-20230705 gcc arc allyesconfig gcc arc defconfig gcc arc nsimosci_hs_defconfig gcc arc randconfig-r011-20230705 gcc arc randconfig-r026-20230705 gcc arc randconfig-r043-20230705 gcc arm allmodconfig gcc arm allyesconfig gcc arm assabet_defconfig gcc arm defconfig gcc arm dove_defconfig clang arm gemini_defconfig gcc arm pxa910_defconfig gcc arm randconfig-r046-20230705 gcc arm sp7021_defconfig clang arm wpcm450_defconfig gcc arm64 alldefconfig gcc arm64 allyesconfig gcc arm64 defconfig gcc arm64 randconfig-r023-20230705 clang arm64 randconfig-r024-20230705 clang arm64 randconfig-r035-20230705 gcc csky defconfig gcc hexagon randconfig-r041-20230705 clang hexagon randconfig-r045-20230705 clang i386 allyesconfig gcc i386 buildonly-randconfig-r004-20230705 gcc i386 buildonly-randconfig-r005-20230705 gcc i386 buildonly-randconfig-r006-20230705 gcc i386 debian-10.3 gcc i386 defconfig gcc i386 randconfig-i001-20230705 gcc i386 randconfig-i002-20230705 gcc i386 randconfig-i003-20230705 gcc i386 randconfig-i004-20230705 gcc i386 randconfig-i005-20230705 gcc i386 randconfig-i006-20230705 gcc i386 randconfig-i011-20230705 clang i386 randconfig-i012-20230705 clang i386 randconfig-i013-20230705 clang i386 randconfig-i014-20230705 clang i386 randconfig-i015-20230705 clang i386 randconfig-i016-20230705 clang i386 randconfig-r015-20230705 clang i386 randconfig-r031-20230705 gcc i386 randconfig-r032-20230705 gcc loongarch allmodconfig gcc loongarch allnoconfig gcc loongarch defconfig gcc loongarch randconfig-r006-20230705 gcc loongarch randconfig-r012-20230705 gcc loongarch randconfig-r014-20230705 gcc m68k allmodconfig gcc m68k allyesconfig gcc m68k defconfig gcc m68k q40_defconfig gcc m68k randconfig-r033-20230705 gcc m68k sun3x_defconfig gcc microblaze randconfig-r001-20230705 gcc mips allmodconfig gcc mips allyesconfig gcc mips cavium_octeon_defconfig clang mips jazz_defconfig gcc mips randconfig-r005-20230705 clang nios2 defconfig gcc nios2 randconfig-r021-20230705 gcc openrisc randconfig-r022-20230705 gcc parisc allyesconfig gcc parisc defconfig gcc parisc64 defconfig gcc powerpc allmodconfig gcc powerpc allnoconfig gcc powerpc mgcoge_defconfig gcc powerpc mpc832x_rdb_defconfig clang powerpc mpc866_ads_defconfig clang powerpc mvme5100_defconfig clang powerpc pasemi_defconfig gcc powerpc randconfig-r011-20230705 clang powerpc randconfig-r025-20230705 clang powerpc taishan_defconfig gcc powerpc warp_defconfig gcc riscv allmodconfig gcc riscv allnoconfig gcc riscv allyesconfig gcc riscv defconfig gcc riscv nommu_k210_sdcard_defconfig gcc riscv randconfig-r042-20230705 clang riscv rv32_defconfig gcc s390 allmodconfig gcc s390 allyesconfig gcc s390 defconfig gcc s390 randconfig-r014-20230705 clang s390 randconfig-r021-20230705 clang s390 randconfig-r044-20230705 clang sh allmodconfig gcc sh microdev_defconfig gcc sh randconfig-r004-20230705 gcc sh se7343_defconfig gcc sh se7750_defconfig gcc sh sh2007_defconfig gcc sh sh7785lcr_32bit_defconfig gcc sh urquell_defconfig gcc sparc allyesconfig gcc sparc defconfig gcc sparc randconfig-r016-20230705 gcc sparc64 randconfig-r012-20230705 gcc sparc64 randconfig-r015-20230705 gcc sparc64 randconfig-r026-20230705 gcc sparc64 randconfig-r034-20230705 gcc um allmodconfig clang um allnoconfig clang um allyesconfig clang um defconfig gcc um i386_defconfig gcc um randconfig-r016-20230705 gcc um x86_64_defconfig gcc x86_64 allyesconfig gcc x86_64 buildonly-randconfig-r001-20230705 gcc x86_64 buildonly-randconfig-r002-20230705 gcc x86_64 buildonly-randconfig-r003-20230705 gcc x86_64 defconfig gcc x86_64 kexec gcc x86_64 randconfig-r024-20230705 clang x86_64 randconfig-x001-20230705 clang x86_64 randconfig-x002-20230705 clang x86_64 randconfig-x003-20230705 clang x86_64 randconfig-x004-20230705 clang x86_64 randconfig-x005-20230705 clang x86_64 randconfig-x006-20230705 clang x86_64 randconfig-x011-20230705 gcc x86_64 randconfig-x012-20230705 gcc x86_64 randconfig-x013-20230705 gcc x86_64 randconfig-x014-20230705 gcc x86_64 randconfig-x015-20230705 gcc x86_64 randconfig-x016-20230705 gcc x86_64 rhel-8.3-rust clang x86_64 rhel-8.3 gcc xtensa randconfig-r013-20230705 gcc -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki

2 years, 5 months

2
1
0 0

[PATCH bpf-next 0/2] BPF kselftest cross-build/RISC-V fixes

by Björn Töpel

From: Björn Töpel <bjorn(a)rivosinc.com> This series has two minor fixes, found when cross-compiling for the RISC-V architecture. Some RISC-V systems do not define HAVE_EFFICIENT_UNALIGNED_ACCESS, which made some of tests bail out. Fix the failing tests by adding F_NEEDS_EFFICIENT_UNALIGNED_ACCESS. ...and some RISC-V systems *do* define HAVE_EFFICIENT_UNALIGNED_ACCESS. In this case the autoconf.h was not correctly picked up by the build system. Cheers, Björn Björn Töpel (2): selftests/bpf: Add F_NEEDS_EFFICIENT_UNALIGNED_ACCESS to some tests selftests/bpf: Honor $(O) when figuring out paths tools/testing/selftests/bpf/Makefile | 4 ++++ tools/testing/selftests/bpf/verifier/atomic_cmpxchg.c | 1 + tools/testing/selftests/bpf/verifier/ctx_skb.c | 2 ++ tools/testing/selftests/bpf/verifier/jmp32.c | 8 ++++++++ tools/testing/selftests/bpf/verifier/map_kptr.c | 2 ++ tools/testing/selftests/bpf/verifier/precise.c | 2 +- 6 files changed, 18 insertions(+), 1 deletion(-) base-commit: a94098d490e17d652770f2309fcb9b46bc4cf864 -- 2.39.2

2 years, 5 months

2
4
0 0

[PATCH v1] selftests:bpf:Fix repeated initialization

by Wang Ming

In use_missing_map function, value is initialized twice.There is no connection between the two assignment. This patch could fix this bug. Signed-off-by: Wang Ming <machel(a)vivo.com> --- tools/testing/selftests/bpf/progs/test_log_fixup.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/bpf/progs/test_log_fixup.c b/tools/testing/selftests/bpf/progs/test_log_fixup.c index 1bd48feaaa42..1c49b2f9be6c 100644 --- a/tools/testing/selftests/bpf/progs/test_log_fixup.c +++ b/tools/testing/selftests/bpf/progs/test_log_fixup.c @@ -52,13 +52,9 @@ struct { SEC("?raw_tp/sys_enter") int use_missing_map(const void *ctx) { - int zero = 0, *value; + int zero = 0; - value = bpf_map_lookup_elem(&existing_map, &zero); - - value = bpf_map_lookup_elem(&missing_map, &zero); - - return value != NULL; + return bpf_map_lookup_elem(&missing_map, &zero) != NULL; } extern int bpf_nonexistent_kfunc(void) __ksym __weak; -- 2.25.1

2 years, 5 months

3
2
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror July 2023