=== Overview
arm64 has a feature called Top Byte Ignore, which allows to embed pointer tags into the top byte of each pointer. Userspace programs (such as HWASan, a memory debugging tool [1]) might use this feature and pass tagged user pointers to the kernel through syscalls or other interfaces.
Right now the kernel is already able to handle user faults with tagged pointers, due to these patches:
1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a tagged pointer") 2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged pointers") 3. 276e9327 ("arm64: entry: improve data abort handling of tagged pointers")
This patchset extends tagged pointer support to syscall arguments.
As per the proposed ABI change [3], tagged pointers are only allowed to be passed to syscalls when they point to memory ranges obtained by anonymous mmap() or sbrk() (see the patchset [3] for more details).
For non-memory syscalls this is done by untaging user pointers when the kernel performs pointer checking to find out whether the pointer comes from userspace (most notably in access_ok). The untagging is done only when the pointer is being checked, the tag is preserved as the pointer makes its way through the kernel and stays tagged when the kernel dereferences the pointer when perfoming user memory accesses.
Memory syscalls (mmap, mprotect, etc.) don't do user memory accesses but rather deal with memory ranges, and untagged pointers are better suited to describe memory ranges internally. Thus for memory syscalls we untag pointers completely when they enter the kernel.
=== Other approaches
One of the alternative approaches to untagging that was considered is to completely strip the pointer tag as the pointer enters the kernel with some kind of a syscall wrapper, but that won't work with the countless number of different ioctl calls. With this approach we would need a custom wrapper for each ioctl variation, which doesn't seem practical.
An alternative approach to untagging pointers in memory syscalls prologues is to inspead allow tagged pointers to be passed to find_vma() (and other vma related functions) and untag them there. Unfortunately, a lot of find_vma() callers then compare or subtract the returned vma start and end fields against the pointer that was being searched. Thus this approach would still require changing all find_vma() callers.
=== Testing
The following testing approaches has been taken to find potential issues with user pointer untagging:
1. Static testing (with sparse [2] and separately with a custom static analyzer based on Clang) to track casts of __user pointers to integer types to find places where untagging needs to be done.
2. Static testing with grep to find parts of the kernel that call find_vma() (and other similar functions) or directly compare against vm_start/vm_end fields of vma.
3. Static testing with grep to find parts of the kernel that compare user pointers with TASK_SIZE or other similar consts and macros.
4. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running a modified syzkaller version that passes tagged pointers to the kernel.
Based on the results of the testing the requried patches have been added to the patchset.
=== Notes
This patchset is meant to be merged together with "arm64 relaxed ABI" [3].
This patchset is a prerequisite for ARM's memory tagging hardware feature support [4].
This patchset has been merged into the Pixel 2 & 3 kernel trees and is now being used to enable testing of Pixel phones with HWASan.
Thanks!
[1] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
[2] https://github.com/lucvoo/sparse-dev/commit/5f960cb10f56ec2017c128ef9d16060e...
[3] https://lkml.org/lkml/2019/3/18/819
[4] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture...
Changes in v15: - Removed unnecessary untagging from radeon_ttm_tt_set_userptr(). - Removed unnecessary untagging from amdgpu_ttm_tt_set_userptr(). - Moved untagging to validate_range() in userfaultfd code. - Moved untagging to ib_uverbs_(re)reg_mr() from mlx4_get_umem_mr(). - Rebased onto 5.1.
Changes in v14: - Moved untagging for most memory syscalls to an arm64 specific implementation, instead of doing that in the common code. - Dropped "net, arm64: untag user pointers in tcp_zerocopy_receive", since the provided user pointers don't come from an anonymous map and thus are not covered by this ABI relaxation. - Dropped "kernel, arm64: untag user pointers in prctl_set_mm*". - Moved untagging from __check_mem_type() to tee_shm_register(). - Updated untagging for the amdgpu and radeon drivers to cover the MMU notifier, as suggested by Felix. - Since this ABI relaxation doesn't actually allow tagged instruction pointers, dropped the following patches: - Dropped "tracing, arm64: untag user pointers in seq_print_user_ip". - Dropped "uprobes, arm64: untag user pointers in find_active_uprobe". - Dropped "bpf, arm64: untag user pointers in stack_map_get_build_id_offset". - Rebased onto 5.1-rc7 (37624b58).
Changes in v13: - Simplified untagging in tcp_zerocopy_receive(). - Looked at find_vma() callers in drivers/, which allowed to identify a few other places where untagging is needed. - Added patch "mm, arm64: untag user pointers in get_vaddr_frames". - Added patch "drm/amdgpu, arm64: untag user pointers in amdgpu_ttm_tt_get_user_pages". - Added patch "drm/radeon, arm64: untag user pointers in radeon_ttm_tt_pin_userptr". - Added patch "IB/mlx4, arm64: untag user pointers in mlx4_get_umem_mr". - Added patch "media/v4l2-core, arm64: untag user pointers in videobuf_dma_contig_user_get". - Added patch "tee/optee, arm64: untag user pointers in check_mem_type". - Added patch "vfio/type1, arm64: untag user pointers".
Changes in v12: - Changed untagging in tcp_zerocopy_receive() to also untag zc->address. - Fixed untagging in prctl_set_mm* to only untag pointers for vma lookups and validity checks, but leave them as is for actual user space accesses. - Updated the link to the v2 of the "arm64 relaxed ABI" patchset [3]. - Dropped the documentation patch, as the "arm64 relaxed ABI" patchset [3] handles that.
Changes in v11: - Added "uprobes, arm64: untag user pointers in find_active_uprobe" patch. - Added "bpf, arm64: untag user pointers in stack_map_get_build_id_offset" patch. - Fixed "tracing, arm64: untag user pointers in seq_print_user_ip" to correctly perform subtration with a tagged addr. - Moved untagged_addr() from SYSCALL_DEFINE3(mprotect) and SYSCALL_DEFINE4(pkey_mprotect) to do_mprotect_pkey(). - Moved untagged_addr() definition for other arches from include/linux/memory.h to include/linux/mm.h. - Changed untagging in strn*_user() to perform userspace accesses through tagged pointers. - Updated the documentation to mention that passing tagged pointers to memory syscalls is allowed. - Updated the test to use malloc'ed memory instead of stack memory.
Changes in v10: - Added "mm, arm64: untag user pointers passed to memory syscalls" back. - New patch "fs, arm64: untag user pointers in fs/userfaultfd.c". - New patch "net, arm64: untag user pointers in tcp_zerocopy_receive". - New patch "kernel, arm64: untag user pointers in prctl_set_mm*". - New patch "tracing, arm64: untag user pointers in seq_print_user_ip".
Changes in v9: - Rebased onto 4.20-rc6. - Used u64 instead of __u64 in type casts in the untagged_addr macro for arm64. - Added braces around (addr) in the untagged_addr macro for other arches.
Changes in v8: - Rebased onto 65102238 (4.20-rc1). - Added a note to the cover letter on why syscall wrappers/shims that untag user pointers won't work. - Added a note to the cover letter that this patchset has been merged into the Pixel 2 kernel tree. - Documentation fixes, in particular added a list of syscalls that don't support tagged user pointers.
Changes in v7: - Rebased onto 17b57b18 (4.19-rc6). - Dropped the "arm64: untag user address in __do_user_fault" patch, since the existing patches already handle user faults properly. - Dropped the "usb, arm64: untag user addresses in devio" patch, since the passed pointer must come from a vma and therefore be untagged. - Dropped the "arm64: annotate user pointers casts detected by sparse" patch (see the discussion to the replies of the v6 of this patchset). - Added more context to the cover letter. - Updated Documentation/arm64/tagged-pointers.txt.
Changes in v6: - Added annotations for user pointer casts found by sparse. - Rebased onto 050cdc6c (4.19-rc1+).
Changes in v5: - Added 3 new patches that add untagging to places found with static analysis. - Rebased onto 44c929e1 (4.18-rc8).
Changes in v4: - Added a selftest for checking that passing tagged pointers to the kernel succeeds. - Rebased onto 81e97f013 (4.18-rc1+).
Changes in v3: - Rebased onto e5c51f30 (4.17-rc6+). - Added linux-arch@ to the list of recipients.
Changes in v2: - Rebased onto 2d618bdf (4.17-rc3+). - Removed excessive untagging in gup.c. - Removed untagging pointers returned from __uaccess_mask_ptr.
Changes in v1: - Rebased onto 4.17-rc1.
Changes in RFC v2: - Added "#ifndef untagged_addr..." fallback in linux/uaccess.h instead of defining it for each arch individually. - Updated Documentation/arm64/tagged-pointers.txt. - Dropped "mm, arm64: untag user addresses in memory syscalls". - Rebased onto 3eb2ce82 (4.16-rc7).
Signed-off-by: Andrey Konovalov andreyknvl@google.com
Andrey Konovalov (17): uaccess: add untagged_addr definition for other arches arm64: untag user pointers in access_ok and __uaccess_mask_ptr lib, arm64: untag user pointers in strn*_user mm: add ksys_ wrappers to memory syscalls arms64: untag user pointers passed to memory syscalls mm: untag user pointers in do_pages_move mm, arm64: untag user pointers in mm/gup.c mm, arm64: untag user pointers in get_vaddr_frames fs, arm64: untag user pointers in copy_mount_options fs, arm64: untag user pointers in fs/userfaultfd.c drm/amdgpu, arm64: untag user pointers drm/radeon, arm64: untag user pointers in radeon_gem_userptr_ioctl IB, arm64: untag user pointers in ib_uverbs_(re)reg_mr() media/v4l2-core, arm64: untag user pointers in videobuf_dma_contig_user_get tee, arm64: untag user pointers in tee_shm_register vfio/type1, arm64: untag user pointers in vaddr_get_pfn selftests, arm64: add a selftest for passing tagged pointers to kernel
arch/arm64/include/asm/uaccess.h | 10 +- arch/arm64/kernel/sys.c | 128 ++++++++++++++++- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 + drivers/gpu/drm/radeon/radeon_gem.c | 2 + drivers/infiniband/core/uverbs_cmd.c | 4 + drivers/media/v4l2-core/videobuf-dma-contig.c | 9 +- drivers/tee/tee_shm.c | 1 + drivers/vfio/vfio_iommu_type1.c | 2 + fs/namespace.c | 2 +- fs/userfaultfd.c | 22 +-- include/linux/mm.h | 4 + include/linux/syscalls.h | 22 +++ ipc/shm.c | 7 +- lib/strncpy_from_user.c | 3 +- lib/strnlen_user.c | 3 +- mm/frame_vector.c | 2 + mm/gup.c | 4 + mm/madvise.c | 129 +++++++++--------- mm/mempolicy.c | 21 ++- mm/migrate.c | 1 + mm/mincore.c | 57 ++++---- mm/mlock.c | 20 ++- mm/mmap.c | 30 +++- mm/mprotect.c | 6 +- mm/mremap.c | 27 ++-- mm/msync.c | 35 +++-- tools/testing/selftests/arm64/.gitignore | 1 + tools/testing/selftests/arm64/Makefile | 11 ++ .../testing/selftests/arm64/run_tags_test.sh | 12 ++ tools/testing/selftests/arm64/tags_test.c | 21 +++ 31 files changed, 436 insertions(+), 164 deletions(-) create mode 100644 tools/testing/selftests/arm64/.gitignore create mode 100644 tools/testing/selftests/arm64/Makefile create mode 100755 tools/testing/selftests/arm64/run_tags_test.sh create mode 100644 tools/testing/selftests/arm64/tags_test.c
To allow arm64 syscalls to accept tagged pointers from userspace, we must untag them when they are passed to the kernel. Since untagging is done in generic parts of the kernel, the untagged_addr macro needs to be defined for all architectures.
Define it as a noop for architectures other than arm64.
Acked-by: Catalin Marinas catalin.marinas@arm.com Signed-off-by: Andrey Konovalov andreyknvl@google.com --- include/linux/mm.h | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 6b10c21630f5..44041df804a6 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -99,6 +99,10 @@ extern int mmap_rnd_compat_bits __read_mostly; #include <asm/pgtable.h> #include <asm/processor.h>
+#ifndef untagged_addr +#define untagged_addr(addr) (addr) +#endif + #ifndef __pa_symbol #define __pa_symbol(x) __pa(RELOC_HIDE((unsigned long)(x), 0)) #endif
On Mon, 2019-05-06 at 18:30 +0200, Andrey Konovalov wrote:
To allow arm64 syscalls to accept tagged pointers from userspace, we must untag them when they are passed to the kernel. Since untagging is done in generic parts of the kernel, the untagged_addr macro needs to be defined for all architectures.
Define it as a noop for architectures other than arm64.
Acked-by: Catalin Marinas catalin.marinas@arm.com Signed-off-by: Andrey Konovalov andreyknvl@google.com
include/linux/mm.h | 4 ++++ 1 file changed, 4 insertions(+)
As discussed in the other thread Chris started, there is a generic need to untag addresses in kernel and this patch gets us ready for that.
Reviewed-by: Khalid Aziz khalid.aziz@oracle.com
diff --git a/include/linux/mm.h b/include/linux/mm.h index 6b10c21630f5..44041df804a6 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -99,6 +99,10 @@ extern int mmap_rnd_compat_bits __read_mostly; #include <asm/pgtable.h> #include <asm/processor.h> +#ifndef untagged_addr +#define untagged_addr(addr) (addr) +#endif
#ifndef __pa_symbol #define __pa_symbol(x) __pa(RELOC_HIDE((unsigned long)(x), 0)) #endif
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
copy_from_user (and a few other similar functions) are used to copy data from user memory into the kernel memory or vice versa. Since a user can provided a tagged pointer to one of the syscalls that use copy_from_user, we need to correctly handle such pointers.
Do this by untagging user pointers in access_ok and in __uaccess_mask_ptr, before performing access validity checks.
Note, that this patch only temporarily untags the pointers to perform the checks, but then passes them as is into the kernel internals.
Reviewed-by: Catalin Marinas catalin.marinas@arm.com Signed-off-by: Andrey Konovalov andreyknvl@google.com --- arch/arm64/include/asm/uaccess.h | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h index e5d5f31c6d36..9164ecb5feca 100644 --- a/arch/arm64/include/asm/uaccess.h +++ b/arch/arm64/include/asm/uaccess.h @@ -94,7 +94,7 @@ static inline unsigned long __range_ok(const void __user *addr, unsigned long si return ret; }
-#define access_ok(addr, size) __range_ok(addr, size) +#define access_ok(addr, size) __range_ok(untagged_addr(addr), size) #define user_addr_max get_fs
#define _ASM_EXTABLE(from, to) \ @@ -226,7 +226,8 @@ static inline void uaccess_enable_not_uao(void)
/* * Sanitise a uaccess pointer such that it becomes NULL if above the - * current addr_limit. + * current addr_limit. In case the pointer is tagged (has the top byte set), + * untag the pointer before checking. */ #define uaccess_mask_ptr(ptr) (__typeof__(ptr))__uaccess_mask_ptr(ptr) static inline void __user *__uaccess_mask_ptr(const void __user *ptr) @@ -234,10 +235,11 @@ static inline void __user *__uaccess_mask_ptr(const void __user *ptr) void __user *safe_ptr;
asm volatile( - " bics xzr, %1, %2\n" + " bics xzr, %3, %2\n" " csel %0, %1, xzr, eq\n" : "=&r" (safe_ptr) - : "r" (ptr), "r" (current_thread_info()->addr_limit) + : "r" (ptr), "r" (current_thread_info()->addr_limit), + "r" (untagged_addr(ptr)) : "cc");
csdb();
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
strncpy_from_user and strnlen_user accept user addresses as arguments, and do not go through the same path as copy_from_user and others, so here we need to handle the case of tagged user addresses separately.
Untag user pointers passed to these functions.
Note, that this patch only temporarily untags the pointers to perform validity checks, but then uses them as is to perform user memory accesses.
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- lib/strncpy_from_user.c | 3 ++- lib/strnlen_user.c | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c index 58eacd41526c..6209bb9507c7 100644 --- a/lib/strncpy_from_user.c +++ b/lib/strncpy_from_user.c @@ -6,6 +6,7 @@ #include <linux/uaccess.h> #include <linux/kernel.h> #include <linux/errno.h> +#include <linux/mm.h>
#include <asm/byteorder.h> #include <asm/word-at-a-time.h> @@ -107,7 +108,7 @@ long strncpy_from_user(char *dst, const char __user *src, long count) return 0;
max_addr = user_addr_max(); - src_addr = (unsigned long)src; + src_addr = (unsigned long)untagged_addr(src); if (likely(src_addr < max_addr)) { unsigned long max = max_addr - src_addr; long retval; diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c index 1c1a1b0e38a5..8ca3d2ac32ec 100644 --- a/lib/strnlen_user.c +++ b/lib/strnlen_user.c @@ -2,6 +2,7 @@ #include <linux/kernel.h> #include <linux/export.h> #include <linux/uaccess.h> +#include <linux/mm.h>
#include <asm/word-at-a-time.h>
@@ -109,7 +110,7 @@ long strnlen_user(const char __user *str, long count) return 0;
max_addr = user_addr_max(); - src_addr = (unsigned long)str; + src_addr = (unsigned long)untagged_addr(str); if (likely(src_addr < max_addr)) { unsigned long max = max_addr - src_addr; long retval;
On Mon, May 06, 2019 at 06:30:49PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
strncpy_from_user and strnlen_user accept user addresses as arguments, and do not go through the same path as copy_from_user and others, so here we need to handle the case of tagged user addresses separately.
Untag user pointers passed to these functions.
Note, that this patch only temporarily untags the pointers to perform validity checks, but then uses them as is to perform user memory accesses.
Signed-off-by: Andrey Konovalov andreyknvl@google.com
Just to keep track of where I am with the reviews while the ABI discussion continues:
Reviewed-by: Catalin Marinas catalin.marinas@arm.com
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
This patch adds ksys_ wrappers to the following memory syscalls:
brk, get_mempolicy (renamed kernel_get_mempolicy -> ksys_get_mempolicy), madvise, mbind (renamed kernel_mbind -> ksys_mbind), mincore, mlock (renamed do_mlock -> ksys_mlock), mlock2, mmap_pgoff, mprotect (renamed do_mprotect_pkey -> ksys_mprotect_pkey), mremap, msync, munlock, munmap, remap_file_pages, shmat, shmdt.
The next patch in this series will add a custom implementation for these syscalls that makes them accept tagged pointers on arm64.
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- include/linux/syscalls.h | 22 +++++++ ipc/shm.c | 7 ++- mm/madvise.c | 129 ++++++++++++++++++++------------------- mm/mempolicy.c | 21 +++---- mm/mincore.c | 57 +++++++++-------- mm/mlock.c | 20 ++++-- mm/mmap.c | 30 ++++++--- mm/mprotect.c | 6 +- mm/mremap.c | 27 +++++--- mm/msync.c | 35 ++++++----- 10 files changed, 213 insertions(+), 141 deletions(-)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index e446806a561f..70008f5ed84f 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1260,6 +1260,28 @@ int ksys_ipc(unsigned int call, int first, unsigned long second, unsigned long third, void __user * ptr, long fifth); int compat_ksys_ipc(u32 call, int first, int second, u32 third, u32 ptr, u32 fifth); +unsigned long ksys_mremap(unsigned long addr, unsigned long old_len, + unsigned long new_len, unsigned long flags, + unsigned long new_addr); +int ksys_munmap(unsigned long addr, size_t len); +unsigned long ksys_brk(unsigned long brk); +int ksys_get_mempolicy(int __user *policy, unsigned long __user *nmask, + unsigned long maxnode, unsigned long addr, unsigned long flags); +int ksys_madvise(unsigned long start, size_t len_in, int behavior); +long ksys_mbind(unsigned long start, unsigned long len, + unsigned long mode, const unsigned long __user *nmask, + unsigned long maxnode, unsigned int flags); +__must_check int ksys_mlock(unsigned long start, size_t len, vm_flags_t flags); +__must_check int ksys_mlock2(unsigned long start, size_t len, vm_flags_t flags); +int ksys_munlock(unsigned long start, size_t len); +int ksys_mprotect_pkey(unsigned long start, size_t len, + unsigned long prot, int pkey); +int ksys_msync(unsigned long start, size_t len, int flags); +long ksys_mincore(unsigned long start, size_t len, unsigned char __user *vec); +unsigned long ksys_remap_file_pages(unsigned long start, unsigned long size, + unsigned long prot, unsigned long pgoff, unsigned long flags); +long ksys_shmat(int shmid, char __user *shmaddr, int shmflg); +long ksys_shmdt(char __user *shmaddr);
/* * The following kernel syscall equivalents are just wrappers to fs-internal diff --git a/ipc/shm.c b/ipc/shm.c index ce1ca9f7c6e9..557b43968c0e 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -1588,7 +1588,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, return err; }
-SYSCALL_DEFINE3(shmat, int, shmid, char __user *, shmaddr, int, shmflg) +long ksys_shmat(int shmid, char __user *shmaddr, int shmflg) { unsigned long ret; long err; @@ -1600,6 +1600,11 @@ SYSCALL_DEFINE3(shmat, int, shmid, char __user *, shmaddr, int, shmflg) return (long)ret; }
+SYSCALL_DEFINE3(shmat, int, shmid, char __user *, shmaddr, int, shmflg) +{ + return ksys_shmat(shmid, shmaddr, shmflg); +} + #ifdef CONFIG_COMPAT
#ifndef COMPAT_SHMLBA diff --git a/mm/madvise.c b/mm/madvise.c index 21a7881a2db4..c27f5f14e2ee 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -738,68 +738,7 @@ madvise_behavior_valid(int behavior) } }
-/* - * The madvise(2) system call. - * - * Applications can use madvise() to advise the kernel how it should - * handle paging I/O in this VM area. The idea is to help the kernel - * use appropriate read-ahead and caching techniques. The information - * provided is advisory only, and can be safely disregarded by the - * kernel without affecting the correct operation of the application. - * - * behavior values: - * MADV_NORMAL - the default behavior is to read clusters. This - * results in some read-ahead and read-behind. - * MADV_RANDOM - the system should read the minimum amount of data - * on any access, since it is unlikely that the appli- - * cation will need more than what it asks for. - * MADV_SEQUENTIAL - pages in the given range will probably be accessed - * once, so they can be aggressively read ahead, and - * can be freed soon after they are accessed. - * MADV_WILLNEED - the application is notifying the system to read - * some pages ahead. - * MADV_DONTNEED - the application is finished with the given range, - * so the kernel can free resources associated with it. - * MADV_FREE - the application marks pages in the given range as lazy free, - * where actual purges are postponed until memory pressure happens. - * MADV_REMOVE - the application wants to free up the given range of - * pages and associated backing store. - * MADV_DONTFORK - omit this area from child's address space when forking: - * typically, to avoid COWing pages pinned by get_user_pages(). - * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking. - * MADV_WIPEONFORK - present the child process with zero-filled memory in this - * range after a fork. - * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK - * MADV_HWPOISON - trigger memory error handler as if the given memory range - * were corrupted by unrecoverable hardware memory failure. - * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory. - * MADV_MERGEABLE - the application recommends that KSM try to merge pages in - * this area with pages of identical content from other such areas. - * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others. - * MADV_HUGEPAGE - the application wants to back the given range by transparent - * huge pages in the future. Existing pages might be coalesced and - * new pages might be allocated as THP. - * MADV_NOHUGEPAGE - mark the given range as not worth being backed by - * transparent huge pages so the existing pages will not be - * coalesced into THP and new pages will not be allocated as THP. - * MADV_DONTDUMP - the application wants to prevent pages in the given range - * from being included in its core dump. - * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. - * - * return values: - * zero - success - * -EINVAL - start + len < 0, start is not page-aligned, - * "behavior" is not a valid value, or application - * is attempting to release locked or shared pages, - * or the specified address range includes file, Huge TLB, - * MAP_SHARED or VMPFNMAP range. - * -ENOMEM - addresses in the specified range are not currently - * mapped, or are outside the AS of the process. - * -EIO - an I/O error occurred while paging in data. - * -EBADF - map exists, but area maps something that isn't a file. - * -EAGAIN - a kernel resource was temporarily unavailable. - */ -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) +int ksys_madvise(unsigned long start, size_t len_in, int behavior) { unsigned long end, tmp; struct vm_area_struct *vma, *prev; @@ -894,3 +833,69 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
return error; } + +/* + * The madvise(2) system call. + * + * Applications can use madvise() to advise the kernel how it should + * handle paging I/O in this VM area. The idea is to help the kernel + * use appropriate read-ahead and caching techniques. The information + * provided is advisory only, and can be safely disregarded by the + * kernel without affecting the correct operation of the application. + * + * behavior values: + * MADV_NORMAL - the default behavior is to read clusters. This + * results in some read-ahead and read-behind. + * MADV_RANDOM - the system should read the minimum amount of data + * on any access, since it is unlikely that the appli- + * cation will need more than what it asks for. + * MADV_SEQUENTIAL - pages in the given range will probably be accessed + * once, so they can be aggressively read ahead, and + * can be freed soon after they are accessed. + * MADV_WILLNEED - the application is notifying the system to read + * some pages ahead. + * MADV_DONTNEED - the application is finished with the given range, + * so the kernel can free resources associated with it. + * MADV_FREE - the application marks pages in the given range as lazy free, + * where actual purges are postponed until memory pressure happens. + * MADV_REMOVE - the application wants to free up the given range of + * pages and associated backing store. + * MADV_DONTFORK - omit this area from child's address space when forking: + * typically, to avoid COWing pages pinned by get_user_pages(). + * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking. + * MADV_WIPEONFORK - present the child process with zero-filled memory in this + * range after a fork. + * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK + * MADV_HWPOISON - trigger memory error handler as if the given memory range + * were corrupted by unrecoverable hardware memory failure. + * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory. + * MADV_MERGEABLE - the application recommends that KSM try to merge pages in + * this area with pages of identical content from other such areas. + * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others. + * MADV_HUGEPAGE - the application wants to back the given range by transparent + * huge pages in the future. Existing pages might be coalesced and + * new pages might be allocated as THP. + * MADV_NOHUGEPAGE - mark the given range as not worth being backed by + * transparent huge pages so the existing pages will not be + * coalesced into THP and new pages will not be allocated as THP. + * MADV_DONTDUMP - the application wants to prevent pages in the given range + * from being included in its core dump. + * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. + * + * return values: + * zero - success + * -EINVAL - start + len < 0, start is not page-aligned, + * "behavior" is not a valid value, or application + * is attempting to release locked or shared pages, + * or the specified address range includes file, Huge TLB, + * MAP_SHARED or VMPFNMAP range. + * -ENOMEM - addresses in the specified range are not currently + * mapped, or are outside the AS of the process. + * -EIO - an I/O error occurred while paging in data. + * -EBADF - map exists, but area maps something that isn't a file. + * -EAGAIN - a kernel resource was temporarily unavailable. + */ +SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) +{ + return ksys_madvise(start, len_in, behavior); +} diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 2219e747df49..c2f82a045ceb 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1352,9 +1352,9 @@ static int copy_nodes_to_user(unsigned long __user *mask, unsigned long maxnode, return copy_to_user(mask, nodes_addr(*nodes), copy) ? -EFAULT : 0; }
-static long kernel_mbind(unsigned long start, unsigned long len, - unsigned long mode, const unsigned long __user *nmask, - unsigned long maxnode, unsigned int flags) +long ksys_mbind(unsigned long start, unsigned long len, + unsigned long mode, const unsigned long __user *nmask, + unsigned long maxnode, unsigned int flags) { nodemask_t nodes; int err; @@ -1377,7 +1377,7 @@ SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len, unsigned long, mode, const unsigned long __user *, nmask, unsigned long, maxnode, unsigned int, flags) { - return kernel_mbind(start, len, mode, nmask, maxnode, flags); + return ksys_mbind(start, len, mode, nmask, maxnode, flags); }
/* Set the process memory policy */ @@ -1507,11 +1507,8 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
/* Retrieve NUMA policy */ -static int kernel_get_mempolicy(int __user *policy, - unsigned long __user *nmask, - unsigned long maxnode, - unsigned long addr, - unsigned long flags) +int ksys_get_mempolicy(int __user *policy, unsigned long __user *nmask, + unsigned long maxnode, unsigned long addr, unsigned long flags) { int err; int uninitialized_var(pval); @@ -1538,7 +1535,7 @@ SYSCALL_DEFINE5(get_mempolicy, int __user *, policy, unsigned long __user *, nmask, unsigned long, maxnode, unsigned long, addr, unsigned long, flags) { - return kernel_get_mempolicy(policy, nmask, maxnode, addr, flags); + return ksys_get_mempolicy(policy, nmask, maxnode, addr, flags); }
#ifdef CONFIG_COMPAT @@ -1559,7 +1556,7 @@ COMPAT_SYSCALL_DEFINE5(get_mempolicy, int __user *, policy, if (nmask) nm = compat_alloc_user_space(alloc_size);
- err = kernel_get_mempolicy(policy, nm, nr_bits+1, addr, flags); + err = ksys_get_mempolicy(policy, nm, nr_bits+1, addr, flags);
if (!err && nmask) { unsigned long copy_size; @@ -1613,7 +1610,7 @@ COMPAT_SYSCALL_DEFINE6(mbind, compat_ulong_t, start, compat_ulong_t, len, return -EFAULT; }
- return kernel_mbind(start, len, mode, nm, nr_bits+1, flags); + return ksys_mbind(start, len, mode, nm, nr_bits+1, flags); }
COMPAT_SYSCALL_DEFINE4(migrate_pages, compat_pid_t, pid, diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..a609bd8128da 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -197,32 +197,7 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v return (end - addr) >> PAGE_SHIFT; }
-/* - * The mincore(2) system call. - * - * mincore() returns the memory residency status of the pages in the - * current process's address space specified by [addr, addr + len). - * The status is returned in a vector of bytes. The least significant - * bit of each byte is 1 if the referenced page is in memory, otherwise - * it is zero. - * - * Because the status of a page can change after mincore() checks it - * but before it returns to the application, the returned vector may - * contain stale information. Only locked pages are guaranteed to - * remain in memory. - * - * return values: - * zero - success - * -EFAULT - vec points to an illegal address - * -EINVAL - addr is not a multiple of PAGE_SIZE - * -ENOMEM - Addresses in the range [addr, addr + len] are - * invalid for the address space of this process, or - * specify one or more pages which are not currently - * mapped - * -EAGAIN - A kernel resource was temporarily unavailable. - */ -SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, - unsigned char __user *, vec) +long ksys_mincore(unsigned long start, size_t len, unsigned char __user *vec) { long retval; unsigned long pages; @@ -271,3 +246,33 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, free_page((unsigned long) tmp); return retval; } + +/* + * The mincore(2) system call. + * + * mincore() returns the memory residency status of the pages in the + * current process's address space specified by [addr, addr + len). + * The status is returned in a vector of bytes. The least significant + * bit of each byte is 1 if the referenced page is in memory, otherwise + * it is zero. + * + * Because the status of a page can change after mincore() checks it + * but before it returns to the application, the returned vector may + * contain stale information. Only locked pages are guaranteed to + * remain in memory. + * + * return values: + * zero - success + * -EFAULT - vec points to an illegal address + * -EINVAL - addr is not a multiple of PAGE_SIZE + * -ENOMEM - Addresses in the range [addr, addr + len] are + * invalid for the address space of this process, or + * specify one or more pages which are not currently + * mapped + * -EAGAIN - A kernel resource was temporarily unavailable. + */ +SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len, + unsigned char __user *, vec) +{ + return ksys_mincore(start, len, vec); +} diff --git a/mm/mlock.c b/mm/mlock.c index 080f3b36415b..09e449447539 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -668,7 +668,7 @@ static int count_mm_mlocked_page_nr(struct mm_struct *mm, return count >> PAGE_SHIFT; }
-static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t flags) +__must_check int ksys_mlock(unsigned long start, size_t len, vm_flags_t flags) { unsigned long locked; unsigned long lock_limit; @@ -715,10 +715,10 @@ static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t fla
SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len) { - return do_mlock(start, len, VM_LOCKED); + return ksys_mlock(start, len, VM_LOCKED); }
-SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags) +__must_check int ksys_mlock2(unsigned long start, size_t len, vm_flags_t flags) { vm_flags_t vm_flags = VM_LOCKED;
@@ -728,10 +728,15 @@ SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags) if (flags & MLOCK_ONFAULT) vm_flags |= VM_LOCKONFAULT;
- return do_mlock(start, len, vm_flags); + return ksys_mlock(start, len, vm_flags); }
-SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len) +SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags) +{ + return ksys_mlock2(start, len, flags); +} + +int ksys_munlock(unsigned long start, size_t len) { int ret;
@@ -746,6 +751,11 @@ SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len) return ret; }
+SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len) +{ + return ksys_munlock(start, len); +} + /* * Take the MCL_* flags passed into mlockall (or 0 if called from munlockall) * and translate into the appropriate modifications to mm->def_flags and/or the diff --git a/mm/mmap.c b/mm/mmap.c index bd7b9f293b39..09bfaf36b961 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -189,7 +189,8 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
static int do_brk_flags(unsigned long addr, unsigned long request, unsigned long flags, struct list_head *uf); -SYSCALL_DEFINE1(brk, unsigned long, brk) + +unsigned long ksys_brk(unsigned long brk) { unsigned long retval; unsigned long newbrk, oldbrk, origbrk; @@ -288,6 +289,11 @@ SYSCALL_DEFINE1(brk, unsigned long, brk) return retval; }
+SYSCALL_DEFINE1(brk, unsigned long, brk) +{ + return ksys_brk(brk); +} + static long vma_compute_subtree_gap(struct vm_area_struct *vma) { unsigned long max, prev_end, subtree_gap; @@ -2870,18 +2876,19 @@ int vm_munmap(unsigned long start, size_t len) } EXPORT_SYMBOL(vm_munmap);
-SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) +int ksys_munmap(unsigned long addr, size_t len) { profile_munmap(addr); return __vm_munmap(addr, len, true); }
+SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) +{ + return ksys_munmap(addr, len); +}
-/* - * Emulation of deprecated remap_file_pages() syscall. - */ -SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, - unsigned long, prot, unsigned long, pgoff, unsigned long, flags) +unsigned long ksys_remap_file_pages(unsigned long start, unsigned long size, + unsigned long prot, unsigned long pgoff, unsigned long flags) {
struct mm_struct *mm = current->mm; @@ -2976,6 +2983,15 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, return ret; }
+/* + * Emulation of deprecated remap_file_pages() syscall. + */ +SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, + unsigned long, prot, unsigned long, pgoff, unsigned long, flags) +{ + return ksys_remap_file_pages(start, size, prot, pgoff, flags); +} + /* * this is really a simplified "do_mmap". it only handles * anonymous maps. eventually we may be able to do some diff --git a/mm/mprotect.c b/mm/mprotect.c index 028c724dcb1a..07344bdd7a04 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -454,7 +454,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, /* * pkey==-1 when doing a legacy mprotect() */ -static int do_mprotect_pkey(unsigned long start, size_t len, +int ksys_mprotect_pkey(unsigned long start, size_t len, unsigned long prot, int pkey) { unsigned long nstart, end, tmp, reqprot; @@ -578,7 +578,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len, SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len, unsigned long, prot) { - return do_mprotect_pkey(start, len, prot, -1); + return ksys_mprotect_pkey(start, len, prot, -1); }
#ifdef CONFIG_ARCH_HAS_PKEYS @@ -586,7 +586,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len, SYSCALL_DEFINE4(pkey_mprotect, unsigned long, start, size_t, len, unsigned long, prot, int, pkey) { - return do_mprotect_pkey(start, len, prot, pkey); + return ksys_mprotect_pkey(start, len, prot, pkey); }
SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, unsigned long, init_val) diff --git a/mm/mremap.c b/mm/mremap.c index e3edef6b7a12..fec1f9911388 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -584,16 +584,9 @@ static int vma_expandable(struct vm_area_struct *vma, unsigned long delta) return 1; }
-/* - * Expand (or shrink) an existing mapping, potentially moving it at the - * same time (controlled by the MREMAP_MAYMOVE flag and available VM space) - * - * MREMAP_FIXED option added 5-Dec-1999 by Benjamin LaHaise - * This option implies MREMAP_MAYMOVE. - */ -SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, - unsigned long, new_len, unsigned long, flags, - unsigned long, new_addr) +unsigned long ksys_mremap(unsigned long addr, unsigned long old_len, + unsigned long new_len, unsigned long flags, + unsigned long new_addr) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; @@ -726,3 +719,17 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, userfaultfd_unmap_complete(mm, &uf_unmap); return ret; } + +/* + * Expand (or shrink) an existing mapping, potentially moving it at the + * same time (controlled by the MREMAP_MAYMOVE flag and available VM space) + * + * MREMAP_FIXED option added 5-Dec-1999 by Benjamin LaHaise + * This option implies MREMAP_MAYMOVE. + */ +SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, + unsigned long, new_len, unsigned long, flags, + unsigned long, new_addr) +{ + return ksys_mremap(addr, old_len, new_len, flags, new_addr); +} diff --git a/mm/msync.c b/mm/msync.c index ef30a429623a..b5a013549626 100644 --- a/mm/msync.c +++ b/mm/msync.c @@ -15,21 +15,7 @@ #include <linux/syscalls.h> #include <linux/sched.h>
-/* - * MS_SYNC syncs the entire file - including mappings. - * - * MS_ASYNC does not start I/O (it used to, up to 2.5.67). - * Nor does it marks the relevant pages dirty (it used to up to 2.6.17). - * Now it doesn't do anything, since dirty pages are properly tracked. - * - * The application may now run fsync() to - * write out the dirty pages and wait on the writeout and check the result. - * Or the application may run fadvise(FADV_DONTNEED) against the fd to start - * async writeout immediately. - * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to - * applications. - */ -SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags) +int ksys_msync(unsigned long start, size_t len, int flags) { unsigned long end; struct mm_struct *mm = current->mm; @@ -106,3 +92,22 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags) out: return error ? : unmapped_error; } + +/* + * MS_SYNC syncs the entire file - including mappings. + * + * MS_ASYNC does not start I/O (it used to, up to 2.5.67). + * Nor does it marks the relevant pages dirty (it used to up to 2.6.17). + * Now it doesn't do anything, since dirty pages are properly tracked. + * + * The application may now run fsync() to + * write out the dirty pages and wait on the writeout and check the result. + * Or the application may run fadvise(FADV_DONTNEED) against the fd to start + * async writeout immediately. + * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to + * applications. + */ +SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags) +{ + return ksys_msync(start, len, flags); +}
On Mon, May 06, 2019 at 06:30:50PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
This patch adds ksys_ wrappers to the following memory syscalls:
brk, get_mempolicy (renamed kernel_get_mempolicy -> ksys_get_mempolicy), madvise, mbind (renamed kernel_mbind -> ksys_mbind), mincore, mlock (renamed do_mlock -> ksys_mlock), mlock2, mmap_pgoff, mprotect (renamed do_mprotect_pkey -> ksys_mprotect_pkey), mremap, msync, munlock, munmap, remap_file_pages, shmat, shmdt.
The next patch in this series will add a custom implementation for these syscalls that makes them accept tagged pointers on arm64.
Signed-off-by: Andrey Konovalov andreyknvl@google.com
Reviewed-by: Catalin Marinas catalin.marinas@arm.com
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
This patch allows tagged pointers to be passed to the following memory syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap, remap_file_pages, shmat and shmdt.
This is done by untagging pointers passed to these syscalls in the prologues of their handlers.
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- arch/arm64/kernel/sys.c | 128 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 127 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/sys.c b/arch/arm64/kernel/sys.c index b44065fb1616..933bb9f3d6ec 100644 --- a/arch/arm64/kernel/sys.c +++ b/arch/arm64/kernel/sys.c @@ -35,10 +35,33 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len, { if (offset_in_page(off) != 0) return -EINVAL; - + addr = untagged_addr(addr); return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT); }
+SYSCALL_DEFINE6(arm64_mmap_pgoff, unsigned long, addr, unsigned long, len, + unsigned long, prot, unsigned long, flags, + unsigned long, fd, unsigned long, pgoff) +{ + addr = untagged_addr(addr); + return ksys_mmap_pgoff(addr, len, prot, flags, fd, pgoff); +} + +SYSCALL_DEFINE5(arm64_mremap, unsigned long, addr, unsigned long, old_len, + unsigned long, new_len, unsigned long, flags, + unsigned long, new_addr) +{ + addr = untagged_addr(addr); + new_addr = untagged_addr(new_addr); + return ksys_mremap(addr, old_len, new_len, flags, new_addr); +} + +SYSCALL_DEFINE2(arm64_munmap, unsigned long, addr, size_t, len) +{ + addr = untagged_addr(addr); + return ksys_munmap(addr, len); +} + SYSCALL_DEFINE1(arm64_personality, unsigned int, personality) { if (personality(personality) == PER_LINUX32 && @@ -47,10 +70,113 @@ SYSCALL_DEFINE1(arm64_personality, unsigned int, personality) return ksys_personality(personality); }
+SYSCALL_DEFINE1(arm64_brk, unsigned long, brk) +{ + brk = untagged_addr(brk); + return ksys_brk(brk); +} + +SYSCALL_DEFINE5(arm64_get_mempolicy, int __user *, policy, + unsigned long __user *, nmask, unsigned long, maxnode, + unsigned long, addr, unsigned long, flags) +{ + addr = untagged_addr(addr); + return ksys_get_mempolicy(policy, nmask, maxnode, addr, flags); +} + +SYSCALL_DEFINE3(arm64_madvise, unsigned long, start, + size_t, len_in, int, behavior) +{ + start = untagged_addr(start); + return ksys_madvise(start, len_in, behavior); +} + +SYSCALL_DEFINE6(arm64_mbind, unsigned long, start, unsigned long, len, + unsigned long, mode, const unsigned long __user *, nmask, + unsigned long, maxnode, unsigned int, flags) +{ + start = untagged_addr(start); + return ksys_mbind(start, len, mode, nmask, maxnode, flags); +} + +SYSCALL_DEFINE2(arm64_mlock, unsigned long, start, size_t, len) +{ + start = untagged_addr(start); + return ksys_mlock(start, len, VM_LOCKED); +} + +SYSCALL_DEFINE2(arm64_mlock2, unsigned long, start, size_t, len) +{ + start = untagged_addr(start); + return ksys_mlock(start, len, VM_LOCKED); +} + +SYSCALL_DEFINE2(arm64_munlock, unsigned long, start, size_t, len) +{ + start = untagged_addr(start); + return ksys_munlock(start, len); +} + +SYSCALL_DEFINE3(arm64_mprotect, unsigned long, start, size_t, len, + unsigned long, prot) +{ + start = untagged_addr(start); + return ksys_mprotect_pkey(start, len, prot, -1); +} + +SYSCALL_DEFINE3(arm64_msync, unsigned long, start, size_t, len, int, flags) +{ + start = untagged_addr(start); + return ksys_msync(start, len, flags); +} + +SYSCALL_DEFINE3(arm64_mincore, unsigned long, start, size_t, len, + unsigned char __user *, vec) +{ + start = untagged_addr(start); + return ksys_mincore(start, len, vec); +} + +SYSCALL_DEFINE5(arm64_remap_file_pages, unsigned long, start, + unsigned long, size, unsigned long, prot, + unsigned long, pgoff, unsigned long, flags) +{ + start = untagged_addr(start); + return ksys_remap_file_pages(start, size, prot, pgoff, flags); +} + +SYSCALL_DEFINE3(arm64_shmat, int, shmid, char __user *, shmaddr, int, shmflg) +{ + shmaddr = untagged_addr(shmaddr); + return ksys_shmat(shmid, shmaddr, shmflg); +} + +SYSCALL_DEFINE1(arm64_shmdt, char __user *, shmaddr) +{ + shmaddr = untagged_addr(shmaddr); + return ksys_shmdt(shmaddr); +} + /* * Wrappers to pass the pt_regs argument. */ #define sys_personality sys_arm64_personality +#define sys_mmap_pgoff sys_arm64_mmap_pgoff +#define sys_mremap sys_arm64_mremap +#define sys_munmap sys_arm64_munmap +#define sys_brk sys_arm64_brk +#define sys_get_mempolicy sys_arm64_get_mempolicy +#define sys_madvise sys_arm64_madvise +#define sys_mbind sys_arm64_mbind +#define sys_mlock sys_arm64_mlock +#define sys_mlock2 sys_arm64_mlock2 +#define sys_munlock sys_arm64_munlock +#define sys_mprotect sys_arm64_mprotect +#define sys_msync sys_arm64_msync +#define sys_mincore sys_arm64_mincore +#define sys_remap_file_pages sys_arm64_remap_file_pages +#define sys_shmat sys_arm64_shmat +#define sys_shmdt sys_arm64_shmdt
asmlinkage long sys_ni_syscall(const struct pt_regs *); #define __arm64_sys_ni_syscall sys_ni_syscall
On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
This patch allows tagged pointers to be passed to the following memory syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap, remap_file_pages, shmat and shmdt.
This is done by untagging pointers passed to these syscalls in the prologues of their handlers.
I'll go through them one by one to see if we can tighten the expected ABI while having the MTE in mind.
diff --git a/arch/arm64/kernel/sys.c b/arch/arm64/kernel/sys.c index b44065fb1616..933bb9f3d6ec 100644 --- a/arch/arm64/kernel/sys.c +++ b/arch/arm64/kernel/sys.c @@ -35,10 +35,33 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len, { if (offset_in_page(off) != 0) return -EINVAL;
- addr = untagged_addr(addr); return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
}
If user passes a tagged pointer to mmap() and the address is honoured (or MAP_FIXED is given), what is the expected return pointer? Does it need to be tagged with the value from the hint?
With MTE, we may want to use this as a request for the default colour of the mapped pages (still under discussion).
+SYSCALL_DEFINE6(arm64_mmap_pgoff, unsigned long, addr, unsigned long, len,
unsigned long, prot, unsigned long, flags,
unsigned long, fd, unsigned long, pgoff)
+{
- addr = untagged_addr(addr);
- return ksys_mmap_pgoff(addr, len, prot, flags, fd, pgoff);
+}
We don't have __NR_mmap_pgoff on arm64.
+SYSCALL_DEFINE5(arm64_mremap, unsigned long, addr, unsigned long, old_len,
unsigned long, new_len, unsigned long, flags,
unsigned long, new_addr)
+{
- addr = untagged_addr(addr);
- new_addr = untagged_addr(new_addr);
- return ksys_mremap(addr, old_len, new_len, flags, new_addr);
+}
Similar comment as for mmap(), do we want the tag from new_addr to be preserved? In addition, should we check that the two tags are identical or mremap() should become a way to repaint a memory region?
+SYSCALL_DEFINE2(arm64_munmap, unsigned long, addr, size_t, len) +{
- addr = untagged_addr(addr);
- return ksys_munmap(addr, len);
+}
This looks fine.
+SYSCALL_DEFINE1(arm64_brk, unsigned long, brk) +{
- brk = untagged_addr(brk);
- return ksys_brk(brk);
+}
I wonder whether brk() should simply not accept tags, and should not return them (similar to the prctl(PR_SET_MM) discussion). We could document this in the ABI requirements.
+SYSCALL_DEFINE5(arm64_get_mempolicy, int __user *, policy,
unsigned long __user *, nmask, unsigned long, maxnode,
unsigned long, addr, unsigned long, flags)
+{
- addr = untagged_addr(addr);
- return ksys_get_mempolicy(policy, nmask, maxnode, addr, flags);
+}
+SYSCALL_DEFINE3(arm64_madvise, unsigned long, start,
size_t, len_in, int, behavior)
+{
- start = untagged_addr(start);
- return ksys_madvise(start, len_in, behavior);
+}
+SYSCALL_DEFINE6(arm64_mbind, unsigned long, start, unsigned long, len,
unsigned long, mode, const unsigned long __user *, nmask,
unsigned long, maxnode, unsigned int, flags)
+{
- start = untagged_addr(start);
- return ksys_mbind(start, len, mode, nmask, maxnode, flags);
+}
+SYSCALL_DEFINE2(arm64_mlock, unsigned long, start, size_t, len) +{
- start = untagged_addr(start);
- return ksys_mlock(start, len, VM_LOCKED);
+}
+SYSCALL_DEFINE2(arm64_mlock2, unsigned long, start, size_t, len) +{
- start = untagged_addr(start);
- return ksys_mlock(start, len, VM_LOCKED);
+}
+SYSCALL_DEFINE2(arm64_munlock, unsigned long, start, size_t, len) +{
- start = untagged_addr(start);
- return ksys_munlock(start, len);
+}
+SYSCALL_DEFINE3(arm64_mprotect, unsigned long, start, size_t, len,
unsigned long, prot)
+{
- start = untagged_addr(start);
- return ksys_mprotect_pkey(start, len, prot, -1);
+}
+SYSCALL_DEFINE3(arm64_msync, unsigned long, start, size_t, len, int, flags) +{
- start = untagged_addr(start);
- return ksys_msync(start, len, flags);
+}
+SYSCALL_DEFINE3(arm64_mincore, unsigned long, start, size_t, len,
unsigned char __user *, vec)
+{
- start = untagged_addr(start);
- return ksys_mincore(start, len, vec);
+}
These look fine.
+SYSCALL_DEFINE5(arm64_remap_file_pages, unsigned long, start,
unsigned long, size, unsigned long, prot,
unsigned long, pgoff, unsigned long, flags)
+{
- start = untagged_addr(start);
- return ksys_remap_file_pages(start, size, prot, pgoff, flags);
+}
While this has been deprecated for some time, I presume user space still invokes it?
+SYSCALL_DEFINE3(arm64_shmat, int, shmid, char __user *, shmaddr, int, shmflg) +{
- shmaddr = untagged_addr(shmaddr);
- return ksys_shmat(shmid, shmaddr, shmflg);
+}
+SYSCALL_DEFINE1(arm64_shmdt, char __user *, shmaddr) +{
- shmaddr = untagged_addr(shmaddr);
- return ksys_shmdt(shmaddr);
+}
Do we actually want to allow shared tagged memory? Who's going to tag it? If not, we can document it as not supported.
On Wed, May 22, 2019 at 4:49 AM Catalin Marinas catalin.marinas@arm.com wrote:
On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
This patch allows tagged pointers to be passed to the following memory syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap, remap_file_pages, shmat and shmdt.
This is done by untagging pointers passed to these syscalls in the prologues of their handlers.
I'll go through them one by one to see if we can tighten the expected ABI while having the MTE in mind.
diff --git a/arch/arm64/kernel/sys.c b/arch/arm64/kernel/sys.c index b44065fb1616..933bb9f3d6ec 100644 --- a/arch/arm64/kernel/sys.c +++ b/arch/arm64/kernel/sys.c @@ -35,10 +35,33 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len, { if (offset_in_page(off) != 0) return -EINVAL;
addr = untagged_addr(addr); return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
}
If user passes a tagged pointer to mmap() and the address is honoured (or MAP_FIXED is given), what is the expected return pointer? Does it need to be tagged with the value from the hint?
For HWASan the most convenient would be to use the tag from the hint. But since in the TBI (not MTE) mode the kernel has no idea what meaning userspace assigns to pointer tags, perhaps it should not try to guess, and should return raw (zero-tagged) address instead.
With MTE, we may want to use this as a request for the default colour of the mapped pages (still under discussion).
I like this - and in that case it would make sense to return the pointer that can be immediately dereferenced without crashing the process, i.e. with the matching tag.
+SYSCALL_DEFINE6(arm64_mmap_pgoff, unsigned long, addr, unsigned long, len,
unsigned long, prot, unsigned long, flags,
unsigned long, fd, unsigned long, pgoff)
+{
addr = untagged_addr(addr);
return ksys_mmap_pgoff(addr, len, prot, flags, fd, pgoff);
+}
We don't have __NR_mmap_pgoff on arm64.
+SYSCALL_DEFINE5(arm64_mremap, unsigned long, addr, unsigned long, old_len,
unsigned long, new_len, unsigned long, flags,
unsigned long, new_addr)
+{
addr = untagged_addr(addr);
new_addr = untagged_addr(new_addr);
return ksys_mremap(addr, old_len, new_len, flags, new_addr);
+}
Similar comment as for mmap(), do we want the tag from new_addr to be preserved? In addition, should we check that the two tags are identical or mremap() should become a way to repaint a memory region?
+SYSCALL_DEFINE2(arm64_munmap, unsigned long, addr, size_t, len) +{
addr = untagged_addr(addr);
return ksys_munmap(addr, len);
+}
This looks fine.
+SYSCALL_DEFINE1(arm64_brk, unsigned long, brk) +{
brk = untagged_addr(brk);
return ksys_brk(brk);
+}
I wonder whether brk() should simply not accept tags, and should not return them (similar to the prctl(PR_SET_MM) discussion). We could document this in the ABI requirements.
+SYSCALL_DEFINE5(arm64_get_mempolicy, int __user *, policy,
unsigned long __user *, nmask, unsigned long, maxnode,
unsigned long, addr, unsigned long, flags)
+{
addr = untagged_addr(addr);
return ksys_get_mempolicy(policy, nmask, maxnode, addr, flags);
+}
+SYSCALL_DEFINE3(arm64_madvise, unsigned long, start,
size_t, len_in, int, behavior)
+{
start = untagged_addr(start);
return ksys_madvise(start, len_in, behavior);
+}
+SYSCALL_DEFINE6(arm64_mbind, unsigned long, start, unsigned long, len,
unsigned long, mode, const unsigned long __user *, nmask,
unsigned long, maxnode, unsigned int, flags)
+{
start = untagged_addr(start);
return ksys_mbind(start, len, mode, nmask, maxnode, flags);
+}
+SYSCALL_DEFINE2(arm64_mlock, unsigned long, start, size_t, len) +{
start = untagged_addr(start);
return ksys_mlock(start, len, VM_LOCKED);
+}
+SYSCALL_DEFINE2(arm64_mlock2, unsigned long, start, size_t, len) +{
start = untagged_addr(start);
return ksys_mlock(start, len, VM_LOCKED);
+}
+SYSCALL_DEFINE2(arm64_munlock, unsigned long, start, size_t, len) +{
start = untagged_addr(start);
return ksys_munlock(start, len);
+}
+SYSCALL_DEFINE3(arm64_mprotect, unsigned long, start, size_t, len,
unsigned long, prot)
+{
start = untagged_addr(start);
return ksys_mprotect_pkey(start, len, prot, -1);
+}
+SYSCALL_DEFINE3(arm64_msync, unsigned long, start, size_t, len, int, flags) +{
start = untagged_addr(start);
return ksys_msync(start, len, flags);
+}
+SYSCALL_DEFINE3(arm64_mincore, unsigned long, start, size_t, len,
unsigned char __user *, vec)
+{
start = untagged_addr(start);
return ksys_mincore(start, len, vec);
+}
These look fine.
+SYSCALL_DEFINE5(arm64_remap_file_pages, unsigned long, start,
unsigned long, size, unsigned long, prot,
unsigned long, pgoff, unsigned long, flags)
+{
start = untagged_addr(start);
return ksys_remap_file_pages(start, size, prot, pgoff, flags);
+}
While this has been deprecated for some time, I presume user space still invokes it?
+SYSCALL_DEFINE3(arm64_shmat, int, shmid, char __user *, shmaddr, int, shmflg) +{
shmaddr = untagged_addr(shmaddr);
return ksys_shmat(shmid, shmaddr, shmflg);
+}
+SYSCALL_DEFINE1(arm64_shmdt, char __user *, shmaddr) +{
shmaddr = untagged_addr(shmaddr);
return ksys_shmdt(shmaddr);
+}
Do we actually want to allow shared tagged memory? Who's going to tag it? If not, we can document it as not supported.
-- Catalin
On Wed, May 22, 2019 at 02:16:57PM -0700, Evgenii Stepanov wrote:
On Wed, May 22, 2019 at 4:49 AM Catalin Marinas catalin.marinas@arm.com wrote:
On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
This patch allows tagged pointers to be passed to the following memory syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap, remap_file_pages, shmat and shmdt.
This is done by untagging pointers passed to these syscalls in the prologues of their handlers.
I'll go through them one by one to see if we can tighten the expected ABI while having the MTE in mind.
diff --git a/arch/arm64/kernel/sys.c b/arch/arm64/kernel/sys.c index b44065fb1616..933bb9f3d6ec 100644 --- a/arch/arm64/kernel/sys.c +++ b/arch/arm64/kernel/sys.c @@ -35,10 +35,33 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len, { if (offset_in_page(off) != 0) return -EINVAL;
addr = untagged_addr(addr); return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
}
If user passes a tagged pointer to mmap() and the address is honoured (or MAP_FIXED is given), what is the expected return pointer? Does it need to be tagged with the value from the hint?
For HWASan the most convenient would be to use the tag from the hint. But since in the TBI (not MTE) mode the kernel has no idea what meaning userspace assigns to pointer tags, perhaps it should not try to guess, and should return raw (zero-tagged) address instead.
Then, just to relax the ABI for hwasan, shall we simply disallow tagged pointers on mmap() arguments? We can leave them in for mremap(old_address), madvise().
With MTE, we may want to use this as a request for the default colour of the mapped pages (still under discussion).
I like this - and in that case it would make sense to return the pointer that can be immediately dereferenced without crashing the process, i.e. with the matching tag.
This came up from the Android investigation work where large memory allocations (using mmap) could be more efficiently pre-tagged by the kernel on page fault. Not sure about the implementation details yet.
On Thu, May 23, 2019 at 2:04 AM Catalin Marinas catalin.marinas@arm.com wrote:
On Wed, May 22, 2019 at 02:16:57PM -0700, Evgenii Stepanov wrote:
On Wed, May 22, 2019 at 4:49 AM Catalin Marinas catalin.marinas@arm.com wrote:
On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
This patch allows tagged pointers to be passed to the following memory syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap, remap_file_pages, shmat and shmdt.
This is done by untagging pointers passed to these syscalls in the prologues of their handlers.
I'll go through them one by one to see if we can tighten the expected ABI while having the MTE in mind.
diff --git a/arch/arm64/kernel/sys.c b/arch/arm64/kernel/sys.c index b44065fb1616..933bb9f3d6ec 100644 --- a/arch/arm64/kernel/sys.c +++ b/arch/arm64/kernel/sys.c @@ -35,10 +35,33 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len, { if (offset_in_page(off) != 0) return -EINVAL;
addr = untagged_addr(addr); return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
}
If user passes a tagged pointer to mmap() and the address is honoured (or MAP_FIXED is given), what is the expected return pointer? Does it need to be tagged with the value from the hint?
For HWASan the most convenient would be to use the tag from the hint. But since in the TBI (not MTE) mode the kernel has no idea what meaning userspace assigns to pointer tags, perhaps it should not try to guess, and should return raw (zero-tagged) address instead.
Then, just to relax the ABI for hwasan, shall we simply disallow tagged pointers on mmap() arguments? We can leave them in for mremap(old_address), madvise().
I think this would be fine. We should allow tagged in pointers in mprotect though.
With MTE, we may want to use this as a request for the default colour of the mapped pages (still under discussion).
I like this - and in that case it would make sense to return the pointer that can be immediately dereferenced without crashing the process, i.e. with the matching tag.
This came up from the Android investigation work where large memory allocations (using mmap) could be more efficiently pre-tagged by the kernel on page fault. Not sure about the implementation details yet.
-- Catalin
On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
This patch allows tagged pointers to be passed to the following memory syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap, remap_file_pages, shmat and shmdt.
This is done by untagging pointers passed to these syscalls in the prologues of their handlers.
Signed-off-by: Andrey Konovalov andreyknvl@google.com
+SYSCALL_DEFINE2(arm64_mlock, unsigned long, start, size_t, len) +{
- start = untagged_addr(start);
- return ksys_mlock(start, len, VM_LOCKED);
+}
+SYSCALL_DEFINE2(arm64_mlock2, unsigned long, start, size_t, len) +{
- start = untagged_addr(start);
- return ksys_mlock(start, len, VM_LOCKED);
+}
I think this may be a copy/paste error...
Shouldn't mlock2 have a third 'flags' argument to distinguish is from mlock?
Thanks,
Andrew Murray
+SYSCALL_DEFINE2(arm64_munlock, unsigned long, start, size_t, len) +{
- start = untagged_addr(start);
- return ksys_munlock(start, len);
+}
+SYSCALL_DEFINE3(arm64_mprotect, unsigned long, start, size_t, len,
unsigned long, prot)
+{
- start = untagged_addr(start);
- return ksys_mprotect_pkey(start, len, prot, -1);
+}
+SYSCALL_DEFINE3(arm64_msync, unsigned long, start, size_t, len, int, flags) +{
- start = untagged_addr(start);
- return ksys_msync(start, len, flags);
+}
+SYSCALL_DEFINE3(arm64_mincore, unsigned long, start, size_t, len,
unsigned char __user *, vec)
+{
- start = untagged_addr(start);
- return ksys_mincore(start, len, vec);
+}
+SYSCALL_DEFINE5(arm64_remap_file_pages, unsigned long, start,
unsigned long, size, unsigned long, prot,
unsigned long, pgoff, unsigned long, flags)
+{
- start = untagged_addr(start);
- return ksys_remap_file_pages(start, size, prot, pgoff, flags);
+}
+SYSCALL_DEFINE3(arm64_shmat, int, shmid, char __user *, shmaddr, int, shmflg) +{
- shmaddr = untagged_addr(shmaddr);
- return ksys_shmat(shmid, shmaddr, shmflg);
+}
+SYSCALL_DEFINE1(arm64_shmdt, char __user *, shmaddr) +{
- shmaddr = untagged_addr(shmaddr);
- return ksys_shmdt(shmaddr);
+}
/*
- Wrappers to pass the pt_regs argument.
*/ #define sys_personality sys_arm64_personality +#define sys_mmap_pgoff sys_arm64_mmap_pgoff +#define sys_mremap sys_arm64_mremap +#define sys_munmap sys_arm64_munmap +#define sys_brk sys_arm64_brk +#define sys_get_mempolicy sys_arm64_get_mempolicy +#define sys_madvise sys_arm64_madvise +#define sys_mbind sys_arm64_mbind +#define sys_mlock sys_arm64_mlock +#define sys_mlock2 sys_arm64_mlock2 +#define sys_munlock sys_arm64_munlock +#define sys_mprotect sys_arm64_mprotect +#define sys_msync sys_arm64_msync +#define sys_mincore sys_arm64_mincore +#define sys_remap_file_pages sys_arm64_remap_file_pages +#define sys_shmat sys_arm64_shmat +#define sys_shmdt sys_arm64_shmdt asmlinkage long sys_ni_syscall(const struct pt_regs *);
#define __arm64_sys_ni_syscall sys_ni_syscall
2.21.0.1020.gf2820cf01a-goog
linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
+SYSCALL_DEFINE5(arm64_get_mempolicy, int __user *, policy,
unsigned long __user *, nmask, unsigned long, maxnode,
unsigned long, addr, unsigned long, flags)
+{
- addr = untagged_addr(addr);
- return ksys_get_mempolicy(policy, nmask, maxnode, addr, flags);
+}
[...]
+SYSCALL_DEFINE6(arm64_mbind, unsigned long, start, unsigned long, len,
unsigned long, mode, const unsigned long __user *, nmask,
unsigned long, maxnode, unsigned int, flags)
+{
- start = untagged_addr(start);
- return ksys_mbind(start, len, mode, nmask, maxnode, flags);
+}
The kernel fails to build with CONFIG_NUMA disabled because the above are in mm/mempolicy.c which is no longer compiled in.
On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
+SYSCALL_DEFINE2(arm64_mlock2, unsigned long, start, size_t, len) +{
- start = untagged_addr(start);
- return ksys_mlock(start, len, VM_LOCKED);
+}
Copy/paste error: sys_mlock2() has 3 arguments and should call ksys_mlock2().
Still tracking down an LTP failure on test mlock01.
On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
This patch allows tagged pointers to be passed to the following memory syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap, remap_file_pages, shmat and shmdt.
This is done by untagging pointers passed to these syscalls in the prologues of their handlers.
Signed-off-by: Andrey Konovalov andreyknvl@google.com
Actually, I don't think any of these wrappers get called (have you tested this patch?). Following commit 4378a7d4be30 ("arm64: implement syscall wrappers"), I think we have other macro names for overriding the sys_* ones.
On Mon, May 27, 2019 at 03:37:20PM +0100, Catalin Marinas wrote:
On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
This patch allows tagged pointers to be passed to the following memory syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap, remap_file_pages, shmat and shmdt.
This is done by untagging pointers passed to these syscalls in the prologues of their handlers.
Signed-off-by: Andrey Konovalov andreyknvl@google.com
Actually, I don't think any of these wrappers get called (have you tested this patch?). Following commit 4378a7d4be30 ("arm64: implement syscall wrappers"), I think we have other macro names for overriding the sys_* ones.
What is the value in adding these wrappers?
The untagged_addr macro is defined for all in linux/mm.h and these patches already use untagged_addr in generic code. Thus adding a few more untagged_addr in the generic syscall handlers (which turn to a nop for most) is surely better than adding wrappers?
Even if other architectures implement untagged_addr in the future it would be more consistent if they untagged in the same places and thus not adding these wrappers enforces that.
Thanks,
Andrew Murray
-- Catalin
linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Tue, May 28, 2019 at 03:54:11PM +0100, Andrew Murray wrote:
On Mon, May 27, 2019 at 03:37:20PM +0100, Catalin Marinas wrote:
On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
This patch allows tagged pointers to be passed to the following memory syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap, remap_file_pages, shmat and shmdt.
This is done by untagging pointers passed to these syscalls in the prologues of their handlers.
Signed-off-by: Andrey Konovalov andreyknvl@google.com
Actually, I don't think any of these wrappers get called (have you tested this patch?). Following commit 4378a7d4be30 ("arm64: implement syscall wrappers"), I think we have other macro names for overriding the sys_* ones.
What is the value in adding these wrappers?
Not much value, initially proposed just to keep the core changes small. I'm fine with moving them back in the generic code (but see below).
I think another aspect is how we define the ABI. Is allowing tags to mlock() for example something specific to arm64 or would sparc ADI need the same? In the absence of other architectures defining such ABI, my preference would be to keep the wrappers in the arch code.
Assuming sparc won't implement untagged_addr(), we can place the macros back in the generic code but, as per the review here, we need to be more restrictive on where we allow tagged addresses. For example, if mmap() gets a tagged address with MAP_FIXED, is it expected to return the tag?
My thoughts on allowing tags (quick look):
brk - no get_mempolicy - yes madvise - yes mbind - yes mincore - yes mlock, mlock2, munlock - yes mmap - no (we may change this with MTE but not for TBI) mmap_pgoff - not used on arm64 mprotect - yes mremap - yes for old_address, no for new_address (on par with mmap) msync - yes munmap - probably no (mmap does not return tagged ptrs) remap_file_pages - no (also deprecated syscall) shmat, shmdt - shall we allow tagged addresses on shared memory?
The above is only about the TBI ABI while ignoring hardware MTE. For the latter, we may want to change the mmap() to allow pre-colouring on page fault which means that munmap()/mprotect() should also support tagged pointers. Possibly mremap() as well but we need to decide whether it should allow re-colouring the page (probably no, in which case old_address and new_address should have the same tag). For some of these we'll end up with arm64 specific wrappers again, unless sparc ADI adopts exactly the same ABI restrictions.
On Tue, May 28, 2019 at 04:40:58PM +0100, Catalin Marinas wrote:
[...]
My thoughts on allowing tags (quick look):
brk - no
[...]
mlock, mlock2, munlock - yes mmap - no (we may change this with MTE but not for TBI)
[...]
mprotect - yes
I haven't following this discussion closely... what's the rationale for the inconsistencies here (feel free to refer me back to the discussion if it's elsewhere).
Cheers ---Dave
On Tue, May 28, 2019 at 04:56:45PM +0100, Dave P Martin wrote:
On Tue, May 28, 2019 at 04:40:58PM +0100, Catalin Marinas wrote:
[...]
My thoughts on allowing tags (quick look):
brk - no
[...]
mlock, mlock2, munlock - yes mmap - no (we may change this with MTE but not for TBI)
[...]
mprotect - yes
I haven't following this discussion closely... what's the rationale for the inconsistencies here (feel free to refer me back to the discussion if it's elsewhere).
_My_ rationale (feel free to disagree) is that mmap() by default would not return a tagged address (ignoring MTE for now). If it gets passed a tagged address or a "tagged NULL" (for lack of a better name) we don't have clear semantics of whether the returned address should be tagged in this ABI relaxation. I'd rather reserve this specific behaviour if we overload the non-zero tag meaning of mmap() for MTE. Similar reasoning for mremap(), at least on the new_address argument (not entirely sure about old_address).
munmap() should probably follow the mmap() rules.
As for brk(), I don't see why the user would need to pass a tagged address, we can't associate any meaning to this tag.
For the rest, since it's likely such addresses would have been tagged by malloc() in user space, we should allow tagged pointers.
On Tue, May 28, 2019 at 05:34:00PM +0100, Catalin Marinas wrote:
On Tue, May 28, 2019 at 04:56:45PM +0100, Dave P Martin wrote:
On Tue, May 28, 2019 at 04:40:58PM +0100, Catalin Marinas wrote:
[...]
My thoughts on allowing tags (quick look):
brk - no
[...]
mlock, mlock2, munlock - yes mmap - no (we may change this with MTE but not for TBI)
[...]
mprotect - yes
I haven't following this discussion closely... what's the rationale for the inconsistencies here (feel free to refer me back to the discussion if it's elsewhere).
_My_ rationale (feel free to disagree) is that mmap() by default would not return a tagged address (ignoring MTE for now). If it gets passed a tagged address or a "tagged NULL" (for lack of a better name) we don't have clear semantics of whether the returned address should be tagged in this ABI relaxation. I'd rather reserve this specific behaviour if we overload the non-zero tag meaning of mmap() for MTE. Similar reasoning for mremap(), at least on the new_address argument (not entirely sure about old_address).
munmap() should probably follow the mmap() rules.
As for brk(), I don't see why the user would need to pass a tagged address, we can't associate any meaning to this tag.
For the rest, since it's likely such addresses would have been tagged by malloc() in user space, we should allow tagged pointers.
Those arguments seem reasonable. We should try to capture this somewhere when documenting the ABI.
To be clear, I'm not sure that we should guarantee anywhere that a tagged pointer is rejected: rather the behaviour should probably be left unspecified. Then we can tidy it up incrementally.
(The behaviour is unspecified today, in any case.)
Cheers ---Dave
On Wed, May 29, 2019 at 01:42:25PM +0100, Dave P Martin wrote:
On Tue, May 28, 2019 at 05:34:00PM +0100, Catalin Marinas wrote:
On Tue, May 28, 2019 at 04:56:45PM +0100, Dave P Martin wrote:
On Tue, May 28, 2019 at 04:40:58PM +0100, Catalin Marinas wrote:
[...]
My thoughts on allowing tags (quick look):
brk - no
[...]
mlock, mlock2, munlock - yes mmap - no (we may change this with MTE but not for TBI)
[...]
mprotect - yes
I haven't following this discussion closely... what's the rationale for the inconsistencies here (feel free to refer me back to the discussion if it's elsewhere).
_My_ rationale (feel free to disagree) is that mmap() by default would not return a tagged address (ignoring MTE for now). If it gets passed a tagged address or a "tagged NULL" (for lack of a better name) we don't have clear semantics of whether the returned address should be tagged in this ABI relaxation. I'd rather reserve this specific behaviour if we overload the non-zero tag meaning of mmap() for MTE. Similar reasoning for mremap(), at least on the new_address argument (not entirely sure about old_address).
munmap() should probably follow the mmap() rules.
As for brk(), I don't see why the user would need to pass a tagged address, we can't associate any meaning to this tag.
For the rest, since it's likely such addresses would have been tagged by malloc() in user space, we should allow tagged pointers.
Those arguments seem reasonable. We should try to capture this somewhere when documenting the ABI.
To be clear, I'm not sure that we should guarantee anywhere that a tagged pointer is rejected: rather the behaviour should probably be left unspecified. Then we can tidy it up incrementally.
(The behaviour is unspecified today, in any case.)
What is specified (or rather de-facto ABI) today is that passing a user address above TASK_SIZE (e.g. non-zero top byte) would fail in most cases. If we relax this with the TBI we may end up with some de-facto ABI before we actually get MTE hardware. Tightening it afterwards may be slightly more problematic, although MTE needs to be an explicit opt-in.
IOW, I wouldn't want to unnecessarily relax the ABI if we don't need to.
On Wed, May 29, 2019 at 02:23:42PM +0100, Catalin Marinas wrote:
On Wed, May 29, 2019 at 01:42:25PM +0100, Dave P Martin wrote:
On Tue, May 28, 2019 at 05:34:00PM +0100, Catalin Marinas wrote:
On Tue, May 28, 2019 at 04:56:45PM +0100, Dave P Martin wrote:
On Tue, May 28, 2019 at 04:40:58PM +0100, Catalin Marinas wrote:
[...]
My thoughts on allowing tags (quick look):
brk - no
[...]
mlock, mlock2, munlock - yes mmap - no (we may change this with MTE but not for TBI)
[...]
mprotect - yes
I haven't following this discussion closely... what's the rationale for the inconsistencies here (feel free to refer me back to the discussion if it's elsewhere).
_My_ rationale (feel free to disagree) is that mmap() by default would not return a tagged address (ignoring MTE for now). If it gets passed a tagged address or a "tagged NULL" (for lack of a better name) we don't have clear semantics of whether the returned address should be tagged in this ABI relaxation. I'd rather reserve this specific behaviour if we overload the non-zero tag meaning of mmap() for MTE. Similar reasoning for mremap(), at least on the new_address argument (not entirely sure about old_address).
munmap() should probably follow the mmap() rules.
As for brk(), I don't see why the user would need to pass a tagged address, we can't associate any meaning to this tag.
For the rest, since it's likely such addresses would have been tagged by malloc() in user space, we should allow tagged pointers.
Those arguments seem reasonable. We should try to capture this somewhere when documenting the ABI.
To be clear, I'm not sure that we should guarantee anywhere that a tagged pointer is rejected: rather the behaviour should probably be left unspecified. Then we can tidy it up incrementally.
(The behaviour is unspecified today, in any case.)
What is specified (or rather de-facto ABI) today is that passing a user address above TASK_SIZE (e.g. non-zero top byte) would fail in most cases. If we relax this with the TBI we may end up with some de-facto
I may be being too picky, but "would fail in most cases" sounds like "unspecified" ?
ABI before we actually get MTE hardware. Tightening it afterwards may be slightly more problematic, although MTE needs to be an explicit opt-in.
IOW, I wouldn't want to unnecessarily relax the ABI if we don't need to.
So long we don't block foreseeable future developments unnecessarily either -- I agree there's a balance to be struck.
I guess this can be reviewed when we have nailed down the details a bit further.
Cheers ---Dave
On Tue, 2019-05-28 at 16:40 +0100, Catalin Marinas wrote:
On Tue, May 28, 2019 at 03:54:11PM +0100, Andrew Murray wrote:
On Mon, May 27, 2019 at 03:37:20PM +0100, Catalin Marinas wrote:
On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
This patch allows tagged pointers to be passed to the following memory syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap, remap_file_pages, shmat and shmdt.
This is done by untagging pointers passed to these syscalls in the prologues of their handlers.
Signed-off-by: Andrey Konovalov andreyknvl@google.com
Actually, I don't think any of these wrappers get called (have you tested this patch?). Following commit 4378a7d4be30 ("arm64: implement syscall wrappers"), I think we have other macro names for overriding the sys_* ones.
What is the value in adding these wrappers?
Not much value, initially proposed just to keep the core changes small. I'm fine with moving them back in the generic code (but see below).
I think another aspect is how we define the ABI. Is allowing tags to mlock() for example something specific to arm64 or would sparc ADI need the same? In the absence of other architectures defining such ABI, my preference would be to keep the wrappers in the arch code.
Assuming sparc won't implement untagged_addr(), we can place the macros back in the generic code but, as per the review here, we need to be more restrictive on where we allow tagged addresses. For example, if mmap() gets a tagged address with MAP_FIXED, is it expected to return the tag?
I would recommend against any ABI differences between ARM64 MTE/TBI and sparc ADI unless it simply can not be helped. My understanding of MTE/TBI is limited, so I will explain how sparc ADI works. On sparc, a tagged address has no meaning until following steps happen:
1. set the user mode PSTATE.mcde bit. This acts as the master switch to enable ADI for a process.
2. set TTE.mcd bit on TLB entries that match the address range ADI is being enabled on.
3. Store version tag for the range of addresses userspace wants ADI enabled on using "stxa" instruction. These tags are stored in physical memory by the memory controller.
Steps 1 and 2 are accomplished by userspace by calling mprotect() with PROT_ADI. Tags are set by storing tags in a loop, for example:
version = 10; tmp_addr = shmaddr; end = shmaddr + BUFFER_SIZE; while (tmp_addr < end) { asm volatile( "stxa %1, [%0]0x90\n\t" : : "r" (tmp_addr), "r" (version)); tmp_addr += adi_blksz; }
With these semantics, giving mmap() or shamat() a tagged address is meaningless since no tags have been stored at the addresses mmap() will allocate and one can not store tags before memory range has been allocated. If we choose to allow tagged addresses to come into mmap() and shmat(), sparc code can strip the tags unconditionally and that may help simplify ABI and/or code.
My thoughts on allowing tags (quick look):
brk - no get_mempolicy - yes madvise - yes mbind - yes mincore - yes mlock, mlock2, munlock - yes mmap - no (we may change this with MTE but not for TBI) mmap_pgoff - not used on arm64 mprotect - yes mremap - yes for old_address, no for new_address (on par with mmap) msync - yes munmap - probably no (mmap does not return tagged ptrs) remap_file_pages - no (also deprecated syscall) shmat, shmdt - shall we allow tagged addresses on shared memory?
The above is only about the TBI ABI while ignoring hardware MTE. For the latter, we may want to change the mmap() to allow pre-colouring on page fault which means that munmap()/mprotect() should also support tagged pointers. Possibly mremap() as well but we need to decide whether it should allow re-colouring the page (probably no, in which case old_address and new_address should have the same tag). For some of these we'll end up with arm64 specific wrappers again, unless sparc ADI adopts exactly the same ABI restrictions.
Let us keep any restrictions common across ARM64 and sparc. pre- coloring on sparc in the kernel would mean kernel will have to execute stxa instructions in a loop for each page being faulted in. Not that big a deal but doesn't that assume the entire page has the same tag which is dedcued from the upper bits of address? Shouldn't we support tags at the same granularity level as what the hardware supports? We went through this discussion for sparc and decision was to support tags at the same granularity as hardware. That means we can not deduce tags from the first address that pioints into an mmap or shmat region. Those tags and the upper bytes of colored address could change for every cacheline sized block (64-bytes on sparc M7). We can try to store tags for an entire region in vma but that is expensive, plus on sparc tags are set in userspace with no participation from kernel and now we need a way for userspace to communicate the tags to kernel. From sparc point of view, making kernel responsible for assigning tags to a page on page fault is full of pitfalls.
Thanks, Khalid
Hi Khalid,
On Tue, May 28, 2019 at 05:33:04PM -0600, Khalid Aziz wrote:
On Tue, 2019-05-28 at 16:40 +0100, Catalin Marinas wrote:
I think another aspect is how we define the ABI. Is allowing tags to mlock() for example something specific to arm64 or would sparc ADI need the same? In the absence of other architectures defining such ABI, my preference would be to keep the wrappers in the arch code.
Assuming sparc won't implement untagged_addr(), we can place the macros back in the generic code but, as per the review here, we need to be more restrictive on where we allow tagged addresses. For example, if mmap() gets a tagged address with MAP_FIXED, is it expected to return the tag?
I would recommend against any ABI differences between ARM64 MTE/TBI and sparc ADI unless it simply can not be helped. My understanding of MTE/TBI is limited, so I will explain how sparc ADI works. On sparc, a tagged address has no meaning until following steps happen:
Before we go into the MTE/ADI similarities or differences, just to clarify that TBI is something that we supported from the start of the arm64 kernel port. TBI (top byte ignore) allows a user pointer to have non-zero top byte and dereference it without causing a fault (the hardware masks it out). The user/kernel ABI does not allow such tagged pointers into the kernel, nor would the kernel return any such tagged addresses.
With MTE (memory tagging extensions), the top-byte meaning is changed from no longer being ignored to actually being checked against a tag in the physical RAM (we call it allocation tag).
- set the user mode PSTATE.mcde bit. This acts as the master switch to
enable ADI for a process.
- set TTE.mcd bit on TLB entries that match the address range ADI is
being enabled on.
Something close enough for MTE, with the difference that enabling it is not a PSTATE bit but rather a system control bit (SCTLR_EL1 register), so only the kernel can turn it on/off for the user.
- Store version tag for the range of addresses userspace wants ADI
enabled on using "stxa" instruction. These tags are stored in physical memory by the memory controller.
Do you have an "ldxa" instruction to load the tags from physical memory?
Steps 1 and 2 are accomplished by userspace by calling mprotect() with PROT_ADI. Tags are set by storing tags in a loop, for example:
version = 10; tmp_addr = shmaddr; end = shmaddr + BUFFER_SIZE; while (tmp_addr < end) { asm volatile( "stxa %1, [%0]0x90\n\t" : : "r" (tmp_addr), "r" (version)); tmp_addr += adi_blksz; }
On arm64, a sequence similar to the above would live in the libc. So a malloc() call will tag the memory and return the tagged address to the user.
We were not planning for a PROT_ADI/MTE but rather have MTE enabled for all user memory ranges. We may revisit this before we upstream the MTE support (probably some marginal benefit for the hardware not fetching the tags from memory if we don't need to, e.g. code sections).
Given that we already have the TBI feature and with MTE enabled the top byte is no longer ignored, we are planning for an explicit opt-in by the user via prctl() to enable MTE.
With these semantics, giving mmap() or shamat() a tagged address is meaningless since no tags have been stored at the addresses mmap() will allocate and one can not store tags before memory range has been allocated. If we choose to allow tagged addresses to come into mmap() and shmat(), sparc code can strip the tags unconditionally and that may help simplify ABI and/or code.
We could say that with TBI (pre-MTE support), the top byte is actually ignored on mmap(). Now, if you pass a MAP_FIXED with a tagged address, should the user expect the same tagged address back or stripping the tag is acceptable? If we want to keep the current mmap() semantics, I'd say the same tag is returned. However, with MTE this also implies that the memory was coloured.
My thoughts on allowing tags (quick look):
brk - no get_mempolicy - yes madvise - yes mbind - yes mincore - yes mlock, mlock2, munlock - yes mmap - no (we may change this with MTE but not for TBI) mmap_pgoff - not used on arm64 mprotect - yes mremap - yes for old_address, no for new_address (on par with mmap) msync - yes munmap - probably no (mmap does not return tagged ptrs) remap_file_pages - no (also deprecated syscall) shmat, shmdt - shall we allow tagged addresses on shared memory?
The above is only about the TBI ABI while ignoring hardware MTE. For the latter, we may want to change the mmap() to allow pre-colouring on page fault which means that munmap()/mprotect() should also support tagged pointers. Possibly mremap() as well but we need to decide whether it should allow re-colouring the page (probably no, in which case old_address and new_address should have the same tag). For some of these we'll end up with arm64 specific wrappers again, unless sparc ADI adopts exactly the same ABI restrictions.
Let us keep any restrictions common across ARM64 and sparc. pre- coloring on sparc in the kernel would mean kernel will have to execute stxa instructions in a loop for each page being faulted in.
Since the user can probe the pre-existing colour in a faulted-in page (either with some 'ldxa' instruction or by performing a tag-checked access), the kernel should always pre-colour (even if colour 0) any allocated page. There might not be an obvious security risk but I feel uneasy about letting colours leak between address spaces (different user processes or between kernel and user).
Since we already need such loop in the kernel, we might as well allow user space to require a certain colour. This comes in handy for large malloc() and another advantage is that the C library won't be stuck trying to paint the whole range (think GB).
Not that big a deal but doesn't that assume the entire page has the same tag which is dedcued from the upper bits of address? Shouldn't we support tags at the same granularity level as what the hardware supports?
That's mostly about large malloc() optimisation via mmap(), the latter working on page granularity already. There is another use-case for pre-coloured thread stacks, also allocated via anonymous mmap().
We went through this discussion for sparc and decision was to support tags at the same granularity as hardware. That means we can not deduce tags from the first address that pioints into an mmap or shmat region. Those tags and the upper bytes of colored address could change for every cacheline sized block (64-bytes on sparc M7).
It's 16-byte for arm64, so smaller than the cacheline.
We can try to store tags for an entire region in vma but that is expensive, plus on sparc tags are set in userspace with no participation from kernel and now we need a way for userspace to communicate the tags to kernel.
We can't support finer granularity through the mmap() syscall and, as you said, the vma is not the right thing to store the individual tags. With the above extension to mmap(), we'd have to store a colour per vma and prevent merging if different colours (we could as well use the pkeys mechanism we already have in the kernel but use a colour per vma instead of a key).
Of course, the user is allowed to change the in-memory colours at a finer granularity and the kernel will preserve them during swapping out/in, page migration etc. The above mmap() proposal is just for the first fault-in of a page in a given range/vma.
From sparc point of view, making kernel responsible for assigning tags to a page on page fault is full of pitfalls.
This could be just some arm64-specific but if you plan to deploy it more generically for sparc (at the C library level), you may find this useful.
On 5/29/19 8:20 AM, Catalin Marinas wrote:
Hi Khalid,
On Tue, May 28, 2019 at 05:33:04PM -0600, Khalid Aziz wrote:
On Tue, 2019-05-28 at 16:40 +0100, Catalin Marinas wrote:
I think another aspect is how we define the ABI. Is allowing tags to mlock() for example something specific to arm64 or would sparc ADI need the same? In the absence of other architectures defining such ABI, my preference would be to keep the wrappers in the arch code.
Assuming sparc won't implement untagged_addr(), we can place the macros back in the generic code but, as per the review here, we need to be more restrictive on where we allow tagged addresses. For example, if mmap() gets a tagged address with MAP_FIXED, is it expected to return the tag?
I would recommend against any ABI differences between ARM64 MTE/TBI and sparc ADI unless it simply can not be helped. My understanding of MTE/TBI is limited, so I will explain how sparc ADI works. On sparc, a tagged address has no meaning until following steps happen:
Before we go into the MTE/ADI similarities or differences, just to clarify that TBI is something that we supported from the start of the arm64 kernel port. TBI (top byte ignore) allows a user pointer to have non-zero top byte and dereference it without causing a fault (the hardware masks it out). The user/kernel ABI does not allow such tagged pointers into the kernel, nor would the kernel return any such tagged addresses.
With MTE (memory tagging extensions), the top-byte meaning is changed from no longer being ignored to actually being checked against a tag in the physical RAM (we call it allocation tag).
- set the user mode PSTATE.mcde bit. This acts as the master switch to
enable ADI for a process.
- set TTE.mcd bit on TLB entries that match the address range ADI is
being enabled on.
Something close enough for MTE, with the difference that enabling it is not a PSTATE bit but rather a system control bit (SCTLR_EL1 register), so only the kernel can turn it on/off for the user.
- Store version tag for the range of addresses userspace wants ADI
enabled on using "stxa" instruction. These tags are stored in physical memory by the memory controller.
Do you have an "ldxa" instruction to load the tags from physical memory?
Yes, "ldxa" can be used to read current tag for any memory location. Kernel uses it to read the tags for a physical page being swapped out and restores those tags when the page is swapped back in.
Steps 1 and 2 are accomplished by userspace by calling mprotect() with PROT_ADI. Tags are set by storing tags in a loop, for example:
version = 10; tmp_addr = shmaddr; end = shmaddr + BUFFER_SIZE; while (tmp_addr < end) { asm volatile( "stxa %1, [%0]0x90\n\t" : : "r" (tmp_addr), "r" (version)); tmp_addr += adi_blksz; }
On arm64, a sequence similar to the above would live in the libc. So a malloc() call will tag the memory and return the tagged address to thePre-coloring could easily be done by user.
We were not planning for a PROT_ADI/MTE but rather have MTE enabled for all user memory ranges. We may revisit this before we upstream the MTE support (probably some marginal benefit for the hardware not fetching the tags from memory if we don't need to, e.g. code sections).
Given that we already have the TBI feature and with MTE enabled the top byte is no longer ignored, we are planning for an explicit opt-in by the user via prctl() to enable MTE.
OK. I had initially proposed enabling ADI for a process using prctl(). Feedback I got was prctl was not a desirable interface and I ended up making mprotect() with PROT_ADI enable ADI on the process instead. Just something to keep in mind.
With these semantics, giving mmap() or shamat() a tagged address is meaningless since no tags have been stored at the addresses mmap() will allocate and one can not store tags before memory range has been allocated. If we choose to allow tagged addresses to come into mmap() and shmat(), sparc code can strip the tags unconditionally and that may help simplify ABI and/or code.
We could say that with TBI (pre-MTE support), the top byte is actually ignored on mmap(). Now, if you pass a MAP_FIXED with a tagged address, should the user expect the same tagged address back or stripping the tag is acceptable? If we want to keep the current mmap() semantics, I'd say the same tag is returned. However, with MTE this also implies that the memory was coloured.
Is assigning a tag aprivileged operationon ARM64? I am thinking not since you mentioned libc could do it in a loop for malloc'd memory. mmap() can return the same tagged address but I am uneasy about kernel pre-coloring the pages. Database can mmap 100's of GB of memory. That is lot of work being offloaded to the kernel to pre-color the page even if done in batches as pages are faulted in.
My thoughts on allowing tags (quick look):
brk - no get_mempolicy - yes madvise - yes mbind - yes mincore - yes mlock, mlock2, munlock - yes mmap - no (we may change this with MTE but not for TBI) mmap_pgoff - not used on arm64 mprotect - yes mremap - yes for old_address, no for new_address (on par with mmap) msync - yes munmap - probably no (mmap does not return tagged ptrs) remap_file_pages - no (also deprecated syscall) shmat, shmdt - shall we allow tagged addresses on shared memory?
The above is only about the TBI ABI while ignoring hardware MTE. For the latter, we may want to change the mmap() to allow pre-colouring on page fault which means that munmap()/mprotect() should also support tagged pointers. Possibly mremap() as well but we need to decide whether it should allow re-colouring the page (probably no, in which case old_address and new_address should have the same tag). For some of these we'll end up with arm64 specific wrappers again, unless sparc ADI adopts exactly the same ABI restrictions.
Let us keep any restrictions common across ARM64 and sparc. pre- coloring on sparc in the kernel would mean kernel will have to execute stxa instructions in a loop for each page being faulted in.
Since the user can probe the pre-existing colour in a faulted-in page (either with some 'ldxa' instruction or by performing a tag-checked access), the kernel should always pre-colour (even if colour 0) any allocated page. There might not be an obvious security risk but I feel uneasy about letting colours leak between address spaces (different user processes or between kernel and user).
On sparc, tags 0 and 15 are special in that 0 means untagged memory and 15 means match any tag in the address. Colour 0 is the default for any newly faulted in page on sparc.
Since we already need such loop in the kernel, we might as well allow user space to require a certain colour. This comes in handy for large malloc() and another advantage is that the C library won't be stuck trying to paint the whole range (think GB).
If kernel is going to pre-color all pages in a vma, we will need to store the default tag in the vma. It will add more time to page fault handling code. On sparc M7, kernel will need to execute additional 128 stxa instructions to set the tags on a normal page.
Not that big a deal but doesn't that assume the entire page has the same tag which is dedcued from the upper bits of address? Shouldn't we support tags at the same granularity level as what the hardware supports?
That's mostly about large malloc() optimisation via mmap(), the latter working on page granularity already. There is another use-case for pre-coloured thread stacks, also allocated via anonymous mmap().
We went through this discussion for sparc and decision was to support tags at the same granularity as hardware. That means we can not deduce tags from the first address that pioints into an mmap or shmat region. Those tags and the upper bytes of colored address could change for every cacheline sized block (64-bytes on sparc M7).
It's 16-byte for arm64, so smaller than the cacheline.
We can try to store tags for an entire region in vma but that is expensive, plus on sparc tags are set in userspace with no participation from kernel and now we need a way for userspace to communicate the tags to kernel.
We can't support finer granularity through the mmap() syscall and, as you said, the vma is not the right thing to store the individual tags. With the above extension to mmap(), we'd have to store a colour per vma and prevent merging if different colours (we could as well use the pkeys mechanism we already have in the kernel but use a colour per vma instead of a key).
Since tags can change on any part of mmap region on sparc at any time without kernel being involved, I am not sure I see much reason for kernel to enforce any tag related restrictions.
Of course, the user is allowed to change the in-memory colours at a finer granularity and the kernel will preserve them during swapping out/in, page migration etc. The above mmap() proposal is just for the first fault-in of a page in a given range/vma.
From sparc point of view, making kernel responsible for assigning tags to a page on page fault is full of pitfalls.
This could be just some arm64-specific but if you plan to deploy it more generically for sparc (at the C library level), you may find this useful.
Common semantics from app developer point of view will be very useful to maintain. If arm64 says mmap with MAP_FIXED and a tagged address will return a pre-colored page, I would rather have it be the same on any architecture. Is there a use case that justifies kernel doing this extra work?
-- Khalid
On Wed, May 29, 2019 at 01:16:37PM -0600, Khalid Aziz wrote:
On 5/29/19 8:20 AM, Catalin Marinas wrote:
On Tue, May 28, 2019 at 05:33:04PM -0600, Khalid Aziz wrote:
Steps 1 and 2 are accomplished by userspace by calling mprotect() with PROT_ADI. Tags are set by storing tags in a loop, for example:
version = 10; tmp_addr = shmaddr; end = shmaddr + BUFFER_SIZE; while (tmp_addr < end) { asm volatile( "stxa %1, [%0]0x90\n\t" : : "r" (tmp_addr), "r" (version)); tmp_addr += adi_blksz; }
On arm64, a sequence similar to the above would live in the libc. So a malloc() call will tag the memory and return the tagged address to thePre-coloring could easily be done by user.
We were not planning for a PROT_ADI/MTE but rather have MTE enabled for all user memory ranges. We may revisit this before we upstream the MTE support (probably some marginal benefit for the hardware not fetching the tags from memory if we don't need to, e.g. code sections).
Given that we already have the TBI feature and with MTE enabled the top byte is no longer ignored, we are planning for an explicit opt-in by the user via prctl() to enable MTE.
OK. I had initially proposed enabling ADI for a process using prctl(). Feedback I got was prctl was not a desirable interface and I ended up making mprotect() with PROT_ADI enable ADI on the process instead. Just something to keep in mind.
Thanks for the feedback. We'll keep this in mind when adding MTE support. In the way we plan to deploy this, it would be a libc decision to invoke the mmap() with the right flag.
This could actually simplify the automatic page faulting below brk(), basically no tagged/coloured memory allowed implicitly. It needs feedback from the bionic/glibc folk.
With these semantics, giving mmap() or shamat() a tagged address is meaningless since no tags have been stored at the addresses mmap() will allocate and one can not store tags before memory range has been allocated. If we choose to allow tagged addresses to come into mmap() and shmat(), sparc code can strip the tags unconditionally and that may help simplify ABI and/or code.
We could say that with TBI (pre-MTE support), the top byte is actually ignored on mmap(). Now, if you pass a MAP_FIXED with a tagged address, should the user expect the same tagged address back or stripping the tag is acceptable? If we want to keep the current mmap() semantics, I'd say the same tag is returned. However, with MTE this also implies that the memory was coloured.
Is assigning a tag aprivileged operationon ARM64? I am thinking not since you mentioned libc could do it in a loop for malloc'd memory.
Indeed it's not, the user can do it.
mmap() can return the same tagged address but I am uneasy about kernel pre-coloring the pages. Database can mmap 100's of GB of memory. That is lot of work being offloaded to the kernel to pre-color the page even if done in batches as pages are faulted in.
For anonymous mmap() for example, the kernel would have to zero the faulted in pages anyway. We can handle the colouring at the same time in clear_user_page() (as I said below, we have to clear the colour anyway from previous uses, so it's simply extending this to support something other than tag/colour 0 by default with no additional overhead).
Since the user can probe the pre-existing colour in a faulted-in page (either with some 'ldxa' instruction or by performing a tag-checked access), the kernel should always pre-colour (even if colour 0) any allocated page. There might not be an obvious security risk but I feel uneasy about letting colours leak between address spaces (different user processes or between kernel and user).
On sparc, tags 0 and 15 are special in that 0 means untagged memory and 15 means match any tag in the address. Colour 0 is the default for any newly faulted in page on sparc.
With MTE we don't have match-all/any tag in memory, only in the virtual address/pointer. So if we turn on MTE for all pages and the user accesses an address with a 0 tag, the underlying memory needs to be coloured with the same 0 value.
Since we already need such loop in the kernel, we might as well allow user space to require a certain colour. This comes in handy for large malloc() and another advantage is that the C library won't be stuck trying to paint the whole range (think GB).
If kernel is going to pre-color all pages in a vma, we will need to store the default tag in the vma. It will add more time to page fault handling code. On sparc M7, kernel will need to execute additional 128 stxa instructions to set the tags on a normal page.
As I said, since the user can retrieve an old colour using ldxa, the kernel should perform this operation anyway on any newly allocated page (unless you clear the existing colour on page freeing).
We can try to store tags for an entire region in vma but that is expensive, plus on sparc tags are set in userspace with no participation from kernel and now we need a way for userspace to communicate the tags to kernel.
We can't support finer granularity through the mmap() syscall and, as you said, the vma is not the right thing to store the individual tags. With the above extension to mmap(), we'd have to store a colour per vma and prevent merging if different colours (we could as well use the pkeys mechanism we already have in the kernel but use a colour per vma instead of a key).
Since tags can change on any part of mmap region on sparc at any time without kernel being involved, I am not sure I see much reason for kernel to enforce any tag related restrictions.
It's not enforcing a tag, more like the default colour for a faulted in page. Anyway, if sparc is going with default 0/untagged, that's fine as well. We may add this mmap() option to arm64 only.
From sparc point of view, making kernel responsible for assigning tags to a page on page fault is full of pitfalls.
This could be just some arm64-specific but if you plan to deploy it more generically for sparc (at the C library level), you may find this useful.
Common semantics from app developer point of view will be very useful to maintain. If arm64 says mmap with MAP_FIXED and a tagged address will return a pre-colored page, I would rather have it be the same on any architecture. Is there a use case that justifies kernel doing this extra work?
So if a database program is doing an anonymous mmap(PROT_TBI) of 100GB, IIUC for sparc the faulted-in pages will have random colours (on 64-byte granularity). Ignoring the information leak from prior uses of such pages, it would be the responsibility of the db program to issue the stxa. On arm64, since we also want to do this via malloc(), any large allocation would require all pages to be faulted in so that malloc() can set the write colour before being handed over to the user. That's what we want to avoid and the user is free to repaint the memory as it likes.
On 5/30/19 9:11 AM, Catalin Marinas wrote:
On Wed, May 29, 2019 at 01:16:37PM -0600, Khalid Aziz wrote:
mmap() can return the same tagged address but I am uneasy about kernel pre-coloring the pages. Database can mmap 100's of GB of memory. That is lot of work being offloaded to the kernel to pre-color the page even if done in batches as pages are faulted in.
For anonymous mmap() for example, the kernel would have to zero the faulted in pages anyway. We can handle the colouring at the same time in clear_user_page() (as I said below, we have to clear the colour anyway from previous uses, so it's simply extending this to support something other than tag/colour 0 by default with no additional overhead).
On sparc M7, clear_user_page() ends up in M7clear_user_page defined in arch/sparc/lib/M7memset.S. M7 code use Block Init Store (BIS) to clear the page. BIS on M7 clears the memory tags as well and no separate instructions are needed to clear the tags. As a result when kernel clears a page before returning it to user, the page is not only zeroed out, its tags are also cleared to 0.
Since we already need such loop in the kernel, we might as well allow user space to require a certain colour. This comes in handy for large malloc() and another advantage is that the C library won't be stuck trying to paint the whole range (think GB).
If kernel is going to pre-color all pages in a vma, we will need to store the default tag in the vma. It will add more time to page fault handling code. On sparc M7, kernel will need to execute additional 128 stxa instructions to set the tags on a normal page.
As I said, since the user can retrieve an old colour using ldxa, the kernel should perform this operation anyway on any newly allocated page (unless you clear the existing colour on page freeing).>
Tags are not cleared on sparc on freeing. They get cleared when the page is allocated again.
We can try to store tags for an entire region in vma but that is expensive, plus on sparc tags are set in userspace with no participation from kernel and now we need a way for userspace to communicate the tags to kernel.
We can't support finer granularity through the mmap() syscall and, as you said, the vma is not the right thing to store the individual tags. With the above extension to mmap(), we'd have to store a colour per vma and prevent merging if different colours (we could as well use the pkeys mechanism we already have in the kernel but use a colour per vma instead of a key).
Since tags can change on any part of mmap region on sparc at any time without kernel being involved, I am not sure I see much reason for kernel to enforce any tag related restrictions.
It's not enforcing a tag, more like the default colour for a faulted in page. Anyway, if sparc is going with default 0/untagged, that's fine as well. We may add this mmap() option to arm64 only.
From sparc point of view, making kernel responsible for assigning tags to a page on page fault is full of pitfalls.
This could be just some arm64-specific but if you plan to deploy it more generically for sparc (at the C library level), you may find this useful.
Common semantics from app developer point of view will be very useful to maintain. If arm64 says mmap with MAP_FIXED and a tagged address will return a pre-colored page, I would rather have it be the same on any architecture. Is there a use case that justifies kernel doing this extra work?
So if a database program is doing an anonymous mmap(PROT_TBI) of 100GB, IIUC for sparc the faulted-in pages will have random colours (on 64-byte granularity). Ignoring the information leak from prior uses of such pages, it would be the responsibility of the db program to issue the stxa. On arm64, since we also want to do this via malloc(), any large allocation would require all pages to be faulted in so that malloc() can set the write colour before being handed over to the user. That's what we want to avoid and the user is free to repaint the memory as it likes.
On sparc, any newly allocated page is cleared along with any old tags on it. Since clearing tag happens automatically when page is cleared on sparc, clear_user_page() will need to execute additional stxa instructions to set a new tag. It is doable. In a way it is done already if page is being pre-colored with tag 0 always ;) Where would the pre-defined tag be stored - as part of address stored in vm_start or a new field in vm_area_struct?
-- Khalid
On Thu, May 30, 2019 at 10:05:55AM -0600, Khalid Aziz wrote:
On 5/30/19 9:11 AM, Catalin Marinas wrote:
So if a database program is doing an anonymous mmap(PROT_TBI) of 100GB, IIUC for sparc the faulted-in pages will have random colours (on 64-byte granularity). Ignoring the information leak from prior uses of such pages, it would be the responsibility of the db program to issue the stxa. On arm64, since we also want to do this via malloc(), any large allocation would require all pages to be faulted in so that malloc() can set the write colour before being handed over to the user. That's what we want to avoid and the user is free to repaint the memory as it likes.
On sparc, any newly allocated page is cleared along with any old tags on it. Since clearing tag happens automatically when page is cleared on sparc, clear_user_page() will need to execute additional stxa instructions to set a new tag. It is doable. In a way it is done already if page is being pre-colored with tag 0 always ;)
Ah, good to know. On arm64 we'd have to use different instructions, although the same loop.
Where would the pre-defined tag be stored - as part of address stored in vm_start or a new field in vm_area_struct?
I think we can discuss the details when we post the actual MTE patches. In our internal hack we overloaded the VM_HIGH_ARCH_* flags and selected CONFIG_ARCH_USES_HIGH_VMA_FLAGS (used for pkeys on x86).
For the time being, I'd rather restrict tagged addresses passed to mmap() until we agreed that they have any meaning. If we allowed them now but get ignored (though probably no-one would be doing this), I feel it's slightly harder to change the semantics afterwards.
Thanks.
On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
/*
- Wrappers to pass the pt_regs argument.
*/ #define sys_personality sys_arm64_personality +#define sys_mmap_pgoff sys_arm64_mmap_pgoff +#define sys_mremap sys_arm64_mremap +#define sys_munmap sys_arm64_munmap +#define sys_brk sys_arm64_brk +#define sys_get_mempolicy sys_arm64_get_mempolicy +#define sys_madvise sys_arm64_madvise +#define sys_mbind sys_arm64_mbind +#define sys_mlock sys_arm64_mlock +#define sys_mlock2 sys_arm64_mlock2 +#define sys_munlock sys_arm64_munlock +#define sys_mprotect sys_arm64_mprotect +#define sys_msync sys_arm64_msync +#define sys_mincore sys_arm64_mincore +#define sys_remap_file_pages sys_arm64_remap_file_pages +#define sys_shmat sys_arm64_shmat +#define sys_shmdt sys_arm64_shmdt
This hunk should be (I sent a separate patch for sys_personality):
@@ -160,23 +163,23 @@ SYSCALL_DEFINE1(arm64_shmdt, char __user *, shmaddr) /* * Wrappers to pass the pt_regs argument. */ -#define sys_personality sys_arm64_personality -#define sys_mmap_pgoff sys_arm64_mmap_pgoff -#define sys_mremap sys_arm64_mremap -#define sys_munmap sys_arm64_munmap -#define sys_brk sys_arm64_brk -#define sys_get_mempolicy sys_arm64_get_mempolicy -#define sys_madvise sys_arm64_madvise -#define sys_mbind sys_arm64_mbind -#define sys_mlock sys_arm64_mlock -#define sys_mlock2 sys_arm64_mlock2 -#define sys_munlock sys_arm64_munlock -#define sys_mprotect sys_arm64_mprotect -#define sys_msync sys_arm64_msync -#define sys_mincore sys_arm64_mincore -#define sys_remap_file_pages sys_arm64_remap_file_pages -#define sys_shmat sys_arm64_shmat -#define sys_shmdt sys_arm64_shmdt +#define __arm64_sys_personality __arm64_sys_arm64_personality +#define __arm64_sys_mmap_pgoff __arm64_sys_arm64_mmap_pgoff +#define __arm64_sys_mremap __arm64_sys_arm64_mremap +#define __arm64_sys_munmap __arm64_sys_arm64_munmap +#define __arm64_sys_brk __arm64_sys_arm64_brk +#define __arm64_sys_get_mempolicy __arm64_sys_arm64_get_mempolicy +#define __arm64_sys_madvise __arm64_sys_arm64_madvise +#define __arm64_sys_mbind __arm64_sys_arm64_mbind +#define __arm64_sys_mlock __arm64_sys_arm64_mlock +#define __arm64_sys_mlock2 __arm64_sys_arm64_mlock2 +#define __arm64_sys_munlock __arm64_sys_arm64_munlock +#define __arm64_sys_mprotect __arm64_sys_arm64_mprotect +#define __arm64_sys_msync __arm64_sys_arm64_msync +#define __arm64_sys_mincore __arm64_sys_arm64_mincore +#define __arm64_sys_remap_file_pages __arm64_sys_arm64_remap_file_pages +#define __arm64_sys_shmat __arm64_sys_arm64_shmat +#define __arm64_sys_shmdt __arm64_sys_arm64_shmdt
asmlinkage long sys_ni_syscall(const struct pt_regs *); #define __arm64_sys_ni_syscall sys_ni_syscall
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
do_pages_move() is used in the implementation of the move_pages syscall.
Untag user pointers in this function.
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- mm/migrate.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/mm/migrate.c b/mm/migrate.c index 663a5449367a..c014a07135f0 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1617,6 +1617,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, if (get_user(node, nodes + i)) goto out_flush; addr = (unsigned long)p; + addr = untagged_addr(addr);
err = -ENODEV; if (node < 0 || node >= MAX_NUMNODES)
On Mon, May 06, 2019 at 06:30:52PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
do_pages_move() is used in the implementation of the move_pages syscall.
Untag user pointers in this function.
Signed-off-by: Andrey Konovalov andreyknvl@google.com
Reviewed-by: Catalin Marinas catalin.marinas@arm.com
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
mm/gup.c provides a kernel interface that accepts user addresses and manipulates user pages directly (for example get_user_pages, that is used by the futex syscall). Since a user can provided tagged addresses, we need to handle this case.
Add untagging to gup.c functions that use user addresses for vma lookups.
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- mm/gup.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/mm/gup.c b/mm/gup.c index 91819b8ad9cc..2f477a0a7180 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -696,6 +696,8 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, if (!nr_pages) return 0;
+ start = untagged_addr(start); + VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
/* @@ -858,6 +860,8 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, struct vm_area_struct *vma; vm_fault_t ret, major = 0;
+ address = untagged_addr(address); + if (unlocked) fault_flags |= FAULT_FLAG_ALLOW_RETRY;
On Mon, May 06, 2019 at 06:30:53PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
mm/gup.c provides a kernel interface that accepts user addresses and manipulates user pages directly (for example get_user_pages, that is used by the futex syscall). Since a user can provided tagged addresses, we need to handle this case.
Add untagging to gup.c functions that use user addresses for vma lookups.
Signed-off-by: Andrey Konovalov andreyknvl@google.com
Reviewed-by: Catalin Marinas catalin.marinas@arm.com
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
get_vaddr_frames uses provided user pointers for vma lookups, which can only by done with untagged pointers. Instead of locating and changing all callers of this function, perform untagging in it.
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- mm/frame_vector.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/frame_vector.c b/mm/frame_vector.c index c64dca6e27c2..c431ca81dad5 100644 --- a/mm/frame_vector.c +++ b/mm/frame_vector.c @@ -46,6 +46,8 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames, if (WARN_ON_ONCE(nr_frames > vec->nr_allocated)) nr_frames = vec->nr_allocated;
+ start = untagged_addr(start); + down_read(&mm->mmap_sem); locked = 1; vma = find_vma_intersection(mm, start, start + 1);
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
In copy_mount_options a user address is being subtracted from TASK_SIZE. If the address is lower than TASK_SIZE, the size is calculated to not allow the exact_copy_from_user() call to cross TASK_SIZE boundary. However if the address is tagged, then the size will be calculated incorrectly.
Untag the address before subtracting.
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/namespace.c b/fs/namespace.c index c9cab307fa77..c27e5713bf04 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2825,7 +2825,7 @@ void *copy_mount_options(const void __user * data) * the remainder of the page. */ /* copy_from_user cannot cross TASK_SIZE ! */ - size = TASK_SIZE - (unsigned long)data; + size = TASK_SIZE - (unsigned long)untagged_addr(data); if (size > PAGE_SIZE) size = PAGE_SIZE;
On Mon, May 06, 2019 at 06:30:55PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
In copy_mount_options a user address is being subtracted from TASK_SIZE. If the address is lower than TASK_SIZE, the size is calculated to not allow the exact_copy_from_user() call to cross TASK_SIZE boundary. However if the address is tagged, then the size will be calculated incorrectly.
Untag the address before subtracting.
Signed-off-by: Andrey Konovalov andreyknvl@google.com
Reviewed-by: Catalin Marinas catalin.marinas@arm.com
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
userfaultfd code use provided user pointers for vma lookups, which can only by done with untagged pointers.
Untag user pointers in validate_range().
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- fs/userfaultfd.c | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index f5de1e726356..aa47ed0969dd 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1261,21 +1261,23 @@ static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx, }
static __always_inline int validate_range(struct mm_struct *mm, - __u64 start, __u64 len) + __u64 *start, __u64 len) { __u64 task_size = mm->task_size;
- if (start & ~PAGE_MASK) + *start = untagged_addr(*start); + + if (*start & ~PAGE_MASK) return -EINVAL; if (len & ~PAGE_MASK) return -EINVAL; if (!len) return -EINVAL; - if (start < mmap_min_addr) + if (*start < mmap_min_addr) return -EINVAL; - if (start >= task_size) + if (*start >= task_size) return -EINVAL; - if (len > task_size - start) + if (len > task_size - *start) return -EINVAL; return 0; } @@ -1325,7 +1327,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, goto out; }
- ret = validate_range(mm, uffdio_register.range.start, + ret = validate_range(mm, &uffdio_register.range.start, uffdio_register.range.len); if (ret) goto out; @@ -1514,7 +1516,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister))) goto out;
- ret = validate_range(mm, uffdio_unregister.start, + ret = validate_range(mm, &uffdio_unregister.start, uffdio_unregister.len); if (ret) goto out; @@ -1665,7 +1667,7 @@ static int userfaultfd_wake(struct userfaultfd_ctx *ctx, if (copy_from_user(&uffdio_wake, buf, sizeof(uffdio_wake))) goto out;
- ret = validate_range(ctx->mm, uffdio_wake.start, uffdio_wake.len); + ret = validate_range(ctx->mm, &uffdio_wake.start, uffdio_wake.len); if (ret) goto out;
@@ -1705,7 +1707,7 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx, sizeof(uffdio_copy)-sizeof(__s64))) goto out;
- ret = validate_range(ctx->mm, uffdio_copy.dst, uffdio_copy.len); + ret = validate_range(ctx->mm, &uffdio_copy.dst, uffdio_copy.len); if (ret) goto out; /* @@ -1761,7 +1763,7 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx, sizeof(uffdio_zeropage)-sizeof(__s64))) goto out;
- ret = validate_range(ctx->mm, uffdio_zeropage.range.start, + ret = validate_range(ctx->mm, &uffdio_zeropage.range.start, uffdio_zeropage.range.len); if (ret) goto out;
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
In amdgpu_gem_userptr_ioctl() and amdgpu_amdkfd_gpuvm.c/init_user_pages() an MMU notifier is set up with a (tagged) userspace pointer. The untagged address should be used so that MMU notifiers for the untagged address get correctly matched up with the right BO. This patch untag user pointers in amdgpu_gem_userptr_ioctl() for the GEM case and in amdgpu_amdkfd_gpuvm_ alloc_memory_of_gpu() for the KFD case. This also makes sure that an untagged pointer is passed to amdgpu_ttm_tt_get_user_pages(), which uses it for vma lookups.
Suggested-by: Kuehling, Felix Felix.Kuehling@amd.com Signed-off-by: Andrey Konovalov andreyknvl@google.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 ++ 2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 1921dec3df7a..20cac44ed449 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -1121,7 +1121,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu( alloc_flags = 0; if (!offset || !*offset) return -EINVAL; - user_addr = *offset; + user_addr = untagged_addr(*offset); } else if (flags & ALLOC_MEM_FLAGS_DOORBELL) { domain = AMDGPU_GEM_DOMAIN_GTT; alloc_domain = AMDGPU_GEM_DOMAIN_CPU; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c index d21dd2f369da..985cb82b2aa6 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -286,6 +286,8 @@ int amdgpu_gem_userptr_ioctl(struct drm_device *dev, void *data, uint32_t handle; int r;
+ args->addr = untagged_addr(args->addr); + if (offset_in_page(args->addr | args->size)) return -EINVAL;
On 2019-05-06 12:30 p.m., Andrey Konovalov wrote:
[CAUTION: External Email]
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
In amdgpu_gem_userptr_ioctl() and amdgpu_amdkfd_gpuvm.c/init_user_pages() an MMU notifier is set up with a (tagged) userspace pointer. The untagged address should be used so that MMU notifiers for the untagged address get correctly matched up with the right BO. This patch untag user pointers in amdgpu_gem_userptr_ioctl() for the GEM case and in amdgpu_amdkfd_gpuvm_ alloc_memory_of_gpu() for the KFD case. This also makes sure that an untagged pointer is passed to amdgpu_ttm_tt_get_user_pages(), which uses it for vma lookups.
Suggested-by: Kuehling, Felix Felix.Kuehling@amd.com Signed-off-by: Andrey Konovalov andreyknvl@google.com
Acked-by: Felix Kuehling Felix.Kuehling@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 ++ 2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 1921dec3df7a..20cac44ed449 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -1121,7 +1121,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu( alloc_flags = 0; if (!offset || !*offset) return -EINVAL;
user_addr = *offset;
user_addr = untagged_addr(*offset); } else if (flags & ALLOC_MEM_FLAGS_DOORBELL) { domain = AMDGPU_GEM_DOMAIN_GTT; alloc_domain = AMDGPU_GEM_DOMAIN_CPU;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c index d21dd2f369da..985cb82b2aa6 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -286,6 +286,8 @@ int amdgpu_gem_userptr_ioctl(struct drm_device *dev, void *data, uint32_t handle; int r;
args->addr = untagged_addr(args->addr);
if (offset_in_page(args->addr | args->size)) return -EINVAL;
-- 2.21.0.1020.gf2820cf01a-goog
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
In radeon_gem_userptr_ioctl() an MMU notifier is set up with a (tagged) userspace pointer. The untagged address should be used so that MMU notifiers for the untagged address get correctly matched up with the right BO. This funcation also calls radeon_ttm_tt_pin_userptr(), which uses provided user pointers for vma lookups, which can only by done with untagged pointers.
This patch untags user pointers in radeon_gem_userptr_ioctl().
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- drivers/gpu/drm/radeon/radeon_gem.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c index 44617dec8183..90eb78fb5eb2 100644 --- a/drivers/gpu/drm/radeon/radeon_gem.c +++ b/drivers/gpu/drm/radeon/radeon_gem.c @@ -291,6 +291,8 @@ int radeon_gem_userptr_ioctl(struct drm_device *dev, void *data, uint32_t handle; int r;
+ args->addr = untagged_addr(args->addr); + if (offset_in_page(args->addr | args->size)) return -EINVAL;
On 2019-05-06 12:30 p.m., Andrey Konovalov wrote:
[CAUTION: External Email]
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
In radeon_gem_userptr_ioctl() an MMU notifier is set up with a (tagged) userspace pointer. The untagged address should be used so that MMU notifiers for the untagged address get correctly matched up with the right BO. This funcation also calls radeon_ttm_tt_pin_userptr(), which uses provided user pointers for vma lookups, which can only by done with untagged pointers.
This patch untags user pointers in radeon_gem_userptr_ioctl().
Signed-off-by: Andrey Konovalov andreyknvl@google.com
Acked-by: Felix Kuehling Felix.Kuehling@amd.com
drivers/gpu/drm/radeon/radeon_gem.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c index 44617dec8183..90eb78fb5eb2 100644 --- a/drivers/gpu/drm/radeon/radeon_gem.c +++ b/drivers/gpu/drm/radeon/radeon_gem.c @@ -291,6 +291,8 @@ int radeon_gem_userptr_ioctl(struct drm_device *dev, void *data, uint32_t handle; int r;
args->addr = untagged_addr(args->addr);
if (offset_in_page(args->addr | args->size)) return -EINVAL;
-- 2.21.0.1020.gf2820cf01a-goog
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
ib_uverbs_(re)reg_mr() use provided user pointers for vma lookups (through e.g. mlx4_get_umem_mr()), which can only by done with untagged pointers.
Untag user pointers in these functions.
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- drivers/infiniband/core/uverbs_cmd.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 062a86c04123..36e7b52577d0 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -708,6 +708,8 @@ static int ib_uverbs_reg_mr(struct uverbs_attr_bundle *attrs) if (ret) return ret;
+ cmd.start = untagged_addr(cmd.start); + if ((cmd.start & ~PAGE_MASK) != (cmd.hca_va & ~PAGE_MASK)) return -EINVAL;
@@ -790,6 +792,8 @@ static int ib_uverbs_rereg_mr(struct uverbs_attr_bundle *attrs) if (ret) return ret;
+ cmd.start = untagged_addr(cmd.start); + if (cmd.flags & ~IB_MR_REREG_SUPPORTED || !cmd.flags) return -EINVAL;
On Mon, May 06, 2019 at 06:30:59PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
ib_uverbs_(re)reg_mr() use provided user pointers for vma lookups (through e.g. mlx4_get_umem_mr()), which can only by done with untagged pointers.
Untag user pointers in these functions.
Signed-off-by: Andrey Konovalov andreyknvl@google.com
drivers/infiniband/core/uverbs_cmd.c | 4 ++++ 1 file changed, 4 insertions(+)
I think this is OK.. We should really get it tested though.. Leon?
Jason
On Mon, May 06, 2019 at 04:50:20PM -0300, Jason Gunthorpe wrote:
On Mon, May 06, 2019 at 06:30:59PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
ib_uverbs_(re)reg_mr() use provided user pointers for vma lookups (through e.g. mlx4_get_umem_mr()), which can only by done with untagged pointers.
Untag user pointers in these functions.
Signed-off-by: Andrey Konovalov andreyknvl@google.com
drivers/infiniband/core/uverbs_cmd.c | 4 ++++ 1 file changed, 4 insertions(+)
I think this is OK.. We should really get it tested though.. Leon?
It can be done after v5.2-rc1.
Thanks
Jason
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
videobuf_dma_contig_user_get() uses provided user pointers for vma lookups, which can only by done with untagged pointers.
Untag the pointers in this function.
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- drivers/media/v4l2-core/videobuf-dma-contig.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/media/v4l2-core/videobuf-dma-contig.c b/drivers/media/v4l2-core/videobuf-dma-contig.c index e1bf50df4c70..8a1ddd146b17 100644 --- a/drivers/media/v4l2-core/videobuf-dma-contig.c +++ b/drivers/media/v4l2-core/videobuf-dma-contig.c @@ -160,6 +160,7 @@ static void videobuf_dma_contig_user_put(struct videobuf_dma_contig_memory *mem) static int videobuf_dma_contig_user_get(struct videobuf_dma_contig_memory *mem, struct videobuf_buffer *vb) { + unsigned long untagged_baddr = untagged_addr(vb->baddr); struct mm_struct *mm = current->mm; struct vm_area_struct *vma; unsigned long prev_pfn, this_pfn; @@ -167,22 +168,22 @@ static int videobuf_dma_contig_user_get(struct videobuf_dma_contig_memory *mem, unsigned int offset; int ret;
- offset = vb->baddr & ~PAGE_MASK; + offset = untagged_baddr & ~PAGE_MASK; mem->size = PAGE_ALIGN(vb->size + offset); ret = -EINVAL;
down_read(&mm->mmap_sem);
- vma = find_vma(mm, vb->baddr); + vma = find_vma(mm, untagged_baddr); if (!vma) goto out_up;
- if ((vb->baddr + mem->size) > vma->vm_end) + if ((untagged_baddr + mem->size) > vma->vm_end) goto out_up;
pages_done = 0; prev_pfn = 0; /* kill warning */ - user_address = vb->baddr; + user_address = untagged_baddr;
while (pages_done < (mem->size >> PAGE_SHIFT)) { ret = follow_pfn(vma, user_address, &this_pfn);
Em Mon, 6 May 2019 18:31:00 +0200 Andrey Konovalov andreyknvl@google.com escreveu:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
videobuf_dma_contig_user_get() uses provided user pointers for vma lookups, which can only by done with untagged pointers.
Untag the pointers in this function.
Signed-off-by: Andrey Konovalov andreyknvl@google.com
Acked-by: Mauro Carvalho Chehab mchehab+samsung@kernel.org
drivers/media/v4l2-core/videobuf-dma-contig.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/media/v4l2-core/videobuf-dma-contig.c b/drivers/media/v4l2-core/videobuf-dma-contig.c index e1bf50df4c70..8a1ddd146b17 100644 --- a/drivers/media/v4l2-core/videobuf-dma-contig.c +++ b/drivers/media/v4l2-core/videobuf-dma-contig.c @@ -160,6 +160,7 @@ static void videobuf_dma_contig_user_put(struct videobuf_dma_contig_memory *mem) static int videobuf_dma_contig_user_get(struct videobuf_dma_contig_memory *mem, struct videobuf_buffer *vb) {
- unsigned long untagged_baddr = untagged_addr(vb->baddr); struct mm_struct *mm = current->mm; struct vm_area_struct *vma; unsigned long prev_pfn, this_pfn;
@@ -167,22 +168,22 @@ static int videobuf_dma_contig_user_get(struct videobuf_dma_contig_memory *mem, unsigned int offset; int ret;
- offset = vb->baddr & ~PAGE_MASK;
- offset = untagged_baddr & ~PAGE_MASK; mem->size = PAGE_ALIGN(vb->size + offset); ret = -EINVAL;
down_read(&mm->mmap_sem);
- vma = find_vma(mm, vb->baddr);
- vma = find_vma(mm, untagged_baddr); if (!vma) goto out_up;
- if ((vb->baddr + mem->size) > vma->vm_end)
- if ((untagged_baddr + mem->size) > vma->vm_end) goto out_up;
pages_done = 0; prev_pfn = 0; /* kill warning */
- user_address = vb->baddr;
- user_address = untagged_baddr;
while (pages_done < (mem->size >> PAGE_SHIFT)) { ret = follow_pfn(vma, user_address, &this_pfn);
Thanks, Mauro
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
tee_shm_register()->optee_shm_unregister()->check_mem_type() uses provided user pointers for vma lookups (via __check_mem_type()), which can only by done with untagged pointers.
Untag user pointers in this function.
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- drivers/tee/tee_shm.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/tee/tee_shm.c b/drivers/tee/tee_shm.c index 0b9ab1d0dd45..8e7b52ab6c63 100644 --- a/drivers/tee/tee_shm.c +++ b/drivers/tee/tee_shm.c @@ -263,6 +263,7 @@ struct tee_shm *tee_shm_register(struct tee_context *ctx, unsigned long addr, shm->teedev = teedev; shm->ctx = ctx; shm->id = -1; + addr = untagged_addr(addr); start = rounddown(addr, PAGE_SIZE); shm->offset = addr - start; shm->size = length;
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
vaddr_get_pfn() uses provided user pointers for vma lookups, which can only by done with untagged pointers.
Untag user pointers in this function.
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- drivers/vfio/vfio_iommu_type1.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index d0f731c9920a..5daa966d799e 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -382,6 +382,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
down_read(&mm->mmap_sem);
+ vaddr = untagged_addr(vaddr); + vma = find_vma_intersection(mm, vaddr, vaddr + 1);
if (vma && vma->vm_flags & VM_PFNMAP) {
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
This patch adds a simple test, that calls the uname syscall with a tagged user pointer as an argument. Without the kernel accepting tagged user pointers the test fails with EFAULT.
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- tools/testing/selftests/arm64/.gitignore | 1 + tools/testing/selftests/arm64/Makefile | 11 ++++++++++ .../testing/selftests/arm64/run_tags_test.sh | 12 +++++++++++ tools/testing/selftests/arm64/tags_test.c | 21 +++++++++++++++++++ 4 files changed, 45 insertions(+) create mode 100644 tools/testing/selftests/arm64/.gitignore create mode 100644 tools/testing/selftests/arm64/Makefile create mode 100755 tools/testing/selftests/arm64/run_tags_test.sh create mode 100644 tools/testing/selftests/arm64/tags_test.c
diff --git a/tools/testing/selftests/arm64/.gitignore b/tools/testing/selftests/arm64/.gitignore new file mode 100644 index 000000000000..e8fae8d61ed6 --- /dev/null +++ b/tools/testing/selftests/arm64/.gitignore @@ -0,0 +1 @@ +tags_test diff --git a/tools/testing/selftests/arm64/Makefile b/tools/testing/selftests/arm64/Makefile new file mode 100644 index 000000000000..a61b2e743e99 --- /dev/null +++ b/tools/testing/selftests/arm64/Makefile @@ -0,0 +1,11 @@ +# SPDX-License-Identifier: GPL-2.0 + +# ARCH can be overridden by the user for cross compiling +ARCH ?= $(shell uname -m 2>/dev/null || echo not) + +ifneq (,$(filter $(ARCH),aarch64 arm64)) +TEST_GEN_PROGS := tags_test +TEST_PROGS := run_tags_test.sh +endif + +include ../lib.mk diff --git a/tools/testing/selftests/arm64/run_tags_test.sh b/tools/testing/selftests/arm64/run_tags_test.sh new file mode 100755 index 000000000000..745f11379930 --- /dev/null +++ b/tools/testing/selftests/arm64/run_tags_test.sh @@ -0,0 +1,12 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 + +echo "--------------------" +echo "running tags test" +echo "--------------------" +./tags_test +if [ $? -ne 0 ]; then + echo "[FAIL]" +else + echo "[PASS]" +fi diff --git a/tools/testing/selftests/arm64/tags_test.c b/tools/testing/selftests/arm64/tags_test.c new file mode 100644 index 000000000000..2bd1830a7ebe --- /dev/null +++ b/tools/testing/selftests/arm64/tags_test.c @@ -0,0 +1,21 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <stdint.h> +#include <sys/utsname.h> + +#define SHIFT_TAG(tag) ((uint64_t)(tag) << 56) +#define SET_TAG(ptr, tag) (((uint64_t)(ptr) & ~SHIFT_TAG(0xff)) | \ + SHIFT_TAG(tag)) + +int main(void) +{ + struct utsname *ptr = (struct utsname *)malloc(sizeof(*ptr)); + void *tagged_ptr = (void *)SET_TAG(ptr, 0x42); + int err = uname(tagged_ptr); + + free(ptr); + return err; +}
On Mon, May 06, 2019 at 06:31:03PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
This patch adds a simple test, that calls the uname syscall with a tagged user pointer as an argument. Without the kernel accepting tagged user pointers the test fails with EFAULT.
That's probably sufficient for a simple example. Something we could add to Documentation maybe is a small library that can be LD_PRELOAD'ed so that you can run a lot more tests like LTP.
We could add this to selftests but I think it's too glibc specific.
--------------------8<------------------------------------ #include <stdlib.h>
#define TAG_SHIFT (56) #define TAG_MASK (0xffUL << TAG_SHIFT)
void *__libc_malloc(size_t size); void __libc_free(void *ptr); void *__libc_realloc(void *ptr, size_t size); void *__libc_calloc(size_t nmemb, size_t size);
static void *tag_ptr(void *ptr) { unsigned long tag = rand() & 0xff; if (!ptr) return ptr; return (void *)((unsigned long)ptr | (tag << TAG_SHIFT)); }
static void *untag_ptr(void *ptr) { return (void *)((unsigned long)ptr & ~TAG_MASK); }
void *malloc(size_t size) { return tag_ptr(__libc_malloc(size)); }
void free(void *ptr) { __libc_free(untag_ptr(ptr)); }
void *realloc(void *ptr, size_t size) { return tag_ptr(__libc_realloc(untag_ptr(ptr), size)); }
void *calloc(size_t nmemb, size_t size) { return tag_ptr(__libc_calloc(nmemb, size)); }
On Wed, May 22, 2019 at 4:16 PM Catalin Marinas catalin.marinas@arm.com wrote:
On Mon, May 06, 2019 at 06:31:03PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
This patch adds a simple test, that calls the uname syscall with a tagged user pointer as an argument. Without the kernel accepting tagged user pointers the test fails with EFAULT.
That's probably sufficient for a simple example. Something we could add to Documentation maybe is a small library that can be LD_PRELOAD'ed so that you can run a lot more tests like LTP.
Should I add this into this series, or should this go into Vincenzo's patchset?
We could add this to selftests but I think it's too glibc specific.
--------------------8<------------------------------------ #include <stdlib.h>
#define TAG_SHIFT (56) #define TAG_MASK (0xffUL << TAG_SHIFT)
void *__libc_malloc(size_t size); void __libc_free(void *ptr); void *__libc_realloc(void *ptr, size_t size); void *__libc_calloc(size_t nmemb, size_t size);
static void *tag_ptr(void *ptr) { unsigned long tag = rand() & 0xff; if (!ptr) return ptr; return (void *)((unsigned long)ptr | (tag << TAG_SHIFT)); }
static void *untag_ptr(void *ptr) { return (void *)((unsigned long)ptr & ~TAG_MASK); }
void *malloc(size_t size) { return tag_ptr(__libc_malloc(size)); }
void free(void *ptr) { __libc_free(untag_ptr(ptr)); }
void *realloc(void *ptr, size_t size) { return tag_ptr(__libc_realloc(untag_ptr(ptr), size)); }
void *calloc(size_t nmemb, size_t size) { return tag_ptr(__libc_calloc(nmemb, size)); }
On Fri, May 31, 2019 at 04:21:48PM +0200, Andrey Konovalov wrote:
On Wed, May 22, 2019 at 4:16 PM Catalin Marinas catalin.marinas@arm.com wrote:
On Mon, May 06, 2019 at 06:31:03PM +0200, Andrey Konovalov wrote:
This patch is a part of a series that extends arm64 kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments.
This patch adds a simple test, that calls the uname syscall with a tagged user pointer as an argument. Without the kernel accepting tagged user pointers the test fails with EFAULT.
That's probably sufficient for a simple example. Something we could add to Documentation maybe is a small library that can be LD_PRELOAD'ed so that you can run a lot more tests like LTP.
Should I add this into this series, or should this go into Vincenzo's patchset?
If you can tweak the selftest Makefile to build a library and force it with LD_PRELOAD, you can keep it with this patch. It would be easier to extend to other syscall tests, signal handling etc.
Hi Andrey,
On Mon, May 06, 2019 at 06:30:46PM +0200, Andrey Konovalov wrote:
One of the alternative approaches to untagging that was considered is to completely strip the pointer tag as the pointer enters the kernel with some kind of a syscall wrapper, but that won't work with the countless number of different ioctl calls. With this approach we would need a custom wrapper for each ioctl variation, which doesn't seem practical.
The more I look at this problem, the less convinced I am that we can solve it in a way that results in a stable ABI covering ioctls(). While for the Android kernel codebase it could be simpler as you don't upgrade the kernel version every 2.5 months, for the mainline kernel this doesn't scale. Any run-time checks are relatively limited in terms of drivers covered. Better static checking would be nice as a long term solution but we didn't get anywhere with the discussion last year.
IMO (RFC for now), I see two ways forward:
1. Make this a user space problem and do not allow tagged pointers into the syscall ABI. A libc wrapper would have to convert structures, parameters before passing them into the kernel. Note that we can still support the hardware MTE in the kernel by enabling tagged memory ranges, saving/restoring tags etc. but not allowing tagged addresses at the syscall boundary.
2. Similar shim to the above libc wrapper but inside the kernel (arch/arm64 only; most pointer arguments could be covered with an __SC_CAST similar to the s390 one). There are two differences from what we've discussed in the past:
a) this is an opt-in by the user which would have to explicitly call prctl(). If it returns -ENOTSUPP etc., the user won't be allowed to pass tagged pointers to the kernel. This would probably be the responsibility of the C lib to make sure it doesn't tag heap allocations. If the user did not opt-in, the syscalls are routed through the normal path (no untagging address shim).
b) ioctl() and other blacklisted syscalls (prctl) will not accept tagged pointers (to be documented in Vicenzo's ABI patches).
It doesn't solve the problems we are trying to address but 2.a saves us from blindly relaxing the ABI without knowing how to easily assess new code being merged (over 500K lines between kernel versions). Existing applications (who don't opt-in) won't inadvertently start using the new ABI which could risk becoming de-facto ABI that we need to support on the long run.
Option 1 wouldn't solve the ioctl() problem either and while it makes things simpler for the kernel, I am aware that it's slightly more complicated in user space (but I really don't mind if you prefer option 1 ;)).
The tagged pointers (whether hwasan or MTE) should ideally be a transparent feature for the application writer but I don't think we can solve it entirely and make it seamless for the multitude of ioctls(). I'd say you only opt in to such feature if you know what you are doing and the user code takes care of specific cases like ioctl(), hence the prctl() proposal even for the hwasan.
Comments welcomed.
On Fri, May 17, 2019 at 7:49 AM Catalin Marinas catalin.marinas@arm.com wrote:
Hi Andrey,
On Mon, May 06, 2019 at 06:30:46PM +0200, Andrey Konovalov wrote:
One of the alternative approaches to untagging that was considered is to completely strip the pointer tag as the pointer enters the kernel with some kind of a syscall wrapper, but that won't work with the countless number of different ioctl calls. With this approach we would need a custom wrapper for each ioctl variation, which doesn't seem practical.
The more I look at this problem, the less convinced I am that we can solve it in a way that results in a stable ABI covering ioctls(). While for the Android kernel codebase it could be simpler as you don't upgrade the kernel version every 2.5 months, for the mainline kernel this doesn't scale. Any run-time checks are relatively limited in terms of drivers covered. Better static checking would be nice as a long term solution but we didn't get anywhere with the discussion last year.
IMO (RFC for now), I see two ways forward:
Make this a user space problem and do not allow tagged pointers into the syscall ABI. A libc wrapper would have to convert structures, parameters before passing them into the kernel. Note that we can still support the hardware MTE in the kernel by enabling tagged memory ranges, saving/restoring tags etc. but not allowing tagged addresses at the syscall boundary.
Similar shim to the above libc wrapper but inside the kernel (arch/arm64 only; most pointer arguments could be covered with an __SC_CAST similar to the s390 one). There are two differences from what we've discussed in the past:
a) this is an opt-in by the user which would have to explicitly call prctl(). If it returns -ENOTSUPP etc., the user won't be allowed to pass tagged pointers to the kernel. This would probably be the responsibility of the C lib to make sure it doesn't tag heap allocations. If the user did not opt-in, the syscalls are routed through the normal path (no untagging address shim).
b) ioctl() and other blacklisted syscalls (prctl) will not accept tagged pointers (to be documented in Vicenzo's ABI patches).
It doesn't solve the problems we are trying to address but 2.a saves us from blindly relaxing the ABI without knowing how to easily assess new code being merged (over 500K lines between kernel versions). Existing applications (who don't opt-in) won't inadvertently start using the new ABI which could risk becoming de-facto ABI that we need to support on the long run.
Option 1 wouldn't solve the ioctl() problem either and while it makes things simpler for the kernel, I am aware that it's slightly more complicated in user space (but I really don't mind if you prefer option 1 ;)).
The tagged pointers (whether hwasan or MTE) should ideally be a transparent feature for the application writer but I don't think we can solve it entirely and make it seamless for the multitude of ioctls(). I'd say you only opt in to such feature if you know what you are doing and the user code takes care of specific cases like ioctl(), hence the prctl() proposal even for the hwasan.
Comments welcomed.
Any userspace shim approach is problematic for Android because of the apps that use raw system calls. AFAIK, all apps written in Go are in that camp - I'm not sure how common they are, but getting them all recompiled is probably not realistic.
The way I see it, a patch that breaks handling of tagged pointers is not that different from, say, a patch that adds a wild pointer dereference. Both are bugs; the difference is that (a) the former breaks a relatively uncommon target and (b) it's arguably an easier mistake to make. If MTE adoption goes well, (a) will not be the case for long.
This is a bit of a chicken-and-egg problem. In a world where memory allocators on one or several popular platforms generate pointers with non-zero tags, any such breakage will be caught in testing. Unfortunately to reach that state we need the kernel to start accepting tagged pointers first, and then hold on for a couple of years until userspace catches up.
Perhaps we can start by whitelisting ioctls by driver?
On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote:
On Fri, May 17, 2019 at 7:49 AM Catalin Marinas catalin.marinas@arm.com wrote:
IMO (RFC for now), I see two ways forward:
Make this a user space problem and do not allow tagged pointers into the syscall ABI. A libc wrapper would have to convert structures, parameters before passing them into the kernel. Note that we can still support the hardware MTE in the kernel by enabling tagged memory ranges, saving/restoring tags etc. but not allowing tagged addresses at the syscall boundary.
Similar shim to the above libc wrapper but inside the kernel (arch/arm64 only; most pointer arguments could be covered with an __SC_CAST similar to the s390 one). There are two differences from what we've discussed in the past:
a) this is an opt-in by the user which would have to explicitly call prctl(). If it returns -ENOTSUPP etc., the user won't be allowed to pass tagged pointers to the kernel. This would probably be the responsibility of the C lib to make sure it doesn't tag heap allocations. If the user did not opt-in, the syscalls are routed through the normal path (no untagging address shim).
b) ioctl() and other blacklisted syscalls (prctl) will not accept tagged pointers (to be documented in Vicenzo's ABI patches).
[...]
Any userspace shim approach is problematic for Android because of the apps that use raw system calls. AFAIK, all apps written in Go are in that camp - I'm not sure how common they are, but getting them all recompiled is probably not realistic.
That's a fair point (I wasn't expecting it would get much traction anyway ;)). OTOH, it allows upstreaming of the MTE patches while we continue the discussions around TBI.
The way I see it, a patch that breaks handling of tagged pointers is not that different from, say, a patch that adds a wild pointer dereference. Both are bugs; the difference is that (a) the former breaks a relatively uncommon target and (b) it's arguably an easier mistake to make. If MTE adoption goes well, (a) will not be the case for long.
It's also the fact such patch would go unnoticed for a long time until someone exercises that code path. And when they do, the user would be pretty much in the dark trying to figure what what went wrong, why a SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed all the places where it matters in the current kernel codebase (ignoring future patches).
I think we should revisit the static checking discussions we had last year. Run-time checking (even with compiler instrumentation and syzkaller fuzzing) would only cover the code paths specific to a Linux or Android installation.
This is a bit of a chicken-and-egg problem. In a world where memory allocators on one or several popular platforms generate pointers with non-zero tags, any such breakage will be caught in testing. Unfortunately to reach that state we need the kernel to start accepting tagged pointers first, and then hold on for a couple of years until userspace catches up.
Would the kernel also catch up with providing a stable ABI? Because we have two moving targets.
On one hand, you have Android or some Linux distro that stick to a stable kernel version for some time, so they have better chance of clearing most of the problems. On the other hand, we have mainline kernel that gets over 500K lines every release. As maintainer, I can't rely on my testing alone as this is on a limited number of platforms. So my concern is that every kernel release has a significant chance of breaking the ABI, unless we have a better way of identifying potential issues.
Perhaps we can start by whitelisting ioctls by driver?
This was also raised by Ruben in private but without a (static) tool to to check, manually going through all the drivers doesn't scale. It's very likely that most drivers don't care, just a get_user/put_user is already handled by these patches. Searching for find_vma() was identifying one such use-case but is this sufficient? Are there other cases we need to explicitly untag a pointer?
The other point I'd like feedback on is 2.a above. I see _some_ value into having the user opt-in to this relaxed ABI rather than blinding exposing it to all applications. Dave suggested (in private) a new personality (e.g. PER_LINUX_TBI) inherited by children. It would be the responsibility of the C library to check the current personality bits and only tag pointers on allocation *if* the kernel allowed it. The kernel could provide the AT_FLAGS bit as in Vincenzo's patches if the personality was set but can't set it retrospectively if the user called sys_personality. By default, /sbin/init would not have this personality and libc would not tag pointers, so we can guarantee that your distro boots normally with a new kernel version. We could have an envp that gets caught by /sbin/init so you can pass it on the kernel command line (or a dynamic loader at run-time). But the default should be the current ABI behaviour.
We can enforce the current behaviour by having access_ok() check the personality or a TIF flag but we may relax this enforcement at some point in the future as we learn more about the implications of TBI.
Thanks.
On Tue, May 21, 2019 at 07:29:33PM +0100, Catalin Marinas wrote:
On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote:
On Fri, May 17, 2019 at 7:49 AM Catalin Marinas catalin.marinas@arm.com wrote:
IMO (RFC for now), I see two ways forward: [...] 2. Similar shim to the above libc wrapper but inside the kernel (arch/arm64 only; most pointer arguments could be covered with an __SC_CAST similar to the s390 one). There are two differences from what we've discussed in the past:
a) this is an opt-in by the user which would have to explicitly call prctl(). If it returns -ENOTSUPP etc., the user won't be allowed to pass tagged pointers to the kernel. This would probably be the responsibility of the C lib to make sure it doesn't tag heap allocations. If the user did not opt-in, the syscalls are routed through the normal path (no untagging address shim).
b) ioctl() and other blacklisted syscalls (prctl) will not accept tagged pointers (to be documented in Vicenzo's ABI patches).
The way I see it, a patch that breaks handling of tagged pointers is not that different from, say, a patch that adds a wild pointer dereference. Both are bugs; the difference is that (a) the former breaks a relatively uncommon target and (b) it's arguably an easier mistake to make. If MTE adoption goes well, (a) will not be the case for long.
It's also the fact such patch would go unnoticed for a long time until someone exercises that code path. And when they do, the user would be pretty much in the dark trying to figure what what went wrong, why a SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed all the places where it matters in the current kernel codebase (ignoring future patches).
So, looking forward a bit, this isn't going to be an ARM-specific issue for long. In fact, I think we shouldn't have arm-specific syscall wrappers in this series: I think untagged_addr() should likely be added at the top-level and have it be a no-op for other architectures. So given this becoming a kernel-wide multi-architecture issue (under the assumption that x86, RISC-V, and others will gain similar TBI or MTE things), we should solve it in a way that we can re-use.
We need something that is going to work everywhere. And it needs to be supported by the kernel for the simple reason that the kernel needs to do MTE checks during copy_from_user(): having that information stripped means we lose any userspace-assigned MTE protections if they get handled by the kernel, which is a total non-starter, IMO.
As an aside: I think Sparc ADI support in Linux actually side-stepped this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI for kernel code.") I think this was a mistake we should not repeat for arm64 (we do seem to be at least in agreement about this, I think).
[1] https://lore.kernel.org/patchwork/patch/654481/
This is a bit of a chicken-and-egg problem. In a world where memory allocators on one or several popular platforms generate pointers with non-zero tags, any such breakage will be caught in testing. Unfortunately to reach that state we need the kernel to start accepting tagged pointers first, and then hold on for a couple of years until userspace catches up.
Would the kernel also catch up with providing a stable ABI? Because we have two moving targets.
On one hand, you have Android or some Linux distro that stick to a stable kernel version for some time, so they have better chance of clearing most of the problems. On the other hand, we have mainline kernel that gets over 500K lines every release. As maintainer, I can't rely on my testing alone as this is on a limited number of platforms. So my concern is that every kernel release has a significant chance of breaking the ABI, unless we have a better way of identifying potential issues.
I just want to make sure I fully understand your concern about this being an ABI break, and I work best with examples. The closest situation I can see would be:
- some program has no idea about MTE - malloc() starts returning MTE-tagged addresses - program doesn't break from that change - program uses some syscall that is missing untagged_addr() and fails - kernel has now broken userspace that used to work
The trouble I see with this is that it is largely theoretical and requires part of userspace to collude to start using a new CPU feature that tickles a bug in the kernel. As I understand the golden rule, this is a bug in the kernel (a missed ioctl() or such) to be fixed, not a global breaking of some userspace behavior.
I feel like I'm missing something about this being seen as an ABI break. The kernel already fails on userspace addresses that have high bits set -- are there things that _depend_ on this failure to operate?
Hi Kees,
Thanks for joining the thread ;).
On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
On Tue, May 21, 2019 at 07:29:33PM +0100, Catalin Marinas wrote:
On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote:
On Fri, May 17, 2019 at 7:49 AM Catalin Marinas catalin.marinas@arm.com wrote:
IMO (RFC for now), I see two ways forward: [...] 2. Similar shim to the above libc wrapper but inside the kernel (arch/arm64 only; most pointer arguments could be covered with an __SC_CAST similar to the s390 one). There are two differences from what we've discussed in the past:
a) this is an opt-in by the user which would have to explicitly call prctl(). If it returns -ENOTSUPP etc., the user won't be allowed to pass tagged pointers to the kernel. This would probably be the responsibility of the C lib to make sure it doesn't tag heap allocations. If the user did not opt-in, the syscalls are routed through the normal path (no untagging address shim).
b) ioctl() and other blacklisted syscalls (prctl) will not accept tagged pointers (to be documented in Vicenzo's ABI patches).
The way I see it, a patch that breaks handling of tagged pointers is not that different from, say, a patch that adds a wild pointer dereference. Both are bugs; the difference is that (a) the former breaks a relatively uncommon target and (b) it's arguably an easier mistake to make. If MTE adoption goes well, (a) will not be the case for long.
It's also the fact such patch would go unnoticed for a long time until someone exercises that code path. And when they do, the user would be pretty much in the dark trying to figure what what went wrong, why a SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed all the places where it matters in the current kernel codebase (ignoring future patches).
So, looking forward a bit, this isn't going to be an ARM-specific issue for long.
I do hope so.
In fact, I think we shouldn't have arm-specific syscall wrappers in this series: I think untagged_addr() should likely be added at the top-level and have it be a no-op for other architectures.
That's what the current patchset does, so we have this as a starting point. Kostya raised another potential issue with the syscall wrappers: with MTE the kernel will be forced to enable the match-all (wildcard) pointers for user space accesses since copy_from_user() would only get a 0 tag. So it has wider implications than just uaccess routines not checking the colour.
So given this becoming a kernel-wide multi-architecture issue (under the assumption that x86, RISC-V, and others will gain similar TBI or MTE things), we should solve it in a way that we can re-use.
Can we do any better to aid the untagged_addr() placement (e.g. better type annotations, better static analysis)? We have to distinguish between user pointers that may be dereferenced by the kernel (I think almost fully covered with this patchset) and user addresses represented as ulong that may:
a) be converted to a user pointer and dereferenced; I think that's the case for many overloaded ulong/u64 arguments
b) used for address space management, rbtree look-ups etc. where the tag is no longer relevant and it even gets in the way
We tried last year to identify void __user * casts to unsigned long using sparse on the assumption that pointers can be tagged while ulong is about address space management and needs to lose such tag. I think we could have pushed this further. For example, get_user_pages() takes an unsigned long but it is perfectly capable of untagging the address itself. Shall we change its first argument to void __user * (together with all its callers)?
find_vma(), OTOH, could untag the address but it doesn't help since vm_start/end don't have such information (that's more about the content or type that the user decided) and the callers check against it.
Are there any other places where this matters? These patches tracked down find_vma() as some heuristics but we may need better static analysis to identify other cases.
We need something that is going to work everywhere. And it needs to be supported by the kernel for the simple reason that the kernel needs to do MTE checks during copy_from_user(): having that information stripped means we lose any userspace-assigned MTE protections if they get handled by the kernel, which is a total non-starter, IMO.
Such feedback is welcomed ;).
As an aside: I think Sparc ADI support in Linux actually side-stepped this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI for kernel code.") I think this was a mistake we should not repeat for arm64 (we do seem to be at least in agreement about this, I think).
I tried to drag the SPARC guys into this discussion but without much success.
This is a bit of a chicken-and-egg problem. In a world where memory allocators on one or several popular platforms generate pointers with non-zero tags, any such breakage will be caught in testing. Unfortunately to reach that state we need the kernel to start accepting tagged pointers first, and then hold on for a couple of years until userspace catches up.
Would the kernel also catch up with providing a stable ABI? Because we have two moving targets.
On one hand, you have Android or some Linux distro that stick to a stable kernel version for some time, so they have better chance of clearing most of the problems. On the other hand, we have mainline kernel that gets over 500K lines every release. As maintainer, I can't rely on my testing alone as this is on a limited number of platforms. So my concern is that every kernel release has a significant chance of breaking the ABI, unless we have a better way of identifying potential issues.
I just want to make sure I fully understand your concern about this being an ABI break, and I work best with examples. The closest situation I can see would be:
- some program has no idea about MTE
Apart from some libraries like libc (and maybe those that handle specific device ioctls), I think most programs should have no idea about MTE. I wouldn't expect programmers to have to change their app just because we have a new feature that colours heap allocations.
- malloc() starts returning MTE-tagged addresses
- program doesn't break from that change
- program uses some syscall that is missing untagged_addr() and fails
- kernel has now broken userspace that used to work
That's one aspect though probably more of a case of plugging in a new device (graphics card, network etc.) and the ioctl to the new device doesn't work.
The other is that, assuming we reach a point where the kernel entirely supports this relaxed ABI, can we guarantee that it won't break in the future. Let's say some subsequent kernel change (some refactoring) misses out an untagged_addr(). This renders a previously TBI/MTE-capable syscall unusable. Can we rely only on testing?
The trouble I see with this is that it is largely theoretical and requires part of userspace to collude to start using a new CPU feature that tickles a bug in the kernel. As I understand the golden rule, this is a bug in the kernel (a missed ioctl() or such) to be fixed, not a global breaking of some userspace behavior.
Yes, we should follow the rule that it's a kernel bug but it doesn't help the user that a newly installed kernel causes user space to no longer reach a prompt. Hence the proposal of an opt-in via personality (for MTE we would need an explicit opt-in by the user anyway since the top byte is no longer ignored but checked against the allocation tag).
I feel like I'm missing something about this being seen as an ABI break. The kernel already fails on userspace addresses that have high bits set -- are there things that _depend_ on this failure to operate?
It's about providing a relaxed ABI which allows non-zero top byte and breaking it later inadvertently without having something better in place to analyse the kernel changes.
Thanks.
On Wed, May 22, 2019 at 3:11 AM Catalin Marinas catalin.marinas@arm.com wrote:
Hi Kees,
Thanks for joining the thread ;).
On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
On Tue, May 21, 2019 at 07:29:33PM +0100, Catalin Marinas wrote:
On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote:
On Fri, May 17, 2019 at 7:49 AM Catalin Marinas catalin.marinas@arm.com wrote:
IMO (RFC for now), I see two ways forward: [...] 2. Similar shim to the above libc wrapper but inside the kernel (arch/arm64 only; most pointer arguments could be covered with an __SC_CAST similar to the s390 one). There are two differences from what we've discussed in the past:
a) this is an opt-in by the user which would have to explicitly call prctl(). If it returns -ENOTSUPP etc., the user won't be allowed to pass tagged pointers to the kernel. This would probably be the responsibility of the C lib to make sure it doesn't tag heap allocations. If the user did not opt-in, the syscalls are routed through the normal path (no untagging address shim).
b) ioctl() and other blacklisted syscalls (prctl) will not accept tagged pointers (to be documented in Vicenzo's ABI patches).
The way I see it, a patch that breaks handling of tagged pointers is not that different from, say, a patch that adds a wild pointer dereference. Both are bugs; the difference is that (a) the former breaks a relatively uncommon target and (b) it's arguably an easier mistake to make. If MTE adoption goes well, (a) will not be the case for long.
It's also the fact such patch would go unnoticed for a long time until someone exercises that code path. And when they do, the user would be pretty much in the dark trying to figure what what went wrong, why a SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed all the places where it matters in the current kernel codebase (ignoring future patches).
So, looking forward a bit, this isn't going to be an ARM-specific issue for long.
I do hope so.
In fact, I think we shouldn't have arm-specific syscall wrappers in this series: I think untagged_addr() should likely be added at the top-level and have it be a no-op for other architectures.
That's what the current patchset does, so we have this as a starting point. Kostya raised another potential issue with the syscall wrappers: with MTE the kernel will be forced to enable the match-all (wildcard) pointers for user space accesses since copy_from_user() would only get a 0 tag. So it has wider implications than just uaccess routines not checking the colour.
So given this becoming a kernel-wide multi-architecture issue (under the assumption that x86, RISC-V, and others will gain similar TBI or MTE things), we should solve it in a way that we can re-use.
Can we do any better to aid the untagged_addr() placement (e.g. better type annotations, better static analysis)? We have to distinguish between user pointers that may be dereferenced by the kernel (I think almost fully covered with this patchset) and user addresses represented as ulong that may:
a) be converted to a user pointer and dereferenced; I think that's the case for many overloaded ulong/u64 arguments
b) used for address space management, rbtree look-ups etc. where the tag is no longer relevant and it even gets in the way
We tried last year to identify void __user * casts to unsigned long using sparse on the assumption that pointers can be tagged while ulong is about address space management and needs to lose such tag. I think we could have pushed this further. For example, get_user_pages() takes an unsigned long but it is perfectly capable of untagging the address itself. Shall we change its first argument to void __user * (together with all its callers)?
find_vma(), OTOH, could untag the address but it doesn't help since vm_start/end don't have such information (that's more about the content or type that the user decided) and the callers check against it.
Are there any other places where this matters? These patches tracked down find_vma() as some heuristics but we may need better static analysis to identify other cases.
We need something that is going to work everywhere. And it needs to be supported by the kernel for the simple reason that the kernel needs to do MTE checks during copy_from_user(): having that information stripped means we lose any userspace-assigned MTE protections if they get handled by the kernel, which is a total non-starter, IMO.
Such feedback is welcomed ;).
As an aside: I think Sparc ADI support in Linux actually side-stepped this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI for kernel code.") I think this was a mistake we should not repeat for arm64 (we do seem to be at least in agreement about this, I think).
I tried to drag the SPARC guys into this discussion but without much success.
This is a bit of a chicken-and-egg problem. In a world where memory allocators on one or several popular platforms generate pointers with non-zero tags, any such breakage will be caught in testing. Unfortunately to reach that state we need the kernel to start accepting tagged pointers first, and then hold on for a couple of years until userspace catches up.
Would the kernel also catch up with providing a stable ABI? Because we have two moving targets.
On one hand, you have Android or some Linux distro that stick to a stable kernel version for some time, so they have better chance of clearing most of the problems. On the other hand, we have mainline kernel that gets over 500K lines every release. As maintainer, I can't rely on my testing alone as this is on a limited number of platforms. So my concern is that every kernel release has a significant chance of breaking the ABI, unless we have a better way of identifying potential issues.
I just want to make sure I fully understand your concern about this being an ABI break, and I work best with examples. The closest situation I can see would be:
- some program has no idea about MTE
Apart from some libraries like libc (and maybe those that handle specific device ioctls), I think most programs should have no idea about MTE. I wouldn't expect programmers to have to change their app just because we have a new feature that colours heap allocations.
obviously i'm biased as a libc maintainer, but...
i don't think it helps to move this to libc --- now you just have an extra dependency where to have a guaranteed working system you need to update your kernel and libc together. (or at least update your libc to understand new ioctls etc _before_ you can update your kernel.)
- malloc() starts returning MTE-tagged addresses
- program doesn't break from that change
- program uses some syscall that is missing untagged_addr() and fails
- kernel has now broken userspace that used to work
That's one aspect though probably more of a case of plugging in a new device (graphics card, network etc.) and the ioctl to the new device doesn't work.
The other is that, assuming we reach a point where the kernel entirely supports this relaxed ABI, can we guarantee that it won't break in the future. Let's say some subsequent kernel change (some refactoring) misses out an untagged_addr(). This renders a previously TBI/MTE-capable syscall unusable. Can we rely only on testing?
The trouble I see with this is that it is largely theoretical and requires part of userspace to collude to start using a new CPU feature that tickles a bug in the kernel. As I understand the golden rule, this is a bug in the kernel (a missed ioctl() or such) to be fixed, not a global breaking of some userspace behavior.
Yes, we should follow the rule that it's a kernel bug but it doesn't help the user that a newly installed kernel causes user space to no longer reach a prompt. Hence the proposal of an opt-in via personality (for MTE we would need an explicit opt-in by the user anyway since the top byte is no longer ignored but checked against the allocation tag).
but realistically would this actually get used in this way? or would any given system either be MTE or non-MTE. in which case a kernel configuration option would seem to make more sense. (because either way, the hypothetical user basically needs to recompile the kernel to get back on their feet. or all of userspace.)
i'm not sure i see this new way for a kernel update to break my system and need to be fixed forward/rolled back as any different from any of the existing ways in which this can happen :-) as an end-user i have to rely on whoever's sending me software updates to test adequately enough that they find the problems. as an end user, there isn't any difference between "my phone rebooted when i tried to take a photo because of a kernel/driver leak", say, and "my phone rebooted when i tried to take a photo because of missing untagging of a pointer passed via ioctl".
i suspect you and i have very different people in mind when we say "user" :-)
I feel like I'm missing something about this being seen as an ABI break. The kernel already fails on userspace addresses that have high bits set -- are there things that _depend_ on this failure to operate?
It's about providing a relaxed ABI which allows non-zero top byte and breaking it later inadvertently without having something better in place to analyse the kernel changes.
Thanks.
-- Catalin
On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote:
On Wed, May 22, 2019 at 3:11 AM Catalin Marinas catalin.marinas@arm.com wrote:
On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
I just want to make sure I fully understand your concern about this being an ABI break, and I work best with examples. The closest situation I can see would be:
- some program has no idea about MTE
Apart from some libraries like libc (and maybe those that handle specific device ioctls), I think most programs should have no idea about MTE. I wouldn't expect programmers to have to change their app just because we have a new feature that colours heap allocations.
obviously i'm biased as a libc maintainer, but...
i don't think it helps to move this to libc --- now you just have an extra dependency where to have a guaranteed working system you need to update your kernel and libc together. (or at least update your libc to understand new ioctls etc _before_ you can update your kernel.)
That's not what I meant (or I misunderstood you). If we have a relaxed ABI in the kernel and a libc that returns tagged pointers on malloc() I wouldn't expect the programmer to do anything different in the application code like explicit untagging. Basically the program would continue to run unmodified irrespective of whether you use an old libc without tagged pointers or a new one which tags heap allocations.
What I do expect is that the libc checks for the presence of the relaxed ABI, currently proposed as an AT_FLAGS bit (for MTE we'd have a HWCAP_MTE), and only tag the malloc() pointers if the kernel supports the relaxed ABI. As you said, you shouldn't expect that the C library and kernel are upgraded together, so they should be able to work in any new/old version combination.
The trouble I see with this is that it is largely theoretical and requires part of userspace to collude to start using a new CPU feature that tickles a bug in the kernel. As I understand the golden rule, this is a bug in the kernel (a missed ioctl() or such) to be fixed, not a global breaking of some userspace behavior.
Yes, we should follow the rule that it's a kernel bug but it doesn't help the user that a newly installed kernel causes user space to no longer reach a prompt. Hence the proposal of an opt-in via personality (for MTE we would need an explicit opt-in by the user anyway since the top byte is no longer ignored but checked against the allocation tag).
but realistically would this actually get used in this way? or would any given system either be MTE or non-MTE. in which case a kernel configuration option would seem to make more sense. (because either way, the hypothetical user basically needs to recompile the kernel to get back on their feet. or all of userspace.)
The two hard requirements I have for supporting any new hardware feature in Linux are (1) a single kernel image binary continues to run on old hardware while making use of the new feature if available and (2) old user space continues to run on new hardware while new user space can take advantage of the new feature.
The distro user space usually has a hard requirement that it continues to run on (certain) old hardware. We can't enforce this in the kernel but we offer the option to user space developers of checking feature availability through HWCAP bits.
The Android story may be different as you have more control about which kernel configurations are deployed on specific SoCs. I'm looking more from a Linux distro angle where you just get an off-the-shelf OS image and install it on your hardware, either taking advantage of new features or just not using them if the software was not updated. Or, if updated software is installed on old hardware, it would just run.
For MTE, we just can't enable it by default since there are applications who use the top byte of a pointer and expect it to be ignored rather than failing with a mismatched tag. Just think of a hwasan compiled binary where TBI is expected to work and you try to run it with MTE turned on.
I would also expect the C library or dynamic loader to check for the presence of a HWCAP_MTE bit before starting to tag memory allocations, otherwise it would get SIGILL on the first MTE instruction it tries to execute.
i'm not sure i see this new way for a kernel update to break my system and need to be fixed forward/rolled back as any different from any of the existing ways in which this can happen :-) as an end-user i have to rely on whoever's sending me software updates to test adequately enough that they find the problems. as an end user, there isn't any difference between "my phone rebooted when i tried to take a photo because of a kernel/driver leak", say, and "my phone rebooted when i tried to take a photo because of missing untagging of a pointer passed via ioctl".
i suspect you and i have very different people in mind when we say "user" :-)
Indeed, I think we have different users in mind. I didn't mean the end user who doesn't really care which C library version it's running on their phone but rather advanced users (not necessarily kernel developers) that prefer to build their own kernels with every release. We could extend this to kernel developers who don't have time to track down why a new kernel triggers lots of SIGSEGVs during boot.
On Wed, May 22, 2019 at 9:35 AM Catalin Marinas catalin.marinas@arm.com wrote:
On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote:
On Wed, May 22, 2019 at 3:11 AM Catalin Marinas catalin.marinas@arm.com wrote:
On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
I just want to make sure I fully understand your concern about this being an ABI break, and I work best with examples. The closest situation I can see would be:
- some program has no idea about MTE
Apart from some libraries like libc (and maybe those that handle specific device ioctls), I think most programs should have no idea about MTE. I wouldn't expect programmers to have to change their app just because we have a new feature that colours heap allocations.
obviously i'm biased as a libc maintainer, but...
i don't think it helps to move this to libc --- now you just have an extra dependency where to have a guaranteed working system you need to update your kernel and libc together. (or at least update your libc to understand new ioctls etc _before_ you can update your kernel.)
That's not what I meant (or I misunderstood you). If we have a relaxed ABI in the kernel and a libc that returns tagged pointers on malloc() I wouldn't expect the programmer to do anything different in the application code like explicit untagging. Basically the program would continue to run unmodified irrespective of whether you use an old libc without tagged pointers or a new one which tags heap allocations.
What I do expect is that the libc checks for the presence of the relaxed ABI, currently proposed as an AT_FLAGS bit (for MTE we'd have a HWCAP_MTE), and only tag the malloc() pointers if the kernel supports the relaxed ABI. As you said, you shouldn't expect that the C library and kernel are upgraded together, so they should be able to work in any new/old version combination.
yes, that part makes sense. i do think we'd use the AT_FLAGS bit, for exactly this.
i was questioning the argument about the ioctl issues, and saying that from my perspective, untagging bugs are not really any different than any other kind of kernel bug.
The trouble I see with this is that it is largely theoretical and requires part of userspace to collude to start using a new CPU feature that tickles a bug in the kernel. As I understand the golden rule, this is a bug in the kernel (a missed ioctl() or such) to be fixed, not a global breaking of some userspace behavior.
Yes, we should follow the rule that it's a kernel bug but it doesn't help the user that a newly installed kernel causes user space to no longer reach a prompt. Hence the proposal of an opt-in via personality (for MTE we would need an explicit opt-in by the user anyway since the top byte is no longer ignored but checked against the allocation tag).
but realistically would this actually get used in this way? or would any given system either be MTE or non-MTE. in which case a kernel configuration option would seem to make more sense. (because either way, the hypothetical user basically needs to recompile the kernel to get back on their feet. or all of userspace.)
The two hard requirements I have for supporting any new hardware feature in Linux are (1) a single kernel image binary continues to run on old hardware while making use of the new feature if available and (2) old user space continues to run on new hardware while new user space can take advantage of the new feature.
The distro user space usually has a hard requirement that it continues to run on (certain) old hardware. We can't enforce this in the kernel but we offer the option to user space developers of checking feature availability through HWCAP bits.
The Android story may be different as you have more control about which kernel configurations are deployed on specific SoCs. I'm looking more from a Linux distro angle where you just get an off-the-shelf OS image and install it on your hardware, either taking advantage of new features or just not using them if the software was not updated. Or, if updated software is installed on old hardware, it would just run.
For MTE, we just can't enable it by default since there are applications who use the top byte of a pointer and expect it to be ignored rather than failing with a mismatched tag. Just think of a hwasan compiled binary where TBI is expected to work and you try to run it with MTE turned on.
I would also expect the C library or dynamic loader to check for the presence of a HWCAP_MTE bit before starting to tag memory allocations, otherwise it would get SIGILL on the first MTE instruction it tries to execute.
(a bit off-topic, but i thought the MTE instructions were encoded in the no-op space, to avoid this?)
i'm not sure i see this new way for a kernel update to break my system and need to be fixed forward/rolled back as any different from any of the existing ways in which this can happen :-) as an end-user i have to rely on whoever's sending me software updates to test adequately enough that they find the problems. as an end user, there isn't any difference between "my phone rebooted when i tried to take a photo because of a kernel/driver leak", say, and "my phone rebooted when i tried to take a photo because of missing untagging of a pointer passed via ioctl".
i suspect you and i have very different people in mind when we say "user" :-)
Indeed, I think we have different users in mind. I didn't mean the end user who doesn't really care which C library version it's running on their phone but rather advanced users (not necessarily kernel developers) that prefer to build their own kernels with every release. We could extend this to kernel developers who don't have time to track down why a new kernel triggers lots of SIGSEGVs during boot.
i still don't see how this isn't just a regular testing/CI issue, the same as any other kind of kernel bug. it's already the case that i can get a bad kernel...
-- Catalin
On Wed, May 22, 2019 at 09:58:22AM -0700, enh wrote:
i was questioning the argument about the ioctl issues, and saying that from my perspective, untagging bugs are not really any different than any other kind of kernel bug.
Once this series gets in, they are indeed just kernel bugs. What I want is an easier way to identify them, ideally before they trigger in the field.
i still don't see how this isn't just a regular testing/CI issue, the same as any other kind of kernel bug. it's already the case that i can get a bad kernel...
The testing would have a smaller code coverage in terms of drivers, filesystems than something like a static checker (though one does not exclude the other).
On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
The two hard requirements I have for supporting any new hardware feature in Linux are (1) a single kernel image binary continues to run on old hardware while making use of the new feature if available and (2) old user space continues to run on new hardware while new user space can take advantage of the new feature.
Agreed! And I think the series meets these requirements, yes?
For MTE, we just can't enable it by default since there are applications who use the top byte of a pointer and expect it to be ignored rather than failing with a mismatched tag. Just think of a hwasan compiled binary where TBI is expected to work and you try to run it with MTE turned on.
Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI conflicting with MTE. And anything that starts using TBI suddenly can't run in the future because it's being interpreted as MTE bits? (Is that the ABI concern? I feel like we got into the weeds about ioctl()s and one-off bugs...)
So there needs to be some way to let the kernel know which of three things it should be doing: 1- leaving userspace addresses as-is (present) 2- wiping the top bits before using (this series) 3- wiping the top bits for most things, but retaining them for MTE as needed (the future)
I expect MTE to be the "default" in the future. Once a system's libc has grown support for it, everything will be trying to use MTE. TBI will be the special case (but TBI is effectively a prerequisite).
AFAICT, the only difference I see between 2 and 3 will be the tag handling in usercopy (all other places will continue to ignore the top bits). Is that accurate?
Is "1" a per-process state we want to keep? (I assume not, but rather it is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
To choose between "2" and "3", it seems we need a per-process flag to opt into TBI (and out of MTE). For userspace, how would a future binary choose TBI over MTE? If it's a library issue, we can't use an ELF bit, since the choice may be "late" after ELF load (this implies the need for a prctl().) If it's binary-only ("built with HWKASan") then an ELF bit seems sufficient. And without the marking, I'd expect the kernel to enforce MTE when there are high bits.
I would also expect the C library or dynamic loader to check for the presence of a HWCAP_MTE bit before starting to tag memory allocations, otherwise it would get SIGILL on the first MTE instruction it tries to execute.
I've got the same question as Elliot: aren't MTE instructions just NOP to older CPUs? I.e. if the CPU (or kernel) don't support it, it just gets entirely ignored: checking is only needed to satisfy curiosity or behavioral expectations.
To me, the conflict seems to be using TBI in the face of expecting MTE to be the default state of the future. (But the internal changes needed for TBI -- this series -- is a prereq for MTE.)
On Wed, May 22, 2019 at 1:47 PM Kees Cook keescook@chromium.org wrote:
On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
The two hard requirements I have for supporting any new hardware feature in Linux are (1) a single kernel image binary continues to run on old hardware while making use of the new feature if available and (2) old user space continues to run on new hardware while new user space can take advantage of the new feature.
Agreed! And I think the series meets these requirements, yes?
For MTE, we just can't enable it by default since there are applications who use the top byte of a pointer and expect it to be ignored rather than failing with a mismatched tag. Just think of a hwasan compiled binary where TBI is expected to work and you try to run it with MTE turned on.
Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI conflicting with MTE. And anything that starts using TBI suddenly can't run in the future because it's being interpreted as MTE bits? (Is that the ABI concern? I feel like we got into the weeds about ioctl()s and one-off bugs...)
So there needs to be some way to let the kernel know which of three things it should be doing: 1- leaving userspace addresses as-is (present) 2- wiping the top bits before using (this series) 3- wiping the top bits for most things, but retaining them for MTE as needed (the future)
I expect MTE to be the "default" in the future. Once a system's libc has grown support for it, everything will be trying to use MTE. TBI will be the special case (but TBI is effectively a prerequisite).
AFAICT, the only difference I see between 2 and 3 will be the tag handling in usercopy (all other places will continue to ignore the top bits). Is that accurate?
Is "1" a per-process state we want to keep? (I assume not, but rather it is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
To choose between "2" and "3", it seems we need a per-process flag to opt into TBI (and out of MTE). For userspace, how would a future binary choose TBI over MTE? If it's a library issue, we can't use an ELF bit, since the choice may be "late" after ELF load (this implies the need for a prctl().) If it's binary-only ("built with HWKASan") then an ELF bit seems sufficient. And without the marking, I'd expect the kernel to enforce MTE when there are high bits.
I would also expect the C library or dynamic loader to check for the presence of a HWCAP_MTE bit before starting to tag memory allocations, otherwise it would get SIGILL on the first MTE instruction it tries to execute.
I've got the same question as Elliot: aren't MTE instructions just NOP to older CPUs? I.e. if the CPU (or kernel) don't support it, it just gets entirely ignored: checking is only needed to satisfy curiosity or behavioral expectations.
MTE instructions are not NOP. Most of them have side effects (changing register values, zeroing memory). This only matters for stack tagging, though. Heap tagging is a runtime decision in the allocator.
If an image needs to run on old hardware, it will have to do heap tagging only.
To me, the conflict seems to be using TBI in the face of expecting MTE to be the default state of the future. (But the internal changes needed for TBI -- this series -- is a prereq for MTE.)
-- Kees Cook
On Wed, May 22, 2019 at 4:03 PM Evgenii Stepanov eugenis@google.com wrote:
On Wed, May 22, 2019 at 1:47 PM Kees Cook keescook@chromium.org wrote:
On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
The two hard requirements I have for supporting any new hardware feature in Linux are (1) a single kernel image binary continues to run on old hardware while making use of the new feature if available and (2) old user space continues to run on new hardware while new user space can take advantage of the new feature.
Agreed! And I think the series meets these requirements, yes?
For MTE, we just can't enable it by default since there are applications who use the top byte of a pointer and expect it to be ignored rather than failing with a mismatched tag. Just think of a hwasan compiled binary where TBI is expected to work and you try to run it with MTE turned on.
Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI conflicting with MTE. And anything that starts using TBI suddenly can't run in the future because it's being interpreted as MTE bits? (Is that the ABI concern? I feel like we got into the weeds about ioctl()s and one-off bugs...)
So there needs to be some way to let the kernel know which of three things it should be doing: 1- leaving userspace addresses as-is (present) 2- wiping the top bits before using (this series) 3- wiping the top bits for most things, but retaining them for MTE as needed (the future)
I expect MTE to be the "default" in the future. Once a system's libc has grown support for it, everything will be trying to use MTE. TBI will be the special case (but TBI is effectively a prerequisite).
AFAICT, the only difference I see between 2 and 3 will be the tag handling in usercopy (all other places will continue to ignore the top bits). Is that accurate?
Is "1" a per-process state we want to keep? (I assume not, but rather it is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
To choose between "2" and "3", it seems we need a per-process flag to opt into TBI (and out of MTE). For userspace, how would a future binary choose TBI over MTE? If it's a library issue, we can't use an ELF bit, since the choice may be "late" after ELF load (this implies the need for a prctl().) If it's binary-only ("built with HWKASan") then an ELF bit seems sufficient. And without the marking, I'd expect the kernel to enforce MTE when there are high bits.
I would also expect the C library or dynamic loader to check for the presence of a HWCAP_MTE bit before starting to tag memory allocations, otherwise it would get SIGILL on the first MTE instruction it tries to execute.
I've got the same question as Elliot: aren't MTE instructions just NOP to older CPUs? I.e. if the CPU (or kernel) don't support it, it just gets entirely ignored: checking is only needed to satisfy curiosity or behavioral expectations.
MTE instructions are not NOP. Most of them have side effects (changing register values, zeroing memory).
no, i meant "they're encoded in a space that was previously no-ops, so running on MTE code on old hardware doesn't cause SIGILL".
This only matters for stack tagging, though. Heap tagging is a runtime decision in the allocator.
If an image needs to run on old hardware, it will have to do heap tagging only.
To me, the conflict seems to be using TBI in the face of expecting MTE to be the default state of the future. (But the internal changes needed for TBI -- this series -- is a prereq for MTE.)
-- Kees Cook
On Wed, May 22, 2019 at 04:09:31PM -0700, enh wrote:
On Wed, May 22, 2019 at 4:03 PM Evgenii Stepanov eugenis@google.com wrote:
On Wed, May 22, 2019 at 1:47 PM Kees Cook keescook@chromium.org wrote:
On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
I would also expect the C library or dynamic loader to check for the presence of a HWCAP_MTE bit before starting to tag memory allocations, otherwise it would get SIGILL on the first MTE instruction it tries to execute.
I've got the same question as Elliot: aren't MTE instructions just NOP to older CPUs? I.e. if the CPU (or kernel) don't support it, it just gets entirely ignored: checking is only needed to satisfy curiosity or behavioral expectations.
MTE instructions are not NOP. Most of them have side effects (changing register values, zeroing memory).
no, i meant "they're encoded in a space that was previously no-ops, so running on MTE code on old hardware doesn't cause SIGILL".
It does result in SIGILL, there wasn't enough encoding left in the NOP space for old/current CPU implementations (in hindsight, we should have reserved a bigger NOP space).
As Evgenii said, the libc needs to be careful when tagging the heap as it would cause a SIGILL if the HWCAP_MTE is not set. The standard application doesn't need to be recompiled as it would not issue MTE colouring instructions, just standard LDR/STR.
Stack tagging is problematic if you want to colour each frame individually, the function prologue would need the non-NOP MTE instructions. The best we can do here is just having the (thread) stacks of different colours.
On Wed, May 22, 2019 at 01:47:36PM -0700, Kees Cook wrote:
On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
The two hard requirements I have for supporting any new hardware feature in Linux are (1) a single kernel image binary continues to run on old hardware while making use of the new feature if available and (2) old user space continues to run on new hardware while new user space can take advantage of the new feature.
Agreed! And I think the series meets these requirements, yes?
Yes. I mentioned this just to make sure people don't expect different kernel builds for different hardware features.
There is also the obvious requirement which I didn't mention: new user space continues to run on new/subsequent kernel versions. That's one of the points of contention for this series (ignoring MTE) with the maintainers having to guarantee this without much effort. IOW, do the 500K+ new lines in a subsequent kernel version break any user space out there? I'm only talking about the relaxed TBI ABI. Are the usual LTP, syskaller sufficient? Better static analysis would definitely help.
For MTE, we just can't enable it by default since there are applications who use the top byte of a pointer and expect it to be ignored rather than failing with a mismatched tag. Just think of a hwasan compiled binary where TBI is expected to work and you try to run it with MTE turned on.
Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI conflicting with MTE. And anything that starts using TBI suddenly can't run in the future because it's being interpreted as MTE bits? (Is that the ABI concern?
That's another aspect to figure out when we add the MTE support. I don't think we'd be able to do this without an explicit opt-in by the user.
Or, if we ever want MTE to be turned on by default (i.e. tag checking), even if everything is tagged with 0, we have to disallow TBI for user and this includes hwasan. There were a small number of programs using the TBI (I think some JavaScript compilers tried this). But now we are bringing in the hwasan support and this can be a large user base. Shall we add an ELF note for such binaries that use TBI/hwasan?
This series is still required for MTE but we may decide not to relax the ABI blindly, therefore the opt-in (prctl) or personality idea.
I feel like we got into the weeds about ioctl()s and one-off bugs...)
This needs solving as well. Most driver developers won't know why untagged_addr() is needed unless we have more rigorous types or type annotations and a tool to check them (we should probably revive the old sparse thread).
So there needs to be some way to let the kernel know which of three things it should be doing: 1- leaving userspace addresses as-is (present) 2- wiping the top bits before using (this series)
(I'd say tolerating rather than wiping since get_user still uses the tag in the current series)
The current series does not allow any choice between 1 and 2, the default ABI basically becomes option 2.
3- wiping the top bits for most things, but retaining them for MTE as needed (the future)
2 and 3 are not entirely compatible as a tagged pointer may be checked against the memory colour by the hardware. So you can't have hwasan binary with MTE enabled.
I expect MTE to be the "default" in the future. Once a system's libc has grown support for it, everything will be trying to use MTE. TBI will be the special case (but TBI is effectively a prerequisite).
The kernel handling of tagged pointers is indeed a prerequisite. The ABI distinction between the above 2 and 3 needs to be solved.
AFAICT, the only difference I see between 2 and 3 will be the tag handling in usercopy (all other places will continue to ignore the top bits). Is that accurate?
Yes, mostly (for the kernel). If MTE is enabled by default for a hwasan binary, it will SEGFAULT (either in user space or in kernel uaccess). How does the kernel choose between 2 and 3?
Is "1" a per-process state we want to keep? (I assume not, but rather it is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
Possibly, though not necessarily per process. For testing or if something goes wrong during boot, a command line option with a static label would do. The AT_FLAGS bit needs to be checked by user space. My preference would be per-process.
To choose between "2" and "3", it seems we need a per-process flag to opt into TBI (and out of MTE).
Or leave option 2 the default and get it to opt in to MTE.
For userspace, how would a future binary choose TBI over MTE? If it's a library issue, we can't use an ELF bit, since the choice may be "late" after ELF load (this implies the need for a prctl().) If it's binary-only ("built with HWKASan") then an ELF bit seems sufficient. And without the marking, I'd expect the kernel to enforce MTE when there are high bits.
The current plan is that a future binary issues a prctl(), after checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions are not in the current NOP space). I'd expect this to be done by the libc or dynamic loader under the assumption that the binaries it loads do _not_ use the top pointer byte for anything else. With hwasan compiled objects this gets more confusing (any ELF note to identify them?).
(there is also the risk of existing applications using TBI already but I'm not aware of any still using this feature other than hwasan)
On Thu, May 23, 2019 at 7:45 AM Catalin Marinas catalin.marinas@arm.com wrote:
On Wed, May 22, 2019 at 01:47:36PM -0700, Kees Cook wrote:
On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
The two hard requirements I have for supporting any new hardware feature in Linux are (1) a single kernel image binary continues to run on old hardware while making use of the new feature if available and (2) old user space continues to run on new hardware while new user space can take advantage of the new feature.
Agreed! And I think the series meets these requirements, yes?
Yes. I mentioned this just to make sure people don't expect different kernel builds for different hardware features.
There is also the obvious requirement which I didn't mention: new user space continues to run on new/subsequent kernel versions. That's one of the points of contention for this series (ignoring MTE) with the maintainers having to guarantee this without much effort. IOW, do the 500K+ new lines in a subsequent kernel version break any user space out there? I'm only talking about the relaxed TBI ABI. Are the usual LTP, syskaller sufficient? Better static analysis would definitely help.
For MTE, we just can't enable it by default since there are applications who use the top byte of a pointer and expect it to be ignored rather than failing with a mismatched tag. Just think of a hwasan compiled binary where TBI is expected to work and you try to run it with MTE turned on.
Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI conflicting with MTE. And anything that starts using TBI suddenly can't run in the future because it's being interpreted as MTE bits? (Is that the ABI concern?
That's another aspect to figure out when we add the MTE support. I don't think we'd be able to do this without an explicit opt-in by the user.
Or, if we ever want MTE to be turned on by default (i.e. tag checking), even if everything is tagged with 0, we have to disallow TBI for user and this includes hwasan. There were a small number of programs using the TBI (I think some JavaScript compilers tried this). But now we are bringing in the hwasan support and this can be a large user base. Shall we add an ELF note for such binaries that use TBI/hwasan?
This series is still required for MTE but we may decide not to relax the ABI blindly, therefore the opt-in (prctl) or personality idea.
I feel like we got into the weeds about ioctl()s and one-off bugs...)
This needs solving as well. Most driver developers won't know why untagged_addr() is needed unless we have more rigorous types or type annotations and a tool to check them (we should probably revive the old sparse thread).
So there needs to be some way to let the kernel know which of three things it should be doing: 1- leaving userspace addresses as-is (present) 2- wiping the top bits before using (this series)
(I'd say tolerating rather than wiping since get_user still uses the tag in the current series)
The current series does not allow any choice between 1 and 2, the default ABI basically becomes option 2.
3- wiping the top bits for most things, but retaining them for MTE as needed (the future)
2 and 3 are not entirely compatible as a tagged pointer may be checked against the memory colour by the hardware. So you can't have hwasan binary with MTE enabled.
I expect MTE to be the "default" in the future. Once a system's libc has grown support for it, everything will be trying to use MTE. TBI will be the special case (but TBI is effectively a prerequisite).
The kernel handling of tagged pointers is indeed a prerequisite. The ABI distinction between the above 2 and 3 needs to be solved.
AFAICT, the only difference I see between 2 and 3 will be the tag handling in usercopy (all other places will continue to ignore the top bits). Is that accurate?
Yes, mostly (for the kernel). If MTE is enabled by default for a hwasan binary, it will SEGFAULT (either in user space or in kernel uaccess). How does the kernel choose between 2 and 3?
Is "1" a per-process state we want to keep? (I assume not, but rather it is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
Possibly, though not necessarily per process. For testing or if something goes wrong during boot, a command line option with a static label would do. The AT_FLAGS bit needs to be checked by user space. My preference would be per-process.
To choose between "2" and "3", it seems we need a per-process flag to opt into TBI (and out of MTE).
Or leave option 2 the default and get it to opt in to MTE.
For userspace, how would a future binary choose TBI over MTE? If it's a library issue, we can't use an ELF bit, since the choice may be "late" after ELF load (this implies the need for a prctl().) If it's binary-only ("built with HWKASan") then an ELF bit seems sufficient. And without the marking, I'd expect the kernel to enforce MTE when there are high bits.
The current plan is that a future binary issues a prctl(), after checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions are not in the current NOP space). I'd expect this to be done by the libc or dynamic loader under the assumption that the binaries it loads do _not_ use the top pointer byte for anything else.
yeah, it sounds like to support hwasan and MTE, the dynamic linker will need to not use either itself.
With hwasan compiled objects this gets more confusing (any ELF note to identify them?).
no, at the moment code that wants to know checks for the presence of __hwasan_init. (and bionic doesn't actually look at any ELF notes right now.) but we can always add something if we need to.
(there is also the risk of existing applications using TBI already but I'm not aware of any still using this feature other than hwasan)
-- Catalin
On Thu, May 23, 2019 at 08:44:12AM -0700, enh wrote:
On Thu, May 23, 2019 at 7:45 AM Catalin Marinas catalin.marinas@arm.com wrote:
On Wed, May 22, 2019 at 01:47:36PM -0700, Kees Cook wrote:
For userspace, how would a future binary choose TBI over MTE? If it's a library issue, we can't use an ELF bit, since the choice may be "late" after ELF load (this implies the need for a prctl().) If it's binary-only ("built with HWKASan") then an ELF bit seems sufficient. And without the marking, I'd expect the kernel to enforce MTE when there are high bits.
The current plan is that a future binary issues a prctl(), after checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions are not in the current NOP space). I'd expect this to be done by the libc or dynamic loader under the assumption that the binaries it loads do _not_ use the top pointer byte for anything else.
yeah, it sounds like to support hwasan and MTE, the dynamic linker will need to not use either itself.
With hwasan compiled objects this gets more confusing (any ELF note to identify them?).
no, at the moment code that wants to know checks for the presence of __hwasan_init. (and bionic doesn't actually look at any ELF notes right now.) but we can always add something if we need to.
It's a userspace decision to make. In the kernel, we are proposing that bionic calls a prctl() to enable MTE explicitly. It could first check for the presence of __hwasan_init.
On Thu, May 23, 2019 at 03:44:49PM +0100, Catalin Marinas wrote:
There is also the obvious requirement which I didn't mention: new user space continues to run on new/subsequent kernel versions. That's one of the points of contention for this series (ignoring MTE) with the maintainers having to guarantee this without much effort. IOW, do the 500K+ new lines in a subsequent kernel version break any user space out there? I'm only talking about the relaxed TBI ABI. Are the usual LTP, syskaller sufficient? Better static analysis would definitely help.
We can't have perfect coverage of people actively (or accidentally) working to trick static analyzers (and the compiler) into "forgetting" about a __user annotation. We can certainly improve analysis (I see the other sub-thread on that), but I'd like that work not to block this series.
What on this front would you be comfortable with? Given it's a new feature isn't it sufficient to have a CONFIG (and/or boot option)?
Or, if we ever want MTE to be turned on by default (i.e. tag checking), even if everything is tagged with 0, we have to disallow TBI for user and this includes hwasan. There were a small number of programs using the TBI (I think some JavaScript compilers tried this). But now we are bringing in the hwasan support and this can be a large user base. Shall we add an ELF note for such binaries that use TBI/hwasan?
Just to be clear, you say "disallow TBI for user" -- you mean a particular process, yes? i.e. there is no architectural limitation that says once we're using MTE nothing can switch to TBI. i.e. a process is either doing MTE or TBI (or nothing, but that's the same as TBI).
This needs solving as well. Most driver developers won't know why untagged_addr() is needed unless we have more rigorous types or type annotations and a tool to check them (we should probably revive the old sparse thread).
This seems like a parallel concern: we can do that separately from this series. Without landing it, is it much harder for people to test it, look for bugs, help with types/annotations, etc.
So there needs to be some way to let the kernel know which of three things it should be doing: 1- leaving userspace addresses as-is (present) 2- wiping the top bits before using (this series)
(I'd say tolerating rather than wiping since get_user still uses the tag in the current series)
The current series does not allow any choice between 1 and 2, the default ABI basically becomes option 2.
What about testing tools that intentionally insert high bits for syscalls and are _expecting_ them to fail? It seems the TBI series will break them? In that case, do we need to opt into TBI as well?
3- wiping the top bits for most things, but retaining them for MTE as needed (the future)
2 and 3 are not entirely compatible as a tagged pointer may be checked against the memory colour by the hardware. So you can't have hwasan binary with MTE enabled.
Right: a process must be either MTE or TBI, not both.
I expect MTE to be the "default" in the future. Once a system's libc has grown support for it, everything will be trying to use MTE. TBI will be the special case (but TBI is effectively a prerequisite).
The kernel handling of tagged pointers is indeed a prerequisite. The ABI distinction between the above 2 and 3 needs to be solved.
Does that need solving now or when the MTE series appears? As there is no reason to distinguish between "normal" and "TBI", that doesn't seem to need solving at this point?
AFAICT, the only difference I see between 2 and 3 will be the tag handling in usercopy (all other places will continue to ignore the top bits). Is that accurate?
Yes, mostly (for the kernel). If MTE is enabled by default for a hwasan binary, it will SEGFAULT (either in user space or in kernel uaccess). How does the kernel choose between 2 and 3?
Right -- that was my question as well.
Is "1" a per-process state we want to keep? (I assume not, but rather it is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
Possibly, though not necessarily per process. For testing or if something goes wrong during boot, a command line option with a static label would do. The AT_FLAGS bit needs to be checked by user space. My preference would be per-process.
I would agree.
To choose between "2" and "3", it seems we need a per-process flag to opt into TBI (and out of MTE).
Or leave option 2 the default and get it to opt in to MTE.
Given that MTE has to "start" at some point in the binary lifetime, I'm fine with opting into MTE. I do expect, though, this will feel redundant in a couple years as everything will immediately opt-in. But, okay, this is therefore an issue for the MTE series.
The current plan is that a future binary issues a prctl(), after checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions are not in the current NOP space). I'd expect this to be done by the libc or dynamic loader under the assumption that the binaries it loads do _not_ use the top pointer byte for anything else. With hwasan compiled objects this gets more confusing (any ELF note to identify them?).
Okay, sounds fine.
(there is also the risk of existing applications using TBI already but I'm not aware of any still using this feature other than hwasan)
Correct.
Alright, the tl;dr appears to be: - you want more assurances that we can find __user stripping in the kernel more easily. (But this seems like a parallel problem.) - we might need to opt in to TBI with a prctl() - all other concerns are for the future MTE series (though it sounds like HWCAP_MTE and a prctl() solve those issues too).
Is this accurate? What do you see as the blockers for this series at this point?
On Thu, May 23, 2019 at 09:38:19AM -0700, Kees Cook wrote:
On Thu, May 23, 2019 at 03:44:49PM +0100, Catalin Marinas wrote:
There is also the obvious requirement which I didn't mention: new user space continues to run on new/subsequent kernel versions. That's one of the points of contention for this series (ignoring MTE) with the maintainers having to guarantee this without much effort. IOW, do the 500K+ new lines in a subsequent kernel version break any user space out there? I'm only talking about the relaxed TBI ABI. Are the usual LTP, syskaller sufficient? Better static analysis would definitely help.
We can't have perfect coverage of people actively (or accidentally) working to trick static analyzers (and the compiler) into "forgetting" about a __user annotation. We can certainly improve analysis (I see the other sub-thread on that), but I'd like that work not to block this series.
I don't want to block this series either as it's a prerequisite for MTE but at the moment I'm not confident we tracked down all the places where things can break and what the future effort will be to analyse new kernel changes.
We've made lots of progress since March last year (I think when the first version was posted) by both identifying new places in the kernel that need untagging and limiting the ranges that user space can tag (e.g. an mmap() on a device should not allow tagged pointers). Can we say we are done or more investigation is needed?
What on this front would you be comfortable with? Given it's a new feature isn't it sufficient to have a CONFIG (and/or boot option)?
I'd rather avoid re-building kernels. A boot option would do, unless we see value in a per-process (inherited) personality or prctl. The per-process option has the slight advantage that I can boot my machine normally while being able to run LTP with a relaxed ABI (and a new libc which tags pointers). I admit I don't have a very strong argument on a per-process opt-in.
Another option would be a sysctl control together with a cmdline option.
Or, if we ever want MTE to be turned on by default (i.e. tag checking), even if everything is tagged with 0, we have to disallow TBI for user and this includes hwasan. There were a small number of programs using the TBI (I think some JavaScript compilers tried this). But now we are bringing in the hwasan support and this can be a large user base. Shall we add an ELF note for such binaries that use TBI/hwasan?
Just to be clear, you say "disallow TBI for user" -- you mean a particular process, yes? i.e. there is no architectural limitation that says once we're using MTE nothing can switch to TBI. i.e. a process is either doing MTE or TBI (or nothing, but that's the same as TBI).
I may have answered this below. The architecture does not allow TBI (i.e. faults) if MTE is enabled for a process and address range.
So there needs to be some way to let the kernel know which of three things it should be doing: 1- leaving userspace addresses as-is (present) 2- wiping the top bits before using (this series)
(I'd say tolerating rather than wiping since get_user still uses the tag in the current series)
The current series does not allow any choice between 1 and 2, the default ABI basically becomes option 2.
What about testing tools that intentionally insert high bits for syscalls and are _expecting_ them to fail? It seems the TBI series will break them? In that case, do we need to opt into TBI as well?
If there are such tools, then we may need a per-process control. It's basically an ABI change.
3- wiping the top bits for most things, but retaining them for MTE as needed (the future)
2 and 3 are not entirely compatible as a tagged pointer may be checked against the memory colour by the hardware. So you can't have hwasan binary with MTE enabled.
Right: a process must be either MTE or TBI, not both.
Indeed.
I expect MTE to be the "default" in the future. Once a system's libc has grown support for it, everything will be trying to use MTE. TBI will be the special case (but TBI is effectively a prerequisite).
The kernel handling of tagged pointers is indeed a prerequisite. The ABI distinction between the above 2 and 3 needs to be solved.
Does that need solving now or when the MTE series appears? As there is no reason to distinguish between "normal" and "TBI", that doesn't seem to need solving at this point?
We don't need to solve 2 vs 3 as long as option 3 is an explicit opt-in (prctl) by the user. Otherwise the kernel cannot guess whether the user is using TBI or not (and if it does and still opts in to MTE, we can say it's a user problem ;)).
To choose between "2" and "3", it seems we need a per-process flag to opt into TBI (and out of MTE).
Or leave option 2 the default and get it to opt in to MTE.
Given that MTE has to "start" at some point in the binary lifetime, I'm fine with opting into MTE. I do expect, though, this will feel redundant in a couple years as everything will immediately opt-in. But, okay, this is therefore an issue for the MTE series.
Unless Google phases out hwasan soon ;), they may live in parallel for a while, so a choice of TBI vs MTE.
Alright, the tl;dr appears to be:
- you want more assurances that we can find __user stripping in the kernel more easily. (But this seems like a parallel problem.)
Yes, and that we found all (most) cases now. The reason I don't see it as a parallel problem is that, as maintainer, I promise an ABI to user and I'd rather stick to it. I don't want, for example, ncurses to stop working because of some ioctl() rejecting tagged pointers.
If it's just the occasional one-off bug I'm fine to deal with it. But no-one convinced me yet that this is the case.
As for the generic driver code (filesystems or other subsystems), without some clear direction for developers, together with static checking/sparse, on how user pointers are cast to longs (one example), it would become my responsibility to identify and fix them up with any kernel release. This series is not providing such guidance, just adding untagged_addr() in some places that we think matter.
- we might need to opt in to TBI with a prctl()
Yes, although still up for discussion.
- all other concerns are for the future MTE series (though it sounds like HWCAP_MTE and a prctl() solve those issues too).
Indeed.
Is this accurate? What do you see as the blockers for this series at this point?
Well, see my comment on your first bullet point above ;).
And thanks for summarising it.
On Thu, May 23, 2019 at 06:43:46PM +0100, Catalin Marinas wrote:
On Thu, May 23, 2019 at 09:38:19AM -0700, Kees Cook wrote:
What on this front would you be comfortable with? Given it's a new feature isn't it sufficient to have a CONFIG (and/or boot option)?
I'd rather avoid re-building kernels. A boot option would do, unless we see value in a per-process (inherited) personality or prctl. The
I think I've convinced myself that the normal<->TBI ABI control should be a boot parameter. More below...
What about testing tools that intentionally insert high bits for syscalls and are _expecting_ them to fail? It seems the TBI series will break them? In that case, do we need to opt into TBI as well?
If there are such tools, then we may need a per-process control. It's basically an ABI change.
syzkaller already attempts to randomly inject non-canonical and 0xFFFF....FFFF addresses for user pointers in syscalls in an effort to find bugs like CVE-2017-5123 where waitid() via unchecked put_user() was able to write directly to kernel memory[1].
It seems that using TBI by default and not allowing a switch back to "normal" ABI without a reboot actually means that userspace cannot inject kernel pointers into syscalls any more, since they'll get universally stripped now. Is my understanding correct, here? i.e. exploiting CVE-2017-5123 would be impossible under TBI?
If so, then I think we should commit to the TBI ABI and have a boot flag to disable it, but NOT have a process flag, as that would allow attackers to bypass the masking. The only flag should be "TBI or MTE".
If so, can I get top byte masking for other architectures too? Like, just to strip high bits off userspace addresses? ;)
(Oh, in looking I see this is implemented with sign-extension... why not just a mask? So it'll either be valid userspace address or forced into the non-canonical range?)
[1] https://salls.github.io/Linux-Kernel-CVE-2017-5123/
Alright, the tl;dr appears to be:
- you want more assurances that we can find __user stripping in the kernel more easily. (But this seems like a parallel problem.)
Yes, and that we found all (most) cases now. The reason I don't see it as a parallel problem is that, as maintainer, I promise an ABI to user and I'd rather stick to it. I don't want, for example, ncurses to stop working because of some ioctl() rejecting tagged pointers.
But this is what I don't understand: it would need to be ncurses _using TBI_, that would stop working (having started to work before, but then regress due to a newly added one-off bug). Regular ncurses will be fine because it's not using TBI. So The Golden Rule isn't violated, and by definition, it's a specific regression caused by some bug (since TBI would have had to have worked _before_ in the situation to be considered a regression now). Which describes the normal path for kernel development... add feature, find corner cases where it doesn't work, fix them, encounter new regressions, fix those, repeat forever.
If it's just the occasional one-off bug I'm fine to deal with it. But no-one convinced me yet that this is the case.
You believe there still to be some systemic cases that haven't been found yet? And even if so -- isn't it better to work on that incrementally?
As for the generic driver code (filesystems or other subsystems), without some clear direction for developers, together with static checking/sparse, on how user pointers are cast to longs (one example), it would become my responsibility to identify and fix them up with any kernel release. This series is not providing such guidance, just adding untagged_addr() in some places that we think matter.
What about adding a nice bit of .rst documentation that describes the situation and shows how to use untagged_addr(). This is the kind of kernel-wide change that "everyone" needs to know about, and shouldn't be the arch maintainer's sole responsibility to fix.
- we might need to opt in to TBI with a prctl()
Yes, although still up for discussion.
I think I've talked myself out of it. I say boot param only! :)
So what do you say to these next steps:
- change untagged_addr() to use a static branch that is controlled with a boot parameter. - add, say, Documentation/core-api/user-addresses.rst to describe proper care and handling of user space pointers with untagged_addr(), with examples based on all the cases seen so far in this series. - continue work to improve static analysis.
Thanks for wading through this with me! :)
On Thu, May 23, 2019 at 02:31:16PM -0700, Kees Cook wrote:
On Thu, May 23, 2019 at 06:43:46PM +0100, Catalin Marinas wrote:
On Thu, May 23, 2019 at 09:38:19AM -0700, Kees Cook wrote:
What about testing tools that intentionally insert high bits for syscalls and are _expecting_ them to fail? It seems the TBI series will break them? In that case, do we need to opt into TBI as well?
If there are such tools, then we may need a per-process control. It's basically an ABI change.
syzkaller already attempts to randomly inject non-canonical and 0xFFFF....FFFF addresses for user pointers in syscalls in an effort to find bugs like CVE-2017-5123 where waitid() via unchecked put_user() was able to write directly to kernel memory[1].
It seems that using TBI by default and not allowing a switch back to "normal" ABI without a reboot actually means that userspace cannot inject kernel pointers into syscalls any more, since they'll get universally stripped now. Is my understanding correct, here? i.e. exploiting CVE-2017-5123 would be impossible under TBI?
Unless the kernel is also using TBI (khwasan), in which case masking out the top byte wouldn't help. Anyway, as per this discussion, we want the tagged pointer to remain intact all the way to put_user(), so nothing gets masked out. I don't think this would have helped with the waitid() bug.
If so, then I think we should commit to the TBI ABI and have a boot flag to disable it, but NOT have a process flag, as that would allow attackers to bypass the masking. The only flag should be "TBI or MTE".
If so, can I get top byte masking for other architectures too? Like, just to strip high bits off userspace addresses? ;)
But you didn't like my option 2 shim proposal which strips the tag on kernel entry because it lowers the value of MTE ;).
(Oh, in looking I see this is implemented with sign-extension... why not just a mask? So it'll either be valid userspace address or forced into the non-canonical range?)
The TTBR0/1 selection on memory accesses is done based on bit 63 if TBI is disabled and bit 55 when enabled. Sign-extension allows us to use the same macro for both user and kernel tagged pointers. With MTE tag 0 would be match-all for TTBR0 and 0xff for TTBR1 (so that we don't modify the virtual address space of the kernel; I need to check the latest spec to be sure). Note that the VA space for both user and kernel is limited to 52-bit architecturally so, on access, bits 55..52 must be the same, 0 or 1, otherwise you get a fault.
Since the syzkaller tests would also need to set bits 55-52 (actually 48 for kernel addresses, we haven't merged the 52-bit kernel VA patches yet) to hit a valid kernel address, I don't think ignoring the top byte makes much difference to the expected failure scenario.
Alright, the tl;dr appears to be:
- you want more assurances that we can find __user stripping in the kernel more easily. (But this seems like a parallel problem.)
Yes, and that we found all (most) cases now. The reason I don't see it as a parallel problem is that, as maintainer, I promise an ABI to user and I'd rather stick to it. I don't want, for example, ncurses to stop working because of some ioctl() rejecting tagged pointers.
But this is what I don't understand: it would need to be ncurses _using TBI_, that would stop working (having started to work before, but then regress due to a newly added one-off bug). Regular ncurses will be fine because it's not using TBI. So The Golden Rule isn't violated,
Once we introduced TBI and the libc starts tagging heap allocations, this becomes the new "regular" user space behaviour (i.e. using TBI). So a new bug would break the golden rule. It could also be an old bug that went unnoticed (i.e. you changed the graphics card and its driver gets confused by tagged pointers coming from user-space).
and by definition, it's a specific regression caused by some bug (since TBI would have had to have worked _before_ in the situation to be considered a regression now). Which describes the normal path for kernel development... add feature, find corner cases where it doesn't work, fix them, encounter new regressions, fix those, repeat forever.
If it's just the occasional one-off bug I'm fine to deal with it. But no-one convinced me yet that this is the case.
You believe there still to be some systemic cases that haven't been found yet? And even if so -- isn't it better to work on that incrementally?
I want some way to systematically identify potential issues (sparse?). Since problems are most likely in drivers, I don't have all devices to check and not all users have the knowledge to track down why something failed.
I think we can do this incrementally as long the TBI ABI is not the default. Even better if we made it per process.
As for the generic driver code (filesystems or other subsystems), without some clear direction for developers, together with static checking/sparse, on how user pointers are cast to longs (one example), it would become my responsibility to identify and fix them up with any kernel release. This series is not providing such guidance, just adding untagged_addr() in some places that we think matter.
What about adding a nice bit of .rst documentation that describes the situation and shows how to use untagged_addr(). This is the kind of kernel-wide change that "everyone" needs to know about, and shouldn't be the arch maintainer's sole responsibility to fix.
This works (if people read it) but we also need to be more prescriptive in how casting is done and how we differentiate between a pointer for dereference (T __user *) and address space management (usually unsigned long). On top of that, we'd get sparse to check for such conversions and maybe even checkpatch for some low-hanging fruit.
- we might need to opt in to TBI with a prctl()
Yes, although still up for discussion.
I think I've talked myself out of it. I say boot param only! :)
I hope I talked you in again ;). I don't see TBI as improving kernel security.
So what do you say to these next steps:
- change untagged_addr() to use a static branch that is controlled with a boot parameter.
access_ok() as well.
- add, say, Documentation/core-api/user-addresses.rst to describe proper care and handling of user space pointers with untagged_addr(), with examples based on all the cases seen so far in this series.
We have u64_to_user_ptr(). What about the reverse? And maybe changing get_user_pages() to take void __user *.
- continue work to improve static analysis.
Andrew Murray in the ARM kernel team started revisiting the old sparse threads, let's see how it goes.
On Thu, May 23, 2019 at 02:31:16PM -0700, Kees Cook wrote:
syzkaller already attempts to randomly inject non-canonical and 0xFFFF....FFFF addresses for user pointers in syscalls in an effort to find bugs like CVE-2017-5123 where waitid() via unchecked put_user() was able to write directly to kernel memory[1].
It seems that using TBI by default and not allowing a switch back to "normal" ABI without a reboot actually means that userspace cannot inject kernel pointers into syscalls any more, since they'll get universally stripped now. Is my understanding correct, here? i.e. exploiting CVE-2017-5123 would be impossible under TBI?
If so, then I think we should commit to the TBI ABI and have a boot flag to disable it, but NOT have a process flag, as that would allow attackers to bypass the masking. The only flag should be "TBI or MTE".
If so, can I get top byte masking for other architectures too? Like, just to strip high bits off userspace addresses? ;)
Just for fun, hack/attempt at your idea which should not interfere with TBI. Only briefly tested on arm64 (and the s390 __TYPE_IS_PTR macro is pretty weird ;)):
--------------------------8<--------------------------------- diff --git a/arch/s390/include/asm/compat.h b/arch/s390/include/asm/compat.h index 63b46e30b2c3..338455a74eff 100644 --- a/arch/s390/include/asm/compat.h +++ b/arch/s390/include/asm/compat.h @@ -11,9 +11,6 @@
#include <asm-generic/compat.h>
-#define __TYPE_IS_PTR(t) (!__builtin_types_compatible_p( \ - typeof(0?(__force t)0:0ULL), u64)) - #define __SC_DELOUSE(t,v) ({ \ BUILD_BUG_ON(sizeof(t) > 4 && !__TYPE_IS_PTR(t)); \ (__force t)(__TYPE_IS_PTR(t) ? ((v) & 0x7fffffff) : (v)); \ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index e2870fe1be5b..b1b9fe8502da 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -119,8 +119,15 @@ struct io_uring_params; #define __TYPE_IS_L(t) (__TYPE_AS(t, 0L)) #define __TYPE_IS_UL(t) (__TYPE_AS(t, 0UL)) #define __TYPE_IS_LL(t) (__TYPE_AS(t, 0LL) || __TYPE_AS(t, 0ULL)) +#define __TYPE_IS_PTR(t) (!__builtin_types_compatible_p(typeof(0 ? (__force t)0 : 0ULL), u64)) #define __SC_LONG(t, a) __typeof(__builtin_choose_expr(__TYPE_IS_LL(t), 0LL, 0L)) a +#ifdef CONFIG_64BIT +#define __SC_CAST(t, a) (__TYPE_IS_PTR(t) \ + ? (__force t) ((__u64)a & ~(1UL << 55)) \ + : (__force t) a) +#else #define __SC_CAST(t, a) (__force t) a +#endif #define __SC_ARGS(t, a) a #define __SC_TEST(t, a) (void)BUILD_BUG_ON_ZERO(!__TYPE_IS_LL(t) && sizeof(t) > sizeof(long))
On Tue, May 28, 2019 at 06:02:45PM +0100, Catalin Marinas wrote:
On Thu, May 23, 2019 at 02:31:16PM -0700, Kees Cook wrote:
syzkaller already attempts to randomly inject non-canonical and 0xFFFF....FFFF addresses for user pointers in syscalls in an effort to find bugs like CVE-2017-5123 where waitid() via unchecked put_user() was able to write directly to kernel memory[1].
It seems that using TBI by default and not allowing a switch back to "normal" ABI without a reboot actually means that userspace cannot inject kernel pointers into syscalls any more, since they'll get universally stripped now. Is my understanding correct, here? i.e. exploiting CVE-2017-5123 would be impossible under TBI?
If so, then I think we should commit to the TBI ABI and have a boot flag to disable it, but NOT have a process flag, as that would allow attackers to bypass the masking. The only flag should be "TBI or MTE".
If so, can I get top byte masking for other architectures too? Like, just to strip high bits off userspace addresses? ;)
Just for fun, hack/attempt at your idea which should not interfere with TBI. Only briefly tested on arm64 (and the s390 __TYPE_IS_PTR macro is pretty weird ;)):
OMG, this is amazing and bonkers. I love it.
--------------------------8<--------------------------------- diff --git a/arch/s390/include/asm/compat.h b/arch/s390/include/asm/compat.h index 63b46e30b2c3..338455a74eff 100644 --- a/arch/s390/include/asm/compat.h +++ b/arch/s390/include/asm/compat.h @@ -11,9 +11,6 @@ #include <asm-generic/compat.h> -#define __TYPE_IS_PTR(t) (!__builtin_types_compatible_p( \
typeof(0?(__force t)0:0ULL), u64))
#define __SC_DELOUSE(t,v) ({ \ BUILD_BUG_ON(sizeof(t) > 4 && !__TYPE_IS_PTR(t)); \ (__force t)(__TYPE_IS_PTR(t) ? ((v) & 0x7fffffff) : (v)); \ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index e2870fe1be5b..b1b9fe8502da 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -119,8 +119,15 @@ struct io_uring_params; #define __TYPE_IS_L(t) (__TYPE_AS(t, 0L)) #define __TYPE_IS_UL(t) (__TYPE_AS(t, 0UL)) #define __TYPE_IS_LL(t) (__TYPE_AS(t, 0LL) || __TYPE_AS(t, 0ULL)) +#define __TYPE_IS_PTR(t) (!__builtin_types_compatible_p(typeof(0 ? (__force t)0 : 0ULL), u64)) #define __SC_LONG(t, a) __typeof(__builtin_choose_expr(__TYPE_IS_LL(t), 0LL, 0L)) a +#ifdef CONFIG_64BIT +#define __SC_CAST(t, a) (__TYPE_IS_PTR(t) \
? (__force t) ((__u64)a & ~(1UL << 55)) \
: (__force t) a)
+#else #define __SC_CAST(t, a) (__force t) a +#endif #define __SC_ARGS(t, a) a #define __SC_TEST(t, a) (void)BUILD_BUG_ON_ZERO(!__TYPE_IS_LL(t) && sizeof(t) > sizeof(long))
I'm laughing, I'm crying. Now I have to go look at the disassembly to see how this actually looks. (I mean, it _does_ solve my specific case of the waitid() flaw, but wouldn't help with pointers deeper in structs, etc, though TBI does, I think still help with that. I have to catch back up on the thread...) Anyway, if it's not too expensive it'd block reachability for those kinds of flaws.
I wonder what my chances are of actually getting this landed? :) (Though, I guess I need to find a "VA width" macro instead of a raw 55.)
Thanks for thinking of __SC_CAST() and __TYPE_IS_PTR() together. Really made my day. :)
On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote:
On Wed, May 22, 2019 at 3:11 AM Catalin Marinas catalin.marinas@arm.com wrote:
On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
I just want to make sure I fully understand your concern about this being an ABI break, and I work best with examples. The closest situation I can see would be:
- some program has no idea about MTE
Apart from some libraries like libc (and maybe those that handle specific device ioctls), I think most programs should have no idea about MTE. I wouldn't expect programmers to have to change their app just because we have a new feature that colours heap allocations.
Right -- things should Just Work from the application perspective.
obviously i'm biased as a libc maintainer, but...
i don't think it helps to move this to libc --- now you just have an extra dependency where to have a guaranteed working system you need to update your kernel and libc together. (or at least update your libc to understand new ioctls etc _before_ you can update your kernel.)
I think (hope?) we've all agreed that we shouldn't pass this off to userspace. At the very least, it reduces the utility of MTE, and at worst it complicates userspace when this is clearly a kernel/architecture issue.
- malloc() starts returning MTE-tagged addresses
- program doesn't break from that change
- program uses some syscall that is missing untagged_addr() and fails
- kernel has now broken userspace that used to work
That's one aspect though probably more of a case of plugging in a new device (graphics card, network etc.) and the ioctl to the new device doesn't work.
I think MTE will likely be rather like NX/PXN and SMAP/PAN: there will be glitches, and we can disable stuff either via CONFIG or (as is more common now) via a kernel commandline with untagged_addr() containing a static branch, etc. But I actually don't think we need to go this route (see below...)
The other is that, assuming we reach a point where the kernel entirely supports this relaxed ABI, can we guarantee that it won't break in the future. Let's say some subsequent kernel change (some refactoring) misses out an untagged_addr(). This renders a previously TBI/MTE-capable syscall unusable. Can we rely only on testing?
The trouble I see with this is that it is largely theoretical and requires part of userspace to collude to start using a new CPU feature that tickles a bug in the kernel. As I understand the golden rule, this is a bug in the kernel (a missed ioctl() or such) to be fixed, not a global breaking of some userspace behavior.
Yes, we should follow the rule that it's a kernel bug but it doesn't help the user that a newly installed kernel causes user space to no longer reach a prompt. Hence the proposal of an opt-in via personality (for MTE we would need an explicit opt-in by the user anyway since the top byte is no longer ignored but checked against the allocation tag).
but realistically would this actually get used in this way? or would any given system either be MTE or non-MTE. in which case a kernel configuration option would seem to make more sense. (because either way, the hypothetical user basically needs to recompile the kernel to get back on their feet. or all of userspace.)
Right: the point is to design things so that we do our best to not break userspace that is using the new feature (which I think this series has done well). But supporting MTE/TBI is just like supporting PAN: if someone refactors a driver and swaps a copy_from_user() to a memcpy(), it's going to break under PAN. There will be the same long tail of these bugs like any other, but my sense is that they are small and rare. But I agree: they're going to be pretty weird bugs to track down. The final result, however, will be excellent annotation in the kernel for where userspace addresses get used and people make assumptions about them.
The sooner we get the series landed and gain QEMU support (or real hardware), the faster we can hammer out these missed corner-cases. What's the timeline for either of those things, BTW?
I feel like I'm missing something about this being seen as an ABI break. The kernel already fails on userspace addresses that have high bits set -- are there things that _depend_ on this failure to operate?
It's about providing a relaxed ABI which allows non-zero top byte and breaking it later inadvertently without having something better in place to analyse the kernel changes.
It sounds like the question is how to switch a process in or out of this ABI (but I don't think that's the real issue: I think it's just a matter of whether or not a process uses tags at all). Doing it at the prctl() level doesn't make sense to me, except maybe to detect MTE support or something. ("Should I tag allocations?") And that state is controlled by the kernel: the kernel does it or it doesn't.
If a process wants to not tag, that's also up to the allocator where it can decide not to ask the kernel, and just not tag. Nothing breaks in userspace if a process is NOT tagging and untagged_addr() exists or is missing. This, I think, is the core way this doesn't trip over the golden rule: an old system image will run fine (because it's not tagging). A *new* system may encounter bugs with tagging because it's a new feature: this is The Way Of Things. But we don't break old userspace because old userspace isn't using tags.
So the agreement appears to be between the kernel and the allocator. Kernel says "I support this" or not. Telling the allocator to not tag if something breaks sounds like an entirely userspace decision, yes?
On Wed, May 22, 2019 at 12:21 PM Kees Cook keescook@chromium.org wrote:
On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote:
On Wed, May 22, 2019 at 3:11 AM Catalin Marinas catalin.marinas@arm.com wrote:
On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
I just want to make sure I fully understand your concern about this being an ABI break, and I work best with examples. The closest situation I can see would be:
- some program has no idea about MTE
Apart from some libraries like libc (and maybe those that handle specific device ioctls), I think most programs should have no idea about MTE. I wouldn't expect programmers to have to change their app just because we have a new feature that colours heap allocations.
Right -- things should Just Work from the application perspective.
obviously i'm biased as a libc maintainer, but...
i don't think it helps to move this to libc --- now you just have an extra dependency where to have a guaranteed working system you need to update your kernel and libc together. (or at least update your libc to understand new ioctls etc _before_ you can update your kernel.)
I think (hope?) we've all agreed that we shouldn't pass this off to userspace. At the very least, it reduces the utility of MTE, and at worst it complicates userspace when this is clearly a kernel/architecture issue.
- malloc() starts returning MTE-tagged addresses
- program doesn't break from that change
- program uses some syscall that is missing untagged_addr() and fails
- kernel has now broken userspace that used to work
That's one aspect though probably more of a case of plugging in a new device (graphics card, network etc.) and the ioctl to the new device doesn't work.
I think MTE will likely be rather like NX/PXN and SMAP/PAN: there will be glitches, and we can disable stuff either via CONFIG or (as is more common now) via a kernel commandline with untagged_addr() containing a static branch, etc. But I actually don't think we need to go this route (see below...)
The other is that, assuming we reach a point where the kernel entirely supports this relaxed ABI, can we guarantee that it won't break in the future. Let's say some subsequent kernel change (some refactoring) misses out an untagged_addr(). This renders a previously TBI/MTE-capable syscall unusable. Can we rely only on testing?
The trouble I see with this is that it is largely theoretical and requires part of userspace to collude to start using a new CPU feature that tickles a bug in the kernel. As I understand the golden rule, this is a bug in the kernel (a missed ioctl() or such) to be fixed, not a global breaking of some userspace behavior.
Yes, we should follow the rule that it's a kernel bug but it doesn't help the user that a newly installed kernel causes user space to no longer reach a prompt. Hence the proposal of an opt-in via personality (for MTE we would need an explicit opt-in by the user anyway since the top byte is no longer ignored but checked against the allocation tag).
but realistically would this actually get used in this way? or would any given system either be MTE or non-MTE. in which case a kernel configuration option would seem to make more sense. (because either way, the hypothetical user basically needs to recompile the kernel to get back on their feet. or all of userspace.)
Right: the point is to design things so that we do our best to not break userspace that is using the new feature (which I think this series has done well). But supporting MTE/TBI is just like supporting PAN: if someone refactors a driver and swaps a copy_from_user() to a memcpy(), it's going to break under PAN. There will be the same long tail of these bugs like any other, but my sense is that they are small and rare. But I agree: they're going to be pretty weird bugs to track down. The final result, however, will be excellent annotation in the kernel for where userspace addresses get used and people make assumptions about them.
The sooner we get the series landed and gain QEMU support (or real hardware), the faster we can hammer out these missed corner-cases. What's the timeline for either of those things, BTW?
I feel like I'm missing something about this being seen as an ABI break. The kernel already fails on userspace addresses that have high bits set -- are there things that _depend_ on this failure to operate?
It's about providing a relaxed ABI which allows non-zero top byte and breaking it later inadvertently without having something better in place to analyse the kernel changes.
It sounds like the question is how to switch a process in or out of this ABI (but I don't think that's the real issue: I think it's just a matter of whether or not a process uses tags at all). Doing it at the prctl() level doesn't make sense to me, except maybe to detect MTE support or something. ("Should I tag allocations?") And that state is controlled by the kernel: the kernel does it or it doesn't.
If a process wants to not tag, that's also up to the allocator where it can decide not to ask the kernel, and just not tag. Nothing breaks in userspace if a process is NOT tagging and untagged_addr() exists or is missing. This, I think, is the core way this doesn't trip over the golden rule: an old system image will run fine (because it's not tagging). A *new* system may encounter bugs with tagging because it's a new feature: this is The Way Of Things. But we don't break old userspace because old userspace isn't using tags.
So the agreement appears to be between the kernel and the allocator. Kernel says "I support this" or not. Telling the allocator to not tag if something breaks sounds like an entirely userspace decision, yes?
sgtm, and the AT_FLAGS suggestion sounds fine for our needs in that regard.
-- Kees Cook
On Wed, May 22, 2019 at 12:21:27PM -0700, Kees Cook wrote:
If a process wants to not tag, that's also up to the allocator where it can decide not to ask the kernel, and just not tag. Nothing breaks in userspace if a process is NOT tagging and untagged_addr() exists or is missing. This, I think, is the core way this doesn't trip over the golden rule: an old system image will run fine (because it's not tagging). A *new* system may encounter bugs with tagging because it's a new feature: this is The Way Of Things. But we don't break old userspace because old userspace isn't using tags.
With this series and hwasan binaries, at some point in the future they will be considered "old userspace" and they do use pointer tags which expect to be ignored by both the hardware and the kernel. MTE breaks this assumption.
On 5/21/19 6:04 PM, Kees Cook wrote:
As an aside: I think Sparc ADI support in Linux actually side-stepped this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI for kernel code.") I think this was a mistake we should not repeat for arm64 (we do seem to be at least in agreement about this, I think).
That is a very early version of the sparc ADI patch. Support for tagged addresses in syscalls was added in later versions and is in the patch that is in the kernel.
That part "Kernel does not enable ADI for kernel code." is correct. It is a possible enhancement for future.
-- Khalid
Hi Khalid,
On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote:
On 5/21/19 6:04 PM, Kees Cook wrote:
As an aside: I think Sparc ADI support in Linux actually side-stepped this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI for kernel code.") I think this was a mistake we should not repeat for arm64 (we do seem to be at least in agreement about this, I think).
That is a very early version of the sparc ADI patch. Support for tagged addresses in syscalls was added in later versions and is in the patch that is in the kernel.
I tried to figure out but I'm not familiar with the sparc port. How did you solve the tagged address going into various syscall implementations in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it ends up deeper in the core code?
Thanks.
On 5/23/19 2:11 PM, Catalin Marinas wrote:
Hi Khalid,
On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote:
On 5/21/19 6:04 PM, Kees Cook wrote:
As an aside: I think Sparc ADI support in Linux actually side-stepped this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI for kernel code.") I think this was a mistake we should not repeat for arm64 (we do seem to be at least in agreement about this, I think).
That is a very early version of the sparc ADI patch. Support for tagged addresses in syscalls was added in later versions and is in the patch that is in the kernel.
I tried to figure out but I'm not familiar with the sparc port. How did you solve the tagged address going into various syscall implementations in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it ends up deeper in the core code?
Tag is not removed from the user addresses. Kernel passes tagged addresses to copy_from_user and copy_to_user. MMU checks the tag embedded in the address when kernel accesses userspace addresses. This maintains the ADI integrity even when userspace attempts to access any userspace addresses through system calls.
On sparc, access_ok() is defined as:
#define access_ok(addr, size) __access_ok((unsigned long)(addr), size) #define __access_ok(addr, size) (__user_ok((addr) & get_fs().seg, (size))) #define __user_ok(addr, size) ({ (void)(size); (addr) < STACK_TOP; })
STACK_TOP for M7 processor (which is the first sparc processor to support ADI) is 0xfff8000000000000UL. Tagged addresses pass the access_ok() check fine. Any tag mismatches that happen during kernel access to userspace addresses are handled by do_mcd_err().
-- Khalid
On 5/23/19 2:11 PM, Catalin Marinas wrote:
Hi Khalid,
On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote:
On 5/21/19 6:04 PM, Kees Cook wrote:
As an aside: I think Sparc ADI support in Linux actually side-stepped this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI for kernel code.") I think this was a mistake we should not repeat for arm64 (we do seem to be at least in agreement about this, I think).
That is a very early version of the sparc ADI patch. Support for tagged addresses in syscalls was added in later versions and is in the patch that is in the kernel.
I tried to figure out but I'm not familiar with the sparc port. How did you solve the tagged address going into various syscall implementations in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it ends up deeper in the core code?
Another spot I should point out in ADI patch - Tags are not stored in VMAs and IOMMU does not support ADI tags on M7. ADI tags are stripped before userspace addresses are passed to IOMMU in the following snippet from the patch:
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c index 5335ba3c850e..357b6047653a 100644 --- a/arch/sparc/mm/gup.c +++ b/arch/sparc/mm/gup.c @@ -201,6 +202,24 @@ int __get_user_pages_fast(unsigned long start, int nr_pages , int write, pgd_t *pgdp; int nr = 0;
+#ifdef CONFIG_SPARC64 + if (adi_capable()) { + long addr = start; + + /* If userspace has passed a versioned address, kernel + * will not find it in the VMAs since it does not store + * the version tags in the list of VMAs. Storing version + * tags in list of VMAs is impractical since they can be + * changed any time from userspace without dropping into + * kernel. Any address search in VMAs will be done with + * non-versioned addresses. Ensure the ADI version bits + * are dropped here by sign extending the last bit before + * ADI bits. IOMMU does not implement version tags. + */ + addr = (addr << (long)adi_nbits()) >> (long)adi_nbits(); + start = addr; + } +#endif start &= PAGE_MASK; addr = start; len = (unsigned long) nr_pages << PAGE_SHIFT;
-- Khalid
On Thu, May 23, 2019 at 03:49:05PM -0600, Khalid Aziz wrote:
On 5/23/19 2:11 PM, Catalin Marinas wrote:
On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote:
On 5/21/19 6:04 PM, Kees Cook wrote:
As an aside: I think Sparc ADI support in Linux actually side-stepped this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI for kernel code.") I think this was a mistake we should not repeat for arm64 (we do seem to be at least in agreement about this, I think).
That is a very early version of the sparc ADI patch. Support for tagged addresses in syscalls was added in later versions and is in the patch that is in the kernel.
I tried to figure out but I'm not familiar with the sparc port. How did you solve the tagged address going into various syscall implementations in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it ends up deeper in the core code?
Another spot I should point out in ADI patch - Tags are not stored in VMAs and IOMMU does not support ADI tags on M7. ADI tags are stripped before userspace addresses are passed to IOMMU in the following snippet from the patch:
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c index 5335ba3c850e..357b6047653a 100644 --- a/arch/sparc/mm/gup.c +++ b/arch/sparc/mm/gup.c @@ -201,6 +202,24 @@ int __get_user_pages_fast(unsigned long start, int nr_pages , int write, pgd_t *pgdp; int nr = 0;
+#ifdef CONFIG_SPARC64
if (adi_capable()) {
long addr = start;
/* If userspace has passed a versioned address, kernel
* will not find it in the VMAs since it does not store
* the version tags in the list of VMAs. Storing version
* tags in list of VMAs is impractical since they can be
* changed any time from userspace without dropping into
* kernel. Any address search in VMAs will be done with
* non-versioned addresses. Ensure the ADI version bits
* are dropped here by sign extending the last bit before
* ADI bits. IOMMU does not implement version tags.
*/
addr = (addr << (long)adi_nbits()) >> (long)adi_nbits();
start = addr;
}
+#endif start &= PAGE_MASK; addr = start; len = (unsigned long) nr_pages << PAGE_SHIFT;
Thanks Khalid. I missed that sparc does not enable HAVE_GENERIC_GUP, so you fix this case here. If we add the generic untagged_addr() macro in the generic code, I think sparc can start making use of it rather than open-coding the shifts.
There are a few other other places where tags can leak and the core code would get confused (for example, madvise()). I presume your user space doesn't exercise them. On arm64 we plan to just allow the C library to tag any new memory allocation, so those core code paths would need to be covered.
And similarly, devices, IOMMU, any DMA would ignore tags.
On 5/24/19 4:11 AM, Catalin Marinas wrote:
On Thu, May 23, 2019 at 03:49:05PM -0600, Khalid Aziz wrote:
On 5/23/19 2:11 PM, Catalin Marinas wrote:
On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote:
On 5/21/19 6:04 PM, Kees Cook wrote:
As an aside: I think Sparc ADI support in Linux actually side-stepped this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI for kernel code.") I think this was a mistake we should not repeat for arm64 (we do seem to be at least in agreement about this, I think).
That is a very early version of the sparc ADI patch. Support for tagged addresses in syscalls was added in later versions and is in the patch that is in the kernel.
I tried to figure out but I'm not familiar with the sparc port. How did you solve the tagged address going into various syscall implementations in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it ends up deeper in the core code?
Another spot I should point out in ADI patch - Tags are not stored in VMAs and IOMMU does not support ADI tags on M7. ADI tags are stripped before userspace addresses are passed to IOMMU in the following snippet from the patch:
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c index 5335ba3c850e..357b6047653a 100644 --- a/arch/sparc/mm/gup.c +++ b/arch/sparc/mm/gup.c @@ -201,6 +202,24 @@ int __get_user_pages_fast(unsigned long start, int nr_pages , int write, pgd_t *pgdp; int nr = 0;
+#ifdef CONFIG_SPARC64
if (adi_capable()) {
long addr = start;
/* If userspace has passed a versioned address, kernel
* will not find it in the VMAs since it does not store
* the version tags in the list of VMAs. Storing version
* tags in list of VMAs is impractical since they can be
* changed any time from userspace without dropping into
* kernel. Any address search in VMAs will be done with
* non-versioned addresses. Ensure the ADI version bits
* are dropped here by sign extending the last bit before
* ADI bits. IOMMU does not implement version tags.
*/
addr = (addr << (long)adi_nbits()) >> (long)adi_nbits();
start = addr;
}
+#endif start &= PAGE_MASK; addr = start; len = (unsigned long) nr_pages << PAGE_SHIFT;
Thanks Khalid. I missed that sparc does not enable HAVE_GENERIC_GUP, so you fix this case here. If we add the generic untagged_addr() macro in the generic code, I think sparc can start making use of it rather than open-coding the shifts.
Hi Catalin,
Yes, that will be good. Right now addresses are untagged in sparc code in only two spots but that will expand as we expand use of tags. Scalabale solution is definitely better.
There are a few other other places where tags can leak and the core code would get confused (for example, madvise()). I presume your user space doesn't exercise them. On arm64 we plan to just allow the C library to tag any new memory allocation, so those core code paths would need to be covered.
And similarly, devices, IOMMU, any DMA would ignore tags.
Right. You are doing lot more with tags than sparc code intended to do. I had looked into implementing just malloc(), mmap() and possibly shmat() in library that automatically tags pointers. Expanding tags to any pointers in C library will require covering lot more paths in kernel.
-- Khalid
Thanks for a lot of valuable input! I've read through all the replies and got somewhat lost. What are the changes I need to do to this series?
1. Should I move untagging for memory syscalls back to the generic code so other arches would make use of it as well, or should I keep the arm64 specific memory syscalls wrappers and address the comments on that patch?
2. Should I make untagging opt-in and controlled by a command line argument?
3. Should I "add Documentation/core-api/user-addresses.rst to describe proper care and handling of user space pointers with untagged_addr(), with examples based on all the cases seen so far in this series"? Which examples specifically should it cover?
Is there something else?
On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
Thanks for a lot of valuable input! I've read through all the replies and got somewhat lost. What are the changes I need to do to this series?
- Should I move untagging for memory syscalls back to the generic
code so other arches would make use of it as well, or should I keep the arm64 specific memory syscalls wrappers and address the comments on that patch?
It absolutely needs to move to common code. Having arch code leads to pointless (often unintentional) semantic difference between architectures, and lots of boilerplate code.
Btw, can anyone of the arm crowd or Khalid comment on the linux-mm thread on generic gup where I'm dealing with the pre-existing ADI case of pointer untagging?
On Tue, May 28, 2019 at 11:11:26PM -0700, Christoph Hellwig wrote:
On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
Thanks for a lot of valuable input! I've read through all the replies and got somewhat lost. What are the changes I need to do to this series?
- Should I move untagging for memory syscalls back to the generic
code so other arches would make use of it as well, or should I keep the arm64 specific memory syscalls wrappers and address the comments on that patch?
It absolutely needs to move to common code. Having arch code leads to pointless (often unintentional) semantic difference between architectures, and lots of boilerplate code.
That's fine by me as long as we agree on the semantics (which shouldn't be hard; Khalid already following up). We should probably also move the proposed ABI document [1] into a common place (or part of since we'll have arm64-specifics like prctl() calls to explicitly opt in to memory tagging).
[1] https://lore.kernel.org/lkml/20190318163533.26838-1-vincenzo.frascino@arm.co...
On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
Thanks for a lot of valuable input! I've read through all the replies and got somewhat lost. What are the changes I need to do to this series?
- Should I move untagging for memory syscalls back to the generic
code so other arches would make use of it as well, or should I keep the arm64 specific memory syscalls wrappers and address the comments on that patch?
Keep them generic again but make sure we get agreement with Khalid on the actual ABI implications for sparc.
- Should I make untagging opt-in and controlled by a command line argument?
Opt-in, yes, but per task rather than kernel command line option. prctl() is a possibility of opting in.
- Should I "add Documentation/core-api/user-addresses.rst to describe
proper care and handling of user space pointers with untagged_addr(), with examples based on all the cases seen so far in this series"? Which examples specifically should it cover?
I think we can leave 3 for now as not too urgent. What I'd like is for Vincenzo's TBI user ABI document to go into a more common place since we can expand it to cover both sparc and arm64. We'd need an arm64-specific doc as well for things like prctl() and later MTE that sparc may support differently.
On Thu, May 30, 2019 at 7:15 PM Catalin Marinas catalin.marinas@arm.com wrote:
On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
Thanks for a lot of valuable input! I've read through all the replies and got somewhat lost. What are the changes I need to do to this series?
- Should I move untagging for memory syscalls back to the generic
code so other arches would make use of it as well, or should I keep the arm64 specific memory syscalls wrappers and address the comments on that patch?
Keep them generic again but make sure we get agreement with Khalid on the actual ABI implications for sparc.
OK, will do. I find it hard to understand what the ABI implications are. I'll post the next version without untagging in brk, mmap, munmap, mremap (for new_address), mmap_pgoff, remap_file_pages, shmat and shmdt.
- Should I make untagging opt-in and controlled by a command line argument?
Opt-in, yes, but per task rather than kernel command line option. prctl() is a possibility of opting in.
OK. Should I store a flag somewhere in task_struct? Should it be inheritable on clone?
- Should I "add Documentation/core-api/user-addresses.rst to describe
proper care and handling of user space pointers with untagged_addr(), with examples based on all the cases seen so far in this series"? Which examples specifically should it cover?
I think we can leave 3 for now as not too urgent. What I'd like is for Vincenzo's TBI user ABI document to go into a more common place since we can expand it to cover both sparc and arm64. We'd need an arm64-specific doc as well for things like prctl() and later MTE that sparc may support differently.
OK.
-- Catalin
On Fri, May 31, 2019 at 04:29:10PM +0200, Andrey Konovalov wrote:
On Thu, May 30, 2019 at 7:15 PM Catalin Marinas catalin.marinas@arm.com wrote:
On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
Thanks for a lot of valuable input! I've read through all the replies and got somewhat lost. What are the changes I need to do to this series?
- Should I move untagging for memory syscalls back to the generic
code so other arches would make use of it as well, or should I keep the arm64 specific memory syscalls wrappers and address the comments on that patch?
Keep them generic again but make sure we get agreement with Khalid on the actual ABI implications for sparc.
OK, will do. I find it hard to understand what the ABI implications are. I'll post the next version without untagging in brk, mmap, munmap, mremap (for new_address), mmap_pgoff, remap_file_pages, shmat and shmdt.
It's more about not relaxing the ABI to accept non-zero top-byte unless we have a use-case for it. For mmap() etc., I don't think that's needed but if you think otherwise, please raise it.
- Should I make untagging opt-in and controlled by a command line argument?
Opt-in, yes, but per task rather than kernel command line option. prctl() is a possibility of opting in.
OK. Should I store a flag somewhere in task_struct? Should it be inheritable on clone?
A TIF flag would do but I'd say leave it out for now (default opted in) until we figure out the best way to do this (can be a patch on top of this series).
Thanks.
On Fri, May 31, 2019 at 6:20 PM Catalin Marinas catalin.marinas@arm.com wrote:
On Fri, May 31, 2019 at 04:29:10PM +0200, Andrey Konovalov wrote:
On Thu, May 30, 2019 at 7:15 PM Catalin Marinas catalin.marinas@arm.com wrote:
On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
Thanks for a lot of valuable input! I've read through all the replies and got somewhat lost. What are the changes I need to do to this series?
- Should I move untagging for memory syscalls back to the generic
code so other arches would make use of it as well, or should I keep the arm64 specific memory syscalls wrappers and address the comments on that patch?
Keep them generic again but make sure we get agreement with Khalid on the actual ABI implications for sparc.
OK, will do. I find it hard to understand what the ABI implications are. I'll post the next version without untagging in brk, mmap, munmap, mremap (for new_address), mmap_pgoff, remap_file_pages, shmat and shmdt.
It's more about not relaxing the ABI to accept non-zero top-byte unless we have a use-case for it. For mmap() etc., I don't think that's needed but if you think otherwise, please raise it.
- Should I make untagging opt-in and controlled by a command line argument?
Opt-in, yes, but per task rather than kernel command line option. prctl() is a possibility of opting in.
OK. Should I store a flag somewhere in task_struct? Should it be inheritable on clone?
A TIF flag would do but I'd say leave it out for now (default opted in) until we figure out the best way to do this (can be a patch on top of this series).
You mean leave the whole opt-in/prctl part out? So the only change would be to move untagging for memory syscalls into generic code?
Thanks.
-- Catalin
On Fri, May 31, 2019 at 06:24:06PM +0200, Andrey Konovalov wrote:
On Fri, May 31, 2019 at 6:20 PM Catalin Marinas catalin.marinas@arm.com wrote:
On Fri, May 31, 2019 at 04:29:10PM +0200, Andrey Konovalov wrote:
On Thu, May 30, 2019 at 7:15 PM Catalin Marinas catalin.marinas@arm.com wrote:
On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
Thanks for a lot of valuable input! I've read through all the replies and got somewhat lost. What are the changes I need to do to this series?
- Should I move untagging for memory syscalls back to the generic
code so other arches would make use of it as well, or should I keep the arm64 specific memory syscalls wrappers and address the comments on that patch?
Keep them generic again but make sure we get agreement with Khalid on the actual ABI implications for sparc.
OK, will do. I find it hard to understand what the ABI implications are. I'll post the next version without untagging in brk, mmap, munmap, mremap (for new_address), mmap_pgoff, remap_file_pages, shmat and shmdt.
It's more about not relaxing the ABI to accept non-zero top-byte unless we have a use-case for it. For mmap() etc., I don't think that's needed but if you think otherwise, please raise it.
- Should I make untagging opt-in and controlled by a command line argument?
Opt-in, yes, but per task rather than kernel command line option. prctl() is a possibility of opting in.
OK. Should I store a flag somewhere in task_struct? Should it be inheritable on clone?
A TIF flag would do but I'd say leave it out for now (default opted in) until we figure out the best way to do this (can be a patch on top of this series).
You mean leave the whole opt-in/prctl part out? So the only change would be to move untagging for memory syscalls into generic code?
Yes (or just wait until next week to see if the discussion settles down).
On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote:
The tagged pointers (whether hwasan or MTE) should ideally be a transparent feature for the application writer but I don't think we can solve it entirely and make it seamless for the multitude of ioctls(). I'd say you only opt in to such feature if you know what you are doing and the user code takes care of specific cases like ioctl(), hence the prctl() proposal even for the hwasan.
I'm not sure such a dire view is warrented..
The ioctl situation is not so bad, other than a few special cases, most drivers just take a 'void __user *' and pass it as an argument to some function that accepts a 'void __user *'. sparse et al verify this.
As long as the core functions do the right thing the drivers will be OK.
The only place things get dicy is if someone casts to unsigned long (ie for vma work) but I think that reflects that our driver facing APIs for VMAs are compatible with static analysis (ie I have no earthly idea why get_user_pages() accepts an unsigned long), not that this is too hard.
Jason
On Tue, May 21, 2019 at 03:48:56PM -0300, Jason Gunthorpe wrote:
On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote:
The tagged pointers (whether hwasan or MTE) should ideally be a transparent feature for the application writer but I don't think we can solve it entirely and make it seamless for the multitude of ioctls(). I'd say you only opt in to such feature if you know what you are doing and the user code takes care of specific cases like ioctl(), hence the prctl() proposal even for the hwasan.
I'm not sure such a dire view is warrented..
The ioctl situation is not so bad, other than a few special cases, most drivers just take a 'void __user *' and pass it as an argument to some function that accepts a 'void __user *'. sparse et al verify this.
As long as the core functions do the right thing the drivers will be OK.
The only place things get dicy is if someone casts to unsigned long (ie for vma work) but I think that reflects that our driver facing APIs for VMAs are compatible with static analysis (ie I have no earthly idea why get_user_pages() accepts an unsigned long), not that this is too hard.
If multiple people will care about this, perhaps we should try to annotate types more explicitly in SYSCALL_DEFINEx() and ABI data structures.
For example, we could have a couple of mutually exclusive modifiers
T __object * T __vaddr * (or U __vaddr)
In the first case the pointer points to an object (in the C sense) that the call may dereference but not use for any other purpose.
In the latter case the pointer (or other type) is a virtual address that the call does not dereference but my do other things with.
Also
U __really(T)
to tell static analysers the real type of pointers smuggled through UAPI disguised as other types (*cough* KVM, etc.)
We could gradually make sparse more strict about the presence of annotations and allowed conversions, add get/put_user() variants that demand explicit annotation, etc.
find_vma() wouldn't work with a __object pointer, for example. A get_user_pages_for_dereference() might be needed for __object pointers (embodying a promise from the caller that only the object will be dereferenced within the mapped pages).
Thoughts?
This kind of thing would need widespread buy-in in order to be viable.
Cheers ---Dave
On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote:
On Tue, May 21, 2019 at 03:48:56PM -0300, Jason Gunthorpe wrote:
On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote:
The tagged pointers (whether hwasan or MTE) should ideally be a transparent feature for the application writer but I don't think we can solve it entirely and make it seamless for the multitude of ioctls(). I'd say you only opt in to such feature if you know what you are doing and the user code takes care of specific cases like ioctl(), hence the prctl() proposal even for the hwasan.
I'm not sure such a dire view is warrented..
The ioctl situation is not so bad, other than a few special cases, most drivers just take a 'void __user *' and pass it as an argument to some function that accepts a 'void __user *'. sparse et al verify this.
As long as the core functions do the right thing the drivers will be OK.
The only place things get dicy is if someone casts to unsigned long (ie for vma work) but I think that reflects that our driver facing APIs for VMAs are compatible with static analysis (ie I have no earthly idea why get_user_pages() accepts an unsigned long), not that this is too hard.
If multiple people will care about this, perhaps we should try to annotate types more explicitly in SYSCALL_DEFINEx() and ABI data structures.
For example, we could have a couple of mutually exclusive modifiers
T __object * T __vaddr * (or U __vaddr)
In the first case the pointer points to an object (in the C sense) that the call may dereference but not use for any other purpose.
How would you use these two differently?
So far the kernel has worked that __user should tag any pointer that is from userspace and then you can't do anything with it until you transform it into a kernel something
to tell static analysers the real type of pointers smuggled through UAPI disguised as other types (*cough* KVM, etc.)
Yes, that would help alot, we often have to pass pointers through a u64 in the uAPI, and there is no static checker support to make sure they are run through the u64_to_user_ptr() helper.
Jason
On Wed, May 22, 2019 at 09:20:52PM -0300, Jason Gunthorpe wrote:
On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote:
On Tue, May 21, 2019 at 03:48:56PM -0300, Jason Gunthorpe wrote:
On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote:
The tagged pointers (whether hwasan or MTE) should ideally be a transparent feature for the application writer but I don't think we can solve it entirely and make it seamless for the multitude of ioctls(). I'd say you only opt in to such feature if you know what you are doing and the user code takes care of specific cases like ioctl(), hence the prctl() proposal even for the hwasan.
I'm not sure such a dire view is warrented..
The ioctl situation is not so bad, other than a few special cases, most drivers just take a 'void __user *' and pass it as an argument to some function that accepts a 'void __user *'. sparse et al verify this.
As long as the core functions do the right thing the drivers will be OK.
The only place things get dicy is if someone casts to unsigned long (ie for vma work) but I think that reflects that our driver facing APIs for VMAs are compatible with static analysis (ie I have no earthly idea why get_user_pages() accepts an unsigned long), not that this is too hard.
If multiple people will care about this, perhaps we should try to annotate types more explicitly in SYSCALL_DEFINEx() and ABI data structures.
For example, we could have a couple of mutually exclusive modifiers
T __object * T __vaddr * (or U __vaddr)
In the first case the pointer points to an object (in the C sense) that the call may dereference but not use for any other purpose.
How would you use these two differently?
So far the kernel has worked that __user should tag any pointer that is from userspace and then you can't do anything with it until you transform it into a kernel something
Ultimately it would be good to disallow casting __object pointers execpt to compatible __object pointer types, and to make get_user etc. demand __object.
__vaddr pointers / addresses would be freely castable, but not to __object and so would not be dereferenceable even indirectly.
Or that's the general idea. Figuring out a sane set of rules that we could actually check / enforce would require a bit of work.
(Whether the __vaddr base type is a pointer or an integer type is probably moot, due to the restrictions we would place on the use of these anyway.)
to tell static analysers the real type of pointers smuggled through UAPI disguised as other types (*cough* KVM, etc.)
Yes, that would help alot, we often have to pass pointers through a u64 in the uAPI, and there is no static checker support to make sure they are run through the u64_to_user_ptr() helper.
Agreed.
Cheers ---Dave
On Thu, May 23, 2019 at 11:42:57AM +0100, Dave P Martin wrote:
On Wed, May 22, 2019 at 09:20:52PM -0300, Jason Gunthorpe wrote:
On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote:
If multiple people will care about this, perhaps we should try to annotate types more explicitly in SYSCALL_DEFINEx() and ABI data structures.
For example, we could have a couple of mutually exclusive modifiers
T __object * T __vaddr * (or U __vaddr)
In the first case the pointer points to an object (in the C sense) that the call may dereference but not use for any other purpose.
How would you use these two differently?
So far the kernel has worked that __user should tag any pointer that is from userspace and then you can't do anything with it until you transform it into a kernel something
Ultimately it would be good to disallow casting __object pointers execpt to compatible __object pointer types, and to make get_user etc. demand __object.
__vaddr pointers / addresses would be freely castable, but not to __object and so would not be dereferenceable even indirectly.
I think it gets too complicated and there are ambiguous cases that we may not be able to distinguish. For example copy_from_user() may be used to copy a user data structure into the kernel, hence __object would work, while the same function may be used to copy opaque data to a file, so __vaddr may be a better option (unless I misunderstood your proposal).
We currently have T __user * and I think it's a good starting point. The prior attempt [1] was shut down because it was just hiding the cast using __force. We'd need to work through those cases again and rather start changing the function prototypes to avoid unnecessary casting in the callers (e.g. get_user_pages(void __user *) or come up with a new type) while changing the explicit casting to a macro where it needs to be obvious that we are converting a user pointer, potentially typed (tagged), to an untyped address range. We may need a user_ptr_to_ulong() macro or similar (it seems that we have a u64_to_user_ptr, wasn't aware of it).
It may actually not be far from what you suggested but I'd keep the current T __user * to denote possible dereference.
[1] https://lore.kernel.org/lkml/5d54526e5ff2e5ad63d0dfdd9ab17cf359afa4f2.153562...
On Thu, May 23, 2019 at 05:57:09PM +0100, Catalin Marinas wrote:
On Thu, May 23, 2019 at 11:42:57AM +0100, Dave P Martin wrote:
On Wed, May 22, 2019 at 09:20:52PM -0300, Jason Gunthorpe wrote:
On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote:
If multiple people will care about this, perhaps we should try to annotate types more explicitly in SYSCALL_DEFINEx() and ABI data structures.
For example, we could have a couple of mutually exclusive modifiers
T __object * T __vaddr * (or U __vaddr)
In the first case the pointer points to an object (in the C sense) that the call may dereference but not use for any other purpose.
How would you use these two differently?
So far the kernel has worked that __user should tag any pointer that is from userspace and then you can't do anything with it until you transform it into a kernel something
Ultimately it would be good to disallow casting __object pointers execpt to compatible __object pointer types, and to make get_user etc. demand __object.
__vaddr pointers / addresses would be freely castable, but not to __object and so would not be dereferenceable even indirectly.
I think it gets too complicated and there are ambiguous cases that we may not be able to distinguish. For example copy_from_user() may be used to copy a user data structure into the kernel, hence __object would work, while the same function may be used to copy opaque data to a file, so __vaddr may be a better option (unless I misunderstood your proposal).
Can you illustrate? I'm not sure of your point here.
We currently have T __user * and I think it's a good starting point. The prior attempt [1] was shut down because it was just hiding the cast using __force. We'd need to work through those cases again and rather start changing the function prototypes to avoid unnecessary casting in the callers (e.g. get_user_pages(void __user *) or come up with a new type) while changing the explicit casting to a macro where it needs to be obvious that we are converting a user pointer, potentially typed (tagged), to an untyped address range. We may need a user_ptr_to_ulong() macro or similar (it seems that we have a u64_to_user_ptr, wasn't aware of it).
It may actually not be far from what you suggested but I'd keep the current T __user * to denote possible dereference.
This may not have been clear, but __object and __vaddr would be orthogonal to __user. Since __object and __vaddr strictly constrain what can be done with an lvalue, they could be cast on, but not be cast off without __force.
Syscall arguments and pointer in ioctl structs etc. would typically be annotated as __object __user * or __vaddr __user *. Plain old __user * would work as before, but would be more permissive and give static analysers less information to go on.
Conversion or use or __object or __vaddr pointers would require specific APIs in the kernel, so that we can be clear about the semantics.
Doing things this way would allow migration to annotation of most or all ABI pointers with __object or __vaddr over time, but we wouldn't have to do it all in one go. Problem cases (which won't be the majority) could continue to be plain __user.
This does not magically solve the challenges of MTE, but might provide tools that are useful to help avoid bitrot and regressions over time.
I agree though that there might be a fair number of of cases that don't conveniently fall under __object or __vaddr semantics. It's hard to know without trying it.
_Most_ syscall arguments seem to be fairly obviously one or another though, and this approach has some possibility of scaling to ioctls and other odd interfaces.
Cheers ---Dave
linux-kselftest-mirror@lists.linaro.org