arm64 has a feature called Top Byte Ignore, which allows to embed pointer tags into the top byte of each pointer. Userspace programs (such as HWASan, a memory debugging tool [1]) might use this feature and pass tagged user pointers to the kernel through syscalls or other interfaces.
Right now the kernel is already able to handle user faults with tagged pointers, due to these patches:
1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a tagged pointer") 2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged pointers") 3. 276e9327 ("arm64: entry: improve data abort handling of tagged pointers")
When passing tagged pointers to syscalls, there's a special case of such a pointer being passed to one of the memory syscalls (mmap, mprotect, etc.). These syscalls don't do memory accesses but rather deal with memory ranges, hence an untagged pointer is better suited.
This patchset extends tagged pointer support to non-memory syscalls. This is done by reusing the untagged_addr macro to untag user pointers when the kernel performs pointer checking to find out whether the pointer comes from userspace (most notably in access_ok). The untagging is done only when the pointer is being checked, the tag is preserved as the pointer makes its way through the kernel.
One of the alternative approaches to untagging that was considered is to completely strip the pointer tag as the pointer enters the kernel with some kind of a syscall wrapper, but that won't work with the countless number of different ioctl calls. With this approach we would need a custom wrapper for each ioctl variation, which doesn't seem practical.
The following testing approaches has been taken to find potential issues with user pointer untagging:
1. Static testing (with sparse [2] and separately with a custom static analyzer based on Clang) to track casts of __user pointers to integer types to find places where untagging needs to be done.
2. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running a modified syzkaller version that passes tagged pointers to the kernel.
Based on the results of the testing the requried patches have been added to the patchset.
This patchset has been merged into the Pixel 2 kernel tree and is now being used to enable testing of Pixel 2 phones with HWASan.
This patchset is a prerequisite for ARM's memory tagging hardware feature support [3].
Thanks!
[1] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
[2] https://github.com/lucvoo/sparse-dev/commit/5f960cb10f56ec2017c128ef9d16060e...
[3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture...
Changes in v9: - Rebased onto 4.20-rc6. - Used u64 instead of __u64 in type casts in the untagged_addr macro for arm64. - Added braces around (addr) in the untagged_addr macro for other arches.
Changes in v8: - Rebased onto 65102238 (4.20-rc1). - Added a note to the cover letter on why syscall wrappers/shims that untag user pointers won't work. - Added a note to the cover letter that this patchset has been merged into the Pixel 2 kernel tree. - Documentation fixes, in particular added a list of syscalls that don't support tagged user pointers.
Changes in v7: - Rebased onto 17b57b18 (4.19-rc6). - Dropped the "arm64: untag user address in __do_user_fault" patch, since the existing patches already handle user faults properly. - Dropped the "usb, arm64: untag user addresses in devio" patch, since the passed pointer must come from a vma and therefore be untagged. - Dropped the "arm64: annotate user pointers casts detected by sparse" patch (see the discussion to the replies of the v6 of this patchset). - Added more context to the cover letter. - Updated Documentation/arm64/tagged-pointers.txt.
Changes in v6: 1 From 502466b9652c57a23af3bd72124144319212f30b Mon Sep 17 00:00:00 2001 - Added annotations for user pointer casts found by sparse. 1 From 502466b9652c57a23af3bd72124144319212f30b Mon Sep 17 00:00:00 2001 - Rebased onto 050cdc6c (4.19-rc1+). 1 From 502466b9652c57a23af3bd72124144319212f30b Mon Sep 17 00:00:00 2001
Changes in v5: - Added 3 new patches that add untagging to places found with static analysis. - Rebased onto 44c929e1 (4.18-rc8).
Changes in v4: - Added a selftest for checking that passing tagged pointers to the kernel succeeds. - Rebased onto 81e97f013 (4.18-rc1+).
Changes in v3: - Rebased onto e5c51f30 (4.17-rc6+). - Added linux-arch@ to the list of recipients.
Changes in v2: - Rebased onto 2d618bdf (4.17-rc3+). - Removed excessive untagging in gup.c. - Removed untagging pointers returned from __uaccess_mask_ptr.
Changes in v1: - Rebased onto 4.17-rc1.
Changes in RFC v2: - Added "#ifndef untagged_addr..." fallback in linux/uaccess.h instead of defining it for each arch individually. - Updated Documentation/arm64/tagged-pointers.txt. - Dropped "mm, arm64: untag user addresses in memory syscalls". - Rebased onto 3eb2ce82 (4.16-rc7).
Reviewed-by: Luc Van Oostenryck luc.vanoostenryck@gmail.com Signed-off-by: Andrey Konovalov andreyknvl@google.com
Andrey Konovalov (8): arm64: add type casts to untagged_addr macro uaccess: add untagged_addr definition for other arches arm64: untag user addresses in access_ok and __uaccess_mask_ptr mm, arm64: untag user addresses in mm/gup.c lib, arm64: untag addrs passed to strncpy_from_user and strnlen_user fs, arm64: untag user address in copy_mount_options arm64: update Documentation/arm64/tagged-pointers.txt selftests, arm64: add a selftest for passing tagged pointers to kernel
Documentation/arm64/tagged-pointers.txt | 25 +++++++++++-------- arch/arm64/include/asm/uaccess.h | 14 +++++++---- fs/namespace.c | 2 +- include/linux/uaccess.h | 4 +++ lib/strncpy_from_user.c | 2 ++ lib/strnlen_user.c | 2 ++ mm/gup.c | 4 +++ tools/testing/selftests/arm64/.gitignore | 1 + tools/testing/selftests/arm64/Makefile | 11 ++++++++ .../testing/selftests/arm64/run_tags_test.sh | 12 +++++++++ tools/testing/selftests/arm64/tags_test.c | 19 ++++++++++++++ 11 files changed, 80 insertions(+), 16 deletions(-) create mode 100644 tools/testing/selftests/arm64/.gitignore create mode 100644 tools/testing/selftests/arm64/Makefile create mode 100755 tools/testing/selftests/arm64/run_tags_test.sh create mode 100644 tools/testing/selftests/arm64/tags_test.c
This patch makes the untagged_addr macro accept all kinds of address types (void *, unsigned long, etc.) and allows not to specify type casts in each place where it is used. This is done by using __typeof__.
Acked-by: Catalin Marinas catalin.marinas@arm.com Signed-off-by: Andrey Konovalov andreyknvl@google.com --- arch/arm64/include/asm/uaccess.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h index 07c34087bd5e..3c3864ba3cc1 100644 --- a/arch/arm64/include/asm/uaccess.h +++ b/arch/arm64/include/asm/uaccess.h @@ -101,7 +101,8 @@ static inline unsigned long __range_ok(const void __user *addr, unsigned long si * up with a tagged userland pointer. Clear the tag to get a sane pointer to * pass on to access_ok(), for instance. */ -#define untagged_addr(addr) sign_extend64(addr, 55) +#define untagged_addr(addr) \ + ((__typeof__(addr))sign_extend64((u64)(addr), 55))
#define access_ok(type, addr, size) __range_ok(addr, size) #define user_addr_max get_fs
To allow arm64 syscalls accept tagged pointers from userspace, we must untag them when they are passed to the kernel. Since untagging is done in generic parts of the kernel, the untagged_addr macro needs to be defined for all architectures.
Define it as a noop for other architectures besides arm64.
Acked-by: Catalin Marinas catalin.marinas@arm.com Signed-off-by: Andrey Konovalov andreyknvl@google.com --- include/linux/uaccess.h | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index efe79c1cdd47..42b7a4ac65e2 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -13,6 +13,10 @@
#include <asm/uaccess.h>
+#ifndef untagged_addr +#define untagged_addr(addr) (addr) +#endif + /* * Architectures should provide two primitives (raw_copy_{to,from}_user()) * and get rid of their private instances of copy_{to,from}_user() and
copy_from_user (and a few other similar functions) are used to copy data from user memory into the kernel memory or vice versa. Since a user can provided a tagged pointer to one of the syscalls that use copy_from_user, we need to correctly handle such pointers.
Do this by untagging user pointers in access_ok and in __uaccess_mask_ptr, before performing access validity checks.
Reviewed-by: Catalin Marinas catalin.marinas@arm.com Signed-off-by: Andrey Konovalov andreyknvl@google.com --- arch/arm64/include/asm/uaccess.h | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h index 3c3864ba3cc1..d28c3b1314ce 100644 --- a/arch/arm64/include/asm/uaccess.h +++ b/arch/arm64/include/asm/uaccess.h @@ -104,7 +104,8 @@ static inline unsigned long __range_ok(const void __user *addr, unsigned long si #define untagged_addr(addr) \ ((__typeof__(addr))sign_extend64((u64)(addr), 55))
-#define access_ok(type, addr, size) __range_ok(addr, size) +#define access_ok(type, addr, size) \ + __range_ok(untagged_addr(addr), size) #define user_addr_max get_fs
#define _ASM_EXTABLE(from, to) \ @@ -236,7 +237,8 @@ static inline void uaccess_enable_not_uao(void)
/* * Sanitise a uaccess pointer such that it becomes NULL if above the - * current addr_limit. + * current addr_limit. In case the pointer is tagged (has the top byte set), + * untag the pointer before checking. */ #define uaccess_mask_ptr(ptr) (__typeof__(ptr))__uaccess_mask_ptr(ptr) static inline void __user *__uaccess_mask_ptr(const void __user *ptr) @@ -244,10 +246,11 @@ static inline void __user *__uaccess_mask_ptr(const void __user *ptr) void __user *safe_ptr;
asm volatile( - " bics xzr, %1, %2\n" + " bics xzr, %3, %2\n" " csel %0, %1, xzr, eq\n" : "=&r" (safe_ptr) - : "r" (ptr), "r" (current_thread_info()->addr_limit) + : "r" (ptr), "r" (current_thread_info()->addr_limit), + "r" (untagged_addr(ptr)) : "cc");
csdb();
mm/gup.c provides a kernel interface that accepts user addresses and manipulates user pages directly (for example get_user_pages, that is used by the futex syscall). Since a user can provided tagged addresses, we need to handle such case.
Add untagging to gup.c functions that use user addresses for vma lookup.
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- mm/gup.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/mm/gup.c b/mm/gup.c index 8cb68a50dbdf..409aedb1e2d5 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -683,6 +683,8 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, if (!nr_pages) return 0;
+ start = untagged_addr(start); + VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
/* @@ -845,6 +847,8 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, struct vm_area_struct *vma; vm_fault_t ret, major = 0;
+ address = untagged_addr(address); + if (unlocked) fault_flags |= FAULT_FLAG_ALLOW_RETRY;
strncpy_from_user and strnlen_user accept user addresses as arguments, and do not go through the same path as copy_from_user and others, so here we need to handle the case of tagged user addresses separately.
Untag user pointers passed to these functions.
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- lib/strncpy_from_user.c | 2 ++ lib/strnlen_user.c | 2 ++ 2 files changed, 4 insertions(+)
diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c index b53e1b5d80f4..97467cd2bc59 100644 --- a/lib/strncpy_from_user.c +++ b/lib/strncpy_from_user.c @@ -106,6 +106,8 @@ long strncpy_from_user(char *dst, const char __user *src, long count) if (unlikely(count <= 0)) return 0;
+ src = untagged_addr(src); + max_addr = user_addr_max(); src_addr = (unsigned long)src; if (likely(src_addr < max_addr)) { diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c index 60d0bbda8f5e..8b5f56466e00 100644 --- a/lib/strnlen_user.c +++ b/lib/strnlen_user.c @@ -108,6 +108,8 @@ long strnlen_user(const char __user *str, long count) if (unlikely(count <= 0)) return 0;
+ str = untagged_addr(str); + max_addr = user_addr_max(); src_addr = (unsigned long)str; if (likely(src_addr < max_addr)) {
In copy_mount_options a user address is being subtracted from TASK_SIZE. If the address is lower than TASK_SIZE, the size is calculated to not allow the exact_copy_from_user() call to cross TASK_SIZE boundary. However if the address is tagged, then the size will be calculated incorrectly.
Untag the address before subtracting.
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/namespace.c b/fs/namespace.c index a7f91265ea67..694dcedb7e7d 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2686,7 +2686,7 @@ void *copy_mount_options(const void __user * data) * the remainder of the page. */ /* copy_from_user cannot cross TASK_SIZE ! */ - size = TASK_SIZE - (unsigned long)data; + size = TASK_SIZE - (unsigned long)untagged_addr(data); if (size > PAGE_SIZE) size = PAGE_SIZE;
Document the changes in Documentation/arm64/tagged-pointers.txt.
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- Documentation/arm64/tagged-pointers.txt | 25 +++++++++++++++---------- 1 file changed, 15 insertions(+), 10 deletions(-)
diff --git a/Documentation/arm64/tagged-pointers.txt b/Documentation/arm64/tagged-pointers.txt index a25a99e82bb1..f4cf1f5cf362 100644 --- a/Documentation/arm64/tagged-pointers.txt +++ b/Documentation/arm64/tagged-pointers.txt @@ -17,13 +17,22 @@ this byte for application use. Passing tagged addresses to the kernel --------------------------------------
-All interpretation of userspace memory addresses by the kernel assumes -an address tag of 0x00. +The kernel supports tags in pointer arguments (including pointers in +structures) for a limited set of syscalls, the exceptions are:
-This includes, but is not limited to, addresses found in: + - memory syscalls: brk, madvise, mbind, mincore, mlock, mlock2, move_pages, + mprotect, mremap, msync, munlock, munmap, pkey_mprotect, process_vm_readv, + process_vm_writev, remap_file_pages;
- - pointer arguments to system calls, including pointers in structures - passed to system calls, + - ioctls that accept user pointers that describe virtual memory ranges; + + - TCP_ZEROCOPY_RECEIVE setsockopt. + +The kernel supports tags in user fault addresses. However the fault_address +field in the sigcontext struct will contain an untagged address. + +All other interpretations of userspace memory addresses by the kernel +assume an address tag of 0x00, in particular:
- the stack pointer (sp), e.g. when interpreting it to deliver a signal, @@ -33,11 +42,7 @@ This includes, but is not limited to, addresses found in:
Using non-zero address tags in any of these locations may result in an error code being returned, a (fatal) signal being raised, or other modes -of failure. - -For these reasons, passing non-zero address tags to the kernel via -system calls is forbidden, and using a non-zero address tag for sp is -strongly discouraged. +of failure. Using a non-zero address tag for sp is strongly discouraged.
Programs maintaining a frame pointer and frame records that use non-zero address tags may suffer impaired or inaccurate debug and profiling
This patch adds a simple test, that calls the uname syscall with a tagged user pointer as an argument. Without the kernel accepting tagged user pointers the test fails with EFAULT.
Signed-off-by: Andrey Konovalov andreyknvl@google.com --- tools/testing/selftests/arm64/.gitignore | 1 + tools/testing/selftests/arm64/Makefile | 11 +++++++++++ .../testing/selftests/arm64/run_tags_test.sh | 12 ++++++++++++ tools/testing/selftests/arm64/tags_test.c | 19 +++++++++++++++++++ 4 files changed, 43 insertions(+) create mode 100644 tools/testing/selftests/arm64/.gitignore create mode 100644 tools/testing/selftests/arm64/Makefile create mode 100755 tools/testing/selftests/arm64/run_tags_test.sh create mode 100644 tools/testing/selftests/arm64/tags_test.c
diff --git a/tools/testing/selftests/arm64/.gitignore b/tools/testing/selftests/arm64/.gitignore new file mode 100644 index 000000000000..e8fae8d61ed6 --- /dev/null +++ b/tools/testing/selftests/arm64/.gitignore @@ -0,0 +1 @@ +tags_test diff --git a/tools/testing/selftests/arm64/Makefile b/tools/testing/selftests/arm64/Makefile new file mode 100644 index 000000000000..a61b2e743e99 --- /dev/null +++ b/tools/testing/selftests/arm64/Makefile @@ -0,0 +1,11 @@ +# SPDX-License-Identifier: GPL-2.0 + +# ARCH can be overridden by the user for cross compiling +ARCH ?= $(shell uname -m 2>/dev/null || echo not) + +ifneq (,$(filter $(ARCH),aarch64 arm64)) +TEST_GEN_PROGS := tags_test +TEST_PROGS := run_tags_test.sh +endif + +include ../lib.mk diff --git a/tools/testing/selftests/arm64/run_tags_test.sh b/tools/testing/selftests/arm64/run_tags_test.sh new file mode 100755 index 000000000000..745f11379930 --- /dev/null +++ b/tools/testing/selftests/arm64/run_tags_test.sh @@ -0,0 +1,12 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 + +echo "--------------------" +echo "running tags test" +echo "--------------------" +./tags_test +if [ $? -ne 0 ]; then + echo "[FAIL]" +else + echo "[PASS]" +fi diff --git a/tools/testing/selftests/arm64/tags_test.c b/tools/testing/selftests/arm64/tags_test.c new file mode 100644 index 000000000000..1452ed7d33f9 --- /dev/null +++ b/tools/testing/selftests/arm64/tags_test.c @@ -0,0 +1,19 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <stdio.h> +#include <unistd.h> +#include <stdint.h> +#include <sys/utsname.h> + +#define SHIFT_TAG(tag) ((uint64_t)(tag) << 56) +#define SET_TAG(ptr, tag) (((uint64_t)(ptr) & ~SHIFT_TAG(0xff)) | \ + SHIFT_TAG(tag)) + +int main(void) +{ + struct utsname utsname; + void *ptr = &utsname; + void *tagged_ptr = (void *)SET_TAG(ptr, 0x42); + int err = uname(tagged_ptr); + return err; +}
On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence the userspace (EL0) is allowed to set a non-zero value in the top byte but the resulting pointers are not allowed at the user-kernel syscall ABI boundary.
This patchset proposes a relaxation of the ABI and a mechanism to advertise it to the userspace via an AT_FLAGS.
The rationale behind the choice of AT_FLAGS is that the Unix System V ABI defines AT_FLAGS as "flags", leaving some degree of freedom in interpretation. There are two previous attempts of using AT_FLAGS in the Linux Kernel for different reasons: the first was more generic and was used to expose the support for the GNU STACK NX feature [1] and the second was done for the MIPS architecture and was used to expose the support of "MIPS ABI Extension for IEEE Std 754 Non-Compliant Interlinking" [2]. Both the changes are currently _not_ merged in mainline. The only architecture that reserves some of the bits in AT_FLAGS is currently MIPS, which introduced the concept of platform specific ABI (psABI) reserving the top-byte [3].
When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising to the userspace that a relaxed ABI is supported hence this type of pointers are now allowed to be passed to the syscalls when they are in memory ranges obtained by anonymous mmap() or brk().
The userspace _must_ verify that the flag is set before passing tagged pointers to the syscalls allowed by this relaxation.
More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and mandating to the software to check that the feature is present, before using the associated functionality, it provides a degree of control on the decision of disabling such a feature in future without consequently breaking the userspace.
The change required a modification of the elf common code, because in Linux the AT_FLAGS are currently set to zero by default by the kernel.
The newly added flag has been verified on arm64 using the code below. #include <stdio.h> #include <stdbool.h> #include <sys/auxv.h>
#define ARM64_AT_FLAGS_SYSCALL_TBI (1 << 0)
bool arm64_syscall_tbi_is_present(void) { unsigned long at_flags = getauxval(AT_FLAGS); if (at_flags & ARM64_AT_FLAGS_SYSCALL_TBI) return true;
return false; }
void main() { if (arm64_syscall_tbi_is_present()) printf("ARM64_AT_FLAGS_SYSCALL_TBI is present\n"); }
This patchset should be merged together with [4].
[1] https://patchwork.ozlabs.org/patch/579578/ [2] https://lore.kernel.org/patchwork/cover/618280/ [3] ftp://www.linux-mips.org/pub/linux/mips/doc/ABI/psABI_mips3.0.pdf [4] https://patchwork.kernel.org/cover/10674351/
ABI References: --------------- Sco SysV ABI: http://www.sco.com/developers/gabi/2003-12-17/contents.html PowerPC AUXV: http://openpowerfoundation.org/wp-content/uploads/resources/leabi/content/db... AMD64 ABI: https://www.cs.tufts.edu/comp/40-2012f/readings/amd64-abi.pdf x86 ABI: https://www.uclibc.org/docs/psABI-i386.pdf MIPS ABI: ftp://www.linux-mips.org/pub/linux/mips/doc/ABI/psABI_mips3.0.pdf ARM ABI: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044f/IHI0044F_aaelf.pdf SPARC ABI: http://math-atlas.sourceforge.net/devel/assembly/abi_sysV_sparc.pdf
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will.deacon@arm.com Cc: Mark Rutland mark.rutland@arm.com Cc: Robin Murphy robin.murphy@arm.com Cc: Kees Cook keescook@chromium.org Cc: Kate Stewart kstewart@linuxfoundation.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Ingo Molnar mingo@kernel.org Cc: "Kirill A . Shutemov" kirill.shutemov@linux.intel.com Cc: Shuah Khan shuah@kernel.org Cc: Chintan Pandya cpandya@codeaurora.org Cc: Jacob Bramley Jacob.Bramley@arm.com Cc: Ruben Ayrapetyan Ruben.Ayrapetyan@arm.com Cc: Andrey Konovalov andreyknvl@google.com Cc: Lee Smith Lee.Smith@arm.com Cc: Kostya Serebryany kcc@google.com Cc: Dmitry Vyukov dvyukov@google.com, Cc: Ramana Radhakrishnan Ramana.Radhakrishnan@arm.com Cc: Luc Van Oostenryck luc.vanoostenryck@gmail.com Cc: Evgeniy Stepanov eugenis@google.com CC: Alexander Viro viro@zeniv.linux.org.uk Signed-off-by: Vincenzo Frascino vincenzo.frascino@arm.com
Vincenzo Frascino (3): elf: Make AT_FLAGS arch configurable arm64: Define Documentation/arm64/elf_at_flags.txt arm64: elf: Advertise relaxed ABI
Documentation/arm64/elf_at_flags.txt | 111 ++++++++++++++++++++++++++ arch/arm64/include/asm/atflags.h | 7 ++ arch/arm64/include/asm/elf.h | 5 ++ arch/arm64/include/uapi/asm/atflags.h | 8 ++ fs/binfmt_elf.c | 6 +- fs/binfmt_elf_fdpic.c | 6 +- fs/compat_binfmt_elf.c | 5 ++ 7 files changed, 146 insertions(+), 2 deletions(-) create mode 100644 Documentation/arm64/elf_at_flags.txt create mode 100644 arch/arm64/include/asm/atflags.h create mode 100644 arch/arm64/include/uapi/asm/atflags.h
Currently, the AT_FLAGS in the elf auxiliary vector are set to 0 by default by the kernel. Some architectures might need to expose to the userspace a non-zero value to advertise some platform specific ABI functionalities.
This patch makes AT_FLAGS configurable by the architectures that require it.
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will.deacon@arm.com CC: Andrey Konovalov andreyknvl@google.com CC: Alexander Viro viro@zeniv.linux.org.uk Signed-off-by: Vincenzo Frascino vincenzo.frascino@arm.com --- fs/binfmt_elf.c | 6 +++++- fs/binfmt_elf_fdpic.c | 6 +++++- fs/compat_binfmt_elf.c | 5 +++++ 3 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 54207327f98f..9fa20cc4a437 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -86,6 +86,10 @@ static int elf_core_dump(struct coredump_params *cprm); #define ELF_CORE_EFLAGS 0 #endif
+#ifndef ELF_AT_FLAGS +#define ELF_AT_FLAGS 0 +#endif + #define ELF_PAGESTART(_v) ((_v) & ~(unsigned long)(ELF_MIN_ALIGN-1)) #define ELF_PAGEOFFSET(_v) ((_v) & (ELF_MIN_ALIGN-1)) #define ELF_PAGEALIGN(_v) (((_v) + ELF_MIN_ALIGN - 1) & ~(ELF_MIN_ALIGN - 1)) @@ -251,7 +255,7 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec, NEW_AUX_ENT(AT_PHENT, sizeof(struct elf_phdr)); NEW_AUX_ENT(AT_PHNUM, exec->e_phnum); NEW_AUX_ENT(AT_BASE, interp_load_addr); - NEW_AUX_ENT(AT_FLAGS, 0); + NEW_AUX_ENT(AT_FLAGS, ELF_AT_FLAGS); NEW_AUX_ENT(AT_ENTRY, exec->e_entry); NEW_AUX_ENT(AT_UID, from_kuid_munged(cred->user_ns, cred->uid)); NEW_AUX_ENT(AT_EUID, from_kuid_munged(cred->user_ns, cred->euid)); diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c index b53bb3729ac1..cf1e680a6b88 100644 --- a/fs/binfmt_elf_fdpic.c +++ b/fs/binfmt_elf_fdpic.c @@ -82,6 +82,10 @@ static int elf_fdpic_map_file_by_direct_mmap(struct elf_fdpic_params *, static int elf_fdpic_core_dump(struct coredump_params *cprm); #endif
+#ifndef ELF_AT_FLAGS +#define ELF_AT_FLAGS 0 +#endif + static struct linux_binfmt elf_fdpic_format = { .module = THIS_MODULE, .load_binary = load_elf_fdpic_binary, @@ -651,7 +655,7 @@ static int create_elf_fdpic_tables(struct linux_binprm *bprm, NEW_AUX_ENT(AT_PHENT, sizeof(struct elf_phdr)); NEW_AUX_ENT(AT_PHNUM, exec_params->hdr.e_phnum); NEW_AUX_ENT(AT_BASE, interp_params->elfhdr_addr); - NEW_AUX_ENT(AT_FLAGS, 0); + NEW_AUX_ENT(AT_FLAGS, ELF_AT_FLAGS); NEW_AUX_ENT(AT_ENTRY, exec_params->entry_addr); NEW_AUX_ENT(AT_UID, (elf_addr_t) from_kuid_munged(cred->user_ns, cred->uid)); NEW_AUX_ENT(AT_EUID, (elf_addr_t) from_kuid_munged(cred->user_ns, cred->euid)); diff --git a/fs/compat_binfmt_elf.c b/fs/compat_binfmt_elf.c index 15f6e96b3bd9..a21cf99701ae 100644 --- a/fs/compat_binfmt_elf.c +++ b/fs/compat_binfmt_elf.c @@ -79,6 +79,11 @@ #define ELF_HWCAP2 COMPAT_ELF_HWCAP2 #endif
+#ifdef COMPAT_ELF_AT_FLAGS +#undef ELF_AT_FLAGS +#define ELF_AT_FLAGS COMPAT_ELF_AT_FLAGS +#endif + #ifdef COMPAT_ARCH_DLINFO #undef ARCH_DLINFO #define ARCH_DLINFO COMPAT_ARCH_DLINFO
On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence the userspace (EL0) is allowed to set a non-zero value in the top byte but the resulting pointers are not allowed at the user-kernel syscall ABI boundary.
With the relaxed ABI proposed through this document, it is now possible to pass tagged pointers to the syscalls, when these pointers are in memory ranges obtained by an anonymous (MAP_ANONYMOUS) mmap() or brk().
This change in the ABI requires a mechanism to inform the userspace that such an option is available.
This patch specifies and documents the way on which AT_FLAGS can be used to advertise this feature to the userspace.
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will.deacon@arm.com CC: Andrey Konovalov andreyknvl@google.com Signed-off-by: Vincenzo Frascino vincenzo.frascino@arm.com --- Documentation/arm64/elf_at_flags.txt | 111 +++++++++++++++++++++++++++ 1 file changed, 111 insertions(+) create mode 100644 Documentation/arm64/elf_at_flags.txt
diff --git a/Documentation/arm64/elf_at_flags.txt b/Documentation/arm64/elf_at_flags.txt new file mode 100644 index 000000000000..153e657c058a --- /dev/null +++ b/Documentation/arm64/elf_at_flags.txt @@ -0,0 +1,111 @@ +ARM64 ELF AT_FLAGS +================== + +This document describes the usage and semantics of AT_FLAGS on arm64. + +1. Introduction +--------------- + +AT_FLAGS is part of the Auxiliary Vector, contains the flags and it +is currently set to zero by the kernel on arm64. + +The auxiliary vector can be accessed by the userspace using the +getauxval() API provided by the C library. +getauxval() returns an unsigned long and when a flag is present in +the AT_FLAGS, the corresponding bit in the returned value is set to 1. + +The AT_FLAGS with a "defined semantic" on arm64 are exposed to the +userspace via user API (uapi/asm/atflags.h). +The AT_FLAGS bits with "undefined semantics" are set to zero by default. +This means that the AT_FLAGS bits to which this document does not assign +an explicit meaning are to be intended reserved for future use. +The kernel will populate all such bits with zero until meanings are +assigned to them. If and when meanings are assigned, it is guaranteed +that they will not impact the functional operation of existing userspace +software. Userspace software should ignore any AT_FLAGS bit whose meaning +is not defined when the software is written. + +The userspace software can test for features by acquiring the AT_FLAGS +entry of the auxiliary vector, and testing whether a relevant flag +is set. + +Example of a userspace test function: + +bool feature_x_is_present(void) +{ + unsigned long at_flags = getauxval(AT_FLAGS); + if (at_flags & FEATURE_X) + return true; + + return false; +} + +Where the software relies on a feature advertised by AT_FLAGS, it +should check that the feature is present before attempting to +use it. + +2. Features exposed via AT_FLAGS +-------------------------------- + +bit[0]: ARM64_AT_FLAGS_SYSCALL_TBI + + On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence + the userspace (EL0) is allowed to set a non-zero value in the top + byte but the resulting pointers are not allowed at the user-kernel + syscall ABI boundary. + When bit[0] is set to 1 the kernel is advertising to the userspace + that a relaxed ABI is supported hence this type of pointers are now + allowed to be passed to the syscalls, when these pointers are in + memory ranges obtained by anonymous (MAP_ANONYMOUS) mmap() or brk(). + In these cases the tag is preserved as the pointer goes through the + kernel. Only when the kernel needs to check if a pointer is coming + from userspace (i.e. access_ok()) an untag operation is required. + +3. ARM64_AT_FLAGS_SYSCALL_TBI +----------------------------- + +When ARM64_AT_FLAGS_SYSCALL_TBI is enabled every syscalls can accept tagged +pointers, when these pointers are in memory ranges obtained by an anonymous +(MAP_ANONYMOUS) mmap() or brk(). + +A definition of the meaning of tagged pointers on arm64 can be found in: +Documentation/arm64/tagged-pointers.txt. + +When a pointer does not are in a memory range obtained by an anonymous mmap() +or brk(), this can not be passed to a syscall if it is tagged. + +To be more explicit: a syscall can accept pointers whose memory range is +obtained by a non-anonymous mmap() or brk() if and only if the tag encoded in +the top-byte is 0x00. + +When a new syscall is added, this can accept tagged pointers if and only if +these pointers are in memory ranges obtained by an anonymous (MAP_ANONYMOUS) +mmap() or brk(). In all the other cases, the tag encoded in the top-byte is +expected to be 0x00. + +Example of correct usage (pseudo-code) for a userspace application: + +bool arm64_syscall_tbi_is_present(void) +{ + unsigned long at_flags = getauxval(AT_FLAGS); + if (at_flags & ARM64_AT_FLAGS_SYSCALL_TBI) + return true; + + return false; +} + +void main(void) +{ + char *addr = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, + MAP_ANONYMOUS, -1, 0); + + /* Check if the relaxed ABI is supported */ + if (arm64_syscall_tbi_is_present()) { + /* Add a tag to the pointer and to the memory */ + addr = tag_pointer_and_memory(addr); + } + + /* Write to memory */ + strcpy("Hello World\n", addr); +} +
On Mon, Dec 10, 2018 at 02:30:43PM +0000, Vincenzo Frascino wrote:
On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence the userspace (EL0) is allowed to set a non-zero value in the top byte but the resulting pointers are not allowed at the user-kernel syscall ABI boundary.
With the relaxed ABI proposed through this document, it is now possible to pass tagged pointers to the syscalls, when these pointers are in memory ranges obtained by an anonymous (MAP_ANONYMOUS) mmap() or brk().
What about other anonymous memory such as the process stack?
What about MAP_PRIVATE mappings of /dev/zero (i.e., oldskool "anonymous" mappings)?
I wonder whether this should really say MAP_PRIVATE rather than MAP_ANONYMOUS. There are two requirements here:
* the memory must be the exclusive property of a single process, otherwise tagging it on-the-fly could break some other process;
* the memory should be regular memory, i.e., not something like a mapped device. Since copy-on-write mappings of devices make little sense, and we want writes to devices to propagate to the hardware directly, MAP_PRIVATE doesn't make a lot of sense for such mappings.
MAP_ANONYMOUS | MAP_SHARED is theoretically possible, and we should make sure that it does the expected thing. Tagging such memory is probably a bad idea, just like any other MAP_SHARED memory.
I'm assuming here that tagging some currently shared copy-on-write pages would throw a page fault and trigger a copy, so that we end up tagging the calling process's private copy of the page.
I also don't see how the above requirements conflict with regular file- backed mappings (which would need to work if you want to be able to tag objects in .bss or .data etc.)
This change in the ABI requires a mechanism to inform the userspace that such an option is available.
This patch specifies and documents the way on which AT_FLAGS can be used to advertise this feature to the userspace.
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will.deacon@arm.com CC: Andrey Konovalov andreyknvl@google.com Signed-off-by: Vincenzo Frascino vincenzo.frascino@arm.com
Documentation/arm64/elf_at_flags.txt | 111 +++++++++++++++++++++++++++ 1 file changed, 111 insertions(+) create mode 100644 Documentation/arm64/elf_at_flags.txt
diff --git a/Documentation/arm64/elf_at_flags.txt b/Documentation/arm64/elf_at_flags.txt new file mode 100644 index 000000000000..153e657c058a --- /dev/null +++ b/Documentation/arm64/elf_at_flags.txt @@ -0,0 +1,111 @@ +ARM64 ELF AT_FLAGS +==================
+This document describes the usage and semantics of AT_FLAGS on arm64.
+1. Introduction +---------------
+AT_FLAGS is part of the Auxiliary Vector, contains the flags and it +is currently set to zero by the kernel on arm64.
+The auxiliary vector can be accessed by the userspace using the +getauxval() API provided by the C library. +getauxval() returns an unsigned long and when a flag is present in +the AT_FLAGS, the corresponding bit in the returned value is set to 1.
+The AT_FLAGS with a "defined semantic" on arm64 are exposed to the +userspace via user API (uapi/asm/atflags.h). +The AT_FLAGS bits with "undefined semantics" are set to zero by default. +This means that the AT_FLAGS bits to which this document does not assign +an explicit meaning are to be intended reserved for future use. +The kernel will populate all such bits with zero until meanings are +assigned to them. If and when meanings are assigned, it is guaranteed +that they will not impact the functional operation of existing userspace +software. Userspace software should ignore any AT_FLAGS bit whose meaning +is not defined when the software is written.
+The userspace software can test for features by acquiring the AT_FLAGS +entry of the auxiliary vector, and testing whether a relevant flag +is set.
+Example of a userspace test function:
+bool feature_x_is_present(void) +{
- unsigned long at_flags = getauxval(AT_FLAGS);
- if (at_flags & FEATURE_X)
return true;
- return false;
+}
+Where the software relies on a feature advertised by AT_FLAGS, it +should check that the feature is present before attempting to +use it.
+2. Features exposed via AT_FLAGS +--------------------------------
+bit[0]: ARM64_AT_FLAGS_SYSCALL_TBI
- On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
- the userspace (EL0) is allowed to set a non-zero value in the top
- byte but the resulting pointers are not allowed at the user-kernel
- syscall ABI boundary.
- When bit[0] is set to 1 the kernel is advertising to the userspace
- that a relaxed ABI is supported hence this type of pointers are now
- allowed to be passed to the syscalls, when these pointers are in
- memory ranges obtained by anonymous (MAP_ANONYMOUS) mmap() or brk().
"TBI" is a slightly odd name.
The kernel seems not to be ignoring the top byte, otherwise how could it make a difference whehter the memory is anonymous or something else?
(With memory tagging enabled, the top byte is also not architecturally ignored.)
- In these cases the tag is preserved as the pointer goes through the
- kernel. Only when the kernel needs to check if a pointer is coming
- from userspace (i.e. access_ok()) an untag operation is required.
Does the last sentence belong here? That's about kernel internals, whereas the rest all seems user-facing.
+3. ARM64_AT_FLAGS_SYSCALL_TBI +-----------------------------
+When ARM64_AT_FLAGS_SYSCALL_TBI is enabled every syscalls can accept tagged +pointers, when these pointers are in memory ranges obtained by an anonymous +(MAP_ANONYMOUS) mmap() or brk().
+A definition of the meaning of tagged pointers on arm64 can be found in: +Documentation/arm64/tagged-pointers.txt.
+When a pointer does not are in a memory range obtained by an anonymous mmap() +or brk(), this can not be passed to a syscall if it is tagged.
+To be more explicit: a syscall can accept pointers whose memory range is +obtained by a non-anonymous mmap() or brk() if and only if the tag encoded in +the top-byte is 0x00.
+When a new syscall is added, this can accept tagged pointers if and only if +these pointers are in memory ranges obtained by an anonymous (MAP_ANONYMOUS) +mmap() or brk(). In all the other cases, the tag encoded in the top-byte is +expected to be 0x00.
Does this apply to kernel interfaces that are not syscalls?
And does it apply to ioctls in general (I think from discussions elsewhere that it can't).
What about things that flow through the kernel, like an si_value.sival_ptr that propagates from sigqueue(2) to the signal frame of the signalled thread, or registered with the kernel via aio_read(2)?
This kind of thing is why I would like to define a set of rules for making an educated guess about how the kernel should interpret arbitrary arguments (in an ideal world).
With issues like this in the mix, it seems difficult to extract general guarantees from ARM64_AT_FLAGS_SYSCALL_TBI. If it means that just certain specific uses of a few specific syscalls work with tagged pointers, that may not be very useful by ifself? It sounds like if you are tagging memory at all, you suddenly need to port every random library you're using.
[...]
Cheers ---Dave
On 12/12/2018 17:34, Dave Martin wrote:
On Mon, Dec 10, 2018 at 02:30:43PM +0000, Vincenzo Frascino wrote:
On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence the userspace (EL0) is allowed to set a non-zero value in the top byte but the resulting pointers are not allowed at the user-kernel syscall ABI boundary.
With the relaxed ABI proposed through this document, it is now possible to pass tagged pointers to the syscalls, when these pointers are in memory ranges obtained by an anonymous (MAP_ANONYMOUS) mmap() or brk().
What about other anonymous memory such as the process stack?
The process stack is set up by the kernel as a consequence of the exec* family of system calls in a way that should be transparent to the userspace. For what concerns the userspace applications, the ones that are aware of the AT_FLAGS[0] should assume that the stack is always tagged.
What about MAP_PRIVATE mappings of /dev/zero (i.e., oldskool "anonymous" mappings)?
MAP_PRIVATE of /dev/zero is equivalent to MAP_ANONYMOUS | MAP_PRIVATE, hence it should follow the same rule of MAP_ANONYMOUS obtained memory ranges.
I wonder whether this should really say MAP_PRIVATE rather than MAP_ANONYMOUS. There are two requirements here:
- the memory must be the exclusive property of a single process, otherwise tagging it on-the-fly could break some other process;
I think that it is responsibility of the userspace to tag or not a shared area. I do not think that we should exclude MAP_ANONYMOUS | MAP_SHARED because if such an area is tagged improperly by a userspace process, the kernel will be able to detect the issue and report it.
- the memory should be regular memory, i.e., not something like a mapped device. Since copy-on-write mappings of devices make little sense, and we want writes to devices to propagate to the hardware directly, MAP_PRIVATE doesn't make a lot of sense for such mappings.
What we are trying to target with this ABI change are only pointers in NORMAL memory ranges. DEVICE memory should be untagged.
...
I'm assuming here that tagging some currently shared copy-on-write pages would throw a page fault and trigger a copy, so that we end up tagging the calling process's private copy of the page.
I also don't see how the above requirements conflict with regular file- backed mappings (which would need to work if you want to be able to tag objects in .bss or .data etc.)
I agree that we should look at the more general cases to make sure that we will not face any incompatibility in future, but, what I am proposing here, assumes that the memory areas that can be tagged are only the stack and the heap. This is because the features that will use it will be mostly looking for memory related errors in these areas.
This change in the ABI requires a mechanism to inform the userspace that such an option is available.
This patch specifies and documents the way on which AT_FLAGS can be used to advertise this feature to the userspace.
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will.deacon@arm.com CC: Andrey Konovalov andreyknvl@google.com Signed-off-by: Vincenzo Frascino vincenzo.frascino@arm.com
Documentation/arm64/elf_at_flags.txt | 111 +++++++++++++++++++++++++++ 1 file changed, 111 insertions(+) create mode 100644 Documentation/arm64/elf_at_flags.txt
diff --git a/Documentation/arm64/elf_at_flags.txt b/Documentation/arm64/elf_at_flags.txt new file mode 100644 index 000000000000..153e657c058a --- /dev/null +++ b/Documentation/arm64/elf_at_flags.txt @@ -0,0 +1,111 @@ +ARM64 ELF AT_FLAGS +==================
+This document describes the usage and semantics of AT_FLAGS on arm64.
+1. Introduction +---------------
+AT_FLAGS is part of the Auxiliary Vector, contains the flags and it +is currently set to zero by the kernel on arm64.
+The auxiliary vector can be accessed by the userspace using the +getauxval() API provided by the C library. +getauxval() returns an unsigned long and when a flag is present in +the AT_FLAGS, the corresponding bit in the returned value is set to 1.
+The AT_FLAGS with a "defined semantic" on arm64 are exposed to the +userspace via user API (uapi/asm/atflags.h). +The AT_FLAGS bits with "undefined semantics" are set to zero by default. +This means that the AT_FLAGS bits to which this document does not assign +an explicit meaning are to be intended reserved for future use. +The kernel will populate all such bits with zero until meanings are +assigned to them. If and when meanings are assigned, it is guaranteed +that they will not impact the functional operation of existing userspace +software. Userspace software should ignore any AT_FLAGS bit whose meaning +is not defined when the software is written.
+The userspace software can test for features by acquiring the AT_FLAGS +entry of the auxiliary vector, and testing whether a relevant flag +is set.
+Example of a userspace test function:
+bool feature_x_is_present(void) +{
- unsigned long at_flags = getauxval(AT_FLAGS);
- if (at_flags & FEATURE_X)
return true;
- return false;
+}
+Where the software relies on a feature advertised by AT_FLAGS, it +should check that the feature is present before attempting to +use it.
+2. Features exposed via AT_FLAGS +--------------------------------
+bit[0]: ARM64_AT_FLAGS_SYSCALL_TBI
- On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
- the userspace (EL0) is allowed to set a non-zero value in the top
- byte but the resulting pointers are not allowed at the user-kernel
- syscall ABI boundary.
- When bit[0] is set to 1 the kernel is advertising to the userspace
- that a relaxed ABI is supported hence this type of pointers are now
- allowed to be passed to the syscalls, when these pointers are in
- memory ranges obtained by anonymous (MAP_ANONYMOUS) mmap() or brk().
"TBI" is a slightly odd name.
The kernel seems not to be ignoring the top byte, otherwise how could it make a difference whehter the memory is anonymous or something else?
(With memory tagging enabled, the top byte is also not architecturally ignored.)
The top-byte is ignored for MMU translations hence it can carry information that are relevant for other aspects. This means that the kernel can not ignore that the top-byte could be set.
- In these cases the tag is preserved as the pointer goes through the
- kernel. Only when the kernel needs to check if a pointer is coming
- from userspace (i.e. access_ok()) an untag operation is required.
Does the last sentence belong here? That's about kernel internals, whereas the rest all seems user-facing.
I put it here just to give an idea to the user on when the kernel is going to perform an untag operation. But if we want to keep the document completely user-facing, I am fine with removing it.
+3. ARM64_AT_FLAGS_SYSCALL_TBI +-----------------------------
+When ARM64_AT_FLAGS_SYSCALL_TBI is enabled every syscalls can accept tagged +pointers, when these pointers are in memory ranges obtained by an anonymous +(MAP_ANONYMOUS) mmap() or brk().
+A definition of the meaning of tagged pointers on arm64 can be found in: +Documentation/arm64/tagged-pointers.txt.
+When a pointer does not are in a memory range obtained by an anonymous mmap() +or brk(), this can not be passed to a syscall if it is tagged.
+To be more explicit: a syscall can accept pointers whose memory range is +obtained by a non-anonymous mmap() or brk() if and only if the tag encoded in +the top-byte is 0x00.
+When a new syscall is added, this can accept tagged pointers if and only if +these pointers are in memory ranges obtained by an anonymous (MAP_ANONYMOUS) +mmap() or brk(). In all the other cases, the tag encoded in the top-byte is +expected to be 0x00.
Does this apply to kernel interfaces that are not syscalls?
Not at the moment, but it does not seem incompatible with a future extension.
And does it apply to ioctls in general (I think from discussions elsewhere that it can't).
Requiring that the pointers accepted by the syscalls belong to ranges that are obtained by anonymous mappings, should cover the drivers mmap() cases.
What about things that flow through the kernel, like an si_value.sival_ptr that propagates from sigqueue(2) to the signal frame of the signalled thread, or registered with the kernel via aio_read(2)?
This kind of thing is why I would like to define a set of rules for making an educated guess about how the kernel should interpret arbitrary arguments (in an ideal world).
I think that to make an educated choice we need to restrict the problem we are trying to target (making sure that we do not take any assumption that could be incompatible with the general case) as a first step. Once we are confident that it works as expected we can try to extend it progressively.
With issues like this in the mix, it seems difficult to extract general guarantees from ARM64_AT_FLAGS_SYSCALL_TBI. If it means that just certain specific uses of a few specific syscalls work with tagged pointers, that may not be very useful by ifself? It sounds like if you are tagging memory at all, you suddenly need to port every random library you're using.
The ABI presented in this document seems quite general (it is not confined to specific syscalls). This means that an application that tags the memory can use any non tag aware random library as is (without porting it), as far as it sanitizes the pointers before and after calling it.
[...]
Cheers ---Dave
On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence the userspace (EL0) is allowed to set a non-zero value in the top byte but the resulting pointers are not allowed at the user-kernel syscall ABI boundary.
This patch sets ARM64_AT_FLAGS_SYSCALL_TBI (bit[0]) in the AT_FLAGS to advertise the relaxation of the ABI to the userspace.
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will.deacon@arm.com CC: Andrey Konovalov andreyknvl@google.com Signed-off-by: Vincenzo Frascino vincenzo.frascino@arm.com --- arch/arm64/include/asm/atflags.h | 7 +++++++ arch/arm64/include/asm/elf.h | 5 +++++ arch/arm64/include/uapi/asm/atflags.h | 8 ++++++++ 3 files changed, 20 insertions(+) create mode 100644 arch/arm64/include/asm/atflags.h create mode 100644 arch/arm64/include/uapi/asm/atflags.h
diff --git a/arch/arm64/include/asm/atflags.h b/arch/arm64/include/asm/atflags.h new file mode 100644 index 000000000000..b20093d61bf2 --- /dev/null +++ b/arch/arm64/include/asm/atflags.h @@ -0,0 +1,7 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_ATFLAGS_H +#define __ASM_ATFLAGS_H + +#include <uapi/asm/atflags.h> + +#endif diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h index 433b9554c6a1..da5a6d310ff4 100644 --- a/arch/arm64/include/asm/elf.h +++ b/arch/arm64/include/asm/elf.h @@ -16,6 +16,7 @@ #ifndef __ASM_ELF_H #define __ASM_ELF_H
+#include <asm/atflags.h> #include <asm/hwcap.h>
/* @@ -163,6 +164,10 @@ do { \ NEW_AUX_ENT(AT_IGNORE, 0); \ } while (0)
+/* Platform specific AT_FLAGS */ +#define ELF_AT_FLAGS ARM64_AT_FLAGS_SYSCALL_TBI +#define COMPAT_ELF_AT_FLAGS 0 + #define ARCH_HAS_SETUP_ADDITIONAL_PAGES struct linux_binprm; extern int arch_setup_additional_pages(struct linux_binprm *bprm, diff --git a/arch/arm64/include/uapi/asm/atflags.h b/arch/arm64/include/uapi/asm/atflags.h new file mode 100644 index 000000000000..1cf25692ffd6 --- /dev/null +++ b/arch/arm64/include/uapi/asm/atflags.h @@ -0,0 +1,8 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __UAPI_ASM_ATFLAGS_H +#define __UAPI_ASM_ATFLAGS_H + +/* Platform specific AT_FLAGS */ +#define ARM64_AT_FLAGS_SYSCALL_TBI (1 << 0) + +#endif
On Mon, Dec 10, 2018 at 3:31 PM Vincenzo Frascino vincenzo.frascino@arm.com wrote:
On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence the userspace (EL0) is allowed to set a non-zero value in the top byte but the resulting pointers are not allowed at the user-kernel syscall ABI boundary.
This patchset proposes a relaxation of the ABI and a mechanism to advertise it to the userspace via an AT_FLAGS.
The rationale behind the choice of AT_FLAGS is that the Unix System V ABI defines AT_FLAGS as "flags", leaving some degree of freedom in interpretation. There are two previous attempts of using AT_FLAGS in the Linux Kernel for different reasons: the first was more generic and was used to expose the support for the GNU STACK NX feature [1] and the second was done for the MIPS architecture and was used to expose the support of "MIPS ABI Extension for IEEE Std 754 Non-Compliant Interlinking" [2]. Both the changes are currently _not_ merged in mainline. The only architecture that reserves some of the bits in AT_FLAGS is currently MIPS, which introduced the concept of platform specific ABI (psABI) reserving the top-byte [3].
When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising to the userspace that a relaxed ABI is supported hence this type of pointers are now allowed to be passed to the syscalls when they are in memory ranges obtained by anonymous mmap() or brk().
The userspace _must_ verify that the flag is set before passing tagged pointers to the syscalls allowed by this relaxation.
More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and mandating to the software to check that the feature is present, before using the associated functionality, it provides a degree of control on the decision of disabling such a feature in future without consequently breaking the userspace.
The change required a modification of the elf common code, because in Linux the AT_FLAGS are currently set to zero by default by the kernel.
The newly added flag has been verified on arm64 using the code below. #include <stdio.h> #include <stdbool.h> #include <sys/auxv.h>
#define ARM64_AT_FLAGS_SYSCALL_TBI (1 << 0)
bool arm64_syscall_tbi_is_present(void) { unsigned long at_flags = getauxval(AT_FLAGS); if (at_flags & ARM64_AT_FLAGS_SYSCALL_TBI) return true;
return false;
}
void main() { if (arm64_syscall_tbi_is_present()) printf("ARM64_AT_FLAGS_SYSCALL_TBI is present\n"); }
This patchset should be merged together with [4].
[1] https://patchwork.ozlabs.org/patch/579578/ [2] https://lore.kernel.org/patchwork/cover/618280/ [3] ftp://www.linux-mips.org/pub/linux/mips/doc/ABI/psABI_mips3.0.pdf [4] https://patchwork.kernel.org/cover/10674351/
ABI References:
Sco SysV ABI: http://www.sco.com/developers/gabi/2003-12-17/contents.html PowerPC AUXV: http://openpowerfoundation.org/wp-content/uploads/resources/leabi/content/db... AMD64 ABI: https://www.cs.tufts.edu/comp/40-2012f/readings/amd64-abi.pdf x86 ABI: https://www.uclibc.org/docs/psABI-i386.pdf MIPS ABI: ftp://www.linux-mips.org/pub/linux/mips/doc/ABI/psABI_mips3.0.pdf ARM ABI: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044f/IHI0044F_aaelf.pdf SPARC ABI: http://math-atlas.sourceforge.net/devel/assembly/abi_sysV_sparc.pdf
Cc: Catalin Marinas catalin.marinas@arm.com Cc: Will Deacon will.deacon@arm.com Cc: Mark Rutland mark.rutland@arm.com Cc: Robin Murphy robin.murphy@arm.com Cc: Kees Cook keescook@chromium.org Cc: Kate Stewart kstewart@linuxfoundation.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Ingo Molnar mingo@kernel.org Cc: "Kirill A . Shutemov" kirill.shutemov@linux.intel.com Cc: Shuah Khan shuah@kernel.org Cc: Chintan Pandya cpandya@codeaurora.org Cc: Jacob Bramley Jacob.Bramley@arm.com Cc: Ruben Ayrapetyan Ruben.Ayrapetyan@arm.com Cc: Andrey Konovalov andreyknvl@google.com Cc: Lee Smith Lee.Smith@arm.com Cc: Kostya Serebryany kcc@google.com Cc: Dmitry Vyukov dvyukov@google.com, Cc: Ramana Radhakrishnan Ramana.Radhakrishnan@arm.com Cc: Luc Van Oostenryck luc.vanoostenryck@gmail.com Cc: Evgeniy Stepanov eugenis@google.com CC: Alexander Viro viro@zeniv.linux.org.uk Signed-off-by: Vincenzo Frascino vincenzo.frascino@arm.com
Vincenzo Frascino (3): elf: Make AT_FLAGS arch configurable arm64: Define Documentation/arm64/elf_at_flags.txt arm64: elf: Advertise relaxed ABI
Documentation/arm64/elf_at_flags.txt | 111 ++++++++++++++++++++++++++ arch/arm64/include/asm/atflags.h | 7 ++ arch/arm64/include/asm/elf.h | 5 ++ arch/arm64/include/uapi/asm/atflags.h | 8 ++ fs/binfmt_elf.c | 6 +- fs/binfmt_elf_fdpic.c | 6 +- fs/compat_binfmt_elf.c | 5 ++ 7 files changed, 146 insertions(+), 2 deletions(-) create mode 100644 Documentation/arm64/elf_at_flags.txt create mode 100644 arch/arm64/include/asm/atflags.h create mode 100644 arch/arm64/include/uapi/asm/atflags.h
-- 2.19.2
Acked-by: Andrey Konovalov andreyknvl@google.com
Hi Andrey,
On Wed, Dec 12, 2018 at 03:23:25PM +0100, Andrey Konovalov wrote:
On Mon, Dec 10, 2018 at 3:31 PM Vincenzo Frascino vincenzo.frascino@arm.com wrote:
On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence the userspace (EL0) is allowed to set a non-zero value in the top byte but the resulting pointers are not allowed at the user-kernel syscall ABI boundary.
This patchset proposes a relaxation of the ABI and a mechanism to advertise it to the userspace via an AT_FLAGS.
The rationale behind the choice of AT_FLAGS is that the Unix System V ABI defines AT_FLAGS as "flags", leaving some degree of freedom in interpretation. There are two previous attempts of using AT_FLAGS in the Linux Kernel for different reasons: the first was more generic and was used to expose the support for the GNU STACK NX feature [1] and the second was done for the MIPS architecture and was used to expose the support of "MIPS ABI Extension for IEEE Std 754 Non-Compliant Interlinking" [2]. Both the changes are currently _not_ merged in mainline. The only architecture that reserves some of the bits in AT_FLAGS is currently MIPS, which introduced the concept of platform specific ABI (psABI) reserving the top-byte [3].
When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising to the userspace that a relaxed ABI is supported hence this type of pointers are now allowed to be passed to the syscalls when they are in memory ranges obtained by anonymous mmap() or brk().
The userspace _must_ verify that the flag is set before passing tagged pointers to the syscalls allowed by this relaxation.
More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and mandating to the software to check that the feature is present, before using the associated functionality, it provides a degree of control on the decision of disabling such a feature in future without consequently breaking the userspace.
[...]
Acked-by: Andrey Konovalov andreyknvl@google.com
Thanks for the ack. However, if we go ahead with this ABI proposal it means that your patches need to be reworked to allow a non-zero top byte in all syscalls, including mmap() and friends, ioctl(). There are ABI concerns in either case but I'd rather have this discussion in the open. It doesn't necessarily mean that I endorse this proposal, I would like feedback and not just from kernel developers but user space ones.
The summary of our internal discussions (mostly between kernel developers) is that we can't properly describe a user ABI that covers future syscalls or syscall extensions while not all syscalls accept tagged pointers. So we tweaked the requirements slightly to only allow tagged pointers back into the kernel *if* the originating address is from an anonymous mmap() or below sbrk(0). This should cover some of the ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a pointer to a buffer obtained via mmap() on the device operations.
(sorry for not being clear on what Vincenzo's proposal implies)
On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas catalin.marinas@arm.com wrote:
Hi Andrey,
On Wed, Dec 12, 2018 at 03:23:25PM +0100, Andrey Konovalov wrote:
On Mon, Dec 10, 2018 at 3:31 PM Vincenzo Frascino vincenzo.frascino@arm.com wrote:
On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence the userspace (EL0) is allowed to set a non-zero value in the top byte but the resulting pointers are not allowed at the user-kernel syscall ABI boundary.
This patchset proposes a relaxation of the ABI and a mechanism to advertise it to the userspace via an AT_FLAGS.
The rationale behind the choice of AT_FLAGS is that the Unix System V ABI defines AT_FLAGS as "flags", leaving some degree of freedom in interpretation. There are two previous attempts of using AT_FLAGS in the Linux Kernel for different reasons: the first was more generic and was used to expose the support for the GNU STACK NX feature [1] and the second was done for the MIPS architecture and was used to expose the support of "MIPS ABI Extension for IEEE Std 754 Non-Compliant Interlinking" [2]. Both the changes are currently _not_ merged in mainline. The only architecture that reserves some of the bits in AT_FLAGS is currently MIPS, which introduced the concept of platform specific ABI (psABI) reserving the top-byte [3].
When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising to the userspace that a relaxed ABI is supported hence this type of pointers are now allowed to be passed to the syscalls when they are in memory ranges obtained by anonymous mmap() or brk().
The userspace _must_ verify that the flag is set before passing tagged pointers to the syscalls allowed by this relaxation.
More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and mandating to the software to check that the feature is present, before using the associated functionality, it provides a degree of control on the decision of disabling such a feature in future without consequently breaking the userspace.
[...]
Acked-by: Andrey Konovalov andreyknvl@google.com
Thanks for the ack. However, if we go ahead with this ABI proposal it means that your patches need to be reworked to allow a non-zero top byte in all syscalls, including mmap() and friends, ioctl(). There are ABI concerns in either case but I'd rather have this discussion in the open. It doesn't necessarily mean that I endorse this proposal, I would like feedback and not just from kernel developers but user space ones.
The summary of our internal discussions (mostly between kernel developers) is that we can't properly describe a user ABI that covers future syscalls or syscall extensions while not all syscalls accept tagged pointers. So we tweaked the requirements slightly to only allow tagged pointers back into the kernel *if* the originating address is from an anonymous mmap() or below sbrk(0). This should cover some of the ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a pointer to a buffer obtained via mmap() on the device operations.
(sorry for not being clear on what Vincenzo's proposal implies)
OK, I see. So I need to make the following changes to my patchset AFAIU.
1. Make sure that we only allow tagged user addresses that originate from an anonymous mmap() or below sbrk(0). How exactly should this check be performed?
2. Allow tagged addressed passed to memory syscalls (as long as (1) is satisfied). Do I understand correctly that this means that I need to locate all find_vma() callers outside of mm/ and fix them up as well?
On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote:
On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas catalin.marinas@arm.com wrote:
The summary of our internal discussions (mostly between kernel developers) is that we can't properly describe a user ABI that covers future syscalls or syscall extensions while not all syscalls accept tagged pointers. So we tweaked the requirements slightly to only allow tagged pointers back into the kernel *if* the originating address is from an anonymous mmap() or below sbrk(0). This should cover some of the ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a pointer to a buffer obtained via mmap() on the device operations.
(sorry for not being clear on what Vincenzo's proposal implies)
OK, I see. So I need to make the following changes to my patchset AFAIU.
- Make sure that we only allow tagged user addresses that originate
from an anonymous mmap() or below sbrk(0). How exactly should this check be performed?
I don't think we should perform such checks. That's rather stating that the kernel only guarantees that the tagged pointers work if they originated from these memory ranges.
- Allow tagged addressed passed to memory syscalls (as long as (1) is
satisfied). Do I understand correctly that this means that I need to locate all find_vma() callers outside of mm/ and fix them up as well?
Yes (unless anyone as a better idea or objections to this approach).
BTW, I'll be off until the new year, so won't be able to follow up.
On Tue, Dec 18, 2018 at 05:59:38PM +0000, Catalin Marinas wrote:
On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote:
On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas catalin.marinas@arm.com wrote:
The summary of our internal discussions (mostly between kernel developers) is that we can't properly describe a user ABI that covers future syscalls or syscall extensions while not all syscalls accept tagged pointers. So we tweaked the requirements slightly to only allow tagged pointers back into the kernel *if* the originating address is from an anonymous mmap() or below sbrk(0). This should cover some of the ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a pointer to a buffer obtained via mmap() on the device operations.
(sorry for not being clear on what Vincenzo's proposal implies)
OK, I see. So I need to make the following changes to my patchset AFAIU.
- Make sure that we only allow tagged user addresses that originate
from an anonymous mmap() or below sbrk(0). How exactly should this check be performed?
I don't think we should perform such checks. That's rather stating that the kernel only guarantees that the tagged pointers work if they originated from these memory ranges.
I concur.
Really, the kernel should do the expected thing with all "non-weird" memory.
In lieu of a proper definition of "non-weird", I think we should have some lists of things that are explicitly included, and also excluded:
OK: kernel-allocated process stack brk area MAP_ANONYMOUS | MAP_PRIVATE MAP_PRIVATE mappings of /dev/zero
Not OK: MAP_SHARED mmaps of non-memory-like devices mmaps of anything that is not a regular file the VDSO ...
In general, userspace can tag memory that it "owns", and we do not assume a transfer of ownership except in the "OK" list above. Otherwise, it's the kernel's memory, or the owner is simply not well defined.
I would also like to see advice for userspace developers, particularly things like (strawman, please challenge!):
* Userspace should set tags at the point of allocation only.
* If you don't know how an object was allocated, you cannot modify the tag, period.
* A single C object should be accessed using a single, fixed pointer tag throughout its entire lifetime.
* Tags can be changed only when there are no outstanding pointers to the affected object or region that may be used to access the object or region (i.e., if the object were allocated from the C heap and is it safe to realloc() it, then it is safe to change the tag; for other types of allocation, analogous arguments can be applied).
* When the kernel dereferences a pointer on userspace's behalf, it shall behave equivalently to userspace dereferencing the same pointer, including use of the same tag (where passed by userspace).
* Where the pointer tag affects pointer dereference behaviour (i.e., with hardware memory colouring) the kernel makes no guarantee to honour pointer tags correctly for every location a buffer based on a pointer passed by userspace to the kernel.
(This means for example that for a read(fd, buf, size), we can check the tag for a single arbitrary location in *(char (*)[size])buf before passing the buffer to get_user_pages(). Hopefully this could be done in get_user_pages() itself rather than hunting call sites. For userspace, it means that you're on your own if you ask the kernel to operate on a buffer than spans multiple, independently- allocated objects, or a deliberately striped single object.)
* The kernel shall not extend the lifetime of user pointers in ways that are not clear from the specification of the syscall or interface to which the pointer is passed (and in any case shall not extend pointer lifetimes without good reason).
So no clever transparent caching between syscalls, unless it _really_ is transparent in the presence of tags.
* For purposes other than dereference, the kernel shall accept any legitimately tagged pointer (according to the above rules) as identifying the associated memory location.
So, mprotect(some_page_aligned_object, ...); is valid irrespective of where page_aligned_object() came from. There is no implicit derefence by the kernel here, hence no tag check.
The kernel does not guarantee to work correctly if the wrong tag is used, but there is not always a well-defined "right" tag, so we can't really guarantee to check it. So a pointer derived by any reasonable means by userspace has to be treated as equally valid.
We would need to get some cross-arch buy-in for this, otherwise core maintainers might just refuse to maintain the necessary guarantees.
- Allow tagged addressed passed to memory syscalls (as long as (1) is
satisfied). Do I understand correctly that this means that I need to locate all find_vma() callers outside of mm/ and fix them up as well?
Yes (unless anyone as a better idea or objections to this approach).
Also, watch out for code that pokes about inside struct vma directly.
I'm wondering, could we define an explicit type, say,
struct user_vaddr { unsigned long addr; };
to replace the unsigned longs in struct vma the mm API? This would turn ad-hoc (unsigned long) casts into build breaks. We could have an explicit conversion functions, say,
struct user_vaddr __user_vaddr_unsafe(void __user *); void __user *__user_ptr_unsafe(struct user_vaddr);
that we robotically insert in all the relevant places to mark unaudited code.
This allows us to keep the kernel buildable, while flagging things that will need review. We would also need to warn the mm folks to reject any new code using these unsafe conversions.
Of course, it would be a non-trivial effort...
BTW, I'll be off until the new year, so won't be able to follow up.
Cheers ---Dave
On 19/12/2018 12:52, Dave Martin wrote:
On Tue, Dec 18, 2018 at 05:59:38PM +0000, Catalin Marinas wrote:
On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote:
On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinascatalin.marinas@arm.com wrote:
The summary of our internal discussions (mostly between kernel developers) is that we can't properly describe a user ABI that covers future syscalls or syscall extensions while not all syscalls accept tagged pointers. So we tweaked the requirements slightly to only allow tagged pointers back into the kernel *if* the originating address is from an anonymous mmap() or below sbrk(0). This should cover some of the ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a pointer to a buffer obtained via mmap() on the device operations.
(sorry for not being clear on what Vincenzo's proposal implies)
OK, I see. So I need to make the following changes to my patchset AFAIU.
- Make sure that we only allow tagged user addresses that originate
from an anonymous mmap() or below sbrk(0). How exactly should this check be performed?
I don't think we should perform such checks. That's rather stating that the kernel only guarantees that the tagged pointers work if they originated from these memory ranges.
I concur.
Really, the kernel should do the expected thing with all "non-weird" memory.
In lieu of a proper definition of "non-weird", I think we should have some lists of things that are explicitly included, and also excluded:
OK: kernel-allocated process stack brk area MAP_ANONYMOUS | MAP_PRIVATE MAP_PRIVATE mappings of /dev/zero
Not OK: MAP_SHARED mmaps of non-memory-like devices mmaps of anything that is not a regular file the VDSO ...
In general, userspace can tag memory that it "owns", and we do not assume a transfer of ownership except in the "OK" list above. Otherwise, it's the kernel's memory, or the owner is simply not well defined.
Agreed on the general idea: a process should be able to pass tagged pointers at the syscall interface, as long as they point to memory privately owned by the process. I think it would be possible to simplify the definition of "non-weird" memory by using only this "OK" list: - mmap() done by the process itself, where either: Â * flags = MAP_PRIVATE | MAP_ANONYMOUS Â * flags = MAP_PRIVATE and fd refers to a regular file or a well-defined list of device files (like /dev/zero) - brk() done by the process itself - Any memory mapped by the kernel in the new process's address space during execve(), with the same restrictions as above ([vdso]/[vvar] are therefore excluded)
I would also like to see advice for userspace developers, particularly things like (strawman, please challenge!):
To some extent, one could call me a userspace developer, so I'll try to help :)
- Userspace should set tags at the point of allocation only.
Yes, tags are only supposed to be set at the point of either allocation or deallocation/reallocation. However, allocators can in principle be nested, so an allocator could take a region allocated by malloc() as input and subdivide it (changing tags in the process). That said, this suballocator must not free() that region until all the suballocations themselves have been freed (thereby restoring the tags initially set by malloc()).
- If you don't know how an object was allocated, you cannot modify the tag, period.
Agreed, allocators that tag memory can only operate on memory with a well-defined provenance (for instance anonymous mmap() or malloc()).
- A single C object should be accessed using a single, fixed pointer tag throughout its entire lifetime.
Agreed. Allocators themselves may need to be excluded though, depending on how they represent their managed memory.
- Tags can be changed only when there are no outstanding pointers to the affected object or region that may be used to access the object or region (i.e., if the object were allocated from the C heap and is it safe to realloc() it, then it is safe to change the tag; for other types of allocation, analogous arguments can be applied).
Tags can only be changed at the point of deallocation/reallocation. Pointers to the object become invalid and cannot be used after it has been deallocated; memory tagging allows to catch such invalid usage.
When the kernel dereferences a pointer on userspace's behalf, it shall behave equivalently to userspace dereferencing the same pointer, including use of the same tag (where passed by userspace).
Where the pointer tag affects pointer dereference behaviour (i.e., with hardware memory colouring) the kernel makes no guarantee to honour pointer tags correctly for every location a buffer based on a pointer passed by userspace to the kernel.
(This means for example that for a read(fd, buf, size), we can check the tag for a single arbitrary location in *(char (*)[size])buf before passing the buffer to get_user_pages(). Hopefully this could be done in get_user_pages() itself rather than hunting call sites. For userspace, it means that you're on your own if you ask the kernel to operate on a buffer than spans multiple, independently- allocated objects, or a deliberately striped single object.)
I think both points are reasonable. It is very valuable for the kernel to access userspace memory using the user-provided tag, because it enables kernel accesses to be checked in the same way as user accesses, allowing to detect bugs that are potentially hard to find. For instance, if a pointer to an object is passed to the kernel after it has been deallocated, this is invalid and should be detected. However, you are absolutely right that the kernel cannot *guarantee* that such a check is carried out for the entire memory range (or in fact at all); it should be a best-effort approach.
The kernel shall not extend the lifetime of user pointers in ways that are not clear from the specification of the syscall or interface to which the pointer is passed (and in any case shall not extend pointer lifetimes without good reason).
So no clever transparent caching between syscalls, unless it _really_ is transparent in the presence of tags.
Do you have any particular case in mind? If such caching is really valuable, it is always possible to access the object while ignoring the tag. For sure, the user-provided tag can only be used during the syscall handling itself, not asynchronously later on, unless otherwise specified.
For purposes other than dereference, the kernel shall accept any legitimately tagged pointer (according to the above rules) as identifying the associated memory location.
So, mprotect(some_page_aligned_object, ...); is valid irrespective of where page_aligned_object() came from. There is no implicit derefence by the kernel here, hence no tag check.
The kernel does not guarantee to work correctly if the wrong tag is used, but there is not always a well-defined "right" tag, so we can't really guarantee to check it. So a pointer derived by any reasonable means by userspace has to be treated as equally valid.
This is a disputed point :) In my opinion, this is the the most reasonable approach.
Cheers, Kevin
We would need to get some cross-arch buy-in for this, otherwise core maintainers might just refuse to maintain the necessary guarantees.
- Allow tagged addressed passed to memory syscalls (as long as (1) is
satisfied). Do I understand correctly that this means that I need to locate all find_vma() callers outside of mm/ and fix them up as well?
Yes (unless anyone as a better idea or objections to this approach).
Also, watch out for code that pokes about inside struct vma directly.
I'm wondering, could we define an explicit type, say,
struct user_vaddr { unsigned long addr; };
to replace the unsigned longs in struct vma the mm API? This would turn ad-hoc (unsigned long) casts into build breaks. We could have an explicit conversion functions, say,
struct user_vaddr __user_vaddr_unsafe(void __user *); void __user *__user_ptr_unsafe(struct user_vaddr);
that we robotically insert in all the relevant places to mark unaudited code.
This allows us to keep the kernel buildable, while flagging things that will need review. We would also need to warn the mm folks to reject any new code using these unsafe conversions.
Of course, it would be a non-trivial effort...
BTW, I'll be off until the new year, so won't be able to follow up.
Cheers ---Dave
linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky kevin.brodsky@arm.com wrote:
On 19/12/2018 12:52, Dave Martin wrote:
On Tue, Dec 18, 2018 at 05:59:38PM +0000, Catalin Marinas wrote:
On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote:
On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinascatalin.marinas@arm.com wrote:
The summary of our internal discussions (mostly between kernel developers) is that we can't properly describe a user ABI that covers future syscalls or syscall extensions while not all syscalls accept tagged pointers. So we tweaked the requirements slightly to only allow tagged pointers back into the kernel *if* the originating address is from an anonymous mmap() or below sbrk(0). This should cover some of the ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a pointer to a buffer obtained via mmap() on the device operations.
(sorry for not being clear on what Vincenzo's proposal implies)
OK, I see. So I need to make the following changes to my patchset AFAIU.
- Make sure that we only allow tagged user addresses that originate
from an anonymous mmap() or below sbrk(0). How exactly should this check be performed?
I don't think we should perform such checks. That's rather stating that the kernel only guarantees that the tagged pointers work if they originated from these memory ranges.
I concur.
Really, the kernel should do the expected thing with all "non-weird" memory.
In lieu of a proper definition of "non-weird", I think we should have some lists of things that are explicitly included, and also excluded:
OK: kernel-allocated process stack brk area MAP_ANONYMOUS | MAP_PRIVATE MAP_PRIVATE mappings of /dev/zero
Not OK: MAP_SHARED mmaps of non-memory-like devices mmaps of anything that is not a regular file the VDSO ...
In general, userspace can tag memory that it "owns", and we do not assume a transfer of ownership except in the "OK" list above. Otherwise, it's the kernel's memory, or the owner is simply not well defined.
Agreed on the general idea: a process should be able to pass tagged pointers at the syscall interface, as long as they point to memory privately owned by the process. I think it would be possible to simplify the definition of "non-weird" memory by using only this "OK" list:
- mmap() done by the process itself, where either:
- flags = MAP_PRIVATE | MAP_ANONYMOUS
- flags = MAP_PRIVATE and fd refers to a regular file or a well-defined list of
device files (like /dev/zero)
- brk() done by the process itself
- Any memory mapped by the kernel in the new process's address space during execve(),
with the same restrictions as above ([vdso]/[vvar] are therefore excluded)
I would also like to see advice for userspace developers, particularly things like (strawman, please challenge!):
To some extent, one could call me a userspace developer, so I'll try to help :)
- Userspace should set tags at the point of allocation only.
Yes, tags are only supposed to be set at the point of either allocation or deallocation/reallocation. However, allocators can in principle be nested, so an allocator could take a region allocated by malloc() as input and subdivide it (changing tags in the process). That said, this suballocator must not free() that region until all the suballocations themselves have been freed (thereby restoring the tags initially set by malloc()).
- If you don't know how an object was allocated, you cannot modify the tag, period.
Agreed, allocators that tag memory can only operate on memory with a well-defined provenance (for instance anonymous mmap() or malloc()).
- A single C object should be accessed using a single, fixed pointer tag throughout its entire lifetime.
Agreed. Allocators themselves may need to be excluded though, depending on how they represent their managed memory.
- Tags can be changed only when there are no outstanding pointers to the affected object or region that may be used to access the object or region (i.e., if the object were allocated from the C heap and is it safe to realloc() it, then it is safe to change the tag; for other types of allocation, analogous arguments can be applied).
Tags can only be changed at the point of deallocation/reallocation. Pointers to the object become invalid and cannot be used after it has been deallocated; memory tagging allows to catch such invalid usage.
When the kernel dereferences a pointer on userspace's behalf, it shall behave equivalently to userspace dereferencing the same pointer, including use of the same tag (where passed by userspace).
Where the pointer tag affects pointer dereference behaviour (i.e., with hardware memory colouring) the kernel makes no guarantee to honour pointer tags correctly for every location a buffer based on a pointer passed by userspace to the kernel.
(This means for example that for a read(fd, buf, size), we can check the tag for a single arbitrary location in *(char (*)[size])buf before passing the buffer to get_user_pages(). Hopefully this could be done in get_user_pages() itself rather than hunting call sites. For userspace, it means that you're on your own if you ask the kernel to operate on a buffer than spans multiple, independently- allocated objects, or a deliberately striped single object.)
I think both points are reasonable. It is very valuable for the kernel to access userspace memory using the user-provided tag, because it enables kernel accesses to be checked in the same way as user accesses, allowing to detect bugs that are potentially hard to find. For instance, if a pointer to an object is passed to the kernel after it has been deallocated, this is invalid and should be detected. However, you are absolutely right that the kernel cannot *guarantee* that such a check is carried out for the entire memory range (or in fact at all); it should be a best-effort approach.
It would also be valuable to narrow down the set of "relaxed" (i.e. not tag-checking) syscalls as reasonably possible. We would want to provide tag-checking userspace wrappers for any important calls that are not checked in the kernel. Is it correct to assume that anything that goes through copy_from_user / copy_to_user is checked?
The kernel shall not extend the lifetime of user pointers in ways that are not clear from the specification of the syscall or interface to which the pointer is passed (and in any case shall not extend pointer lifetimes without good reason).
So no clever transparent caching between syscalls, unless it _really_ is transparent in the presence of tags.
Do you have any particular case in mind? If such caching is really valuable, it is always possible to access the object while ignoring the tag. For sure, the user-provided tag can only be used during the syscall handling itself, not asynchronously later on, unless otherwise specified.
For aio* operations it would be nice if the tag was checked at the time of the actual userspace read/write, either instead of or in addition to at the time of the system call.
For purposes other than dereference, the kernel shall accept any legitimately tagged pointer (according to the above rules) as identifying the associated memory location.
So, mprotect(some_page_aligned_object, ...); is valid irrespective of where page_aligned_object() came from. There is no implicit derefence by the kernel here, hence no tag check.
The kernel does not guarantee to work correctly if the wrong tag is used, but there is not always a well-defined "right" tag, so we can't really guarantee to check it. So a pointer derived by any reasonable means by userspace has to be treated as equally valid.
This is a disputed point :) In my opinion, this is the the most reasonable approach.
Yes, it would be nice if the kernel explicitly promised, ex. mprotect() over a range of differently tagged pages to be allowed (i.e. address tag should be unchecked).
Cheers, Kevin
We would need to get some cross-arch buy-in for this, otherwise core maintainers might just refuse to maintain the necessary guarantees.
- Allow tagged addressed passed to memory syscalls (as long as (1) is
satisfied). Do I understand correctly that this means that I need to locate all find_vma() callers outside of mm/ and fix them up as well?
Yes (unless anyone as a better idea or objections to this approach).
Also, watch out for code that pokes about inside struct vma directly.
I'm wondering, could we define an explicit type, say,
struct user_vaddr { unsigned long addr; };
to replace the unsigned longs in struct vma the mm API? This would turn ad-hoc (unsigned long) casts into build breaks. We could have an explicit conversion functions, say,
struct user_vaddr __user_vaddr_unsafe(void __user *); void __user *__user_ptr_unsafe(struct user_vaddr);
that we robotically insert in all the relevant places to mark unaudited code.
This allows us to keep the kernel buildable, while flagging things that will need review. We would also need to warn the mm folks to reject any new code using these unsafe conversions.
Of course, it would be a non-trivial effort...
BTW, I'll be off until the new year, so won't be able to follow up.
Cheers ---Dave
linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote:
On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky kevin.brodsky@arm.com wrote:
On 19/12/2018 12:52, Dave Martin wrote:
Really, the kernel should do the expected thing with all "non-weird" memory.
In lieu of a proper definition of "non-weird", I think we should have some lists of things that are explicitly included, and also excluded:
OK: kernel-allocated process stack brk area MAP_ANONYMOUS | MAP_PRIVATE MAP_PRIVATE mappings of /dev/zero
Not OK: MAP_SHARED mmaps of non-memory-like devices mmaps of anything that is not a regular file the VDSO ...
In general, userspace can tag memory that it "owns", and we do not assume a transfer of ownership except in the "OK" list above. Otherwise, it's the kernel's memory, or the owner is simply not well defined.
Agreed on the general idea: a process should be able to pass tagged pointers at the syscall interface, as long as they point to memory privately owned by the process. I think it would be possible to simplify the definition of "non-weird" memory by using only this "OK" list:
- mmap() done by the process itself, where either:
- flags = MAP_PRIVATE | MAP_ANONYMOUS
- flags = MAP_PRIVATE and fd refers to a regular file or a well-defined list of
device files (like /dev/zero)
- brk() done by the process itself
- Any memory mapped by the kernel in the new process's address space during execve(),
with the same restrictions as above ([vdso]/[vvar] are therefore excluded)
Sounds reasonable.
- Userspace should set tags at the point of allocation only.
Yes, tags are only supposed to be set at the point of either allocation or deallocation/reallocation. However, allocators can in principle be nested, so an allocator could take a region allocated by malloc() as input and subdivide it (changing tags in the process). That said, this suballocator must not free() that region until all the suballocations themselves have been freed (thereby restoring the tags initially set by malloc()).
- If you don't know how an object was allocated, you cannot modify the tag, period.
Agreed, allocators that tag memory can only operate on memory with a well-defined provenance (for instance anonymous mmap() or malloc()).
- A single C object should be accessed using a single, fixed pointer tag throughout its entire lifetime.
Agreed. Allocators themselves may need to be excluded though, depending on how they represent their managed memory.
- Tags can be changed only when there are no outstanding pointers to the affected object or region that may be used to access the object or region (i.e., if the object were allocated from the C heap and is it safe to realloc() it, then it is safe to change the tag; for other types of allocation, analogous arguments can be applied).
Tags can only be changed at the point of deallocation/reallocation. Pointers to the object become invalid and cannot be used after it has been deallocated; memory tagging allows to catch such invalid usage.
All the above sound well but that's mostly a guideline on what a C library can do. It doesn't help much with defining the kernel ABI. Anyway, it's good to clarify the use-cases.
When the kernel dereferences a pointer on userspace's behalf, it shall behave equivalently to userspace dereferencing the same pointer, including use of the same tag (where passed by userspace).
Where the pointer tag affects pointer dereference behaviour (i.e., with hardware memory colouring) the kernel makes no guarantee to honour pointer tags correctly for every location a buffer based on a pointer passed by userspace to the kernel.
(This means for example that for a read(fd, buf, size), we can check the tag for a single arbitrary location in *(char (*)[size])buf before passing the buffer to get_user_pages(). Hopefully this could be done in get_user_pages() itself rather than hunting call sites. For userspace, it means that you're on your own if you ask the kernel to operate on a buffer than spans multiple, independently- allocated objects, or a deliberately striped single object.)
I think both points are reasonable. It is very valuable for the kernel to access userspace memory using the user-provided tag, because it enables kernel accesses to be checked in the same way as user accesses, allowing to detect bugs that are potentially hard to find. For instance, if a pointer to an object is passed to the kernel after it has been deallocated, this is invalid and should be detected. However, you are absolutely right that the kernel cannot *guarantee* that such a check is carried out for the entire memory range (or in fact at all); it should be a best-effort approach.
It would also be valuable to narrow down the set of "relaxed" (i.e. not tag-checking) syscalls as reasonably possible. We would want to provide tag-checking userspace wrappers for any important calls that are not checked in the kernel. Is it correct to assume that anything that goes through copy_from_user / copy_to_user is checked?
I lost track of the context of this thread but if it's just about relaxing the ABI for hwasan, the kernel has no idea of the compiler generated structures in user space, so nothing is checked.
If we talk about tags in the context of MTE, than yes, with the current proposal the tag would be checked by copy_*_user() functions.
The kernel shall not extend the lifetime of user pointers in ways that are not clear from the specification of the syscall or interface to which the pointer is passed (and in any case shall not extend pointer lifetimes without good reason).
So no clever transparent caching between syscalls, unless it _really_ is transparent in the presence of tags.
Do you have any particular case in mind? If such caching is really valuable, it is always possible to access the object while ignoring the tag. For sure, the user-provided tag can only be used during the syscall handling itself, not asynchronously later on, unless otherwise specified.
For aio* operations it would be nice if the tag was checked at the time of the actual userspace read/write, either instead of or in addition to at the time of the system call.
With aio* (and synchronous iovec-based syscalls), the kernel may access the memory while the corresponding user process is scheduled out. Given that such access is not done in the context of the user process (and using the user VA like copy_*_user), the kernel cannot handle potential tag faults. Moreover, the transfer may be done by DMA and the device does not understand tags.
I'd like to keep tags as a property of the pointer in a specific virtual address space. The moment you convert it to a different address space (e.g. kernel linear map, physical address), the tag property is stripped and I don't think we should re-build it (and have it checked).
For purposes other than dereference, the kernel shall accept any legitimately tagged pointer (according to the above rules) as identifying the associated memory location.
So, mprotect(some_page_aligned_object, ...); is valid irrespective of where page_aligned_object() came from. There is no implicit derefence by the kernel here, hence no tag check.
The kernel does not guarantee to work correctly if the wrong tag is used, but there is not always a well-defined "right" tag, so we can't really guarantee to check it. So a pointer derived by any reasonable means by userspace has to be treated as equally valid.
This is a disputed point :) In my opinion, this is the the most reasonable approach.
Yes, it would be nice if the kernel explicitly promised, ex. mprotect() over a range of differently tagged pages to be allowed (i.e. address tag should be unchecked).
I don't think mprotect() over differently tagged pages was ever a problem. I originally asked that mprotect() and friends do not accept tagged pointers since these functions deal with memory ranges rather than dereferencing such pointer (the reason being minimal kernel changes). However, given how complicated it is to specify an ABI, I came to the conclusion that a pointer passed to such function should be allowed to have non-zero top byte. It would be the kernel's responsibility to strip it out as appropriate.
On Tue, Feb 12, 2019 at 06:02:24PM +0000, Catalin Marinas wrote:
On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote:
On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky kevin.brodsky@arm.com wrote:
On 19/12/2018 12:52, Dave Martin wrote:
[...]
- A single C object should be accessed using a single, fixed pointer tag throughout its entire lifetime.
Agreed. Allocators themselves may need to be excluded though, depending on how they represent their managed memory.
- Tags can be changed only when there are no outstanding pointers to the affected object or region that may be used to access the object or region (i.e., if the object were allocated from the C heap and is it safe to realloc() it, then it is safe to change the tag; for other types of allocation, analogous arguments can be applied).
Tags can only be changed at the point of deallocation/ reallocation. Pointers to the object become invalid and cannot be used after it has been deallocated; memory tagging allows to catch such invalid usage.
All the above sound well but that's mostly a guideline on what a C library can do. It doesn't help much with defining the kernel ABI. Anyway, it's good to clarify the use-cases.
My aim was to clarify the use case in userspace, since I wasn't directly involved in that. The kernel ABI needs to be compatible with the the use case, but doesn't need to specify must of it.
I'm wondering whether we can piggy-back on existing concepts.
We could say that recolouring memory is safe when and only when unmapping of the page or removing permissions on the page (via munmap/mremap/mprotect) would be safe. Otherwise, the resulting behaviour of the process is undefined.
Hopefully there are friendly fuzzers testing this kind of thing.
[...]
It would also be valuable to narrow down the set of "relaxed" (i.e. not tag-checking) syscalls as reasonably possible. We would want to provide tag-checking userspace wrappers for any important calls that are not checked in the kernel. Is it correct to assume that anything that goes through copy_from_user / copy_to_user is checked?
I lost track of the context of this thread but if it's just about relaxing the ABI for hwasan, the kernel has no idea of the compiler generated structures in user space, so nothing is checked.
If we talk about tags in the context of MTE, than yes, with the current proposal the tag would be checked by copy_*_user() functions.
Also put_user() and friends?
It might be reasonable to do the check in access_ok() and skip it in __put_user() etc.
(I seem to remember some separate discussion about abolishing __put_user() and friends though, due to the accident risk they pose.)
For aio* operations it would be nice if the tag was checked at the time of the actual userspace read/write, either instead of or in addition to at the time of the system call.
With aio* (and synchronous iovec-based syscalls), the kernel may access the memory while the corresponding user process is scheduled out. Given that such access is not done in the context of the user process (and using the user VA like copy_*_user), the kernel cannot handle potential tag faults. Moreover, the transfer may be done by DMA and the device does not understand tags.
I'd like to keep tags as a property of the pointer in a specific virtual address space. The moment you convert it to a different address space (e.g. kernel linear map, physical address), the tag property is stripped and I don't think we should re-build it (and have it checked).
This is probably reasonable.
Ideally we would check the tag at the point of stripping it off, but most likely it's going to be rather best-effort.
If memory tagging is essential a debugging feature then this seems an acceptable compromise.
For purposes other than dereference, the kernel shall accept any legitimately tagged pointer (according to the above rules) as identifying the associated memory location.
So, mprotect(some_page_aligned_object, ...); is valid irrespective of where page_aligned_object() came from. There is no implicit derefence by the kernel here, hence no tag check.
The kernel does not guarantee to work correctly if the wrong tag is used, but there is not always a well-defined "right" tag, so we can't really guarantee to check it. So a pointer derived by any reasonable means by userspace has to be treated as equally valid.
This is a disputed point :) In my opinion, this is the the most reasonable approach.
Yes, it would be nice if the kernel explicitly promised, ex. mprotect() over a range of differently tagged pages to be allowed (i.e. address tag should be unchecked).
I don't think mprotect() over differently tagged pages was ever a problem. I originally asked that mprotect() and friends do not accept tagged pointers since these functions deal with memory ranges rather than dereferencing such pointer (the reason being minimal kernel changes). However, given how complicated it is to specify an ABI, I came to the conclusion that a pointer passed to such function should be allowed to have non-zero top byte. It would be the kernel's responsibility to strip it out as appropriate.
I think that if the page range is all the same colour then it should be legitimate to pass a matching tag.
But it doesn't seem reasonable for the kernel to require this. If free() calls munmap(), the page(s) will contain possibly randomly- coloured garbage. There's no correct tag to pass in such a case.
The most obvious solution is just to ignore the tags passed by userspace to such syscalls. This would imply that the kernel must explicitly strip it out, as you suggest.
The number of affected syscalls is relatively small though.
Cheers ---Dave
(+Cc other people with MTE experience: Branislav, Ruben)
On 13/02/2019 14:58, Dave Martin wrote:
On Tue, Feb 12, 2019 at 06:02:24PM +0000, Catalin Marinas wrote:
On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote:
On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky kevin.brodsky@arm.com wrote:
On 19/12/2018 12:52, Dave Martin wrote:
[...]
- A single C object should be accessed using a single, fixed pointer tag throughout its entire lifetime.
Agreed. Allocators themselves may need to be excluded though, depending on how they represent their managed memory.
- Tags can be changed only when there are no outstanding pointers to the affected object or region that may be used to access the object or region (i.e., if the object were allocated from the C heap and is it safe to realloc() it, then it is safe to change the tag; for other types of allocation, analogous arguments can be applied).
Tags can only be changed at the point of deallocation/ reallocation. Pointers to the object become invalid and cannot be used after it has been deallocated; memory tagging allows to catch such invalid usage.
All the above sound well but that's mostly a guideline on what a C library can do. It doesn't help much with defining the kernel ABI. Anyway, it's good to clarify the use-cases.
My aim was to clarify the use case in userspace, since I wasn't directly involved in that. The kernel ABI needs to be compatible with the the use case, but doesn't need to specify must of it.
I'm wondering whether we can piggy-back on existing concepts.
We could say that recolouring memory is safe when and only when unmapping of the page or removing permissions on the page (via munmap/mremap/mprotect) would be safe. Otherwise, the resulting behaviour of the process is undefined.
Is that a sufficient requirement? I don't think that anything prevents you from using mprotect() on say [vvar], but we don't necessarily want to map [vvar] as tagged. I'm not sure it's easy to define what "safe" would mean here.
Hopefully there are friendly fuzzers testing this kind of thing.
[...]
It would also be valuable to narrow down the set of "relaxed" (i.e. not tag-checking) syscalls as reasonably possible. We would want to provide tag-checking userspace wrappers for any important calls that are not checked in the kernel. Is it correct to assume that anything that goes through copy_from_user / copy_to_user is checked?
I lost track of the context of this thread but if it's just about relaxing the ABI for hwasan, the kernel has no idea of the compiler generated structures in user space, so nothing is checked.
If we talk about tags in the context of MTE, than yes, with the current proposal the tag would be checked by copy_*_user() functions.
Also put_user() and friends?
It might be reasonable to do the check in access_ok() and skip it in __put_user() etc.
(I seem to remember some separate discussion about abolishing __put_user() and friends though, due to the accident risk they pose.)
Keep in mind that with MTE, there is no need to do any explicit check when accessing user memory via a user-provided pointer. The tagged user pointer is directly passed to copy_*_user() or put_user(). If the load/store causes a tag fault, then it is handled just like a page fault (i.e. invoking the fixup handler). As far as I can tell, there's no need to do anything special in access_ok() in that case.
[The above applies to precise mode. In imprecise mode, some more work will be needed after the load/store to check whether a tag fault happened.]
For aio* operations it would be nice if the tag was checked at the time of the actual userspace read/write, either instead of or in addition to at the time of the system call.
With aio* (and synchronous iovec-based syscalls), the kernel may access the memory while the corresponding user process is scheduled out. Given that such access is not done in the context of the user process (and using the user VA like copy_*_user), the kernel cannot handle potential tag faults. Moreover, the transfer may be done by DMA and the device does not understand tags.
I'd like to keep tags as a property of the pointer in a specific virtual address space. The moment you convert it to a different address space (e.g. kernel linear map, physical address), the tag property is stripped and I don't think we should re-build it (and have it checked).
This is probably reasonable.
Ideally we would check the tag at the point of stripping it off, but most likely it's going to be rather best-effort.
If memory tagging is essential a debugging feature then this seems an acceptable compromise.
There are many possible ways to deploy MTE, and debugging is just one of them. For instance, you may want to turn on heap colouring for some processes in the system, including in production.
Regarding those cases where it is impossible to check tags at the point of accessing user memory, it is indeed possible to check the memory tags at the point of stripping the tag from the user pointer. Given that some MTE use-cases favour performance over tag check coverage, the ideal approach would be to make these checks configurable (e.g. check one granule, check all of them, or check none). I don't know how feasible this is in practice.
Kevin
For purposes other than dereference, the kernel shall accept any legitimately tagged pointer (according to the above rules) as identifying the associated memory location.
So, mprotect(some_page_aligned_object, ...); is valid irrespective of where page_aligned_object() came from. There is no implicit derefence by the kernel here, hence no tag check.
The kernel does not guarantee to work correctly if the wrong tag is used, but there is not always a well-defined "right" tag, so we can't really guarantee to check it. So a pointer derived by any reasonable means by userspace has to be treated as equally valid.
This is a disputed point :) In my opinion, this is the the most reasonable approach.
Yes, it would be nice if the kernel explicitly promised, ex. mprotect() over a range of differently tagged pages to be allowed (i.e. address tag should be unchecked).
I don't think mprotect() over differently tagged pages was ever a problem. I originally asked that mprotect() and friends do not accept tagged pointers since these functions deal with memory ranges rather than dereferencing such pointer (the reason being minimal kernel changes). However, given how complicated it is to specify an ABI, I came to the conclusion that a pointer passed to such function should be allowed to have non-zero top byte. It would be the kernel's responsibility to strip it out as appropriate.
I think that if the page range is all the same colour then it should be legitimate to pass a matching tag.
But it doesn't seem reasonable for the kernel to require this. If free() calls munmap(), the page(s) will contain possibly randomly- coloured garbage. There's no correct tag to pass in such a case.
The most obvious solution is just to ignore the tags passed by userspace to such syscalls. This would imply that the kernel must explicitly strip it out, as you suggest.
The number of affected syscalls is relatively small though.
Cheers ---Dave
On Wed, Feb 13, 2019 at 04:42:11PM +0000, Kevin Brodsky wrote:
(+Cc other people with MTE experience: Branislav, Ruben)
[...]
I'm wondering whether we can piggy-back on existing concepts.
We could say that recolouring memory is safe when and only when unmapping of the page or removing permissions on the page (via munmap/mremap/mprotect) would be safe. Otherwise, the resulting behaviour of the process is undefined.
Is that a sufficient requirement? I don't think that anything prevents you from using mprotect() on say [vvar], but we don't necessarily want to map [vvar] as tagged. I'm not sure it's easy to define what "safe" would mean here.
I think the origin rules have to apply too: [vvar] is not a regular, private page but a weird, shared thing mapped for you by the kernel.
Presumably userspace _cannot_ do mprotect(PROT_WRITE) on it.
I'm also assuming that userspace cannot recolour memory in read-only pages. That sounds bad if there's no way to prevent it.
[...]
It might be reasonable to do the check in access_ok() and skip it in __put_user() etc.
(I seem to remember some separate discussion about abolishing __put_user() and friends though, due to the accident risk they pose.)
Keep in mind that with MTE, there is no need to do any explicit check when accessing user memory via a user-provided pointer. The tagged user pointer is directly passed to copy_*_user() or put_user(). If the load/store causes a tag fault, then it is handled just like a page fault (i.e. invoking the fixup handler). As far as I can tell, there's no need to do anything special in access_ok() in that case.
[The above applies to precise mode. In imprecise mode, some more work will be needed after the load/store to check whether a tag fault happened.]
Fair enough, I'm a bit hazy on the details as of right now..
[...]
There are many possible ways to deploy MTE, and debugging is just one of them. For instance, you may want to turn on heap colouring for some processes in the system, including in production.
To implement enforceable protection, or as a diagnostic tool for when something goes wrong?
In the latter case it's still OK for the kernel's tag checking not to be exhaustive.
Regarding those cases where it is impossible to check tags at the point of accessing user memory, it is indeed possible to check the memory tags at the point of stripping the tag from the user pointer. Given that some MTE use-cases favour performance over tag check coverage, the ideal approach would be to make these checks configurable (e.g. check one granule, check all of them, or check none). I don't know how feasible this is in practice.
Check all granules of a massive DMA buffer?
That doesn't sounds feasible without explicit support in the hardware to have the DMA check tags itself as the memory is accessed. MTE by itself doesn't provide for this IIUC (at least, it would require support in the platform, not just the CPU).
We do not want to bake any assumptions into the ABI about whether a given data transfer may or may not be offloaded to DMA. That feels like a slippery slope.
Providing we get the checks for free in put_user/get_user/ copy_{to,from}_user(), those will cover a lot of cases though, for non-bulk-IO cases.
My assumption has been that at this point in time we are mainly aiming to support the debug/diagnostic use cases today.
At least, those are the low(ish)-hanging fruit.
Others are better placed than me to comment on the goals here.
Cheers ---Dave
On Wed, Feb 13, 2019 at 9:43 AM Dave Martin Dave.Martin@arm.com wrote:
On Wed, Feb 13, 2019 at 04:42:11PM +0000, Kevin Brodsky wrote:
(+Cc other people with MTE experience: Branislav, Ruben)
[...]
I'm wondering whether we can piggy-back on existing concepts.
We could say that recolouring memory is safe when and only when unmapping of the page or removing permissions on the page (via munmap/mremap/mprotect) would be safe. Otherwise, the resulting behaviour of the process is undefined.
Is that a sufficient requirement? I don't think that anything prevents you from using mprotect() on say [vvar], but we don't necessarily want to map [vvar] as tagged. I'm not sure it's easy to define what "safe" would mean here.
I think the origin rules have to apply too: [vvar] is not a regular, private page but a weird, shared thing mapped for you by the kernel.
Presumably userspace _cannot_ do mprotect(PROT_WRITE) on it.
I'm also assuming that userspace cannot recolour memory in read-only pages. That sounds bad if there's no way to prevent it.
That sounds like something we would like to do to catch out of bounds read of .rodata globals. Another potentially interesting use case for MTE is infinite hardware watchpoints - that would require trapping reads for individual tagging granules, include those in read-only binary segment.
[...]
It might be reasonable to do the check in access_ok() and skip it in __put_user() etc.
(I seem to remember some separate discussion about abolishing __put_user() and friends though, due to the accident risk they pose.)
Keep in mind that with MTE, there is no need to do any explicit check when accessing user memory via a user-provided pointer. The tagged user pointer is directly passed to copy_*_user() or put_user(). If the load/store causes a tag fault, then it is handled just like a page fault (i.e. invoking the fixup handler). As far as I can tell, there's no need to do anything special in access_ok() in that case.
[The above applies to precise mode. In imprecise mode, some more work will be needed after the load/store to check whether a tag fault happened.]
Fair enough, I'm a bit hazy on the details as of right now..
[...]
There are many possible ways to deploy MTE, and debugging is just one of them. For instance, you may want to turn on heap colouring for some processes in the system, including in production.
To implement enforceable protection, or as a diagnostic tool for when something goes wrong?
In the latter case it's still OK for the kernel's tag checking not to be exhaustive.
Regarding those cases where it is impossible to check tags at the point of accessing user memory, it is indeed possible to check the memory tags at the point of stripping the tag from the user pointer. Given that some MTE use-cases favour performance over tag check coverage, the ideal approach would be to make these checks configurable (e.g. check one granule, check all of them, or check none). I don't know how feasible this is in practice.
Check all granules of a massive DMA buffer?
That doesn't sounds feasible without explicit support in the hardware to have the DMA check tags itself as the memory is accessed. MTE by itself doesn't provide for this IIUC (at least, it would require support in the platform, not just the CPU).
We do not want to bake any assumptions into the ABI about whether a given data transfer may or may not be offloaded to DMA. That feels like a slippery slope.
Providing we get the checks for free in put_user/get_user/ copy_{to,from}_user(), those will cover a lot of cases though, for non-bulk-IO cases.
My assumption has been that at this point in time we are mainly aiming to support the debug/diagnostic use cases today.
At least, those are the low(ish)-hanging fruit.
Others are better placed than me to comment on the goals here.
Cheers ---Dave
On 13/02/2019 21:41, Evgenii Stepanov wrote:
On Wed, Feb 13, 2019 at 9:43 AM Dave Martin Dave.Martin@arm.com wrote:
On Wed, Feb 13, 2019 at 04:42:11PM +0000, Kevin Brodsky wrote:
(+Cc other people with MTE experience: Branislav, Ruben)
[...]
I'm wondering whether we can piggy-back on existing concepts.
We could say that recolouring memory is safe when and only when unmapping of the page or removing permissions on the page (via munmap/mremap/mprotect) would be safe. Otherwise, the resulting behaviour of the process is undefined.
Is that a sufficient requirement? I don't think that anything prevents you from using mprotect() on say [vvar], but we don't necessarily want to map [vvar] as tagged. I'm not sure it's easy to define what "safe" would mean here.
I think the origin rules have to apply too: [vvar] is not a regular, private page but a weird, shared thing mapped for you by the kernel.
Presumably userspace _cannot_ do mprotect(PROT_WRITE) on it.
I'm also assuming that userspace cannot recolour memory in read-only pages. That sounds bad if there's no way to prevent it.
That sounds like something we would like to do to catch out of bounds read of .rodata globals. Another potentially interesting use case for MTE is infinite hardware watchpoints - that would require trapping reads for individual tagging granules, include those in read-only binary segment.
I think we should keep this discussion for a later, separate thread. Vincenzo's proposal is about allowing userspace to pass tags at the syscall interface. The set of mappings allowed to be tagged by userspace (in MTE) should be contained in the set of mappings that userspace can pass tagged pointers to (at the syscall interface), but they are not necessarily the same. Private read-only mappings are an edge case (you can pass tagged pointers to them, the memory may or may not be mapped as tagged, but in any case it is not possible to change the memory tags via such mapping).
[...]
It might be reasonable to do the check in access_ok() and skip it in __put_user() etc.
(I seem to remember some separate discussion about abolishing __put_user() and friends though, due to the accident risk they pose.)
Keep in mind that with MTE, there is no need to do any explicit check when accessing user memory via a user-provided pointer. The tagged user pointer is directly passed to copy_*_user() or put_user(). If the load/store causes a tag fault, then it is handled just like a page fault (i.e. invoking the fixup handler). As far as I can tell, there's no need to do anything special in access_ok() in that case.
[The above applies to precise mode. In imprecise mode, some more work will be needed after the load/store to check whether a tag fault happened.]
Fair enough, I'm a bit hazy on the details as of right now..
[...]
There are many possible ways to deploy MTE, and debugging is just one of them. For instance, you may want to turn on heap colouring for some processes in the system, including in production.
To implement enforceable protection, or as a diagnostic tool for when something goes wrong?
In the latter case it's still OK for the kernel's tag checking not to be exhaustive.
Regarding those cases where it is impossible to check tags at the point of accessing user memory, it is indeed possible to check the memory tags at the point of stripping the tag from the user pointer. Given that some MTE use-cases favour performance over tag check coverage, the ideal approach would be to make these checks configurable (e.g. check one granule, check all of them, or check none). I don't know how feasible this is in practice.
Check all granules of a massive DMA buffer?
That doesn't sounds feasible without explicit support in the hardware to have the DMA check tags itself as the memory is accessed. MTE by itself doesn't provide for this IIUC (at least, it would require support in the platform, not just the CPU).
We do not want to bake any assumptions into the ABI about whether a given data transfer may or may not be offloaded to DMA. That feels like a slippery slope.
Providing we get the checks for free in put_user/get_user/ copy_{to,from}_user(), those will cover a lot of cases though, for non-bulk-IO cases.
My assumption has been that at this point in time we are mainly aiming to support the debug/diagnostic use cases today.
MTE can be used both for diagnostics (imprecise mode is especially suitable for that), and to halt execution when something wrong is detected. Even in the latter case, one cannot expect exhaustive checking from MTE, because the way it works is fundamentally statistical; an invalid pointer may by chance have the right tag to access the given location. So again, I think that a best-effort approach is appropriate when the kernel accesses user memory, in terms of checking that tags match.
More specifically, different use-cases come with different tradeoffs (performance / tag check coverage). That's why I am suggesting that in the cases where tag checks would need to be done _explicitly_ (before losing the user-provided tag), it would be nice to be able to choose how much should be checked. I am not suggesting that always checking all the granules by default is sane. Maybe checking just the first granule is the right default.
I don't think we need to get to the bottom of this specific aspect at this point. This ABI proposal is not about memory tagging, so there is no need to specify how or when tag checking is done. As long as this ABI allows tagged pointers, pointing to mappings that could be potentially tagged, to be passed to syscalls, I don't think further relaxations are needed to enable memory tagging.
Kevin
At least, those are the low(ish)-hanging fruit.
Others are better placed than me to comment on the goals here.
Cheers ---Dave
On 12/02/2019 18:02, Catalin Marinas wrote:
On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote:
On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky kevin.brodsky@arm.com wrote:
On 19/12/2018 12:52, Dave Martin wrote:
Really, the kernel should do the expected thing with all "non-weird" memory.
In lieu of a proper definition of "non-weird", I think we should have some lists of things that are explicitly included, and also excluded:
OK: kernel-allocated process stack brk area MAP_ANONYMOUS | MAP_PRIVATE MAP_PRIVATE mappings of /dev/zero
Not OK: MAP_SHARED mmaps of non-memory-like devices mmaps of anything that is not a regular file the VDSO ...
In general, userspace can tag memory that it "owns", and we do not assume a transfer of ownership except in the "OK" list above. Otherwise, it's the kernel's memory, or the owner is simply not well defined.
Agreed on the general idea: a process should be able to pass tagged pointers at the syscall interface, as long as they point to memory privately owned by the process. I think it would be possible to simplify the definition of "non-weird" memory by using only this "OK" list:
- mmap() done by the process itself, where either:
- flags = MAP_PRIVATE | MAP_ANONYMOUS
- flags = MAP_PRIVATE and fd refers to a regular file or a well-defined list of
device files (like /dev/zero)
- brk() done by the process itself
- Any memory mapped by the kernel in the new process's address space during execve(),
with the same restrictions as above ([vdso]/[vvar] are therefore excluded)
Sounds reasonable.
OK. this non-weird memory definition works for me too.
rule 1: if weird memory pointers are passed to the kernel with top byte set then the behaviour is undefined.
When the kernel dereferences a pointer on userspace's behalf, it shall behave equivalently to userspace dereferencing the same pointer, including use of the same tag (where passed by userspace).
Where the pointer tag affects pointer dereference behaviour (i.e., with hardware memory colouring) the kernel makes no guarantee to honour pointer tags correctly for every location a buffer based on a pointer passed by userspace to the kernel.
(This means for example that for a read(fd, buf, size), we can check the tag for a single arbitrary location in *(char (*)[size])buf before passing the buffer to get_user_pages(). Hopefully this could be done in get_user_pages() itself rather than hunting call sites. For userspace, it means that you're on your own if you ask the kernel to operate on a buffer than spans multiple, independently- allocated objects, or a deliberately striped single object.)
I think both points are reasonable. It is very valuable for the kernel to access userspace memory using the user-provided tag, because it enables kernel accesses to be checked in the same way as user accesses, allowing to detect bugs that are potentially hard to find. For instance, if a pointer to an object is passed to the kernel after it has been deallocated, this is invalid and should be detected. However, you are absolutely right that the kernel cannot *guarantee* that such a check is carried out for the entire memory range (or in fact at all); it should be a best-effort approach.
It would also be valuable to narrow down the set of "relaxed" (i.e. not tag-checking) syscalls as reasonably possible. We would want to provide tag-checking userspace wrappers for any important calls that are not checked in the kernel. Is it correct to assume that anything that goes through copy_from_user / copy_to_user is checked?
I lost track of the context of this thread but if it's just about relaxing the ABI for hwasan, the kernel has no idea of the compiler generated structures in user space, so nothing is checked.
If we talk about tags in the context of MTE, than yes, with the current proposal the tag would be checked by copy_*_user() functions.
rule 2: kernel derefs as if user derefs when non-weird memory pointers are passed to the kernel.
note that the important bit is what happens on valid pointer derefs, invalid pointer deref is usually undefined for user programs, so what happens in case of mte tag failures is more of a QoI issue than abi i think.
(e.g. EFAULT is not guaranteed by the kernel currently, i can successfully do write(open("/dev/null",O_WRONLY), 0, 1), or get a crash when passing invalid pointer to a vdso function, so userspace should not rely on some strict EFAULT behaviour).
The kernel shall not extend the lifetime of user pointers in ways that are not clear from the specification of the syscall or interface to which the pointer is passed (and in any case shall not extend pointer lifetimes without good reason).
So no clever transparent caching between syscalls, unless it _really_ is transparent in the presence of tags.
Do you have any particular case in mind? If such caching is really valuable, it is always possible to access the object while ignoring the tag. For sure, the user-provided tag can only be used during the syscall handling itself, not asynchronously later on, unless otherwise specified.
For aio* operations it would be nice if the tag was checked at the time of the actual userspace read/write, either instead of or in addition to at the time of the system call.
With aio* (and synchronous iovec-based syscalls), the kernel may access the memory while the corresponding user process is scheduled out. Given that such access is not done in the context of the user process (and using the user VA like copy_*_user), the kernel cannot handle potential tag faults. Moreover, the transfer may be done by DMA and the device does not understand tags.
I'd like to keep tags as a property of the pointer in a specific virtual address space. The moment you convert it to a different address space (e.g. kernel linear map, physical address), the tag property is stripped and I don't think we should re-build it (and have it checked).
OK.
i don't think the new abi needs special rules about pointer lifetime.
For purposes other than dereference, the kernel shall accept any legitimately tagged pointer (according to the above rules) as identifying the associated memory location.
So, mprotect(some_page_aligned_object, ...); is valid irrespective of where page_aligned_object() came from. There is no implicit derefence by the kernel here, hence no tag check.
The kernel does not guarantee to work correctly if the wrong tag is used, but there is not always a well-defined "right" tag, so we can't really guarantee to check it. So a pointer derived by any reasonable means by userspace has to be treated as equally valid.
This is a disputed point :) In my opinion, this is the the most reasonable approach.
Yes, it would be nice if the kernel explicitly promised, ex. mprotect() over a range of differently tagged pages to be allowed (i.e. address tag should be unchecked).
I don't think mprotect() over differently tagged pages was ever a problem. I originally asked that mprotect() and friends do not accept tagged pointers since these functions deal with memory ranges rather than dereferencing such pointer (the reason being minimal kernel changes). However, given how complicated it is to specify an ABI, I came to the conclusion that a pointer passed to such function should be allowed to have non-zero top byte. It would be the kernel's responsibility to strip it out as appropriate.
OK.
rule 3: kernel accepts legitimately tagged non-weird memory pointers and untags them before usage other than deref.
this is relevant if a syscall uses pointers for address range specification, instead of deref. (mprotect, madvise,...)
i also propose:
rule 4: kernel keeps legitimate tags on non-weird memory pointers that it returns to the user.
e.g. clone passes stack/arg/tls pointers on without dropping tags, same for set/get_robust_list. i'm not sure if there are pointer values observable in /proc etc but those should keep tags too.
"legitimately tagged" may not always be obvious, but the illegitimately tagged case can be left unspecified i think, so dropping tags is ok, but not required if tbi is off and mte is not used (i.e. tag is illegitimate).
i think these rules work for the cases i care about, a more tricky question is when/how to check for the new syscall abi and when/how the TCR_EL1.TBI0 setting may be turned off. consider the following cases (tb == top byte):
binary 1: user tb = any, syscall tb = 0 tbi is on, "legacy binary"
binary 2: user tb = any, syscall tb = any tbi is on, "new binary using tb" for backward compat it needs to check for new syscall abi.
binary 3: user tb = 0, syscall tb = 0 tbi can be off, "new binary", binary is marked to indicate unused tb, kernel may turn tbi off: additional pac bits.
binary 4: user tb = mte, syscall tb = mte like binary 3, but with mte, "new binary using mte" does it have to check for new syscall abi? or MTE HWCAP would imply it? (is it possible to use mte without new syscall abi?)
in userspace we want most binaries to be like binary 3 and 4 eventually, i.e. marked as not-relying-on-tbi, if a dso is loaded that is unmarked (legacy or new tb user), then either the load fails (e.g. if mte is already used? or can we turn mte off at runtime?) or tbi has to be enabled (prctl? does this work with pac? or multi-threads?).
as for checking the new syscall abi: i don't see much semantic difference between AT_HWCAP and AT_FLAGS (either way, the user has to check a feature flag before using the feature of the underlying system and it does not matter much if it's a syscall abi feature or cpu feature), but i don't see anything wrong with AT_FLAGS if the kernel prefers that.
the discussion here was mostly about binary 2, but for me the open question is if we can make binary 3/4 work. (which requires some elf binary marking, that is recognised by the kernel and dynamic loader, and efficient handling of the TBI0 bit, ..if it's not possible, then i don't see how mte will be deployed).
and i guess on the kernel side the open question is if the rules 1/2/3/4 can be made to work in corner cases e.g. when pointers embedded into structs are passed down in ioctl.
Hi Szabolcs,
Thanks for looking into this. Comments below.
On Tue, Feb 19, 2019 at 06:38:31PM +0000, Szabolcs Nagy wrote:
i think these rules work for the cases i care about, a more tricky question is when/how to check for the new syscall abi and when/how the TCR_EL1.TBI0 setting may be turned off.
I don't think turning TBI0 off is critical (it's handy for PAC with 52-bit VA but then it's short-lived if you want more security features like MTE).
consider the following cases (tb == top byte):
binary 1: user tb = any, syscall tb = 0 tbi is on, "legacy binary"
binary 2: user tb = any, syscall tb = any tbi is on, "new binary using tb" for backward compat it needs to check for new syscall abi.
binary 3: user tb = 0, syscall tb = 0 tbi can be off, "new binary", binary is marked to indicate unused tb, kernel may turn tbi off: additional pac bits.
binary 4: user tb = mte, syscall tb = mte like binary 3, but with mte, "new binary using mte" does it have to check for new syscall abi? or MTE HWCAP would imply it? (is it possible to use mte without new syscall abi?)
I think MTE HWCAP should imply it.
in userspace we want most binaries to be like binary 3 and 4 eventually, i.e. marked as not-relying-on-tbi, if a dso is loaded that is unmarked (legacy or new tb user), then either the load fails (e.g. if mte is already used? or can we turn mte off at runtime?) or tbi has to be enabled (prctl? does this work with pac? or multi-threads?).
We could enable it via prctl. That's the plan for MTE as well (in addition maybe to some ELF flag).
as for checking the new syscall abi: i don't see much semantic difference between AT_HWCAP and AT_FLAGS (either way, the user has to check a feature flag before using the feature of the underlying system and it does not matter much if it's a syscall abi feature or cpu feature), but i don't see anything wrong with AT_FLAGS if the kernel prefers that.
The AT_FLAGS is aimed at capturing binary 2 case above, i.e. the relaxation of the syscall ABI to accept tb = any. The MTE support will have its own AT_HWCAP, likely in addition to AT_FLAGS. Arguably, AT_FLAGS is either redundant here if MTE implies it (and no harm in keeping it around) or the meaning is different: a tb != 0 may be checked by the kernel against the allocation tag (i.e. get_user() could fail, the tag is not entirely ignored).
the discussion here was mostly about binary 2,
That's because passing tb != 0 into the syscall ABI is the main blocker here that needs clearing out before merging the MTE support. There is, of course, a variation of binary 1 for MTE:
binary 5: user tb = mte, syscall tb = 0
but this requires a lot of C lib changes to support properly.
but for me the open question is if we can make binary 3/4 work. (which requires some elf binary marking, that is recognised by the kernel and dynamic loader, and efficient handling of the TBI0 bit, ..if it's not possible, then i don't see how mte will be deployed).
If we ignore binary 3, we can keep TBI0 = 1 permanently, whether we have MTE or not.
and i guess on the kernel side the open question is if the rules 1/2/3/4 can be made to work in corner cases e.g. when pointers embedded into structs are passed down in ioctl.
We've been trying to track these down since last summer and we came to the conclusion that it should be (mostly) fine for the non-weird memory described above.
On 25/02/2019 16:57, Catalin Marinas wrote:
On Tue, Feb 19, 2019 at 06:38:31PM +0000, Szabolcs Nagy wrote:
i think these rules work for the cases i care about, a more tricky question is when/how to check for the new syscall abi and when/how the TCR_EL1.TBI0 setting may be turned off.
I don't think turning TBI0 off is critical (it's handy for PAC with 52-bit VA but then it's short-lived if you want more security features like MTE).
yes, i made a mistake assuming TBI0 off is required for (or at least compatible with) MTE.
if TBI0 needs to be on for MTE then some of my analysis is wrong, and i expect TBI0 to be on in the foreseeable future.
consider the following cases (tb == top byte):
binary 1: user tb = any, syscall tb = 0 tbi is on, "legacy binary"
binary 2: user tb = any, syscall tb = any tbi is on, "new binary using tb" for backward compat it needs to check for new syscall abi.
binary 3: user tb = 0, syscall tb = 0 tbi can be off, "new binary", binary is marked to indicate unused tb, kernel may turn tbi off: additional pac bits.
binary 4: user tb = mte, syscall tb = mte like binary 3, but with mte, "new binary using mte"
so this should be "like binary 2, but with mte".
does it have to check for new syscall abi? or MTE HWCAP would imply it? (is it possible to use mte without new syscall abi?)
I think MTE HWCAP should imply it.
in userspace we want most binaries to be like binary 3 and 4 eventually, i.e. marked as not-relying-on-tbi, if a dso is loaded that is unmarked (legacy or new tb user), then either the load fails (e.g. if mte is already used? or can we turn mte off at runtime?) or tbi has to be enabled (prctl? does this work with pac? or multi-threads?).
We could enable it via prctl. That's the plan for MTE as well (in addition maybe to some ELF flag).
as for checking the new syscall abi: i don't see much semantic difference between AT_HWCAP and AT_FLAGS (either way, the user has to check a feature flag before using the feature of the underlying system and it does not matter much if it's a syscall abi feature or cpu feature), but i don't see anything wrong with AT_FLAGS if the kernel prefers that.
The AT_FLAGS is aimed at capturing binary 2 case above, i.e. the relaxation of the syscall ABI to accept tb = any. The MTE support will have its own AT_HWCAP, likely in addition to AT_FLAGS. Arguably, AT_FLAGS is either redundant here if MTE implies it (and no harm in keeping it around) or the meaning is different: a tb != 0 may be checked by the kernel against the allocation tag (i.e. get_user() could fail, the tag is not entirely ignored).
the discussion here was mostly about binary 2,
That's because passing tb != 0 into the syscall ABI is the main blocker here that needs clearing out before merging the MTE support. There is, of course, a variation of binary 1 for MTE:
binary 5: user tb = mte, syscall tb = 0
but this requires a lot of C lib changes to support properly.
yes, i don't think we want to do that.
but it's ok to have both syscall tbi AT_FLAGS and MTE HWCAP.
but for me the open question is if we can make binary 3/4 work. (which requires some elf binary marking, that is recognised by the kernel and dynamic loader, and efficient handling of the TBI0 bit, ..if it's not possible, then i don't see how mte will be deployed).
If we ignore binary 3, we can keep TBI0 = 1 permanently, whether we have MTE or not.
and i guess on the kernel side the open question is if the rules 1/2/3/4 can be made to work in corner cases e.g. when pointers embedded into structs are passed down in ioctl.
We've been trying to track these down since last summer and we came to the conclusion that it should be (mostly) fine for the non-weird memory described above.
i think an interesting case is when userspace passes a pointer to the kernel and later gets it back, which is why i proposed rule 4 (kernel has to keep the tag then).
but i wonder what's the right thing to do for sp (user can malloc thread/sigalt/makecontext stack which will be mte tagged in practice with mte on) does tagged sp work? should userspace untag the stack memory before setting it up as a stack? (but then user pointers to that allocation may get broken..)
On 25/02/2019 18:02, Szabolcs Nagy wrote:
On 25/02/2019 16:57, Catalin Marinas wrote:
On Tue, Feb 19, 2019 at 06:38:31PM +0000, Szabolcs Nagy wrote:
i think these rules work for the cases i care about, a more tricky question is when/how to check for the new syscall abi and when/how the TCR_EL1.TBI0 setting may be turned off.
I don't think turning TBI0 off is critical (it's handy for PAC with 52-bit VA but then it's short-lived if you want more security features like MTE).
yes, i made a mistake assuming TBI0 off is required for (or at least compatible with) MTE.
if TBI0 needs to be on for MTE then some of my analysis is wrong, and i expect TBI0 to be on in the foreseeable future.
consider the following cases (tb == top byte):
binary 1: user tb = any, syscall tb = 0 tbi is on, "legacy binary"
binary 2: user tb = any, syscall tb = any tbi is on, "new binary using tb" for backward compat it needs to check for new syscall abi.
binary 3: user tb = 0, syscall tb = 0 tbi can be off, "new binary", binary is marked to indicate unused tb, kernel may turn tbi off: additional pac bits.
binary 4: user tb = mte, syscall tb = mte like binary 3, but with mte, "new binary using mte"
so this should be "like binary 2, but with mte".
does it have to check for new syscall abi? or MTE HWCAP would imply it? (is it possible to use mte without new syscall abi?)
I think MTE HWCAP should imply it.
in userspace we want most binaries to be like binary 3 and 4 eventually, i.e. marked as not-relying-on-tbi, if a dso is loaded that is unmarked (legacy or new tb user), then either the load fails (e.g. if mte is already used? or can we turn mte off at runtime?) or tbi has to be enabled (prctl? does this work with pac? or multi-threads?).
We could enable it via prctl. That's the plan for MTE as well (in addition maybe to some ELF flag).
as for checking the new syscall abi: i don't see much semantic difference between AT_HWCAP and AT_FLAGS (either way, the user has to check a feature flag before using the feature of the underlying system and it does not matter much if it's a syscall abi feature or cpu feature), but i don't see anything wrong with AT_FLAGS if the kernel prefers that.
The AT_FLAGS is aimed at capturing binary 2 case above, i.e. the relaxation of the syscall ABI to accept tb = any. The MTE support will have its own AT_HWCAP, likely in addition to AT_FLAGS. Arguably, AT_FLAGS is either redundant here if MTE implies it (and no harm in keeping it around) or the meaning is different: a tb != 0 may be checked by the kernel against the allocation tag (i.e. get_user() could fail, the tag is not entirely ignored).
the discussion here was mostly about binary 2,
That's because passing tb != 0 into the syscall ABI is the main blocker here that needs clearing out before merging the MTE support. There is, of course, a variation of binary 1 for MTE:
binary 5: user tb = mte, syscall tb = 0
but this requires a lot of C lib changes to support properly.
yes, i don't think we want to do that.
but it's ok to have both syscall tbi AT_FLAGS and MTE HWCAP.
but for me the open question is if we can make binary 3/4 work. (which requires some elf binary marking, that is recognised by the kernel and dynamic loader, and efficient handling of the TBI0 bit, ..if it's not possible, then i don't see how mte will be deployed).
If we ignore binary 3, we can keep TBI0 = 1 permanently, whether we have MTE or not.
and i guess on the kernel side the open question is if the rules 1/2/3/4 can be made to work in corner cases e.g. when pointers embedded into structs are passed down in ioctl.
We've been trying to track these down since last summer and we came to the conclusion that it should be (mostly) fine for the non-weird memory described above.
i think an interesting case is when userspace passes a pointer to the kernel and later gets it back, which is why i proposed rule 4 (kernel has to keep the tag then).
but i wonder what's the right thing to do for sp (user can malloc thread/sigalt/makecontext stack which will be mte tagged in practice with mte on) does tagged sp work? should userspace untag the stack memory before setting it up as a stack? (but then user pointers to that allocation may get broken..)
Tagged SP does work, and it is actually a good idea (it avoids using the default tag for the stack). It would be quite easy for the kernel to tag the initial SP and the stack on execve(). For other stacks, it is up to userspace, as you say, and would be made easier by making it possible to choose how a mapping should be tagged by the kernel via a new mmap() flag. Some software that makes too many assumptions on the address of stack variables will be disturbed by a tagged SP, but this should be fairly rare.
In any case, I don't think this impacts this ABI proposal (beyond the fact that passing tagged pointers to the stack needs to be allowed).
Kevin
On Mon, Dec 10, 2018 at 01:50:57PM +0100, Andrey Konovalov wrote:
arm64 has a feature called Top Byte Ignore, which allows to embed pointer tags into the top byte of each pointer. Userspace programs (such as HWASan, a memory debugging tool [1]) might use this feature and pass tagged user pointers to the kernel through syscalls or other interfaces.
Right now the kernel is already able to handle user faults with tagged pointers, due to these patches:
- 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a tagged pointer")
- 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged pointers")
- 276e9327 ("arm64: entry: improve data abort handling of tagged pointers")
When passing tagged pointers to syscalls, there's a special case of such a pointer being passed to one of the memory syscalls (mmap, mprotect, etc.). These syscalls don't do memory accesses but rather deal with memory ranges, hence an untagged pointer is better suited.
This patchset extends tagged pointer support to non-memory syscalls. This is done by reusing the untagged_addr macro to untag user pointers when the kernel performs pointer checking to find out whether the pointer comes from userspace (most notably in access_ok). The untagging is done only when the pointer is being checked, the tag is preserved as the pointer makes its way through the kernel.
One of the alternative approaches to untagging that was considered is to completely strip the pointer tag as the pointer enters the kernel with some kind of a syscall wrapper, but that won't work with the countless number of different ioctl calls. With this approach we would need a custom wrapper for each ioctl variation, which doesn't seem practical.
The following testing approaches has been taken to find potential issues with user pointer untagging:
Static testing (with sparse [2] and separately with a custom static analyzer based on Clang) to track casts of __user pointers to integer types to find places where untagging needs to be done.
Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running a modified syzkaller version that passes tagged pointers to the kernel.
Based on the results of the testing the requried patches have been added to the patchset.
This patchset has been merged into the Pixel 2 kernel tree and is now being used to enable testing of Pixel 2 phones with HWASan.
Do you have an idea of how much of the user/kernel interface is covered by this workload?
This patchset is a prerequisite for ARM's memory tagging hardware feature support [3].
It looks like there's been a lot of progress made here towards smoking out most of the sites in the kernel where pointers need to be untagged.
However, I do think that we need a clear policy for how existing kernel interfaces are to be interpreted in the presence of tagged pointers. Unless we have that nailed down, we are likely to be able to make only vague guarantees to userspace about what works, and the ongoing risk of ABI regressions and inconsistencies seems high.
I don't really see how we can advertise a full system interface if we know some subset of it doesn't work for foreseeable userspace environments. I feel that presenting the current changes as an ABI relaxation may be a recipe for future problems, since the forwards compatibility guarantees we're able to make today are few and rather vague.
Can we define an opt-in for tagged-pointer userspace, that rejects all syscalls that we haven't checked and whitelisted (or that are uncheckable like ioctl)? This reflects the reality that we don't have a regular userspace environment in which standards-compliant software that uses address tags in a reasonable way will just work.
It might be feasible to promote this to be enabled by default later on, if it becomes sufficiently complete.
In the meantime, I think we really need to nail down the kernel's policies on
* in the default configuration (without opt-in), is the presence of non-address bits in pointers exchanged with the kernel simply considered broken? (Even with this series, the de factor answer generally seems to be "yes", although many specific things will now work fine)
* if not, how do we tighten syscall / interface specifications to describe what happens with pointers containing non-address bits, while keeping the existing behaviour for untagged pointers?
We would want a general recipe that gives clear guidance on what userspace should expect an arbitrarily chosen syscall to do with its pointers, without having to enumerate each and every case.
To be sustainable, we would also need to solve that in a way that doesn't need to be reintented per-arch.
There may already be some background on these topics -- can you throw me a link if so?
Cheers ---Dave
On Wed, Dec 12, 2018 at 6:01 PM Dave Martin Dave.Martin@arm.com wrote:
On Mon, Dec 10, 2018 at 01:50:57PM +0100, Andrey Konovalov wrote:
arm64 has a feature called Top Byte Ignore, which allows to embed pointer tags into the top byte of each pointer. Userspace programs (such as HWASan, a memory debugging tool [1]) might use this feature and pass tagged user pointers to the kernel through syscalls or other interfaces.
Right now the kernel is already able to handle user faults with tagged pointers, due to these patches:
- 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a tagged pointer")
- 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged pointers")
- 276e9327 ("arm64: entry: improve data abort handling of tagged pointers")
When passing tagged pointers to syscalls, there's a special case of such a pointer being passed to one of the memory syscalls (mmap, mprotect, etc.). These syscalls don't do memory accesses but rather deal with memory ranges, hence an untagged pointer is better suited.
This patchset extends tagged pointer support to non-memory syscalls. This is done by reusing the untagged_addr macro to untag user pointers when the kernel performs pointer checking to find out whether the pointer comes from userspace (most notably in access_ok). The untagging is done only when the pointer is being checked, the tag is preserved as the pointer makes its way through the kernel.
One of the alternative approaches to untagging that was considered is to completely strip the pointer tag as the pointer enters the kernel with some kind of a syscall wrapper, but that won't work with the countless number of different ioctl calls. With this approach we would need a custom wrapper for each ioctl variation, which doesn't seem practical.
The following testing approaches has been taken to find potential issues with user pointer untagging:
Static testing (with sparse [2] and separately with a custom static analyzer based on Clang) to track casts of __user pointers to integer types to find places where untagging needs to be done.
Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running a modified syzkaller version that passes tagged pointers to the kernel.
Based on the results of the testing the requried patches have been added to the patchset.
This patchset has been merged into the Pixel 2 kernel tree and is now being used to enable testing of Pixel 2 phones with HWASan.
Hi, Dave,
Do you have an idea of how much of the user/kernel interface is covered by this workload?
Not really. I don't even know what kind of measurements can be used to obtain this estimate. But Pixel 2 kernel with these patches + Android runtime instrumented with HWASan works.
This patchset is a prerequisite for ARM's memory tagging hardware feature support [3].
It looks like there's been a lot of progress made here towards smoking out most of the sites in the kernel where pointers need to be untagged.
However, I do think that we need a clear policy for how existing kernel interfaces are to be interpreted in the presence of tagged pointers. Unless we have that nailed down, we are likely to be able to make only vague guarantees to userspace about what works, and the ongoing risk of ABI regressions and inconsistencies seems high.
I don't really see how we can advertise a full system interface if we know some subset of it doesn't work for foreseeable userspace environments. I feel that presenting the current changes as an ABI relaxation may be a recipe for future problems, since the forwards compatibility guarantees we're able to make today are few and rather vague.
Can we define an opt-in for tagged-pointer userspace, that rejects all syscalls that we haven't checked and whitelisted (or that are uncheckable like ioctl)? This reflects the reality that we don't have a regular userspace environment in which standards-compliant software that uses address tags in a reasonable way will just work.
It might be feasible to promote this to be enabled by default later on, if it becomes sufficiently complete.
In the meantime, I think we really need to nail down the kernel's policies on
- in the default configuration (without opt-in), is the presence of
non-address bits in pointers exchanged with the kernel simply considered broken? (Even with this series, the de factor answer generally seems to be "yes", although many specific things will now work fine)
- if not, how do we tighten syscall / interface specifications to
describe what happens with pointers containing non-address bits, while keeping the existing behaviour for untagged pointers?
We would want a general recipe that gives clear guidance on what userspace should expect an arbitrarily chosen syscall to do with its pointers, without having to enumerate each and every case.
To be sustainable, we would also need to solve that in a way that doesn't need to be reintented per-arch.
As I understand your main concern is userspace/kernel ABI changes these patches introduce. This concern was already pointed out by Catalin, and working out the details is still in progress.
There may already be some background on these topics -- can you throw me a link if so?
I don't have a single link, I would suggest to look at the comments for all the previous versions of this patchset. I see you saw the pathset by Vincenzo, it also has some information about this.
Cheers ---Dave
Thanks!
Hi Dave,
On Wed, Dec 12, 2018 at 05:01:12PM +0000, Dave P Martin wrote:
On Mon, Dec 10, 2018 at 01:50:57PM +0100, Andrey Konovalov wrote:
arm64 has a feature called Top Byte Ignore, which allows to embed pointer tags into the top byte of each pointer. Userspace programs (such as HWASan, a memory debugging tool [1]) might use this feature and pass tagged user pointers to the kernel through syscalls or other interfaces.
[...]
It looks like there's been a lot of progress made here towards smoking out most of the sites in the kernel where pointers need to be untagged.
In summary, based on last summer's analysis, there are two main (and rather broad) scenarios of __user pointers use in the kernel: (a) uaccess macros, together with access_ok() checks and (b) identifying of user address ranges (find_vma() and related, some ioctls). The patches here handle the former by allowing sign-extension in access_ok() and subsequent uaccess routines work fine with tagged pointers. Identifying the latter is a bit more problematic and the approach we took was tracking down pointer to long conversion which seems to cover the majority of cases. However, this approach doesn't scale as, for example, we'd rather change get_user_pages() to sign-extend the address rather than all the callers. In lots of other cases we don't even need untagging as we don't expect user space to tag such pointers (i.e. mmap() of device memory).
We might be able to improve the static analysis by introducing a virt_addr_t but that's significant effort and we still won't cover all cases (e.g. it doesn't necessarily catch tcp_zerocopy_receive() which wouldn't use a pointer, just a u64 for address).
However, I do think that we need a clear policy for how existing kernel interfaces are to be interpreted in the presence of tagged pointers. Unless we have that nailed down, we are likely to be able to make only vague guarantees to userspace about what works, and the ongoing risk of ABI regressions and inconsistencies seems high.
I agree.
Can we define an opt-in for tagged-pointer userspace, that rejects all syscalls that we haven't checked and whitelisted (or that are uncheckable like ioctl)?
Defining an opt-in is not a problem, however, rejecting all syscalls that we haven't whitelisted is not feasible. We can have an opt-in per process (that's what we were going to do with MTE) but the only thing we can reasonably do is change the behaviour of access_ok(). That's too big a knob and a new syscall that we haven't got around to whitelist may just work. This eventually leads to de-facto ABI and our whitelist would simply be ignored.
I'm not really keen on a big syscall shim in the arm64 kernel which checks syscall arguments, including in-struct values. If we are to do this, I'd rather keep it in user space as part of the C library.
In the meantime, I think we really need to nail down the kernel's policies on
- in the default configuration (without opt-in), is the presence of
non-address bits in pointers exchanged with the kernel simply considered broken? (Even with this series, the de factor answer generally seems to be "yes", although many specific things will now work fine)
Without these patches, passing non-address bits in pointers is considered broken. I couldn't find a case where it would still work with non-zero tag but maybe I haven't looked hard enough.
- if not, how do we tighten syscall / interface specifications to
describe what happens with pointers containing non-address bits, while keeping the existing behaviour for untagged pointers?
We would want a general recipe that gives clear guidance on what userspace should expect an arbitrarily chosen syscall to do with its pointers, without having to enumerate each and every case.
That's what we are aiming with the pointer origins, to move away from a syscall whitelist to a generic definition. That said, the two approaches are orthogonal, we can use the pointer origins as the base rule for which syscalls can be whitelisted.
If we step back a bit to look at the use-case for TBI (and MTE), the normal application programmer shouldn't really care about this ABI (well, most of the time). The app gets a tagged pointer from the C library as a result of a malloc()/realloc() (possibly alloca()) call and it expects to be able to pass it back into the kernel (usually via the C library) without any awareness of the non-address bits. Now, we can't define a user/kernel ABI based on the provenance of the pointer in user space (i.e. we only support tags for heap and stack), so we are trying to generalise this based where the pointer originated from in the kernel (e.g. anonymous mmap()).
There may already be some background on these topics -- can you throw me a link if so?
That's an interesting sub-thread to read:
https://lore.kernel.org/lkml/5d54526e5ff2e5ad63d0dfdd9ab17cf359afa4f2.153562...
On Mon, Feb 11, 2019 at 11:35:12AM +0000, Catalin Marinas wrote:
Hi Dave,
On Wed, Dec 12, 2018 at 05:01:12PM +0000, Dave P Martin wrote:
On Mon, Dec 10, 2018 at 01:50:57PM +0100, Andrey Konovalov wrote:
arm64 has a feature called Top Byte Ignore, which allows to embed pointer tags into the top byte of each pointer. Userspace programs (such as HWASan, a memory debugging tool [1]) might use this feature and pass tagged user pointers to the kernel through syscalls or other interfaces.
[...]
It looks like there's been a lot of progress made here towards smoking out most of the sites in the kernel where pointers need to be untagged.
In summary, based on last summer's analysis, there are two main (and rather broad) scenarios of __user pointers use in the kernel: (a) uaccess macros, together with access_ok() checks and (b) identifying of user address ranges (find_vma() and related, some ioctls). The patches here handle the former by allowing sign-extension in access_ok() and subsequent uaccess routines work fine with tagged pointers. Identifying the latter is a bit more problematic and the approach we took was tracking down pointer to long conversion which seems to cover the majority of cases. However, this approach doesn't scale as, for example, we'd rather change get_user_pages() to sign-extend the address rather than all the callers. In lots of other cases we don't even need untagging as we don't expect user space to tag such pointers (i.e. mmap() of device memory).
We might be able to improve the static analysis by introducing a virt_addr_t but that's significant effort and we still won't cover all cases (e.g. it doesn't necessarily catch tcp_zerocopy_receive() which wouldn't use a pointer, just a u64 for address).
However, I do think that we need a clear policy for how existing kernel interfaces are to be interpreted in the presence of tagged pointers. Unless we have that nailed down, we are likely to be able to make only vague guarantees to userspace about what works, and the ongoing risk of ABI regressions and inconsistencies seems high.
I agree.
Can we define an opt-in for tagged-pointer userspace, that rejects all syscalls that we haven't checked and whitelisted (or that are uncheckable like ioctl)?
Defining an opt-in is not a problem, however, rejecting all syscalls that we haven't whitelisted is not feasible. We can have an opt-in per process (that's what we were going to do with MTE) but the only thing we can reasonably do is change the behaviour of access_ok(). That's too big a knob and a new syscall that we haven't got around to whitelist may just work. This eventually leads to de-facto ABI and our whitelist would simply be ignored.
I'm not really keen on a big syscall shim in the arm64 kernel which checks syscall arguments, including in-struct values. If we are to do this, I'd rather keep it in user space as part of the C library.
In the meantime, I think we really need to nail down the kernel's policies on
- in the default configuration (without opt-in), is the presence of
non-address bits in pointers exchanged with the kernel simply considered broken? (Even with this series, the de factor answer generally seems to be "yes", although many specific things will now work fine)
Without these patches, passing non-address bits in pointers is considered broken. I couldn't find a case where it would still work with non-zero tag but maybe I haven't looked hard enough.
- if not, how do we tighten syscall / interface specifications to
describe what happens with pointers containing non-address bits, while keeping the existing behaviour for untagged pointers?
We would want a general recipe that gives clear guidance on what userspace should expect an arbitrarily chosen syscall to do with its pointers, without having to enumerate each and every case.
That's what we are aiming with the pointer origins, to move away from a syscall whitelist to a generic definition. That said, the two approaches are orthogonal, we can use the pointer origins as the base rule for which syscalls can be whitelisted.
If we step back a bit to look at the use-case for TBI (and MTE), the normal application programmer shouldn't really care about this ABI (well, most of the time). The app gets a tagged pointer from the C library as a result of a malloc()/realloc() (possibly alloca()) call and it expects to be able to pass it back into the kernel (usually via the C library) without any awareness of the non-address bits. Now, we can't define a user/kernel ABI based on the provenance of the pointer in user space (i.e. we only support tags for heap and stack), so we are trying to generalise this based where the pointer originated from in the kernel (e.g. anonymous mmap()).
This sounds generally reasonable.
It is not adequate for describing changing the tag on already-tagged memory (which a memory allocator will definitely do), but we may be able to come up with some weasel words to cover that.
It is also not adequete for describing tagging (and retagging) regions of the stack -- but as you say, we can rule that use-case out for now in the interest of simplicity, since we know we wouldn't be able to deploy it widely for now anyway due to the incompability with non-MTE- capable hardware.
Ideally we would clarify user/kernel interface semantics in terms of object and pointer lifetimes and accessibility, but that's a larger project that should be pursued separately (if at all).
I could also quibble about whether "anonymous mmap" is the right thing here -- we should still give specific examples of things that do / don't qualify, to make it clear what we mean.
Cheers ---Dave
linux-kselftest-mirror@lists.linaro.org