Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the maximum address space, unless the hint address is greater than this value.
On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This flag allows applications a way to specify exactly how many bits they want to be left unused by mmap. This eliminates the need for applications to know the page table hierarchy of the system to be able to reason which addresses mmap will be allowed to return.
--- riscv made this feature of mmap returning addresses less than the hint address the default behavior. This was in contrast to the implementation of x86/arm64 that have a single boundary at the 5-level page table region. However this restriction proved too great -- the reduced address space when using a hint address was too small.
A patch for riscv [1] reverts the behavior that broke userspace. This series serves to make this feature available to all architectures.
I have only tested on riscv and x86. There is a tremendous amount of duplicated code in mmap so the implementations across architectures I believe should be mostly consistent. I added this feature to all architectures that implement either arch_get_mmap_end()/arch_get_mmap_base() or arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base().
Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.... [1]
To: Arnd Bergmann arnd@arndb.de To: Paul Walmsley paul.walmsley@sifive.com To: Palmer Dabbelt palmer@dabbelt.com To: Albert Ou aou@eecs.berkeley.edu To: Catalin Marinas catalin.marinas@arm.com To: Will Deacon will@kernel.org To: Michael Ellerman mpe@ellerman.id.au To: Nicholas Piggin npiggin@gmail.com To: Christophe Leroy christophe.leroy@csgroup.eu To: Naveen N Rao naveen@kernel.org To: Muchun Song muchun.song@linux.dev To: Andrew Morton akpm@linux-foundation.org To: Liam R. Howlett Liam.Howlett@oracle.com To: Vlastimil Babka vbabka@suse.cz To: Lorenzo Stoakes lorenzo.stoakes@oracle.com To: Thomas Gleixner tglx@linutronix.de To: Ingo Molnar mingo@redhat.com To: Borislav Petkov bp@alien8.de To: Dave Hansen dave.hansen@linux.intel.com To: x86@kernel.org To: H. Peter Anvin hpa@zytor.com To: Huacai Chen chenhuacai@kernel.org To: WANG Xuerui kernel@xen0n.name To: Russell King linux@armlinux.org.uk To: Thomas Bogendoerfer tsbogend@alpha.franken.de To: James E.J. Bottomley James.Bottomley@HansenPartnership.com To: Helge Deller deller@gmx.de To: Alexander Gordeev agordeev@linux.ibm.com To: Gerald Schaefer gerald.schaefer@linux.ibm.com To: Heiko Carstens hca@linux.ibm.com To: Vasily Gorbik gor@linux.ibm.com To: Christian Borntraeger borntraeger@linux.ibm.com To: Sven Schnelle svens@linux.ibm.com To: Yoshinori Sato ysato@users.sourceforge.jp To: Rich Felker dalias@libc.org To: John Paul Adrian Glaubitz glaubitz@physik.fu-berlin.de To: David S. Miller davem@davemloft.net To: Andreas Larsson andreas@gaisler.com To: Shuah Khan shuah@kernel.org To: Alexandre Ghiti alexghiti@rivosinc.com Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Palmer Dabbelt palmer@rivosinc.com Cc: linux-riscv@lists.infradead.org Cc: linux-arm-kernel@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-mm@kvack.org Cc: loongarch@lists.linux.dev Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-kselftest@vger.kernel.org Signed-off-by: Charlie Jenkins charlie@rivosinc.com
--- Charlie Jenkins (16): mm: Add MAP_BELOW_HINT riscv: mm: Do not restrict mmap address based on hint mm: Add flag and len param to arch_get_mmap_base() mm: Add generic MAP_BELOW_HINT riscv: mm: Support MAP_BELOW_HINT arm64: mm: Support MAP_BELOW_HINT powerpc: mm: Support MAP_BELOW_HINT x86: mm: Support MAP_BELOW_HINT loongarch: mm: Support MAP_BELOW_HINT arm: mm: Support MAP_BELOW_HINT mips: mm: Support MAP_BELOW_HINT parisc: mm: Support MAP_BELOW_HINT s390: mm: Support MAP_BELOW_HINT sh: mm: Support MAP_BELOW_HINT sparc: mm: Support MAP_BELOW_HINT selftests/mm: Create MAP_BELOW_HINT test
arch/arm/mm/mmap.c | 10 ++++++++ arch/arm64/include/asm/processor.h | 34 ++++++++++++++++++++++---- arch/loongarch/mm/mmap.c | 11 +++++++++ arch/mips/mm/mmap.c | 9 +++++++ arch/parisc/include/uapi/asm/mman.h | 1 + arch/parisc/kernel/sys_parisc.c | 9 +++++++ arch/powerpc/include/asm/task_size_64.h | 36 +++++++++++++++++++++++----- arch/riscv/include/asm/processor.h | 32 ------------------------- arch/s390/mm/mmap.c | 10 ++++++++ arch/sh/mm/mmap.c | 10 ++++++++ arch/sparc/kernel/sys_sparc_64.c | 8 +++++++ arch/x86/kernel/sys_x86_64.c | 25 ++++++++++++++++--- fs/hugetlbfs/inode.c | 2 +- include/linux/sched/mm.h | 34 ++++++++++++++++++++++++-- include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c | 2 +- tools/arch/parisc/include/uapi/asm/mman.h | 1 + tools/include/uapi/asm-generic/mman-common.h | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 29 ++++++++++++++++++++++ 20 files changed, 216 insertions(+), 50 deletions(-) --- base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the maximum address space, unless the hint address is greater than this value.
To make this behavior explicit and more versatile across all architectures, define a mmap flag that allows users to define an arbitrary upper limit on addresses returned by mmap.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- include/uapi/asm-generic/mman-common.h | 1 + tools/include/uapi/asm-generic/mman-common.h | 1 + 2 files changed, 2 insertions(+)
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..03ac13d9aa37 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -32,6 +32,7 @@
#define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */ +#define MAP_BELOW_HINT 0x8000000 /* give out address that is below (inclusive) hint address */
/* * Flags for mlock diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..03ac13d9aa37 100644 --- a/tools/include/uapi/asm-generic/mman-common.h +++ b/tools/include/uapi/asm-generic/mman-common.h @@ -32,6 +32,7 @@
#define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */ +#define MAP_BELOW_HINT 0x8000000 /* give out address that is below (inclusive) hint address */
/* * Flags for mlock
The hint address should not forcefully restrict the addresses returned by mmap as this causes mmap to report ENOMEM when there is memory still available.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com Fixes: b5b4287accd7 ("riscv: mm: Use hint address in mmap if available") Fixes: add2cc6b6515 ("RISC-V: mm: Restrict address space for sv39,sv48,sv57") Closes: https://lore.kernel.org/linux-kernel/ZbxTNjQPFKBatMq+@ghost/T/#mccb1890466bf...
---
This patch is pulled from [1] for ease of review of this series.
Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.... [1] --- arch/riscv/include/asm/processor.h | 28 +++------------------------- 1 file changed, 3 insertions(+), 25 deletions(-)
diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h index 8702b8721a27..1015f2a49917 100644 --- a/arch/riscv/include/asm/processor.h +++ b/arch/riscv/include/asm/processor.h @@ -14,36 +14,14 @@
#include <asm/ptrace.h>
-/* - * addr is a hint to the maximum userspace address that mmap should provide, so - * this macro needs to return the largest address space available so that - * mmap_end < addr, being mmap_end the top of that address space. - * See Documentation/arch/riscv/vm-layout.rst for more details. - */ #define arch_get_mmap_end(addr, len, flags) \ ({ \ - unsigned long mmap_end; \ - typeof(addr) _addr = (addr); \ - if ((_addr) == 0 || is_compat_task() || \ - ((_addr + len) > BIT(VA_BITS - 1))) \ - mmap_end = STACK_TOP_MAX; \ - else \ - mmap_end = (_addr + len); \ - mmap_end; \ + STACK_TOP_MAX; \ })
-#define arch_get_mmap_base(addr, base) \ +#define arch_get_mmap_base(addr, base, flags) \ ({ \ - unsigned long mmap_base; \ - typeof(addr) _addr = (addr); \ - typeof(base) _base = (base); \ - unsigned long rnd_gap = DEFAULT_MAP_WINDOW - (_base); \ - if ((_addr) == 0 || is_compat_task() || \ - ((_addr + len) > BIT(VA_BITS - 1))) \ - mmap_base = (_base); \ - else \ - mmap_base = (_addr + len) - rnd_gap; \ - mmap_base; \ + base; \ })
#ifdef CONFIG_64BIT
The flag and len param is required in arch_get_mmap_base() to implement MAP_BELOW_HINT.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- arch/arm64/include/asm/processor.h | 2 +- arch/powerpc/include/asm/task_size_64.h | 2 +- arch/riscv/include/asm/processor.h | 2 +- fs/hugetlbfs/inode.c | 2 +- include/linux/sched/mm.h | 2 +- mm/mmap.c | 2 +- 6 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h index f77371232d8c..a67ca119bb91 100644 --- a/arch/arm64/include/asm/processor.h +++ b/arch/arm64/include/asm/processor.h @@ -95,7 +95,7 @@ #define arch_get_mmap_end(addr, len, flags) \ (((addr) > DEFAULT_MAP_WINDOW) ? TASK_SIZE : DEFAULT_MAP_WINDOW)
-#define arch_get_mmap_base(addr, base) ((addr > DEFAULT_MAP_WINDOW) ? \ +#define arch_get_mmap_base(addr, len, base, flags) ((addr > DEFAULT_MAP_WINDOW) ? \ base + TASK_SIZE - DEFAULT_MAP_WINDOW :\ base) #endif /* CONFIG_ARM64_FORCE_52BIT */ diff --git a/arch/powerpc/include/asm/task_size_64.h b/arch/powerpc/include/asm/task_size_64.h index 5a709951c901..239b363841aa 100644 --- a/arch/powerpc/include/asm/task_size_64.h +++ b/arch/powerpc/include/asm/task_size_64.h @@ -72,7 +72,7 @@ #define STACK_TOP_MAX TASK_SIZE_USER64 #define STACK_TOP (is_32bit_task() ? STACK_TOP_USER32 : STACK_TOP_USER64)
-#define arch_get_mmap_base(addr, base) \ +#define arch_get_mmap_base(addr, len, base, flags) \ (((addr) > DEFAULT_MAP_WINDOW) ? (base) + TASK_SIZE - DEFAULT_MAP_WINDOW : (base))
#define arch_get_mmap_end(addr, len, flags) \ diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h index 1015f2a49917..7ff559bf46f2 100644 --- a/arch/riscv/include/asm/processor.h +++ b/arch/riscv/include/asm/processor.h @@ -19,7 +19,7 @@ STACK_TOP_MAX; \ })
-#define arch_get_mmap_base(addr, base, flags) \ +#define arch_get_mmap_base(addr, len, base, flags) \ ({ \ base; \ }) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 9f6cff356796..05a52f85dba9 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -195,7 +195,7 @@ hugetlb_get_unmapped_area_topdown(struct file *file, unsigned long addr, info.flags = VM_UNMAPPED_AREA_TOPDOWN; info.length = len; info.low_limit = PAGE_SIZE; - info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base); + info.high_limit = arch_get_mmap_base(addr, len, current->mm->mmap_base, flags); info.align_mask = PAGE_MASK & ~huge_page_mask(h); addr = vm_unmapped_area(&info);
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 91546493c43d..265b43855d0b 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -174,7 +174,7 @@ static inline void mm_update_next_owner(struct mm_struct *mm) #endif
#ifndef arch_get_mmap_base -#define arch_get_mmap_base(addr, base) (base) +#define arch_get_mmap_base(addr, len, base, flags) (base) #endif
extern void arch_pick_mmap_layout(struct mm_struct *mm, diff --git a/mm/mmap.c b/mm/mmap.c index d0dfc85b209b..27a7f2be3f68 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1861,7 +1861,7 @@ generic_get_unmapped_area_topdown(struct file *filp, unsigned long addr, info.flags = VM_UNMAPPED_AREA_TOPDOWN; info.length = len; info.low_limit = PAGE_SIZE; - info.high_limit = arch_get_mmap_base(addr, mm->mmap_base); + info.high_limit = arch_get_mmap_base(addr, len, mm->mmap_base, flags); addr = vm_unmapped_area(&info);
/*
Make the generic implementation of arch_get_mmap_base() and arch_get_mmap_end() support MAP_BELOW_HINT.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- include/linux/sched/mm.h | 34 ++++++++++++++++++++++++++++++++-- 1 file changed, 32 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 265b43855d0b..c350bb5ac0a2 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -9,6 +9,7 @@ #include <linux/gfp.h> #include <linux/sync_core.h> #include <linux/sched/coredump.h> +#include <uapi/asm/mman.h>
/* * Routines for handling mm_structs @@ -170,11 +171,40 @@ static inline void mm_update_next_owner(struct mm_struct *mm)
#ifdef CONFIG_MMU #ifndef arch_get_mmap_end -#define arch_get_mmap_end(addr, len, flags) (TASK_SIZE) +#define arch_get_mmap_end(addr, len, flags) \ +({ \ + unsigned long mmap_end; \ + typeof(flags) _flags = (flags); \ + typeof(addr) _addr = (addr); \ + typeof(len) _len = (len); \ + mmap_end = TASK_SIZE; \ + if (_flags & MAP_BELOW_HINT && _addr != 0) \ + mmap_end = MIN(mmap_end, _addr + _len); \ + mmap_end; \ +}) #endif
#ifndef arch_get_mmap_base -#define arch_get_mmap_base(addr, len, base, flags) (base) +/* + * rnd_gap is defined to be (STACK_TOP - _base) due to the definition of + * mmap_base in mm/util.c + * + * Assumes ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT, which all architectures that + * implement generic mmap use + */ +#define arch_get_mmap_base(addr, len, base, flags) \ +({ \ + unsigned long mmap_base; \ + typeof(flags) _flags = (flags); \ + typeof(addr) _addr = (addr); \ + typeof(base) _base = (base); \ + typeof(len) _len = (len); \ + unsigned long rnd_gap = STACK_TOP - _base; \ + mmap_base = _base; \ + if (_flags & MAP_BELOW_HINT && _addr != 0) \ + mmap_base = MIN(mmap_base, (_addr + _len) - rnd_gap); \ + mmap_base; \ +}) #endif
extern void arch_pick_mmap_layout(struct mm_struct *mm,
When adding support for MAP_BELOW_HINT, the riscv implementation becomes identical to the default implementation, so arch_get_mmap_base() and arch_get_mmap_end() can be removed.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- arch/riscv/include/asm/processor.h | 10 ---------- 1 file changed, 10 deletions(-)
diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h index 7ff559bf46f2..20b4ba7d32be 100644 --- a/arch/riscv/include/asm/processor.h +++ b/arch/riscv/include/asm/processor.h @@ -14,16 +14,6 @@
#include <asm/ptrace.h>
-#define arch_get_mmap_end(addr, len, flags) \ -({ \ - STACK_TOP_MAX; \ -}) - -#define arch_get_mmap_base(addr, len, base, flags) \ -({ \ - base; \ -}) - #ifdef CONFIG_64BIT #define DEFAULT_MAP_WINDOW (UL(1) << (MMAP_VA_BITS - 1)) #define STACK_TOP_MAX TASK_SIZE_64
Add support for MAP_BELOW_HINT to arch_get_mmap_base() and arch_get_mmap_end().
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- arch/arm64/include/asm/processor.h | 34 +++++++++++++++++++++++++++++----- 1 file changed, 29 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h index a67ca119bb91..39aabb1619f6 100644 --- a/arch/arm64/include/asm/processor.h +++ b/arch/arm64/include/asm/processor.h @@ -92,12 +92,36 @@ #endif /* CONFIG_COMPAT */
#ifndef CONFIG_ARM64_FORCE_52BIT -#define arch_get_mmap_end(addr, len, flags) \ - (((addr) > DEFAULT_MAP_WINDOW) ? TASK_SIZE : DEFAULT_MAP_WINDOW) +#define arch_get_mmap_end(addr, len, flags) \ +({ \ + unsigned long mmap_end; \ + typeof(flags) _flags = (flags); \ + typeof(addr) _addr = (addr); \ + typeof(len) _len = (len); \ + if (_flags & MAP_BELOW_HINT && _addr != 0 && ((_addr + _len) > BIT(VA_BITS - 1))) \ + mmap_end = (_addr + _len); \ + else \ + mmap_end = ((_addr > DEFAULT_MAP_WINDOW) ? TASK_SIZE : DEFAULT_MAP_WINDOW); \ + mmap_end \ +}) + +#define arch_get_mmap_base(addr, len, base, flags) \ +({ \ + unsigned long mmap_base; \ + typeof(flags) _flags = (flags); \ + typeof(addr) _addr = (addr); \ + typeof(base) _base = (base); \ + typeof(len) _len = (len); \ + unsigned long rnd_gap = DEFAULT_MAP_WINDOW - (_base); \ + if (_flags & MAP_BELOW_HINT && _addr != 0 && ((_addr + _len) > BIT(VA_BITS - 1)))\ + mmap_base = (_addr + _len) - rnd_gap; \ + else \ + mmap_end = ((_addr > DEFAULT_MAP_WINDOW) ? \ + _base + TASK_SIZE - DEFAULT_MAP_WINDOW : \ + _base); \ + mmap_end \ +})
-#define arch_get_mmap_base(addr, len, base, flags) ((addr > DEFAULT_MAP_WINDOW) ? \ - base + TASK_SIZE - DEFAULT_MAP_WINDOW :\ - base) #endif /* CONFIG_ARM64_FORCE_52BIT */
extern phys_addr_t arm64_dma_phys_limit;
Add support for MAP_BELOW_HINT to arch_get_mmap_base() and arch_get_mmap_end().
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- arch/powerpc/include/asm/task_size_64.h | 36 +++++++++++++++++++++++++++------ 1 file changed, 30 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/include/asm/task_size_64.h b/arch/powerpc/include/asm/task_size_64.h index 239b363841aa..a37a5a81365d 100644 --- a/arch/powerpc/include/asm/task_size_64.h +++ b/arch/powerpc/include/asm/task_size_64.h @@ -72,12 +72,36 @@ #define STACK_TOP_MAX TASK_SIZE_USER64 #define STACK_TOP (is_32bit_task() ? STACK_TOP_USER32 : STACK_TOP_USER64)
-#define arch_get_mmap_base(addr, len, base, flags) \ - (((addr) > DEFAULT_MAP_WINDOW) ? (base) + TASK_SIZE - DEFAULT_MAP_WINDOW : (base)) +#define arch_get_mmap_base(addr, len, base, flags) \ +({ \ + unsigned long mmap_base; \ + typeof(flags) _flags = (flags); \ + typeof(addr) _addr = (addr); \ + typeof(base) _base = (base); \ + typeof(len) _len = (len); \ + unsigned long rnd_gap = DEFAULT_MAP_WINDOW - (_base); \ + if (_flags & MAP_BELOW_HINT && _addr != 0 && ((_addr + _len) > BIT(VA_BITS - 1)))\ + mmap_base = (_addr + _len) - rnd_gap; \ + else \ + mmap_end = ((_addr > DEFAULT_MAP_WINDOW) ? \ + _base + TASK_SIZE - DEFAULT_MAP_WINDOW : \ + _base); \ + mmap_end; \ +})
-#define arch_get_mmap_end(addr, len, flags) \ - (((addr) > DEFAULT_MAP_WINDOW) || \ - (((flags) & MAP_FIXED) && ((addr) + (len) > DEFAULT_MAP_WINDOW)) ? TASK_SIZE : \ - DEFAULT_MAP_WINDOW) +#define arch_get_mmap_end(addr, len, flags) \ +({ \ + unsigned long mmap_end; \ + typeof(flags) _flags = (flags); \ + typeof(addr) _addr = (addr); \ + typeof(len) _len = (len); \ + if (_flags & MAP_BELOW_HINT && _addr != 0 && ((_addr + _len) > BIT(VA_BITS - 1))) \ + mmap_end = (_addr + _len); \ + else \ + mmap_end = (((_addr) > DEFAULT_MAP_WINDOW) || \ + (((_flags) & MAP_FIXED) && ((_addr) + (_len) > DEFAULT_MAP_WINDOW))\ + ? TASK_SIZE : DEFAULT_MAP_WINDOW) \ + mmap_end; \ +})
#endif /* _ASM_POWERPC_TASK_SIZE_64_H */
Hi Charlie,
Le 28/08/2024 à 07:49, Charlie Jenkins a écrit :
Add support for MAP_BELOW_HINT to arch_get_mmap_base() and arch_get_mmap_end().
Signed-off-by: Charlie Jenkins charlie@rivosinc.com
arch/powerpc/include/asm/task_size_64.h | 36 +++++++++++++++++++++++++++------ 1 file changed, 30 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/include/asm/task_size_64.h b/arch/powerpc/include/asm/task_size_64.h index 239b363841aa..a37a5a81365d 100644 --- a/arch/powerpc/include/asm/task_size_64.h +++ b/arch/powerpc/include/asm/task_size_64.h @@ -72,12 +72,36 @@ #define STACK_TOP_MAX TASK_SIZE_USER64 #define STACK_TOP (is_32bit_task() ? STACK_TOP_USER32 : STACK_TOP_USER64) -#define arch_get_mmap_base(addr, len, base, flags) \
- (((addr) > DEFAULT_MAP_WINDOW) ? (base) + TASK_SIZE - DEFAULT_MAP_WINDOW : (base))
+#define arch_get_mmap_base(addr, len, base, flags) \
This macro looks quite big for a macro, can it be a static inline function instead ? Same for the other macro below.
+({ \
- unsigned long mmap_base; \
- typeof(flags) _flags = (flags); \
- typeof(addr) _addr = (addr); \
- typeof(base) _base = (base); \
- typeof(len) _len = (len); \
- unsigned long rnd_gap = DEFAULT_MAP_WINDOW - (_base); \
- if (_flags & MAP_BELOW_HINT && _addr != 0 && ((_addr + _len) > BIT(VA_BITS - 1)))\
mmap_base = (_addr + _len) - rnd_gap; \
- else \
mmap_end = ((_addr > DEFAULT_MAP_WINDOW) ? \
_base + TASK_SIZE - DEFAULT_MAP_WINDOW : \
_base); \
- mmap_end; \
mmap_end doesn't exist, did you mean mmap_base ?
+}) -#define arch_get_mmap_end(addr, len, flags) \
- (((addr) > DEFAULT_MAP_WINDOW) || \
(((flags) & MAP_FIXED) && ((addr) + (len) > DEFAULT_MAP_WINDOW)) ? TASK_SIZE : \
DEFAULT_MAP_WINDOW)
+#define arch_get_mmap_end(addr, len, flags) \ +({ \
- unsigned long mmap_end; \
- typeof(flags) _flags = (flags); \
- typeof(addr) _addr = (addr); \
- typeof(len) _len = (len); \
- if (_flags & MAP_BELOW_HINT && _addr != 0 && ((_addr + _len) > BIT(VA_BITS - 1))) \
mmap_end = (_addr + _len); \
- else \
mmap_end = (((_addr) > DEFAULT_MAP_WINDOW) || \
(((_flags) & MAP_FIXED) && ((_addr) + (_len) > DEFAULT_MAP_WINDOW))\
? TASK_SIZE : DEFAULT_MAP_WINDOW) \
- mmap_end; \
+}) #endif /* _ASM_POWERPC_TASK_SIZE_64_H */
On Wed, Aug 28, 2024 at 08:34:49AM +0200, Christophe Leroy wrote:
Hi Charlie,
Le 28/08/2024 à 07:49, Charlie Jenkins a écrit :
Add support for MAP_BELOW_HINT to arch_get_mmap_base() and arch_get_mmap_end().
Signed-off-by: Charlie Jenkins charlie@rivosinc.com
arch/powerpc/include/asm/task_size_64.h | 36 +++++++++++++++++++++++++++------ 1 file changed, 30 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/include/asm/task_size_64.h b/arch/powerpc/include/asm/task_size_64.h index 239b363841aa..a37a5a81365d 100644 --- a/arch/powerpc/include/asm/task_size_64.h +++ b/arch/powerpc/include/asm/task_size_64.h @@ -72,12 +72,36 @@ #define STACK_TOP_MAX TASK_SIZE_USER64 #define STACK_TOP (is_32bit_task() ? STACK_TOP_USER32 : STACK_TOP_USER64) -#define arch_get_mmap_base(addr, len, base, flags) \
- (((addr) > DEFAULT_MAP_WINDOW) ? (base) + TASK_SIZE - DEFAULT_MAP_WINDOW : (base))
+#define arch_get_mmap_base(addr, len, base, flags) \
This macro looks quite big for a macro, can it be a static inline function instead ? Same for the other macro below.
I had overlooked that possibility, I think that's a great solution, I will change that.
+({ \
- unsigned long mmap_base; \
- typeof(flags) _flags = (flags); \
- typeof(addr) _addr = (addr); \
- typeof(base) _base = (base); \
- typeof(len) _len = (len); \
- unsigned long rnd_gap = DEFAULT_MAP_WINDOW - (_base); \
- if (_flags & MAP_BELOW_HINT && _addr != 0 && ((_addr + _len) > BIT(VA_BITS - 1)))\
mmap_base = (_addr + _len) - rnd_gap; \
- else \
mmap_end = ((_addr > DEFAULT_MAP_WINDOW) ? \
_base + TASK_SIZE - DEFAULT_MAP_WINDOW : \
_base); \
- mmap_end; \
mmap_end doesn't exist, did you mean mmap_base ?
Oh whoops, thank you!
- Charlie
+}) -#define arch_get_mmap_end(addr, len, flags) \
- (((addr) > DEFAULT_MAP_WINDOW) || \
(((flags) & MAP_FIXED) && ((addr) + (len) > DEFAULT_MAP_WINDOW)) ? TASK_SIZE : \
DEFAULT_MAP_WINDOW)
+#define arch_get_mmap_end(addr, len, flags) \ +({ \
- unsigned long mmap_end; \
- typeof(flags) _flags = (flags); \
- typeof(addr) _addr = (addr); \
- typeof(len) _len = (len); \
- if (_flags & MAP_BELOW_HINT && _addr != 0 && ((_addr + _len) > BIT(VA_BITS - 1))) \
mmap_end = (_addr + _len); \
- else \
mmap_end = (((_addr) > DEFAULT_MAP_WINDOW) || \
(((_flags) & MAP_FIXED) && ((_addr) + (_len) > DEFAULT_MAP_WINDOW))\
? TASK_SIZE : DEFAULT_MAP_WINDOW) \
- mmap_end; \
+}) #endif /* _ASM_POWERPC_TASK_SIZE_64_H */
From: Christophe Leroy
Sent: 28 August 2024 07:35 Hi Charlie,
Le 28/08/2024 à 07:49, Charlie Jenkins a écrit :
Add support for MAP_BELOW_HINT to arch_get_mmap_base() and arch_get_mmap_end().
Signed-off-by: Charlie Jenkins charlie@rivosinc.com
arch/powerpc/include/asm/task_size_64.h | 36 +++++++++++++++++++++++++++------ 1 file changed, 30 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/include/asm/task_size_64.h b/arch/powerpc/include/asm/task_size_64.h index 239b363841aa..a37a5a81365d 100644 --- a/arch/powerpc/include/asm/task_size_64.h +++ b/arch/powerpc/include/asm/task_size_64.h @@ -72,12 +72,36 @@ #define STACK_TOP_MAX TASK_SIZE_USER64 #define STACK_TOP (is_32bit_task() ? STACK_TOP_USER32 : STACK_TOP_USER64)
-#define arch_get_mmap_base(addr, len, base, flags) \
- (((addr) > DEFAULT_MAP_WINDOW) ? (base) + TASK_SIZE - DEFAULT_MAP_WINDOW : (base))
+#define arch_get_mmap_base(addr, len, base, flags) \
This macro looks quite big for a macro, can it be a static inline function instead ? Same for the other macro below.
Or even a real function? Given the actual cost of mapping memory an extra function call isn't going to be masurable.
David
+({ \
- unsigned long mmap_base; \
- typeof(flags) _flags = (flags); \
- typeof(addr) _addr = (addr); \
- typeof(base) _base = (base); \
- typeof(len) _len = (len); \
- unsigned long rnd_gap = DEFAULT_MAP_WINDOW - (_base); \
- if (_flags & MAP_BELOW_HINT && _addr != 0 && ((_addr + _len) > BIT(VA_BITS - 1)))\
mmap_base = (_addr + _len) - rnd_gap; \
- else \
mmap_end = ((_addr > DEFAULT_MAP_WINDOW) ? \
_base + TASK_SIZE - DEFAULT_MAP_WINDOW : \
_base); \
- mmap_end; \
mmap_end doesn't exist, did you mean mmap_base ?
+})
-#define arch_get_mmap_end(addr, len, flags) \
- (((addr) > DEFAULT_MAP_WINDOW) || \
(((flags) & MAP_FIXED) && ((addr) + (len) > DEFAULT_MAP_WINDOW)) ? TASK_SIZE : \
DEFAULT_MAP_WINDOW)
+#define arch_get_mmap_end(addr, len, flags) \ +({ \
- unsigned long mmap_end; \
- typeof(flags) _flags = (flags); \
- typeof(addr) _addr = (addr); \
- typeof(len) _len = (len); \
- if (_flags & MAP_BELOW_HINT && _addr != 0 && ((_addr + _len) > BIT(VA_BITS - 1))) \
mmap_end = (_addr + _len); \
- else \
mmap_end = (((_addr) > DEFAULT_MAP_WINDOW) || \
(((_flags) & MAP_FIXED) && ((_addr) + (_len) > DEFAULT_MAP_WINDOW))\
? TASK_SIZE : DEFAULT_MAP_WINDOW) \
- mmap_end; \
+})
#endif /* _ASM_POWERPC_TASK_SIZE_64_H */
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Add support for MAP_BELOW_HINT to mmap by restricting high_limit to addr when the flag is enabled.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- arch/x86/kernel/sys_x86_64.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c index 01d7cd85ef97..fa16b38f3702 100644 --- a/arch/x86/kernel/sys_x86_64.c +++ b/arch/x86/kernel/sys_x86_64.c @@ -86,7 +86,7 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len, return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT); }
-static void find_start_end(unsigned long addr, unsigned long flags, +static void find_start_end(unsigned long addr, unsigned long len, unsigned long flags, unsigned long *begin, unsigned long *end) { if (!in_32bit_syscall() && (flags & MAP_32BIT)) { @@ -106,10 +106,14 @@ static void find_start_end(unsigned long addr, unsigned long flags, }
*begin = get_mmap_base(1); + if (in_32bit_syscall()) *end = task_size_32bit(); else *end = task_size_64bit(addr > DEFAULT_MAP_WINDOW); + + if (flags & MAP_BELOW_HINT) + *end = MIN(*end, addr + len); }
static inline unsigned long stack_guard_placement(vm_flags_t vm_flags) @@ -132,7 +136,7 @@ arch_get_unmapped_area_vmflags(struct file *filp, unsigned long addr, unsigned l if (flags & MAP_FIXED) return addr;
- find_start_end(addr, flags, &begin, &end); + find_start_end(addr, len, flags, &begin, &end);
if (len > end) return -ENOMEM; @@ -166,6 +170,7 @@ arch_get_unmapped_area_topdown_vmflags(struct file *filp, unsigned long addr0, struct mm_struct *mm = current->mm; unsigned long addr = addr0; struct vm_unmapped_area_info info = {}; + unsigned long task_size, mmap_base;
/* requested length too big for entire address space */ if (len > TASK_SIZE) @@ -198,7 +203,8 @@ arch_get_unmapped_area_topdown_vmflags(struct file *filp, unsigned long addr0, else info.low_limit = PAGE_SIZE;
- info.high_limit = get_mmap_base(0); + mmap_base = get_mmap_base(0); + info.high_limit = mmap_base; info.start_gap = stack_guard_placement(vm_flags);
/* @@ -210,6 +216,19 @@ arch_get_unmapped_area_topdown_vmflags(struct file *filp, unsigned long addr0, */ if (addr > DEFAULT_MAP_WINDOW && !in_32bit_syscall()) info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW; + if (flags & MAP_BELOW_HINT) { +#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES + task_size = task_size_32bit(); +#else + task_size = task_size_64bit(0); +#endif + /* + * mmap_base is defined by PAGE_ALIGN(task_size - gap - rnd) so + * subtract out the task_size to isolate the gap + rnd. + */ + info.high_limit = MIN(info.high_limit, + (addr + len) - (task_size - mmap_base)); + }
info.align_offset = pgoff << PAGE_SHIFT; if (filp) {
Add support for MAP_BELOW_HINT to mmap by restricting high_limit to addr when the flag is enabled.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- arch/loongarch/mm/mmap.c | 11 +++++++++++ 1 file changed, 11 insertions(+)
diff --git a/arch/loongarch/mm/mmap.c b/arch/loongarch/mm/mmap.c index 889030985135..66a5badf849b 100644 --- a/arch/loongarch/mm/mmap.c +++ b/arch/loongarch/mm/mmap.c @@ -70,6 +70,13 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, info.flags = VM_UNMAPPED_AREA_TOPDOWN; info.low_limit = PAGE_SIZE; info.high_limit = mm->mmap_base; + if (flags & MAP_BELOW_HINT) + /* + * Subtract (STACK_TOP - mm->mmap_base) to get random + * offset defined in mmap_base() in mm/util.c + */ + info.high_limit = MIN(mm->mmap_base, + (addr + len) - (STACK_TOP - mm->mmap_base)) addr = vm_unmapped_area(&info);
if (!(addr & ~PAGE_MASK)) @@ -84,7 +91,11 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, }
info.low_limit = mm->mmap_base; + info.high_limit = TASK_SIZE; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); + return vm_unmapped_area(&info); }
Add support for MAP_BELOW_HINT to mmap by restricting high_limit to addr when the flag is enabled.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- arch/arm/mm/mmap.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/arch/arm/mm/mmap.c b/arch/arm/mm/mmap.c index d65d0e6ed10a..fa0c79447b78 100644 --- a/arch/arm/mm/mmap.c +++ b/arch/arm/mm/mmap.c @@ -71,6 +71,8 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, info.length = len; info.low_limit = mm->mmap_base; info.high_limit = TASK_SIZE; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); info.align_mask = do_align ? (PAGE_MASK & (SHMLBA - 1)) : 0; info.align_offset = pgoff << PAGE_SHIFT; return vm_unmapped_area(&info); @@ -122,6 +124,12 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, info.length = len; info.low_limit = FIRST_USER_ADDRESS; info.high_limit = mm->mmap_base; + if (flags & MAP_BELOW_HINT) + /* + * Subtract (STACK_TOP - mm->mmap_base) to get random + * offset defined in mmap_base() in mm/util.c + */ + info.high_limit = MIN(info.high_limit, (addr + len) - (STACK_TOP - mm->mmap_base)); info.align_mask = do_align ? (PAGE_MASK & (SHMLBA - 1)) : 0; info.align_offset = pgoff << PAGE_SHIFT; addr = vm_unmapped_area(&info); @@ -137,6 +145,8 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, info.flags = 0; info.low_limit = mm->mmap_base; info.high_limit = TASK_SIZE; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); addr = vm_unmapped_area(&info); }
Add support for MAP_BELOW_HINT to mmap by restricting high_limit to addr when the flag is enabled.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- arch/mips/mm/mmap.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/arch/mips/mm/mmap.c b/arch/mips/mm/mmap.c index 7e11d7b58761..1595fda284cd 100644 --- a/arch/mips/mm/mmap.c +++ b/arch/mips/mm/mmap.c @@ -79,6 +79,13 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, info.flags = VM_UNMAPPED_AREA_TOPDOWN; info.low_limit = PAGE_SIZE; info.high_limit = mm->mmap_base; + if (flags & MAP_BELOW_HINT) + /* + * Subtract (STACK_TOP - mm->mmap_base) to get random + * offset defined in mmap_base() in mm/util.c + */ + info.high_limit = MIN(info.high_limit, + (addr + len) - (STACK_TOP - mm->mmap_base)); addr = vm_unmapped_area(&info);
if (!(addr & ~PAGE_MASK)) @@ -94,6 +101,8 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp,
info.low_limit = mm->mmap_base; info.high_limit = TASK_SIZE; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); return vm_unmapped_area(&info); }
Add support for MAP_BELOW_HINT to mmap by restricting high_limit to addr when the flag is enabled.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- arch/parisc/include/uapi/asm/mman.h | 1 + arch/parisc/kernel/sys_parisc.c | 9 +++++++++ tools/arch/parisc/include/uapi/asm/mman.h | 1 + 3 files changed, 11 insertions(+)
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 68c44f99bc93..44925ef8ac44 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -26,6 +26,7 @@ #define MAP_HUGETLB 0x80000 /* create a huge page mapping */ #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */ #define MAP_UNINITIALIZED 0 /* uninitialized anonymous mmap */ +#define MAP_BELOW_HINT 0x200000 /* give out address that is below (inclusive) hint address */
#define MS_SYNC 1 /* synchronous memory sync */ #define MS_ASYNC 2 /* sync memory asynchronously */ diff --git a/arch/parisc/kernel/sys_parisc.c b/arch/parisc/kernel/sys_parisc.c index f7722451276e..feccb60cf746 100644 --- a/arch/parisc/kernel/sys_parisc.c +++ b/arch/parisc/kernel/sys_parisc.c @@ -148,6 +148,13 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, info.flags = VM_UNMAPPED_AREA_TOPDOWN; info.low_limit = PAGE_SIZE; info.high_limit = mm->mmap_base; + if (flags & MAP_BELOW_HINT) + /* + * Subtract (STACK_TOP - mm->mmap_base) to get random + * offset defined in mmap_base() in mm/util.c + */ + info.high_limit = MIN(info.high_limit, + (addr + len) - (STACK_TOP - mm->mmap_base)); addr = vm_unmapped_area(&info); if (!(addr & ~PAGE_MASK)) return addr; @@ -163,6 +170,8 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp,
info.low_limit = mm->mmap_base; info.high_limit = mmap_upper_limit(NULL); + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); return vm_unmapped_area(&info); }
diff --git a/tools/arch/parisc/include/uapi/asm/mman.h b/tools/arch/parisc/include/uapi/asm/mman.h index 4cc88a642e10..297acc0f7b2a 100644 --- a/tools/arch/parisc/include/uapi/asm/mman.h +++ b/tools/arch/parisc/include/uapi/asm/mman.h @@ -40,4 +40,5 @@ /* MAP_32BIT is undefined on parisc, fix it for perf */ #define MAP_32BIT 0 #define MAP_UNINITIALIZED 0 +#define MAP_BELOW_MAP 0x200000 #endif
Add support for MAP_BELOW_HINT to mmap by restricting high_limit to addr when the flag is enabled.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- arch/s390/mm/mmap.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/arch/s390/mm/mmap.c b/arch/s390/mm/mmap.c index 206756946589..29e20351ca85 100644 --- a/arch/s390/mm/mmap.c +++ b/arch/s390/mm/mmap.c @@ -105,6 +105,8 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, info.length = len; info.low_limit = mm->mmap_base; info.high_limit = TASK_SIZE; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); info.align_mask = get_align_mask(filp, flags); info.align_offset = pgoff << PAGE_SHIFT; addr = vm_unmapped_area(&info); @@ -143,6 +145,12 @@ unsigned long arch_get_unmapped_area_topdown(struct file *filp, unsigned long ad info.length = len; info.low_limit = PAGE_SIZE; info.high_limit = mm->mmap_base; + if (flags & MAP_BELOW_HINT) + /* + * Subtract (STACK_TOP - mm->mmap_base) to get random + * offset defined in mmap_base() in mm/util.c + */ + info.high_limit = MIN(info.high_limit, addr + len) - (STACK_TOP - mm->mmap_base); info.align_mask = get_align_mask(filp, flags); info.align_offset = pgoff << PAGE_SHIFT; addr = vm_unmapped_area(&info); @@ -158,6 +166,8 @@ unsigned long arch_get_unmapped_area_topdown(struct file *filp, unsigned long ad info.flags = 0; info.low_limit = TASK_UNMAPPED_BASE; info.high_limit = TASK_SIZE; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(TASK_SIZE, addr + len); addr = vm_unmapped_area(&info); if (offset_in_page(addr)) return addr;
Add support for MAP_BELOW_HINT to mmap by restricting high_limit to addr when the flag is enabled.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- arch/sh/mm/mmap.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/arch/sh/mm/mmap.c b/arch/sh/mm/mmap.c index bee329d4149a..867f6598b7d0 100644 --- a/arch/sh/mm/mmap.c +++ b/arch/sh/mm/mmap.c @@ -91,6 +91,8 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, info.length = len; info.low_limit = TASK_UNMAPPED_BASE; info.high_limit = TASK_SIZE; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); info.align_mask = do_colour_align ? (PAGE_MASK & shm_align_mask) : 0; info.align_offset = pgoff << PAGE_SHIFT; return vm_unmapped_area(&info); @@ -141,6 +143,12 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, info.length = len; info.low_limit = PAGE_SIZE; info.high_limit = mm->mmap_base; + if (flags & MAP_BELOW_HINT) + /* + * Subtract (STACK_TOP - mm->mmap_base) to get random + * offset defined in mmap_base() in mm/util.c + */ + info.high_limit = MIN(info.high_limit, (addr + len) - (STACK_TOP - mm->mmap_base)); info.align_mask = do_colour_align ? (PAGE_MASK & shm_align_mask) : 0; info.align_offset = pgoff << PAGE_SHIFT; addr = vm_unmapped_area(&info); @@ -156,6 +164,8 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, info.flags = 0; info.low_limit = TASK_UNMAPPED_BASE; info.high_limit = TASK_SIZE; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); addr = vm_unmapped_area(&info); }
Add support for MAP_BELOW_HINT to mmap by restricting high_limit to addr when the flag is enabled.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- arch/sparc/kernel/sys_sparc_64.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/arch/sparc/kernel/sys_sparc_64.c b/arch/sparc/kernel/sys_sparc_64.c index d9c3b34ca744..b9ce9988551a 100644 --- a/arch/sparc/kernel/sys_sparc_64.c +++ b/arch/sparc/kernel/sys_sparc_64.c @@ -129,6 +129,8 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsi info.length = len; info.low_limit = TASK_UNMAPPED_BASE; info.high_limit = min(task_size, VA_EXCLUDE_START); + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); info.align_mask = do_color_align ? (PAGE_MASK & (SHMLBA - 1)) : 0; info.align_offset = pgoff << PAGE_SHIFT; addr = vm_unmapped_area(&info); @@ -137,6 +139,8 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsi VM_BUG_ON(addr != -ENOMEM); info.low_limit = VA_EXCLUDE_END; info.high_limit = task_size; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); addr = vm_unmapped_area(&info); }
@@ -192,6 +196,8 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, info.length = len; info.low_limit = PAGE_SIZE; info.high_limit = mm->mmap_base; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); info.align_mask = do_color_align ? (PAGE_MASK & (SHMLBA - 1)) : 0; info.align_offset = pgoff << PAGE_SHIFT; addr = vm_unmapped_area(&info); @@ -207,6 +213,8 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, info.flags = 0; info.low_limit = TASK_UNMAPPED_BASE; info.high_limit = STACK_TOP32; + if (flags & MAP_BELOW_HINT) + info.high_limit = MIN(info.high_limit, addr + len); addr = vm_unmapped_area(&info); }
Add a selftest for MAP_BELOW_HINT that maps until it runs out of space below the hint address.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 29 +++++++++++++++++++++++++++++ 2 files changed, 30 insertions(+)
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index cfad627e8d94..4e2de85267b5 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -50,6 +50,7 @@ TEST_GEN_FILES += hugepage-shm TEST_GEN_FILES += hugepage-vmemmap TEST_GEN_FILES += khugepaged TEST_GEN_FILES += madv_populate +TEST_GEN_FILES += map_below_hint TEST_GEN_FILES += map_fixed_noreplace TEST_GEN_FILES += map_hugetlb TEST_GEN_FILES += map_populate diff --git a/tools/testing/selftests/mm/map_below_hint.c b/tools/testing/selftests/mm/map_below_hint.c new file mode 100644 index 000000000000..305274c5af49 --- /dev/null +++ b/tools/testing/selftests/mm/map_below_hint.c @@ -0,0 +1,29 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test the MAP_BELOW_HINT mmap flag. + */ +#include <sys/mman.h> +#include "../kselftest.h" + +#define ADDR 0x1000000UL +#define LENGTH (ADDR / 100) + +#define MAP_BELOW_HINT 0x8000000 /* Not defined in all libc */ + +/* + * Map memory with MAP_BELOW_HINT until no memory left. Ensure that all returned + * addresses are below the hint. + */ +int main(int argc, char **argv) +{ + void *addr; + + do { + addr = mmap((void *)ADDR, LENGTH, MAP_ANONYMOUS, MAP_BELOW_HINT, -1, 0); + } while (addr == MAP_FAILED && (unsigned long)addr <= ADDR); + + if (addr != MAP_FAILED && (unsigned long)addr > ADDR) + ksft_exit_fail_msg("mmap returned address above hint with MAP_BELOW_HINT\n"); + + ksft_test_result_pass("MAP_BELOW_HINT works\n"); +}
On Tue, Aug 27, 2024 at 10:49:22PM GMT, Charlie Jenkins wrote:
Add a selftest for MAP_BELOW_HINT that maps until it runs out of space below the hint address.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com
tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 29 +++++++++++++++++++++++++++++ 2 files changed, 30 insertions(+)
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index cfad627e8d94..4e2de85267b5 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -50,6 +50,7 @@ TEST_GEN_FILES += hugepage-shm TEST_GEN_FILES += hugepage-vmemmap TEST_GEN_FILES += khugepaged TEST_GEN_FILES += madv_populate +TEST_GEN_FILES += map_below_hint TEST_GEN_FILES += map_fixed_noreplace TEST_GEN_FILES += map_hugetlb TEST_GEN_FILES += map_populate diff --git a/tools/testing/selftests/mm/map_below_hint.c b/tools/testing/selftests/mm/map_below_hint.c new file mode 100644 index 000000000000..305274c5af49 --- /dev/null +++ b/tools/testing/selftests/mm/map_below_hint.c @@ -0,0 +1,29 @@ +// SPDX-License-Identifier: GPL-2.0 +/*
- Test the MAP_BELOW_HINT mmap flag.
- */
+#include <sys/mman.h> +#include "../kselftest.h"
+#define ADDR 0x1000000UL +#define LENGTH (ADDR / 100)
+#define MAP_BELOW_HINT 0x8000000 /* Not defined in all libc */
+/*
- Map memory with MAP_BELOW_HINT until no memory left. Ensure that all returned
- addresses are below the hint.
- */
+int main(int argc, char **argv) +{
- void *addr;
- do {
addr = mmap((void *)ADDR, LENGTH, MAP_ANONYMOUS, MAP_BELOW_HINT, -1, 0);
How can this be correct? mmap() has parameters:
void *mmap(void addr[.length], size_t length, int prot, int flags, int fd, off_t offset);
You'r setting prot = MAP_ANONYMOUS, flags = MAP_BELOW_HINT...
This surely should be:
mmap(..., PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE | MAP_BELOW_HINT, -1, 0);
- } while (addr == MAP_FAILED && (unsigned long)addr <= ADDR);
How can addr == MAP_FAILED (i.e. ~0) and addr <= ADDR? This will just loop through once...
If you want to make sure you're getting mappings only below the hint until you start getting MAP_FAILED's you'll need to handle this more robustly.
- if (addr != MAP_FAILED && (unsigned long)addr > ADDR)
ksft_exit_fail_msg("mmap returned address above hint with MAP_BELOW_HINT\n");
This is just going to fail because your flags are wrong then wrongly claim to have passed...
- ksft_test_result_pass("MAP_BELOW_HINT works\n");
+}
-- 2.45.0
On Wed, Aug 28, 2024 at 06:48:33PM +0100, Lorenzo Stoakes wrote:
On Tue, Aug 27, 2024 at 10:49:22PM GMT, Charlie Jenkins wrote:
Add a selftest for MAP_BELOW_HINT that maps until it runs out of space below the hint address.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com
tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 29 +++++++++++++++++++++++++++++ 2 files changed, 30 insertions(+)
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index cfad627e8d94..4e2de85267b5 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -50,6 +50,7 @@ TEST_GEN_FILES += hugepage-shm TEST_GEN_FILES += hugepage-vmemmap TEST_GEN_FILES += khugepaged TEST_GEN_FILES += madv_populate +TEST_GEN_FILES += map_below_hint TEST_GEN_FILES += map_fixed_noreplace TEST_GEN_FILES += map_hugetlb TEST_GEN_FILES += map_populate diff --git a/tools/testing/selftests/mm/map_below_hint.c b/tools/testing/selftests/mm/map_below_hint.c new file mode 100644 index 000000000000..305274c5af49 --- /dev/null +++ b/tools/testing/selftests/mm/map_below_hint.c @@ -0,0 +1,29 @@ +// SPDX-License-Identifier: GPL-2.0 +/*
- Test the MAP_BELOW_HINT mmap flag.
- */
+#include <sys/mman.h> +#include "../kselftest.h"
+#define ADDR 0x1000000UL +#define LENGTH (ADDR / 100)
+#define MAP_BELOW_HINT 0x8000000 /* Not defined in all libc */
+/*
- Map memory with MAP_BELOW_HINT until no memory left. Ensure that all returned
- addresses are below the hint.
- */
+int main(int argc, char **argv) +{
- void *addr;
- do {
addr = mmap((void *)ADDR, LENGTH, MAP_ANONYMOUS, MAP_BELOW_HINT, -1, 0);
How can this be correct? mmap() has parameters:
void *mmap(void addr[.length], size_t length, int prot, int flags, int fd, off_t offset);
You'r setting prot = MAP_ANONYMOUS, flags = MAP_BELOW_HINT...
This surely should be:
mmap(..., PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE | MAP_BELOW_HINT, -1, 0);
- } while (addr == MAP_FAILED && (unsigned long)addr <= ADDR);
How can addr == MAP_FAILED (i.e. ~0) and addr <= ADDR? This will just loop through once...
If you want to make sure you're getting mappings only below the hint until you start getting MAP_FAILED's you'll need to handle this more robustly.
- if (addr != MAP_FAILED && (unsigned long)addr > ADDR)
ksft_exit_fail_msg("mmap returned address above hint with MAP_BELOW_HINT\n");
This is just going to fail because your flags are wrong then wrongly claim to have passed...
I obviously didn't spend enough time thinking about this test case... You are correct that I wrote this incorrectly. I will make a proper test case and send out a new version.
- Charlie
- ksft_test_result_pass("MAP_BELOW_HINT works\n");
+}
-- 2.45.0
On Tue, Aug 27, 2024 at 10:49:06PM GMT, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the maximum address space, unless the hint address is greater than this value.
On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This flag allows applications a way to specify exactly how many bits they want to be left unused by mmap. This eliminates the need for applications to know the page table hierarchy of the system to be able to reason which addresses mmap will be allowed to return.
riscv made this feature of mmap returning addresses less than the hint address the default behavior. This was in contrast to the implementation of x86/arm64 that have a single boundary at the 5-level page table region. However this restriction proved too great -- the reduced address space when using a hint address was too small.
A patch for riscv [1] reverts the behavior that broke userspace. This series serves to make this feature available to all architectures.
I'm a little confused as to the justification for this - you broke RISC V by doing this, and have now reverted it, but now offer the same behaviour that broke RISC V to all other architectures?
I mean this is how this reads, so I might be being ungenerous here :) but would be good to clarify what the value-add is here.
I also wonder at use of a new MAP_ flag, they're a limited resource and we should only ever add them if we _really_ need to. This seems a bit niche and specific to be making such a big change for including touching a bunch of pretty sensitive arch-specific code.
We have the ability to change how mmap() functions through 'personalities' though of course this would impact every mmap() call in the process.
Overall I'm really not hugely convinced by this, it feels like userland could find better ways of doing this (mostly you'd do a PROT_NONE mmap() to reserve a domain and mprotect() it on allocation or mmap() over it).
So I just struggle to see the purpose myself. BUT absolutely I may be missing context/others may have a view on the value of this. So happy to stand corrected.
I have only tested on riscv and x86. There is a tremendous amount of
Yeah, OK this is crazy, you can't really submit something as non-RFC that touches every single arch and not test it.
I also feel like we need more justification than 'this is a neat thing that we use in RISC V sometimes' conceptually for such a big change.
Also your test program is currently completely broken afaict (have commented on it directly). I also feel like your test program is a little rudimentary, and should test some edge cases close to the limit etc.
So I think this is a NACK until there is testing across the board and a little more justification.
Feel free to respin, but I think any future revisions should be RFC until we're absolutely sure on testing/justification.
I appreciate your efforts here so sorry to be negative, but just obviously want to make sure this is functional and trades off added complexity for value for the kernel and userland :)
Thanks!
duplicated code in mmap so the implementations across architectures I believe should be mostly consistent. I added this feature to all architectures that implement either arch_get_mmap_end()/arch_get_mmap_base() or arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base().
Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.... [1]
To: Arnd Bergmann arnd@arndb.de To: Paul Walmsley paul.walmsley@sifive.com To: Palmer Dabbelt palmer@dabbelt.com To: Albert Ou aou@eecs.berkeley.edu To: Catalin Marinas catalin.marinas@arm.com To: Will Deacon will@kernel.org To: Michael Ellerman mpe@ellerman.id.au To: Nicholas Piggin npiggin@gmail.com To: Christophe Leroy christophe.leroy@csgroup.eu To: Naveen N Rao naveen@kernel.org To: Muchun Song muchun.song@linux.dev To: Andrew Morton akpm@linux-foundation.org To: Liam R. Howlett Liam.Howlett@oracle.com To: Vlastimil Babka vbabka@suse.cz To: Lorenzo Stoakes lorenzo.stoakes@oracle.com To: Thomas Gleixner tglx@linutronix.de To: Ingo Molnar mingo@redhat.com To: Borislav Petkov bp@alien8.de To: Dave Hansen dave.hansen@linux.intel.com To: x86@kernel.org To: H. Peter Anvin hpa@zytor.com To: Huacai Chen chenhuacai@kernel.org To: WANG Xuerui kernel@xen0n.name To: Russell King linux@armlinux.org.uk To: Thomas Bogendoerfer tsbogend@alpha.franken.de To: James E.J. Bottomley James.Bottomley@HansenPartnership.com To: Helge Deller deller@gmx.de To: Alexander Gordeev agordeev@linux.ibm.com To: Gerald Schaefer gerald.schaefer@linux.ibm.com To: Heiko Carstens hca@linux.ibm.com To: Vasily Gorbik gor@linux.ibm.com To: Christian Borntraeger borntraeger@linux.ibm.com To: Sven Schnelle svens@linux.ibm.com To: Yoshinori Sato ysato@users.sourceforge.jp To: Rich Felker dalias@libc.org To: John Paul Adrian Glaubitz glaubitz@physik.fu-berlin.de To: David S. Miller davem@davemloft.net To: Andreas Larsson andreas@gaisler.com To: Shuah Khan shuah@kernel.org To: Alexandre Ghiti alexghiti@rivosinc.com Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Palmer Dabbelt palmer@rivosinc.com Cc: linux-riscv@lists.infradead.org Cc: linux-arm-kernel@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-mm@kvack.org Cc: loongarch@lists.linux.dev Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-kselftest@vger.kernel.org Signed-off-by: Charlie Jenkins charlie@rivosinc.com
Charlie Jenkins (16): mm: Add MAP_BELOW_HINT riscv: mm: Do not restrict mmap address based on hint mm: Add flag and len param to arch_get_mmap_base() mm: Add generic MAP_BELOW_HINT riscv: mm: Support MAP_BELOW_HINT arm64: mm: Support MAP_BELOW_HINT powerpc: mm: Support MAP_BELOW_HINT x86: mm: Support MAP_BELOW_HINT loongarch: mm: Support MAP_BELOW_HINT arm: mm: Support MAP_BELOW_HINT mips: mm: Support MAP_BELOW_HINT parisc: mm: Support MAP_BELOW_HINT s390: mm: Support MAP_BELOW_HINT sh: mm: Support MAP_BELOW_HINT sparc: mm: Support MAP_BELOW_HINT selftests/mm: Create MAP_BELOW_HINT test
arch/arm/mm/mmap.c | 10 ++++++++ arch/arm64/include/asm/processor.h | 34 ++++++++++++++++++++++---- arch/loongarch/mm/mmap.c | 11 +++++++++ arch/mips/mm/mmap.c | 9 +++++++ arch/parisc/include/uapi/asm/mman.h | 1 + arch/parisc/kernel/sys_parisc.c | 9 +++++++ arch/powerpc/include/asm/task_size_64.h | 36 +++++++++++++++++++++++----- arch/riscv/include/asm/processor.h | 32 ------------------------- arch/s390/mm/mmap.c | 10 ++++++++ arch/sh/mm/mmap.c | 10 ++++++++ arch/sparc/kernel/sys_sparc_64.c | 8 +++++++ arch/x86/kernel/sys_x86_64.c | 25 ++++++++++++++++--- fs/hugetlbfs/inode.c | 2 +- include/linux/sched/mm.h | 34 ++++++++++++++++++++++++-- include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c | 2 +- tools/arch/parisc/include/uapi/asm/mman.h | 1 + tools/include/uapi/asm-generic/mman-common.h | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 29 ++++++++++++++++++++++ 20 files changed, 216 insertions(+), 50 deletions(-)
base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55 --
- Charlie
linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv
On Wed, Aug 28, 2024 at 07:19:36PM +0100, Lorenzo Stoakes wrote:
On Tue, Aug 27, 2024 at 10:49:06PM GMT, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the maximum address space, unless the hint address is greater than this value.
On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This flag allows applications a way to specify exactly how many bits they want to be left unused by mmap. This eliminates the need for applications to know the page table hierarchy of the system to be able to reason which addresses mmap will be allowed to return.
riscv made this feature of mmap returning addresses less than the hint address the default behavior. This was in contrast to the implementation of x86/arm64 that have a single boundary at the 5-level page table region. However this restriction proved too great -- the reduced address space when using a hint address was too small.
A patch for riscv [1] reverts the behavior that broke userspace. This series serves to make this feature available to all architectures.
I'm a little confused as to the justification for this - you broke RISC V by doing this, and have now reverted it, but now offer the same behaviour that broke RISC V to all other architectures?
I mean this is how this reads, so I might be being ungenerous here :) but would be good to clarify what the value-add is here.
Yeah I did not do a good job of explaining this! Having this be the default behavior was broken, not that this feature in general was broken.
I also wonder at use of a new MAP_ flag, they're a limited resource and we should only ever add them if we _really_ need to. This seems a bit niche and specific to be making such a big change for including touching a bunch of pretty sensitive arch-specific code.
We have the ability to change how mmap() functions through 'personalities' though of course this would impact every mmap() call in the process.
Overall I'm really not hugely convinced by this, it feels like userland could find better ways of doing this (mostly you'd do a PROT_NONE mmap() to reserve a domain and mprotect() it on allocation or mmap() over it).
So I just struggle to see the purpose myself. BUT absolutely I may be missing context/others may have a view on the value of this. So happy to stand corrected.
I have only tested on riscv and x86. There is a tremendous amount of
Yeah, OK this is crazy, you can't really submit something as non-RFC that touches every single arch and not test it.
I also feel like we need more justification than 'this is a neat thing that we use in RISC V sometimes' conceptually for such a big change.
I will send out a new version that does a much better job at explaining! This is not something that is done on riscv ever currently. This is something that is done on other architectures such as x86 and arm64. This flag is to make similar behavior (the ability to force mmap to return addresses that have a constrained address space) available to all architectures.
Also your test program is currently completely broken afaict (have commented on it directly). I also feel like your test program is a little rudimentary, and should test some edge cases close to the limit etc.
So I think this is a NACK until there is testing across the board and a little more justification.
Feel free to respin, but I think any future revisions should be RFC until we're absolutely sure on testing/justification.
I appreciate your efforts here so sorry to be negative, but just obviously want to make sure this is functional and trades off added complexity for value for the kernel and userland :)
Totally understand thank you! After reviewing comments I have realized that I made this much more complicated than it needs to be. This should be able to be done without changing any architecture specific code. That will mostly eliminate all of the complexity, but still has the downside of consuming a MAP_ flag.
- Charlie
Thanks!
duplicated code in mmap so the implementations across architectures I believe should be mostly consistent. I added this feature to all architectures that implement either arch_get_mmap_end()/arch_get_mmap_base() or arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base().
Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.... [1]
To: Arnd Bergmann arnd@arndb.de To: Paul Walmsley paul.walmsley@sifive.com To: Palmer Dabbelt palmer@dabbelt.com To: Albert Ou aou@eecs.berkeley.edu To: Catalin Marinas catalin.marinas@arm.com To: Will Deacon will@kernel.org To: Michael Ellerman mpe@ellerman.id.au To: Nicholas Piggin npiggin@gmail.com To: Christophe Leroy christophe.leroy@csgroup.eu To: Naveen N Rao naveen@kernel.org To: Muchun Song muchun.song@linux.dev To: Andrew Morton akpm@linux-foundation.org To: Liam R. Howlett Liam.Howlett@oracle.com To: Vlastimil Babka vbabka@suse.cz To: Lorenzo Stoakes lorenzo.stoakes@oracle.com To: Thomas Gleixner tglx@linutronix.de To: Ingo Molnar mingo@redhat.com To: Borislav Petkov bp@alien8.de To: Dave Hansen dave.hansen@linux.intel.com To: x86@kernel.org To: H. Peter Anvin hpa@zytor.com To: Huacai Chen chenhuacai@kernel.org To: WANG Xuerui kernel@xen0n.name To: Russell King linux@armlinux.org.uk To: Thomas Bogendoerfer tsbogend@alpha.franken.de To: James E.J. Bottomley James.Bottomley@HansenPartnership.com To: Helge Deller deller@gmx.de To: Alexander Gordeev agordeev@linux.ibm.com To: Gerald Schaefer gerald.schaefer@linux.ibm.com To: Heiko Carstens hca@linux.ibm.com To: Vasily Gorbik gor@linux.ibm.com To: Christian Borntraeger borntraeger@linux.ibm.com To: Sven Schnelle svens@linux.ibm.com To: Yoshinori Sato ysato@users.sourceforge.jp To: Rich Felker dalias@libc.org To: John Paul Adrian Glaubitz glaubitz@physik.fu-berlin.de To: David S. Miller davem@davemloft.net To: Andreas Larsson andreas@gaisler.com To: Shuah Khan shuah@kernel.org To: Alexandre Ghiti alexghiti@rivosinc.com Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Palmer Dabbelt palmer@rivosinc.com Cc: linux-riscv@lists.infradead.org Cc: linux-arm-kernel@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-mm@kvack.org Cc: loongarch@lists.linux.dev Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-kselftest@vger.kernel.org Signed-off-by: Charlie Jenkins charlie@rivosinc.com
Charlie Jenkins (16): mm: Add MAP_BELOW_HINT riscv: mm: Do not restrict mmap address based on hint mm: Add flag and len param to arch_get_mmap_base() mm: Add generic MAP_BELOW_HINT riscv: mm: Support MAP_BELOW_HINT arm64: mm: Support MAP_BELOW_HINT powerpc: mm: Support MAP_BELOW_HINT x86: mm: Support MAP_BELOW_HINT loongarch: mm: Support MAP_BELOW_HINT arm: mm: Support MAP_BELOW_HINT mips: mm: Support MAP_BELOW_HINT parisc: mm: Support MAP_BELOW_HINT s390: mm: Support MAP_BELOW_HINT sh: mm: Support MAP_BELOW_HINT sparc: mm: Support MAP_BELOW_HINT selftests/mm: Create MAP_BELOW_HINT test
arch/arm/mm/mmap.c | 10 ++++++++ arch/arm64/include/asm/processor.h | 34 ++++++++++++++++++++++---- arch/loongarch/mm/mmap.c | 11 +++++++++ arch/mips/mm/mmap.c | 9 +++++++ arch/parisc/include/uapi/asm/mman.h | 1 + arch/parisc/kernel/sys_parisc.c | 9 +++++++ arch/powerpc/include/asm/task_size_64.h | 36 +++++++++++++++++++++++----- arch/riscv/include/asm/processor.h | 32 ------------------------- arch/s390/mm/mmap.c | 10 ++++++++ arch/sh/mm/mmap.c | 10 ++++++++ arch/sparc/kernel/sys_sparc_64.c | 8 +++++++ arch/x86/kernel/sys_x86_64.c | 25 ++++++++++++++++--- fs/hugetlbfs/inode.c | 2 +- include/linux/sched/mm.h | 34 ++++++++++++++++++++++++-- include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c | 2 +- tools/arch/parisc/include/uapi/asm/mman.h | 1 + tools/include/uapi/asm-generic/mman-common.h | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 29 ++++++++++++++++++++++ 20 files changed, 216 insertions(+), 50 deletions(-)
base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55 --
- Charlie
linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv
On 8/27/24 22:49, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the maximum address space, unless the hint address is greater than this value.
Which applications are these, btw?
Is this the same crowd as the folks who are using the address tagging features like X86_FEATURE_LAM?
Even if they are different, I also wonder if a per-mmap() thing MAP_BELOW_HINT is really what we want. Or should the applications you're trying to service here use a similar mechanism to how LAM affects the *whole* address space as opposed to an individual mmap().
On Wed, Aug 28, 2024 at 11:29:56AM -0700, Dave Hansen wrote:
On 8/27/24 22:49, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the maximum address space, unless the hint address is greater than this value.
Which applications are these, btw?
Java and Go require this feature. These applications store flags that represent the type of data a pointer holds in the upper bits of the pointer itself.
Is this the same crowd as the folks who are using the address tagging features like X86_FEATURE_LAM?
Yes it is. LAM helps to mask the bits out on x86, and this feature could be used to ensure that mmap() doesn't return an address with bits that would be masked out. I chose not to tie this feature to x86 LAM which only has masking boundaries at 57 and 48 bits to allow it to be independent of architecture specific address masking.
Even if they are different, I also wonder if a per-mmap() thing MAP_BELOW_HINT is really what we want. Or should the applications you're trying to service here use a similar mechanism to how LAM affects the *whole* address space as opposed to an individual mmap().
LAM is required to be enabled for entire address spaces because the hardware needs to be configured to mask out the bits. It is not possible to influence the granularity of LAM in the current implementation. However mmap() does not require any of this hardware configuration so it is possible to have finer granularity.
A way to restrict mmap() to return LAM compliant addresses in an entire address space also doesn't have to be mutually exclusive with this flag. This flag allows for the greatest degree of control from applications. I don't believe there is additionally performance saving that could be achieved by having this be on a per address space basis.
Link: https://cdrdv2.intel.com/v1/dl/getContent/671368 [1]
- Charlie
On 8/28/24 13:15, Charlie Jenkins wrote:
A way to restrict mmap() to return LAM compliant addresses in an entire address space also doesn't have to be mutually exclusive with this flag. This flag allows for the greatest degree of control from applications. I don't believe there is additionally performance saving that could be achieved by having this be on a per address space basis.
I agree with you in general. The MAP_BELOW_HINT _is_ the most flexible. But it's also rather complicated.
My _hope_ would be that a per-address-space property could share at least some infrastructure with what x86/LAM and arm/TBI do to the address space. Basically put the restrictions in place for purely software reasons instead of the mostly hardware reasons for LAM/TBI.
Lorenzo also raised some very valid points about a having a generic address-restriction ABI. I'm certainly not discounting those concerns. It's not something that can be done lightly.
* Dave Hansen dave.hansen@intel.com [240829 12:54]:
On 8/28/24 13:15, Charlie Jenkins wrote:
A way to restrict mmap() to return LAM compliant addresses in an entire address space also doesn't have to be mutually exclusive with this flag. This flag allows for the greatest degree of control from applications. I don't believe there is additionally performance saving that could be achieved by having this be on a per address space basis.
I agree with you in general. The MAP_BELOW_HINT _is_ the most flexible. But it's also rather complicated.
There is a (seldom used?) feature of mmap_min_addr, it seems like we could have an mmap_max_addr. Would something like that work for your use case? Perhaps it would be less intrusive to do something in this way? I haven't looked at it in depth and this affects all address spaces as well (new allocations only).
There is a note on mmap_min_addr about applications that require the lower addresses, would this mean we'll now have a note about upper limits?
I really don't understand why you need this at all, to be honest. If you know the upper limit you could just MAP_FIXED map a huge guard at the top of your address space then do whatever you want with those bits.
This will create an entry in the vma tree that no one else will be able to use, and you can do this in any process you want, for as many bits as you want.
My _hope_ would be that a per-address-space property could share at least some infrastructure with what x86/LAM and arm/TBI do to the address space. Basically put the restrictions in place for purely software reasons instead of the mostly hardware reasons for LAM/TBI.
Lorenzo also raised some very valid points about a having a generic address-restriction ABI. I'm certainly not discounting those concerns. It's not something that can be done lightly.
Yes, I am concerned about supporting this (probably forever) and dancing around special code that may cause issues, perhaps on an arch that few have for testing. I already have so many qemu images for testing, some of which no longer have valid install media - and basically none of them use the same code in this area (or have special cases already). I think you understand what we are dealing with considering your comments in your cover letter.
Thanks, Liam
On Thu, Aug 29, 2024 at 03:36:43PM -0400, Liam R. Howlett wrote:
- Dave Hansen dave.hansen@intel.com [240829 12:54]:
On 8/28/24 13:15, Charlie Jenkins wrote:
A way to restrict mmap() to return LAM compliant addresses in an entire address space also doesn't have to be mutually exclusive with this flag. This flag allows for the greatest degree of control from applications. I don't believe there is additionally performance saving that could be achieved by having this be on a per address space basis.
I agree with you in general. The MAP_BELOW_HINT _is_ the most flexible. But it's also rather complicated.
There is a (seldom used?) feature of mmap_min_addr, it seems like we could have an mmap_max_addr. Would something like that work for your use case? Perhaps it would be less intrusive to do something in this way? I haven't looked at it in depth and this affects all address spaces as well (new allocations only).
There is a note on mmap_min_addr about applications that require the lower addresses, would this mean we'll now have a note about upper limits?
I don't think that's a viable solution because that would change the mmap behavior for all applications running on the system, and wouldn't allow individual applications to have different configurations.
I really don't understand why you need this at all, to be honest. If you know the upper limit you could just MAP_FIXED map a huge guard at the top of your address space then do whatever you want with those bits.
This will create an entry in the vma tree that no one else will be able to use, and you can do this in any process you want, for as many bits as you want.
Oh that's an interesting idea. I am not sure how that could work in practice though. The application would need to know it allocated all of the addresses in the upper address space, how would it be able to do that?
My _hope_ would be that a per-address-space property could share at least some infrastructure with what x86/LAM and arm/TBI do to the address space. Basically put the restrictions in place for purely software reasons instead of the mostly hardware reasons for LAM/TBI.
Lorenzo also raised some very valid points about a having a generic address-restriction ABI. I'm certainly not discounting those concerns. It's not something that can be done lightly.
Yes, I am concerned about supporting this (probably forever) and dancing around special code that may cause issues, perhaps on an arch that few have for testing. I already have so many qemu images for testing, some of which no longer have valid install media - and basically none of them use the same code in this area (or have special cases already). I think you understand what we are dealing with considering your comments in your cover letter.
It is definitely not something to be taken lightly. The version 2 of this is completely generic so that should eliminate most of the concern of "special code" on various architectures. Unless I am misunderstanding something.
- Charlie
Thanks, Liam
On Thu, Aug 29, 2024 at 09:54:08AM -0700, Dave Hansen wrote:
On 8/28/24 13:15, Charlie Jenkins wrote:
A way to restrict mmap() to return LAM compliant addresses in an entire address space also doesn't have to be mutually exclusive with this flag. This flag allows for the greatest degree of control from applications. I don't believe there is additionally performance saving that could be achieved by having this be on a per address space basis.
I agree with you in general. The MAP_BELOW_HINT _is_ the most flexible. But it's also rather complicated.
Can you expand upon what you mean by it being complicated? Complicated for the kernel or complicated for a user?
My _hope_ would be that a per-address-space property could share at least some infrastructure with what x86/LAM and arm/TBI do to the address space. Basically put the restrictions in place for purely software reasons instead of the mostly hardware reasons for LAM/TBI.
That is a good point, perhaps that would be a way to hook this into LAM, TBI, and any other architecture's specific address masking feature.
- Charlie
Lorenzo also raised some very valid points about a having a generic address-restriction ABI. I'm certainly not discounting those concerns. It's not something that can be done lightly.
* Charlie Jenkins charlie@rivosinc.com [240828 01:49]:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the maximum address space, unless the hint address is greater than this value.
Wait, what arch(s) allows for greater than the max? The passed hint should be where we start searching, but we go to the lower limit then start at the hint and search up (or vice-versa on the directions).
I don't understand how unmapping works on a higher address; we would fail to free it on termination of the application.
Also, there are archs that map outside of the VMAs, which are freed by freeing from the prev->vm_end to next->vm_start, so I don't understand what that looks like in this reality as well.
On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This flag allows applications a way to specify exactly how many bits they want to be left unused by mmap. This eliminates the need for applications to know the page table hierarchy of the system to be able to reason which addresses mmap will be allowed to return.
But, why do they need to know today? We have a limit for this don't we?
Also, these upper limits are how some archs use the upper bits that you are trying to use.
riscv made this feature of mmap returning addresses less than the hint address the default behavior. This was in contrast to the implementation of x86/arm64 that have a single boundary at the 5-level page table region. However this restriction proved too great -- the reduced address space when using a hint address was too small.
Yes, the hint is used to group things close together so it would literally be random chance on if you have enough room or not (aslr and all).
A patch for riscv [1] reverts the behavior that broke userspace. This series serves to make this feature available to all architectures.
I don't fully understand this statement, you say it broke userspace so now you are porting it to everyone? This reads as if you are braking the userspace on all architectures :)
If you fail to find room below, then your application fails as there is no way to get the upper bits you need. It would be better to fix this in userspace - if your application is returned too high an address, then free it and exit because it's going to fail anyways.
I have only tested on riscv and x86.
This should be an RFC then.
There is a tremendous amount of duplicated code in mmap so the implementations across architectures I believe should be mostly consistent. I added this feature to all architectures that implement either arch_get_mmap_end()/arch_get_mmap_base() or arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base().
Way too much duplicate code. We should be figuring out how to make this all work with the same code.
This is going to make the cloned code problem worse.
Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.... [1]
To: Arnd Bergmann arnd@arndb.de To: Paul Walmsley paul.walmsley@sifive.com To: Palmer Dabbelt palmer@dabbelt.com To: Albert Ou aou@eecs.berkeley.edu To: Catalin Marinas catalin.marinas@arm.com To: Will Deacon will@kernel.org To: Michael Ellerman mpe@ellerman.id.au To: Nicholas Piggin npiggin@gmail.com To: Christophe Leroy christophe.leroy@csgroup.eu To: Naveen N Rao naveen@kernel.org To: Muchun Song muchun.song@linux.dev To: Andrew Morton akpm@linux-foundation.org To: Liam R. Howlett Liam.Howlett@oracle.com To: Vlastimil Babka vbabka@suse.cz To: Lorenzo Stoakes lorenzo.stoakes@oracle.com To: Thomas Gleixner tglx@linutronix.de To: Ingo Molnar mingo@redhat.com To: Borislav Petkov bp@alien8.de To: Dave Hansen dave.hansen@linux.intel.com To: x86@kernel.org To: H. Peter Anvin hpa@zytor.com To: Huacai Chen chenhuacai@kernel.org To: WANG Xuerui kernel@xen0n.name To: Russell King linux@armlinux.org.uk To: Thomas Bogendoerfer tsbogend@alpha.franken.de To: James E.J. Bottomley James.Bottomley@HansenPartnership.com To: Helge Deller deller@gmx.de To: Alexander Gordeev agordeev@linux.ibm.com To: Gerald Schaefer gerald.schaefer@linux.ibm.com To: Heiko Carstens hca@linux.ibm.com To: Vasily Gorbik gor@linux.ibm.com To: Christian Borntraeger borntraeger@linux.ibm.com To: Sven Schnelle svens@linux.ibm.com To: Yoshinori Sato ysato@users.sourceforge.jp To: Rich Felker dalias@libc.org To: John Paul Adrian Glaubitz glaubitz@physik.fu-berlin.de To: David S. Miller davem@davemloft.net To: Andreas Larsson andreas@gaisler.com To: Shuah Khan shuah@kernel.org To: Alexandre Ghiti alexghiti@rivosinc.com Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Palmer Dabbelt palmer@rivosinc.com Cc: linux-riscv@lists.infradead.org Cc: linux-arm-kernel@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-mm@kvack.org Cc: loongarch@lists.linux.dev Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-kselftest@vger.kernel.org Signed-off-by: Charlie Jenkins charlie@rivosinc.com
Charlie Jenkins (16): mm: Add MAP_BELOW_HINT riscv: mm: Do not restrict mmap address based on hint mm: Add flag and len param to arch_get_mmap_base() mm: Add generic MAP_BELOW_HINT riscv: mm: Support MAP_BELOW_HINT arm64: mm: Support MAP_BELOW_HINT powerpc: mm: Support MAP_BELOW_HINT x86: mm: Support MAP_BELOW_HINT loongarch: mm: Support MAP_BELOW_HINT arm: mm: Support MAP_BELOW_HINT mips: mm: Support MAP_BELOW_HINT parisc: mm: Support MAP_BELOW_HINT s390: mm: Support MAP_BELOW_HINT sh: mm: Support MAP_BELOW_HINT sparc: mm: Support MAP_BELOW_HINT selftests/mm: Create MAP_BELOW_HINT test
arch/arm/mm/mmap.c | 10 ++++++++ arch/arm64/include/asm/processor.h | 34 ++++++++++++++++++++++---- arch/loongarch/mm/mmap.c | 11 +++++++++ arch/mips/mm/mmap.c | 9 +++++++ arch/parisc/include/uapi/asm/mman.h | 1 + arch/parisc/kernel/sys_parisc.c | 9 +++++++ arch/powerpc/include/asm/task_size_64.h | 36 +++++++++++++++++++++++----- arch/riscv/include/asm/processor.h | 32 ------------------------- arch/s390/mm/mmap.c | 10 ++++++++ arch/sh/mm/mmap.c | 10 ++++++++ arch/sparc/kernel/sys_sparc_64.c | 8 +++++++ arch/x86/kernel/sys_x86_64.c | 25 ++++++++++++++++--- fs/hugetlbfs/inode.c | 2 +- include/linux/sched/mm.h | 34 ++++++++++++++++++++++++-- include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c | 2 +- tools/arch/parisc/include/uapi/asm/mman.h | 1 + tools/include/uapi/asm-generic/mman-common.h | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 29 ++++++++++++++++++++++ 20 files changed, 216 insertions(+), 50 deletions(-)
base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55 --
- Charlie
On Wed, Aug 28, 2024 at 02:31:42PM -0400, Liam R. Howlett wrote:
- Charlie Jenkins charlie@rivosinc.com [240828 01:49]:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the maximum address space, unless the hint address is greater than this value.
Wait, what arch(s) allows for greater than the max? The passed hint should be where we start searching, but we go to the lower limit then start at the hint and search up (or vice-versa on the directions).
I worded this awkwardly. On arm64 there is a page-table boundary at 48 bits and at 52 bits. On x86 the boundaries are at 48 bits and 57 bits. The max value mmap is able to return on arm64 is 48 bits if the hint address uses 48 bits or less, even if the architecture supports 5-level paging and thus addresses can be 52 bits. Applications can opt-in to using up to 52-bits in an address by using a hint address greater than 48 bits. x86 has the same behavior but with 57 bits instead of 52.
This reason this exists is because some applications arbitrarily replace bits in virtual addresses with data with an assumption that the address will not be using any of the bits above bit 48 in the virtual address. As hardware with larger address spaces was released, x86 decided to build safety guards into the kernel to allow the applications that made these assumptions to continue to work on this different hardware.
This causes all application that use a hint address to silently be restricted to 48-bit addresses. The goal of this flag is to have a way for applications to explicitly request how many bits they want mmap to use.
I don't understand how unmapping works on a higher address; we would fail to free it on termination of the application.
Also, there are archs that map outside of the VMAs, which are freed by freeing from the prev->vm_end to next->vm_start, so I don't understand what that looks like in this reality as well.
On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This flag allows applications a way to specify exactly how many bits they want to be left unused by mmap. This eliminates the need for applications to know the page table hierarchy of the system to be able to reason which addresses mmap will be allowed to return.
But, why do they need to know today? We have a limit for this don't we?
The limit is different for different architectures. On x86 the limit is 57 bits, and on arm64 it is 52 bits. So in the theoretical case that an application requires 10 bits free in a virtual address, the application would always work on arm64 regardless of the hint address, but on x86 if the hint address is greater than 48 bits then the application will not work.
The goal of this flag is to have consistent and tunable behavior of mmap() when it is desired to ensure that mmap() only returns addresses that use some number of bits.
Also, these upper limits are how some archs use the upper bits that you are trying to use.
It does not eliminate the existing behavior of the architectures to place this upper limits, it instead provides a way to have consistent behavior across all architectures.
riscv made this feature of mmap returning addresses less than the hint address the default behavior. This was in contrast to the implementation of x86/arm64 that have a single boundary at the 5-level page table region. However this restriction proved too great -- the reduced address space when using a hint address was too small.
Yes, the hint is used to group things close together so it would literally be random chance on if you have enough room or not (aslr and all).
A patch for riscv [1] reverts the behavior that broke userspace. This series serves to make this feature available to all architectures.
I don't fully understand this statement, you say it broke userspace so now you are porting it to everyone? This reads as if you are braking the userspace on all architectures :)
It was the default for mmap on riscv. The difference here is that it is now enabled by a flag instead. Instead of making the flag specific to riscv, I figured that other architectures might find it useful as well.
If you fail to find room below, then your application fails as there is no way to get the upper bits you need. It would be better to fix this in userspace - if your application is returned too high an address, then free it and exit because it's going to fail anyways.
This flag is trying to define an API that is more robust than the current behavior on that x86 and arm64 which implicitly restricts mmap() addresses to 48 bits. A solution could be to just write in the docs that mmap() will always exhaust all addresses below the hint address before returning an address that is above the hint address. However a flag that defines this behavior seems more intuitive.
I have only tested on riscv and x86.
This should be an RFC then.
Fair enough.
There is a tremendous amount of duplicated code in mmap so the implementations across architectures I believe should be mostly consistent. I added this feature to all architectures that implement either arch_get_mmap_end()/arch_get_mmap_base() or arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base().
Way too much duplicate code. We should be figuring out how to make this all work with the same code.
This is going to make the cloned code problem worse.
That would require standardizing every architecture with the generic mmap() framework that arm64 has developed. That is far outside the scope of this patch, but would be a great area to research for each of the architectures that do not use the generic framework.
- Charlie
Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.... [1]
To: Arnd Bergmann arnd@arndb.de To: Paul Walmsley paul.walmsley@sifive.com To: Palmer Dabbelt palmer@dabbelt.com To: Albert Ou aou@eecs.berkeley.edu To: Catalin Marinas catalin.marinas@arm.com To: Will Deacon will@kernel.org To: Michael Ellerman mpe@ellerman.id.au To: Nicholas Piggin npiggin@gmail.com To: Christophe Leroy christophe.leroy@csgroup.eu To: Naveen N Rao naveen@kernel.org To: Muchun Song muchun.song@linux.dev To: Andrew Morton akpm@linux-foundation.org To: Liam R. Howlett Liam.Howlett@oracle.com To: Vlastimil Babka vbabka@suse.cz To: Lorenzo Stoakes lorenzo.stoakes@oracle.com To: Thomas Gleixner tglx@linutronix.de To: Ingo Molnar mingo@redhat.com To: Borislav Petkov bp@alien8.de To: Dave Hansen dave.hansen@linux.intel.com To: x86@kernel.org To: H. Peter Anvin hpa@zytor.com To: Huacai Chen chenhuacai@kernel.org To: WANG Xuerui kernel@xen0n.name To: Russell King linux@armlinux.org.uk To: Thomas Bogendoerfer tsbogend@alpha.franken.de To: James E.J. Bottomley James.Bottomley@HansenPartnership.com To: Helge Deller deller@gmx.de To: Alexander Gordeev agordeev@linux.ibm.com To: Gerald Schaefer gerald.schaefer@linux.ibm.com To: Heiko Carstens hca@linux.ibm.com To: Vasily Gorbik gor@linux.ibm.com To: Christian Borntraeger borntraeger@linux.ibm.com To: Sven Schnelle svens@linux.ibm.com To: Yoshinori Sato ysato@users.sourceforge.jp To: Rich Felker dalias@libc.org To: John Paul Adrian Glaubitz glaubitz@physik.fu-berlin.de To: David S. Miller davem@davemloft.net To: Andreas Larsson andreas@gaisler.com To: Shuah Khan shuah@kernel.org To: Alexandre Ghiti alexghiti@rivosinc.com Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Palmer Dabbelt palmer@rivosinc.com Cc: linux-riscv@lists.infradead.org Cc: linux-arm-kernel@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-mm@kvack.org Cc: loongarch@lists.linux.dev Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-kselftest@vger.kernel.org Signed-off-by: Charlie Jenkins charlie@rivosinc.com
Charlie Jenkins (16): mm: Add MAP_BELOW_HINT riscv: mm: Do not restrict mmap address based on hint mm: Add flag and len param to arch_get_mmap_base() mm: Add generic MAP_BELOW_HINT riscv: mm: Support MAP_BELOW_HINT arm64: mm: Support MAP_BELOW_HINT powerpc: mm: Support MAP_BELOW_HINT x86: mm: Support MAP_BELOW_HINT loongarch: mm: Support MAP_BELOW_HINT arm: mm: Support MAP_BELOW_HINT mips: mm: Support MAP_BELOW_HINT parisc: mm: Support MAP_BELOW_HINT s390: mm: Support MAP_BELOW_HINT sh: mm: Support MAP_BELOW_HINT sparc: mm: Support MAP_BELOW_HINT selftests/mm: Create MAP_BELOW_HINT test
arch/arm/mm/mmap.c | 10 ++++++++ arch/arm64/include/asm/processor.h | 34 ++++++++++++++++++++++---- arch/loongarch/mm/mmap.c | 11 +++++++++ arch/mips/mm/mmap.c | 9 +++++++ arch/parisc/include/uapi/asm/mman.h | 1 + arch/parisc/kernel/sys_parisc.c | 9 +++++++ arch/powerpc/include/asm/task_size_64.h | 36 +++++++++++++++++++++++----- arch/riscv/include/asm/processor.h | 32 ------------------------- arch/s390/mm/mmap.c | 10 ++++++++ arch/sh/mm/mmap.c | 10 ++++++++ arch/sparc/kernel/sys_sparc_64.c | 8 +++++++ arch/x86/kernel/sys_x86_64.c | 25 ++++++++++++++++--- fs/hugetlbfs/inode.c | 2 +- include/linux/sched/mm.h | 34 ++++++++++++++++++++++++-- include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c | 2 +- tools/arch/parisc/include/uapi/asm/mman.h | 1 + tools/include/uapi/asm-generic/mman-common.h | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 29 ++++++++++++++++++++++ 20 files changed, 216 insertions(+), 50 deletions(-)
base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55 --
- Charlie
On Wed, Aug 28, 2024 at 01:59:18PM -0700, Charlie Jenkins wrote:
On Wed, Aug 28, 2024 at 02:31:42PM -0400, Liam R. Howlett wrote:
- Charlie Jenkins charlie@rivosinc.com [240828 01:49]:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the maximum address space, unless the hint address is greater than this value.
Wait, what arch(s) allows for greater than the max? The passed hint should be where we start searching, but we go to the lower limit then start at the hint and search up (or vice-versa on the directions).
I worded this awkwardly. On arm64 there is a page-table boundary at 48 bits and at 52 bits. On x86 the boundaries are at 48 bits and 57 bits. The max value mmap is able to return on arm64 is 48 bits if the hint address uses 48 bits or less, even if the architecture supports 5-level paging and thus addresses can be 52 bits. Applications can opt-in to using up to 52-bits in an address by using a hint address greater than 48 bits. x86 has the same behavior but with 57 bits instead of 52.
This reason this exists is because some applications arbitrarily replace bits in virtual addresses with data with an assumption that the address will not be using any of the bits above bit 48 in the virtual address. As hardware with larger address spaces was released, x86 decided to build safety guards into the kernel to allow the applications that made these assumptions to continue to work on this different hardware.
This causes all application that use a hint address to silently be restricted to 48-bit addresses. The goal of this flag is to have a way for applications to explicitly request how many bits they want mmap to use.
I don't understand how unmapping works on a higher address; we would fail to free it on termination of the application.
Also, there are archs that map outside of the VMAs, which are freed by freeing from the prev->vm_end to next->vm_start, so I don't understand what that looks like in this reality as well.
On arm64 this barrier is at 52 bits and on x86 it is at 56 bits. This flag allows applications a way to specify exactly how many bits they want to be left unused by mmap. This eliminates the need for applications to know the page table hierarchy of the system to be able to reason which addresses mmap will be allowed to return.
But, why do they need to know today? We have a limit for this don't we?
The limit is different for different architectures. On x86 the limit is 57 bits, and on arm64 it is 52 bits. So in the theoretical case that an application requires 10 bits free in a virtual address, the application would always work on arm64 regardless of the hint address, but on x86 if the hint address is greater than 48 bits then the application will not work.
The goal of this flag is to have consistent and tunable behavior of mmap() when it is desired to ensure that mmap() only returns addresses that use some number of bits.
Also, these upper limits are how some archs use the upper bits that you are trying to use.
It does not eliminate the existing behavior of the architectures to place this upper limits, it instead provides a way to have consistent behavior across all architectures.
riscv made this feature of mmap returning addresses less than the hint address the default behavior. This was in contrast to the implementation of x86/arm64 that have a single boundary at the 5-level page table region. However this restriction proved too great -- the reduced address space when using a hint address was too small.
Yes, the hint is used to group things close together so it would literally be random chance on if you have enough room or not (aslr and all).
A patch for riscv [1] reverts the behavior that broke userspace. This series serves to make this feature available to all architectures.
I don't fully understand this statement, you say it broke userspace so now you are porting it to everyone? This reads as if you are braking the userspace on all architectures :)
It was the default for mmap on riscv. The difference here is that it is now enabled by a flag instead. Instead of making the flag specific to riscv, I figured that other architectures might find it useful as well.
If you fail to find room below, then your application fails as there is no way to get the upper bits you need. It would be better to fix this in userspace - if your application is returned too high an address, then free it and exit because it's going to fail anyways.
This flag is trying to define an API that is more robust than the current behavior on that x86 and arm64 which implicitly restricts mmap() addresses to 48 bits. A solution could be to just write in the docs that mmap() will always exhaust all addresses below the hint address before returning an address that is above the hint address. However a flag that defines this behavior seems more intuitive.
I have only tested on riscv and x86.
This should be an RFC then.
Fair enough.
There is a tremendous amount of duplicated code in mmap so the implementations across architectures I believe should be mostly consistent. I added this feature to all architectures that implement either arch_get_mmap_end()/arch_get_mmap_base() or arch_get_unmapped_area_topdown()/arch_get_unmapped_area(). I also added it to the default behavior for arch_get_mmap_end()/arch_get_mmap_base().
Way too much duplicate code. We should be figuring out how to make this all work with the same code.
This is going to make the cloned code problem worse.
That would require standardizing every architecture with the generic mmap() framework that arm64 has developed. That is far outside the scope of this patch, but would be a great area to research for each of the architectures that do not use the generic framework.
Thinking about this again, I could drop support for all architectures that do not implement arch_get_mmap_base()/arch_get_mmap_end().
- Charlie
Link: https://lore.kernel.org/lkml/20240826-riscv_mmap-v1-2-cd8962afe47f@rivosinc.... [1]
To: Arnd Bergmann arnd@arndb.de To: Paul Walmsley paul.walmsley@sifive.com To: Palmer Dabbelt palmer@dabbelt.com To: Albert Ou aou@eecs.berkeley.edu To: Catalin Marinas catalin.marinas@arm.com To: Will Deacon will@kernel.org To: Michael Ellerman mpe@ellerman.id.au To: Nicholas Piggin npiggin@gmail.com To: Christophe Leroy christophe.leroy@csgroup.eu To: Naveen N Rao naveen@kernel.org To: Muchun Song muchun.song@linux.dev To: Andrew Morton akpm@linux-foundation.org To: Liam R. Howlett Liam.Howlett@oracle.com To: Vlastimil Babka vbabka@suse.cz To: Lorenzo Stoakes lorenzo.stoakes@oracle.com To: Thomas Gleixner tglx@linutronix.de To: Ingo Molnar mingo@redhat.com To: Borislav Petkov bp@alien8.de To: Dave Hansen dave.hansen@linux.intel.com To: x86@kernel.org To: H. Peter Anvin hpa@zytor.com To: Huacai Chen chenhuacai@kernel.org To: WANG Xuerui kernel@xen0n.name To: Russell King linux@armlinux.org.uk To: Thomas Bogendoerfer tsbogend@alpha.franken.de To: James E.J. Bottomley James.Bottomley@HansenPartnership.com To: Helge Deller deller@gmx.de To: Alexander Gordeev agordeev@linux.ibm.com To: Gerald Schaefer gerald.schaefer@linux.ibm.com To: Heiko Carstens hca@linux.ibm.com To: Vasily Gorbik gor@linux.ibm.com To: Christian Borntraeger borntraeger@linux.ibm.com To: Sven Schnelle svens@linux.ibm.com To: Yoshinori Sato ysato@users.sourceforge.jp To: Rich Felker dalias@libc.org To: John Paul Adrian Glaubitz glaubitz@physik.fu-berlin.de To: David S. Miller davem@davemloft.net To: Andreas Larsson andreas@gaisler.com To: Shuah Khan shuah@kernel.org To: Alexandre Ghiti alexghiti@rivosinc.com Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Palmer Dabbelt palmer@rivosinc.com Cc: linux-riscv@lists.infradead.org Cc: linux-arm-kernel@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-mm@kvack.org Cc: loongarch@lists.linux.dev Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-kselftest@vger.kernel.org Signed-off-by: Charlie Jenkins charlie@rivosinc.com
Charlie Jenkins (16): mm: Add MAP_BELOW_HINT riscv: mm: Do not restrict mmap address based on hint mm: Add flag and len param to arch_get_mmap_base() mm: Add generic MAP_BELOW_HINT riscv: mm: Support MAP_BELOW_HINT arm64: mm: Support MAP_BELOW_HINT powerpc: mm: Support MAP_BELOW_HINT x86: mm: Support MAP_BELOW_HINT loongarch: mm: Support MAP_BELOW_HINT arm: mm: Support MAP_BELOW_HINT mips: mm: Support MAP_BELOW_HINT parisc: mm: Support MAP_BELOW_HINT s390: mm: Support MAP_BELOW_HINT sh: mm: Support MAP_BELOW_HINT sparc: mm: Support MAP_BELOW_HINT selftests/mm: Create MAP_BELOW_HINT test
arch/arm/mm/mmap.c | 10 ++++++++ arch/arm64/include/asm/processor.h | 34 ++++++++++++++++++++++---- arch/loongarch/mm/mmap.c | 11 +++++++++ arch/mips/mm/mmap.c | 9 +++++++ arch/parisc/include/uapi/asm/mman.h | 1 + arch/parisc/kernel/sys_parisc.c | 9 +++++++ arch/powerpc/include/asm/task_size_64.h | 36 +++++++++++++++++++++++----- arch/riscv/include/asm/processor.h | 32 ------------------------- arch/s390/mm/mmap.c | 10 ++++++++ arch/sh/mm/mmap.c | 10 ++++++++ arch/sparc/kernel/sys_sparc_64.c | 8 +++++++ arch/x86/kernel/sys_x86_64.c | 25 ++++++++++++++++--- fs/hugetlbfs/inode.c | 2 +- include/linux/sched/mm.h | 34 ++++++++++++++++++++++++-- include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c | 2 +- tools/arch/parisc/include/uapi/asm/mman.h | 1 + tools/include/uapi/asm-generic/mman-common.h | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 29 ++++++++++++++++++++++ 20 files changed, 216 insertions(+), 50 deletions(-)
base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55 --
- Charlie
linux-kselftest-mirror@lists.linaro.org