Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking.
This flag will also allow seemless compatibility between all architectures, so applications like Go and OpenJDK that use bits in a virtual address can request the exact number of bits they need in a generic way. The flag can be checked inside of vm_unmapped_area() so that this flag does not have to be handled individually by each architecture.
Link: https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1... [1] Link: https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/s... [2]
To: Arnd Bergmann arnd@arndb.de To: Richard Henderson richard.henderson@linaro.org To: Ivan Kokshaysky ink@jurassic.park.msu.ru To: Matt Turner mattst88@gmail.com To: Vineet Gupta vgupta@kernel.org To: Russell King linux@armlinux.org.uk To: Guo Ren guoren@kernel.org To: Huacai Chen chenhuacai@kernel.org To: WANG Xuerui kernel@xen0n.name To: Thomas Bogendoerfer tsbogend@alpha.franken.de To: James E.J. Bottomley James.Bottomley@HansenPartnership.com To: Helge Deller deller@gmx.de To: Michael Ellerman mpe@ellerman.id.au To: Nicholas Piggin npiggin@gmail.com To: Christophe Leroy christophe.leroy@csgroup.eu To: Naveen N Rao naveen@kernel.org To: Alexander Gordeev agordeev@linux.ibm.com To: Gerald Schaefer gerald.schaefer@linux.ibm.com To: Heiko Carstens hca@linux.ibm.com To: Vasily Gorbik gor@linux.ibm.com To: Christian Borntraeger borntraeger@linux.ibm.com To: Sven Schnelle svens@linux.ibm.com To: Yoshinori Sato ysato@users.sourceforge.jp To: Rich Felker dalias@libc.org To: John Paul Adrian Glaubitz glaubitz@physik.fu-berlin.de To: David S. Miller davem@davemloft.net To: Andreas Larsson andreas@gaisler.com To: Thomas Gleixner tglx@linutronix.de To: Ingo Molnar mingo@redhat.com To: Borislav Petkov bp@alien8.de To: Dave Hansen dave.hansen@linux.intel.com To: x86@kernel.org To: H. Peter Anvin hpa@zytor.com To: Andy Lutomirski luto@kernel.org To: Peter Zijlstra peterz@infradead.org To: Muchun Song muchun.song@linux.dev To: Andrew Morton akpm@linux-foundation.org To: Liam R. Howlett Liam.Howlett@oracle.com To: Vlastimil Babka vbabka@suse.cz To: Lorenzo Stoakes lorenzo.stoakes@oracle.com To: Shuah Khan shuah@kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-snps-arc@lists.infradead.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-csky@vger.kernel.org Cc: loongarch@lists.linux.dev Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-kselftest@vger.kernel.org Signed-off-by: Charlie Jenkins charlie@rivosinc.com
Changes in v2: - Added much greater detail to cover letter - Removed all code that touched architecture specific code and was able to factor this out into all generic functions, except for flags that needed to be added to vm_unmapped_area_info - Made this an RFC since I have only tested it on riscv and x86 - Link to v1: https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb9022d...
--- Charlie Jenkins (4): mm: Add MAP_BELOW_HINT mm: Add hint and mmap_flags to struct vm_unmapped_area_info mm: Support MAP_BELOW_HINT in vm_unmapped_area() selftests/mm: Create MAP_BELOW_HINT test
arch/alpha/kernel/osf_sys.c | 2 ++ arch/arc/mm/mmap.c | 3 +++ arch/arm/mm/mmap.c | 7 ++++++ arch/csky/abiv1/mmap.c | 3 +++ arch/loongarch/mm/mmap.c | 3 +++ arch/mips/mm/mmap.c | 3 +++ arch/parisc/kernel/sys_parisc.c | 3 +++ arch/powerpc/mm/book3s64/slice.c | 7 ++++++ arch/s390/mm/hugetlbpage.c | 4 ++++ arch/s390/mm/mmap.c | 6 ++++++ arch/sh/mm/mmap.c | 6 ++++++ arch/sparc/kernel/sys_sparc_32.c | 3 +++ arch/sparc/kernel/sys_sparc_64.c | 6 ++++++ arch/sparc/mm/hugetlbpage.c | 4 ++++ arch/x86/kernel/sys_x86_64.c | 6 ++++++ arch/x86/mm/hugetlbpage.c | 4 ++++ fs/hugetlbfs/inode.c | 4 ++++ include/linux/mm.h | 2 ++ include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c | 9 ++++++++ tools/include/uapi/asm-generic/mman-common.h | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 32 ++++++++++++++++++++++++++++ 23 files changed, 120 insertions(+) --- base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
To make this behavior explicit and more versatile across all architectures, define a mmap flag that allows users to define an arbitrary upper limit on addresses returned by mmap.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- include/uapi/asm-generic/mman-common.h | 1 + tools/include/uapi/asm-generic/mman-common.h | 1 + 2 files changed, 2 insertions(+)
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..03ac13d9aa37 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -32,6 +32,7 @@
#define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */ +#define MAP_BELOW_HINT 0x8000000 /* give out address that is below (inclusive) hint address */
/* * Flags for mlock diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..03ac13d9aa37 100644 --- a/tools/include/uapi/asm-generic/mman-common.h +++ b/tools/include/uapi/asm-generic/mman-common.h @@ -32,6 +32,7 @@
#define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */ +#define MAP_BELOW_HINT 0x8000000 /* give out address that is below (inclusive) hint address */
/* * Flags for mlock
Charlie Jenkins charlie@rivosinc.com writes:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
To make this behavior explicit and more versatile across all architectures, define a mmap flag that allows users to define an arbitrary upper limit on addresses returned by mmap.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com
include/uapi/asm-generic/mman-common.h | 1 + tools/include/uapi/asm-generic/mman-common.h | 1 +
You're not meant to update the headers in tools/ directly. There's a mail somewhere from acme somewhere describing the proper process, but the tldr is leave it up to him.
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..03ac13d9aa37 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -32,6 +32,7 @@ #define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */ +#define MAP_BELOW_HINT 0x8000000 /* give out address that is below (inclusive) hint address */
IMHO the API would be clearer if this actually forced the address to be below the hint. That's what the flag name implies after all.
It would also mean the application doesn't need to take into account the length of the mapping when passing the hint.
cheers
On Thu, Aug 29, 2024 at 06:26:41PM +1000, Michael Ellerman wrote:
Charlie Jenkins charlie@rivosinc.com writes:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
To make this behavior explicit and more versatile across all architectures, define a mmap flag that allows users to define an arbitrary upper limit on addresses returned by mmap.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com
include/uapi/asm-generic/mman-common.h | 1 + tools/include/uapi/asm-generic/mman-common.h | 1 +
You're not meant to update the headers in tools/ directly. There's a mail somewhere from acme somewhere describing the proper process, but the tldr is leave it up to him.
Oh okay, thank you.
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..03ac13d9aa37 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -32,6 +32,7 @@ #define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */ +#define MAP_BELOW_HINT 0x8000000 /* give out address that is below (inclusive) hint address */
IMHO the API would be clearer if this actually forced the address to be below the hint. That's what the flag name implies after all.
It would also mean the application doesn't need to take into account the length of the mapping when passing the hint.
cheers
That's a good point. The reason I did it this way was to allow mmap the possibility of returning the same address as the hint. If it must be strictly less than the hint then the hint address can never be returned. Maybe that doesn't matter though.
- Charlie
The hint address and mmap_flags are necessary to determine if MAP_BELOW_HINT requirements are satisfied.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- arch/alpha/kernel/osf_sys.c | 2 ++ arch/arc/mm/mmap.c | 3 +++ arch/arm/mm/mmap.c | 7 +++++++ arch/csky/abiv1/mmap.c | 3 +++ arch/loongarch/mm/mmap.c | 3 +++ arch/mips/mm/mmap.c | 3 +++ arch/parisc/kernel/sys_parisc.c | 3 +++ arch/powerpc/mm/book3s64/slice.c | 7 +++++++ arch/s390/mm/hugetlbpage.c | 4 ++++ arch/s390/mm/mmap.c | 6 ++++++ arch/sh/mm/mmap.c | 6 ++++++ arch/sparc/kernel/sys_sparc_32.c | 3 +++ arch/sparc/kernel/sys_sparc_64.c | 6 ++++++ arch/sparc/mm/hugetlbpage.c | 4 ++++ arch/x86/kernel/sys_x86_64.c | 6 ++++++ arch/x86/mm/hugetlbpage.c | 4 ++++ fs/hugetlbfs/inode.c | 4 ++++ include/linux/mm.h | 2 ++ mm/mmap.c | 6 ++++++ 19 files changed, 82 insertions(+)
diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c index e5f881bc8288..6903700afd12 100644 --- a/arch/alpha/kernel/osf_sys.c +++ b/arch/alpha/kernel/osf_sys.c @@ -1223,6 +1223,8 @@ arch_get_unmapped_area_1(unsigned long addr, unsigned long len, info.length = len; info.low_limit = addr; info.high_limit = limit; + info.hint = addr; + info.mmap_flags = flags; return vm_unmapped_area(&info); }
diff --git a/arch/arc/mm/mmap.c b/arch/arc/mm/mmap.c index 69a915297155..5922cb51e029 100644 --- a/arch/arc/mm/mmap.c +++ b/arch/arc/mm/mmap.c @@ -29,6 +29,9 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, struct vm_area_struct *vma; struct vm_unmapped_area_info info = {};
+ info.hint = addr; + info.mmap_flags = flags; + /* * We enforce the MAP_FIXED case. */ diff --git a/arch/arm/mm/mmap.c b/arch/arm/mm/mmap.c index d65d0e6ed10a..04d9234f049a 100644 --- a/arch/arm/mm/mmap.c +++ b/arch/arm/mm/mmap.c @@ -36,6 +36,9 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, int aliasing = cache_is_vipt_aliasing(); struct vm_unmapped_area_info info = {};
+ info.hint = addr; + info.mmap_flags = flags; + /* * We only need to do colour alignment if either the I or D * caches alias. @@ -56,6 +59,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, if (len > TASK_SIZE) return -ENOMEM;
+ if (addr) { if (do_align) addr = COLOUR_ALIGN(addr, pgoff); @@ -88,6 +92,9 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, int aliasing = cache_is_vipt_aliasing(); struct vm_unmapped_area_info info = {};
+ info.hint = addr; + info.mmap_flags = flags; + /* * We only need to do colour alignment if either the I or D * caches alias. diff --git a/arch/csky/abiv1/mmap.c b/arch/csky/abiv1/mmap.c index 7f826331d409..0be4913c6cf3 100644 --- a/arch/csky/abiv1/mmap.c +++ b/arch/csky/abiv1/mmap.c @@ -35,6 +35,9 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, .align_offset = pgoff << PAGE_SHIFT };
+ info.hint = addr; + info.mmap_flags = flags; + /* * We only need to do colour alignment if either the I or D * caches alias. diff --git a/arch/loongarch/mm/mmap.c b/arch/loongarch/mm/mmap.c index 889030985135..7d1e8be20519 100644 --- a/arch/loongarch/mm/mmap.c +++ b/arch/loongarch/mm/mmap.c @@ -27,6 +27,9 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, int do_color_align; struct vm_unmapped_area_info info = {};
+ info.hint = addr; + info.mmap_flags = flags; + if (unlikely(len > TASK_SIZE)) return -ENOMEM;
diff --git a/arch/mips/mm/mmap.c b/arch/mips/mm/mmap.c index 7e11d7b58761..22e8f9c8eaa0 100644 --- a/arch/mips/mm/mmap.c +++ b/arch/mips/mm/mmap.c @@ -36,6 +36,9 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, int do_color_align; struct vm_unmapped_area_info info = {};
+ info.hint = addr; + info.mmap_flags = flags; + if (unlikely(len > TASK_SIZE)) return -ENOMEM;
diff --git a/arch/parisc/kernel/sys_parisc.c b/arch/parisc/kernel/sys_parisc.c index f7722451276e..2ac53f148624 100644 --- a/arch/parisc/kernel/sys_parisc.c +++ b/arch/parisc/kernel/sys_parisc.c @@ -108,6 +108,9 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp, .length = len };
+ info.hint = addr; + info.mmap_flags = flags; + if (unlikely(len > TASK_SIZE)) return -ENOMEM;
diff --git a/arch/powerpc/mm/book3s64/slice.c b/arch/powerpc/mm/book3s64/slice.c index ef3ce37f1bb3..f0e2550af6d0 100644 --- a/arch/powerpc/mm/book3s64/slice.c +++ b/arch/powerpc/mm/book3s64/slice.c @@ -286,6 +286,10 @@ static unsigned long slice_find_area_bottomup(struct mm_struct *mm, .length = len, .align_mask = PAGE_MASK & ((1ul << pshift) - 1), }; + + info.hint = addr; + info.mmap_flags = flags; + /* * Check till the allow max value for this mmap request */ @@ -331,6 +335,9 @@ static unsigned long slice_find_area_topdown(struct mm_struct *mm, }; unsigned long min_addr = max(PAGE_SIZE, mmap_min_addr);
+ info.hint = addr; + info.mmap_flags = flags; + /* * If we are trying to allocate above DEFAULT_MAP_WINDOW * Add the different to the mmap_base. diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c index ded0eff58a19..dc38f7022037 100644 --- a/arch/s390/mm/hugetlbpage.c +++ b/arch/s390/mm/hugetlbpage.c @@ -254,6 +254,8 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file, info.low_limit = current->mm->mmap_base; info.high_limit = TASK_SIZE; info.align_mask = PAGE_MASK & ~huge_page_mask(h); + info.hint = addr; + info.mmap_flags = flags; return vm_unmapped_area(&info); }
@@ -270,6 +272,8 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file, info.low_limit = PAGE_SIZE; info.high_limit = current->mm->mmap_base; info.align_mask = PAGE_MASK & ~huge_page_mask(h); + info.hint = addr; + info.mmap_flags = flags; addr = vm_unmapped_area(&info);
/* diff --git a/arch/s390/mm/mmap.c b/arch/s390/mm/mmap.c index 206756946589..001345df61b6 100644 --- a/arch/s390/mm/mmap.c +++ b/arch/s390/mm/mmap.c @@ -88,6 +88,9 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, struct vm_area_struct *vma; struct vm_unmapped_area_info info = {};
+ info.hint = addr; + info.mmap_flags = flags; + if (len > TASK_SIZE - mmap_min_addr) return -ENOMEM;
@@ -123,6 +126,9 @@ unsigned long arch_get_unmapped_area_topdown(struct file *filp, unsigned long ad struct mm_struct *mm = current->mm; struct vm_unmapped_area_info info = {};
+ info.hint = addr; + info.mmap_flags = flags; + /* requested length too big for entire address space */ if (len > TASK_SIZE - mmap_min_addr) return -ENOMEM; diff --git a/arch/sh/mm/mmap.c b/arch/sh/mm/mmap.c index bee329d4149a..ab1f5613b5e2 100644 --- a/arch/sh/mm/mmap.c +++ b/arch/sh/mm/mmap.c @@ -59,6 +59,9 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, int do_colour_align; struct vm_unmapped_area_info info = {};
+ info.hint = addr; + info.mmap_flags = flags; + if (flags & MAP_FIXED) { /* We do not accept a shared mapping if it would violate * cache aliasing constraints. @@ -107,6 +110,9 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, int do_colour_align; struct vm_unmapped_area_info info = {};
+ info.hint = addr; + info.mmap_flags = flags; + if (flags & MAP_FIXED) { /* We do not accept a shared mapping if it would violate * cache aliasing constraints. diff --git a/arch/sparc/kernel/sys_sparc_32.c b/arch/sparc/kernel/sys_sparc_32.c index 08a19727795c..bcc3242e003d 100644 --- a/arch/sparc/kernel/sys_sparc_32.c +++ b/arch/sparc/kernel/sys_sparc_32.c @@ -43,6 +43,9 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsi { struct vm_unmapped_area_info info = {};
+ info.hint = addr; + info.mmap_flags = flags; + if (flags & MAP_FIXED) { /* We do not accept a shared mapping if it would violate * cache aliasing constraints. diff --git a/arch/sparc/kernel/sys_sparc_64.c b/arch/sparc/kernel/sys_sparc_64.c index d9c3b34ca744..05de444d66f6 100644 --- a/arch/sparc/kernel/sys_sparc_64.c +++ b/arch/sparc/kernel/sys_sparc_64.c @@ -95,6 +95,9 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsi int do_color_align; struct vm_unmapped_area_info info = {};
+ info.hint = addr; + info.mmap_flags = flags; + if (flags & MAP_FIXED) { /* We do not accept a shared mapping if it would violate * cache aliasing constraints. @@ -155,6 +158,9 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, int do_color_align; struct vm_unmapped_area_info info = {};
+ info.hint = addr; + info.mmap_flags = flags; + /* This should only ever run for 32-bit processes. */ BUG_ON(!test_thread_flag(TIF_32BIT));
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index cc91ca7a1e18..148064142628 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -40,6 +40,8 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *filp, info.low_limit = TASK_UNMAPPED_BASE; info.high_limit = min(task_size, VA_EXCLUDE_START); info.align_mask = PAGE_MASK & ~huge_page_mask(h); + info.hint = addr; + info.mmap_flags = flags; addr = vm_unmapped_area(&info);
if ((addr & ~PAGE_MASK) && task_size > VA_EXCLUDE_END) { @@ -71,6 +73,8 @@ hugetlb_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, info.low_limit = PAGE_SIZE; info.high_limit = mm->mmap_base; info.align_mask = PAGE_MASK & ~huge_page_mask(h); + info.hint = addr; + info.mmap_flags = flags; addr = vm_unmapped_area(&info);
/* diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c index 01d7cd85ef97..248a210c2b0b 100644 --- a/arch/x86/kernel/sys_x86_64.c +++ b/arch/x86/kernel/sys_x86_64.c @@ -129,6 +129,9 @@ arch_get_unmapped_area_vmflags(struct file *filp, unsigned long addr, unsigned l struct vm_unmapped_area_info info = {}; unsigned long begin, end;
+ info.hint = addr; + info.mmap_flags = flags; + if (flags & MAP_FIXED) return addr;
@@ -167,6 +170,9 @@ arch_get_unmapped_area_topdown_vmflags(struct file *filp, unsigned long addr0, unsigned long addr = addr0; struct vm_unmapped_area_info info = {};
+ info.hint = addr; + info.mmap_flags = flags; + /* requested length too big for entire address space */ if (len > TASK_SIZE) return -ENOMEM; diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c index 807a5859a3c4..d3f8cec4895d 100644 --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -29,6 +29,8 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
info.length = len; info.low_limit = get_mmap_base(1); + info.hint = addr; + info.mmap_flags = flags;
/* * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area @@ -52,6 +54,8 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file, info.length = len; info.low_limit = PAGE_SIZE; info.high_limit = get_mmap_base(0); + info.hint = addr; + info.mmap_flags = flags;
/* * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 9f6cff356796..4b35eadd2ad8 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -182,6 +182,8 @@ hugetlb_get_unmapped_area_bottomup(struct file *file, unsigned long addr, info.low_limit = current->mm->mmap_base; info.high_limit = arch_get_mmap_end(addr, len, flags); info.align_mask = PAGE_MASK & ~huge_page_mask(h); + info.hint = addr; + info.mmap_flags = flags; return vm_unmapped_area(&info); }
@@ -197,6 +199,8 @@ hugetlb_get_unmapped_area_topdown(struct file *file, unsigned long addr, info.low_limit = PAGE_SIZE; info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base); info.align_mask = PAGE_MASK & ~huge_page_mask(h); + info.hint = addr; + info.mmap_flags = flags; addr = vm_unmapped_area(&info);
/* diff --git a/include/linux/mm.h b/include/linux/mm.h index 6549d0979b28..b5fcb9dfb942 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3445,6 +3445,8 @@ struct vm_unmapped_area_info { unsigned long align_mask; unsigned long align_offset; unsigned long start_gap; + unsigned long hint; + unsigned long mmap_flags; };
extern unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info); diff --git a/mm/mmap.c b/mm/mmap.c index d0dfc85b209b..34ba0db23678 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1796,6 +1796,9 @@ generic_get_unmapped_area(struct file *filp, unsigned long addr, struct vm_unmapped_area_info info = {}; const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
+ info.hint = addr; + info.mmap_flags = flags; + if (len > mmap_end - mmap_min_addr) return -ENOMEM;
@@ -1841,6 +1844,9 @@ generic_get_unmapped_area_topdown(struct file *filp, unsigned long addr, struct vm_unmapped_area_info info = {}; const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
+ info.hint = addr; + info.mmap_flags = flags; + /* requested length too big for entire address space */ if (len > mmap_end - mmap_min_addr) return -ENOMEM;
On Thu, Aug 29, 2024 at 12:15:59AM GMT, Charlie Jenkins wrote:
[snip]
diff --git a/mm/mmap.c b/mm/mmap.c index d0dfc85b209b..34ba0db23678 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1796,6 +1796,9 @@ generic_get_unmapped_area(struct file *filp, unsigned long addr, struct vm_unmapped_area_info info = {}; const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
- info.hint = addr;
- info.mmap_flags = flags;
- if (len > mmap_end - mmap_min_addr) return -ENOMEM;
@@ -1841,6 +1844,9 @@ generic_get_unmapped_area_topdown(struct file *filp, unsigned long addr, struct vm_unmapped_area_info info = {}; const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
- info.hint = addr;
- info.mmap_flags = flags;
- /* requested length too big for entire address space */ if (len > mmap_end - mmap_min_addr) return -ENOMEM;
These line numbers suggest you're working against Linus's tree, mm/mmap.c has changed a lot recently, so to avoid conflicts please base your changes on mm-unstable in Andrew's tree (if looking to go through that) or at least -next.
-- 2.45.0
On Thu, Aug 29, 2024 at 09:48:44AM +0100, Lorenzo Stoakes wrote:
On Thu, Aug 29, 2024 at 12:15:59AM GMT, Charlie Jenkins wrote:
[snip]
diff --git a/mm/mmap.c b/mm/mmap.c index d0dfc85b209b..34ba0db23678 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1796,6 +1796,9 @@ generic_get_unmapped_area(struct file *filp, unsigned long addr, struct vm_unmapped_area_info info = {}; const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
- info.hint = addr;
- info.mmap_flags = flags;
- if (len > mmap_end - mmap_min_addr) return -ENOMEM;
@@ -1841,6 +1844,9 @@ generic_get_unmapped_area_topdown(struct file *filp, unsigned long addr, struct vm_unmapped_area_info info = {}; const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
- info.hint = addr;
- info.mmap_flags = flags;
- /* requested length too big for entire address space */ if (len > mmap_end - mmap_min_addr) return -ENOMEM;
These line numbers suggest you're working against Linus's tree, mm/mmap.c has changed a lot recently, so to avoid conflicts please base your changes on mm-unstable in Andrew's tree (if looking to go through that) or at least -next.
I will make sure that I base off of mm-unstable for future revisions.
- Charlie
-- 2.45.0
Hi Charlie,
Le 29/08/2024 à 09:15, Charlie Jenkins a écrit :
The hint address and mmap_flags are necessary to determine if MAP_BELOW_HINT requirements are satisfied.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com
arch/alpha/kernel/osf_sys.c | 2 ++ arch/arc/mm/mmap.c | 3 +++ arch/arm/mm/mmap.c | 7 +++++++ arch/csky/abiv1/mmap.c | 3 +++ arch/loongarch/mm/mmap.c | 3 +++ arch/mips/mm/mmap.c | 3 +++ arch/parisc/kernel/sys_parisc.c | 3 +++ arch/powerpc/mm/book3s64/slice.c | 7 +++++++ arch/s390/mm/hugetlbpage.c | 4 ++++ arch/s390/mm/mmap.c | 6 ++++++ arch/sh/mm/mmap.c | 6 ++++++ arch/sparc/kernel/sys_sparc_32.c | 3 +++ arch/sparc/kernel/sys_sparc_64.c | 6 ++++++ arch/sparc/mm/hugetlbpage.c | 4 ++++ arch/x86/kernel/sys_x86_64.c | 6 ++++++ arch/x86/mm/hugetlbpage.c | 4 ++++ fs/hugetlbfs/inode.c | 4 ++++ include/linux/mm.h | 2 ++ mm/mmap.c | 6 ++++++ 19 files changed, 82 insertions(+)
diff --git a/arch/powerpc/mm/book3s64/slice.c b/arch/powerpc/mm/book3s64/slice.c index ef3ce37f1bb3..f0e2550af6d0 100644 --- a/arch/powerpc/mm/book3s64/slice.c +++ b/arch/powerpc/mm/book3s64/slice.c @@ -286,6 +286,10 @@ static unsigned long slice_find_area_bottomup(struct mm_struct *mm, .length = len, .align_mask = PAGE_MASK & ((1ul << pshift) - 1), };
- info.hint = addr;
- info.mmap_flags = flags;
- /*
*/
- Check till the allow max value for this mmap request
@@ -331,6 +335,9 @@ static unsigned long slice_find_area_topdown(struct mm_struct *mm, }; unsigned long min_addr = max(PAGE_SIZE, mmap_min_addr);
- info.hint = addr;
- info.mmap_flags = flags;
- /*
- If we are trying to allocate above DEFAULT_MAP_WINDOW
- Add the different to the mmap_base.
ppc64_defconfig:
CC arch/powerpc/mm/book3s64/slice.o arch/powerpc/mm/book3s64/slice.c: In function 'slice_find_area_bottomup': arch/powerpc/mm/book3s64/slice.c:291:27: error: 'flags' undeclared (first use in this function) 291 | info.mmap_flags = flags; | ^~~~~ arch/powerpc/mm/book3s64/slice.c:291:27: note: each undeclared identifier is reported only once for each function it appears in arch/powerpc/mm/book3s64/slice.c: In function 'slice_find_area_topdown': arch/powerpc/mm/book3s64/slice.c:339:27: error: 'flags' undeclared (first use in this function) 339 | info.mmap_flags = flags; | ^~~~~ make[5]: *** [scripts/Makefile.build:244: arch/powerpc/mm/book3s64/slice.o] Error 1
To ensure that all memory allocations comply with the new MAP_BELOW_HINT flag, set the high_limit in vm_unmapped_area() to the hint address + length at most. All callers to this function set the high_limit to something reasonable, usually with space for a random offset and a gap for the stack. To respect the provided high_limit, take the minimum of hint+length and the given high_limit.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- mm/mmap.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c index 34ba0db23678..459ad380c673 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1766,6 +1766,9 @@ unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info) { unsigned long addr;
+ if (info->hint != 0 && info->mmap_flags & MAP_BELOW_HINT) + info->high_limit = MIN(info->high_limit, info->hint + info->length); + if (info->flags & VM_UNMAPPED_AREA_TOPDOWN) addr = unmapped_area_topdown(info); else
Add a selftest for MAP_BELOW_HINT that maps until it runs out of space below the hint address.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 32 +++++++++++++++++++++++++++++ 2 files changed, 33 insertions(+)
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index cfad627e8d94..4e2de85267b5 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -50,6 +50,7 @@ TEST_GEN_FILES += hugepage-shm TEST_GEN_FILES += hugepage-vmemmap TEST_GEN_FILES += khugepaged TEST_GEN_FILES += madv_populate +TEST_GEN_FILES += map_below_hint TEST_GEN_FILES += map_fixed_noreplace TEST_GEN_FILES += map_hugetlb TEST_GEN_FILES += map_populate diff --git a/tools/testing/selftests/mm/map_below_hint.c b/tools/testing/selftests/mm/map_below_hint.c new file mode 100644 index 000000000000..55d6cbf90645 --- /dev/null +++ b/tools/testing/selftests/mm/map_below_hint.c @@ -0,0 +1,32 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test the MAP_BELOW_HINT mmap flag. + */ +#include <sys/mman.h> +#include <errno.h> +#include "../kselftest.h" + +#define ADDR (1 << 20) +#define LENGTH (ADDR / 10000) + +#define MAP_BELOW_HINT 0x8000000 /* Not defined in all libc */ + +/* + * Map memory with MAP_BELOW_HINT until no memory left. Ensure that all returned + * addresses are below the hint. + */ +int main(int argc, char **argv) +{ + void *addr; + + do { + addr = mmap((void *)ADDR, LENGTH, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS | MAP_BELOW_HINT, -1, 0); + } while (addr != MAP_FAILED && (unsigned long)addr <= ADDR); + + if (errno == ENOMEM) + ksft_test_result_pass("MAP_BELOW_HINT works\n"); + else + ksft_test_result_fail("mmap returned address above hint with MAP_BELOW_HINT with error: %s\n", + strerror(errno)); +}
On Thu 29-08-24 00:15:57, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking.
IIRC this has been discussed at length when 5-level page tables support has been proposed for x86. Sorry I do not have a link handy but lore should help you. Linus was not really convinced and in the end vetoed it and prefer that those few applications that benefit from greater address space would do that explicitly than other way around.
On Thu, Aug 29, 2024 at 10:30:56AM +0200, Michal Hocko wrote:
On Thu 29-08-24 00:15:57, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking.
IIRC this has been discussed at length when 5-level page tables support has been proposed for x86. Sorry I do not have a link handy but lore should help you. Linus was not really convinced and in the end vetoed it and prefer that those few applications that benefit from greater address space would do that explicitly than other way around.
I believe I found the conversation you were referring to. Ingo Molnar recommended a flag similar to what I have proposed [1]. Catalin recommended to make 52-bit opt-in on arm64 [2]. Dave Hansen brought up MPX [3].
However these conversations are tangential to what I am proposing. arm64 and x86 decided to have the default address space be 48 bits. However this was done on a per-architecture basis with no way for applications to have guarantees between architectures. Even this behavior to restrict to 48 bits does not even appear in the man pages, so would require reading the kernel source code to understand that this feature is available. Then to opt-in to larger address spaces, applications have to know to provide a hint address that is greater than 47 bits, mmap() will then return an address that contains up to 56 bits on x86 and 52 bits on arm64. This difference of 4 bits causes inconsistency and is part of the problem I am trying to solve with this flag.
I am not proposing to change x86 and arm64 away from using their opt-out feature, I am instead proposing a standard ABI for applications that need some guarantees of the bits used in pointers.
- Charlie
Link: https://lore.kernel.org/lkml/20161209050130.GC2595@gmail.com/ [1] Link: https://lore.kernel.org/lkml/20161209105120.GA3705@e104818-lin.cambridge.arm... [2] Link: https://lore.kernel.org/lkml/a2f86495-b55f-fda0-40d2-242c45d3c1f3@intel.com/ [3]
-- Michal Hocko SUSE Labs
On Thu 29-08-24 10:33:22, Charlie Jenkins wrote:
On Thu, Aug 29, 2024 at 10:30:56AM +0200, Michal Hocko wrote:
On Thu 29-08-24 00:15:57, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking.
IIRC this has been discussed at length when 5-level page tables support has been proposed for x86. Sorry I do not have a link handy but lore should help you. Linus was not really convinced and in the end vetoed it and prefer that those few applications that benefit from greater address space would do that explicitly than other way around.
I believe I found the conversation you were referring to. Ingo Molnar recommended a flag similar to what I have proposed [1]. Catalin recommended to make 52-bit opt-in on arm64 [2]. Dave Hansen brought up MPX [3].
However these conversations are tangential to what I am proposing. arm64 and x86 decided to have the default address space be 48 bits. However this was done on a per-architecture basis with no way for applications to have guarantees between architectures. Even this behavior to restrict to 48 bits does not even appear in the man pages, so would require reading the kernel source code to understand that this feature is available. Then to opt-in to larger address spaces, applications have to know to provide a hint address that is greater than 47 bits, mmap() will then return an address that contains up to 56 bits on x86 and 52 bits on arm64. This difference of 4 bits causes inconsistency and is part of the problem I am trying to solve with this flag.
Yes, I guess I do understand where you are heading. Our existing model assumes that anybody requiring more address space know what they are doing and deal with the reality. This is the way Linus has pushed this and I am not really convinced it is the right way TBH. On the other hand it is true that this allows a safe(r) transition to larger address spaces.
I am not proposing to change x86 and arm64 away from using their opt-out feature, I am instead proposing a standard ABI for applications that need some guarantees of the bits used in pointers.
Right, but this is not really different from earlier attempts to achieve this IIRC. Extentind mmap for that purpose seems quite tricky as already pointed out in other sub-threads. Quite honestly I am not really sure what is the right and backwards compatible way. I just wanted to make you aware this has been discussed at lenght in the past.
On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
I'm still confused as to why, if an mmap flag is desired, and thus programs are having to be heavily modified and controlled to be able to do this, why you can't just do an mmap() with PROT_NONE early, around a hinted address that, sits below the required limit, and then mprotect() or mmap() over it?
Your feature is a major adjustment to mmap(), it needs to be pretty significantly justified, especially if taking up a new flag.
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
You mean x86 _has_ to limit to physically available bits in a canonical format :) this will not be the case for 5-page table levels though...
Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking.
I raised this last time and you didn't seem to address it so to be more blunt:
I don't understand why this needs to be an mmap() flag. From this it seems the whole process needs allocations to be below a certain limit.
That _could_ be achieved through a 'personality' or similar (though a personality is on/off, rather than allowing configuration so maybe something else would be needed).
From what you're saying 57-bit is all you really need right? So maybe ADDR_LIMIT_57BIT?
I don't see how you're going to actually enforce this in a process either via an mmap flag, as a library might decide not to use it, so you'd need to control the allocator, the thread library implementation, and everything that might allocate.
Liam also raised various points about VMA particulars that I'm not sure are addressed either.
I just find it hard to believe that everything will fit together.
I'd _really_ need to be convinced that this MAP_ flag is justified, and I"m just not.
This flag will also allow seemless compatibility between all architectures, so applications like Go and OpenJDK that use bits in a virtual address can request the exact number of bits they need in a generic way. The flag can be checked inside of vm_unmapped_area() so that this flag does not have to be handled individually by each architecture.
I'm still very unconvinced and feel the bar needs to be high for making changes like this that carry maintainership burden.
So for me, it's a no really as an overall concept.
Happy to be convinced otherwise, however... (I may be missing details or context that provide more justification).
Link: https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1... [1] Link: https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/s... [2]
To: Arnd Bergmann arnd@arndb.de To: Richard Henderson richard.henderson@linaro.org To: Ivan Kokshaysky ink@jurassic.park.msu.ru To: Matt Turner mattst88@gmail.com To: Vineet Gupta vgupta@kernel.org To: Russell King linux@armlinux.org.uk To: Guo Ren guoren@kernel.org To: Huacai Chen chenhuacai@kernel.org To: WANG Xuerui kernel@xen0n.name To: Thomas Bogendoerfer tsbogend@alpha.franken.de To: James E.J. Bottomley James.Bottomley@HansenPartnership.com To: Helge Deller deller@gmx.de To: Michael Ellerman mpe@ellerman.id.au To: Nicholas Piggin npiggin@gmail.com To: Christophe Leroy christophe.leroy@csgroup.eu To: Naveen N Rao naveen@kernel.org To: Alexander Gordeev agordeev@linux.ibm.com To: Gerald Schaefer gerald.schaefer@linux.ibm.com To: Heiko Carstens hca@linux.ibm.com To: Vasily Gorbik gor@linux.ibm.com To: Christian Borntraeger borntraeger@linux.ibm.com To: Sven Schnelle svens@linux.ibm.com To: Yoshinori Sato ysato@users.sourceforge.jp To: Rich Felker dalias@libc.org To: John Paul Adrian Glaubitz glaubitz@physik.fu-berlin.de To: David S. Miller davem@davemloft.net To: Andreas Larsson andreas@gaisler.com To: Thomas Gleixner tglx@linutronix.de To: Ingo Molnar mingo@redhat.com To: Borislav Petkov bp@alien8.de To: Dave Hansen dave.hansen@linux.intel.com To: x86@kernel.org To: H. Peter Anvin hpa@zytor.com To: Andy Lutomirski luto@kernel.org To: Peter Zijlstra peterz@infradead.org To: Muchun Song muchun.song@linux.dev To: Andrew Morton akpm@linux-foundation.org To: Liam R. Howlett Liam.Howlett@oracle.com To: Vlastimil Babka vbabka@suse.cz To: Lorenzo Stoakes lorenzo.stoakes@oracle.com To: Shuah Khan shuah@kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-snps-arc@lists.infradead.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-csky@vger.kernel.org Cc: loongarch@lists.linux.dev Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-kselftest@vger.kernel.org Signed-off-by: Charlie Jenkins charlie@rivosinc.com
Changes in v2:
- Added much greater detail to cover letter
- Removed all code that touched architecture specific code and was able to factor this out into all generic functions, except for flags that needed to be added to vm_unmapped_area_info
- Made this an RFC since I have only tested it on riscv and x86
- Link to v1: https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb9022d...
Charlie Jenkins (4): mm: Add MAP_BELOW_HINT mm: Add hint and mmap_flags to struct vm_unmapped_area_info mm: Support MAP_BELOW_HINT in vm_unmapped_area() selftests/mm: Create MAP_BELOW_HINT test
arch/alpha/kernel/osf_sys.c | 2 ++ arch/arc/mm/mmap.c | 3 +++ arch/arm/mm/mmap.c | 7 ++++++ arch/csky/abiv1/mmap.c | 3 +++ arch/loongarch/mm/mmap.c | 3 +++ arch/mips/mm/mmap.c | 3 +++ arch/parisc/kernel/sys_parisc.c | 3 +++ arch/powerpc/mm/book3s64/slice.c | 7 ++++++ arch/s390/mm/hugetlbpage.c | 4 ++++ arch/s390/mm/mmap.c | 6 ++++++ arch/sh/mm/mmap.c | 6 ++++++ arch/sparc/kernel/sys_sparc_32.c | 3 +++ arch/sparc/kernel/sys_sparc_64.c | 6 ++++++ arch/sparc/mm/hugetlbpage.c | 4 ++++ arch/x86/kernel/sys_x86_64.c | 6 ++++++ arch/x86/mm/hugetlbpage.c | 4 ++++ fs/hugetlbfs/inode.c | 4 ++++ include/linux/mm.h | 2 ++ include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c | 9 ++++++++ tools/include/uapi/asm-generic/mman-common.h | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 32 ++++++++++++++++++++++++++++ 23 files changed, 120 insertions(+)
base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55 --
- Charlie
On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote:
On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
I'm still confused as to why, if an mmap flag is desired, and thus programs are having to be heavily modified and controlled to be able to do this, why you can't just do an mmap() with PROT_NONE early, around a hinted address that, sits below the required limit, and then mprotect() or mmap() over it?
Your feature is a major adjustment to mmap(), it needs to be pretty significantly justified, especially if taking up a new flag.
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
You mean x86 _has_ to limit to physically available bits in a canonical format :) this will not be the case for 5-page table levels though...
Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking.
I raised this last time and you didn't seem to address it so to be more blunt:
I don't understand why this needs to be an mmap() flag. From this it seems the whole process needs allocations to be below a certain limit.
That _could_ be achieved through a 'personality' or similar (though a personality is on/off, rather than allowing configuration so maybe something else would be needed).
From what you're saying 57-bit is all you really need right? So maybe ADDR_LIMIT_57BIT?
I don't see how you're going to actually enforce this in a process either via an mmap flag, as a library might decide not to use it, so you'd need to control the allocator, the thread library implementation, and everything that might allocate.
Liam also raised various points about VMA particulars that I'm not sure are addressed either.
I just find it hard to believe that everything will fit together.
I'd _really_ need to be convinced that this MAP_ flag is justified, and I"m just not.
This flag will also allow seemless compatibility between all architectures, so applications like Go and OpenJDK that use bits in a virtual address can request the exact number of bits they need in a generic way. The flag can be checked inside of vm_unmapped_area() so that this flag does not have to be handled individually by each architecture.
I'm still very unconvinced and feel the bar needs to be high for making changes like this that carry maintainership burden.
So for me, it's a no really as an overall concept.
Happy to be convinced otherwise, however... (I may be missing details or context that provide more justification).
Some more thoughts:
* If you absolutely must keep allocations below a certain limit, you'd probably need to actually associate this information with the VMA so the memory can't be mremap()'d somewhere invalid (you might not control all code so you can't guarantee this won't happen). * Keeping a map limit associated with a VMA would be horrid and keeping VMAs as small as possible is a key aim, so that'd be a no go. VMA flags are in limited supply also. * If we did implement a per-process thing, but it were arbitrary, we'd then have to handle all kinds of corner cases forever (this is UAPI, can't break it etc.) with crazy-low values, or determine a minimum that might vary by arch... * If we did this we'd absolutely have to implement a check in the brk() implementation, which is a very very sensitive bit of code. And of course, in mmap() and mremap()... and any arch-specific code that might interface with this stuff (these functions are hooked). * A fixed address limit would make more sense, but it seems difficult to know what would work for everybody, and again we'd have to deal with edge cases and having a permanent maintenance burden. * If you did have a map flag what about merging between VMAs above the limit and below it? To avoid that you'd need to implement some kind of a 'VMA flag that has an arbitrary characteristic' or a 'limit' field, adjust all the 'can VMA merge' functions and write extensive testing and none of that is frankly acceptable. * We have some 'weird' arches that might have problem with certain virtual address ranges or require arbitrary mappings at a certain address range that a limit might not be able to account for.
I'm absolutely opposed to a new MAP_ flag for this, but even if you implemented that, it implies a lot of complexity.
It implies even more complexity if you implement something per-process except if it were a fixed limit.
And if you implement a fixed limit, it's hard to see that it'll be acceptable to everybody, and I suspect we'd still run into some possible weirdness.
So again, I'm struggling to see how this concept can be justified in any form.
On Thu, Aug 29, 2024 at 10:54:25AM +0100, Lorenzo Stoakes wrote:
On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote:
On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
I'm still confused as to why, if an mmap flag is desired, and thus programs are having to be heavily modified and controlled to be able to do this, why you can't just do an mmap() with PROT_NONE early, around a hinted address that, sits below the required limit, and then mprotect() or mmap() over it?
Your feature is a major adjustment to mmap(), it needs to be pretty significantly justified, especially if taking up a new flag.
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
You mean x86 _has_ to limit to physically available bits in a canonical format :) this will not be the case for 5-page table levels though...
I might be misunderstanding but I am not talking about pointer masking or canonical addresses here. I am referring to the pattern of:
1. Getting an address from mmap() 2. Writing data into bits assumed to be unused in the address 3. Using the data stored in the address 4. Clearing the data from the address and sign extending 5. Dereferencing the now sign-extended address to conform to canonical addresses
I am just talking about step 1 and 2 here -- getting an address from mmap() that only uses bits that will allow your application to not break. How canonicalization happens is a a separate conversation, that can be handled by LAM for x86, TBI for arm64, or Ssnpm for riscv. While LAM for x86 is only capable of masking addresses to 48 or 57 bits, Ssnpm for riscv allow an arbitrary number of bits to be masked out. A design goal here is to be able to support all of the pointer masking flavors, and not just x86.
Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking.
I raised this last time and you didn't seem to address it so to be more blunt:
I don't understand why this needs to be an mmap() flag. From this it seems the whole process needs allocations to be below a certain limit.
Yeah making it per-process does seem logical, as it would help with pointer masking.
That _could_ be achieved through a 'personality' or similar (though a personality is on/off, rather than allowing configuration so maybe something else would be needed).
From what you're saying 57-bit is all you really need right? So maybe ADDR_LIMIT_57BIT?
Addresses will always be limited to 57 bits on riscv and x86 (but not necessarily on other architectures). A flag like that would have no impact, I do not understand what you are suggesting. This patch is to have a configurable number of bits be restricted.
If anything, a personality that was ADDR_LIMIT_48BIT would be the closest to what I am trying to achieve. Since the issue is that applications fail to work when the address space is greater than 48 bits.
I don't see how you're going to actually enforce this in a process either via an mmap flag, as a library might decide not to use it, so you'd need to control the allocator, the thread library implementation, and everything that might allocate.
It is reasonable to change the implementation to be per-process but that is not the current proposal.
This flag was designed for applications which already directly manage all of their addresses like OpenJDK and Go.
This flag implementation was an attempt to make this feature as least invasive as possible to reduce maintainence burden and implementation complexity.
Liam also raised various points about VMA particulars that I'm not sure are addressed either.
I just find it hard to believe that everything will fit together.
I'd _really_ need to be convinced that this MAP_ flag is justified, and I"m just not.
This flag will also allow seemless compatibility between all architectures, so applications like Go and OpenJDK that use bits in a virtual address can request the exact number of bits they need in a generic way. The flag can be checked inside of vm_unmapped_area() so that this flag does not have to be handled individually by each architecture.
I'm still very unconvinced and feel the bar needs to be high for making changes like this that carry maintainership burden.
I may be naive but what is the burden here? It's two lines of code to check MAP_BELOW_HINT and restrict the address. There are the additional flags for hint and mmap_addr but those are also trivial to implement.
So for me, it's a no really as an overall concept.
Happy to be convinced otherwise, however... (I may be missing details or context that provide more justification).
Some more thoughts:
- If you absolutely must keep allocations below a certain limit, you'd probably need to actually associate this information with the VMA so the memory can't be mremap()'d somewhere invalid (you might not control all code so you can't guarantee this won't happen).
- Keeping a map limit associated with a VMA would be horrid and keeping VMAs as small as possible is a key aim, so that'd be a no go. VMA flags are in limited supply also.
Yes that does seem like it would be challenging.
- If we did implement a per-process thing, but it were arbitrary, we'd then have to handle all kinds of corner cases forever (this is UAPI, can't break it etc.) with crazy-low values, or determine a minimum that might vary by arch...
Throwing an error if the value is determined to be "too low" seems reasonable.
- If we did this we'd absolutely have to implement a check in the brk() implementation, which is a very very sensitive bit of code. And of course, in mmap() and mremap()... and any arch-specific code that might interface with this stuff (these functions are hooked).
- A fixed address limit would make more sense, but it seems difficult to know what would work for everybody, and again we'd have to deal with edge cases and having a permanent maintenance burden.
A fixed value is not ideal, since a single size probably would not be suffiecient for every application. However if necessary we could fix it to 48-bits since arm64 and x86 already do that, and that would still allow a generic way of defining this behavior.
- If you did have a map flag what about merging between VMAs above the limit and below it? To avoid that you'd need to implement some kind of a 'VMA flag that has an arbitrary characteristic' or a 'limit' field, adjust all the 'can VMA merge' functions and write extensive testing and none of that is frankly acceptable.
- We have some 'weird' arches that might have problem with certain virtual address ranges or require arbitrary mappings at a certain address range that a limit might not be able to account for.
I'm absolutely opposed to a new MAP_ flag for this, but even if you implemented that, it implies a lot of complexity.
It implies even more complexity if you implement something per-process except if it were a fixed limit.
And if you implement a fixed limit, it's hard to see that it'll be acceptable to everybody, and I suspect we'd still run into some possible weirdness.
So again, I'm struggling to see how this concept can be justified in any form.
The piece I am missing here is that this idea is already being used by x86 and arm64. They implicitly force all allocations to be below the 47-bit boundary if the hint address is below 47 bits. This flag is much less invasive because it is opt-in and will not impact any existing code. I am not familiar enough with all of the interactions spread throughout mm to know how these architectures have managed to ensure that this 48-bit limit is enforced across things like mremap() as well.
Are you against the idea that there should be a standard way for applications to consistently obtain address that have free bits, or are you just against this implementation? From your statement I assume you mean that every architecture should continue to have varying behavior and separate implementations for supporting larger address spaces.
- Charlie
On Thu, Aug 29, 2024 at 03:16:53PM GMT, Charlie Jenkins wrote:
On Thu, Aug 29, 2024 at 10:54:25AM +0100, Lorenzo Stoakes wrote:
On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote:
On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
I'm still confused as to why, if an mmap flag is desired, and thus programs are having to be heavily modified and controlled to be able to do this, why you can't just do an mmap() with PROT_NONE early, around a hinted address that, sits below the required limit, and then mprotect() or mmap() over it?
Your feature is a major adjustment to mmap(), it needs to be pretty significantly justified, especially if taking up a new flag.
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
You mean x86 _has_ to limit to physically available bits in a canonical format :) this will not be the case for 5-page table levels though...
I might be misunderstanding but I am not talking about pointer masking or canonical addresses here. I am referring to the pattern of:
- Getting an address from mmap()
- Writing data into bits assumed to be unused in the address
- Using the data stored in the address
- Clearing the data from the address and sign extending
- Dereferencing the now sign-extended address to conform to canonical addresses
I am just talking about step 1 and 2 here -- getting an address from mmap() that only uses bits that will allow your application to not break. How canonicalization happens is a a separate conversation, that can be handled by LAM for x86, TBI for arm64, or Ssnpm for riscv. While LAM for x86 is only capable of masking addresses to 48 or 57 bits, Ssnpm for riscv allow an arbitrary number of bits to be masked out. A design goal here is to be able to support all of the pointer masking flavors, and not just x86.
Right I get that, I was just saying that the implicit limitation in x86 is due to virtual addresses _having_ to be less than 48 bits. So that's why that is right? I mean perhaps I'm mistaken?
Or is it such that x86 can provide a space for tagging for CPU technology that supports it (UAI perhaps?).
I agree with what Michal and others said about the decision to default to the reduced address space size and opt-in for higher bits. Your series doesn't do this...
Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking.
I raised this last time and you didn't seem to address it so to be more blunt:
I don't understand why this needs to be an mmap() flag. From this it seems the whole process needs allocations to be below a certain limit.
Yeah making it per-process does seem logical, as it would help with pointer masking.
To me it's the only feasible way forward, you can't control all libraries, a map flag continues to seem a strange way to implement this, and I understand that your justification is that it is the _least invasive_ way of doing this, but as I've said below, it's actually pretty invasive if you think about it, the current implementation seems to me to be insufficient without having VMA flags etc.
That _could_ be achieved through a 'personality' or similar (though a personality is on/off, rather than allowing configuration so maybe something else would be needed).
From what you're saying 57-bit is all you really need right? So maybe ADDR_LIMIT_57BIT?
Addresses will always be limited to 57 bits on riscv and x86 (but not necessarily on other architectures). A flag like that would have no impact, I do not understand what you are suggesting. This patch is to have a configurable number of bits be restricted.
I get that, but as I say below, I don't think a customisable limit is workable.
So I was trying to find a compromise that _might_ be more workable.
If anything, a personality that was ADDR_LIMIT_48BIT would be the closest to what I am trying to achieve. Since the issue is that applications fail to work when the address space is greater than 48 bits.
OK so this is at least some possible road forward given there is quite a bit of push-back to alternatives.
I don't see how you're going to actually enforce this in a process either via an mmap flag, as a library might decide not to use it, so you'd need to control the allocator, the thread library implementation, and everything that might allocate.
It is reasonable to change the implementation to be per-process but that is not the current proposal.
I mean maybe I wasn't direct enough - I oppose the current proposal as-is.
This flag was designed for applications which already directly manage all of their addresses like OpenJDK and Go.
This flag implementation was an attempt to make this feature as least invasive as possible to reduce maintainence burden and implementation complexity.
I realise, and as I said below, I don't think your implementation is correct in this form.
Also if you can control everything + for whatever reason can _absolutely know_ no program will use a FFI or a 3rd party library or whatever that mremap()'s, I don't see why you can't use mmap() in a creative way to solve this rather than adding maintenance burden.
A couple ideas:
1. mmap(high_address - domain_size - buffer, ..., PROT_NONE, MAP_FIXED, ...) a vast domain. You will almost certainly get the hint you want. Then mprotect() regions to PROT_READ | PROT_WRITE as you use (or even mmap() with MAP_FIXED_REPLACE over them), all will have high bits clear.
2. (suggested by Liam separately) mmap() with PROT_NONE addresses in the higher range, which prevents mmap() or any other means of allocating memory from allocating there. Acting as a 'huge guard page'.
Neither require any changes.
You kinda can't have it both ways - if you are absolutely controlling all allocations with no risk of a 3rd party library doing an allocation outside of this - then you can just use existing mechanics.
If you don't, then MAP_BELOW_HINT is insufficient.
Liam also raised various points about VMA particulars that I'm not sure are addressed either.
I just find it hard to believe that everything will fit together.
I'd _really_ need to be convinced that this MAP_ flag is justified, and I"m just not.
This flag will also allow seemless compatibility between all architectures, so applications like Go and OpenJDK that use bits in a virtual address can request the exact number of bits they need in a generic way. The flag can be checked inside of vm_unmapped_area() so that this flag does not have to be handled individually by each architecture.
I'm still very unconvinced and feel the bar needs to be high for making changes like this that carry maintainership burden.
I may be naive but what is the burden here? It's two lines of code to check MAP_BELOW_HINT and restrict the address. There are the additional flags for hint and mmap_addr but those are also trivial to implement.
You're taking up a MAP_ flag (in short supply) which we have to maintain forever across all arches and have to respect a limited map range.
And everything in this realm has edge cases. I don't see how you can implement this correctly or usefully without a VMA flag, and see below for my concerns on that.
This is UAPI (and really UABI) so this is _forever_. The bar is high. To me this proposal does not hit that, and as you keep saying this isn't even what you want.
You want something per-process so I think the correct proposal is per-process.
A configurable per-process thing is horrible in itself, so I think the only workable proposal is a fixed personality.
So for me, it's a no really as an overall concept.
Happy to be convinced otherwise, however... (I may be missing details or context that provide more justification).
Some more thoughts:
- If you absolutely must keep allocations below a certain limit, you'd probably need to actually associate this information with the VMA so the memory can't be mremap()'d somewhere invalid (you might not control all code so you can't guarantee this won't happen).
- Keeping a map limit associated with a VMA would be horrid and keeping VMAs as small as possible is a key aim, so that'd be a no go. VMA flags are in limited supply also.
Yes that does seem like it would be challenging.
Right so to me this rules out the MAP_BELOW_HINT. And makes this implementation invalid.
- If we did implement a per-process thing, but it were arbitrary, we'd then have to handle all kinds of corner cases forever (this is UAPI, can't break it etc.) with crazy-low values, or determine a minimum that might vary by arch...
Throwing an error if the value is determined to be "too low" seems reasonable.
What's "too low"? This will vary by arch too right? Keep in mind this is 'forever'...
- If we did this we'd absolutely have to implement a check in the brk() implementation, which is a very very sensitive bit of code. And of course, in mmap() and mremap()... and any arch-specific code that might interface with this stuff (these functions are hooked).
- A fixed address limit would make more sense, but it seems difficult to know what would work for everybody, and again we'd have to deal with edge cases and having a permanent maintenance burden.
A fixed value is not ideal, since a single size probably would not be suffiecient for every application. However if necessary we could fix it to 48-bits since arm64 and x86 already do that, and that would still allow a generic way of defining this behavior.
This is more acceptable. It avoids pretty much all of the rest of the issues.
- If you did have a map flag what about merging between VMAs above the limit and below it? To avoid that you'd need to implement some kind of a 'VMA flag that has an arbitrary characteristic' or a 'limit' field, adjust all the 'can VMA merge' functions and write extensive testing and none of that is frankly acceptable.
- We have some 'weird' arches that might have problem with certain virtual address ranges or require arbitrary mappings at a certain address range that a limit might not be able to account for.
I'm absolutely opposed to a new MAP_ flag for this, but even if you implemented that, it implies a lot of complexity.
It implies even more complexity if you implement something per-process except if it were a fixed limit.
And if you implement a fixed limit, it's hard to see that it'll be acceptable to everybody, and I suspect we'd still run into some possible weirdness.
So again, I'm struggling to see how this concept can be justified in any form.
The piece I am missing here is that this idea is already being used by x86 and arm64. They implicitly force all allocations to be below the 47-bit boundary if the hint address is below 47 bits. This flag is much less invasive because it is opt-in and will not impact any existing code. I am not familiar enough with all of the interactions spread throughout mm to know how these architectures have managed to ensure that this 48-bit limit is enforced across things like mremap() as well.
I just wrote a bunch above about this and did in the original email.
The 48-bit limit is much more workable and is across-the-board so it's easy to implement. It's the variable thing or map flag thing that's the problem.
Are you against the idea that there should be a standard way for applications to consistently obtain address that have free bits, or are you just against this implementation? From your statement I assume you mean that every architecture should continue to have varying behavior and separate implementations for supporting larger address spaces.
I'm against this implementation, or one with a variable limit.
An ADDR_LIMIT_48BIT personality I think is the most workable form of this. Others may have opinions on this also, but it avoids pretty much all of the aforementioned issues and is the least invasive.
- Charlie
Sorry to push back so much on your series, your efforts are appreciated, I am simply trying to head off issues in the future especially those that end up exposed to userland! :)
On Fri, Aug 30, 2024 at 10:52:01AM +0100, Lorenzo Stoakes wrote:
On Thu, Aug 29, 2024 at 03:16:53PM GMT, Charlie Jenkins wrote:
On Thu, Aug 29, 2024 at 10:54:25AM +0100, Lorenzo Stoakes wrote:
On Thu, Aug 29, 2024 at 09:42:22AM GMT, Lorenzo Stoakes wrote:
On Thu, Aug 29, 2024 at 12:15:57AM GMT, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
I'm still confused as to why, if an mmap flag is desired, and thus programs are having to be heavily modified and controlled to be able to do this, why you can't just do an mmap() with PROT_NONE early, around a hinted address that, sits below the required limit, and then mprotect() or mmap() over it?
Your feature is a major adjustment to mmap(), it needs to be pretty significantly justified, especially if taking up a new flag.
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
You mean x86 _has_ to limit to physically available bits in a canonical format :) this will not be the case for 5-page table levels though...
I might be misunderstanding but I am not talking about pointer masking or canonical addresses here. I am referring to the pattern of:
- Getting an address from mmap()
- Writing data into bits assumed to be unused in the address
- Using the data stored in the address
- Clearing the data from the address and sign extending
- Dereferencing the now sign-extended address to conform to canonical addresses
I am just talking about step 1 and 2 here -- getting an address from mmap() that only uses bits that will allow your application to not break. How canonicalization happens is a a separate conversation, that can be handled by LAM for x86, TBI for arm64, or Ssnpm for riscv. While LAM for x86 is only capable of masking addresses to 48 or 57 bits, Ssnpm for riscv allow an arbitrary number of bits to be masked out. A design goal here is to be able to support all of the pointer masking flavors, and not just x86.
Right I get that, I was just saying that the implicit limitation in x86 is due to virtual addresses _having_ to be less than 48 bits. So that's why that is right? I mean perhaps I'm mistaken?
Or is it such that x86 can provide a space for tagging for CPU technology that supports it (UAI perhaps?).
I agree with what Michal and others said about the decision to default to the reduced address space size and opt-in for higher bits. Your series doesn't do this...
Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking.
I raised this last time and you didn't seem to address it so to be more blunt:
I don't understand why this needs to be an mmap() flag. From this it seems the whole process needs allocations to be below a certain limit.
Yeah making it per-process does seem logical, as it would help with pointer masking.
To me it's the only feasible way forward, you can't control all libraries, a map flag continues to seem a strange way to implement this, and I understand that your justification is that it is the _least invasive_ way of doing this, but as I've said below, it's actually pretty invasive if you think about it, the current implementation seems to me to be insufficient without having VMA flags etc.
That _could_ be achieved through a 'personality' or similar (though a personality is on/off, rather than allowing configuration so maybe something else would be needed).
From what you're saying 57-bit is all you really need right? So maybe ADDR_LIMIT_57BIT?
Addresses will always be limited to 57 bits on riscv and x86 (but not necessarily on other architectures). A flag like that would have no impact, I do not understand what you are suggesting. This patch is to have a configurable number of bits be restricted.
I get that, but as I say below, I don't think a customisable limit is workable.
So I was trying to find a compromise that _might_ be more workable.
If anything, a personality that was ADDR_LIMIT_48BIT would be the closest to what I am trying to achieve. Since the issue is that applications fail to work when the address space is greater than 48 bits.
OK so this is at least some possible road forward given there is quite a bit of push-back to alternatives.
I don't see how you're going to actually enforce this in a process either via an mmap flag, as a library might decide not to use it, so you'd need to control the allocator, the thread library implementation, and everything that might allocate.
It is reasonable to change the implementation to be per-process but that is not the current proposal.
I mean maybe I wasn't direct enough - I oppose the current proposal as-is.
This flag was designed for applications which already directly manage all of their addresses like OpenJDK and Go.
This flag implementation was an attempt to make this feature as least invasive as possible to reduce maintainence burden and implementation complexity.
I realise, and as I said below, I don't think your implementation is correct in this form.
Also if you can control everything + for whatever reason can _absolutely know_ no program will use a FFI or a 3rd party library or whatever that mremap()'s, I don't see why you can't use mmap() in a creative way to solve this rather than adding maintenance burden.
A couple ideas:
mmap(high_address - domain_size - buffer, ..., PROT_NONE, MAP_FIXED, ...) a vast domain. You will almost certainly get the hint you want. Then mprotect() regions to PROT_READ | PROT_WRITE as you use (or even mmap() with MAP_FIXED_REPLACE over them), all will have high bits clear.
(suggested by Liam separately) mmap() with PROT_NONE addresses in the higher range, which prevents mmap() or any other means of allocating memory from allocating there. Acting as a 'huge guard page'.
Neither require any changes.
You kinda can't have it both ways - if you are absolutely controlling all allocations with no risk of a 3rd party library doing an allocation outside of this - then you can just use existing mechanics.
If you don't, then MAP_BELOW_HINT is insufficient.
Liam also raised various points about VMA particulars that I'm not sure are addressed either.
I just find it hard to believe that everything will fit together.
I'd _really_ need to be convinced that this MAP_ flag is justified, and I"m just not.
This flag will also allow seemless compatibility between all architectures, so applications like Go and OpenJDK that use bits in a virtual address can request the exact number of bits they need in a generic way. The flag can be checked inside of vm_unmapped_area() so that this flag does not have to be handled individually by each architecture.
I'm still very unconvinced and feel the bar needs to be high for making changes like this that carry maintainership burden.
I may be naive but what is the burden here? It's two lines of code to check MAP_BELOW_HINT and restrict the address. There are the additional flags for hint and mmap_addr but those are also trivial to implement.
You're taking up a MAP_ flag (in short supply) which we have to maintain forever across all arches and have to respect a limited map range.
And everything in this realm has edge cases. I don't see how you can implement this correctly or usefully without a VMA flag, and see below for my concerns on that.
This is UAPI (and really UABI) so this is _forever_. The bar is high. To me this proposal does not hit that, and as you keep saying this isn't even what you want.
You want something per-process so I think the correct proposal is per-process.
A configurable per-process thing is horrible in itself, so I think the only workable proposal is a fixed personality.
So for me, it's a no really as an overall concept.
Happy to be convinced otherwise, however... (I may be missing details or context that provide more justification).
Some more thoughts:
- If you absolutely must keep allocations below a certain limit, you'd probably need to actually associate this information with the VMA so the memory can't be mremap()'d somewhere invalid (you might not control all code so you can't guarantee this won't happen).
- Keeping a map limit associated with a VMA would be horrid and keeping VMAs as small as possible is a key aim, so that'd be a no go. VMA flags are in limited supply also.
Yes that does seem like it would be challenging.
Right so to me this rules out the MAP_BELOW_HINT. And makes this implementation invalid.
- If we did implement a per-process thing, but it were arbitrary, we'd then have to handle all kinds of corner cases forever (this is UAPI, can't break it etc.) with crazy-low values, or determine a minimum that might vary by arch...
Throwing an error if the value is determined to be "too low" seems reasonable.
What's "too low"? This will vary by arch too right? Keep in mind this is 'forever'...
- If we did this we'd absolutely have to implement a check in the brk() implementation, which is a very very sensitive bit of code. And of course, in mmap() and mremap()... and any arch-specific code that might interface with this stuff (these functions are hooked).
- A fixed address limit would make more sense, but it seems difficult to know what would work for everybody, and again we'd have to deal with edge cases and having a permanent maintenance burden.
A fixed value is not ideal, since a single size probably would not be suffiecient for every application. However if necessary we could fix it to 48-bits since arm64 and x86 already do that, and that would still allow a generic way of defining this behavior.
This is more acceptable. It avoids pretty much all of the rest of the issues.
- If you did have a map flag what about merging between VMAs above the limit and below it? To avoid that you'd need to implement some kind of a 'VMA flag that has an arbitrary characteristic' or a 'limit' field, adjust all the 'can VMA merge' functions and write extensive testing and none of that is frankly acceptable.
- We have some 'weird' arches that might have problem with certain virtual address ranges or require arbitrary mappings at a certain address range that a limit might not be able to account for.
I'm absolutely opposed to a new MAP_ flag for this, but even if you implemented that, it implies a lot of complexity.
It implies even more complexity if you implement something per-process except if it were a fixed limit.
And if you implement a fixed limit, it's hard to see that it'll be acceptable to everybody, and I suspect we'd still run into some possible weirdness.
So again, I'm struggling to see how this concept can be justified in any form.
The piece I am missing here is that this idea is already being used by x86 and arm64. They implicitly force all allocations to be below the 47-bit boundary if the hint address is below 47 bits. This flag is much less invasive because it is opt-in and will not impact any existing code. I am not familiar enough with all of the interactions spread throughout mm to know how these architectures have managed to ensure that this 48-bit limit is enforced across things like mremap() as well.
I just wrote a bunch above about this and did in the original email.
The 48-bit limit is much more workable and is across-the-board so it's easy to implement. It's the variable thing or map flag thing that's the problem.
Are you against the idea that there should be a standard way for applications to consistently obtain address that have free bits, or are you just against this implementation? From your statement I assume you mean that every architecture should continue to have varying behavior and separate implementations for supporting larger address spaces.
I'm against this implementation, or one with a variable limit.
An ADDR_LIMIT_48BIT personality I think is the most workable form of this. Others may have opinions on this also, but it avoids pretty much all of the aforementioned issues and is the least invasive.
- Charlie
Sorry to push back so much on your series, your efforts are appreciated, I am simply trying to head off issues in the future especially those that end up exposed to userland! :)
No I really appreciate it, thank you for your input! Having a variable limit is not necessary. The motivation is to create a UABI that would be flexible into the future where an application may want more (or less) bits reserved than what ADDR_LIMIT_48BIT would provide. However, current applications are happy to work with 48-bit address spaces so having the 48-bit personality is sufficient.
There will still need to be some management in the userland software to ensure that the personality has been set before any allocations, but that is unavoidable with any solution.
- Charlie
On 8/29/24 01:42, Lorenzo Stoakes wrote:
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
You mean x86 _has_ to limit to physically available bits in a canonical format 🙂 this will not be the case for 5-page table levels though...
By "physically available bits" are you referring to the bits that can be used as a part of the virtual address? "Physically" may not have been the best choice of words. ;)
There's a canonical hole in 4-level paging and 5-level paging on x86. The 5-level canonical hole is just smaller.
Also, I should probably say that the >47-bit mmap() access hint was more of a crutch than something that we wanted to make ABI forever. We knew that high addresses might break some apps and we hoped that the list of things it would break would go down over time so that we could eventually just let mmap() access the whole address space by default.
That optimism may have been misplaced.
On Fri, Aug 30, 2024 at 08:03:25AM GMT, Dave Hansen wrote:
On 8/29/24 01:42, Lorenzo Stoakes wrote:
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
You mean x86 _has_ to limit to physically available bits in a canonical format 🙂 this will not be the case for 5-page table levels though...
By "physically available bits" are you referring to the bits that can be used as a part of the virtual address? "Physically" may not have been the best choice of words. ;)
There's a canonical hole in 4-level paging and 5-level paging on x86. The 5-level canonical hole is just smaller.
Yeah sorry this is what I meant!
Also, I should probably say that the >47-bit mmap() access hint was more of a crutch than something that we wanted to make ABI forever. We knew that high addresses might break some apps and we hoped that the list of things it would break would go down over time so that we could eventually just let mmap() access the whole address space by default.
That optimism may have been misplaced.
Interesting, thanks. This speaks again I think to it being unwise to rely on these things.
I do think the only workable form of this series is a fixed personality-based mapping limit.
Such a large recipient list and no linux-api. CC'd, please include it on future postings.
On 8/29/24 09:15, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking.
I doubt it's desirable to have different behavior depending on architecture. Also you could say it's a small class of applications that need more than 47 bits.
This flag will also allow seemless compatibility between all architectures, so applications like Go and OpenJDK that use bits in a virtual address can request the exact number of bits they need in a generic way. The flag can be checked inside of vm_unmapped_area() so that this flag does not have to be handled individually by each architecture.
Link: https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1... [1] Link: https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/s... [2]
To: Arnd Bergmann arnd@arndb.de To: Richard Henderson richard.henderson@linaro.org To: Ivan Kokshaysky ink@jurassic.park.msu.ru To: Matt Turner mattst88@gmail.com To: Vineet Gupta vgupta@kernel.org To: Russell King linux@armlinux.org.uk To: Guo Ren guoren@kernel.org To: Huacai Chen chenhuacai@kernel.org To: WANG Xuerui kernel@xen0n.name To: Thomas Bogendoerfer tsbogend@alpha.franken.de To: James E.J. Bottomley James.Bottomley@HansenPartnership.com To: Helge Deller deller@gmx.de To: Michael Ellerman mpe@ellerman.id.au To: Nicholas Piggin npiggin@gmail.com To: Christophe Leroy christophe.leroy@csgroup.eu To: Naveen N Rao naveen@kernel.org To: Alexander Gordeev agordeev@linux.ibm.com To: Gerald Schaefer gerald.schaefer@linux.ibm.com To: Heiko Carstens hca@linux.ibm.com To: Vasily Gorbik gor@linux.ibm.com To: Christian Borntraeger borntraeger@linux.ibm.com To: Sven Schnelle svens@linux.ibm.com To: Yoshinori Sato ysato@users.sourceforge.jp To: Rich Felker dalias@libc.org To: John Paul Adrian Glaubitz glaubitz@physik.fu-berlin.de To: David S. Miller davem@davemloft.net To: Andreas Larsson andreas@gaisler.com To: Thomas Gleixner tglx@linutronix.de To: Ingo Molnar mingo@redhat.com To: Borislav Petkov bp@alien8.de To: Dave Hansen dave.hansen@linux.intel.com To: x86@kernel.org To: H. Peter Anvin hpa@zytor.com To: Andy Lutomirski luto@kernel.org To: Peter Zijlstra peterz@infradead.org To: Muchun Song muchun.song@linux.dev To: Andrew Morton akpm@linux-foundation.org To: Liam R. Howlett Liam.Howlett@oracle.com To: Vlastimil Babka vbabka@suse.cz To: Lorenzo Stoakes lorenzo.stoakes@oracle.com To: Shuah Khan shuah@kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-snps-arc@lists.infradead.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-csky@vger.kernel.org Cc: loongarch@lists.linux.dev Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-kselftest@vger.kernel.org Signed-off-by: Charlie Jenkins charlie@rivosinc.com
Changes in v2:
- Added much greater detail to cover letter
- Removed all code that touched architecture specific code and was able to factor this out into all generic functions, except for flags that needed to be added to vm_unmapped_area_info
- Made this an RFC since I have only tested it on riscv and x86
- Link to v1: https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb9022d...
Charlie Jenkins (4): mm: Add MAP_BELOW_HINT mm: Add hint and mmap_flags to struct vm_unmapped_area_info mm: Support MAP_BELOW_HINT in vm_unmapped_area() selftests/mm: Create MAP_BELOW_HINT test
arch/alpha/kernel/osf_sys.c | 2 ++ arch/arc/mm/mmap.c | 3 +++ arch/arm/mm/mmap.c | 7 ++++++ arch/csky/abiv1/mmap.c | 3 +++ arch/loongarch/mm/mmap.c | 3 +++ arch/mips/mm/mmap.c | 3 +++ arch/parisc/kernel/sys_parisc.c | 3 +++ arch/powerpc/mm/book3s64/slice.c | 7 ++++++ arch/s390/mm/hugetlbpage.c | 4 ++++ arch/s390/mm/mmap.c | 6 ++++++ arch/sh/mm/mmap.c | 6 ++++++ arch/sparc/kernel/sys_sparc_32.c | 3 +++ arch/sparc/kernel/sys_sparc_64.c | 6 ++++++ arch/sparc/mm/hugetlbpage.c | 4 ++++ arch/x86/kernel/sys_x86_64.c | 6 ++++++ arch/x86/mm/hugetlbpage.c | 4 ++++ fs/hugetlbfs/inode.c | 4 ++++ include/linux/mm.h | 2 ++ include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c | 9 ++++++++ tools/include/uapi/asm-generic/mman-common.h | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 32 ++++++++++++++++++++++++++++ 23 files changed, 120 insertions(+)
base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55
On Thu, 29 Aug 2024 02:02:34 PDT (-0700), vbabka@suse.cz wrote:
Such a large recipient list and no linux-api. CC'd, please include it on future postings.
On 8/29/24 09:15, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking.
I doubt it's desirable to have different behavior depending on architecture. Also you could say it's a small class of applications that need more than 47 bits.
We're sort of stuck with the architeture-depending behavior here: for the first few years RISC-V only had 39-bit VAs, so the defato uABI ended up being that userspace can ignore way more bits. While 48 bits might be enough for everyone, 39 doesn't seem to be -- or at least IIRC when we tried restricting the default to that, we broke stuff. There's also some other wrinkles like arbitrary bit boundaries in pointer masking and vendor-specific paging formats, but at some point we just end up down a rabbit hole of insanity there...
FWIW, I think that userspace depending on just tossing some VA bits because some kernels happened to never allocate from them is just broken, but it seems like other ports worked around the 48->57 bit transition and we're trying to do something similar for 39->48 (and that works with 49->57, as we'll have to deal with that eventually).
So that's basically how we ended up with this sort of thing: trying to do something similar without a flag broke userspace because we were trying to jam too much into the hints. I couldn't really figure out a way to satisfy all the userspace constraints by just implicitly retrofitting behavior based on the hints, so we figured having an explicit flag to control the behavior would be the sanest way to go.
That said: I'm not opposed to just saying "depending on 39-bit VAs is broken" and just forcing people to fix it.
This flag will also allow seemless compatibility between all architectures, so applications like Go and OpenJDK that use bits in a virtual address can request the exact number of bits they need in a generic way. The flag can be checked inside of vm_unmapped_area() so that this flag does not have to be handled individually by each architecture.
Link: https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1... [1] Link: https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/s... [2]
To: Arnd Bergmann arnd@arndb.de To: Richard Henderson richard.henderson@linaro.org To: Ivan Kokshaysky ink@jurassic.park.msu.ru To: Matt Turner mattst88@gmail.com To: Vineet Gupta vgupta@kernel.org To: Russell King linux@armlinux.org.uk To: Guo Ren guoren@kernel.org To: Huacai Chen chenhuacai@kernel.org To: WANG Xuerui kernel@xen0n.name To: Thomas Bogendoerfer tsbogend@alpha.franken.de To: James E.J. Bottomley James.Bottomley@HansenPartnership.com To: Helge Deller deller@gmx.de To: Michael Ellerman mpe@ellerman.id.au To: Nicholas Piggin npiggin@gmail.com To: Christophe Leroy christophe.leroy@csgroup.eu To: Naveen N Rao naveen@kernel.org To: Alexander Gordeev agordeev@linux.ibm.com To: Gerald Schaefer gerald.schaefer@linux.ibm.com To: Heiko Carstens hca@linux.ibm.com To: Vasily Gorbik gor@linux.ibm.com To: Christian Borntraeger borntraeger@linux.ibm.com To: Sven Schnelle svens@linux.ibm.com To: Yoshinori Sato ysato@users.sourceforge.jp To: Rich Felker dalias@libc.org To: John Paul Adrian Glaubitz glaubitz@physik.fu-berlin.de To: David S. Miller davem@davemloft.net To: Andreas Larsson andreas@gaisler.com To: Thomas Gleixner tglx@linutronix.de To: Ingo Molnar mingo@redhat.com To: Borislav Petkov bp@alien8.de To: Dave Hansen dave.hansen@linux.intel.com To: x86@kernel.org To: H. Peter Anvin hpa@zytor.com To: Andy Lutomirski luto@kernel.org To: Peter Zijlstra peterz@infradead.org To: Muchun Song muchun.song@linux.dev To: Andrew Morton akpm@linux-foundation.org To: Liam R. Howlett Liam.Howlett@oracle.com To: Vlastimil Babka vbabka@suse.cz To: Lorenzo Stoakes lorenzo.stoakes@oracle.com To: Shuah Khan shuah@kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-snps-arc@lists.infradead.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-csky@vger.kernel.org Cc: loongarch@lists.linux.dev Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-kselftest@vger.kernel.org Signed-off-by: Charlie Jenkins charlie@rivosinc.com
Changes in v2:
- Added much greater detail to cover letter
- Removed all code that touched architecture specific code and was able to factor this out into all generic functions, except for flags that needed to be added to vm_unmapped_area_info
- Made this an RFC since I have only tested it on riscv and x86
- Link to v1: https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb9022d...
Charlie Jenkins (4): mm: Add MAP_BELOW_HINT mm: Add hint and mmap_flags to struct vm_unmapped_area_info mm: Support MAP_BELOW_HINT in vm_unmapped_area() selftests/mm: Create MAP_BELOW_HINT test
arch/alpha/kernel/osf_sys.c | 2 ++ arch/arc/mm/mmap.c | 3 +++ arch/arm/mm/mmap.c | 7 ++++++ arch/csky/abiv1/mmap.c | 3 +++ arch/loongarch/mm/mmap.c | 3 +++ arch/mips/mm/mmap.c | 3 +++ arch/parisc/kernel/sys_parisc.c | 3 +++ arch/powerpc/mm/book3s64/slice.c | 7 ++++++ arch/s390/mm/hugetlbpage.c | 4 ++++ arch/s390/mm/mmap.c | 6 ++++++ arch/sh/mm/mmap.c | 6 ++++++ arch/sparc/kernel/sys_sparc_32.c | 3 +++ arch/sparc/kernel/sys_sparc_64.c | 6 ++++++ arch/sparc/mm/hugetlbpage.c | 4 ++++ arch/x86/kernel/sys_x86_64.c | 6 ++++++ arch/x86/mm/hugetlbpage.c | 4 ++++ fs/hugetlbfs/inode.c | 4 ++++ include/linux/mm.h | 2 ++ include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c | 9 ++++++++ tools/include/uapi/asm-generic/mman-common.h | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 32 ++++++++++++++++++++++++++++ 23 files changed, 120 insertions(+)
base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55
On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking.
This argument looks broken to me.
The "small class of applications" is going to be broken unless they got patched to use your new mmap() flag. You are asking for bugs.
Consider the case when you write, compile and validate a piece of software on machine that has <=47bit VA. The binary got shipped to customers. Later, customer gets a new shiny machine that supports larger address space and your previously working software is broken. Such binaries might exist today.
It is bad idea to use >47bit VA by default. Most of software got tested on x86 with 47bit VA.
We can consider more options to opt-in into wider address space like personality or prctl() handle. But opt-out is no-go from what I see.
On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking.
This argument looks broken to me.
The "small class of applications" is going to be broken unless they got patched to use your new mmap() flag. You are asking for bugs.
Consider the case when you write, compile and validate a piece of software on machine that has <=47bit VA. The binary got shipped to customers. Later, customer gets a new shiny machine that supports larger address space and your previously working software is broken. Such binaries might exist today.
It is bad idea to use >47bit VA by default. Most of software got tested on x86 with 47bit VA.
We can consider more options to opt-in into wider address space like personality or prctl() handle. But opt-out is no-go from what I see.
-- Kiryl Shutsemau / Kirill A. Shutemov
riscv is in an interesting state in regards to this because the software ecosystem is much less mature than other architectures. The existing riscv hardware supports either 38 or 47 bit userspace VAs, but a lot of people test on QEMU which defaults to 56 bit. As a result, a lot of code is tested with the larger address space. Applications that don't work on the larger address space, like OpenJDK, currently throw an error and exit.
Since riscv does not currently have the address space default to 47 bits, some applications just don't work on 56 bits. We could change the kernel so that these applications start working without the need for them to change their code, but that seems like the kernel is overstepping and fixing binaries rather than providing users tools to fix the binaries themselves.
This mmap flag was an attempt to provide a tool for these applications that work on the existing 47 bit VA hardware to also work on different hardware that supports a 56 bit VA space. After feedback, it looks like a better solution than the mmap flag is to use the personality syscall to set a process wide restriction to 47 bits instead, which matches the 32 bit flag that already exists.
- Charlie
On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote:
On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
I also saw libmozjs crashing with 57-bit address space on x86.
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking.
You reiterate the argument about "small class of applications". But it makes no sense to me.
With full address space by default, this small class of applications is going to *broken* unless they would handle RISC-V case specifically.
On other hand, if you limit VA to 128TiB by default (like many architectures do[1]) everything would work without intervention. And if an app needs wider address space it would get it with hint opt-in, because it is required on x86-64 anyway. Again, no RISC-V-specific code.
I see no upside with your approach. Just worse user experience.
[1] See va_high_addr_switch test case in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
On 09/09/2024 10:46, Kirill A. Shutemov wrote:
On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote:
On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
I also saw libmozjs crashing with 57-bit address space on x86.
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking.
You reiterate the argument about "small class of applications". But it makes no sense to me.
Sorry to chime in late on this - I had been considering implementing something like MAP_BELOW_HINT and found this thread.
While the examples of applications that want to use high VA bits and get bitten by future upgrades is not very persuasive. It's worth pointing out that there are a variety of somewhat horrid hacks out there to work around this feature not existing.
E.g. from my brief research into other code:
* Box64 seems to have a custom allocator based on reading /proc/self/maps to allocate a block of VA space with a low enough address [1]
* PHP has code reading /proc/self/maps - I think this is to find a segment which is close enough to the text segment [2]
* FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit addresses [3][4]
* pmdk has some funky code to find the lowest address that meets certain requirements - this does look like an ALSR alternative and probably couldn't directly use MAP_BELOW_HINT, although maybe this suggests we need a mechanism to map without a VA-range? [5]
* MIT-Scheme parses /proc/self/maps to find the lowest mapping within a range [6]
* LuaJIT uses an approach to 'probe' to find a suitable low address for allocation [7]
The biggest benefit I see of MAP_BELOW_HINT is that it would allow a library to get low addresses without causing any problems for the rest of the application. The use case I'm looking at is in a library and therefore a personality mode wouldn't be appropriate (because I don't want to affect the rest of the application). Reading /proc/self/maps is also problematic because other threads could be allocating/freeing at the same time.
Thanks, Steve
[1] https://sources.debian.org/src/box64/0.3.0+dfsg-1/src/custommem.c/ [2] https://sources.debian.org/src/php8.2/8.2.24-1/ext/opcache/shared_alloc_mmap... [3] https://github.com/FEX-Emu/FEX/blob/main/FEXCore/Source/Utils/Allocator.cpp [4] https://github.com/FEX-Emu/FEX/commit/df2f1ad074e5cdfb19a0bd4639b7604f777fb0... [5] https://sources.debian.org/src/pmdk/1.13.1-1.1/src/common/mmap_posix.c/?hl=2... [6] https://sources.debian.org/src/mit-scheme/12.1-3/src/microcode/ux.c/#L826 [7] https://sources.debian.org/src/luajit/2.1.0+openresty20240815-1/src/lj_alloc...
With full address space by default, this small class of applications is going to *broken* unless they would handle RISC-V case specifically.
On other hand, if you limit VA to 128TiB by default (like many architectures do[1]) everything would work without intervention. And if an app needs wider address space it would get it with hint opt-in, because it is required on x86-64 anyway. Again, no RISC-V-specific code.
I see no upside with your approach. Just worse user experience.
[1] See va_high_addr_switch test case in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tool...
* Steven Price steven.price@arm.com [241021 09:23]:
On 09/09/2024 10:46, Kirill A. Shutemov wrote:
On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote:
On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
I also saw libmozjs crashing with 57-bit address space on x86.
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking.
You reiterate the argument about "small class of applications". But it makes no sense to me.
Sorry to chime in late on this - I had been considering implementing something like MAP_BELOW_HINT and found this thread.
While the examples of applications that want to use high VA bits and get bitten by future upgrades is not very persuasive. It's worth pointing out that there are a variety of somewhat horrid hacks out there to work around this feature not existing.
E.g. from my brief research into other code:
Box64 seems to have a custom allocator based on reading /proc/self/maps to allocate a block of VA space with a low enough address [1]
PHP has code reading /proc/self/maps - I think this is to find a segment which is close enough to the text segment [2]
FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit addresses [3][4]
Can't the limited number of applications that need to restrict the upper bound use an LD_PRELOAD compatible library to do this?
pmdk has some funky code to find the lowest address that meets certain requirements - this does look like an ALSR alternative and probably couldn't directly use MAP_BELOW_HINT, although maybe this suggests we need a mechanism to map without a VA-range? [5]
MIT-Scheme parses /proc/self/maps to find the lowest mapping within a range [6]
LuaJIT uses an approach to 'probe' to find a suitable low address for allocation [7]
Although I did not take a deep dive into each example above, there are some very odd things being done, we will never cover all the use cases with an exact API match. What we have today can be made to work for these users as they have figured ways to do it.
Are they pretty? no. Are they common? no. I'm not sure it's worth plumbing in new MM code in for these users.
The biggest benefit I see of MAP_BELOW_HINT is that it would allow a library to get low addresses without causing any problems for the rest of the application. The use case I'm looking at is in a library and therefore a personality mode wouldn't be appropriate (because I don't want to affect the rest of the application). Reading /proc/self/maps is also problematic because other threads could be allocating/freeing at the same time.
As long as you don't exhaust the lower limit you are trying to allocate within - which is exactly the issue riscv is hitting.
I understand that you are providing examples to prove that this is needed, but I feel like you are better demonstrating the flexibility exists to implement solutions in different ways using todays API.
I think it would be best to use the existing methods and work around the issue that was created in riscv while future changes could mirror amd64 and arm64.
...
[1] https://sources.debian.org/src/box64/0.3.0+dfsg-1/src/custommem.c/ [2] https://sources.debian.org/src/php8.2/8.2.24-1/ext/opcache/shared_alloc_mmap... [3] https://github.com/FEX-Emu/FEX/blob/main/FEXCore/Source/Utils/Allocator.cpp [4] https://github.com/FEX-Emu/FEX/commit/df2f1ad074e5cdfb19a0bd4639b7604f777fb0... [5] https://sources.debian.org/src/pmdk/1.13.1-1.1/src/common/mmap_posix.c/?hl=2... [6] https://sources.debian.org/src/mit-scheme/12.1-3/src/microcode/ux.c/#L826 [7] https://sources.debian.org/src/luajit/2.1.0+openresty20240815-1/src/lj_alloc...
...
Thanks, Liam
Hi Liam,
On 21/10/2024 20:48, Liam R. Howlett wrote:
- Steven Price steven.price@arm.com [241021 09:23]:
On 09/09/2024 10:46, Kirill A. Shutemov wrote:
On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote:
On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
I also saw libmozjs crashing with 57-bit address space on x86.
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking.
You reiterate the argument about "small class of applications". But it makes no sense to me.
Sorry to chime in late on this - I had been considering implementing something like MAP_BELOW_HINT and found this thread.
While the examples of applications that want to use high VA bits and get bitten by future upgrades is not very persuasive. It's worth pointing out that there are a variety of somewhat horrid hacks out there to work around this feature not existing.
E.g. from my brief research into other code:
Box64 seems to have a custom allocator based on reading /proc/self/maps to allocate a block of VA space with a low enough address [1]
PHP has code reading /proc/self/maps - I think this is to find a segment which is close enough to the text segment [2]
FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit addresses [3][4]
Can't the limited number of applications that need to restrict the upper bound use an LD_PRELOAD compatible library to do this?
I'm not entirely sure what point you are making here. Yes an LD_PRELOAD approach could be used instead of a personality type as a 'hack' to preallocate the upper address space. The obvious disadvantage is that you can't (easily) layer LD_PRELOAD so it won't work in the general case.
pmdk has some funky code to find the lowest address that meets certain requirements - this does look like an ALSR alternative and probably couldn't directly use MAP_BELOW_HINT, although maybe this suggests we need a mechanism to map without a VA-range? [5]
MIT-Scheme parses /proc/self/maps to find the lowest mapping within a range [6]
LuaJIT uses an approach to 'probe' to find a suitable low address for allocation [7]
Although I did not take a deep dive into each example above, there are some very odd things being done, we will never cover all the use cases with an exact API match. What we have today can be made to work for these users as they have figured ways to do it.
Are they pretty? no. Are they common? no. I'm not sure it's worth plumbing in new MM code in for these users.
My issue with the existing 'solutions' is that they all seem to be fragile:
* Using /proc/self/maps is inherently racy if there could be any other code running in the process at the same time.
* Attempting to map the upper part of the address space only works if done early enough - once an allocation arrives there, there's very little you can robustly do (because the stray allocation might be freed).
* LuaJIT's probing mechanism is probably robust, but it's inefficient - LuaJIT has a fallback of linear probing, following by no hint (ASLR), followed by pseudo-random probing. I don't know the history of the code but it looks like it's probably been tweaked to try to avoid performance issues.
The biggest benefit I see of MAP_BELOW_HINT is that it would allow a library to get low addresses without causing any problems for the rest of the application. The use case I'm looking at is in a library and therefore a personality mode wouldn't be appropriate (because I don't want to affect the rest of the application). Reading /proc/self/maps is also problematic because other threads could be allocating/freeing at the same time.
As long as you don't exhaust the lower limit you are trying to allocate within - which is exactly the issue riscv is hitting.
Obviously if you actually exhaust the lower limit then any MAP_BELOW_HINT API would also fail - there's really not much that can be done in that case.
I understand that you are providing examples to prove that this is needed, but I feel like you are better demonstrating the flexibility exists to implement solutions in different ways using todays API.
My intention is to show that today's API doesn't provide a robust way of doing this. Although I'm quite happy if you can point me at a robust way with the current API. As I mentioned my goal is to be able to map memory in a (multithreaded) library with a (ideally configurable) number of VA bits. I don't particularly want to restrict the whole process, just specific allocations.
I had typed up a series similar to this one as a MAP_BELOW flag would fit my use-case well.
I think it would be best to use the existing methods and work around the issue that was created in riscv while future changes could mirror amd64 and arm64.
The riscv issue is a different issue to the one I'm trying to solve. I agree MAP_BELOW_HINT isn't a great fix for that because we already have differences between amd64 and arm64 and obviously no software currently out there uses this new flag.
However, if we had introduced this flag in the past (e.g. if MAP_32BIT had been implemented more generically, across architectures and with a hint value, like this new flag) then we probably wouldn't be in this situation. Applications that want to restrict the VA space would be able to opt-in and be portable across architectures.
Another potential option is a mmap3() which actually allows the caller to place constraints on the VA space (e.g. minimum, maximum and alignment). There's plenty of code out there that has to over-allocate and munmap() the unneeded part for alignment reasons. But I don't have a specific need for that, and I'm guessing you wouldn't be in favour.
Thanks, Steve
...
[1] https://sources.debian.org/src/box64/0.3.0+dfsg-1/src/custommem.c/ [2] https://sources.debian.org/src/php8.2/8.2.24-1/ext/opcache/shared_alloc_mmap... [3] https://github.com/FEX-Emu/FEX/blob/main/FEXCore/Source/Utils/Allocator.cpp [4] https://github.com/FEX-Emu/FEX/commit/df2f1ad074e5cdfb19a0bd4639b7604f777fb0... [5] https://sources.debian.org/src/pmdk/1.13.1-1.1/src/common/mmap_posix.c/?hl=2... [6] https://sources.debian.org/src/mit-scheme/12.1-3/src/microcode/ux.c/#L826 [7] https://sources.debian.org/src/luajit/2.1.0+openresty20240815-1/src/lj_alloc...
...
Thanks, Liam
* Steven Price steven.price@arm.com [241023 05:31]:
Box64 seems to have a custom allocator based on reading /proc/self/maps to allocate a block of VA space with a low enough address [1]
PHP has code reading /proc/self/maps - I think this is to find a segment which is close enough to the text segment [2]
FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit addresses [3][4]
Can't the limited number of applications that need to restrict the upper bound use an LD_PRELOAD compatible library to do this?
I'm not entirely sure what point you are making here. Yes an LD_PRELOAD approach could be used instead of a personality type as a 'hack' to preallocate the upper address space. The obvious disadvantage is that you can't (easily) layer LD_PRELOAD so it won't work in the general case.
My point is that riscv could work around the limited number of applications that requires this. It's not really viable for you.
pmdk has some funky code to find the lowest address that meets certain requirements - this does look like an ALSR alternative and probably couldn't directly use MAP_BELOW_HINT, although maybe this suggests we need a mechanism to map without a VA-range? [5]
MIT-Scheme parses /proc/self/maps to find the lowest mapping within a range [6]
LuaJIT uses an approach to 'probe' to find a suitable low address for allocation [7]
Although I did not take a deep dive into each example above, there are some very odd things being done, we will never cover all the use cases with an exact API match. What we have today can be made to work for these users as they have figured ways to do it.
Are they pretty? no. Are they common? no. I'm not sure it's worth plumbing in new MM code in for these users.
My issue with the existing 'solutions' is that they all seem to be fragile:
- Using /proc/self/maps is inherently racy if there could be any other
code running in the process at the same time.
Yes, it is not thread safe. Parsing text is also undesirable.
- Attempting to map the upper part of the address space only works if
done early enough - once an allocation arrives there, there's very little you can robustly do (because the stray allocation might be freed).
- LuaJIT's probing mechanism is probably robust, but it's inefficient -
LuaJIT has a fallback of linear probing, following by no hint (ASLR), followed by pseudo-random probing. I don't know the history of the code but it looks like it's probably been tweaked to try to avoid performance issues.
The biggest benefit I see of MAP_BELOW_HINT is that it would allow a library to get low addresses without causing any problems for the rest of the application. The use case I'm looking at is in a library and therefore a personality mode wouldn't be appropriate (because I don't want to affect the rest of the application). Reading /proc/self/maps is also problematic because other threads could be allocating/freeing at the same time.
As long as you don't exhaust the lower limit you are trying to allocate within - which is exactly the issue riscv is hitting.
Obviously if you actually exhaust the lower limit then any MAP_BELOW_HINT API would also fail - there's really not much that can be done in that case.
Today we reverse the search, so you end up in the higher address (bottom-up vs top-down) - although the direction is arch dependent.
If the allocation is too high/low then you could detect, free, and handle the failure.
I understand that you are providing examples to prove that this is needed, but I feel like you are better demonstrating the flexibility exists to implement solutions in different ways using todays API.
My intention is to show that today's API doesn't provide a robust way of doing this. Although I'm quite happy if you can point me at a robust way with the current API. As I mentioned my goal is to be able to map memory in a (multithreaded) library with a (ideally configurable) number of VA bits. I don't particularly want to restrict the whole process, just specific allocations.
If you don't need to restrict everything, won't the hint work for your usecase? I must be missing something from your requirements.
I had typed up a series similar to this one as a MAP_BELOW flag would fit my use-case well.
I think it would be best to use the existing methods and work around the issue that was created in riscv while future changes could mirror amd64 and arm64.
The riscv issue is a different issue to the one I'm trying to solve. I agree MAP_BELOW_HINT isn't a great fix for that because we already have differences between amd64 and arm64 and obviously no software currently out there uses this new flag.
However, if we had introduced this flag in the past (e.g. if MAP_32BIT had been implemented more generically, across architectures and with a hint value, like this new flag) then we probably wouldn't be in this situation. Applications that want to restrict the VA space would be able to opt-in and be portable across architectures.
I don't think that's true. Some of the applications want all of the allocations below a certain threshold and by the time they are adding flags to allocations, it's too late. What you are looking for is a counterpart to mmap_min_addr, but for higher addresses? This would have to be set before any of the allocations occur for a specific binary (ie: existing libraries need to be below that threshold too), I think?
Another potential option is a mmap3() which actually allows the caller to place constraints on the VA space (e.g. minimum, maximum and alignment). There's plenty of code out there that has to over-allocate and munmap() the unneeded part for alignment reasons. But I don't have a specific need for that, and I'm guessing you wouldn't be in favour.
You'd probably want control of the direction of the search too.
I think mmap3() would be difficult to have accepted as well.
...
Thanks, Liam
On 23/10/2024 19:10, Liam R. Howlett wrote:
- Steven Price steven.price@arm.com [241023 05:31]:
Box64 seems to have a custom allocator based on reading /proc/self/maps to allocate a block of VA space with a low enough address [1]
PHP has code reading /proc/self/maps - I think this is to find a segment which is close enough to the text segment [2]
FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit addresses [3][4]
Can't the limited number of applications that need to restrict the upper bound use an LD_PRELOAD compatible library to do this?
I'm not entirely sure what point you are making here. Yes an LD_PRELOAD approach could be used instead of a personality type as a 'hack' to preallocate the upper address space. The obvious disadvantage is that you can't (easily) layer LD_PRELOAD so it won't work in the general case.
My point is that riscv could work around the limited number of applications that requires this. It's not really viable for you.
Ah ok - thanks for the clarification.
pmdk has some funky code to find the lowest address that meets certain requirements - this does look like an ALSR alternative and probably couldn't directly use MAP_BELOW_HINT, although maybe this suggests we need a mechanism to map without a VA-range? [5]
MIT-Scheme parses /proc/self/maps to find the lowest mapping within a range [6]
LuaJIT uses an approach to 'probe' to find a suitable low address for allocation [7]
Although I did not take a deep dive into each example above, there are some very odd things being done, we will never cover all the use cases with an exact API match. What we have today can be made to work for these users as they have figured ways to do it.
Are they pretty? no. Are they common? no. I'm not sure it's worth plumbing in new MM code in for these users.
My issue with the existing 'solutions' is that they all seem to be fragile:
- Using /proc/self/maps is inherently racy if there could be any other
code running in the process at the same time.
Yes, it is not thread safe. Parsing text is also undesirable.
- Attempting to map the upper part of the address space only works if
done early enough - once an allocation arrives there, there's very little you can robustly do (because the stray allocation might be freed).
- LuaJIT's probing mechanism is probably robust, but it's inefficient -
LuaJIT has a fallback of linear probing, following by no hint (ASLR), followed by pseudo-random probing. I don't know the history of the code but it looks like it's probably been tweaked to try to avoid performance issues.
The biggest benefit I see of MAP_BELOW_HINT is that it would allow a library to get low addresses without causing any problems for the rest of the application. The use case I'm looking at is in a library and therefore a personality mode wouldn't be appropriate (because I don't want to affect the rest of the application). Reading /proc/self/maps is also problematic because other threads could be allocating/freeing at the same time.
As long as you don't exhaust the lower limit you are trying to allocate within - which is exactly the issue riscv is hitting.
Obviously if you actually exhaust the lower limit then any MAP_BELOW_HINT API would also fail - there's really not much that can be done in that case.
Today we reverse the search, so you end up in the higher address (bottom-up vs top-down) - although the direction is arch dependent.
If the allocation is too high/low then you could detect, free, and handle the failure.
Agreed, that's fine.
I understand that you are providing examples to prove that this is needed, but I feel like you are better demonstrating the flexibility exists to implement solutions in different ways using todays API.
My intention is to show that today's API doesn't provide a robust way of doing this. Although I'm quite happy if you can point me at a robust way with the current API. As I mentioned my goal is to be able to map memory in a (multithreaded) library with a (ideally configurable) number of VA bits. I don't particularly want to restrict the whole process, just specific allocations.
If you don't need to restrict everything, won't the hint work for your usecase? I must be missing something from your requirements.
The hint only works if the hint address is actually free. Otherwise mmap() falls back to as if the hint address wasn't specified.
E.g.
for(int i = 0; i < 2; i++) { void *addr = mmap((void*)(1UL << 32), PAGE_SIZE, PROT_NONE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); printf("%p\n", addr); }
Prints something like:
0x100000000 0x7f20d21e0000
The hint is ignored for the second mmap() because there's already a VMA at the hint address.
So the question is how to generate a hint value that is (or has a high likelihood of being) empty? This AFAICT is the LuaJIT approach, but their approach is to pick random values in the hope of getting a free address (and then working linearly up for subsequent allocations). Which doesn't meet my idea of "robust".
I had typed up a series similar to this one as a MAP_BELOW flag would fit my use-case well.
I think it would be best to use the existing methods and work around the issue that was created in riscv while future changes could mirror amd64 and arm64.
The riscv issue is a different issue to the one I'm trying to solve. I agree MAP_BELOW_HINT isn't a great fix for that because we already have differences between amd64 and arm64 and obviously no software currently out there uses this new flag.
However, if we had introduced this flag in the past (e.g. if MAP_32BIT had been implemented more generically, across architectures and with a hint value, like this new flag) then we probably wouldn't be in this situation. Applications that want to restrict the VA space would be able to opt-in and be portable across architectures.
I don't think that's true. Some of the applications want all of the allocations below a certain threshold and by the time they are adding flags to allocations, it's too late. What you are looking for is a counterpart to mmap_min_addr, but for higher addresses? This would have to be set before any of the allocations occur for a specific binary (ie: existing libraries need to be below that threshold too), I think?
Well that's not what *I* am looking for. A mmap_max_addr might be useful for others for the purpose of restricting all allocations.
I think there are roughly three classes of application:
1. Applications which do nothing special with pointers. This is most applications and they could benefit from any future expansions to the VA size without any modification. E.g. if 64 bit VA addresses were somehow available they could deal with them today (without recompilation).
2. Applications which need VA addresses to meet certain requirements. They might be emulating another architecture (e.g. FEX) and want pointers that can be exposed to the emulation. They might be aware of restrictions in JIT code (e.g. PHP). Or they might want to store pointers in 'weird' ways which involve fewer bits - AFAICT that's the LuaJIT situation. These applications are usually well aware that they are doing something "unusual" and would likely use a Linux API if it existed.
3. Applications which abuse the top bits of a VA because they've read the architecture documentation and they "know" that the VA space is limited.
Class 3 would benefit from mmap_max_addr - either because the architecture has been extended (although that's been worked around by requiring the hint value to allocate into the top address space) or because they get ported to another architecture (which I believe is the RiscV issue). There is some argument these applications are buggy but "we don't break userspace" so we deal with them in kernel until they get ported and then ideally the bugs are fixed.
Class 1 is the applications we know and love, they don't need anything special.
Class 2 is the case I care about. The application knows it wants special addresses, and in the cases I've detailed there has been significant code written to try to achieve this. But the kernel isn't currently providing a good mechanism to do this.
Another potential option is a mmap3() which actually allows the caller to place constraints on the VA space (e.g. minimum, maximum and alignment). There's plenty of code out there that has to over-allocate and munmap() the unneeded part for alignment reasons. But I don't have a specific need for that, and I'm guessing you wouldn't be in favour.
You'd probably want control of the direction of the search too.
Very true, and one of the reasons I don't want to do a mmap3() is that I'm pretty I'd miss something.
I think mmap3() would be difficult to have accepted as well.
And that's the other major reason ;)
Thanks,
Steve
...
Thanks, Liam
linux-kselftest-mirror@lists.linaro.org