On Thu, 29 Aug 2024 02:02:34 PDT (-0700), vbabka@suse.cz wrote:
Such a large recipient list and no linux-api. CC'd, please include it on future postings.
On 8/29/24 09:15, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
Instead of implicitly restricting the address space on riscv (or any current/future architecture), a flag would allow users to opt-in to this behavior rather than opt-out as is done on other architectures. This is desirable because it is a small class of applications that do pointer masking.
I doubt it's desirable to have different behavior depending on architecture. Also you could say it's a small class of applications that need more than 47 bits.
We're sort of stuck with the architeture-depending behavior here: for the first few years RISC-V only had 39-bit VAs, so the defato uABI ended up being that userspace can ignore way more bits. While 48 bits might be enough for everyone, 39 doesn't seem to be -- or at least IIRC when we tried restricting the default to that, we broke stuff. There's also some other wrinkles like arbitrary bit boundaries in pointer masking and vendor-specific paging formats, but at some point we just end up down a rabbit hole of insanity there...
FWIW, I think that userspace depending on just tossing some VA bits because some kernels happened to never allocate from them is just broken, but it seems like other ports worked around the 48->57 bit transition and we're trying to do something similar for 39->48 (and that works with 49->57, as we'll have to deal with that eventually).
So that's basically how we ended up with this sort of thing: trying to do something similar without a flag broke userspace because we were trying to jam too much into the hints. I couldn't really figure out a way to satisfy all the userspace constraints by just implicitly retrofitting behavior based on the hints, so we figured having an explicit flag to control the behavior would be the sanest way to go.
That said: I'm not opposed to just saying "depending on 39-bit VAs is broken" and just forcing people to fix it.
This flag will also allow seemless compatibility between all architectures, so applications like Go and OpenJDK that use bits in a virtual address can request the exact number of bits they need in a generic way. The flag can be checked inside of vm_unmapped_area() so that this flag does not have to be handled individually by each architecture.
Link: https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1... [1] Link: https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/s... [2]
To: Arnd Bergmann arnd@arndb.de To: Richard Henderson richard.henderson@linaro.org To: Ivan Kokshaysky ink@jurassic.park.msu.ru To: Matt Turner mattst88@gmail.com To: Vineet Gupta vgupta@kernel.org To: Russell King linux@armlinux.org.uk To: Guo Ren guoren@kernel.org To: Huacai Chen chenhuacai@kernel.org To: WANG Xuerui kernel@xen0n.name To: Thomas Bogendoerfer tsbogend@alpha.franken.de To: James E.J. Bottomley James.Bottomley@HansenPartnership.com To: Helge Deller deller@gmx.de To: Michael Ellerman mpe@ellerman.id.au To: Nicholas Piggin npiggin@gmail.com To: Christophe Leroy christophe.leroy@csgroup.eu To: Naveen N Rao naveen@kernel.org To: Alexander Gordeev agordeev@linux.ibm.com To: Gerald Schaefer gerald.schaefer@linux.ibm.com To: Heiko Carstens hca@linux.ibm.com To: Vasily Gorbik gor@linux.ibm.com To: Christian Borntraeger borntraeger@linux.ibm.com To: Sven Schnelle svens@linux.ibm.com To: Yoshinori Sato ysato@users.sourceforge.jp To: Rich Felker dalias@libc.org To: John Paul Adrian Glaubitz glaubitz@physik.fu-berlin.de To: David S. Miller davem@davemloft.net To: Andreas Larsson andreas@gaisler.com To: Thomas Gleixner tglx@linutronix.de To: Ingo Molnar mingo@redhat.com To: Borislav Petkov bp@alien8.de To: Dave Hansen dave.hansen@linux.intel.com To: x86@kernel.org To: H. Peter Anvin hpa@zytor.com To: Andy Lutomirski luto@kernel.org To: Peter Zijlstra peterz@infradead.org To: Muchun Song muchun.song@linux.dev To: Andrew Morton akpm@linux-foundation.org To: Liam R. Howlett Liam.Howlett@oracle.com To: Vlastimil Babka vbabka@suse.cz To: Lorenzo Stoakes lorenzo.stoakes@oracle.com To: Shuah Khan shuah@kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-snps-arc@lists.infradead.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-csky@vger.kernel.org Cc: loongarch@lists.linux.dev Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-kselftest@vger.kernel.org Signed-off-by: Charlie Jenkins charlie@rivosinc.com
Changes in v2:
- Added much greater detail to cover letter
- Removed all code that touched architecture specific code and was able to factor this out into all generic functions, except for flags that needed to be added to vm_unmapped_area_info
- Made this an RFC since I have only tested it on riscv and x86
- Link to v1: https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb9022d...
Charlie Jenkins (4): mm: Add MAP_BELOW_HINT mm: Add hint and mmap_flags to struct vm_unmapped_area_info mm: Support MAP_BELOW_HINT in vm_unmapped_area() selftests/mm: Create MAP_BELOW_HINT test
arch/alpha/kernel/osf_sys.c | 2 ++ arch/arc/mm/mmap.c | 3 +++ arch/arm/mm/mmap.c | 7 ++++++ arch/csky/abiv1/mmap.c | 3 +++ arch/loongarch/mm/mmap.c | 3 +++ arch/mips/mm/mmap.c | 3 +++ arch/parisc/kernel/sys_parisc.c | 3 +++ arch/powerpc/mm/book3s64/slice.c | 7 ++++++ arch/s390/mm/hugetlbpage.c | 4 ++++ arch/s390/mm/mmap.c | 6 ++++++ arch/sh/mm/mmap.c | 6 ++++++ arch/sparc/kernel/sys_sparc_32.c | 3 +++ arch/sparc/kernel/sys_sparc_64.c | 6 ++++++ arch/sparc/mm/hugetlbpage.c | 4 ++++ arch/x86/kernel/sys_x86_64.c | 6 ++++++ arch/x86/mm/hugetlbpage.c | 4 ++++ fs/hugetlbfs/inode.c | 4 ++++ include/linux/mm.h | 2 ++ include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c | 9 ++++++++ tools/include/uapi/asm-generic/mman-common.h | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_below_hint.c | 32 ++++++++++++++++++++++++++++ 23 files changed, 120 insertions(+)
base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55