Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
Instead of implicitly restricting the address space on riscv (or any current/future architecture), provide a flag to the personality syscall that can be used to ensure an application works in any arbitrary VA space. A similar feature has already been implemented by the personality syscall in ADDR_LIMIT_32BIT.
This flag will also allow seemless compatibility between all architectures, so applications like Go and OpenJDK that use bits in a virtual address can request the exact number of bits they need in a generic way. The flag can be checked inside of vm_unmapped_area() so that this flag does not have to be handled individually by each architecture.
Link: https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1... [1] Link: https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/s... [2]
To: Arnd Bergmann arnd@arndb.de To: Richard Henderson richard.henderson@linaro.org To: Ivan Kokshaysky ink@jurassic.park.msu.ru To: Matt Turner mattst88@gmail.com To: Vineet Gupta vgupta@kernel.org To: Russell King linux@armlinux.org.uk To: Guo Ren guoren@kernel.org To: Huacai Chen chenhuacai@kernel.org To: WANG Xuerui kernel@xen0n.name To: Thomas Bogendoerfer tsbogend@alpha.franken.de To: James E.J. Bottomley James.Bottomley@HansenPartnership.com To: Helge Deller deller@gmx.de To: Michael Ellerman mpe@ellerman.id.au To: Nicholas Piggin npiggin@gmail.com To: Christophe Leroy christophe.leroy@csgroup.eu To: Naveen N Rao naveen@kernel.org To: Alexander Gordeev agordeev@linux.ibm.com To: Gerald Schaefer gerald.schaefer@linux.ibm.com To: Heiko Carstens hca@linux.ibm.com To: Vasily Gorbik gor@linux.ibm.com To: Christian Borntraeger borntraeger@linux.ibm.com To: Sven Schnelle svens@linux.ibm.com To: Yoshinori Sato ysato@users.sourceforge.jp To: Rich Felker dalias@libc.org To: John Paul Adrian Glaubitz glaubitz@physik.fu-berlin.de To: David S. Miller davem@davemloft.net To: Andreas Larsson andreas@gaisler.com To: Thomas Gleixner tglx@linutronix.de To: Ingo Molnar mingo@redhat.com To: Borislav Petkov bp@alien8.de To: Dave Hansen dave.hansen@linux.intel.com To: x86@kernel.org To: H. Peter Anvin hpa@zytor.com To: Andy Lutomirski luto@kernel.org To: Peter Zijlstra peterz@infradead.org To: Muchun Song muchun.song@linux.dev To: Andrew Morton akpm@linux-foundation.org To: Liam R. Howlett Liam.Howlett@oracle.com To: Vlastimil Babka vbabka@suse.cz To: Lorenzo Stoakes lorenzo.stoakes@oracle.com To: Shuah Khan shuah@kernel.org To: Christoph Hellwig hch@infradead.org To: Michal Hocko mhocko@suse.com To: "Kirill A. Shutemov" kirill@shutemov.name To: Chris Torek chris.torek@gmail.com Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-snps-arc@lists.infradead.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-csky@vger.kernel.org Cc: loongarch@lists.linux.dev Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-kselftest@vger.kernel.org Cc: linux-abi-devel@lists.sourceforge.net Signed-off-by: Charlie Jenkins charlie@rivosinc.com
Changes in v2: - Added much greater detail to cover letter - Removed all code that touched architecture specific code and was able to factor this out into all generic functions, except for flags that needed to be added to vm_unmapped_area_info - Made this an RFC since I have only tested it on riscv and x86 - Link to v1: https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb9022d...
Changes in v3: - Use a personality flag instead of an mmap flag - Link to v2: https://lore.kernel.org/r/20240829-patches-below_hint_mmap-v2-0-638a28d9eae0...
--- Charlie Jenkins (2): mm: Add personality flag to limit address to 47 bits selftests/mm: Create ADDR_LIMIT_47BIT test
include/uapi/linux/personality.h | 1 + mm/mmap.c | 3 ++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_47bit_personality.c | 34 ++++++++++++++++++++++ 5 files changed, 40 insertions(+) --- base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55
Create a personality flag ADDR_LIMIT_47BIT to support applications that wish to transition from running in environments that support at most 47-bit VAs to environments that support larger VAs. This personality can be set to cause all allocations to be below the 47-bit boundary. Using MAP_FIXED with mmap() will bypass this restriction.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- include/uapi/linux/personality.h | 1 + mm/mmap.c | 3 +++ 2 files changed, 4 insertions(+)
diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h index 49796b7756af..cd3b8c154d9b 100644 --- a/include/uapi/linux/personality.h +++ b/include/uapi/linux/personality.h @@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x2000000, STICKY_TIMEOUTS = 0x4000000, ADDR_LIMIT_3GB = 0x8000000, + ADDR_LIMIT_47BIT = 0x10000000, };
/* diff --git a/mm/mmap.c b/mm/mmap.c index d0dfc85b209b..a5c7544853e5 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1766,6 +1766,9 @@ unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info) { unsigned long addr;
+ if (current->personality & ADDR_LIMIT_47BIT) + info->high_limit = MIN(info->high_limit, BIT(47) - 1); + if (info->flags & VM_UNMAPPED_AREA_TOPDOWN) addr = unmapped_area_topdown(info); else
Charlie Jenkins charlie@rivosinc.com writes:
Create a personality flag ADDR_LIMIT_47BIT to support applications that wish to transition from running in environments that support at most 47-bit VAs to environments that support larger VAs. This personality can be set to cause all allocations to be below the 47-bit boundary. Using MAP_FIXED with mmap() will bypass this restriction.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com
include/uapi/linux/personality.h | 1 + mm/mmap.c | 3 +++ 2 files changed, 4 insertions(+)
diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h index 49796b7756af..cd3b8c154d9b 100644 --- a/include/uapi/linux/personality.h +++ b/include/uapi/linux/personality.h @@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x2000000, STICKY_TIMEOUTS = 0x4000000, ADDR_LIMIT_3GB = 0x8000000,
- ADDR_LIMIT_47BIT = 0x10000000,
};
I wonder if ADDR_LIMIT_128T would be clearer?
Have you looked at writing an update for the personality(2) man page? :)
cheers
On Fri, Sep 06, 2024 at 04:59:40PM +1000, Michael Ellerman wrote:
Charlie Jenkins charlie@rivosinc.com writes:
Create a personality flag ADDR_LIMIT_47BIT to support applications that wish to transition from running in environments that support at most 47-bit VAs to environments that support larger VAs. This personality can be set to cause all allocations to be below the 47-bit boundary. Using MAP_FIXED with mmap() will bypass this restriction.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com
include/uapi/linux/personality.h | 1 + mm/mmap.c | 3 +++ 2 files changed, 4 insertions(+)
diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h index 49796b7756af..cd3b8c154d9b 100644 --- a/include/uapi/linux/personality.h +++ b/include/uapi/linux/personality.h @@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x2000000, STICKY_TIMEOUTS = 0x4000000, ADDR_LIMIT_3GB = 0x8000000,
- ADDR_LIMIT_47BIT = 0x10000000,
};
I wonder if ADDR_LIMIT_128T would be clearer?
I don't follow, what does 128T represent?
Have you looked at writing an update for the personality(2) man page? :)
I will write an update to the man page if this patch is approved!
cheers
- Charlie
diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h index 49796b7756af..cd3b8c154d9b 100644 --- a/include/uapi/linux/personality.h +++ b/include/uapi/linux/personality.h @@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x2000000, STICKY_TIMEOUTS = 0x4000000, ADDR_LIMIT_3GB = 0x8000000,
- ADDR_LIMIT_47BIT = 0x10000000, };
I wonder if ADDR_LIMIT_128T would be clearer?
I don't follow, what does 128T represent?
128T is 128 Terabytes, that's the maximum size achievable with a 47BIT address, that naming would be more consistant with the ADDR_LIMIT_3GB just above that means a 3 Gigabytes limit.
Christophe
Hi Christophe,
On Tue, Sep 10, 2024 at 11:21 AM Christophe Leroy christophe.leroy@csgroup.eu wrote:
diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h index 49796b7756af..cd3b8c154d9b 100644 --- a/include/uapi/linux/personality.h +++ b/include/uapi/linux/personality.h @@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x2000000, STICKY_TIMEOUTS = 0x4000000, ADDR_LIMIT_3GB = 0x8000000,
- ADDR_LIMIT_47BIT = 0x10000000, };
I wonder if ADDR_LIMIT_128T would be clearer?
I don't follow, what does 128T represent?
128T is 128 Terabytes, that's the maximum size achievable with a 47BIT address, that naming would be more consistant with the ADDR_LIMIT_3GB just above that means a 3 Gigabytes limit.
Hence ADDR_LIMIT_128TB?
Gr{oetje,eeting}s,
Geert
Geert Uytterhoeven geert@linux-m68k.org writes:
Hi Christophe,
On Tue, Sep 10, 2024 at 11:21 AM Christophe Leroy christophe.leroy@csgroup.eu wrote:
diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h index 49796b7756af..cd3b8c154d9b 100644 --- a/include/uapi/linux/personality.h +++ b/include/uapi/linux/personality.h @@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x2000000, STICKY_TIMEOUTS = 0x4000000, ADDR_LIMIT_3GB = 0x8000000,
- ADDR_LIMIT_47BIT = 0x10000000, };
I wonder if ADDR_LIMIT_128T would be clearer?
I don't follow, what does 128T represent?
128T is 128 Terabytes, that's the maximum size achievable with a 47BIT address, that naming would be more consistant with the ADDR_LIMIT_3GB just above that means a 3 Gigabytes limit.
Hence ADDR_LIMIT_128TB?
Yes it should be 128TB. Typo by me.
cheers
On Wed, Sep 11, 2024 at 11:38:55PM +1000, Michael Ellerman wrote:
Geert Uytterhoeven geert@linux-m68k.org writes:
Hi Christophe,
On Tue, Sep 10, 2024 at 11:21 AM Christophe Leroy christophe.leroy@csgroup.eu wrote:
diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h index 49796b7756af..cd3b8c154d9b 100644 --- a/include/uapi/linux/personality.h +++ b/include/uapi/linux/personality.h @@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x2000000, STICKY_TIMEOUTS = 0x4000000, ADDR_LIMIT_3GB = 0x8000000,
- ADDR_LIMIT_47BIT = 0x10000000, };
I wonder if ADDR_LIMIT_128T would be clearer?
I don't follow, what does 128T represent?
128T is 128 Terabytes, that's the maximum size achievable with a 47BIT address, that naming would be more consistant with the ADDR_LIMIT_3GB just above that means a 3 Gigabytes limit.
Hence ADDR_LIMIT_128TB?
Yes it should be 128TB. Typo by me.
cheers
47BIT was selected because the usecase for this flag is for applications that want to store data in the upper bits of a virtual address space. In this case, how large the virtual address space is irrelevant, and only the number of bits that are being used, and hence the number of bits that are free.
- Charlie
Charlie Jenkins charlie@rivosinc.com writes:
On Wed, Sep 11, 2024 at 11:38:55PM +1000, Michael Ellerman wrote:
Geert Uytterhoeven geert@linux-m68k.org writes:
Hi Christophe,
On Tue, Sep 10, 2024 at 11:21 AM Christophe Leroy christophe.leroy@csgroup.eu wrote:
> diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h > index 49796b7756af..cd3b8c154d9b 100644 > --- a/include/uapi/linux/personality.h > +++ b/include/uapi/linux/personality.h > @@ -22,6 +22,7 @@ enum { > WHOLE_SECONDS = 0x2000000, > STICKY_TIMEOUTS = 0x4000000, > ADDR_LIMIT_3GB = 0x8000000, > + ADDR_LIMIT_47BIT = 0x10000000, > };
I wonder if ADDR_LIMIT_128T would be clearer?
I don't follow, what does 128T represent?
128T is 128 Terabytes, that's the maximum size achievable with a 47BIT address, that naming would be more consistant with the ADDR_LIMIT_3GB just above that means a 3 Gigabytes limit.
Hence ADDR_LIMIT_128TB?
Yes it should be 128TB. Typo by me.
47BIT was selected because the usecase for this flag is for applications that want to store data in the upper bits of a virtual address space. In this case, how large the virtual address space is irrelevant, and only the number of bits that are being used, and hence the number of bits that are free.
Yeah I understand that's how you came to the problem.
But for the user API I think using the size of the address space is clearer, easier to explain, and matches the existing ADDR_LIMIT_3GB.
cheers
Charlie Jenkins charlie@rivosinc.com writes:
On Fri, Sep 06, 2024 at 04:59:40PM +1000, Michael Ellerman wrote:
Charlie Jenkins charlie@rivosinc.com writes:
Create a personality flag ADDR_LIMIT_47BIT to support applications that wish to transition from running in environments that support at most 47-bit VAs to environments that support larger VAs. This personality can be set to cause all allocations to be below the 47-bit boundary. Using MAP_FIXED with mmap() will bypass this restriction.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com
include/uapi/linux/personality.h | 1 + mm/mmap.c | 3 +++ 2 files changed, 4 insertions(+)
diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h index 49796b7756af..cd3b8c154d9b 100644 --- a/include/uapi/linux/personality.h +++ b/include/uapi/linux/personality.h @@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x2000000, STICKY_TIMEOUTS = 0x4000000, ADDR_LIMIT_3GB = 0x8000000,
- ADDR_LIMIT_47BIT = 0x10000000,
};
I wonder if ADDR_LIMIT_128T would be clearer?
I don't follow, what does 128T represent?
Sorry, as Christophe explained it's 128 Terabytes, which is the actual value of the address limit.
I think expressing it as the address value is probably more widely understood, and would also match ADDR_LIMIT_3GB.
Have you looked at writing an update for the personality(2) man page? :)
I will write an update to the man page if this patch is approved!
Yeah fair enough.
My (poorly expressed) point was that trying to describe the flag for the man page might highlight that using the 47BIT name requires more explanation.
cheers
On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
Create a personality flag ADDR_LIMIT_47BIT to support applications that wish to transition from running in environments that support at most 47-bit VAs to environments that support larger VAs. This personality can be set to cause all allocations to be below the 47-bit boundary. Using MAP_FIXED with mmap() will bypass this restriction.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com
I think having an architecture-independent mechanism to limit the size of the 64-bit address space is useful in general, and we've discussed the same thing for arm64 in the past, though we have not actually reached an agreement on the ABI previously.
@@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x2000000, STICKY_TIMEOUTS = 0x4000000, ADDR_LIMIT_3GB = 0x8000000,
- ADDR_LIMIT_47BIT = 0x10000000,
};
I'm a bit worried about having this done specifically in the personality flag bits, as they are rather limited. We obviously don't want to add many more such flags when there could be a way to just set the default limit.
It's also unclear to me how we want this flag to interact with the existing logic in arch_get_mmap_end(), which attempts to limit the default mapping to a 47-bit address space already.
For some reason, it appears that the arch_get_mmap_end() logic on RISC-V defaults to the maximum address space for the 'addr==0' case which is inconsistentn with the other architectures, so we should probably fix that part first, possibly moving more of that logic into a shared implementation.
Arnd
On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote:
On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
Create a personality flag ADDR_LIMIT_47BIT to support applications that wish to transition from running in environments that support at most 47-bit VAs to environments that support larger VAs. This personality can be set to cause all allocations to be below the 47-bit boundary. Using MAP_FIXED with mmap() will bypass this restriction.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com
I think having an architecture-independent mechanism to limit the size of the 64-bit address space is useful in general, and we've discussed the same thing for arm64 in the past, though we have not actually reached an agreement on the ABI previously.
The thread on the original proposals attests to this being rather a fraught topic, and I think the weight of opinion was more so in favour of opt-in rather than opt-out.
@@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x2000000, STICKY_TIMEOUTS = 0x4000000, ADDR_LIMIT_3GB = 0x8000000,
- ADDR_LIMIT_47BIT = 0x10000000,
};
I'm a bit worried about having this done specifically in the personality flag bits, as they are rather limited. We obviously don't want to add many more such flags when there could be a way to just set the default limit.
Since I'm the one who suggested it, I feel I should offer some kind of vague defence here :)
We shouldn't let perfect be the enemy of the good. This is a relatively straightforward means of achieving the aim (assuming your concern about arch_get_mmap_end() below isn't a blocker) which has the least impact on existing code.
Of course we can end up in absurdities where we start doing ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an egregious maintenance burden and is entirely opt-in so has things going for it.
It's also unclear to me how we want this flag to interact with the existing logic in arch_get_mmap_end(), which attempts to limit the default mapping to a 47-bit address space already.
How does ADDR_LIMIT_3GB presently interact with that?
For some reason, it appears that the arch_get_mmap_end() logic on RISC-V defaults to the maximum address space for the 'addr==0' case which is inconsistentn with the other architectures, so we should probably fix that part first, possibly moving more of that logic into a shared implementation.
Arnd
On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote:
On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
Create a personality flag ADDR_LIMIT_47BIT to support applications that wish to transition from running in environments that support at most 47-bit VAs to environments that support larger VAs. This personality can be set to cause all allocations to be below the 47-bit boundary. Using MAP_FIXED with mmap() will bypass this restriction.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com
I think having an architecture-independent mechanism to limit the size of the 64-bit address space is useful in general, and we've discussed the same thing for arm64 in the past, though we have not actually reached an agreement on the ABI previously.
The thread on the original proposals attests to this being rather a fraught topic, and I think the weight of opinion was more so in favour of opt-in rather than opt-out.
@@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x2000000, STICKY_TIMEOUTS = 0x4000000, ADDR_LIMIT_3GB = 0x8000000,
- ADDR_LIMIT_47BIT = 0x10000000,
};
I'm a bit worried about having this done specifically in the personality flag bits, as they are rather limited. We obviously don't want to add many more such flags when there could be a way to just set the default limit.
Since I'm the one who suggested it, I feel I should offer some kind of vague defence here :)
We shouldn't let perfect be the enemy of the good. This is a relatively straightforward means of achieving the aim (assuming your concern about arch_get_mmap_end() below isn't a blocker) which has the least impact on existing code.
Of course we can end up in absurdities where we start doing ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an egregious maintenance burden and is entirely opt-in so has things going for it.
It's also unclear to me how we want this flag to interact with the existing logic in arch_get_mmap_end(), which attempts to limit the default mapping to a 47-bit address space already.
How does ADDR_LIMIT_3GB presently interact with that?
For some reason, it appears that the arch_get_mmap_end() logic on RISC-V defaults to the maximum address space for the 'addr==0' case which is inconsistentn with the other architectures, so we should probably fix that part first, possibly moving more of that logic into a shared implementation.
Arnd
On Fri, Sep 6, 2024, at 08:14, Lorenzo Stoakes wrote:
On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote:
On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
Create a personality flag ADDR_LIMIT_47BIT to support applications that wish to transition from running in environments that support at most 47-bit VAs to environments that support larger VAs. This personality can be set to cause all allocations to be below the 47-bit boundary. Using MAP_FIXED with mmap() will bypass this restriction.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com
I think having an architecture-independent mechanism to limit the size of the 64-bit address space is useful in general, and we've discussed the same thing for arm64 in the past, though we have not actually reached an agreement on the ABI previously.
The thread on the original proposals attests to this being rather a fraught topic, and I think the weight of opinion was more so in favour of opt-in rather than opt-out.
You mean opt-in to using the larger addresses like we do on arm64 and powerpc, while "opt-out" means a limit as Charlie suggested?
@@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x2000000, STICKY_TIMEOUTS = 0x4000000, ADDR_LIMIT_3GB = 0x8000000,
- ADDR_LIMIT_47BIT = 0x10000000,
};
I'm a bit worried about having this done specifically in the personality flag bits, as they are rather limited. We obviously don't want to add many more such flags when there could be a way to just set the default limit.
Since I'm the one who suggested it, I feel I should offer some kind of vague defence here :)
We shouldn't let perfect be the enemy of the good. This is a relatively straightforward means of achieving the aim (assuming your concern about arch_get_mmap_end() below isn't a blocker) which has the least impact on existing code.
Of course we can end up in absurdities where we start doing ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an egregious maintenance burden and is entirely opt-in so has things going for it.
I'm more confused now, I think most importantly we should try to handle this consistently across all architectures. The proposed implementation seems to completely block addresses above BIT(47) even for applications that opt in by calling mmap(BIT(47), ...), which seems to break the existing applications.
If we want this flag for RISC-V and also keep the behavior of defaulting to >BIT(47) addresses for mmap(0, ...) how about changing arch_get_mmap_end() to return the limit based on ADDR_LIMIT_47BIT and then make this default to enabled on arm64 and powerpc but disabled on riscv?
It's also unclear to me how we want this flag to interact with the existing logic in arch_get_mmap_end(), which attempts to limit the default mapping to a 47-bit address space already.
How does ADDR_LIMIT_3GB presently interact with that?
That is x86 specific and only relevant to compat tasks, limiting them to 3 instead of 4 GB. There is also ADDR_LIMIT_32BIT, which on arm32 is always set in practice to allow 32-bit addressing as opposed to ARMv2 style 26-bit addressing (IIRC ARMv3 supported both 26-bit and 32-bit addressing, while ARMv4 through ARMv7 are 32-bit only.
Arnd
(Sorry having issues with my IPv6 setup that duplicated the original email...
On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote:
On Fri, Sep 6, 2024, at 08:14, Lorenzo Stoakes wrote:
On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote:
On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
Create a personality flag ADDR_LIMIT_47BIT to support applications that wish to transition from running in environments that support at most 47-bit VAs to environments that support larger VAs. This personality can be set to cause all allocations to be below the 47-bit boundary. Using MAP_FIXED with mmap() will bypass this restriction.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com
I think having an architecture-independent mechanism to limit the size of the 64-bit address space is useful in general, and we've discussed the same thing for arm64 in the past, though we have not actually reached an agreement on the ABI previously.
The thread on the original proposals attests to this being rather a fraught topic, and I think the weight of opinion was more so in favour of opt-in rather than opt-out.
You mean opt-in to using the larger addresses like we do on arm64 and powerpc, while "opt-out" means a limit as Charlie suggested?
I guess I'm not using brilliant terminology here haha!
To clarify - the weight of opinion was for a situation where the address space is limited, except if you set a hint above that (you could call that opt-out or opt-in depending which way you look at it, so yeah ok very unclear sorry!).
It was against the MAP_ flag and also I think a _flexible_ per-process limit is also questionable as you might end up setting a limit which breaks something else, and this starts getting messy quick.
To be clear, the ADDR_LIMIT_47BIT suggestion is absolutely a compromise and practical suggestion.
@@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x2000000, STICKY_TIMEOUTS = 0x4000000, ADDR_LIMIT_3GB = 0x8000000,
- ADDR_LIMIT_47BIT = 0x10000000,
};
I'm a bit worried about having this done specifically in the personality flag bits, as they are rather limited. We obviously don't want to add many more such flags when there could be a way to just set the default limit.
Since I'm the one who suggested it, I feel I should offer some kind of vague defence here :)
We shouldn't let perfect be the enemy of the good. This is a relatively straightforward means of achieving the aim (assuming your concern about arch_get_mmap_end() below isn't a blocker) which has the least impact on existing code.
Of course we can end up in absurdities where we start doing ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an egregious maintenance burden and is entirely opt-in so has things going for it.
I'm more confused now, I think most importantly we should try to handle this consistently across all architectures. The proposed implementation seems to completely block addresses above BIT(47) even for applications that opt in by calling mmap(BIT(47), ...), which seems to break the existing applications.
Hm, I thought the commit message suggested the hint overrides it still?
The intent is to optionally be able to run a process that keeps higher bits free for tagging and to be sure no memory mapping in the process will clobber these (correct me if I'm wrong Charlie! :)
So you really wouldn't want this if you are using tagged pointers, you'd want to be sure literally nothing touches the higher bits.
If we want this flag for RISC-V and also keep the behavior of defaulting to >BIT(47) addresses for mmap(0, ...) how about changing arch_get_mmap_end() to return the limit based on ADDR_LIMIT_47BIT and then make this default to enabled on arm64 and powerpc but disabled on riscv?
But you wouldn't necessarily want all processes to be so restricted, I think this is what Charlie's trying to avoid :)
On the ohter hand - I'm not sure there are many processes on any arch that'd want the higher mappings.
So that'd push us again towards risc v just limiting to 48-bits and only mapping above this if a hint is provided like x86-64 does (and as you mentioned via irc - it seems risc v is an outlier in that DEFAULT_MAP_WINDOW == TASK_SIZE).
This would be more consistent vs. other arches.
It's also unclear to me how we want this flag to interact with the existing logic in arch_get_mmap_end(), which attempts to limit the default mapping to a 47-bit address space already.
How does ADDR_LIMIT_3GB presently interact with that?
That is x86 specific and only relevant to compat tasks, limiting them to 3 instead of 4 GB. There is also ADDR_LIMIT_32BIT, which on arm32 is always set in practice to allow 32-bit addressing as opposed to ARMv2 style 26-bit addressing (IIRC ARMv3 supported both 26-bit and 32-bit addressing, while ARMv4 through ARMv7 are 32-bit only.
OK, I understand what it's for, I missed it was arch-specific bit, urgh.
I'd say this limit should be min of the arch-specific limit vs. the 48-bit limit. If you have a 36-bit address space obviously it'd be rather unwise to try to provide 48 bit addresses..
Arnd
On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote:
(Sorry having issues with my IPv6 setup that duplicated the original email...
On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote:
On Fri, Sep 6, 2024, at 08:14, Lorenzo Stoakes wrote:
On Fri, Sep 06, 2024 at 07:17:44AM GMT, Arnd Bergmann wrote:
On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
Create a personality flag ADDR_LIMIT_47BIT to support applications that wish to transition from running in environments that support at most 47-bit VAs to environments that support larger VAs. This personality can be set to cause all allocations to be below the 47-bit boundary. Using MAP_FIXED with mmap() will bypass this restriction.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com
I think having an architecture-independent mechanism to limit the size of the 64-bit address space is useful in general, and we've discussed the same thing for arm64 in the past, though we have not actually reached an agreement on the ABI previously.
The thread on the original proposals attests to this being rather a fraught topic, and I think the weight of opinion was more so in favour of opt-in rather than opt-out.
You mean opt-in to using the larger addresses like we do on arm64 and powerpc, while "opt-out" means a limit as Charlie suggested?
I guess I'm not using brilliant terminology here haha!
To clarify - the weight of opinion was for a situation where the address space is limited, except if you set a hint above that (you could call that opt-out or opt-in depending which way you look at it, so yeah ok very unclear sorry!).
It was against the MAP_ flag and also I think a _flexible_ per-process limit is also questionable as you might end up setting a limit which breaks something else, and this starts getting messy quick.
To be clear, the ADDR_LIMIT_47BIT suggestion is absolutely a compromise and practical suggestion.
@@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x2000000, STICKY_TIMEOUTS = 0x4000000, ADDR_LIMIT_3GB = 0x8000000,
- ADDR_LIMIT_47BIT = 0x10000000,
};
I'm a bit worried about having this done specifically in the personality flag bits, as they are rather limited. We obviously don't want to add many more such flags when there could be a way to just set the default limit.
Since I'm the one who suggested it, I feel I should offer some kind of vague defence here :)
We shouldn't let perfect be the enemy of the good. This is a relatively straightforward means of achieving the aim (assuming your concern about arch_get_mmap_end() below isn't a blocker) which has the least impact on existing code.
Of course we can end up in absurdities where we start doing ADDR_LIMIT_xxBIT... but again - it's simple, shouldn't represent an egregious maintenance burden and is entirely opt-in so has things going for it.
I'm more confused now, I think most importantly we should try to handle this consistently across all architectures. The proposed implementation seems to completely block addresses above BIT(47) even for applications that opt in by calling mmap(BIT(47), ...), which seems to break the existing applications.
Hm, I thought the commit message suggested the hint overrides it still?
The intent is to optionally be able to run a process that keeps higher bits free for tagging and to be sure no memory mapping in the process will clobber these (correct me if I'm wrong Charlie! :)
So you really wouldn't want this if you are using tagged pointers, you'd want to be sure literally nothing touches the higher bits.
Various architectures handle the hint address differently, but it appears that the only case across any architecture where an address above 47 bits will be returned is if the application had a hint address with a value greater than 47 bits and was using the MAP_FIXED flag. MAP_FIXED bypasses all other checks so I was assuming that it would be logical for MAP_FIXED to bypass this as well. If MAP_FIXED is not set, then the intent is for no hint address to cause a value greater than 47 bits to be returned.
This does have the issue that if MAP_FIXED is used then an address can be returned above 47-bits, but if an application does not want addresses above 47-bits then they shouldn't ask for a fixed address above that range.
If we want this flag for RISC-V and also keep the behavior of defaulting to >BIT(47) addresses for mmap(0, ...) how about changing arch_get_mmap_end() to return the limit based on ADDR_LIMIT_47BIT and then make this default to enabled on arm64 and powerpc but disabled on riscv?
But you wouldn't necessarily want all processes to be so restricted, I think this is what Charlie's trying to avoid :)
On the ohter hand - I'm not sure there are many processes on any arch that'd want the higher mappings.
So that'd push us again towards risc v just limiting to 48-bits and only mapping above this if a hint is provided like x86-64 does (and as you mentioned via irc - it seems risc v is an outlier in that DEFAULT_MAP_WINDOW == TASK_SIZE).
This would be more consistent vs. other arches.
Yes riscv is an outlier here. The reason I am pushing for something like a flag to restrict the address space rather than setting it to be the default is it seems like if applications are relying on upper bits to be free, then they should be explicitly asking the kernel to keep them free rather than assuming them to be free.
It's also unclear to me how we want this flag to interact with the existing logic in arch_get_mmap_end(), which attempts to limit the default mapping to a 47-bit address space already.
How does ADDR_LIMIT_3GB presently interact with that?
That is x86 specific and only relevant to compat tasks, limiting them to 3 instead of 4 GB. There is also ADDR_LIMIT_32BIT, which on arm32 is always set in practice to allow 32-bit addressing as opposed to ARMv2 style 26-bit addressing (IIRC ARMv3 supported both 26-bit and 32-bit addressing, while ARMv4 through ARMv7 are 32-bit only.
OK, I understand what it's for, I missed it was arch-specific bit, urgh.
I'd say this limit should be min of the arch-specific limit vs. the 48-bit limit. If you have a 36-bit address space obviously it'd be rather unwise to try to provide 48 bit addresses..
In this patch I set the high limit to be the minimum of the provided high limit and 47 bits so I think that should cover this case?
- Charlie
Arnd
On Mon, Sep 9, 2024, at 23:22, Charlie Jenkins wrote:
On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote:
On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote: The intent is to optionally be able to run a process that keeps higher bits free for tagging and to be sure no memory mapping in the process will clobber these (correct me if I'm wrong Charlie! :)
So you really wouldn't want this if you are using tagged pointers, you'd want to be sure literally nothing touches the higher bits.
My understanding was that the purpose of the existing design is to allow applications to ask for a high address without having to resort to the complexity of MAP_FIXED.
In particular, I'm sure there is precedent for applications that want both tagged pointers (for most mappings) and untagged pointers (for large mappings). With a per-mm_struct or per-task_struct setting you can't do that.
Various architectures handle the hint address differently, but it appears that the only case across any architecture where an address above 47 bits will be returned is if the application had a hint address with a value greater than 47 bits and was using the MAP_FIXED flag. MAP_FIXED bypasses all other checks so I was assuming that it would be logical for MAP_FIXED to bypass this as well. If MAP_FIXED is not set, then the intent is for no hint address to cause a value greater than 47 bits to be returned.
I don't think the MAP_FIXED case is that interesting here because it has to work in both fixed and non-fixed mappings.
This would be more consistent vs. other arches.
Yes riscv is an outlier here. The reason I am pushing for something like a flag to restrict the address space rather than setting it to be the default is it seems like if applications are relying on upper bits to be free, then they should be explicitly asking the kernel to keep them free rather than assuming them to be free.
Let's see what the other architectures do and then come up with a way that fixes the pointer tagging case first on those that are broken. We can see if there needs to be an extra flag after that. Here is what I found:
- x86_64 uses DEFAULT_MAP_WINDOW of BIT(47), uses a 57 bit address space when an addr hint is passed. - arm64 uses DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns higher 52-bit addresses when either a hint is passed or CONFIG_EXPERT and CONFIG_ARM64_FORCE_52BIT is set (this is a debugging option) - ppc64 uses a DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns 52 bit address when an addr hint is passed - riscv uses a DEFAULT_MAP_WINDOW of BIT(47) but only uses it for allocating the stack below, ignoring it for normal mappings - s390 has no DEFAULT_MAP_WINDOW but tried to allocate in the current number of pgtable levels and only upgrades to the next level (31, 42, 53, 64 bits) if a hint is passed or the current level is exhausted. - loongarch64 has no DEFAULT_MAP_WINDOW, and a default VA space of 47 bits (16K pages, 3 levels), but can support a 55 bit space (64K pages, 3 levels). - sparc has no DEFAULT_MAP_WINDOW and up to 52 bit VA space. It may allocate both positive and negative addresses in there. (?) - mips64, parisc64 and alpha have no DEFAULT_MAP_WINDOW and at most 48, 41 or 39 address bits, respectively.
I would suggest these changes:
- make riscv enforce DEFAULT_MAP_WINDOW like x86_64, arm64 and ppc64, leave it at 47
- add DEFAULT_MAP_WINDOW on loongarch64 (47/48 bits based on page size), sparc (48 bits) and s390 (unsure if 42, 53, 47 or 48 bits)
- leave the rest unchanged.
Arnd
On Tue, Sep 10, 2024 at 09:13:33AM +0000, Arnd Bergmann wrote:
On Mon, Sep 9, 2024, at 23:22, Charlie Jenkins wrote:
On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote:
On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote: The intent is to optionally be able to run a process that keeps higher bits free for tagging and to be sure no memory mapping in the process will clobber these (correct me if I'm wrong Charlie! :)
So you really wouldn't want this if you are using tagged pointers, you'd want to be sure literally nothing touches the higher bits.
My understanding was that the purpose of the existing design is to allow applications to ask for a high address without having to resort to the complexity of MAP_FIXED.
In particular, I'm sure there is precedent for applications that want both tagged pointers (for most mappings) and untagged pointers (for large mappings). With a per-mm_struct or per-task_struct setting you can't do that.
Various architectures handle the hint address differently, but it appears that the only case across any architecture where an address above 47 bits will be returned is if the application had a hint address with a value greater than 47 bits and was using the MAP_FIXED flag. MAP_FIXED bypasses all other checks so I was assuming that it would be logical for MAP_FIXED to bypass this as well. If MAP_FIXED is not set, then the intent is for no hint address to cause a value greater than 47 bits to be returned.
I don't think the MAP_FIXED case is that interesting here because it has to work in both fixed and non-fixed mappings.
This would be more consistent vs. other arches.
Yes riscv is an outlier here. The reason I am pushing for something like a flag to restrict the address space rather than setting it to be the default is it seems like if applications are relying on upper bits to be free, then they should be explicitly asking the kernel to keep them free rather than assuming them to be free.
Let's see what the other architectures do and then come up with a way that fixes the pointer tagging case first on those that are broken. We can see if there needs to be an extra flag after that. Here is what I found:
- x86_64 uses DEFAULT_MAP_WINDOW of BIT(47), uses a 57 bit address space when an addr hint is passed.
- arm64 uses DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns higher 52-bit addresses when either a hint is passed or CONFIG_EXPERT and CONFIG_ARM64_FORCE_52BIT is set (this is a debugging option)
- ppc64 uses a DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns 52 bit address when an addr hint is passed
- riscv uses a DEFAULT_MAP_WINDOW of BIT(47) but only uses it for allocating the stack below, ignoring it for normal mappings
- s390 has no DEFAULT_MAP_WINDOW but tried to allocate in the current number of pgtable levels and only upgrades to the next level (31, 42, 53, 64 bits) if a hint is passed or the current level is exhausted.
- loongarch64 has no DEFAULT_MAP_WINDOW, and a default VA space of 47 bits (16K pages, 3 levels), but can support a 55 bit space (64K pages, 3 levels).
- sparc has no DEFAULT_MAP_WINDOW and up to 52 bit VA space. It may allocate both positive and negative addresses in there. (?)
- mips64, parisc64 and alpha have no DEFAULT_MAP_WINDOW and at most 48, 41 or 39 address bits, respectively.
I would suggest these changes:
make riscv enforce DEFAULT_MAP_WINDOW like x86_64, arm64 and ppc64, leave it at 47
add DEFAULT_MAP_WINDOW on loongarch64 (47/48 bits based on page size), sparc (48 bits) and s390 (unsure if 42, 53, 47 or 48 bits)
leave the rest unchanged.
Arnd
Changing all architectures to have a standardized DEFAULT_MAP_WINDOW mostly solves the problem. However, I am concerned that it is fragile for applications to rely on a default like this. Having the personality bit flag is supposed to provide an intuitive ABI for users that guarantees that they will not accidentally request for memory outside of the boundary that they specified.
Also you bring up that the DEFAULT_MAP_WINDOW would not be able to be standardized across architectures, so we still have the problem that this default behavior will be different across architectures which I am trying to solve.
- Charlie
"Arnd Bergmann" arnd@arndb.de writes:
On Mon, Sep 9, 2024, at 23:22, Charlie Jenkins wrote:
On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote:
On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote: The intent is to optionally be able to run a process that keeps higher bits free for tagging and to be sure no memory mapping in the process will clobber these (correct me if I'm wrong Charlie! :)
...
Let's see what the other architectures do and then come up with a way that fixes the pointer tagging case first on those that are broken. We can see if there needs to be an extra flag after that. Here is what I found:
- x86_64 uses DEFAULT_MAP_WINDOW of BIT(47), uses a 57 bit address space when an addr hint is passed.
- arm64 uses DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns higher 52-bit addresses when either a hint is passed or CONFIG_EXPERT and CONFIG_ARM64_FORCE_52BIT is set (this is a debugging option)
- ppc64 uses a DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns 52 bit address when an addr hint is passed
It's 46 or 47 depending on PAGE_SIZE (4K or 64K):
$ git grep "define DEFAULT_MAP_WINDOW_USER64" arch/powerpc/include/asm/task_size_64.h arch/powerpc/include/asm/task_size_64.h:#define DEFAULT_MAP_WINDOW_USER64 TASK_SIZE_128TB arch/powerpc/include/asm/task_size_64.h:#define DEFAULT_MAP_WINDOW_USER64 TASK_SIZE_64TB
cheers
On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann arnd@arndb.de wrote:
On Thu, Sep 5, 2024, at 21:15, Charlie Jenkins wrote:
Create a personality flag ADDR_LIMIT_47BIT to support applications that wish to transition from running in environments that support at most 47-bit VAs to environments that support larger VAs. This personality can be set to cause all allocations to be below the 47-bit boundary. Using MAP_FIXED with mmap() will bypass this restriction.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com
I think having an architecture-independent mechanism to limit the size of the 64-bit address space is useful in general, and we've discussed the same thing for arm64 in the past, though we have not actually reached an agreement on the ABI previously.
@@ -22,6 +22,7 @@ enum { WHOLE_SECONDS = 0x2000000, STICKY_TIMEOUTS = 0x4000000, ADDR_LIMIT_3GB = 0x8000000,
ADDR_LIMIT_47BIT = 0x10000000,
};
I'm a bit worried about having this done specifically in the personality flag bits, as they are rather limited. We obviously don't want to add many more such flags when there could be a way to just set the default limit.
It's also unclear to me how we want this flag to interact with the existing logic in arch_get_mmap_end(), which attempts to limit the default mapping to a 47-bit address space already.
To optimize RISC-V progress, I recommend:
Step 1: Approve the patch. Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. Step 3: Wait approximately several iterations for Go & OpenJDK Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
For some reason, it appears that the arch_get_mmap_end() logic on RISC-V defaults to the maximum address space for the 'addr==0' case which is inconsistentn with the other architectures, so we should probably fix that part first, possibly moving more of that logic into a shared implementation.
Arnd
On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann arnd@arndb.de wrote:
It's also unclear to me how we want this flag to interact with the existing logic in arch_get_mmap_end(), which attempts to limit the default mapping to a 47-bit address space already.
To optimize RISC-V progress, I recommend:
Step 1: Approve the patch. Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. Step 3: Wait approximately several iterations for Go & OpenJDK Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
I really want to first see a plausible explanation about why RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW like all the other major architectures (x86, arm64, powerpc64), e.g. something like the patch below (untested, probably slightly wrong but show illustrate my point).
Arnd
diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h index 8702b8721a27..de9863be1efd 100644 --- a/arch/riscv/include/asm/processor.h +++ b/arch/riscv/include/asm/processor.h @@ -20,17 +20,8 @@ * mmap_end < addr, being mmap_end the top of that address space. * See Documentation/arch/riscv/vm-layout.rst for more details. */ -#define arch_get_mmap_end(addr, len, flags) \ -({ \ - unsigned long mmap_end; \ - typeof(addr) _addr = (addr); \ - if ((_addr) == 0 || is_compat_task() || \ - ((_addr + len) > BIT(VA_BITS - 1))) \ - mmap_end = STACK_TOP_MAX; \ - else \ - mmap_end = (_addr + len); \ - mmap_end; \ -}) +#define arch_get_mmap_end(addr, len, flags) \ + (((addr) > DEFAULT_MAP_WINDOW) ? TASK_SIZE : DEFAULT_MAP_WINDOW)
#define arch_get_mmap_base(addr, base) \ ({ \ @@ -47,7 +38,7 @@ })
#ifdef CONFIG_64BIT -#define DEFAULT_MAP_WINDOW (UL(1) << (MMAP_VA_BITS - 1)) +#define DEFAULT_MAP_WINDOW (is_compat_task() ? (UL(1) << (MMAP_VA_BITS - 1)) : TASK_SIZE_32) #define STACK_TOP_MAX TASK_SIZE_64 #else #define DEFAULT_MAP_WINDOW TASK_SIZE
On Fri, Sep 06, 2024 at 09:55:42AM +0000, Arnd Bergmann wrote:
On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann arnd@arndb.de wrote:
It's also unclear to me how we want this flag to interact with the existing logic in arch_get_mmap_end(), which attempts to limit the default mapping to a 47-bit address space already.
To optimize RISC-V progress, I recommend:
Step 1: Approve the patch. Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. Step 3: Wait approximately several iterations for Go & OpenJDK Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
I really want to first see a plausible explanation about why RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW like all the other major architectures (x86, arm64, powerpc64),
FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default configuration. We end up with a 47-bit with 16K pages but for a different reason that has to do with LPA2 support (I doubt we need this for the user mapping but we need to untangle some of the macros there; that's for a separate discussion).
That said, we haven't encountered any user space problems with a 48-bit DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar approach (47 or 48 bit default limit). Better to have some ABI consistency between architectures. One can still ask for addresses above this default limit via mmap().
* Catalin Marinas catalin.marinas@arm.com [240906 07:44]:
On Fri, Sep 06, 2024 at 09:55:42AM +0000, Arnd Bergmann wrote:
On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann arnd@arndb.de wrote:
It's also unclear to me how we want this flag to interact with the existing logic in arch_get_mmap_end(), which attempts to limit the default mapping to a 47-bit address space already.
To optimize RISC-V progress, I recommend:
Step 1: Approve the patch. Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. Step 3: Wait approximately several iterations for Go & OpenJDK Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
I really want to first see a plausible explanation about why RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW like all the other major architectures (x86, arm64, powerpc64),
FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default configuration. We end up with a 47-bit with 16K pages but for a different reason that has to do with LPA2 support (I doubt we need this for the user mapping but we need to untangle some of the macros there; that's for a separate discussion).
That said, we haven't encountered any user space problems with a 48-bit DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar approach (47 or 48 bit default limit). Better to have some ABI consistency between architectures. One can still ask for addresses above this default limit via mmap().
I think that is best as well.
Can we please just do what x86 and arm64 does?
Thanks, Liam
On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
- Catalin Marinas catalin.marinas@arm.com [240906 07:44]:
On Fri, Sep 06, 2024 at 09:55:42AM +0000, Arnd Bergmann wrote:
On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann arnd@arndb.de wrote:
It's also unclear to me how we want this flag to interact with the existing logic in arch_get_mmap_end(), which attempts to limit the default mapping to a 47-bit address space already.
To optimize RISC-V progress, I recommend:
Step 1: Approve the patch. Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. Step 3: Wait approximately several iterations for Go & OpenJDK Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
I really want to first see a plausible explanation about why RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW like all the other major architectures (x86, arm64, powerpc64),
FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default configuration. We end up with a 47-bit with 16K pages but for a different reason that has to do with LPA2 support (I doubt we need this for the user mapping but we need to untangle some of the macros there; that's for a separate discussion).
That said, we haven't encountered any user space problems with a 48-bit DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar approach (47 or 48 bit default limit). Better to have some ABI consistency between architectures. One can still ask for addresses above this default limit via mmap().
I think that is best as well.
Can we please just do what x86 and arm64 does?
Thanks, Liam
I responded to Arnd in the other thread, but I am still not convinced that the solution that x86 and arm64 have selected is the best solution. The solution of defaulting to 47 bits does allow applications the ability to get addresses that are below 47 bits. However, due to differences across architectures it doesn't seem possible to have all architectures default to the same value. Additionally, this flag will be able to help users avoid potential bugs where a hint address is passed that causes upper bits of a VA to be used.
The other issue I have with this is that if there is not a hint address specified to be greater than 47 bits on x86, then mmap() may return an address that is greater than 47-bits. The documentation in Documentation/arch/x86/x86_64/5level-paging.rst says:
"If hint address set above 47-bit, but MAP_FIXED is not specified, we try to look for unmapped area by specified address. If it's already occupied, we look for unmapped area in *full* address space, rather than from 47-bit window."
arm64 on the other hand defines this as only being able to opt-into the 52-bit VA space with the hint address, and my understanding is that mmap() will not fall back to the 52-bit address space. Please correct me if I am wrong. From Documentation/arch/arm64/memory.rst:
"To maintain compatibility with software that relies on the ARMv8.0 VA space maximum size of 48-bits, the kernel will, by default, return virtual addresses to userspace from a 48-bit range.
"Software can "opt-in" to receiving VAs from a 52-bit space by specifying an mmap hint parameter that is larger than 48-bit."
This is an inconsistency I am trying to solve with this personality flag.
- Charlie
On Wed, Sep 11, 2024, at 00:45, Charlie Jenkins wrote:
On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
I responded to Arnd in the other thread, but I am still not convinced that the solution that x86 and arm64 have selected is the best solution. The solution of defaulting to 47 bits does allow applications the ability to get addresses that are below 47 bits. However, due to differences across architectures it doesn't seem possible to have all architectures default to the same value. Additionally, this flag will be able to help users avoid potential bugs where a hint address is passed that causes upper bits of a VA to be used.
The other issue I have with this is that if there is not a hint address specified to be greater than 47 bits on x86, then mmap() may return an address that is greater than 47-bits. The documentation in Documentation/arch/x86/x86_64/5level-paging.rst says:
"If hint address set above 47-bit, but MAP_FIXED is not specified, we try to look for unmapped area by specified address. If it's already occupied, we look for unmapped area in *full* address space, rather than from 47-bit window."
This is also in the commit message of b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace"), which introduced it. However, I don't actually see the fallback to the full address space, instead the actual behavior seems to be the same as arm64.
Am I missing something in the x86 implementation, or do we just need to update the documentation?
Arnd
On Wed, Sep 11, 2024 at 07:25:08AM +0000, Arnd Bergmann wrote:
On Wed, Sep 11, 2024, at 00:45, Charlie Jenkins wrote:
On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
I responded to Arnd in the other thread, but I am still not convinced that the solution that x86 and arm64 have selected is the best solution. The solution of defaulting to 47 bits does allow applications the ability to get addresses that are below 47 bits. However, due to differences across architectures it doesn't seem possible to have all architectures default to the same value. Additionally, this flag will be able to help users avoid potential bugs where a hint address is passed that causes upper bits of a VA to be used.
The other issue I have with this is that if there is not a hint address specified to be greater than 47 bits on x86, then mmap() may return an address that is greater than 47-bits. The documentation in Documentation/arch/x86/x86_64/5level-paging.rst says:
"If hint address set above 47-bit, but MAP_FIXED is not specified, we try to look for unmapped area by specified address. If it's already occupied, we look for unmapped area in *full* address space, rather than from 47-bit window."
This is also in the commit message of b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace"), which introduced it. However, I don't actually see the fallback to the full address space, instead the actual behavior seems to be the same as arm64.
Am I missing something in the x86 implementation, or do we just need to update the documentation?
Arnd
Yeah I guess it is incorrect documentation then? It seems more reasonable to me to have a hint address fall back onto the larger address space because otherwise the "hint" address can cause allocations to fail even if there is space above the 47-bit limit. This is another reason I wanted to avoid having this default behavior on riscv, to not have this abuse of the hint address.
- Charlie
On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote:
On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
- Catalin Marinas catalin.marinas@arm.com [240906 07:44]:
On Fri, Sep 06, 2024 at 09:55:42AM +0000, Arnd Bergmann wrote:
On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann arnd@arndb.de wrote:
It's also unclear to me how we want this flag to interact with the existing logic in arch_get_mmap_end(), which attempts to limit the default mapping to a 47-bit address space already.
To optimize RISC-V progress, I recommend:
Step 1: Approve the patch. Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. Step 3: Wait approximately several iterations for Go & OpenJDK Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
Point 4 is an ABI change. What guarantees that there isn't still software out there that relies on the old behaviour?
I really want to first see a plausible explanation about why RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW like all the other major architectures (x86, arm64, powerpc64),
FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default configuration. We end up with a 47-bit with 16K pages but for a different reason that has to do with LPA2 support (I doubt we need this for the user mapping but we need to untangle some of the macros there; that's for a separate discussion).
That said, we haven't encountered any user space problems with a 48-bit DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar approach (47 or 48 bit default limit). Better to have some ABI consistency between architectures. One can still ask for addresses above this default limit via mmap().
I think that is best as well.
Can we please just do what x86 and arm64 does?
I responded to Arnd in the other thread, but I am still not convinced that the solution that x86 and arm64 have selected is the best solution. The solution of defaulting to 47 bits does allow applications the ability to get addresses that are below 47 bits. However, due to differences across architectures it doesn't seem possible to have all architectures default to the same value. Additionally, this flag will be able to help users avoid potential bugs where a hint address is passed that causes upper bits of a VA to be used.
The reason we added this limit on arm64 is that we noticed programs using the top 8 bits of a 64-bit pointer for additional information. IIRC, it wasn't even openJDK but some JavaScript JIT. We could have taught those programs of a new flag but since we couldn't tell how many are out there, it was the safest to default to a smaller limit and opt in to the higher one. Such opt-in is via mmap() but if you prefer a prctl() flag, that's fine by me as well (though I think this should be opt-in to higher addresses rather than opt-out of the higher addresses).
On Wed, Sep 11, 2024 at 07:21:27PM +0100, Catalin Marinas wrote:
On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote:
On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
- Catalin Marinas catalin.marinas@arm.com [240906 07:44]:
On Fri, Sep 06, 2024 at 09:55:42AM +0000, Arnd Bergmann wrote:
On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote:
On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann arnd@arndb.de wrote: > It's also unclear to me how we want this flag to interact with > the existing logic in arch_get_mmap_end(), which attempts to > limit the default mapping to a 47-bit address space already.
To optimize RISC-V progress, I recommend:
Step 1: Approve the patch. Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. Step 3: Wait approximately several iterations for Go & OpenJDK Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
Point 4 is an ABI change. What guarantees that there isn't still software out there that relies on the old behaviour?
Yeah I don't think it would be desirable to remove the 47 bit constraint in architectures that already have it.
I really want to first see a plausible explanation about why RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW like all the other major architectures (x86, arm64, powerpc64),
FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default configuration. We end up with a 47-bit with 16K pages but for a different reason that has to do with LPA2 support (I doubt we need this for the user mapping but we need to untangle some of the macros there; that's for a separate discussion).
That said, we haven't encountered any user space problems with a 48-bit DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar approach (47 or 48 bit default limit). Better to have some ABI consistency between architectures. One can still ask for addresses above this default limit via mmap().
I think that is best as well.
Can we please just do what x86 and arm64 does?
I responded to Arnd in the other thread, but I am still not convinced that the solution that x86 and arm64 have selected is the best solution. The solution of defaulting to 47 bits does allow applications the ability to get addresses that are below 47 bits. However, due to differences across architectures it doesn't seem possible to have all architectures default to the same value. Additionally, this flag will be able to help users avoid potential bugs where a hint address is passed that causes upper bits of a VA to be used.
The reason we added this limit on arm64 is that we noticed programs using the top 8 bits of a 64-bit pointer for additional information. IIRC, it wasn't even openJDK but some JavaScript JIT. We could have taught those programs of a new flag but since we couldn't tell how many are out there, it was the safest to default to a smaller limit and opt in to the higher one. Such opt-in is via mmap() but if you prefer a prctl() flag, that's fine by me as well (though I think this should be opt-in to higher addresses rather than opt-out of the higher addresses).
The mmap() flag was used in previous versions but was decided against because this feature is more useful if it is process-wide. A personality() flag was chosen instead of a prctl() flag because there existed other flags in personality() that were similar. I am tempted to use prctl() however because then we could have an additional arg to select the exact number of bits that should be reserved (rather than being fixed at 47 bits).
Opting-in to the higher address space is reasonable. However, it is not my preference, because the purpose of this flag is to ensure that allocations do not exceed 47-bits, so it is a clearer ABI to have the applications that want this guarantee to be the ones setting the flag, rather than the applications that want the higher bits setting the flag.
- Charlie
-- Catalin
On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote:
Opting-in to the higher address space is reasonable. However, it is not my preference, because the purpose of this flag is to ensure that allocations do not exceed 47-bits, so it is a clearer ABI to have the applications that want this guarantee to be the ones setting the flag, rather than the applications that want the higher bits setting the flag.
Yes, this would be ideal. Unfortunately those applications don't know they need to set a flag in order to work.
A slightly better option is to leave the default 47-bit at the kernel ABI level and have the libc/dynamic loader issue the prctl(). You can control the default with environment variables if needed.
We do something similar in glibc for arm64 MTE. When MTE is enabled, the top byte of an allocated pointer contains the tag that must not be corrupted. We left the decision to the C library via the glibc.mem.tagging tunable (Android has something similar via the app manifest). An app can change the default if it wants but if you run with old glibc or no environment variable to say otherwise, the default would be safe. Distros can set the environment to be the maximum range by default if they know the apps included have been upgraded and tested.
On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote:
On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote:
Opting-in to the higher address space is reasonable. However, it is not my preference, because the purpose of this flag is to ensure that allocations do not exceed 47-bits, so it is a clearer ABI to have the applications that want this guarantee to be the ones setting the flag, rather than the applications that want the higher bits setting the flag.
Yes, this would be ideal. Unfortunately those applications don't know they need to set a flag in order to work.
It's not a regression, the applications never worked (on platforms that do not have this default). The 47-bit default would allow applications that didn't work to start working at the cost of a non-ideal ABI. That doesn't seem like a reasonable tradeoff to me. If applications want to run on new hardware that has different requirements, shouldn't they be required to update rather than expect the kernel will solve their problems for them?
A slightly better option is to leave the default 47-bit at the kernel ABI level and have the libc/dynamic loader issue the prctl(). You can control the default with environment variables if needed.
Having glibc set the 47-bit requirement could make it slightly easier for applications since they would only have to set the environment variable. After the kernel interface is approved I can look into supporting that.
- Charlie
We do something similar in glibc for arm64 MTE. When MTE is enabled, the top byte of an allocated pointer contains the tag that must not be corrupted. We left the decision to the C library via the glibc.mem.tagging tunable (Android has something similar via the app manifest). An app can change the default if it wants but if you run with old glibc or no environment variable to say otherwise, the default would be safe. Distros can set the environment to be the maximum range by default if they know the apps included have been upgraded and tested.
-- Catalin
On Thu, Sep 12, 2024 at 02:15:59PM -0700, Charlie Jenkins wrote:
On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote:
On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote:
Opting-in to the higher address space is reasonable. However, it is not my preference, because the purpose of this flag is to ensure that allocations do not exceed 47-bits, so it is a clearer ABI to have the applications that want this guarantee to be the ones setting the flag, rather than the applications that want the higher bits setting the flag.
Yes, this would be ideal. Unfortunately those applications don't know they need to set a flag in order to work.
It's not a regression, the applications never worked (on platforms that do not have this default). The 47-bit default would allow applications that didn't work to start working at the cost of a non-ideal ABI. That doesn't seem like a reasonable tradeoff to me. If applications want to run on new hardware that has different requirements, shouldn't they be required to update rather than expect the kernel will solve their problems for them?
That's a valid point but it depends on the application and how much you want to spend updating user-space. OpenJDK is fine, if you need a JIT you'll have to add support for that architecture anyway. But others are arch-agnostic, you just recompile to your target. It's not an ABI problem, more of an API one.
The x86 case (and powerpc/arm64) was different, the 47-bit worked for a long time before expanding it. So it made a lot of sense to keep the same default.
Anyway, the prctl() can go both ways, either expanding or limiting the default address space. So I'd be fine with such interface.
On Fri, Sep 13, 2024 at 11:08:23AM +0100, Catalin Marinas wrote:
On Thu, Sep 12, 2024 at 02:15:59PM -0700, Charlie Jenkins wrote:
On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote:
On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote:
Opting-in to the higher address space is reasonable. However, it is not my preference, because the purpose of this flag is to ensure that allocations do not exceed 47-bits, so it is a clearer ABI to have the applications that want this guarantee to be the ones setting the flag, rather than the applications that want the higher bits setting the flag.
[...]
Anyway, the prctl() can go both ways, either expanding or limiting the default address space. So I'd be fine with such interface.
Ah, I just realised (while reading Lorenzo's reply) that we can't really restrict the space via a prctl() as we have the main thread stack already allocated by the kernel before the user code starts. You may need to limit this stack as well, not just the later heap allocations (anonymous mmap()).
On Fri, Sep 13, 2024 at 11:08:23AM +0100, Catalin Marinas wrote:
On Thu, Sep 12, 2024 at 02:15:59PM -0700, Charlie Jenkins wrote:
On Thu, Sep 12, 2024 at 11:53:49AM +0100, Catalin Marinas wrote:
On Wed, Sep 11, 2024 at 11:18:12PM -0700, Charlie Jenkins wrote:
Opting-in to the higher address space is reasonable. However, it is not my preference, because the purpose of this flag is to ensure that allocations do not exceed 47-bits, so it is a clearer ABI to have the applications that want this guarantee to be the ones setting the flag, rather than the applications that want the higher bits setting the flag.
Yes, this would be ideal. Unfortunately those applications don't know they need to set a flag in order to work.
It's not a regression, the applications never worked (on platforms that do not have this default). The 47-bit default would allow applications that didn't work to start working at the cost of a non-ideal ABI. That doesn't seem like a reasonable tradeoff to me. If applications want to run on new hardware that has different requirements, shouldn't they be required to update rather than expect the kernel will solve their problems for them?
That's a valid point but it depends on the application and how much you want to spend updating user-space. OpenJDK is fine, if you need a JIT you'll have to add support for that architecture anyway. But others are arch-agnostic, you just recompile to your target. It's not an ABI problem, more of an API one.
The arch-agnosticism is my hope with this personality flag, it can be added arch-agnostic userspace code and allow the application to work everywhere, but it does have the downside of requiring that change to user-space code.
The x86 case (and powerpc/arm64) was different, the 47-bit worked for a long time before expanding it. So it made a lot of sense to keep the same default.
Yes it is very reasonable that this solution was selected for those architectures since the support for higher address spaces evolved in the manner that it did!
- Charlie
Anyway, the prctl() can go both ways, either expanding or limiting the default address space. So I'd be fine with such interface.
-- Catalin
On Wed, Sep 11, 2024 at 11:18:12PM GMT, Charlie Jenkins wrote:
On Wed, Sep 11, 2024 at 07:21:27PM +0100, Catalin Marinas wrote:
On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote:
On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
- Catalin Marinas catalin.marinas@arm.com [240906 07:44]:
On Fri, Sep 06, 2024 at 09:55:42AM +0000, Arnd Bergmann wrote:
On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote: > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann arnd@arndb.de wrote: >> It's also unclear to me how we want this flag to interact with >> the existing logic in arch_get_mmap_end(), which attempts to >> limit the default mapping to a 47-bit address space already. > > To optimize RISC-V progress, I recommend: > > Step 1: Approve the patch. > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. > Step 3: Wait approximately several iterations for Go & OpenJDK > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
Point 4 is an ABI change. What guarantees that there isn't still software out there that relies on the old behaviour?
Yeah I don't think it would be desirable to remove the 47 bit constraint in architectures that already have it.
I really want to first see a plausible explanation about why RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW like all the other major architectures (x86, arm64, powerpc64),
FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default configuration. We end up with a 47-bit with 16K pages but for a different reason that has to do with LPA2 support (I doubt we need this for the user mapping but we need to untangle some of the macros there; that's for a separate discussion).
That said, we haven't encountered any user space problems with a 48-bit DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar approach (47 or 48 bit default limit). Better to have some ABI consistency between architectures. One can still ask for addresses above this default limit via mmap().
I think that is best as well.
Can we please just do what x86 and arm64 does?
I responded to Arnd in the other thread, but I am still not convinced that the solution that x86 and arm64 have selected is the best solution. The solution of defaulting to 47 bits does allow applications the ability to get addresses that are below 47 bits. However, due to differences across architectures it doesn't seem possible to have all architectures default to the same value. Additionally, this flag will be able to help users avoid potential bugs where a hint address is passed that causes upper bits of a VA to be used.
The reason we added this limit on arm64 is that we noticed programs using the top 8 bits of a 64-bit pointer for additional information. IIRC, it wasn't even openJDK but some JavaScript JIT. We could have taught those programs of a new flag but since we couldn't tell how many are out there, it was the safest to default to a smaller limit and opt in to the higher one. Such opt-in is via mmap() but if you prefer a prctl() flag, that's fine by me as well (though I think this should be opt-in to higher addresses rather than opt-out of the higher addresses).
The mmap() flag was used in previous versions but was decided against because this feature is more useful if it is process-wide. A personality() flag was chosen instead of a prctl() flag because there existed other flags in personality() that were similar. I am tempted to use prctl() however because then we could have an additional arg to select the exact number of bits that should be reserved (rather than being fixed at 47 bits).
I am very much not in favour of a prctl(), it would require us to add state limiting the address space and the timing of it becomes critical. Then we have the same issue we do with the other proposals as to - what happens if this is too low?
What is 'too low' varies by architecture, and for 32-bit architectures could get quite... problematic.
And again, wha is the RoI here - we introducing maintenance burden and edge cases vs. the x86 solution in order to... accommodate things that need more than 128 TiB of address space? A problem that does not appear to exist in reality?
I suggested the personality approach as the least impactful compromise way of this series working, but I think after what Arnd has said (and please forgive me if I've missed further discussion have been dipping in and out of this!) - adapting risc v to the approach we take elsewhere seems the most sensible solution to me.
This remains something we can revisit in future if this turns out to be egregious.
Opting-in to the higher address space is reasonable. However, it is not my preference, because the purpose of this flag is to ensure that allocations do not exceed 47-bits, so it is a clearer ABI to have the applications that want this guarantee to be the ones setting the flag, rather than the applications that want the higher bits setting the flag.
Perfect is the enemy of the good :) and an idealised solution may not end up being something everybody can agree on.
- Charlie
-- Catalin
On Fri, Sep 13, 2024 at 08:41:34AM +0100, Lorenzo Stoakes wrote:
On Wed, Sep 11, 2024 at 11:18:12PM GMT, Charlie Jenkins wrote:
On Wed, Sep 11, 2024 at 07:21:27PM +0100, Catalin Marinas wrote:
On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote:
On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
- Catalin Marinas catalin.marinas@arm.com [240906 07:44]:
On Fri, Sep 06, 2024 at 09:55:42AM +0000, Arnd Bergmann wrote: > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote: > > On Fri, Sep 6, 2024 at 3:18 PM Arnd Bergmann arnd@arndb.de wrote: > >> It's also unclear to me how we want this flag to interact with > >> the existing logic in arch_get_mmap_end(), which attempts to > >> limit the default mapping to a 47-bit address space already. > > > > To optimize RISC-V progress, I recommend: > > > > Step 1: Approve the patch. > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. > > Step 3: Wait approximately several iterations for Go & OpenJDK > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
Point 4 is an ABI change. What guarantees that there isn't still software out there that relies on the old behaviour?
Yeah I don't think it would be desirable to remove the 47 bit constraint in architectures that already have it.
> I really want to first see a plausible explanation about why > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW > like all the other major architectures (x86, arm64, powerpc64),
FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default configuration. We end up with a 47-bit with 16K pages but for a different reason that has to do with LPA2 support (I doubt we need this for the user mapping but we need to untangle some of the macros there; that's for a separate discussion).
That said, we haven't encountered any user space problems with a 48-bit DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar approach (47 or 48 bit default limit). Better to have some ABI consistency between architectures. One can still ask for addresses above this default limit via mmap().
I think that is best as well.
Can we please just do what x86 and arm64 does?
I responded to Arnd in the other thread, but I am still not convinced that the solution that x86 and arm64 have selected is the best solution. The solution of defaulting to 47 bits does allow applications the ability to get addresses that are below 47 bits. However, due to differences across architectures it doesn't seem possible to have all architectures default to the same value. Additionally, this flag will be able to help users avoid potential bugs where a hint address is passed that causes upper bits of a VA to be used.
The reason we added this limit on arm64 is that we noticed programs using the top 8 bits of a 64-bit pointer for additional information. IIRC, it wasn't even openJDK but some JavaScript JIT. We could have taught those programs of a new flag but since we couldn't tell how many are out there, it was the safest to default to a smaller limit and opt in to the higher one. Such opt-in is via mmap() but if you prefer a prctl() flag, that's fine by me as well (though I think this should be opt-in to higher addresses rather than opt-out of the higher addresses).
The mmap() flag was used in previous versions but was decided against because this feature is more useful if it is process-wide. A personality() flag was chosen instead of a prctl() flag because there existed other flags in personality() that were similar. I am tempted to use prctl() however because then we could have an additional arg to select the exact number of bits that should be reserved (rather than being fixed at 47 bits).
I am very much not in favour of a prctl(), it would require us to add state limiting the address space and the timing of it becomes critical. Then we have the same issue we do with the other proposals as to - what happens if this is too low?
What is 'too low' varies by architecture, and for 32-bit architectures could get quite... problematic.
And again, wha is the RoI here - we introducing maintenance burden and edge cases vs. the x86 solution in order to... accommodate things that need more than 128 TiB of address space? A problem that does not appear to exist in reality?
I suggested the personality approach as the least impactful compromise way of this series working, but I think after what Arnd has said (and please forgive me if I've missed further discussion have been dipping in and out of this!) - adapting risc v to the approach we take elsewhere seems the most sensible solution to me.
This remains something we can revisit in future if this turns out to be egregious.
I appreciate Arnd's comments, but I do not think that making 47-bit the default is the best solution for riscv. On riscv, support for 48-bit address spaces was merged in 5.17 and support for 57-bit address spaces was merged in 5.18 without changing the default addresses provided by mmap(). It could be argued that this was a mistake, however since at the time there didn't exist hardware with larger address spaces it wasn't an issue. The applications that existed at the time that relied on the smaller address spaces have not been able to move to larger address spaces. Making a 47-bit user-space address space default solves the problem, but that is not arch agnostic, and can't be since of the varying differences in page table sizes across architectures, which is the other part of the problem I am trying to solve.
Opting-in to the higher address space is reasonable. However, it is not my preference, because the purpose of this flag is to ensure that allocations do not exceed 47-bits, so it is a clearer ABI to have the applications that want this guarantee to be the ones setting the flag, rather than the applications that want the higher bits setting the flag.
Perfect is the enemy of the good :) and an idealised solution may not end up being something everybody can agree on.
Yes you are totally right! Although this is not my ideal solution, it sufficiently accomplishes the goal so I think it is reasonable to implement this as a personality flag.
- Charlie
-- Catalin
On Fri, 13 Sep 2024 14:04:06 PDT (-0700), Charlie Jenkins wrote:
On Fri, Sep 13, 2024 at 08:41:34AM +0100, Lorenzo Stoakes wrote:
On Wed, Sep 11, 2024 at 11:18:12PM GMT, Charlie Jenkins wrote:
On Wed, Sep 11, 2024 at 07:21:27PM +0100, Catalin Marinas wrote:
On Tue, Sep 10, 2024 at 05:45:07PM -0700, Charlie Jenkins wrote:
On Tue, Sep 10, 2024 at 03:08:14PM -0400, Liam R. Howlett wrote:
- Catalin Marinas catalin.marinas@arm.com [240906 07:44]:
> On Fri, Sep 06, 2024 at 09:55:42AM +0000, Arnd Bergmann wrote: > > On Fri, Sep 6, 2024, at 09:14, Guo Ren wrote: > > > On Fri, Sep 6, 2024 at 3:18���PM Arnd Bergmann arnd@arndb.de wrote: > > >> It's also unclear to me how we want this flag to interact with > > >> the existing logic in arch_get_mmap_end(), which attempts to > > >> limit the default mapping to a 47-bit address space already. > > > > > > To optimize RISC-V progress, I recommend: > > > > > > Step 1: Approve the patch. > > > Step 2: Update Go and OpenJDK's RISC-V backend to utilize it. > > > Step 3: Wait approximately several iterations for Go & OpenJDK > > > Step 4: Remove the 47-bit constraint in arch_get_mmap_end()
Point 4 is an ABI change. What guarantees that there isn't still software out there that relies on the old behaviour?
Yeah I don't think it would be desirable to remove the 47 bit constraint in architectures that already have it.
> > I really want to first see a plausible explanation about why > > RISC-V can't just implement this using a 47-bit DEFAULT_MAP_WINDOW > > like all the other major architectures (x86, arm64, powerpc64), > > FWIW arm64 actually limits DEFAULT_MAP_WINDOW to 48-bit in the default > configuration. We end up with a 47-bit with 16K pages but for a > different reason that has to do with LPA2 support (I doubt we need this > for the user mapping but we need to untangle some of the macros there; > that's for a separate discussion). > > That said, we haven't encountered any user space problems with a 48-bit > DEFAULT_MAP_WINDOW. So I also think RISC-V should follow a similar > approach (47 or 48 bit default limit). Better to have some ABI > consistency between architectures. One can still ask for addresses above > this default limit via mmap().
I think that is best as well.
Can we please just do what x86 and arm64 does?
I responded to Arnd in the other thread, but I am still not convinced that the solution that x86 and arm64 have selected is the best solution. The solution of defaulting to 47 bits does allow applications the ability to get addresses that are below 47 bits. However, due to differences across architectures it doesn't seem possible to have all architectures default to the same value. Additionally, this flag will be able to help users avoid potential bugs where a hint address is passed that causes upper bits of a VA to be used.
The reason we added this limit on arm64 is that we noticed programs using the top 8 bits of a 64-bit pointer for additional information. IIRC, it wasn't even openJDK but some JavaScript JIT. We could have taught those programs of a new flag but since we couldn't tell how many are out there, it was the safest to default to a smaller limit and opt in to the higher one. Such opt-in is via mmap() but if you prefer a prctl() flag, that's fine by me as well (though I think this should be opt-in to higher addresses rather than opt-out of the higher addresses).
The mmap() flag was used in previous versions but was decided against because this feature is more useful if it is process-wide. A personality() flag was chosen instead of a prctl() flag because there existed other flags in personality() that were similar. I am tempted to use prctl() however because then we could have an additional arg to select the exact number of bits that should be reserved (rather than being fixed at 47 bits).
I am very much not in favour of a prctl(), it would require us to add state limiting the address space and the timing of it becomes critical. Then we have the same issue we do with the other proposals as to - what happens if this is too low?
What is 'too low' varies by architecture, and for 32-bit architectures could get quite... problematic.
And again, wha is the RoI here - we introducing maintenance burden and edge cases vs. the x86 solution in order to... accommodate things that need more than 128 TiB of address space? A problem that does not appear to exist in reality?
I suggested the personality approach as the least impactful compromise way of this series working, but I think after what Arnd has said (and please forgive me if I've missed further discussion have been dipping in and out of this!) - adapting risc v to the approach we take elsewhere seems the most sensible solution to me.
There's one wrinkle here: RISC-V started out with 39-bit VAs by default, and we've had at least one report of userspace breaking when moving to 48-bit addresses. That was just address sanitizer, so maybe nobody cares, but we're still pretty early in the transition to 48-bit systems (most of the HW is still 39-bit) so it's not clear if that's going to be the only bug.
So we're sort of in our own world of backwards compatibility here. 39-bit vs 48-bit is just an arbitrary number, but "38 bits are enough for userspace" doesn't seem as sane a "47 bits are enough for userspace". Maybe the right answer here is to just say the 38-bit userspace is broken and that it's a Linux-ism that 64-bit sytems have 47-bit user addresses by default.
This remains something we can revisit in future if this turns out to be egregious.
I appreciate Arnd's comments, but I do not think that making 47-bit the default is the best solution for riscv. On riscv, support for 48-bit address spaces was merged in 5.17 and support for 57-bit address spaces was merged in 5.18 without changing the default addresses provided by mmap(). It could be argued that this was a mistake, however since at the time there didn't exist hardware with larger address spaces it wasn't an issue. The applications that existed at the time that relied on the smaller address spaces have not been able to move to larger address spaces. Making a 47-bit user-space address space default solves the problem, but that is not arch agnostic, and can't be since of the varying differences in page table sizes across architectures, which is the other part of the problem I am trying to solve.
Opting-in to the higher address space is reasonable. However, it is not my preference, because the purpose of this flag is to ensure that allocations do not exceed 47-bits, so it is a clearer ABI to have the applications that want this guarantee to be the ones setting the flag, rather than the applications that want the higher bits setting the flag.
Perfect is the enemy of the good :) and an idealised solution may not end up being something everybody can agree on.
Yes you are totally right! Although this is not my ideal solution, it sufficiently accomplishes the goal so I think it is reasonable to implement this as a personality flag.
- Charlie
-- Catalin
Add a selftest for the ADDR_LIMIT_47BIT personality flag that mmaps until it runs out of space and ensures no addresses are allocated above 47 bits.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_47bit_personality.c | 34 ++++++++++++++++++++++ 3 files changed, 36 insertions(+)
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index da030b43e43b..918ef05e180d 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -32,6 +32,7 @@ mlock-random-test virtual_address_range gup_test va_128TBswitch +map_47bit_personality map_fixed_noreplace write_to_hugetlbfs hmm-tests diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index cfad627e8d94..2e95fd545409 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -50,6 +50,7 @@ TEST_GEN_FILES += hugepage-shm TEST_GEN_FILES += hugepage-vmemmap TEST_GEN_FILES += khugepaged TEST_GEN_FILES += madv_populate +TEST_GEN_FILES += map_47bit_personality TEST_GEN_FILES += map_fixed_noreplace TEST_GEN_FILES += map_hugetlb TEST_GEN_FILES += map_populate diff --git a/tools/testing/selftests/mm/map_47bit_personality.c b/tools/testing/selftests/mm/map_47bit_personality.c new file mode 100644 index 000000000000..453412990c21 --- /dev/null +++ b/tools/testing/selftests/mm/map_47bit_personality.c @@ -0,0 +1,34 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test the ADDR_LIMIT_47BIT personality flag. + */ +#include <sys/syscall.h> +#include <sys/mman.h> +#include <errno.h> +#include "../kselftest.h" + +#define LENGTH (100000000) + +#define ADDR_LIMIT_47BIT 0x10000000 +#define BIT47 1UL << 47 + +/* + * Map memory with ADDR_LIMIT_47BIT until no memory left. Ensure that all returned + * addresses are below 47 bits. + */ +int main(int argc, char **argv) +{ + void *addr; + + syscall(__NR_personality, ADDR_LIMIT_47BIT); + + do { + addr = mmap(0, LENGTH, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + } while (addr != MAP_FAILED && (unsigned long)addr < BIT47); + + if (errno == ENOMEM) + ksft_test_result_pass("ADDR_LIMIT_47BIT works\n"); + else + ksft_test_result_fail("mmap returned address above 47 bits with ADDR_LIMIT_47BIT with addr: %p and err: %s\n", + addr, strerror(errno)); +}
On Fri, Sep 6, 2024 at 5:16 AM Charlie Jenkins charlie@rivosinc.com wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
Instead of implicitly restricting the address space on riscv (or any current/future architecture), provide a flag to the personality syscall that can be used to ensure an application works in any arbitrary VA space. A similar feature has already been implemented by the personality syscall in ADDR_LIMIT_32BIT.
This flag will also allow seemless compatibility between all architectures, so applications like Go and OpenJDK that use bits in a virtual address can request the exact number of bits they need in a generic way. The flag can be checked inside of vm_unmapped_area() so that this flag does not have to be handled individually by each architecture.
Acked-by: Guo Ren guoren@kernel.org
Sv57's pain finds its cure in this antidote.
Link: https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1... [1] Link: https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/s... [2]
To: Arnd Bergmann arnd@arndb.de To: Richard Henderson richard.henderson@linaro.org To: Ivan Kokshaysky ink@jurassic.park.msu.ru To: Matt Turner mattst88@gmail.com To: Vineet Gupta vgupta@kernel.org To: Russell King linux@armlinux.org.uk To: Guo Ren guoren@kernel.org To: Huacai Chen chenhuacai@kernel.org To: WANG Xuerui kernel@xen0n.name To: Thomas Bogendoerfer tsbogend@alpha.franken.de To: James E.J. Bottomley James.Bottomley@HansenPartnership.com To: Helge Deller deller@gmx.de To: Michael Ellerman mpe@ellerman.id.au To: Nicholas Piggin npiggin@gmail.com To: Christophe Leroy christophe.leroy@csgroup.eu To: Naveen N Rao naveen@kernel.org To: Alexander Gordeev agordeev@linux.ibm.com To: Gerald Schaefer gerald.schaefer@linux.ibm.com To: Heiko Carstens hca@linux.ibm.com To: Vasily Gorbik gor@linux.ibm.com To: Christian Borntraeger borntraeger@linux.ibm.com To: Sven Schnelle svens@linux.ibm.com To: Yoshinori Sato ysato@users.sourceforge.jp To: Rich Felker dalias@libc.org To: John Paul Adrian Glaubitz glaubitz@physik.fu-berlin.de To: David S. Miller davem@davemloft.net To: Andreas Larsson andreas@gaisler.com To: Thomas Gleixner tglx@linutronix.de To: Ingo Molnar mingo@redhat.com To: Borislav Petkov bp@alien8.de To: Dave Hansen dave.hansen@linux.intel.com To: x86@kernel.org To: H. Peter Anvin hpa@zytor.com To: Andy Lutomirski luto@kernel.org To: Peter Zijlstra peterz@infradead.org To: Muchun Song muchun.song@linux.dev To: Andrew Morton akpm@linux-foundation.org To: Liam R. Howlett Liam.Howlett@oracle.com To: Vlastimil Babka vbabka@suse.cz To: Lorenzo Stoakes lorenzo.stoakes@oracle.com To: Shuah Khan shuah@kernel.org To: Christoph Hellwig hch@infradead.org To: Michal Hocko mhocko@suse.com To: "Kirill A. Shutemov" kirill@shutemov.name To: Chris Torek chris.torek@gmail.com Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-snps-arc@lists.infradead.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-csky@vger.kernel.org Cc: loongarch@lists.linux.dev Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-kselftest@vger.kernel.org Cc: linux-abi-devel@lists.sourceforge.net Signed-off-by: Charlie Jenkins charlie@rivosinc.com
Changes in v2:
- Added much greater detail to cover letter
- Removed all code that touched architecture specific code and was able to factor this out into all generic functions, except for flags that needed to be added to vm_unmapped_area_info
- Made this an RFC since I have only tested it on riscv and x86
- Link to v1: https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb9022d...
Changes in v3:
- Use a personality flag instead of an mmap flag
- Link to v2: https://lore.kernel.org/r/20240829-patches-below_hint_mmap-v2-0-638a28d9eae0...
Charlie Jenkins (2): mm: Add personality flag to limit address to 47 bits selftests/mm: Create ADDR_LIMIT_47BIT test
include/uapi/linux/personality.h | 1 + mm/mmap.c | 3 ++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_47bit_personality.c | 34 ++++++++++++++++++++++ 5 files changed, 40 insertions(+)
base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55 --
- Charlie
Hi Charlie,
On Thu, 2024-09-05 at 14:15 -0700, Charlie Jenkins wrote:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
Instead of implicitly restricting the address space on riscv (or any current/future architecture), provide a flag to the personality syscall that can be used to ensure an application works in any arbitrary VA space. A similar feature has already been implemented by the personality syscall in ADDR_LIMIT_32BIT.
This flag will also allow seemless compatibility between all architectures, so applications like Go and OpenJDK that use bits in a virtual address can request the exact number of bits they need in a generic way. The flag can be checked inside of vm_unmapped_area() so that this flag does not have to be handled individually by each architecture.
Link: https://github.com/openjdk/jdk/blob/f080b4bb8a75284db1b6037f8c00ef3b1ef1add1... [1] Link: https://github.com/golang/go/blob/9e8ea567c838574a0f14538c0bbbd83c3215aa55/s... [2]
To: Arnd Bergmann arnd@arndb.de To: Richard Henderson richard.henderson@linaro.org To: Ivan Kokshaysky ink@jurassic.park.msu.ru To: Matt Turner mattst88@gmail.com To: Vineet Gupta vgupta@kernel.org To: Russell King linux@armlinux.org.uk To: Guo Ren guoren@kernel.org To: Huacai Chen chenhuacai@kernel.org To: WANG Xuerui kernel@xen0n.name To: Thomas Bogendoerfer tsbogend@alpha.franken.de To: James E.J. Bottomley James.Bottomley@HansenPartnership.com To: Helge Deller deller@gmx.de To: Michael Ellerman mpe@ellerman.id.au To: Nicholas Piggin npiggin@gmail.com To: Christophe Leroy christophe.leroy@csgroup.eu To: Naveen N Rao naveen@kernel.org To: Alexander Gordeev agordeev@linux.ibm.com To: Gerald Schaefer gerald.schaefer@linux.ibm.com To: Heiko Carstens hca@linux.ibm.com To: Vasily Gorbik gor@linux.ibm.com To: Christian Borntraeger borntraeger@linux.ibm.com To: Sven Schnelle svens@linux.ibm.com To: Yoshinori Sato ysato@users.sourceforge.jp To: Rich Felker dalias@libc.org To: John Paul Adrian Glaubitz glaubitz@physik.fu-berlin.de To: David S. Miller davem@davemloft.net To: Andreas Larsson andreas@gaisler.com To: Thomas Gleixner tglx@linutronix.de To: Ingo Molnar mingo@redhat.com To: Borislav Petkov bp@alien8.de To: Dave Hansen dave.hansen@linux.intel.com To: x86@kernel.org To: H. Peter Anvin hpa@zytor.com To: Andy Lutomirski luto@kernel.org To: Peter Zijlstra peterz@infradead.org To: Muchun Song muchun.song@linux.dev To: Andrew Morton akpm@linux-foundation.org To: Liam R. Howlett Liam.Howlett@oracle.com To: Vlastimil Babka vbabka@suse.cz To: Lorenzo Stoakes lorenzo.stoakes@oracle.com To: Shuah Khan shuah@kernel.org To: Christoph Hellwig hch@infradead.org To: Michal Hocko mhocko@suse.com To: "Kirill A. Shutemov" kirill@shutemov.name To: Chris Torek chris.torek@gmail.com Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-snps-arc@lists.infradead.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-csky@vger.kernel.org Cc: loongarch@lists.linux.dev Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-kselftest@vger.kernel.org Cc: linux-abi-devel@lists.sourceforge.net Signed-off-by: Charlie Jenkins charlie@rivosinc.com
Changes in v2:
- Added much greater detail to cover letter
- Removed all code that touched architecture specific code and was able to factor this out into all generic functions, except for flags that needed to be added to vm_unmapped_area_info
- Made this an RFC since I have only tested it on riscv and x86
- Link to v1: https://lore.kernel.org/r/20240827-patches-below_hint_mmap-v1-0-46ff2eb9022d...
Changes in v3:
- Use a personality flag instead of an mmap flag
- Link to v2: https://lore.kernel.org/r/20240829-patches-below_hint_mmap-v2-0-638a28d9eae0...
Charlie Jenkins (2): mm: Add personality flag to limit address to 47 bits selftests/mm: Create ADDR_LIMIT_47BIT test
include/uapi/linux/personality.h | 1 + mm/mmap.c | 3 ++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/map_47bit_personality.c | 34 ++++++++++++++++++++++ 5 files changed, 40 insertions(+)
base-commit: 5be63fc19fcaa4c236b307420483578a56986a37 change-id: 20240827-patches-below_hint_mmap-b13d79ae1c55
Wow, this issue has been plaguing SPARC users for years already as the architecture uses a 52-bit virtual address space and Javascript engines such as the one in Firefox or Webkit have been crashing ever since.
I should definitely give this series a try and see if that fixes Javascript crashes on SPARC.
Thanks a lot for addressing this nasty long-standing problem!
Adrian
在2024年9月5日九月 下午10:15,Charlie Jenkins写道:
Some applications rely on placing data in free bits addresses allocated by mmap. Various architectures (eg. x86, arm64, powerpc) restrict the address returned by mmap to be less than the 48-bit address space, unless the hint address uses more than 47 bits (the 48th bit is reserved for the kernel address space).
The riscv architecture needs a way to similarly restrict the virtual address space. On the riscv port of OpenJDK an error is thrown if attempted to run on the 57-bit address space, called sv57 [1]. golang has a comment that sv57 support is not complete, but there are some workarounds to get it to mostly work [2].
These applications work on x86 because x86 does an implicit 47-bit restriction of mmap() address that contain a hint address that is less than 48 bits.
Instead of implicitly restricting the address space on riscv (or any current/future architecture), provide a flag to the personality syscall that can be used to ensure an application works in any arbitrary VA space. A similar feature has already been implemented by the personality syscall in ADDR_LIMIT_32BIT.
This flag will also allow seemless compatibility between all architectures, so applications like Go and OpenJDK that use bits in a virtual address can request the exact number of bits they need in a generic way. The flag can be checked inside of vm_unmapped_area() so that this flag does not have to be handled individually by each architecture.
Tested-by: Jiaxun Yang jiaxun.yang@flygoat.com
Tested on MIPS VA 48 system, fixed pointer tagging on mozjs!
Thanks!
[...]
linux-kselftest-mirror@lists.linaro.org