Previous patch series[1] changes a mmap behavior that treats the hint address as the upper bound of the mmap address range. The motivation of the previous patch series is that some user space software may assume 48-bit address space and use higher bits to encode some information, which may collide with large virtual address space mmap may return. However, to make sv48 by default, we don't need to change the meaning of the hint address on mmap as the upper bound of the mmap address range, especially when this behavior only shows up on the RISC-V. This behavior also breaks some user space software which assumes mmap should try to create mapping on the hint address if possible. As the mmap manpage said:
If addr is not NULL, then the kernel takes it as a hint about where to place the mapping; on Linux, the kernel will pick a nearby page boundary (but always above or equal to the value specified by /proc/sys/vm/mmap_min_addr) and attempt to create the mapping there.
Unfortunately, what mmap said is not true on RISC-V since kernel v6.6.
Other ISAs with larger than 48-bit virtual address space like x86, arm64, and powerpc do not have this special mmap behavior on hint address. They all just make 48-bit / 47-bit virtual address space by default, and if a user space software wants to large virtual address space, it only need to specify a hint address larger than 48-bit / 47-bit.
Thus, this patch series keeps the change of mmap to use sv48 by default but does not treat the hint address as the upper bound of the mmap address range. After this patch, the behavior of mmap will align with existing behavior on other ISAs with larger than 48-bit virtual address space like x86, arm64, and powerpc. The user space software will no longer need to rewrite their code to fit with this special mmap behavior only on RISC-V.
My concern is that the change of mmap behavior on the hint address is already in the upstream kernel since v6.6, and it might be hard to revert it although it already brings some regression on some user space software. And it will be harder than adding it since v6.6 because mmap not creating mapping on the hint address is very common, especially when running on a machine without sv57 / sv48. However, if some user space software already adopted this special mmap behavior on RISC-V, we should not return a mmap address larger than the hint if the address is larger than BIT(38). My opinion is that revert this change on the next kernel release might be a good choice as only a few of hardware support sv57 / sv48 now, these changes will have no impact on sv39 systems.
Moreover, previous patch series said it make sv48 by default, which is in the cover letter, kernel documentation and MMAP_VA_BITS defination. However, the code on arch_get_mmap_end and arch_get_mmap_base marco still use sv39 by default, which makes me confused, and I still use sv48 by default in this patch series including arch_get_mmap_end and arch_get_mmap_base.
Changes in v2: - correct arch_get_mmap_end and arch_get_mmap_base - Add description in documentation about mmap behavior on kernel v6.6-6.7. - Improve commit message and cover letter - Rebase to newest riscv/for-next branch - Link to v1: https://lore.kernel.org/linux-riscv/tencent_F3B3B5AB1C9D704763CA423E1A41F8BE...
[1]. https://lore.kernel.org/linux-riscv/20230809232218.849726-1-charlie@rivosinc...
Yangyu Chen (3): RISC-V: mm: do not treat hint addr on mmap as the upper bound to search RISC-V: mm: only test mmap without hint Documentation: riscv: correct sv57 kernel behavior
Documentation/arch/riscv/vm-layout.rst | 54 ++++++++++++------- arch/riscv/include/asm/processor.h | 38 +++---------- .../selftests/riscv/mm/mmap_bottomup.c | 12 ----- .../testing/selftests/riscv/mm/mmap_default.c | 12 ----- tools/testing/selftests/riscv/mm/mmap_test.h | 30 ----------- 5 files changed, 41 insertions(+), 105 deletions(-)
This patch reverted the meaning of the addr parameter in the mmap syscall change from the previous commit add2cc6b6515 ("RISC-V: mm: Restrict address space for sv39,sv48,sv57") from patch[1] which treats hint addr as the upper bound of the mmap return address. However, some userspace software assumes mmap will attempt to create mapping on the hint address if possible without MAP_FIXED set, thus these software will always use the fallback path as the return address is not the same as the hint, which may lead to some performance overhead. Other ISAs like x86, arm64, and powerpc also meet this issue which has userspace virtual address bits larger than 48-bit and userspace software may use the MSB beyond 48-bit to store some information. Still, these ISAs didn't change the meaning of the hint address and only limited the address space to 48-bit when the hint address did not go beyond the default map window.
Thus, this patch makes the behavior of mmap syscall on RISC-V sv57 capable system align with x86, arm64, powerpc by only limiting the address space to DEFAULT_MAP_WINDOW which is defined as not larger than 47-bit. If a user program wants to use sv57 address space, it can use mmap with a hint address larger than BIT(47) as it is already documented in x86 and arm64. And this code is copied from kernel source code on powerpc.
[1]. https://lore.kernel.org/r/20230809232218.849726-2-charlie@rivosinc.com
Signed-off-by: Yangyu Chen cyy@cyyself.name --- arch/riscv/include/asm/processor.h | 38 ++++++------------------------ 1 file changed, 7 insertions(+), 31 deletions(-)
diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h index a8509cc31ab2..bc604669f18e 100644 --- a/arch/riscv/include/asm/processor.h +++ b/arch/riscv/include/asm/processor.h @@ -18,37 +18,13 @@ #define DEFAULT_MAP_WINDOW (UL(1) << (MMAP_VA_BITS - 1)) #define STACK_TOP_MAX TASK_SIZE
-#define arch_get_mmap_end(addr, len, flags) \ -({ \ - unsigned long mmap_end; \ - typeof(addr) _addr = (addr); \ - if ((_addr) == 0 || (IS_ENABLED(CONFIG_COMPAT) && is_compat_task())) \ - mmap_end = STACK_TOP_MAX; \ - else if ((_addr) >= VA_USER_SV57) \ - mmap_end = STACK_TOP_MAX; \ - else if ((((_addr) >= VA_USER_SV48)) && (VA_BITS >= VA_BITS_SV48)) \ - mmap_end = VA_USER_SV48; \ - else \ - mmap_end = VA_USER_SV39; \ - mmap_end; \ -}) - -#define arch_get_mmap_base(addr, base) \ -({ \ - unsigned long mmap_base; \ - typeof(addr) _addr = (addr); \ - typeof(base) _base = (base); \ - unsigned long rnd_gap = DEFAULT_MAP_WINDOW - (_base); \ - if ((_addr) == 0 || (IS_ENABLED(CONFIG_COMPAT) && is_compat_task())) \ - mmap_base = (_base); \ - else if (((_addr) >= VA_USER_SV57) && (VA_BITS >= VA_BITS_SV57)) \ - mmap_base = VA_USER_SV57 - rnd_gap; \ - else if ((((_addr) >= VA_USER_SV48)) && (VA_BITS >= VA_BITS_SV48)) \ - mmap_base = VA_USER_SV48 - rnd_gap; \ - else \ - mmap_base = VA_USER_SV39 - rnd_gap; \ - mmap_base; \ -}) +#define arch_get_mmap_end(addr, len, flags) \ + (((addr) > DEFAULT_MAP_WINDOW) || \ + (((flags) & MAP_FIXED) && ((addr) + (len) > DEFAULT_MAP_WINDOW)) ? TASK_SIZE : \ + DEFAULT_MAP_WINDOW) + +#define arch_get_mmap_base(addr, base) \ + (((addr) > DEFAULT_MAP_WINDOW) ? (base) + TASK_SIZE - DEFAULT_MAP_WINDOW : (base))
#else #define DEFAULT_MAP_WINDOW TASK_SIZE
The original test from the previous patchset[1] assumes the hint address on mmap is treated as the upper bound of the return address. As we reverted this special behavior, this test should be updated to reflect the change.
[1]. https://lore.kernel.org/linux-riscv/20230809232218.849726-1-charlie@rivosinc...
Signed-off-by: Yangyu Chen cyy@cyyself.name --- .../selftests/riscv/mm/mmap_bottomup.c | 12 -------- .../testing/selftests/riscv/mm/mmap_default.c | 12 -------- tools/testing/selftests/riscv/mm/mmap_test.h | 30 ------------------- 3 files changed, 54 deletions(-)
diff --git a/tools/testing/selftests/riscv/mm/mmap_bottomup.c b/tools/testing/selftests/riscv/mm/mmap_bottomup.c index 1757d19ca89b..1ba703d3f552 100644 --- a/tools/testing/selftests/riscv/mm/mmap_bottomup.c +++ b/tools/testing/selftests/riscv/mm/mmap_bottomup.c @@ -15,20 +15,8 @@ TEST(infinite_rlimit) do_mmaps(&mmap_addresses);
EXPECT_NE(MAP_FAILED, mmap_addresses.no_hint); - EXPECT_NE(MAP_FAILED, mmap_addresses.on_37_addr); - EXPECT_NE(MAP_FAILED, mmap_addresses.on_38_addr); - EXPECT_NE(MAP_FAILED, mmap_addresses.on_46_addr); - EXPECT_NE(MAP_FAILED, mmap_addresses.on_47_addr); - EXPECT_NE(MAP_FAILED, mmap_addresses.on_55_addr); - EXPECT_NE(MAP_FAILED, mmap_addresses.on_56_addr);
EXPECT_GT(1UL << 47, (unsigned long)mmap_addresses.no_hint); - EXPECT_GT(1UL << 38, (unsigned long)mmap_addresses.on_37_addr); - EXPECT_GT(1UL << 38, (unsigned long)mmap_addresses.on_38_addr); - EXPECT_GT(1UL << 38, (unsigned long)mmap_addresses.on_46_addr); - EXPECT_GT(1UL << 47, (unsigned long)mmap_addresses.on_47_addr); - EXPECT_GT(1UL << 47, (unsigned long)mmap_addresses.on_55_addr); - EXPECT_GT(1UL << 56, (unsigned long)mmap_addresses.on_56_addr); #endif }
diff --git a/tools/testing/selftests/riscv/mm/mmap_default.c b/tools/testing/selftests/riscv/mm/mmap_default.c index c63c60b9397e..f1ac860dcf04 100644 --- a/tools/testing/selftests/riscv/mm/mmap_default.c +++ b/tools/testing/selftests/riscv/mm/mmap_default.c @@ -15,20 +15,8 @@ TEST(default_rlimit) do_mmaps(&mmap_addresses);
EXPECT_NE(MAP_FAILED, mmap_addresses.no_hint); - EXPECT_NE(MAP_FAILED, mmap_addresses.on_37_addr); - EXPECT_NE(MAP_FAILED, mmap_addresses.on_38_addr); - EXPECT_NE(MAP_FAILED, mmap_addresses.on_46_addr); - EXPECT_NE(MAP_FAILED, mmap_addresses.on_47_addr); - EXPECT_NE(MAP_FAILED, mmap_addresses.on_55_addr); - EXPECT_NE(MAP_FAILED, mmap_addresses.on_56_addr);
EXPECT_GT(1UL << 47, (unsigned long)mmap_addresses.no_hint); - EXPECT_GT(1UL << 38, (unsigned long)mmap_addresses.on_37_addr); - EXPECT_GT(1UL << 38, (unsigned long)mmap_addresses.on_38_addr); - EXPECT_GT(1UL << 38, (unsigned long)mmap_addresses.on_46_addr); - EXPECT_GT(1UL << 47, (unsigned long)mmap_addresses.on_47_addr); - EXPECT_GT(1UL << 47, (unsigned long)mmap_addresses.on_55_addr); - EXPECT_GT(1UL << 56, (unsigned long)mmap_addresses.on_56_addr); #endif }
diff --git a/tools/testing/selftests/riscv/mm/mmap_test.h b/tools/testing/selftests/riscv/mm/mmap_test.h index 2e0db9c5be6c..d2271426288f 100644 --- a/tools/testing/selftests/riscv/mm/mmap_test.h +++ b/tools/testing/selftests/riscv/mm/mmap_test.h @@ -10,47 +10,17 @@
struct addresses { int *no_hint; - int *on_37_addr; - int *on_38_addr; - int *on_46_addr; - int *on_47_addr; - int *on_55_addr; - int *on_56_addr; };
// Only works on 64 bit #if __riscv_xlen == 64 static inline void do_mmaps(struct addresses *mmap_addresses) { - /* - * Place all of the hint addresses on the boundaries of mmap - * sv39, sv48, sv57 - * User addresses end at 1<<38, 1<<47, 1<<56 respectively - */ - void *on_37_bits = (void *)(1UL << 37); - void *on_38_bits = (void *)(1UL << 38); - void *on_46_bits = (void *)(1UL << 46); - void *on_47_bits = (void *)(1UL << 47); - void *on_55_bits = (void *)(1UL << 55); - void *on_56_bits = (void *)(1UL << 56); - int prot = PROT_READ | PROT_WRITE; int flags = MAP_PRIVATE | MAP_ANONYMOUS;
mmap_addresses->no_hint = mmap(NULL, 5 * sizeof(int), prot, flags, 0, 0); - mmap_addresses->on_37_addr = - mmap(on_37_bits, 5 * sizeof(int), prot, flags, 0, 0); - mmap_addresses->on_38_addr = - mmap(on_38_bits, 5 * sizeof(int), prot, flags, 0, 0); - mmap_addresses->on_46_addr = - mmap(on_46_bits, 5 * sizeof(int), prot, flags, 0, 0); - mmap_addresses->on_47_addr = - mmap(on_47_bits, 5 * sizeof(int), prot, flags, 0, 0); - mmap_addresses->on_55_addr = - mmap(on_55_bits, 5 * sizeof(int), prot, flags, 0, 0); - mmap_addresses->on_56_addr = - mmap(on_56_bits, 5 * sizeof(int), prot, flags, 0, 0); } #endif /* __riscv_xlen == 64 */
The original documentation from the previous patchset[1] treated the hint address on mmap as the upper bound, since we have already removed this behavior, this document should be updated. Most of the content is copied from the corresponding feature in x86_64 with some modifications to align with the current kernel's behavior on RISC-V.
[1]. https://lore.kernel.org/linux-riscv/20230809232218.849726-1-charlie@rivosinc...
Signed-off-by: Yangyu Chen cyy@cyyself.name --- Documentation/arch/riscv/vm-layout.rst | 54 ++++++++++++++++---------- 1 file changed, 34 insertions(+), 20 deletions(-)
diff --git a/Documentation/arch/riscv/vm-layout.rst b/Documentation/arch/riscv/vm-layout.rst index 69ff6da1dbf8..9d84362b9f91 100644 --- a/Documentation/arch/riscv/vm-layout.rst +++ b/Documentation/arch/riscv/vm-layout.rst @@ -135,23 +135,37 @@ RISC-V Linux Kernel SV57 __________________|____________|__________________|_________|____________________________________________________________
-Userspace VAs --------------------- -To maintain compatibility with software that relies on the VA space with a -maximum of 48 bits the kernel will, by default, return virtual addresses to -userspace from a 48-bit range (sv48). This default behavior is achieved by -passing 0 into the hint address parameter of mmap. On CPUs with an address space -smaller than sv48, the CPU maximum supported address space will be the default. - -Software can "opt-in" to receiving VAs from another VA space by providing -a hint address to mmap. A hint address passed to mmap will cause the largest -address space that fits entirely into the hint to be used, unless there is no -space left in the address space. If there is no space available in the requested -address space, an address in the next smallest available address space will be -returned. - -For example, in order to obtain 48-bit VA space, a hint address greater than -:code:`1 << 47` must be provided. Note that this is 47 due to sv48 userspace -ending at :code:`1 << 47` and the addresses beyond this are reserved for the -kernel. Similarly, to obtain 57-bit VA space addresses, a hint address greater -than or equal to :code:`1 << 56` must be provided. +User-space and large virtual address space +========================================== +On RISC-V, Sv57 paging enables 56-bit userspace virtual address space. +Not all user space is ready to handle wide addresses. It's known that +at least some JIT compilers use higher bits in pointers to encode their +information. It collides with valid pointers with Sv57 paging and leads +to crashes. + +To mitigate this, we are not going to allocate virtual address space +above 47-bit by default. And on kernel v6.6-v6.7, that is 38-bit by +default. + +But userspace can ask for allocation from full address space by +specifying hint address (with or without MAP_FIXED) above 47-bits, or +hint address + size above 47-bits with MAP_FIXED. + +If hint address set above 47-bit, but MAP_FIXED is not specified, we try +to look for unmapped area by specified address. If it's already +occupied, we look for unmapped area in *full* address space, rather than +from 47-bit window. + +A high hint address would only affect the allocation in question, but not +any future mmap()s. + +Specifying high hint address without MAP_FIXED on older kernel or on +machine without Sv57 paging support is safe. On kernel v6.6-v6.7, the +hint will be treated as the upper bound of the address space to search, +but this was removed in the future version of kernels. On kernel older +than v6.6 or on machine without Sv57 paging support, the kernel will +fall back to allocation from the supported address space. + +This approach helps to easily make application's memory allocator aware +about large address space without manually tracking allocated virtual +address space.
This patch has not been reviewed for more than a month. There is another patch that did the same fix but in another way and still has not been reviewed like this. I'm here to do a comparison of some choices briefly to let the maintainer understand the issues and the solutions. I think it's time to make a decision before the next Linux LTS v6.9. As a number of sv48 chips will be released this year.
Issues:
Since commit add2cc6b6515 ("RISC-V: mm: Restrict address space for sv39,sv48,sv57") from patch [1], userspace software cannot create virtual address memory mapping on the hint address if the address larger than (1<<38) on sv48, sv57 capable CPU using mmap without MAP_FIXED set.
This is because since that commit, the hint address is treated as the upper bound to create the mapping when the hint address is larger than (1<<38).
Existing regression for userspace software since that commit: - box64 [2]
Some choices are:
1. Do not change it
Con:
This behavior is not the same as x86, arm64, and powerpc when treating memory address space larger than 48-bit. On x86, arm64, and powerpc, if the hint address is larger than 48-bit, mmap will not limit the upper bound to use.
Also, these ISAs limit the mmap to 48-bit by default. However, RISC-V currently uses sv39 by default, which is not the same as the document and commit message.
2. Use my patch
which limits the upper bound of mmap to 47-bit by default, if the hint address is larger than (1<<47), then no limit.
Pros: Let the behavior of mmap align with x86, arm64, powerpc
Cons: A new regression for software that assumes mmap will not return an address larger than the hint address if the hint address is larger than (1<<38) as it has been documented on RISC-V since v6.6. However, there is no change in the widespread sv39 systems we use now.
3. Use Charlie's patch [3]
which adjusts the upper bound to hint address + size.
Pros: Still has upper-bound limit using hint address but allows userspace to create mapping on the hint address without MAP_FIXED set.
Cons: That patch will introduce a new regression even for the sv39 system when creating mmap with the same hint address more than one time if the hint address is less than round-gap.
4. Some new ideas currently are not on the mailing list
Hope this issue can be fixed before the Linux v6.9 release.
Thanks, Yangyu Chen
[1] https://lore.kernel.org/linux-riscv/20230809232218.849726-2-charlie@rivosinc... [2] https://github.com/ptitSeb/box64/commit/5b700cb6e6f397d2074c49659f7f9915f4a3... [3] https://lore.kernel.org/linux-riscv/20240130-use_mmap_hint_address-v3-0-8a65...
On Thu, 29 Feb 2024 04:10:03 PST (-0800), cyy@cyyself.name wrote:
This patch has not been reviewed for more than a month. There is another patch that did the same fix but in another way and still has not been reviewed like this. I'm here to do a comparison of some choices briefly to let the maintainer understand the issues and the solutions. I think it's time to make a decision before the next Linux LTS v6.9. As a number of sv48 chips will be released this year.
Issues:
Since commit add2cc6b6515 ("RISC-V: mm: Restrict address space for sv39,sv48,sv57") from patch [1], userspace software cannot create virtual address memory mapping on the hint address if the address larger than (1<<38) on sv48, sv57 capable CPU using mmap without MAP_FIXED set.
This is because since that commit, the hint address is treated as the upper bound to create the mapping when the hint address is larger than (1<<38).
Existing regression for userspace software since that commit:
- box64 [2]
Is this the same regression as before? IIUC the real issue there is that userspace wasn't passing MAP_FIXED and expecting a fixed address to be mapped. That's just a bug in userspace.
Is there any software that uses mmap() in a legal way that the flags patch caused a regression in? If that's the case then we'll need to figure out what it's doing so we can avoid the regression.
The only thing I can think of are realloc-type schemes, where rounding the hint address down would result in performance problems. I don't know of anything like that specifically, but I think Charlie's patch would fix it.
Some choices are:
- Do not change it
Con:
This behavior is not the same as x86, arm64, and powerpc when treating memory address space larger than 48-bit. On x86, arm64, and powerpc, if the hint address is larger than 48-bit, mmap will not limit the upper bound to use.
Also, these ISAs limit the mmap to 48-bit by default. However, RISC-V currently uses sv39 by default, which is not the same as the document and commit message.
IIUC arm64/amd64 started with 48-bit-capable hardware and kernels, and thus the only regression was when moving to the larger VA spaces. We started with sv39-based VA space,
- Use my patch
which limits the upper bound of mmap to 47-bit by default, if the hint address is larger than (1<<47), then no limit.
Pros: Let the behavior of mmap align with x86, arm64, powerpc
Cons: A new regression for software that assumes mmap will not return an address larger than the hint address if the hint address is larger than (1<<38) as it has been documented on RISC-V since v6.6. However, there is no change in the widespread sv39 systems we use now.
The OpenJDK and Go people have at least talked about using the interface as it is currently defined. I'm trying to chase down some of the folks around here who understand that stuff, but it might take a bit...
- Use Charlie's patch [3]
which adjusts the upper bound to hint address + size.
IMO we can call that compatible with the docs. There's sort of a grey area in "A hint address passed to mmap will cause the largest address space that fits entirely into the hint to be used" as to how that hint address is used, but I think interpreting it as the base address is sane and we can just update the docs.
This also should fix the realloc-type cases I can think of, though those are sort of theoretical right now.
Pros: Still has upper-bound limit using hint address but allows userspace to create mapping on the hint address without MAP_FIXED set.
Cons: That patch will introduce a new regression even for the sv39 system when creating mmap with the same hint address more than one time if the hint address is less than round-gap.
I'm not quite sure what you're trying to say there. If users are passing a hint that's already allocated then they're not going to get that address allocated, so as long as we give them something else we're OK.
We might want to take more advantage of the clause in the docs that allows larger addresses to be allocated under memory pressure to avoid too many allocation failures, but that applies to any of these schemes.
- Some new ideas currently are not on the mailing list
Hope this issue can be fixed before the Linux v6.9 release.
Thanks, Yangyu Chen
[1] https://lore.kernel.org/linux-riscv/20230809232218.849726-2-charlie@rivosinc... [2] https://github.com/ptitSeb/box64/commit/5b700cb6e6f397d2074c49659f7f9915f4a3... [3] https://lore.kernel.org/linux-riscv/20240130-use_mmap_hint_address-v3-0-8a65...
On 2024/3/1 03:21, Palmer Dabbelt wrote:
On Thu, 29 Feb 2024 04:10:03 PST (-0800), cyy@cyyself.name wrote:
This patch has not been reviewed for more than a month. There is another patch that did the same fix but in another way and still has not been reviewed like this. I'm here to do a comparison of some choices briefly to let the maintainer understand the issues and the solutions. I think it's time to make a decision before the next Linux LTS v6.9. As a number of sv48 chips will be released this year.
Issues:
Since commit add2cc6b6515 ("RISC-V: mm: Restrict address space for sv39,sv48,sv57") from patch [1], userspace software cannot create virtual address memory mapping on the hint address if the address larger than (1<<38) on sv48, sv57 capable CPU using mmap without MAP_FIXED set.
This is because since that commit, the hint address is treated as the upper bound to create the mapping when the hint address is larger than (1<<38).
Existing regression for userspace software since that commit:
- box64 [2]
Is this the same regression as before? IIUC the real issue there is that userspace wasn't passing MAP_FIXED and expecting a fixed address to be mapped. That's just a bug in userspace.
Is there any software that uses mmap() in a legal way that the flags patch caused a regression in? If that's the case then we'll need to figure out what it's doing so we can avoid the regression.
The only thing I can think of are realloc-type schemes, where rounding the hint address down would result in performance problems. I don't know of anything like that specifically, but I think Charlie's patch would fix it.
Yes. The regression for a legal mmap is only on performance for userspace software, not on functionality.
Some choices are:
- Do not change it
Con:
This behavior is not the same as x86, arm64, and powerpc when treating memory address space larger than 48-bit. On x86, arm64, and powerpc, if the hint address is larger than 48-bit, mmap will not limit the upper bound to use.
Also, these ISAs limit the mmap to 48-bit by default. However, RISC-V currently uses sv39 by default, which is not the same as the document and commit message.
IIUC arm64/amd64 started with 48-bit-capable hardware and kernels, and thus the only regression was when moving to the larger VA spaces. We started with sv39-based VA space,
It's about the document and the commit message says it uses sv48 by default. However, the code in the kernel uses sv39 by default. The reasons for using sv48 by default has been talked about in that patch review previously. [4]
Whatever, the document or the code can be simply fixed if we decide not to change it.
Another concern is that if we can't make this decision in time to catch up with v6.9 we don't want some bad things to happen as a large number of sv48 machines might appear this year and they may run on the next v6.9 LTS kernel, Shall we change the code in the kernel to use sv48 by default right now?
[4] https://lore.kernel.org/linux-riscv/ZJzgi8RyqG3Mjt0R@ghost/
- Use my patch
which limits the upper bound of mmap to 47-bit by default, if the hint address is larger than (1<<47), then no limit.
Pros: Let the behavior of mmap align with x86, arm64, powerpc
Cons: A new regression for software that assumes mmap will not return an address larger than the hint address if the hint address is larger than (1<<38) as it has been documented on RISC-V since v6.6. However, there is no change in the widespread sv39 systems we use now.
The OpenJDK and Go people have at least talked about using the interface as it is currently defined. I'm trying to chase down some of the folks around here who understand that stuff, but it might take a bit...
Roger that.
- Use Charlie's patch [3]
which adjusts the upper bound to hint address + size.
IMO we can call that compatible with the docs. There's sort of a grey area in "A hint address passed to mmap will cause the largest address space that fits entirely into the hint to be used" as to how that hint address is used, but I think interpreting it as the base address is sane and we can just update the docs.
This also should fix the realloc-type cases I can think of, though those are sort of theoretical right now.
Pros: Still has upper-bound limit using hint address but allows userspace to create mapping on the hint address without MAP_FIXED set.
Cons: That patch will introduce a new regression even for the sv39 system when creating mmap with the same hint address more than one time if the hint address is less than round-gap.
I'm not quite sure what you're trying to say there. If users are passing a hint that's already allocated then they're not going to get that address allocated, so as long as we give them something else we're OK.
In this case, mmap will return MAP_FAILED in the second time. But on arm64, x86, it will pick an address in 48-bit space to use. However, after reviewing the code, I think it's not easy to make Charlie's patch search for another space to create the mapping without any changes outside of arch/riscv.
We might want to take more advantage of the clause in the docs that allows larger addresses to be allocated under memory pressure to avoid too many allocation failures, but that applies to any of these schemes.
Indeed. After thinking about it for a while, especially for the OpenJDK and Go people have at least talked about using the interface. If it is not used now, I have an idea is that to port Charlie's patch to Linux-mm not only for RISC-V, and pick a flag like MAP_UPPERBOUND to use it. And then change the mmap behavior on RISC-V to align with x86, arm64, and powerpc. So we have all ISAs take advantage to use Charlie's idea and all ISAs will treat mmap in the same way, which makes userspace developers happy as they don't need to care about the ISA-specific behavior.
- Some new ideas currently are not on the mailing list
Hope this issue can be fixed before the Linux v6.9 release.
Thanks, Yangyu Chen
[1] https://lore.kernel.org/linux-riscv/20230809232218.849726-2-charlie@rivosinc... [2] https://github.com/ptitSeb/box64/commit/5b700cb6e6f397d2074c49659f7f9915f4a3... [3] https://lore.kernel.org/linux-riscv/20240130-use_mmap_hint_address-v3-0-8a65...
On Fri, Mar 01, 2024 at 04:54:05AM +0800, Yangyu Chen wrote:
Another concern is that if we can't make this decision in time to catch up with v6.9 we don't want some bad things to happen as a large number of sv48 machines might appear this year and they may run on the next v6.9 LTS kernel, Shall we change the code in the kernel to use sv48 by default right now?
Just pointing out that v6.9 is highly unlikely to be the next lts kernel, depending on whether or not Linus delays some releases, it'll most likely be either v6.11 or v6.12.
linux-kselftest-mirror@lists.linaro.org