Make sv48 the default address space for mmap as some applications currently depend on this assumption. Users can now select a desired address space using a non-zero hint address to mmap. Previously, requesting the default address space from mmap by passing zero as the hint address would result in using the largest address space possible. Some applications depend on empty bits in the virtual address space, like Go and Java, so this patch provides more flexibility for application developers.
-Charlie
--- v10: - Move pgtable.h defintions into a no __ASSEMBLY__ region to resolve compilation conflicts (pointed out by Conor) - Will now compile with allmodconfig
v9: - Raise the mmap_end default to STACK_TOP_MAX to allow the address space to grow beyond the default of sv48 on sv57 machines as suggested by Alexandre - Some of the mmap macros had unnecessary conditionals that I have removed
v8: - Fix RV32 and the RV32 compat mode of RV64 (suggested by Conor) - Extract out addr and base from the mmap macros (suggested by Alexandre)
v7: - Changing RLIMIT_STACK inside of an executing program does not trigger arch_pick_mmap_layout(), so rewrite tests to change RLIMIT_STACK from a script before executing tests. RLIMIT_STACK of infinity forces bottomup mmap allocation. - Make arch_get_mmap_base macro more readible by extracting out the rnd calculation. - Use MMAP_MIN_VA_BITS in TASK_UNMAPPED_BASE to support case when mmap attempts to allocate address smaller than DEFAULT_MAP_WINDOW. - Fix incorrect wording in documentation.
v6: - Rebase onto the correct base
v5: - Minor wording change in documentation - Change some parenthesis in arch_get_mmap_ macros - Added case for addr==0 in arch_get_mmap_ because without this, programs would crash if RLIMIT_STACK was modified before executing the program. This was tested using the libhugetlbfs tests.
v4: - Split testcases/document patch into test cases, in-code documentation, and formal documentation patches - Modified the mmap_base macro to be more legible and better represent memory layout - Fixed documentation to better reflect the implmentation - Renamed DEFAULT_VA_BITS to MMAP_VA_BITS - Added additional test case for rlimit changes ---
Charlie Jenkins (4): RISC-V: mm: Restrict address space for sv39,sv48,sv57 RISC-V: mm: Add tests for RISC-V mm RISC-V: mm: Update pgtable comment documentation RISC-V: mm: Document mmap changes
Documentation/riscv/vm-layout.rst | 22 +++++++ arch/riscv/include/asm/elf.h | 2 +- arch/riscv/include/asm/pgtable.h | 33 ++++++++-- arch/riscv/include/asm/processor.h | 52 +++++++++++++-- tools/testing/selftests/riscv/Makefile | 2 +- tools/testing/selftests/riscv/mm/.gitignore | 2 + tools/testing/selftests/riscv/mm/Makefile | 15 +++++ .../riscv/mm/testcases/mmap_bottomup.c | 35 ++++++++++ .../riscv/mm/testcases/mmap_default.c | 35 ++++++++++ .../selftests/riscv/mm/testcases/mmap_test.h | 64 +++++++++++++++++++ .../selftests/riscv/mm/testcases/run_mmap.sh | 12 ++++ 11 files changed, 261 insertions(+), 13 deletions(-) create mode 100644 tools/testing/selftests/riscv/mm/.gitignore create mode 100644 tools/testing/selftests/riscv/mm/Makefile create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap_bottomup.c create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap_default.c create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap_test.h create mode 100755 tools/testing/selftests/riscv/mm/testcases/run_mmap.sh
Make sv48 the default address space for mmap as some applications currently depend on this assumption. A hint address passed to mmap will cause the largest address space that fits entirely into the hint to be used. If the hint is less than or equal to 1<<38, an sv39 address will be used. An exception is that if the hint address is 0, then a sv48 address will be used. After an address space is completely full, the next smallest address space will be used.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- arch/riscv/include/asm/elf.h | 2 +- arch/riscv/include/asm/pgtable.h | 25 ++++++++++++-- arch/riscv/include/asm/processor.h | 52 ++++++++++++++++++++++++++---- 3 files changed, 70 insertions(+), 9 deletions(-)
diff --git a/arch/riscv/include/asm/elf.h b/arch/riscv/include/asm/elf.h index c24280774caf..5d3368d5585c 100644 --- a/arch/riscv/include/asm/elf.h +++ b/arch/riscv/include/asm/elf.h @@ -49,7 +49,7 @@ extern bool compat_elf_check_arch(Elf32_Ehdr *hdr); * the loader. We need to make sure that it is out of the way of the program * that it will "exec", and that there is sufficient room for the brk. */ -#define ELF_ET_DYN_BASE ((TASK_SIZE / 3) * 2) +#define ELF_ET_DYN_BASE ((DEFAULT_MAP_WINDOW / 3) * 2)
#ifdef CONFIG_64BIT #ifdef CONFIG_COMPAT diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index 75970ee2bda2..bb0b9ac7b581 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -62,11 +62,16 @@ * struct pages to map half the virtual address space. Then * position vmemmap directly below the VMALLOC region. */ +#define VA_BITS_SV32 32 #ifdef CONFIG_64BIT +#define VA_BITS_SV39 39 +#define VA_BITS_SV48 48 +#define VA_BITS_SV57 57 + #define VA_BITS (pgtable_l5_enabled ? \ - 57 : (pgtable_l4_enabled ? 48 : 39)) + VA_BITS_SV57 : (pgtable_l4_enabled ? VA_BITS_SV48 : VA_BITS_SV39)) #else -#define VA_BITS 32 +#define VA_BITS VA_BITS_SV32 #endif
#define VMEMMAP_SHIFT \ @@ -111,11 +116,27 @@ #include <asm/page.h> #include <asm/tlbflush.h> #include <linux/mm_types.h> +#include <asm/compat.h>
#define __page_val_to_pfn(_val) (((_val) & _PAGE_PFN_MASK) >> _PAGE_PFN_SHIFT)
#ifdef CONFIG_64BIT #include <asm/pgtable-64.h> + +#define VA_USER_SV39 (UL(1) << (VA_BITS_SV39 - 1)) +#define VA_USER_SV48 (UL(1) << (VA_BITS_SV48 - 1)) +#define VA_USER_SV57 (UL(1) << (VA_BITS_SV57 - 1)) + +#ifdef CONFIG_COMPAT +#define MMAP_VA_BITS_64 ((VA_BITS >= VA_BITS_SV48) ? VA_BITS_SV48 : VA_BITS) +#define MMAP_MIN_VA_BITS_64 (VA_BITS_SV39) +#define MMAP_VA_BITS (is_compat_task() ? VA_BITS_SV32 : MMAP_VA_BITS_64) +#define MMAP_MIN_VA_BITS (is_compat_task() ? VA_BITS_SV32 : MMAP_MIN_VA_BITS_64) +#else +#define MMAP_VA_BITS ((VA_BITS >= VA_BITS_SV48) ? VA_BITS_SV48 : VA_BITS) +#define MMAP_MIN_VA_BITS (VA_BITS_SV39) +#endif /* CONFIG_COMPAT */ + #else #include <asm/pgtable-32.h> #endif /* CONFIG_64BIT */ diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h index c950a8d9edef..3e23e1786d05 100644 --- a/arch/riscv/include/asm/processor.h +++ b/arch/riscv/include/asm/processor.h @@ -13,19 +13,59 @@
#include <asm/ptrace.h>
+#ifdef CONFIG_64BIT +#define DEFAULT_MAP_WINDOW (UL(1) << (MMAP_VA_BITS - 1)) +#define STACK_TOP_MAX TASK_SIZE_64 + +#define arch_get_mmap_end(addr, len, flags) \ +({ \ + unsigned long mmap_end; \ + typeof(addr) _addr = (addr); \ + if ((_addr) == 0 || (IS_ENABLED(CONFIG_COMPAT) && is_compat_task())) \ + mmap_end = STACK_TOP_MAX; \ + else if ((_addr) >= VA_USER_SV57) \ + mmap_end = STACK_TOP_MAX; \ + else if ((((_addr) >= VA_USER_SV48)) && (VA_BITS >= VA_BITS_SV48)) \ + mmap_end = VA_USER_SV48; \ + else \ + mmap_end = VA_USER_SV39; \ + mmap_end; \ +}) + +#define arch_get_mmap_base(addr, base) \ +({ \ + unsigned long mmap_base; \ + typeof(addr) _addr = (addr); \ + typeof(base) _base = (base); \ + unsigned long rnd_gap = DEFAULT_MAP_WINDOW - (_base); \ + if ((_addr) == 0 || (IS_ENABLED(CONFIG_COMPAT) && is_compat_task())) \ + mmap_base = (_base); \ + else if (((_addr) >= VA_USER_SV57) && (VA_BITS >= VA_BITS_SV57)) \ + mmap_base = VA_USER_SV57 - rnd_gap; \ + else if ((((_addr) >= VA_USER_SV48)) && (VA_BITS >= VA_BITS_SV48)) \ + mmap_base = VA_USER_SV48 - rnd_gap; \ + else \ + mmap_base = VA_USER_SV39 - rnd_gap; \ + mmap_base; \ +}) + +#else +#define DEFAULT_MAP_WINDOW TASK_SIZE +#define STACK_TOP_MAX TASK_SIZE +#endif +#define STACK_ALIGN 16 + +#define STACK_TOP DEFAULT_MAP_WINDOW + /* * This decides where the kernel will search for a free chunk of vm * space during mmap's. */ -#define TASK_UNMAPPED_BASE PAGE_ALIGN(TASK_SIZE / 3) - -#define STACK_TOP TASK_SIZE #ifdef CONFIG_64BIT -#define STACK_TOP_MAX TASK_SIZE_64 +#define TASK_UNMAPPED_BASE PAGE_ALIGN((UL(1) << MMAP_MIN_VA_BITS) / 3) #else -#define STACK_TOP_MAX TASK_SIZE +#define TASK_UNMAPPED_BASE PAGE_ALIGN(TASK_SIZE / 3) #endif -#define STACK_ALIGN 16
#ifndef __ASSEMBLY__
Add tests that enforce mmap hint address behavior. mmap should default to sv48. mmap will provide an address at the highest address space that can fit into the hint address, unless the hint address is less than sv39 and not 0, then it will return a sv39 address.
These tests are split into two files: mmap_default.c and mmap_bottomup.c because a new process must be exec'd in order to change the mmap layout. The run_mmap.sh script sets the stack to be unlimited for the mmap_bottomup.c test which triggers a bottomup layout.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com --- tools/testing/selftests/riscv/Makefile | 2 +- tools/testing/selftests/riscv/mm/.gitignore | 2 + tools/testing/selftests/riscv/mm/Makefile | 15 +++++ .../riscv/mm/testcases/mmap_bottomup.c | 35 ++++++++++ .../riscv/mm/testcases/mmap_default.c | 35 ++++++++++ .../selftests/riscv/mm/testcases/mmap_test.h | 64 +++++++++++++++++++ .../selftests/riscv/mm/testcases/run_mmap.sh | 12 ++++ 7 files changed, 164 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/riscv/mm/.gitignore create mode 100644 tools/testing/selftests/riscv/mm/Makefile create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap_bottomup.c create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap_default.c create mode 100644 tools/testing/selftests/riscv/mm/testcases/mmap_test.h create mode 100755 tools/testing/selftests/riscv/mm/testcases/run_mmap.sh
diff --git a/tools/testing/selftests/riscv/Makefile b/tools/testing/selftests/riscv/Makefile index f4b3d5c9af5b..4a9ff515a3a0 100644 --- a/tools/testing/selftests/riscv/Makefile +++ b/tools/testing/selftests/riscv/Makefile @@ -5,7 +5,7 @@ ARCH ?= $(shell uname -m 2>/dev/null || echo not)
ifneq (,$(filter $(ARCH),riscv)) -RISCV_SUBTARGETS ?= hwprobe vector +RISCV_SUBTARGETS ?= hwprobe vector mm else RISCV_SUBTARGETS := endif diff --git a/tools/testing/selftests/riscv/mm/.gitignore b/tools/testing/selftests/riscv/mm/.gitignore new file mode 100644 index 000000000000..5c2c57cb950c --- /dev/null +++ b/tools/testing/selftests/riscv/mm/.gitignore @@ -0,0 +1,2 @@ +mmap_bottomup +mmap_default diff --git a/tools/testing/selftests/riscv/mm/Makefile b/tools/testing/selftests/riscv/mm/Makefile new file mode 100644 index 000000000000..11e0f0568923 --- /dev/null +++ b/tools/testing/selftests/riscv/mm/Makefile @@ -0,0 +1,15 @@ +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) 2021 ARM Limited +# Originally tools/testing/arm64/abi/Makefile + +# Additional include paths needed by kselftest.h and local headers +CFLAGS += -D_GNU_SOURCE -std=gnu99 -I. + +TEST_GEN_FILES := testcases/mmap_default testcases/mmap_bottomup + +TEST_PROGS := testcases/run_mmap.sh + +include ../../lib.mk + +$(OUTPUT)/mm: testcases/mmap_default.c testcases/mmap_bottomup.c testcases/mmap_tests.h + $(CC) -o$@ $(CFLAGS) $(LDFLAGS) $^ diff --git a/tools/testing/selftests/riscv/mm/testcases/mmap_bottomup.c b/tools/testing/selftests/riscv/mm/testcases/mmap_bottomup.c new file mode 100644 index 000000000000..b29379f7e478 --- /dev/null +++ b/tools/testing/selftests/riscv/mm/testcases/mmap_bottomup.c @@ -0,0 +1,35 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include <sys/mman.h> +#include <testcases/mmap_test.h> + +#include "../../kselftest_harness.h" + +TEST(infinite_rlimit) +{ +// Only works on 64 bit +#if __riscv_xlen == 64 + struct addresses mmap_addresses; + + EXPECT_EQ(BOTTOM_UP, memory_layout()); + + do_mmaps(&mmap_addresses); + + EXPECT_NE(MAP_FAILED, mmap_addresses.no_hint); + EXPECT_NE(MAP_FAILED, mmap_addresses.on_37_addr); + EXPECT_NE(MAP_FAILED, mmap_addresses.on_38_addr); + EXPECT_NE(MAP_FAILED, mmap_addresses.on_46_addr); + EXPECT_NE(MAP_FAILED, mmap_addresses.on_47_addr); + EXPECT_NE(MAP_FAILED, mmap_addresses.on_55_addr); + EXPECT_NE(MAP_FAILED, mmap_addresses.on_56_addr); + + EXPECT_GT(1UL << 47, (unsigned long)mmap_addresses.no_hint); + EXPECT_GT(1UL << 38, (unsigned long)mmap_addresses.on_37_addr); + EXPECT_GT(1UL << 38, (unsigned long)mmap_addresses.on_38_addr); + EXPECT_GT(1UL << 38, (unsigned long)mmap_addresses.on_46_addr); + EXPECT_GT(1UL << 47, (unsigned long)mmap_addresses.on_47_addr); + EXPECT_GT(1UL << 47, (unsigned long)mmap_addresses.on_55_addr); + EXPECT_GT(1UL << 56, (unsigned long)mmap_addresses.on_56_addr); +#endif +} + +TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/riscv/mm/testcases/mmap_default.c b/tools/testing/selftests/riscv/mm/testcases/mmap_default.c new file mode 100644 index 000000000000..d1accb91b726 --- /dev/null +++ b/tools/testing/selftests/riscv/mm/testcases/mmap_default.c @@ -0,0 +1,35 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include <sys/mman.h> +#include <testcases/mmap_test.h> + +#include "../../kselftest_harness.h" + +TEST(default_rlimit) +{ +// Only works on 64 bit +#if __riscv_xlen == 64 + struct addresses mmap_addresses; + + EXPECT_EQ(TOP_DOWN, memory_layout()); + + do_mmaps(&mmap_addresses); + + EXPECT_NE(MAP_FAILED, mmap_addresses.no_hint); + EXPECT_NE(MAP_FAILED, mmap_addresses.on_37_addr); + EXPECT_NE(MAP_FAILED, mmap_addresses.on_38_addr); + EXPECT_NE(MAP_FAILED, mmap_addresses.on_46_addr); + EXPECT_NE(MAP_FAILED, mmap_addresses.on_47_addr); + EXPECT_NE(MAP_FAILED, mmap_addresses.on_55_addr); + EXPECT_NE(MAP_FAILED, mmap_addresses.on_56_addr); + + EXPECT_GT(1UL << 47, (unsigned long)mmap_addresses.no_hint); + EXPECT_GT(1UL << 38, (unsigned long)mmap_addresses.on_37_addr); + EXPECT_GT(1UL << 38, (unsigned long)mmap_addresses.on_38_addr); + EXPECT_GT(1UL << 38, (unsigned long)mmap_addresses.on_46_addr); + EXPECT_GT(1UL << 47, (unsigned long)mmap_addresses.on_47_addr); + EXPECT_GT(1UL << 47, (unsigned long)mmap_addresses.on_55_addr); + EXPECT_GT(1UL << 56, (unsigned long)mmap_addresses.on_56_addr); +#endif +} + +TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/riscv/mm/testcases/mmap_test.h b/tools/testing/selftests/riscv/mm/testcases/mmap_test.h new file mode 100644 index 000000000000..9b8434f62f57 --- /dev/null +++ b/tools/testing/selftests/riscv/mm/testcases/mmap_test.h @@ -0,0 +1,64 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +#ifndef _TESTCASES_MMAP_TEST_H +#define _TESTCASES_MMAP_TEST_H +#include <sys/mman.h> +#include <sys/resource.h> +#include <stddef.h> + +#define TOP_DOWN 0 +#define BOTTOM_UP 1 + +struct addresses { + int *no_hint; + int *on_37_addr; + int *on_38_addr; + int *on_46_addr; + int *on_47_addr; + int *on_55_addr; + int *on_56_addr; +}; + +static inline void do_mmaps(struct addresses *mmap_addresses) +{ + /* + * Place all of the hint addresses on the boundaries of mmap + * sv39, sv48, sv57 + * User addresses end at 1<<38, 1<<47, 1<<56 respectively + */ + void *on_37_bits = (void *)(1UL << 37); + void *on_38_bits = (void *)(1UL << 38); + void *on_46_bits = (void *)(1UL << 46); + void *on_47_bits = (void *)(1UL << 47); + void *on_55_bits = (void *)(1UL << 55); + void *on_56_bits = (void *)(1UL << 56); + + int prot = PROT_READ | PROT_WRITE; + int flags = MAP_PRIVATE | MAP_ANONYMOUS; + + mmap_addresses->no_hint = + mmap(NULL, 5 * sizeof(int), prot, flags, 0, 0); + mmap_addresses->on_37_addr = + mmap(on_37_bits, 5 * sizeof(int), prot, flags, 0, 0); + mmap_addresses->on_38_addr = + mmap(on_38_bits, 5 * sizeof(int), prot, flags, 0, 0); + mmap_addresses->on_46_addr = + mmap(on_46_bits, 5 * sizeof(int), prot, flags, 0, 0); + mmap_addresses->on_47_addr = + mmap(on_47_bits, 5 * sizeof(int), prot, flags, 0, 0); + mmap_addresses->on_55_addr = + mmap(on_55_bits, 5 * sizeof(int), prot, flags, 0, 0); + mmap_addresses->on_56_addr = + mmap(on_56_bits, 5 * sizeof(int), prot, flags, 0, 0); +} + +static inline int memory_layout(void) +{ + int prot = PROT_READ | PROT_WRITE; + int flags = MAP_PRIVATE | MAP_ANONYMOUS; + + void *value1 = mmap(NULL, sizeof(int), prot, flags, 0, 0); + void *value2 = mmap(NULL, sizeof(int), prot, flags, 0, 0); + + return value2 > value1; +} +#endif /* _TESTCASES_MMAP_TEST_H */ diff --git a/tools/testing/selftests/riscv/mm/testcases/run_mmap.sh b/tools/testing/selftests/riscv/mm/testcases/run_mmap.sh new file mode 100755 index 000000000000..ca5ad7c48bad --- /dev/null +++ b/tools/testing/selftests/riscv/mm/testcases/run_mmap.sh @@ -0,0 +1,12 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 + +original_stack_limit=$(ulimit -s) + +./mmap_default + +# Force mmap_bottomup to be ran with bottomup memory due to +# the unlimited stack +ulimit -s unlimited +./mmap_bottomup +ulimit -s $original_stack_limit
sv57 is supported in the kernel so pgtable.h should reflect that.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com Reviewed-by: Alexandre Ghiti alexghiti@rivosinc.com --- arch/riscv/include/asm/pgtable.h | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index bb0b9ac7b581..2c5f6c8edc8a 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -851,14 +851,16 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte) * Task size is 0x4000000000 for RV64 or 0x9fc00000 for RV32. * Note that PGDIR_SIZE must evenly divide TASK_SIZE. * Task size is: - * - 0x9fc00000 (~2.5GB) for RV32. - * - 0x4000000000 ( 256GB) for RV64 using SV39 mmu - * - 0x800000000000 ( 128TB) for RV64 using SV48 mmu + * - 0x9fc00000 (~2.5GB) for RV32. + * - 0x4000000000 ( 256GB) for RV64 using SV39 mmu + * - 0x800000000000 ( 128TB) for RV64 using SV48 mmu + * - 0x100000000000000 ( 64PB) for RV64 using SV57 mmu * * Note that PGDIR_SIZE must evenly divide TASK_SIZE since "RISC-V * Instruction Set Manual Volume II: Privileged Architecture" states that * "load and store effective addresses, which are 64bits, must have bits * 63–48 all equal to bit 47, or else a page-fault exception will occur." + * Similarly for SV57, bits 63–57 must be equal to bit 56. */ #ifdef CONFIG_64BIT #define TASK_SIZE_64 (PGDIR_SIZE * PTRS_PER_PGD / 2)
The behavior of mmap is modified with this patch series, so explain the changes to the mmap hint address behavior.
Signed-off-by: Charlie Jenkins charlie@rivosinc.com Reviewed-by: Alexandre Ghiti alexghiti@rivosinc.com --- Documentation/riscv/vm-layout.rst | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+)
diff --git a/Documentation/riscv/vm-layout.rst b/Documentation/riscv/vm-layout.rst index 5462c84f4723..69ff6da1dbf8 100644 --- a/Documentation/riscv/vm-layout.rst +++ b/Documentation/riscv/vm-layout.rst @@ -133,3 +133,25 @@ RISC-V Linux Kernel SV57 ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | modules, BPF ffffffff80000000 | -2 GB | ffffffffffffffff | 2 GB | kernel __________________|____________|__________________|_________|____________________________________________________________ + + +Userspace VAs +-------------------- +To maintain compatibility with software that relies on the VA space with a +maximum of 48 bits the kernel will, by default, return virtual addresses to +userspace from a 48-bit range (sv48). This default behavior is achieved by +passing 0 into the hint address parameter of mmap. On CPUs with an address space +smaller than sv48, the CPU maximum supported address space will be the default. + +Software can "opt-in" to receiving VAs from another VA space by providing +a hint address to mmap. A hint address passed to mmap will cause the largest +address space that fits entirely into the hint to be used, unless there is no +space left in the address space. If there is no space available in the requested +address space, an address in the next smallest available address space will be +returned. + +For example, in order to obtain 48-bit VA space, a hint address greater than +:code:`1 << 47` must be provided. Note that this is 47 due to sv48 userspace +ending at :code:`1 << 47` and the addresses beyond this are reserved for the +kernel. Similarly, to obtain 57-bit VA space addresses, a hint address greater +than or equal to :code:`1 << 56` must be provided.
Hello:
This series was applied to riscv/linux.git (for-next) by Palmer Dabbelt palmer@rivosinc.com:
On Wed, 9 Aug 2023 16:22:00 -0700 you wrote:
Make sv48 the default address space for mmap as some applications currently depend on this assumption. Users can now select a desired address space using a non-zero hint address to mmap. Previously, requesting the default address space from mmap by passing zero as the hint address would result in using the largest address space possible. Some applications depend on empty bits in the virtual address space, like Go and Java, so this patch provides more flexibility for application developers.
[...]
Here is the summary with links: - [v10,1/4] RISC-V: mm: Restrict address space for sv39,sv48,sv57 https://git.kernel.org/riscv/c/add2cc6b6515 - [v10,2/4] RISC-V: mm: Add tests for RISC-V mm https://git.kernel.org/riscv/c/4d0c04eac0c2 - [v10,3/4] RISC-V: mm: Update pgtable comment documentation https://git.kernel.org/riscv/c/26eee2bfc477 - [v10,4/4] RISC-V: mm: Document mmap changes https://git.kernel.org/riscv/c/7998abe69d3c
You are awesome, thank you!
Hi, Charlie
Although this patchset has been merged I still have some questions about this patchset. Because it breaks regular mmap if address >= 38 bits on sv48 / sv57 capable systems like qemu. For example, If a userspace program wants to mmap an anonymous page to addr=(1<<45) on an sv48 capable system, it will fail and kernel will mmaped to another sv39 address since it does not meet the requirement to use sv48 as you wrote:
else if ((((_addr) >= VA_USER_SV48)) && (VA_BITS >= VA_BITS_SV48)) \ mmap_end = VA_USER_SV48; \ else \ mmap_end = VA_USER_SV39; \
Then, How can a userspace program create a mmap with a hint if the address
= (1<<38) after your patch without MAP_FIXED? The only way to do this is
to pass a hint >= (1<<47) on mmap syscall then kernel will return a random address in sv48 address space but the hint address gets lost. I think this violate the principle of mmap syscall as kernel should take the hint and attempt to create the mapping there.
I don't think patching in this way is right. However, if we only revert this patch, some programs relying on mmap to return address with effective bits <= 48 will still be an issue and it might expand to other ISAs if they implement larger virtual address space like RISC-V sv57. A better way to solve this might be adding a MAP_48BIT flag to mmap like MAP_32BIT has been introduced for decades.
Thanks, Yangyu Chen
On Sun, Jan 14, 2024 at 01:26:57AM +0800, Yangyu Chen wrote:
Hi, Charlie
Although this patchset has been merged I still have some questions about this patchset. Because it breaks regular mmap if address >= 38 bits on sv48 / sv57 capable systems like qemu. For example, If a userspace program wants to mmap an anonymous page to addr=(1<<45) on an sv48 capable system, it will fail and kernel will mmaped to another sv39 address since it does
Thank you for raising this concern. To make sure I am understanding correctly, you are passing a hint address of (1<<45) and expecting mmap to return 1<<45 and if it returns a different address you are describing mmap as failing? If you want an address that is in the sv48 space you can pass in an address that is greater than 1<<47.
not meet the requirement to use sv48 as you wrote:
else if ((((_addr) >= VA_USER_SV48)) && (VA_BITS >= VA_BITS_SV48)) \ mmap_end = VA_USER_SV48; \ else \ mmap_end = VA_USER_SV39; \
Then, How can a userspace program create a mmap with a hint if the address
= (1<<38) after your patch without MAP_FIXED? The only way to do this is
to pass a hint >= (1<<47) on mmap syscall then kernel will return a random address in sv48 address space but the hint address gets lost. I think this
In order to force mmap to return the address provided you must use MAP_FIXED. Otherwise, the address is a "hint" and has no guarantees. The hint address on riscv is used to mean "don't give me an address that uses more bits than this". This behavior is not unique to riscv, arm64 and powerpc use a similar scheme. In arch/arm64/include/asm/processor.h there is the following code:
#define arch_get_mmap_base(addr, base) ((addr > DEFAULT_MAP_WINDOW) ? \ base + TASK_SIZE - DEFAULT_MAP_WINDOW :\ base)
arm64/powerpc are only concerned with a single boundary so the code is simpler.
violate the principle of mmap syscall as kernel should take the hint and attempt to create the mapping there.
Although the man page for mmap does say "on Linux, the kernel will pick a nearby page boundary" it is still a hint address so there is no strict requirement (and the precedent has already been set by arm64/powerpc).
I don't think patching in this way is right. However, if we only revert this patch, some programs relying on mmap to return address with effective bits <= 48 will still be an issue and it might expand to other ISAs if they implement larger virtual address space like RISC-V sv57. A better way to solve this might be adding a MAP_48BIT flag to mmap like MAP_32BIT has been introduced for decades.
Thanks, Yangyu Chen
- Charlie
Thanks for your reply.
On 1/20/24 09:34, Charlie Jenkins wrote:
On Sun, Jan 14, 2024 at 01:26:57AM +0800, Yangyu Chen wrote:
Hi, Charlie
Although this patchset has been merged I still have some questions about this patchset. Because it breaks regular mmap if address >= 38 bits on sv48 / sv57 capable systems like qemu. For example, If a userspace program wants to mmap an anonymous page to addr=(1<<45) on an sv48 capable system, it will fail and kernel will mmaped to another sv39 address since it does
Thank you for raising this concern. To make sure I am understanding correctly, you are passing a hint address of (1<<45) and expecting mmap to return 1<<45 and if it returns a different address you are describing mmap as failing? If you want an address that is in the sv48 space you can pass in an address that is greater than 1<<47.
not meet the requirement to use sv48 as you wrote:
else if ((((_addr) >= VA_USER_SV48)) && (VA_BITS >= VA_BITS_SV48)) \ mmap_end = VA_USER_SV48; \ else \ mmap_end = VA_USER_SV39; \
Then, How can a userspace program create a mmap with a hint if the address
= (1<<38) after your patch without MAP_FIXED? The only way to do this is
to pass a hint >= (1<<47) on mmap syscall then kernel will return a random address in sv48 address space but the hint address gets lost. I think this
In order to force mmap to return the address provided you must use MAP_FIXED. Otherwise, the address is a "hint" and has no guarantees. The hint address on riscv is used to mean "don't give me an address that uses more bits than this". This behavior is not unique to riscv, arm64 and powerpc use a similar scheme. In arch/arm64/include/asm/processor.h there is the following code:
#define arch_get_mmap_base(addr, base) ((addr > DEFAULT_MAP_WINDOW) ? \ base + TASK_SIZE - DEFAULT_MAP_WINDOW :\ base)
arm64/powerpc are only concerned with a single boundary so the code is simpler.
As you say, this code in arm64/powerpc will not meet the issue I address. For example, If the addr here is (1<<50) on arm64, the arch_get_mmap_base will return base+TASK_SIZE-DEFAULT_MAP_WINDOW which is (1<<vabits_actual). And this behavior on arm64/powerpc/x86 does not break anything since we will use a larger address space if the hint address is specified on the address > DEFAULT_MAP_WINDOW. The corresponding behavior on RISC-V should be if the hint address > BIT(47) then use Sv57 address space and use Sv48 when the hint address > BIT(38) if we want Sv39 by default.
However, your patch needs the address >= BIT(47) rather than BIT(38) to use Sv48 and address >= BIT(56) to use Sv57, thus breaking existing userspace software to create mapping on the hint address without MAP_FIXED set.
violate the principle of mmap syscall as kernel should take the hint and attempt to create the mapping there.
Although the man page for mmap does say "on Linux, the kernel will pick a nearby page boundary" it is still a hint address so there is no strict requirement (and the precedent has already been set by arm64/powerpc).
Yeah. There is no strict requirement. But currently x86/arm64/powerpc works in this situation well. The hint address on these ISAs is not used as the upper bound to allocating the address. However, on RISC-V, you treat this as the upper bound.
I don't think patching in this way is right. However, if we only revert this patch, some programs relying on mmap to return address with effective bits <= 48 will still be an issue and it might expand to other ISAs if they implement larger virtual address space like RISC-V sv57. A better way to solve this might be adding a MAP_48BIT flag to mmap like MAP_32BIT has been introduced for decades.
Thanks, Yangyu Chen
- Charlie
On Sat, Jan 20, 2024 at 02:13:14PM +0800, Yangyu Chen wrote:
Thanks for your reply.
On 1/20/24 09:34, Charlie Jenkins wrote:
On Sun, Jan 14, 2024 at 01:26:57AM +0800, Yangyu Chen wrote:
Hi, Charlie
Although this patchset has been merged I still have some questions about this patchset. Because it breaks regular mmap if address >= 38 bits on sv48 / sv57 capable systems like qemu. For example, If a userspace program wants to mmap an anonymous page to addr=(1<<45) on an sv48 capable system, it will fail and kernel will mmaped to another sv39 address since it does
Thank you for raising this concern. To make sure I am understanding correctly, you are passing a hint address of (1<<45) and expecting mmap to return 1<<45 and if it returns a different address you are describing mmap as failing? If you want an address that is in the sv48 space you can pass in an address that is greater than 1<<47.
not meet the requirement to use sv48 as you wrote:
else if ((((_addr) >= VA_USER_SV48)) && (VA_BITS >= VA_BITS_SV48)) \ mmap_end = VA_USER_SV48; \ else \ mmap_end = VA_USER_SV39; \
Then, How can a userspace program create a mmap with a hint if the address
= (1<<38) after your patch without MAP_FIXED? The only way to do this is
to pass a hint >= (1<<47) on mmap syscall then kernel will return a random address in sv48 address space but the hint address gets lost. I think this
In order to force mmap to return the address provided you must use MAP_FIXED. Otherwise, the address is a "hint" and has no guarantees. The hint address on riscv is used to mean "don't give me an address that uses more bits than this". This behavior is not unique to riscv, arm64 and powerpc use a similar scheme. In arch/arm64/include/asm/processor.h there is the following code:
#define arch_get_mmap_base(addr, base) ((addr > DEFAULT_MAP_WINDOW) ? \ base + TASK_SIZE - DEFAULT_MAP_WINDOW :\ base)
arm64/powerpc are only concerned with a single boundary so the code is simpler.
As you say, this code in arm64/powerpc will not meet the issue I address. For example, If the addr here is (1<<50) on arm64, the arch_get_mmap_base will return base+TASK_SIZE-DEFAULT_MAP_WINDOW which is (1<<vabits_actual). And this behavior on arm64/powerpc/x86 does not break anything since we will use a larger address space if the hint address is specified on the address > DEFAULT_MAP_WINDOW. The corresponding behavior on RISC-V should be if the hint address > BIT(47) then use Sv57 address space and use Sv48 when the hint address > BIT(38) if we want Sv39 by default.
However, your patch needs the address >= BIT(47) rather than BIT(38) to use Sv48 and address >= BIT(56) to use Sv57, thus breaking existing userspace software to create mapping on the hint address without MAP_FIXED set.
Code that needs mmap to provide a specific address must use MAP_FIXED. On riscv, it was decided that the address returned from mmap cannot be greater than the hint address. This is currently implemented by using the largest address space that can fit into the hint address. It may be possible that this range can be extended to use all of the addresses that are less than or equal to the hint address.
From reading the code even on arm64 if you pass an address that is greater than DEFAULT_MAP_WINDOW it is not guaranteed that mmap will return an address that is greater than DEFAULT_MAP_WINDOW. It may still be provide an address that is less than DEFAULT_MAP_WINDOW if it fails to find an address above. This seems like this would also break your use case.
violate the principle of mmap syscall as kernel should take the hint and attempt to create the mapping there.
Although the man page for mmap does say "on Linux, the kernel will pick a nearby page boundary" it is still a hint address so there is no strict requirement (and the precedent has already been set by arm64/powerpc).
Yeah. There is no strict requirement. But currently x86/arm64/powerpc works in this situation well. The hint address on these ISAs is not used as the upper bound to allocating the address. However, on RISC-V, you treat this as the upper bound.
I don't think patching in this way is right. However, if we only revert this patch, some programs relying on mmap to return address with effective bits <= 48 will still be an issue and it might expand to other ISAs if they implement larger virtual address space like RISC-V sv57. A better way to solve this might be adding a MAP_48BIT flag to mmap like MAP_32BIT has been introduced for decades.
Thanks, Yangyu Chen
- Charlie
On 1/20/24 14:49, Charlie Jenkins wrote:
On Sat, Jan 20, 2024 at 02:13:14PM +0800, Yangyu Chen wrote:
Thanks for your reply.
On 1/20/24 09:34, Charlie Jenkins wrote:
On Sun, Jan 14, 2024 at 01:26:57AM +0800, Yangyu Chen wrote:
Hi, Charlie
Although this patchset has been merged I still have some questions about this patchset. Because it breaks regular mmap if address >= 38 bits on sv48 / sv57 capable systems like qemu. For example, If a userspace program wants to mmap an anonymous page to addr=(1<<45) on an sv48 capable system, it will fail and kernel will mmaped to another sv39 address since it does
Thank you for raising this concern. To make sure I am understanding correctly, you are passing a hint address of (1<<45) and expecting mmap to return 1<<45 and if it returns a different address you are describing mmap as failing? If you want an address that is in the sv48 space you can pass in an address that is greater than 1<<47.
not meet the requirement to use sv48 as you wrote:
else if ((((_addr) >= VA_USER_SV48)) && (VA_BITS >= VA_BITS_SV48)) \ mmap_end = VA_USER_SV48; \ else \ mmap_end = VA_USER_SV39; \
Then, How can a userspace program create a mmap with a hint if the address
= (1<<38) after your patch without MAP_FIXED? The only way to do this is
to pass a hint >= (1<<47) on mmap syscall then kernel will return a random address in sv48 address space but the hint address gets lost. I think this
In order to force mmap to return the address provided you must use MAP_FIXED. Otherwise, the address is a "hint" and has no guarantees. The hint address on riscv is used to mean "don't give me an address that uses more bits than this". This behavior is not unique to riscv, arm64 and powerpc use a similar scheme. In arch/arm64/include/asm/processor.h there is the following code:
#define arch_get_mmap_base(addr, base) ((addr > DEFAULT_MAP_WINDOW) ? \ base + TASK_SIZE - DEFAULT_MAP_WINDOW :\ base)
arm64/powerpc are only concerned with a single boundary so the code is simpler.
As you say, this code in arm64/powerpc will not meet the issue I address. For example, If the addr here is (1<<50) on arm64, the arch_get_mmap_base will return base+TASK_SIZE-DEFAULT_MAP_WINDOW which is (1<<vabits_actual). And this behavior on arm64/powerpc/x86 does not break anything since we will use a larger address space if the hint address is specified on the address > DEFAULT_MAP_WINDOW. The corresponding behavior on RISC-V should be if the hint address > BIT(47) then use Sv57 address space and use Sv48 when the hint address > BIT(38) if we want Sv39 by default.
However, your patch needs the address >= BIT(47) rather than BIT(38) to use Sv48 and address >= BIT(56) to use Sv57, thus breaking existing userspace software to create mapping on the hint address without MAP_FIXED set.
Code that needs mmap to provide a specific address must use MAP_FIXED. On riscv, it was decided that the address returned from mmap cannot be greater than the hint address. This is currently implemented by using the largest address space that can fit into the hint address. It may be possible that this range can be extended to use all of the addresses that are less than or equal to the hint address.
So this decision might be wrong. It requires some userspace software to modify their mmap flags to fit with this. For example, a binary translate JIT compiler already probes this platform is capable with Sv48, then want to create mapping on some address specified on the mmap hint to align with foreign binary native address but also provide a fallback path with performance overhead. Your patch here will always let userspace software use a fallback path with performance overhead until the userspace software changes its syscall to use MAP_FIXED. But it is not required in x86, arm64, powerpc.
From reading the code even on arm64 if you pass an address that is greater than DEFAULT_MAP_WINDOW it is not guaranteed that mmap will return an address that is greater than DEFAULT_MAP_WINDOW. It may still be provide an address that is less than DEFAULT_MAP_WINDOW if it fails to find an address above. This seems like this would also break your use case.
Yeah. As I said before, this patch will always let userspace software use a fallback path and this only happens in RISC-V. Make default sv48 is right, but RISC-V implementation for this and changing the hint address behavior might be wrong. And x86, arm64, powerpc already use 48-bit address space by default but do not change the meaning of hint address on mmap.
violate the principle of mmap syscall as kernel should take the hint and attempt to create the mapping there.
Although the man page for mmap does say "on Linux, the kernel will pick a nearby page boundary" it is still a hint address so there is no strict requirement (and the precedent has already been set by arm64/powerpc).
Yeah. There is no strict requirement. But currently x86/arm64/powerpc works in this situation well. The hint address on these ISAs is not used as the upper bound to allocating the address. However, on RISC-V, you treat this as the upper bound.
I don't think patching in this way is right. However, if we only revert this patch, some programs relying on mmap to return address with effective bits <= 48 will still be an issue and it might expand to other ISAs if they implement larger virtual address space like RISC-V sv57. A better way to solve this might be adding a MAP_48BIT flag to mmap like MAP_32BIT has been introduced for decades.
Thanks, Yangyu Chen
- Charlie
On Sat, Jan 20, 2024 at 03:09:51PM +0800, Yangyu Chen wrote:
On 1/20/24 14:49, Charlie Jenkins wrote:
On Sat, Jan 20, 2024 at 02:13:14PM +0800, Yangyu Chen wrote:
Thanks for your reply.
On 1/20/24 09:34, Charlie Jenkins wrote:
On Sun, Jan 14, 2024 at 01:26:57AM +0800, Yangyu Chen wrote:
Hi, Charlie
Although this patchset has been merged I still have some questions about this patchset. Because it breaks regular mmap if address >= 38 bits on sv48 / sv57 capable systems like qemu. For example, If a userspace program wants to mmap an anonymous page to addr=(1<<45) on an sv48 capable system, it will fail and kernel will mmaped to another sv39 address since it does
Thank you for raising this concern. To make sure I am understanding correctly, you are passing a hint address of (1<<45) and expecting mmap to return 1<<45 and if it returns a different address you are describing mmap as failing? If you want an address that is in the sv48 space you can pass in an address that is greater than 1<<47.
not meet the requirement to use sv48 as you wrote:
else if ((((_addr) >= VA_USER_SV48)) && (VA_BITS >= VA_BITS_SV48)) \ mmap_end = VA_USER_SV48; \ else \ mmap_end = VA_USER_SV39; \
Then, How can a userspace program create a mmap with a hint if the address
= (1<<38) after your patch without MAP_FIXED? The only way to do this is
to pass a hint >= (1<<47) on mmap syscall then kernel will return a random address in sv48 address space but the hint address gets lost. I think this
In order to force mmap to return the address provided you must use MAP_FIXED. Otherwise, the address is a "hint" and has no guarantees. The hint address on riscv is used to mean "don't give me an address that uses more bits than this". This behavior is not unique to riscv, arm64 and powerpc use a similar scheme. In arch/arm64/include/asm/processor.h there is the following code:
#define arch_get_mmap_base(addr, base) ((addr > DEFAULT_MAP_WINDOW) ? \ base + TASK_SIZE - DEFAULT_MAP_WINDOW :\ base)
arm64/powerpc are only concerned with a single boundary so the code is simpler.
As you say, this code in arm64/powerpc will not meet the issue I address. For example, If the addr here is (1<<50) on arm64, the arch_get_mmap_base will return base+TASK_SIZE-DEFAULT_MAP_WINDOW which is (1<<vabits_actual). And this behavior on arm64/powerpc/x86 does not break anything since we will use a larger address space if the hint address is specified on the address > DEFAULT_MAP_WINDOW. The corresponding behavior on RISC-V should be if the hint address > BIT(47) then use Sv57 address space and use Sv48 when the hint address > BIT(38) if we want Sv39 by default.
However, your patch needs the address >= BIT(47) rather than BIT(38) to use Sv48 and address >= BIT(56) to use Sv57, thus breaking existing userspace software to create mapping on the hint address without MAP_FIXED set.
Code that needs mmap to provide a specific address must use MAP_FIXED. On riscv, it was decided that the address returned from mmap cannot be greater than the hint address. This is currently implemented by using the largest address space that can fit into the hint address. It may be possible that this range can be extended to use all of the addresses that are less than or equal to the hint address.
So this decision might be wrong. It requires some userspace software to modify their mmap flags to fit with this. For example, a binary translate JIT compiler already probes this platform is capable with Sv48, then want to create mapping on some address specified on the mmap hint to align with foreign binary native address but also provide a fallback path with performance overhead. Your patch here will always let userspace software use
I do not follow. This mechanism allows a program to always know how many bits will be available in the virtual address provided by mmap, regardless of the size of the underlying virtual address space.
The phrasing "align with foreign binary native address" seems like the program requires a specific address, which is never guaranteed by mmap without MAP_FIXED. If the program is relying on mmap to provide the address without MAP_FIXED, the program is relying on behavior that cannot be expected to remain constant across Linux releases.
a fallback path with performance overhead until the userspace software changes its syscall to use MAP_FIXED. But it is not required in x86, arm64, powerpc.
From reading the code even on arm64 if you pass an address that is greater than DEFAULT_MAP_WINDOW it is not guaranteed that mmap will return an address that is greater than DEFAULT_MAP_WINDOW. It may still be provide an address that is less than DEFAULT_MAP_WINDOW if it fails to find an address above. This seems like this would also break your use case.
Yeah. As I said before, this patch will always let userspace software use a fallback path and this only happens in RISC-V. Make default sv48 is right, but RISC-V implementation for this and changing the hint address behavior might be wrong. And x86, arm64, powerpc already use 48-bit address space by default but do not change the meaning of hint address on mmap.
violate the principle of mmap syscall as kernel should take the hint and attempt to create the mapping there.
Although the man page for mmap does say "on Linux, the kernel will pick a nearby page boundary" it is still a hint address so there is no strict requirement (and the precedent has already been set by arm64/powerpc).
Yeah. There is no strict requirement. But currently x86/arm64/powerpc works in this situation well. The hint address on these ISAs is not used as the upper bound to allocating the address. However, on RISC-V, you treat this as the upper bound.
I don't think patching in this way is right. However, if we only revert this patch, some programs relying on mmap to return address with effective bits <= 48 will still be an issue and it might expand to other ISAs if they implement larger virtual address space like RISC-V sv57. A better way to solve this might be adding a MAP_48BIT flag to mmap like MAP_32BIT has been introduced for decades.
Thanks, Yangyu Chen
- Charlie
linux-kselftest-mirror@lists.linaro.org