On Mon, Sep 9, 2024, at 23:22, Charlie Jenkins wrote:
On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote:
On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote: The intent is to optionally be able to run a process that keeps higher bits free for tagging and to be sure no memory mapping in the process will clobber these (correct me if I'm wrong Charlie! :)
So you really wouldn't want this if you are using tagged pointers, you'd want to be sure literally nothing touches the higher bits.
My understanding was that the purpose of the existing design is to allow applications to ask for a high address without having to resort to the complexity of MAP_FIXED.
In particular, I'm sure there is precedent for applications that want both tagged pointers (for most mappings) and untagged pointers (for large mappings). With a per-mm_struct or per-task_struct setting you can't do that.
Various architectures handle the hint address differently, but it appears that the only case across any architecture where an address above 47 bits will be returned is if the application had a hint address with a value greater than 47 bits and was using the MAP_FIXED flag. MAP_FIXED bypasses all other checks so I was assuming that it would be logical for MAP_FIXED to bypass this as well. If MAP_FIXED is not set, then the intent is for no hint address to cause a value greater than 47 bits to be returned.
I don't think the MAP_FIXED case is that interesting here because it has to work in both fixed and non-fixed mappings.
This would be more consistent vs. other arches.
Yes riscv is an outlier here. The reason I am pushing for something like a flag to restrict the address space rather than setting it to be the default is it seems like if applications are relying on upper bits to be free, then they should be explicitly asking the kernel to keep them free rather than assuming them to be free.
Let's see what the other architectures do and then come up with a way that fixes the pointer tagging case first on those that are broken. We can see if there needs to be an extra flag after that. Here is what I found:
- x86_64 uses DEFAULT_MAP_WINDOW of BIT(47), uses a 57 bit address space when an addr hint is passed. - arm64 uses DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns higher 52-bit addresses when either a hint is passed or CONFIG_EXPERT and CONFIG_ARM64_FORCE_52BIT is set (this is a debugging option) - ppc64 uses a DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns 52 bit address when an addr hint is passed - riscv uses a DEFAULT_MAP_WINDOW of BIT(47) but only uses it for allocating the stack below, ignoring it for normal mappings - s390 has no DEFAULT_MAP_WINDOW but tried to allocate in the current number of pgtable levels and only upgrades to the next level (31, 42, 53, 64 bits) if a hint is passed or the current level is exhausted. - loongarch64 has no DEFAULT_MAP_WINDOW, and a default VA space of 47 bits (16K pages, 3 levels), but can support a 55 bit space (64K pages, 3 levels). - sparc has no DEFAULT_MAP_WINDOW and up to 52 bit VA space. It may allocate both positive and negative addresses in there. (?) - mips64, parisc64 and alpha have no DEFAULT_MAP_WINDOW and at most 48, 41 or 39 address bits, respectively.
I would suggest these changes:
- make riscv enforce DEFAULT_MAP_WINDOW like x86_64, arm64 and ppc64, leave it at 47
- add DEFAULT_MAP_WINDOW on loongarch64 (47/48 bits based on page size), sparc (48 bits) and s390 (unsure if 42, 53, 47 or 48 bits)
- leave the rest unchanged.
Arnd