This is a summary of discussions we had on IRC between kernel and toolchain engineers regarding support for JITs and 52-bit virtual address space (mostly in the context of LuaJIT, but this concerns other JITs too).
The summary is that we need to consider ways of reducing the size of VA for a given process or container on a Linux system.
The high-level problem is that JITs tend to use upper bits of addresses to encode various pieces of data, and that the number of available bits is shrinking due to VA size increasing. With the usual 42-bit VA (which is what most JITs assume) they have 22 bits to encode various performance-critical data. With 48-bit VA (e.g., ThunderX world) things start to get complicated, and JITs need to be non-trivially patched at the source level to continue working with less bits available for their performance-critical storage. With upcoming 52-bit VA things might get dire enough for some JITs to declare such configurations unsupported.
On the other hand, most JITs are not expected to requires terabytes of RAM and huge VA for their applications. Most JIT applications will happily live in 42-bit world with mere 4 terabytes of RAM that it provides. Therefore, what JITs need in the modern world is a way to make mmap() return addresses below a certain threshold, and error out with ENOMEM when "lower" memory is exhausted. This is very similar to ADDR_LIMIT_32BIT personality, but extended to common VA sizes on 64-bit systems: 39-bit, 42-bit, 48-bit, 52-bit, etc.
Since we do not want to penalize the whole system (using an artificially low-size VA), it would be best to have a way to enable VA limit on per-process basis (similar to ADDR_LIMIT_32BIT personality). If that's not possible -- then on per-container / cgroup basis. If that's not possible -- then on system level (similar to vm.mmap_min_addr, but from the other end).
Dear kernel people, what can be done to address the JITs need to reduce effective VA size?
-- Maxim Kuvyrkov www.linaro.org
On Thursday 28 April 2016 16:00:22 Maxim Kuvyrkov wrote:
This is a summary of discussions we had on IRC between kernel and toolchain engineers regarding support for JITs and 52-bit virtual address space (mostly in the context of LuaJIT, but this concerns other JITs too).
The summary is that we need to consider ways of reducing the size of VA for a given process or container on a Linux system.
The high-level problem is that JITs tend to use upper bits of addresses to encode various pieces of data, and that the number of available bits is shrinking due to VA size increasing. With the usual 42-bit VA (which is what most JITs assume) they have 22 bits to encode various performance-critical data. With 48-bit VA (e.g., ThunderX world) things start to get complicated, and JITs need to be non-trivially patched at the source level to continue working with less bits available for their performance-critical storage. With upcoming 52-bit VA things might get dire enough for some JITs to declare such configurations unsupported.
On the other hand, most JITs are not expected to requires terabytes of RAM and huge VA for their applications. Most JIT applications will happily live in 42-bit world with mere 4 terabytes of RAM that it provides. Therefore, what JITs need in the modern world is a way to make mmap() return addresses below a certain threshold, and error out with ENOMEM when "lower" memory is exhausted. This is very similar to ADDR_LIMIT_32BIT personality, but extended to common VA sizes on 64-bit systems: 39-bit, 42-bit, 48-bit, 52-bit, etc.
Since we do not want to penalize the whole system (using an artificially low-size VA), it would be best to have a way to enable VA limit on per-process basis (similar to ADDR_LIMIT_32BIT personality). If that's not possible -- then on per-container / cgroup basis. If that's not possible -- then on system level (similar to vm.mmap_min_addr, but from the other end).
Dear kernel people, what can be done to address the JITs need to reduce effective VA size?
Thanks for the summary, now it all makes much more sense.
One simple (from the kernel's perspective, not from the JIT) approach might be to always use MAP_FIXED whenever an allocation is made for memory that needs these special pointers, and then manage the available address space explicitly. Would that work, or do you require everything including the binary itself to be below the address?
Regarding which memory sizes are needed, my impression from your explanation is that a single personality flag (e.g. ADDR_LIMIT_42BIT) would be sufficient for the usecase, and you don't actually need to tie this to the architecture-provided virtual addressing limits at all. If it's only one such flag, we can probably find a way to fit it into the personality flags, though ironically we are actually running out of bits in there as well.
Arnd
On 28 April 2016 at 14:17, Arnd Bergmann arnd@arndb.de wrote:
One simple (from the kernel's perspective, not from the JIT) approach might be to always use MAP_FIXED whenever an allocation is made for memory that needs these special pointers, and then manage the available address space explicitly. Would that work, or do you require everything including the binary itself to be below the address?
The trouble IME with this idea is that in practice you're linking with glibc, which means glibc is managing (and using) the address space, not the JIT. So MAP_FIXED is pretty awkward to use.
thanks -- PMM
On 28 April 2016 at 14:24, Peter Maydell peter.maydell@linaro.org wrote:
On 28 April 2016 at 14:17, Arnd Bergmann arnd@arndb.de wrote:
One simple (from the kernel's perspective, not from the JIT) approach might be to always use MAP_FIXED whenever an allocation is made for memory that needs these special pointers, and then manage the available address space explicitly. Would that work, or do you require everything including the binary itself to be below the address?
The trouble IME with this idea is that in practice you're linking with glibc, which means glibc is managing (and using) the address space, not the JIT. So MAP_FIXED is pretty awkward to use.
thanks -- PMM
Hi,
One can find holes in the VA space by examining /proc/self/maps, thus selection of pointers for MAP_FIXED can be deduced.
The other problem is, as Arnd alluded to, if a JIT'ed object needs to then refer to something allocated outside of the JIT. This could be remedied by another level of indirection/trampoline.
Taking two steps back though, I would view VA space squeezing as a stop-gap before removing tags from the upper bits of a pointer altogether (tagging the bottom bits, by controlling alignment is perfectly safe). The larger the VA space, the more scope mechanisms such as Address Space Layout Randomisation have to improve security.
Cheers, -- Steve
FWIW: OpenJDK assumes 48 bit virtual address. There is no inherent reason for this other than we do
movz/movk/movk
to form an address. It is relatively trivial to change this to
movz/movk/movk/movk
All the best, Ed.
On 28 April 2016 at 14:00, Maxim Kuvyrkov maxim.kuvyrkov@linaro.org wrote:
This is a summary of discussions we had on IRC between kernel and toolchain engineers regarding support for JITs and 52-bit virtual address space (mostly in the context of LuaJIT, but this concerns other JITs too).
The summary is that we need to consider ways of reducing the size of VA for a given process or container on a Linux system.
I do not think this issue is inherent to all JIT implements, but rather to luajit with its NaN-tagging scheme [1] which packs different types of objects in a 8-byte. It works well with x86_64 that limits the VMA to 47-bits, but things get messy with large VMA support. Luajit work around this issue by changing its internal block allocator [2] to basically limit mmap allocation to 47 bits. Basically it tries fixed mmap with random hint address until an allocation returns an address within 47-bits. It is far from the idea solution and it might break with different scenarios (fragmented or exausted vma space).
Another project that shows some limitation with different VMA sizes is the llvm sanitizers: for each VMA type it must use a different scheme to direct map the segments to shadow memory. It works on 39 and 42 VMAs, but with some tradeoffs: it either limits the total of shadow memory to a lower bound (asan that sets to maximum of 39-bits), or add performance cost to address translation (msan and tsan) by checking the vma and applying the correct transformation.
I see adding a personality flag could work, but it has the problem of using another flag and limiting the scheme to a narrow set of VMA (I do nothing we could add 2 flags, 39 and 42). I still see that limiting it by using cgroups a better strategy and might also help on testing on userland size (by using 48-bit kernels and setting vma to 39 and 42).
[1] http://lua-users.org/lists/lua-l/2009-11/msg00089.html [2] https://github.com/LuaJIT/LuaJIT/commit/0c6fdc1039a3a4450d366fba7af4b29de73f...
On 28/04/2016 10:53, Edward Nevill wrote:
FWIW: OpenJDK assumes 48 bit virtual address. There is no inherent reason for this other than we do
movz/movk/movk
to form an address. It is relatively trivial to change this to
movz/movk/movk/movk
All the best, Ed.
On 28 April 2016 at 14:00, Maxim Kuvyrkov maxim.kuvyrkov@linaro.org wrote:
This is a summary of discussions we had on IRC between kernel and toolchain engineers regarding support for JITs and 52-bit virtual address space (mostly in the context of LuaJIT, but this concerns other JITs too).
The summary is that we need to consider ways of reducing the size of VA for a given process or container on a Linux system.
+++ Adhemerval Zanella [2016-04-28 12:07 -0300]:
I do not think this issue is inherent to all JIT implements, but rather to luajit with its NaN-tagging scheme [1] which packs different types of objects in a 8-byte.
Other jits use the same/similar schemes (mozilla's ionmonkey is one AIUI). Not sure how many others do this, or how many JITs do it a different way. BUt it is certainly wider than just luajit.
Wookey
For info Google v8 Javascript uses the bottom bit to tag pointers. One problem with this mechanism though is that the pointers can only be used directly with unscaled offset memory access instructions (LDUR/STUR). So in particular, no LDP/STP.
On 28 April 2016 at 22:41, Wookey wookey@wookware.org wrote:
+++ Adhemerval Zanella [2016-04-28 12:07 -0300]:
I do not think this issue is inherent to all JIT implements, but rather
to
luajit with its NaN-tagging scheme [1] which packs different types of
objects
in a 8-byte.
Other jits use the same/similar schemes (mozilla's ionmonkey is one AIUI). Not sure how many others do this, or how many JITs do it a different way. BUt it is certainly wider than just luajit.
Wookey
Principal hats: Linaro, Debian, Wookware, ARM http://wookware.org/
linaro-dev mailing list linaro-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-dev
+Andy, Cyrill, Dmitry who have been discussing variable TASK_SIZE on x86 on linux-mm
http://marc.info/?l=linux-mm&m=146290118818484&w=2
On 04/28/2016 09:00 AM, Maxim Kuvyrkov wrote:
This is a summary of discussions we had on IRC between kernel and toolchain engineers regarding support for JITs and 52-bit virtual address space (mostly in the context of LuaJIT, but this concerns other JITs too).
The summary is that we need to consider ways of reducing the size of VA for a given process or container on a Linux system.
The high-level problem is that JITs tend to use upper bits of addresses to encode various pieces of data, and that the number of available bits is shrinking due to VA size increasing. With the usual 42-bit VA (which is what most JITs assume) they have 22 bits to encode various performance-critical data. With 48-bit VA (e.g., ThunderX world) things start to get complicated, and JITs need to be non-trivially patched at the source level to continue working with less bits available for their performance-critical storage. With upcoming 52-bit VA things might get dire enough for some JITs to declare such configurations unsupported.
On the other hand, most JITs are not expected to requires terabytes of RAM and huge VA for their applications. Most JIT applications will happily live in 42-bit world with mere 4 terabytes of RAM that it provides. Therefore, what JITs need in the modern world is a way to make mmap() return addresses below a certain threshold, and error out with ENOMEM when "lower" memory is exhausted. This is very similar to ADDR_LIMIT_32BIT personality, but extended to common VA sizes on 64-bit systems: 39-bit, 42-bit, 48-bit, 52-bit, etc.
Since we do not want to penalize the whole system (using an artificially low-size VA), it would be best to have a way to enable VA limit on per-process basis (similar to ADDR_LIMIT_32BIT personality). If that's not possible -- then on per-container / cgroup basis. If that's not possible -- then on system level (similar to vm.mmap_min_addr, but from the other end).
Dear kernel people, what can be done to address the JITs need to reduce effective VA size?
On 04/28/2016 09:17 AM, Arnd Bergmann wrote:
Thanks for the summary, now it all makes much more sense.
One simple (from the kernel's perspective, not from the JIT) approach might be to always use MAP_FIXED whenever an allocation is made for memory that needs these special pointers, and then manage the available address space explicitly. Would that work, or do you require everything including the binary itself to be below the address?
Regarding which memory sizes are needed, my impression from your explanation is that a single personality flag (e.g. ADDR_LIMIT_42BIT) would be sufficient for the usecase, and you don't actually need to tie this to the architecture-provided virtual addressing limits at all. If it's only one such flag, we can probably find a way to fit it into the personality flags, though ironically we are actually running out of bits in there as well.
On 04/28/2016 09:24 AM, Peter Maydell wrote:
The trouble IME with this idea is that in practice you're linking with glibc, which means glibc is managing (and using) the address space, not the JIT. So MAP_FIXED is pretty awkward to use.
On 04/28/2016 03:27 PM, Steve Capper wrote:
One can find holes in the VA space by examining /proc/self/maps, thus selection of pointers for MAP_FIXED can be deduced.
The other problem is, as Arnd alluded to, if a JIT'ed object needs to then refer to something allocated outside of the JIT. This could be remedied by another level of indirection/trampoline.
Taking two steps back though, I would view VA space squeezing as a stop-gap before removing tags from the upper bits of a pointer altogether (tagging the bottom bits, by controlling alignment is perfectly safe). The larger the VA space, the more scope mechanisms such as Address Space Layout Randomisation have to improve security.
I was working on an (AArch64-specific) auxiliary vector entry to export TASK_SIZE to userspace at exec time. The goal was to allow for more elegant, robust, and efficient replacements for the following changes:
https://hg.mozilla.org/integration/mozilla-inbound/rev/dfaafbaaa291
https://github.com/xemul/criu/commit/c0c0546c31e6df4932669f4740197bb830a24c8...
However based on the above discussion, it appears that some sort of prctl(PR_GET_TASK_SIZE, ...) and prctl(PR_SET_TASK_SIZE, ...) may be preferable for AArch64. (And perhaps other justifications for the new calls influences the x86 decisions.) What do folks think?
Thanks, Cov