On 06/22/2016 01:17 PM, Cyrill Gorcunov wrote:
> On Wed, Jun 22, 2016 at 12:56:56PM -0700, Dave Hansen wrote:
>>
>> Yeah, cgroups don't make a lot of sense.
>>
>> On x86, the 48-bit virtual address is even hard-coded in the ABI[1]. So
>> we can't change *any* program's layout without either breaking the ABI
>> or having it opt in.
>>
>> But, we're also lucky to only have one VA layout since day one.
>>
>> 1. www.x86-64.org/documentation/abi.pdf - “... Therefore, conforming
>> processes may only use addresses from 0x00000000 00000000 to 0x00007fff
>> ffffffff .”
>
> Yes, but noone forces you to write conforming programs ;)
> After all while hw allows you to run VA with bits > than
> 48 it's fine, all side effects of breaking abi is up to
> program author (iirc on x86 there is up to 52 bits on
> hw level allowed, don't have specs under my hands?)
My point was that you can't restrict the vaddr space without breaking
the ABI because apps expect to be able to use 0x00007fffffffffff. You
also can't extend the vaddr space because apps can *also* expect that
there are no valid vaddrs past 0x00007fffffffffff.
So, whatever happens here, at least on x86, we can't do anything to the
vaddr space without it being an opt-in for *each* *app*.
On 06/22/2016 12:20 PM, Andy Lutomirski wrote:
>>> >> As an example, a 32-bit x86 program really could have something mapped
>>> >> above the 32-bit boundary. It just wouldn't be useful, but the kernel
>>> >> should still understand that it's *user* memory.
>>> >>
>>> >> So you'd have PR_SET_MMAP_LIMIT and PR_GET_MMAP_LIMIT or similar instead.
>> >
>> > +1. Also it might be (not sure though, just guessing) suitable to do such
>> > thing via memory cgroup controller, instead of carrying this limit per
>> > each process (or task structure/vma or mm).
> I think we'll want this per mm. After all, a high-VA-limit-aware bash
> should be able run high-VA-unaware programs without fiddling with
> cgroups.
Yeah, cgroups don't make a lot of sense.
On x86, the 48-bit virtual address is even hard-coded in the ABI[1]. So
we can't change *any* program's layout without either breaking the ABI
or having it opt in.
But, we're also lucky to only have one VA layout since day one.
1. www.x86-64.org/documentation/abi.pdf - “... Therefore, conforming
processes may only use addresses from 0x00000000 00000000 to 0x00007fff
ffffffff .”
This is a summary of discussions we had on IRC between kernel and toolchain engineers regarding support for JITs and 52-bit virtual address space (mostly in the context of LuaJIT, but this concerns other JITs too).
The summary is that we need to consider ways of reducing the size of VA for a given process or container on a Linux system.
The high-level problem is that JITs tend to use upper bits of addresses to encode various pieces of data, and that the number of available bits is shrinking due to VA size increasing. With the usual 42-bit VA (which is what most JITs assume) they have 22 bits to encode various performance-critical data. With 48-bit VA (e.g., ThunderX world) things start to get complicated, and JITs need to be non-trivially patched at the source level to continue working with less bits available for their performance-critical storage. With upcoming 52-bit VA things might get dire enough for some JITs to declare such configurations unsupported.
On the other hand, most JITs are not expected to requires terabytes of RAM and huge VA for their applications. Most JIT applications will happily live in 42-bit world with mere 4 terabytes of RAM that it provides. Therefore, what JITs need in the modern world is a way to make mmap() return addresses below a certain threshold, and error out with ENOMEM when "lower" memory is exhausted. This is very similar to ADDR_LIMIT_32BIT personality, but extended to common VA sizes on 64-bit systems: 39-bit, 42-bit, 48-bit, 52-bit, etc.
Since we do not want to penalize the whole system (using an artificially low-size VA), it would be best to have a way to enable VA limit on per-process basis (similar to ADDR_LIMIT_32BIT personality). If that's not possible -- then on per-container / cgroup basis. If that's not possible -- then on system level (similar to vm.mmap_min_addr, but from the other end).
Dear kernel people, what can be done to address the JITs need to reduce effective VA size?
--
Maxim Kuvyrkov
www.linaro.org