[Linaro-mm-sig] Re: [PATCH RFC 00/12] dma: Enable dmem cgroup tracking

3 Apr 2025


      Am 03.04.25 um 08:07 schrieb Dave Airlie:
...
On Tue, 1 Apr 2025 at 21:03, Christian König christian.koenig@amd.com wrote:
...
Am 31.03.25 um 22:43 schrieb Dave Airlie:
...
On Tue, 11 Mar 2025 at 00:26, Maxime Ripard mripard@kernel.org wrote:
...
Hi,
On Mon, Mar 10, 2025 at 03:16:53PM +0100, Christian König wrote:
...
[Adding Ben since we are currently in the middle of a discussion
regarding exactly that problem]
Just for my understanding before I deep dive into the code: This uses
a separate dmem cgroup and does not account against memcg, don't it?
Yes. The main rationale being that it doesn't always make sense to
register against memcg: a lot of devices are going to allocate from
dedicated chunks of memory that are either carved out from the main
memory allocator, or not under Linux supervision at all.
And if there's no way to make it consistent across drivers, it's not the
right tool.
While I agree on that, if a user can cause a device driver to allocate
memory that is also memory that memcg accounts, then we have to
interface with memcg to account that memory.
This assumes that memcg should be in control of device driver allocated memory. Which in some cases is intentionally not done.
E.g. a server application which allocates buffers on behalves of clients gets a nice deny of service problem if we suddenly start to account those buffers.
Yes we definitely need the ability to transfer an allocation between
cgroups for this case.
The bigger issue is that you break UAPI for things which are "older" (X server, older Android approaches etc...), fixing all of this is certainly possible but most likely simply not worth it.
There are simpler approach to handle all this I think, but see below for further thoughts on that topic.
...
...
That was one of the reasons why my OOM killer improvement patches never landed (e.g. you could trivially kill X/Wayland or systemd with that).
...
The pathological case would be a single application wanting to use 90%
of RAM for device allocations, freeing it all, then using 90% of RAM
for normal usage. How to create a policy that would allow that with
dmem and memcg is difficult, since if you say you can do 90% on both
then the user can easily OOM the system.
Yeah, completely agree.
That's why the GTT size limit we already have per device and the global 50% TTM limit doesn't work as expected. People also didn't liked those limits and because of that we even have flags to circumvent them, see AMDGPU_GEM_CREATE_PREEMPTIBLE and  TTM_TT_FLAG_EXTERNAL.
Another problem is when and to which process we account things when eviction happens? For example process A wants to use VRAM that process B currently occupies. In this case we would give both processes a mix of VRAM and system memory, but how do we account that?
If we account to process B then it can be that process A fails because of process Bs memcg limit. This creates a situation which is absolutely not traceable for a system administrator.
But process A never asked for system memory in the first place, so we can't account the memory to it either or otherwise we make the process responsible for things it didn't do.
There are good argument for all solutions and there are a couple of blocks which rule out one solution or another for a certain use case. To summarize I think the whole situation is a complete mess.
Maybe there is not this one solution and we need to make it somehow configurable?
My feeling is that we can't solve the VRAM eviction problem super
effectively, but it's also probably not going to be a major common
case, I don't think we should double account memcg/dmem just in case
we have to evict all of a users dmem at some point, maybe if there was
some kind of soft memcg limit we could add as an accounting but not
enforced overhead it might be useful to track evictions, but yes we
can't have A allocating memory causing B to fall over because we evict
memory into it's memcg space and it fails to allocate the next time it
tries, or having A fail in that case.
+1 yeah, exactly my thinking as well.
...
For the UMA GPU case where there is no device memory or eviction
problem, perhaps a configurable option to just say account memory in
memcg for all allocations done by this process, and state yes you can
work around it with allocation servers or whatever but the behaviour
for well behaved things is at least somewhat defined.
We can have that as a workaround, but I think we should approach that differently.
With upcoming CXL even coherent device memory is exposed to the core OS as NUMA memory with just a high latency.
So both in the CXL and UMA case it actually doesn't make sense to allocate the memory through the driver interfaces any more. With AMDGPU for example we are just replicating mbind()/madvise() within the driver.
Instead what the DRM subsystem should aim for is to allocate memory using the normal core OS functionality and then import it into the driver.
AMD, NVidia and Intel have HMM working for quite a while now but it has some limitations, especially on the performance side.
So for AMDGPU we are currently evaluating udmabuf as alternative. That seems to be working fine with different NUMA nodes, is perfectly memcg accounted and gives you a DMA-buf which can be imported everywhere.
The only show stopper might be the allocation performance, but even if that's the case I think the ongoing folio work will properly resolve that.
With that in mind I think for the CXL/UMA use case we should use dmem to limit the driver allocate memory to just a few megabytes for legacy things and let the wast amount of memory allocation go through the normal core OS channels instead.
Regards,
Christian.
...
Dave.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[Linaro-mm-sig] Re: [PATCH RFC 00/12] dma: Enable dmem cgroup tracking