This RFC builds on T.J. Mercier's earlier series [1] which added a memory.stat counter for exported dma-bufs and a binder-backed mechanism to transfer charges between cgroups.
The first commit is taken almost verbatim from TJ's series: it introduces MEMCG_DMABUF as a dedicated per-cgroup stat, so that the total exported dma-buf footprint is visible both system-wide (via the root cgroup) and per-application (via per-process cgroups). This avoids the overhead of DMABUF_SYSFS_STATS and integrates naturally into the existing cgroup memory hierarchy.
The rest of the series departs from TJ's approach. While the first commit introduces the memcg stat infrastructure for dmabufs, the export-time charging it introduces in dma_buf_export() is then superseded: we charge at dma_heap_ioctl_allocate() time, using a new charge_pid_fd field in struct dma_heap_allocation_data. The allocator opens a pidfd for its client (e.g., from binder's sender_pid), passes it to the ioctl, and the kernel charges the buffer directly to the client's cgroup at allocation time, so no transfer step is needed.
This decouples the accounting path from binder entirely: any allocator that knows its client's PID can use the pid_fd mechanism regardless of the IPC transport in use.
The cross-cgroup charging capability requires access control. Patches #3 and #4 add a generic LSM hook (security_dma_heap_alloc) and an SELinux implementation based on a new dma_heap object class with a charge_to permission, so policy authors can express which domains are allowed to charge memory to another domain's cgroup.
Last patch adds some tests to verify the new charge_pid_fd field.
We are sending it as an RFC to spark broader discussion. It may or may not be the right path forward, and we welcome feedback on the trade-offs.
Collision note: Eric Chanudet's series [2] adds __GFP_ACCOUNT to system_heap page allocations as an opt-in module parameter. That approach charges pages to the allocator's own kmem, which overlaps with MEMCG_DMABUF. This series explicitly removes __GFP_ACCOUNT from system heap allocations and routes all accounting through the MEMCG_DMABUF path to avoid double-counting.
[1] https://lore.kernel.org/cgroups/20230109213809.418135-1-tjmercier@google.com... [2] https://lore.kernel.org/r/20260113-dmabuf-heap-system-memcg-v2-0-e85722cc2f2...
Signed-off-by: Albert Esteve aesteve@redhat.com --- Albert Esteve (4): dma-heap: charge dma-buf memory via explicit memcg security: dma-heap: Add dma_heap_alloc LSM hook selinux: Restrict cross-cgroup dma-heap charging selftests/dmabuf-heaps: Add dma-buf memcg accounting tests
T.J. Mercier (1): memcg: Track exported dma-buffers
Documentation/admin-guide/cgroup-v2.rst | 5 + drivers/dma-buf/dma-buf.c | 7 + drivers/dma-buf/dma-heap.c | 54 +++++- drivers/dma-buf/heaps/system_heap.c | 2 - include/linux/dma-buf.h | 4 + include/linux/lsm_hook_defs.h | 1 + include/linux/memcontrol.h | 37 ++++ include/linux/security.h | 7 + include/uapi/linux/dma-heap.h | 6 + mm/memcontrol.c | 19 ++ security/security.c | 16 ++ security/selinux/hooks.c | 7 + security/selinux/include/classmap.h | 1 + tools/testing/selftests/cgroup/Makefile | 2 +- tools/testing/selftests/cgroup/test_memcontrol.c | 143 +++++++++++++- tools/testing/selftests/dmabuf-heaps/config | 1 + tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c | 126 ++++++++++++- tools/testing/selftests/dmabuf-heaps/vmtest.sh | 205 +++++++++++++++++++++ 18 files changed, 633 insertions(+), 10 deletions(-) --- base-commit: 74fe02ce122a6103f207d29fafc8b3a53de6abaf change-id: 20260508-v2_20230123_tjmercier_google_com-f44fcfb16530
Best regards,
From: "T.J. Mercier" tjmercier@google.com
When a buffer is exported to userspace, use memcg to attribute the buffer to the allocating cgroup until all buffer references are released.
Unlike the dmabuf sysfs stats implementation, this memcg accounting avoids contention over the kernfs_rwsem incurred when creating or removing nodes.
Signed-off-by: T.J. Mercier tjmercier@google.com Signed-off-by: Albert Esteve aesteve@redhat.com --- Documentation/admin-guide/cgroup-v2.rst | 4 ++++ drivers/dma-buf/dma-buf.c | 13 ++++++++++++ include/linux/dma-buf.h | 4 ++++ include/linux/memcontrol.h | 37 +++++++++++++++++++++++++++++++++ mm/memcontrol.c | 19 +++++++++++++++++ 5 files changed, 77 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 6efd0095ed995..8bdbc2e866430 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1635,6 +1635,10 @@ The following nested keys are defined. Amount of memory used for storing in-kernel data structures.
+ dmabuf (npn) + Amount of memory used for exported DMA buffers allocated by the cgroup. + Stays with the allocating cgroup regardless of how the buffer is shared. + workingset_refault_anon Number of refaults of previously evicted anonymous pages.
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 71f37544a5c61..ce02377f48908 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -14,6 +14,7 @@ #include <linux/fs.h> #include <linux/slab.h> #include <linux/dma-buf.h> +#include <linux/memcontrol.h> #include <linux/dma-fence.h> #include <linux/dma-fence-unwrap.h> #include <linux/anon_inodes.h> @@ -180,6 +181,9 @@ static void dma_buf_release(struct dentry *dentry) */ BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
+ mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE); + mem_cgroup_put(dmabuf->memcg); + dmabuf->ops->release(dmabuf);
if (dmabuf->resv == (struct dma_resv *)&dmabuf[1]) @@ -760,6 +764,13 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) dmabuf->resv = resv; }
+ dmabuf->memcg = get_mem_cgroup_from_mm(current->mm); + if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE, + GFP_KERNEL)) { + ret = -ENOMEM; + goto err_memcg; + } + file->private_data = dmabuf; file->f_path.dentry->d_fsdata = dmabuf; dmabuf->file = file; @@ -770,6 +781,8 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
return dmabuf;
+err_memcg: + mem_cgroup_put(dmabuf->memcg); err_file: fput(file); err_module: diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index d1203da56fc5f..d9f1ccb51c60e 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -27,6 +27,7 @@ struct device; struct dma_buf; struct dma_buf_attachment; +struct mem_cgroup;
/** * struct dma_buf_ops - operations possible on struct dma_buf @@ -429,6 +430,9 @@ struct dma_buf {
__poll_t active; } cb_in, cb_out; + + /** @memcg: the cgroup to which this buffer is currently attributed */ + struct mem_cgroup *memcg; };
/** diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index dc3fa687759b4..10068a833ad9e 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -39,6 +39,7 @@ enum memcg_stat_item { MEMCG_ZSWAP_B, MEMCG_ZSWAPPED, MEMCG_ZSWAP_INCOMP, + MEMCG_DMABUF, MEMCG_NR_STAT, };
@@ -649,6 +650,24 @@ int mem_cgroup_charge_hugetlb(struct folio* folio, gfp_t gfp); int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry);
+/** + * mem_cgroup_charge_dmabuf - Charge dma-buf memory to a cgroup and update stat counter + * @memcg: memcg to charge + * @nr_pages: number of pages to charge + * @gfp_mask: reclaim mode + * + * Charges @nr_pages to @memcg. Returns %true if the charge fit within + * @memcg's configured limit, %false if it doesn't. + */ +bool __mem_cgroup_charge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages, gfp_t gfp_mask); +static inline bool mem_cgroup_charge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages, + gfp_t gfp_mask) +{ + if (mem_cgroup_disabled()) + return true; + return __mem_cgroup_charge_dmabuf(memcg, nr_pages, gfp_mask); +} + void __mem_cgroup_uncharge(struct folio *folio);
/** @@ -664,6 +683,14 @@ static inline void mem_cgroup_uncharge(struct folio *folio) __mem_cgroup_uncharge(folio); }
+void __mem_cgroup_uncharge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages); +static inline void mem_cgroup_uncharge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages) +{ + if (mem_cgroup_disabled()) + return; + __mem_cgroup_uncharge_dmabuf(memcg, nr_pages); +} + void __mem_cgroup_uncharge_folios(struct folio_batch *folios); static inline void mem_cgroup_uncharge_folios(struct folio_batch *folios) { @@ -1142,10 +1169,20 @@ static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, return 0; }
+static inline bool mem_cgroup_charge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages, + gfp_t gfp_mask) +{ + return true; +} + static inline void mem_cgroup_uncharge(struct folio *folio) { }
+static inline void mem_cgroup_uncharge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages) +{ +} + static inline void mem_cgroup_uncharge_folios(struct folio_batch *folios) { } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c03d4787d4668..15cee13d3ccd6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -433,6 +433,7 @@ static const unsigned int memcg_stat_items[] = { MEMCG_ZSWAP_B, MEMCG_ZSWAPPED, MEMCG_ZSWAP_INCOMP, + MEMCG_DMABUF, };
#define NR_MEMCG_NODE_STAT_ITEMS ARRAY_SIZE(memcg_node_stat_items) @@ -1580,6 +1581,7 @@ static const struct memory_stat memory_stats[] = { #ifdef CONFIG_HUGETLB_PAGE { "hugetlb", NR_HUGETLB }, #endif + { "dmabuf", MEMCG_DMABUF },
/* The memory events */ { "workingset_refault_anon", WORKINGSET_REFAULT_ANON }, @@ -5399,6 +5401,23 @@ void mem_cgroup_flush_workqueue(void) flush_workqueue(memcg_wq); }
+bool __mem_cgroup_charge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages, gfp_t gfp_mask) +{ + if (try_charge(memcg, gfp_mask, nr_pages) == 0) { + mod_memcg_state(memcg, MEMCG_DMABUF, nr_pages); + return true; + } + + return false; +} + +void __mem_cgroup_uncharge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages) +{ + mod_memcg_state(memcg, MEMCG_DMABUF, -nr_pages); + if (!mem_cgroup_is_root(memcg)) + refill_stock(memcg, nr_pages); +} + static int __init cgroup_memory(char *s) { char *token;
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
Usage examples:
1. Central allocator charging to a client at allocation time. The allocator knows the client's PID (e.g., from binder's sender_pid) and uses pidfd to attribute the charge:
pid_t client_pid = txn->sender_pid; int pidfd = pidfd_open(client_pid, 0);
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, .charge_pid_fd = pidfd, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); close(pidfd); /* alloc.fd is now charged to client's cgroup */
2. Default allocation (no pidfd, mem_accounting=1). When charge_pid_fd is not set and the mem_accounting module parameter is enabled, the buffer is charged to the allocator's own cgroup:
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); /* charged to current process's cgroup */
Current limitations:
- Single-owner model: a dma-buf carries one memcg charge regardless of how many processes share it. Means only the first owner (and exporter) of the shared buffer bears the charge. - Only memcg accounting supported. While this makes sense for system heap buffers, other heaps (e.g., CMA heaps) will require selectively charging also for the dmem controller.
Signed-off-by: Albert Esteve aesteve@redhat.com --- Documentation/admin-guide/cgroup-v2.rst | 5 ++-- drivers/dma-buf/dma-buf.c | 16 ++++--------- drivers/dma-buf/dma-heap.c | 42 ++++++++++++++++++++++++++++++--- drivers/dma-buf/heaps/system_heap.c | 2 -- include/uapi/linux/dma-heap.h | 6 +++++ 5 files changed, 53 insertions(+), 18 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8bdbc2e866430..824d269531eb1 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1636,8 +1636,9 @@ The following nested keys are defined. structures.
dmabuf (npn) - Amount of memory used for exported DMA buffers allocated by the cgroup. - Stays with the allocating cgroup regardless of how the buffer is shared. + Amount of memory used for exported DMA buffers allocated by or on + behalf of the cgroup. Stays with the allocating cgroup regardless + of how the buffer is shared.
workingset_refault_anon Number of refaults of previously evicted anonymous pages. diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index ce02377f48908..23fb758b78297 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry) */ BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
- mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE); - mem_cgroup_put(dmabuf->memcg); + if (dmabuf->memcg) { + mem_cgroup_uncharge_dmabuf(dmabuf->memcg, + PAGE_ALIGN(dmabuf->size) / PAGE_SIZE); + mem_cgroup_put(dmabuf->memcg); + }
dmabuf->ops->release(dmabuf);
@@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) dmabuf->resv = resv; }
- dmabuf->memcg = get_mem_cgroup_from_mm(current->mm); - if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE, - GFP_KERNEL)) { - ret = -ENOMEM; - goto err_memcg; - } - file->private_data = dmabuf; file->f_path.dentry->d_fsdata = dmabuf; dmabuf->file = file; @@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
return dmabuf;
-err_memcg: - mem_cgroup_put(dmabuf->memcg); err_file: fput(file); err_module: diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c index ac5f8685a6494..ff6e259afcdc0 100644 --- a/drivers/dma-buf/dma-heap.c +++ b/drivers/dma-buf/dma-heap.c @@ -7,13 +7,17 @@ */
#include <linux/cdev.h> +#include <linux/cgroup.h> #include <linux/device.h> #include <linux/dma-buf.h> #include <linux/dma-heap.h> +#include <linux/memcontrol.h> +#include <linux/sched/mm.h> #include <linux/err.h> #include <linux/export.h> #include <linux/list.h> #include <linux/nospec.h> +#include <linux/pidfd.h> #include <linux/syscalls.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting, "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len, - u32 fd_flags, - u64 heap_flags) + u32 fd_flags, u64 heap_flags, + struct mem_cgroup *charge_to) { struct dma_buf *dmabuf; + unsigned int nr_pages; + struct mem_cgroup *memcg = charge_to; int fd;
/* @@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len, if (IS_ERR(dmabuf)) return PTR_ERR(dmabuf);
+ nr_pages = len / PAGE_SIZE; + + if (memcg) + css_get(&memcg->css); + else if (mem_accounting) + memcg = get_mem_cgroup_from_mm(current->mm); + + if (memcg) { + if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) { + mem_cgroup_put(memcg); + dma_buf_put(dmabuf); + return -ENOMEM; + } + dmabuf->memcg = memcg; + } + fd = dma_buf_fd(dmabuf, fd_flags); if (fd < 0) { dma_buf_put(dmabuf); @@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) { struct dma_heap_allocation_data *heap_allocation = data; struct dma_heap *heap = file->private_data; + struct mem_cgroup *memcg = NULL; + struct task_struct *task; + unsigned int pidfd_flags; int fd;
if (heap_allocation->fd) @@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS) return -EINVAL;
+ if (heap_allocation->charge_pid_fd) { + task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags); + if (IS_ERR(task)) + return PTR_ERR(task); + + memcg = get_mem_cgroup_from_mm(task->mm); + put_task_struct(task); + } + fd = dma_heap_buffer_alloc(heap, heap_allocation->len, heap_allocation->fd_flags, - heap_allocation->heap_flags); + heap_allocation->heap_flags, + memcg); + mem_cgroup_put(memcg); if (fd < 0) return fd;
diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index 03c2b87cb1112..95d7688167b93 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size, if (max_order < orders[i]) continue; flags = order_flags[i]; - if (mem_accounting) - flags |= __GFP_ACCOUNT; page = alloc_pages(flags, orders[i]); if (!page) continue; diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h index a4cf716a49fa6..e02b0f8cbc6a1 100644 --- a/include/uapi/linux/dma-heap.h +++ b/include/uapi/linux/dma-heap.h @@ -29,6 +29,10 @@ * handle to the allocated dma-buf * @fd_flags: file descriptor flags used when allocating * @heap_flags: flags passed to heap + * @charge_pid_fd: optional pidfd of the process whose cgroup should be + * charged for this allocation; 0 means charge the calling + * process's cgroup + * @__padding: reserved, must be zero * * Provided by userspace as an argument to the ioctl */ @@ -37,6 +41,8 @@ struct dma_heap_allocation_data { __u32 fd; __u32 fd_flags; __u64 heap_flags; + __u32 charge_pid_fd; + __u32 __padding; };
#define DMA_HEAP_IOC_MAGIC 'H'
On 5/12/26 11:10, Albert Esteve wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
Usage examples:
Central allocator charging to a client at allocation time. The allocator knows the client's PID (e.g., from binder's sender_pid) and uses pidfd to attribute the charge:
pid_t client_pid = txn->sender_pid; int pidfd = pidfd_open(client_pid, 0);
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, .charge_pid_fd = pidfd, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); close(pidfd); /* alloc.fd is now charged to client's cgroup */
Default allocation (no pidfd, mem_accounting=1). When charge_pid_fd is not set and the mem_accounting module parameter is enabled, the buffer is charged to the allocator's own cgroup:
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); /* charged to current process's cgroup */
Current limitations:
- Single-owner model: a dma-buf carries one memcg charge regardless of how many processes share it. Means only the first owner (and exporter) of the shared buffer bears the charge.
- Only memcg accounting supported. While this makes sense for system heap buffers, other heaps (e.g., CMA heaps) will require selectively charging also for the dmem controller.
Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.
I'm just not sure if this is future prove and will work for all use cases, e.g. cloud gaming, native context for automotive etc...
Essentially the problem boils down to two limitations: 1) a piece of memory can only be charged to one cgroup, the framework doesn't has a concept of charging shared memory to multiple groups 2) when memory references in the form of file descriptors are passed between applications we have no way of changing the accounting to a different cgroup
The passing of the memory reference already has a well defined uAPI and if we could solve those two limitations we not only solve the problem without introducing new uAPI (with potential new security risks) but also solve it for all other use cases which uses file descriptors as well as. E.g. memfd, accel and GPU drivers etc...
On the other hand it is really nice to finally see this tackled for at least DMA-buf heaps. On the GPU side I have seen just another try of a driver doing some kind of special driver specific accounting to solve this just a few weeks ago. And to be honest such single driver island approach have the tendency to break more often that they are working correctly.
Regards, Christian.
Signed-off-by: Albert Esteve aesteve@redhat.com
Documentation/admin-guide/cgroup-v2.rst | 5 ++-- drivers/dma-buf/dma-buf.c | 16 ++++--------- drivers/dma-buf/dma-heap.c | 42 ++++++++++++++++++++++++++++++--- drivers/dma-buf/heaps/system_heap.c | 2 -- include/uapi/linux/dma-heap.h | 6 +++++ 5 files changed, 53 insertions(+), 18 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8bdbc2e866430..824d269531eb1 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1636,8 +1636,9 @@ The following nested keys are defined. structures. dmabuf (npn)
Amount of memory used for exported DMA buffers allocated by the cgroup.Stays with the allocating cgroup regardless of how the buffer is shared.
Amount of memory used for exported DMA buffers allocated by or onbehalf of the cgroup. Stays with the allocating cgroup regardlessof how the buffer is shared.workingset_refault_anon Number of refaults of previously evicted anonymous pages. diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index ce02377f48908..23fb758b78297 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry) */ BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
- mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
- mem_cgroup_put(dmabuf->memcg);
- if (dmabuf->memcg) {
mem_cgroup_uncharge_dmabuf(dmabuf->memcg,PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);- }
dmabuf->ops->release(dmabuf); @@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) dmabuf->resv = resv; }
- dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
- if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
GFP_KERNEL)) {ret = -ENOMEM;goto err_memcg;- }
- file->private_data = dmabuf; file->f_path.dentry->d_fsdata = dmabuf; dmabuf->file = file;
@@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) return dmabuf; -err_memcg:
- mem_cgroup_put(dmabuf->memcg);
err_file: fput(file); err_module: diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c index ac5f8685a6494..ff6e259afcdc0 100644 --- a/drivers/dma-buf/dma-heap.c +++ b/drivers/dma-buf/dma-heap.c @@ -7,13 +7,17 @@ */ #include <linux/cdev.h> +#include <linux/cgroup.h> #include <linux/device.h> #include <linux/dma-buf.h> #include <linux/dma-heap.h> +#include <linux/memcontrol.h> +#include <linux/sched/mm.h> #include <linux/err.h> #include <linux/export.h> #include <linux/list.h> #include <linux/nospec.h> +#include <linux/pidfd.h> #include <linux/syscalls.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting, "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false)."); static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
u32 fd_flags,u64 heap_flags)
u32 fd_flags, u64 heap_flags,struct mem_cgroup *charge_to){ struct dma_buf *dmabuf;
- unsigned int nr_pages;
- struct mem_cgroup *memcg = charge_to; int fd;
/* @@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len, if (IS_ERR(dmabuf)) return PTR_ERR(dmabuf);
- nr_pages = len / PAGE_SIZE;
- if (memcg)
css_get(&memcg->css);- else if (mem_accounting)
memcg = get_mem_cgroup_from_mm(current->mm);- if (memcg) {
if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {mem_cgroup_put(memcg);dma_buf_put(dmabuf);return -ENOMEM;}dmabuf->memcg = memcg;- }
- fd = dma_buf_fd(dmabuf, fd_flags); if (fd < 0) { dma_buf_put(dmabuf);
@@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) { struct dma_heap_allocation_data *heap_allocation = data; struct dma_heap *heap = file->private_data;
- struct mem_cgroup *memcg = NULL;
- struct task_struct *task;
- unsigned int pidfd_flags; int fd;
if (heap_allocation->fd) @@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS) return -EINVAL;
- if (heap_allocation->charge_pid_fd) {
task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);if (IS_ERR(task))return PTR_ERR(task);memcg = get_mem_cgroup_from_mm(task->mm);put_task_struct(task);- }
- fd = dma_heap_buffer_alloc(heap, heap_allocation->len, heap_allocation->fd_flags,
heap_allocation->heap_flags);
heap_allocation->heap_flags,memcg);- mem_cgroup_put(memcg); if (fd < 0) return fd;
diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index 03c2b87cb1112..95d7688167b93 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size, if (max_order < orders[i]) continue; flags = order_flags[i];
if (mem_accounting) page = alloc_pages(flags, orders[i]); if (!page) continue;flags |= __GFP_ACCOUNT;diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h index a4cf716a49fa6..e02b0f8cbc6a1 100644 --- a/include/uapi/linux/dma-heap.h +++ b/include/uapi/linux/dma-heap.h @@ -29,6 +29,10 @@
handle to the allocated dma-buf- @fd_flags: file descriptor flags used when allocating
- @heap_flags: flags passed to heap
- @charge_pid_fd: optional pidfd of the process whose cgroup should be
charged for this allocation; 0 means charge the calling
process's cgroup*/
- @__padding: reserved, must be zero
- Provided by userspace as an argument to the ioctl
@@ -37,6 +41,8 @@ struct dma_heap_allocation_data { __u32 fd; __u32 fd_flags; __u64 heap_flags;
- __u32 charge_pid_fd;
- __u32 __padding;
}; #define DMA_HEAP_IOC_MAGIC 'H'
On Tue, May 12, 2026 at 3:14 AM Christian König christian.koenig@amd.com wrote:
On 5/12/26 11:10, Albert Esteve wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
Usage examples:
Central allocator charging to a client at allocation time. The allocator knows the client's PID (e.g., from binder's sender_pid) and uses pidfd to attribute the charge:
pid_t client_pid = txn->sender_pid; int pidfd = pidfd_open(client_pid, 0);
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, .charge_pid_fd = pidfd, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); close(pidfd); /* alloc.fd is now charged to client's cgroup */
Default allocation (no pidfd, mem_accounting=1). When charge_pid_fd is not set and the mem_accounting module parameter is enabled, the buffer is charged to the allocator's own cgroup:
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); /* charged to current process's cgroup */
Current limitations:
- Single-owner model: a dma-buf carries one memcg charge regardless of how many processes share it. Means only the first owner (and exporter) of the shared buffer bears the charge.
- Only memcg accounting supported. While this makes sense for system heap buffers, other heaps (e.g., CMA heaps) will require selectively charging also for the dmem controller.
Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.
Yeah I think this might work. I know of 3 cases, and it trivially solves the first two. The third requires some work on our end to extend our userspace interfaces to include the pidfd but it seems doable. I'm checking with our graphics folks.
1) Direct allocation from user (e.g. app -> allocation ioctl on /dev/dma_heap/foo) No changes required to userspace. mem_accounting=1 charges the app.
2) Single hop remote allocation (e.g. app -> AHardwareBuffer_allocate -> gralloc) gralloc has the caller's pid as described in the commit message. Open a pidfd and pass it in the dma_heap_allocation_data.
3) Double hop remote allocation (e.g. app -> dequeueBuffer -> SurfaceFlinger -> gralloc) In this case gralloc knows SurfaceFlinger's pid, but not the app's. So we need to add the app's pidfd to the SurfaceFlinger -> gralloc interface, or transfer the memcg charge from SurfaceFlinger to the app after the allocation. It'd be nice to avoid the charge transfer option entirely, but if we need it that doesn't seem so bad in this case because it's a bulk charge for the entire dmabuf rather than per-page. So the exporter doesn't need to get involved (we wouldn't need a new dma_buf_op) and we wouldn't have to worry about looping and locking for each page.
I'm just not sure if this is future prove and will work for all use cases, e.g. cloud gaming, native context for automotive etc...
Essentially the problem boils down to two limitations:
- a piece of memory can only be charged to one cgroup, the framework doesn't has a concept of charging shared memory to multiple groups
Yup, memcg already has this problem with pagecache and shmem.
- when memory references in the form of file descriptors are passed between applications we have no way of changing the accounting to a different cgroup
The passing of the memory reference already has a well defined uAPI and if we could solve those two limitations we not only solve the problem without introducing new uAPI (with potential new security risks) but also solve it for all other use cases which uses file descriptors as well as. E.g. memfd, accel and GPU drivers etc...
On the other hand it is really nice to finally see this tackled for at least DMA-buf heaps.
I have a question about this part. Albert I guess you are interested only in accounting dmabuf-heap allocations, or do you expect to add __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other non-dmabuf-heap exporters?
On the GPU side I have seen just another try of a driver doing some kind of special driver specific accounting to solve this just a few weeks ago. And to be honest such single driver island approach have the tendency to break more often that they are working correctly.
Regards, Christian.
Signed-off-by: Albert Esteve aesteve@redhat.com
Documentation/admin-guide/cgroup-v2.rst | 5 ++-- drivers/dma-buf/dma-buf.c | 16 ++++--------- drivers/dma-buf/dma-heap.c | 42 ++++++++++++++++++++++++++++++--- drivers/dma-buf/heaps/system_heap.c | 2 -- include/uapi/linux/dma-heap.h | 6 +++++ 5 files changed, 53 insertions(+), 18 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8bdbc2e866430..824d269531eb1 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1636,8 +1636,9 @@ The following nested keys are defined. structures.
dmabuf (npn)
Amount of memory used for exported DMA buffers allocated by the cgroup.Stays with the allocating cgroup regardless of how the buffer is shared.
Amount of memory used for exported DMA buffers allocated by or onbehalf of the cgroup. Stays with the allocating cgroup regardlessof how the buffer is shared. workingset_refault_anon Number of refaults of previously evicted anonymous pages.diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index ce02377f48908..23fb758b78297 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry) */ BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);
if (dmabuf->memcg) {mem_cgroup_uncharge_dmabuf(dmabuf->memcg,PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);} dmabuf->ops->release(dmabuf);@@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) dmabuf->resv = resv; }
dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,GFP_KERNEL)) {ret = -ENOMEM;goto err_memcg;}file->private_data = dmabuf; file->f_path.dentry->d_fsdata = dmabuf; dmabuf->file = file;@@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
return dmabuf;-err_memcg:
mem_cgroup_put(dmabuf->memcg);err_file: fput(file); err_module: diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c index ac5f8685a6494..ff6e259afcdc0 100644 --- a/drivers/dma-buf/dma-heap.c +++ b/drivers/dma-buf/dma-heap.c @@ -7,13 +7,17 @@ */
#include <linux/cdev.h> +#include <linux/cgroup.h> #include <linux/device.h> #include <linux/dma-buf.h> #include <linux/dma-heap.h> +#include <linux/memcontrol.h> +#include <linux/sched/mm.h> #include <linux/err.h> #include <linux/export.h> #include <linux/list.h> #include <linux/nospec.h> +#include <linux/pidfd.h> #include <linux/syscalls.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting, "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
u32 fd_flags,u64 heap_flags)
u32 fd_flags, u64 heap_flags,struct mem_cgroup *charge_to){ struct dma_buf *dmabuf;
unsigned int nr_pages;struct mem_cgroup *memcg = charge_to; int fd; /*@@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len, if (IS_ERR(dmabuf)) return PTR_ERR(dmabuf);
nr_pages = len / PAGE_SIZE;if (memcg)css_get(&memcg->css);else if (mem_accounting)memcg = get_mem_cgroup_from_mm(current->mm);if (memcg) {if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {mem_cgroup_put(memcg);dma_buf_put(dmabuf);return -ENOMEM;}dmabuf->memcg = memcg;}fd = dma_buf_fd(dmabuf, fd_flags); if (fd < 0) { dma_buf_put(dmabuf);@@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) { struct dma_heap_allocation_data *heap_allocation = data; struct dma_heap *heap = file->private_data;
struct mem_cgroup *memcg = NULL;struct task_struct *task;unsigned int pidfd_flags; int fd; if (heap_allocation->fd)@@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS) return -EINVAL;
if (heap_allocation->charge_pid_fd) {task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);if (IS_ERR(task))return PTR_ERR(task);memcg = get_mem_cgroup_from_mm(task->mm);put_task_struct(task);}fd = dma_heap_buffer_alloc(heap, heap_allocation->len, heap_allocation->fd_flags,
heap_allocation->heap_flags);
heap_allocation->heap_flags,memcg);mem_cgroup_put(memcg); if (fd < 0) return fd;diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index 03c2b87cb1112..95d7688167b93 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size, if (max_order < orders[i]) continue; flags = order_flags[i];
if (mem_accounting)flags |= __GFP_ACCOUNT; page = alloc_pages(flags, orders[i]); if (!page) continue;diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h index a4cf716a49fa6..e02b0f8cbc6a1 100644 --- a/include/uapi/linux/dma-heap.h +++ b/include/uapi/linux/dma-heap.h @@ -29,6 +29,10 @@
handle to the allocated dma-buf- @fd_flags: file descriptor flags used when allocating
- @heap_flags: flags passed to heap
- @charge_pid_fd: optional pidfd of the process whose cgroup should be
charged for this allocation; 0 means charge the calling
process's cgroup*/
- @__padding: reserved, must be zero
- Provided by userspace as an argument to the ioctl
@@ -37,6 +41,8 @@ struct dma_heap_allocation_data { __u32 fd; __u32 fd_flags; __u64 heap_flags;
__u32 charge_pid_fd;__u32 __padding;};
#define DMA_HEAP_IOC_MAGIC 'H'
On Tue, May 12, 2026 at 8:53 PM T.J. Mercier tjmercier@google.com wrote:
On Tue, May 12, 2026 at 3:14 AM Christian König christian.koenig@amd.com wrote:
On 5/12/26 11:10, Albert Esteve wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
Usage examples:
Central allocator charging to a client at allocation time. The allocator knows the client's PID (e.g., from binder's sender_pid) and uses pidfd to attribute the charge:
pid_t client_pid = txn->sender_pid; int pidfd = pidfd_open(client_pid, 0);
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, .charge_pid_fd = pidfd, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); close(pidfd); /* alloc.fd is now charged to client's cgroup */
Default allocation (no pidfd, mem_accounting=1). When charge_pid_fd is not set and the mem_accounting module parameter is enabled, the buffer is charged to the allocator's own cgroup:
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); /* charged to current process's cgroup */
Current limitations:
- Single-owner model: a dma-buf carries one memcg charge regardless of how many processes share it. Means only the first owner (and exporter) of the shared buffer bears the charge.
- Only memcg accounting supported. While this makes sense for system heap buffers, other heaps (e.g., CMA heaps) will require selectively charging also for the dmem controller.
Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.
Yeah I think this might work. I know of 3 cases, and it trivially solves the first two. The third requires some work on our end to extend our userspace interfaces to include the pidfd but it seems doable. I'm checking with our graphics folks.
- Direct allocation from user (e.g. app -> allocation ioctl on
/dev/dma_heap/foo) No changes required to userspace. mem_accounting=1 charges the app.
- Single hop remote allocation (e.g. app -> AHardwareBuffer_allocate
-> gralloc) gralloc has the caller's pid as described in the commit message. Open a pidfd and pass it in the dma_heap_allocation_data.
- Double hop remote allocation (e.g. app -> dequeueBuffer ->
SurfaceFlinger -> gralloc) In this case gralloc knows SurfaceFlinger's pid, but not the app's. So we need to add the app's pidfd to the SurfaceFlinger -> gralloc interface, or transfer the memcg charge from SurfaceFlinger to the app after the allocation. It'd be nice to avoid the charge transfer option entirely, but if we need it that doesn't seem so bad in this case because it's a bulk charge for the entire dmabuf rather than per-page. So the exporter doesn't need to get involved (we wouldn't need a new dma_buf_op) and we wouldn't have to worry about looping and locking for each page.
I'm just not sure if this is future prove and will work for all use cases, e.g. cloud gaming, native context for automotive etc...
Essentially the problem boils down to two limitations:
- a piece of memory can only be charged to one cgroup, the framework doesn't has a concept of charging shared memory to multiple groups
Yup, memcg already has this problem with pagecache and shmem.
- when memory references in the form of file descriptors are passed between applications we have no way of changing the accounting to a different cgroup
The passing of the memory reference already has a well defined uAPI and if we could solve those two limitations we not only solve the problem without introducing new uAPI (with potential new security risks) but also solve it for all other use cases which uses file descriptors as well as. E.g. memfd, accel and GPU drivers etc...
On the other hand it is really nice to finally see this tackled for at least DMA-buf heaps.
I have a question about this part. Albert I guess you are interested only in accounting dmabuf-heap allocations, or do you expect to add __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other non-dmabuf-heap exporters?
We're scoping this to dma-buf heaps for now. CMA heaps and the dmem controller are on the radar for follow-up/parallel work (there will be dragons and will surely need discussion). For DRM and V4L2 the long-term intent is migration to heaps, which would make direct accounting on those paths unnecessary. udmabufs are already memcg-charged, so adding a separate MEMCG_DMABUF would double count. Are there any other exporters you had in mind that would benefit from this approach?
BR, Albert.
On the GPU side I have seen just another try of a driver doing some kind of special driver specific accounting to solve this just a few weeks ago. And to be honest such single driver island approach have the tendency to break more often that they are working correctly.
Regards, Christian.
Signed-off-by: Albert Esteve aesteve@redhat.com
Documentation/admin-guide/cgroup-v2.rst | 5 ++-- drivers/dma-buf/dma-buf.c | 16 ++++--------- drivers/dma-buf/dma-heap.c | 42 ++++++++++++++++++++++++++++++--- drivers/dma-buf/heaps/system_heap.c | 2 -- include/uapi/linux/dma-heap.h | 6 +++++ 5 files changed, 53 insertions(+), 18 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8bdbc2e866430..824d269531eb1 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1636,8 +1636,9 @@ The following nested keys are defined. structures.
dmabuf (npn)
Amount of memory used for exported DMA buffers allocated by the cgroup.Stays with the allocating cgroup regardless of how the buffer is shared.
Amount of memory used for exported DMA buffers allocated by or onbehalf of the cgroup. Stays with the allocating cgroup regardlessof how the buffer is shared. workingset_refault_anon Number of refaults of previously evicted anonymous pages.diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index ce02377f48908..23fb758b78297 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry) */ BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);
if (dmabuf->memcg) {mem_cgroup_uncharge_dmabuf(dmabuf->memcg,PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);} dmabuf->ops->release(dmabuf);@@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) dmabuf->resv = resv; }
dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,GFP_KERNEL)) {ret = -ENOMEM;goto err_memcg;}file->private_data = dmabuf; file->f_path.dentry->d_fsdata = dmabuf; dmabuf->file = file;@@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
return dmabuf;-err_memcg:
mem_cgroup_put(dmabuf->memcg);err_file: fput(file); err_module: diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c index ac5f8685a6494..ff6e259afcdc0 100644 --- a/drivers/dma-buf/dma-heap.c +++ b/drivers/dma-buf/dma-heap.c @@ -7,13 +7,17 @@ */
#include <linux/cdev.h> +#include <linux/cgroup.h> #include <linux/device.h> #include <linux/dma-buf.h> #include <linux/dma-heap.h> +#include <linux/memcontrol.h> +#include <linux/sched/mm.h> #include <linux/err.h> #include <linux/export.h> #include <linux/list.h> #include <linux/nospec.h> +#include <linux/pidfd.h> #include <linux/syscalls.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting, "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
u32 fd_flags,u64 heap_flags)
u32 fd_flags, u64 heap_flags,struct mem_cgroup *charge_to){ struct dma_buf *dmabuf;
unsigned int nr_pages;struct mem_cgroup *memcg = charge_to; int fd; /*@@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len, if (IS_ERR(dmabuf)) return PTR_ERR(dmabuf);
nr_pages = len / PAGE_SIZE;if (memcg)css_get(&memcg->css);else if (mem_accounting)memcg = get_mem_cgroup_from_mm(current->mm);if (memcg) {if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {mem_cgroup_put(memcg);dma_buf_put(dmabuf);return -ENOMEM;}dmabuf->memcg = memcg;}fd = dma_buf_fd(dmabuf, fd_flags); if (fd < 0) { dma_buf_put(dmabuf);@@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) { struct dma_heap_allocation_data *heap_allocation = data; struct dma_heap *heap = file->private_data;
struct mem_cgroup *memcg = NULL;struct task_struct *task;unsigned int pidfd_flags; int fd; if (heap_allocation->fd)@@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS) return -EINVAL;
if (heap_allocation->charge_pid_fd) {task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);if (IS_ERR(task))return PTR_ERR(task);memcg = get_mem_cgroup_from_mm(task->mm);put_task_struct(task);}fd = dma_heap_buffer_alloc(heap, heap_allocation->len, heap_allocation->fd_flags,
heap_allocation->heap_flags);
heap_allocation->heap_flags,memcg);mem_cgroup_put(memcg); if (fd < 0) return fd;diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index 03c2b87cb1112..95d7688167b93 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size, if (max_order < orders[i]) continue; flags = order_flags[i];
if (mem_accounting)flags |= __GFP_ACCOUNT; page = alloc_pages(flags, orders[i]); if (!page) continue;diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h index a4cf716a49fa6..e02b0f8cbc6a1 100644 --- a/include/uapi/linux/dma-heap.h +++ b/include/uapi/linux/dma-heap.h @@ -29,6 +29,10 @@
handle to the allocated dma-buf- @fd_flags: file descriptor flags used when allocating
- @heap_flags: flags passed to heap
- @charge_pid_fd: optional pidfd of the process whose cgroup should be
charged for this allocation; 0 means charge the calling
process's cgroup*/
- @__padding: reserved, must be zero
- Provided by userspace as an argument to the ioctl
@@ -37,6 +41,8 @@ struct dma_heap_allocation_data { __u32 fd; __u32 fd_flags; __u64 heap_flags;
__u32 charge_pid_fd;__u32 __padding;};
#define DMA_HEAP_IOC_MAGIC 'H'
On Wed, May 13, 2026 at 4:39 AM Albert Esteve aesteve@redhat.com wrote:
On Tue, May 12, 2026 at 8:53 PM T.J. Mercier tjmercier@google.com wrote:
On Tue, May 12, 2026 at 3:14 AM Christian König christian.koenig@amd.com wrote:
On 5/12/26 11:10, Albert Esteve wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
Usage examples:
Central allocator charging to a client at allocation time. The allocator knows the client's PID (e.g., from binder's sender_pid) and uses pidfd to attribute the charge:
pid_t client_pid = txn->sender_pid; int pidfd = pidfd_open(client_pid, 0);
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, .charge_pid_fd = pidfd, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); close(pidfd); /* alloc.fd is now charged to client's cgroup */
Default allocation (no pidfd, mem_accounting=1). When charge_pid_fd is not set and the mem_accounting module parameter is enabled, the buffer is charged to the allocator's own cgroup:
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); /* charged to current process's cgroup */
Current limitations:
- Single-owner model: a dma-buf carries one memcg charge regardless of how many processes share it. Means only the first owner (and exporter) of the shared buffer bears the charge.
- Only memcg accounting supported. While this makes sense for system heap buffers, other heaps (e.g., CMA heaps) will require selectively charging also for the dmem controller.
Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.
Yeah I think this might work. I know of 3 cases, and it trivially solves the first two. The third requires some work on our end to extend our userspace interfaces to include the pidfd but it seems doable. I'm checking with our graphics folks.
- Direct allocation from user (e.g. app -> allocation ioctl on
/dev/dma_heap/foo) No changes required to userspace. mem_accounting=1 charges the app.
- Single hop remote allocation (e.g. app -> AHardwareBuffer_allocate
-> gralloc) gralloc has the caller's pid as described in the commit message. Open a pidfd and pass it in the dma_heap_allocation_data.
- Double hop remote allocation (e.g. app -> dequeueBuffer ->
SurfaceFlinger -> gralloc) In this case gralloc knows SurfaceFlinger's pid, but not the app's. So we need to add the app's pidfd to the SurfaceFlinger -> gralloc interface, or transfer the memcg charge from SurfaceFlinger to the app after the allocation. It'd be nice to avoid the charge transfer option entirely, but if we need it that doesn't seem so bad in this case because it's a bulk charge for the entire dmabuf rather than per-page. So the exporter doesn't need to get involved (we wouldn't need a new dma_buf_op) and we wouldn't have to worry about looping and locking for each page.
I'm just not sure if this is future prove and will work for all use cases, e.g. cloud gaming, native context for automotive etc...
Essentially the problem boils down to two limitations:
- a piece of memory can only be charged to one cgroup, the framework doesn't has a concept of charging shared memory to multiple groups
Yup, memcg already has this problem with pagecache and shmem.
- when memory references in the form of file descriptors are passed between applications we have no way of changing the accounting to a different cgroup
The passing of the memory reference already has a well defined uAPI and if we could solve those two limitations we not only solve the problem without introducing new uAPI (with potential new security risks) but also solve it for all other use cases which uses file descriptors as well as. E.g. memfd, accel and GPU drivers etc...
On the other hand it is really nice to finally see this tackled for at least DMA-buf heaps.
I have a question about this part. Albert I guess you are interested only in accounting dmabuf-heap allocations, or do you expect to add __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other non-dmabuf-heap exporters?
We're scoping this to dma-buf heaps for now. CMA heaps and the dmem controller are on the radar for follow-up/parallel work (there will be dragons and will surely need discussion). For DRM and V4L2 the long-term intent is migration to heaps, which would make direct accounting on those paths unnecessary.
Ah I see. GEM buffers exported to dmabufs are what I had in mind. I guess this would only leave the odd non-DRM driver with the need to add their own accounting calls, which I don't expect would be a big problem.
udmabufs are already memcg-charged, so adding a separate MEMCG_DMABUF would double count. Are there any other exporters you had in mind that would benefit from this approach?
BR, Albert.
On the GPU side I have seen just another try of a driver doing some kind of special driver specific accounting to solve this just a few weeks ago. And to be honest such single driver island approach have the tendency to break more often that they are working correctly.
Regards, Christian.
Signed-off-by: Albert Esteve aesteve@redhat.com
Documentation/admin-guide/cgroup-v2.rst | 5 ++-- drivers/dma-buf/dma-buf.c | 16 ++++--------- drivers/dma-buf/dma-heap.c | 42 ++++++++++++++++++++++++++++++--- drivers/dma-buf/heaps/system_heap.c | 2 -- include/uapi/linux/dma-heap.h | 6 +++++ 5 files changed, 53 insertions(+), 18 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8bdbc2e866430..824d269531eb1 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1636,8 +1636,9 @@ The following nested keys are defined. structures.
dmabuf (npn)
Amount of memory used for exported DMA buffers allocated by the cgroup.Stays with the allocating cgroup regardless of how the buffer is shared.
Amount of memory used for exported DMA buffers allocated by or onbehalf of the cgroup. Stays with the allocating cgroup regardlessof how the buffer is shared. workingset_refault_anon Number of refaults of previously evicted anonymous pages.diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index ce02377f48908..23fb758b78297 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry) */ BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);
if (dmabuf->memcg) {mem_cgroup_uncharge_dmabuf(dmabuf->memcg,PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);} dmabuf->ops->release(dmabuf);@@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) dmabuf->resv = resv; }
dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,GFP_KERNEL)) {ret = -ENOMEM;goto err_memcg;}file->private_data = dmabuf; file->f_path.dentry->d_fsdata = dmabuf; dmabuf->file = file;@@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
return dmabuf;-err_memcg:
mem_cgroup_put(dmabuf->memcg);err_file: fput(file); err_module: diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c index ac5f8685a6494..ff6e259afcdc0 100644 --- a/drivers/dma-buf/dma-heap.c +++ b/drivers/dma-buf/dma-heap.c @@ -7,13 +7,17 @@ */
#include <linux/cdev.h> +#include <linux/cgroup.h> #include <linux/device.h> #include <linux/dma-buf.h> #include <linux/dma-heap.h> +#include <linux/memcontrol.h> +#include <linux/sched/mm.h> #include <linux/err.h> #include <linux/export.h> #include <linux/list.h> #include <linux/nospec.h> +#include <linux/pidfd.h> #include <linux/syscalls.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting, "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
u32 fd_flags,u64 heap_flags)
u32 fd_flags, u64 heap_flags,struct mem_cgroup *charge_to){ struct dma_buf *dmabuf;
unsigned int nr_pages;struct mem_cgroup *memcg = charge_to; int fd; /*@@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len, if (IS_ERR(dmabuf)) return PTR_ERR(dmabuf);
nr_pages = len / PAGE_SIZE;if (memcg)css_get(&memcg->css);else if (mem_accounting)memcg = get_mem_cgroup_from_mm(current->mm);if (memcg) {if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {mem_cgroup_put(memcg);dma_buf_put(dmabuf);return -ENOMEM;}dmabuf->memcg = memcg;}fd = dma_buf_fd(dmabuf, fd_flags); if (fd < 0) { dma_buf_put(dmabuf);@@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) { struct dma_heap_allocation_data *heap_allocation = data; struct dma_heap *heap = file->private_data;
struct mem_cgroup *memcg = NULL;struct task_struct *task;unsigned int pidfd_flags; int fd; if (heap_allocation->fd)@@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS) return -EINVAL;
if (heap_allocation->charge_pid_fd) {task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);if (IS_ERR(task))return PTR_ERR(task);memcg = get_mem_cgroup_from_mm(task->mm);put_task_struct(task);}fd = dma_heap_buffer_alloc(heap, heap_allocation->len, heap_allocation->fd_flags,
heap_allocation->heap_flags);
heap_allocation->heap_flags,memcg);mem_cgroup_put(memcg); if (fd < 0) return fd;diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index 03c2b87cb1112..95d7688167b93 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size, if (max_order < orders[i]) continue; flags = order_flags[i];
if (mem_accounting)flags |= __GFP_ACCOUNT; page = alloc_pages(flags, orders[i]); if (!page) continue;diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h index a4cf716a49fa6..e02b0f8cbc6a1 100644 --- a/include/uapi/linux/dma-heap.h +++ b/include/uapi/linux/dma-heap.h @@ -29,6 +29,10 @@
handle to the allocated dma-buf- @fd_flags: file descriptor flags used when allocating
- @heap_flags: flags passed to heap
- @charge_pid_fd: optional pidfd of the process whose cgroup should be
charged for this allocation; 0 means charge the calling
process's cgroup*/
- @__padding: reserved, must be zero
- Provided by userspace as an argument to the ioctl
@@ -37,6 +41,8 @@ struct dma_heap_allocation_data { __u32 fd; __u32 fd_flags; __u64 heap_flags;
__u32 charge_pid_fd;__u32 __padding;};
#define DMA_HEAP_IOC_MAGIC 'H'
On Thu, May 14, 2026 at 12:35 AM T.J. Mercier tjmercier@google.com wrote: [...]
I have a question about this part. Albert I guess you are interested only in accounting dmabuf-heap allocations, or do you expect to add __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other non-dmabuf-heap exporters?
We're scoping this to dma-buf heaps for now. CMA heaps and the dmem controller are on the radar for follow-up/parallel work (there will be dragons and will surely need discussion). For DRM and V4L2 the long-term intent is migration to heaps, which would make direct accounting on those paths unnecessary.
Ah I see. GEM buffers exported to dmabufs are what I had in mind. I guess this would only leave the odd non-DRM driver with the need to add their own accounting calls, which I don't expect would be a big problem.
sounds like we still have a long way to go to correctly account for various v4l2, drm, GEM, CMA, etc. In patch 1, the charging is done in dma_buf_export(), so I guess it covers all dma-buf types except dma_heap, but the problem is that it has no remote charging support at all?
udmabufs are already memcg-charged, so adding a separate MEMCG_DMABUF would double count. Are there any other exporters you had in mind that would benefit from this approach?
Thanks Barry
On 5/16/26 11:19, Barry Song wrote:
On Thu, May 14, 2026 at 12:35 AM T.J. Mercier tjmercier@google.com wrote: [...]
I have a question about this part. Albert I guess you are interested only in accounting dmabuf-heap allocations, or do you expect to add __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other non-dmabuf-heap exporters?
We're scoping this to dma-buf heaps for now. CMA heaps and the dmem controller are on the radar for follow-up/parallel work (there will be dragons and will surely need discussion). For DRM and V4L2 the long-term intent is migration to heaps, which would make direct accounting on those paths unnecessary.
Ah I see. GEM buffers exported to dmabufs are what I had in mind. I guess this would only leave the odd non-DRM driver with the need to add their own accounting calls, which I don't expect would be a big problem.
sounds like we still have a long way to go to correctly account for various v4l2, drm, GEM, CMA, etc. In patch 1, the charging is done in dma_buf_export(), so I guess it covers all dma-buf types except dma_heap, but the problem is that it has no remote charging support at all?
No, just the other way around
DMA-buf heaps can be handled here because we know that it is pure system memory and nothing special so memcg always applies.
dma_buf_export() on the other hand handles tons of different use cases, ranging from buffer accounted to dmem, over special resources which aren't even memory all the way to buffers which can migrate from dmem to memcg and back during their lifetime.
udmabufs are already memcg-charged, so adding a separate MEMCG_DMABUF would double count. Are there any other exporters you had in mind that would benefit from this approach?
Well apart from DMA-buf memfd_create() is one of the things which as broken our neck in the past a couple of times.
But thinking more about it what if instead of making this DMA-buf heaps specific what if we have a general cgroups function which allows to change accounting of a buffer referenced by a file descriptor to a different process?
That would cover not only the DMA-buf heaps use case, but also all other DMA-buf with dmem and whatever we come up in the future as well.
The only drawback I can see is that DMA-buf heap allocations would be temporarily accounted to the memory allocation daemon, but I don't think that this would be a problem.
Regards, Christian.
Thanks Barry
On Mon, May 18, 2026 at 9:34 AM Christian König christian.koenig@amd.com wrote:
On 5/16/26 11:19, Barry Song wrote:
On Thu, May 14, 2026 at 12:35 AM T.J. Mercier tjmercier@google.com wrote: [...]
I have a question about this part. Albert I guess you are interested only in accounting dmabuf-heap allocations, or do you expect to add __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other non-dmabuf-heap exporters?
We're scoping this to dma-buf heaps for now. CMA heaps and the dmem controller are on the radar for follow-up/parallel work (there will be dragons and will surely need discussion). For DRM and V4L2 the long-term intent is migration to heaps, which would make direct accounting on those paths unnecessary.
Ah I see. GEM buffers exported to dmabufs are what I had in mind. I guess this would only leave the odd non-DRM driver with the need to add their own accounting calls, which I don't expect would be a big problem.
sounds like we still have a long way to go to correctly account for various v4l2, drm, GEM, CMA, etc. In patch 1, the charging is done in dma_buf_export(), so I guess it covers all dma-buf types except dma_heap, but the problem is that it has no remote charging support at all?
No, just the other way around
DMA-buf heaps can be handled here because we know that it is pure system memory and nothing special so memcg always applies.
dma_buf_export() on the other hand handles tons of different use cases, ranging from buffer accounted to dmem, over special resources which aren't even memory all the way to buffers which can migrate from dmem to memcg and back during their lifetime.
udmabufs are already memcg-charged, so adding a separate MEMCG_DMABUF would double count. Are there any other exporters you had in mind that would benefit from this approach?
Well apart from DMA-buf memfd_create() is one of the things which as broken our neck in the past a couple of times.
But thinking more about it what if instead of making this DMA-buf heaps specific what if we have a general cgroups function which allows to change accounting of a buffer referenced by a file descriptor to a different process?
That would cover not only the DMA-buf heaps use case, but also all other DMA-buf with dmem and whatever we come up in the future as well.
I removed a draft adding an ioctl for charge transfer from the series before sending because I wanted to focus on the charge_pid_fd approach and keep things simple, deferring the recharge path to a follow-up depending on feedback.
The main difference between my removed draft and what you're describing, iiuc, is scope and layer: my draft was an explicit ioctl on the dma-buf fd that the consumer calls to claim the charge (see below), while you seem to be suggesting a more general kernel-internal function that could work across buffer types and cgroup controllers, so not necessarily userspace-initiated? A kernel-internal function will need a way to identify the target process, which sounds similar to the binder-backed approach from TJ [1]. For everything else, the receiver still needs to declare itself, which the ioctl accomplishes.
``` # When an app imports a daemon-allocated buffer, it can transfer the charge to itself: int buf_fd = receive_dmabuf_from_daemon(); ioctl(buf_fd, DMA_BUF_IOCTL_XFER_CHARGE); /* charge now attributed to apps's cgroup */ ```
[1] https://lore.kernel.org/cgroups/20230109213809.418135-1-tjmercier@google.com...
The only drawback I can see is that DMA-buf heap allocations would be temporarily accounted to the memory allocation daemon, but I don't think that this would be a problem.
The main reasons we moved away from TJ's transfer-based approach toward `charge_pid_fd` are: avoid the transient charge window on the daemon's cgroup; and to decouple from Binder, allowing any allocator to use it.
Technically, both approaches could coexist, though. Of the three scenarios TJ described: - Scenario 2 is directly addressed by charge_pid_fd approach without any transient charge on the daemon at the cost of one extra field in the heap ioctl uAPI struct. - Scenario 3 can be handled by the charge transfer function without changes to SurfaceFlinger. The app or dequeueBuffer claims the charge for itself or the app, respectively (depending on whether we include a pid_fd field in the transfer ioctl). It also covers non-heap exporters. The con in both variants is the transient charge window on the daemon.
Both approaches shift the responsibility for correct charging attribution to userspace: first, 'charge_pid_fd` on the allocator's side, and the transfer charge on the consumer's side.
Deciding on one, the other or both depends on how much we value avoiding transient attribution, and how much we need a non-heap generic solution. With the XFER_CHARGE we can cover both. Thus, the `charge_pid_fd` approach in this RFC can be seen as a performance/strictness optimisation, eliminating transient charges to the daemon at the cost of a permanent uAPI addition to the heap ioctl struct, but not strictly required for correctness. On the other hand, if we agree on the end goal of migrating other exporters to use dma-buf heaps, and scenario 3 is addressed by adding the app's pid_fd to SurfaceFlinger, then `charge_pid_fd` alone is a coherent/sufficient approach despite the uAPI change.
Regards, Christian.
Thanks Barry
On 5/18/26 14:06, Albert Esteve wrote:
udmabufs are already memcg-charged, so adding a separate MEMCG_DMABUF would double count. Are there any other exporters you had in mind that would benefit from this approach?
Well apart from DMA-buf memfd_create() is one of the things which as broken our neck in the past a couple of times.
But thinking more about it what if instead of making this DMA-buf heaps specific what if we have a general cgroups function which allows to change accounting of a buffer referenced by a file descriptor to a different process?
That would cover not only the DMA-buf heaps use case, but also all other DMA-buf with dmem and whatever we come up in the future as well.
I removed a draft adding an ioctl for charge transfer from the series before sending because I wanted to focus on the charge_pid_fd approach and keep things simple, deferring the recharge path to a follow-up depending on feedback.
The main difference between my removed draft and what you're describing, iiuc, is scope and layer: my draft was an explicit ioctl on the dma-buf fd that the consumer calls to claim the charge (see below), while you seem to be suggesting a more general kernel-internal function that could work across buffer types and cgroup controllers, so not necessarily userspace-initiated? A kernel-internal function will need a way to identify the target process, which sounds similar to the binder-backed approach from TJ [1]. For everything else, the receiver still needs to declare itself, which the ioctl accomplishes.
# When an app imports a daemon-allocated buffer, it can transfer the charge to itself: int buf_fd = receive_dmabuf_from_daemon(); ioctl(buf_fd, DMA_BUF_IOCTL_XFER_CHARGE); /* charge now attributed to apps's cgroup */
Well that thinking goes into the right direction, but the requirements are still not completely covered as far as I can see.
Let me explain below a bit more.
[1] https://lore.kernel.org/cgroups/20230109213809.418135-1-tjmercier@google.com...
The only drawback I can see is that DMA-buf heap allocations would be temporarily accounted to the memory allocation daemon, but I don't think that this would be a problem.
The main reasons we moved away from TJ's transfer-based approach toward `charge_pid_fd` are: avoid the transient charge window on the daemon's cgroup; and to decouple from Binder, allowing any allocator to use it.
Yeah those concerns are completely correct.
The application should not volunteering says 'Charge that buffer to me.', but rather that the daemon says force charge that buffer to this application and tell me when the application is over its limit.
Technically, both approaches could coexist, though. Of the three scenarios TJ described:
- Scenario 2 is directly addressed by charge_pid_fd approach without
any transient charge on the daemon at the cost of one extra field in the heap ioctl uAPI struct.
Yeah extending the uAPI to pass in the pid on allocation time is not much of a problem, but you also need to modify the whole stack above it and that is a bit more trickier.
- Scenario 3 can be handled by the charge transfer function without
changes to SurfaceFlinger. The app or dequeueBuffer claims the charge for itself or the app, respectively (depending on whether we include a pid_fd field in the transfer ioctl). It also covers non-heap exporters. The con in both variants is the transient charge window on the daemon.
It should be trivial for the deamon to charge the buffer to an application before handing it out.
Both approaches shift the responsibility for correct charging attribution to userspace: first, 'charge_pid_fd` on the allocator's side, and the transfer charge on the consumer's side.
Yeah that's why I said it would be better if we do that without any uAPI change, but with all the uAPI we have to transfer file descriptors (dup(), fork(), passing FDs over sockets etc...) it could be really tricky to implement that.
Deciding on one, the other or both depends on how much we value avoiding transient attribution, and how much we need a non-heap generic solution. With the XFER_CHARGE we can cover both. Thus, the `charge_pid_fd` approach in this RFC can be seen as a performance/strictness optimisation, eliminating transient charges to the daemon at the cost of a permanent uAPI addition to the heap ioctl struct, but not strictly required for correctness.
Well all we need is a uAPI which says charge this buffer (file descriptor) to that cgroup (pidfd).
With this at hand we should be able to handle all use cases at the same time.
On the other hand, if we agree on the end goal of migrating other exporters to use dma-buf heaps
That won't work. DMA-buf heaps is actually only a rather small and Anroid specific use case.
We have tons of other interfaces to allocate DMA-bufs which need to stay around because of HW restrictions and we do need a solution for them as well.
Regards, Christian.
, and scenario 3 is addressed by adding the app's pid_fd to SurfaceFlinger, then `charge_pid_fd` alone is a coherent/sufficient approach despite the uAPI change.
Regards, Christian.
Thanks Barry
On Tue, May 19, 2026 at 9:53 AM Christian König christian.koenig@amd.com wrote:
On 5/18/26 14:06, Albert Esteve wrote:
udmabufs are already memcg-charged, so adding a separate MEMCG_DMABUF would double count. Are there any other exporters you had in mind that would benefit from this approach?
Well apart from DMA-buf memfd_create() is one of the things which as broken our neck in the past a couple of times.
But thinking more about it what if instead of making this DMA-buf heaps specific what if we have a general cgroups function which allows to change accounting of a buffer referenced by a file descriptor to a different process?
That would cover not only the DMA-buf heaps use case, but also all other DMA-buf with dmem and whatever we come up in the future as well.
I removed a draft adding an ioctl for charge transfer from the series before sending because I wanted to focus on the charge_pid_fd approach and keep things simple, deferring the recharge path to a follow-up depending on feedback.
The main difference between my removed draft and what you're describing, iiuc, is scope and layer: my draft was an explicit ioctl on the dma-buf fd that the consumer calls to claim the charge (see below), while you seem to be suggesting a more general kernel-internal function that could work across buffer types and cgroup controllers, so not necessarily userspace-initiated? A kernel-internal function will need a way to identify the target process, which sounds similar to the binder-backed approach from TJ [1]. For everything else, the receiver still needs to declare itself, which the ioctl accomplishes.
# When an app imports a daemon-allocated buffer, it can transfer the charge to itself: int buf_fd = receive_dmabuf_from_daemon(); ioctl(buf_fd, DMA_BUF_IOCTL_XFER_CHARGE); /* charge now attributed to apps's cgroup */Well that thinking goes into the right direction, but the requirements are still not completely covered as far as I can see.
Let me explain below a bit more.
[1] https://lore.kernel.org/cgroups/20230109213809.418135-1-tjmercier@google.com...
The only drawback I can see is that DMA-buf heap allocations would be temporarily accounted to the memory allocation daemon, but I don't think that this would be a problem.
The main reasons we moved away from TJ's transfer-based approach toward `charge_pid_fd` are: avoid the transient charge window on the daemon's cgroup; and to decouple from Binder, allowing any allocator to use it.
Yeah those concerns are completely correct.
The application should not volunteering says 'Charge that buffer to me.', but rather that the daemon says force charge that buffer to this application and tell me when the application is over its limit.
Technically, both approaches could coexist, though. Of the three scenarios TJ described:
- Scenario 2 is directly addressed by charge_pid_fd approach without
any transient charge on the daemon at the cost of one extra field in the heap ioctl uAPI struct.
Yeah extending the uAPI to pass in the pid on allocation time is not much of a problem, but you also need to modify the whole stack above it and that is a bit more trickier.
- Scenario 3 can be handled by the charge transfer function without
changes to SurfaceFlinger. The app or dequeueBuffer claims the charge for itself or the app, respectively (depending on whether we include a pid_fd field in the transfer ioctl). It also covers non-heap exporters. The con in both variants is the transient charge window on the daemon.
It should be trivial for the deamon to charge the buffer to an application before handing it out.
Yeah, true.
Both approaches shift the responsibility for correct charging attribution to userspace: first, 'charge_pid_fd` on the allocator's side, and the transfer charge on the consumer's side.
Yeah that's why I said it would be better if we do that without any uAPI change, but with all the uAPI we have to transfer file descriptors (dup(), fork(), passing FDs over sockets etc...) it could be really tricky to implement that.
Deciding on one, the other or both depends on how much we value avoiding transient attribution, and how much we need a non-heap generic solution. With the XFER_CHARGE we can cover both. Thus, the `charge_pid_fd` approach in this RFC can be seen as a performance/strictness optimisation, eliminating transient charges to the daemon at the cost of a permanent uAPI addition to the heap ioctl struct, but not strictly required for correctness.
Well all we need is a uAPI which says charge this buffer (file descriptor) to that cgroup (pidfd).
So you favor having only the XFER_CHARGE variant. That is fine with me. If that is fine for others also that could be the way forward. If we extend it to accept either a pidfd or a cgroup fd (as commented previously), we can cover all dma-buf use cases with a single primitive: ``` ioctl(buf_fd, DMA_BUF_IOCTL_XFER_CHARGE, charge_fd); ``` With the daemon invoking this ioctl before handing out the buf_fd.
This should cover most usecases? Except for the memfd case, which requires a separate mechanism. That would be follow-up work.
With this at hand we should be able to handle all use cases at the same time.
On the other hand, if we agree on the end goal of migrating other exporters to use dma-buf heaps
That won't work. DMA-buf heaps is actually only a rather small and Anroid specific use case.
We have tons of other interfaces to allocate DMA-bufs which need to stay around because of HW restrictions and we do need a solution for them as well.
Regards, Christian.
, and scenario 3 is addressed by adding the app's pid_fd to SurfaceFlinger, then `charge_pid_fd` alone is a coherent/sufficient approach despite the uAPI change.
Regards, Christian.
Thanks Barry
Hi Chritian,
On Tue, May 19, 2026 at 09:53:19AM +0200, Christian König wrote:
On 5/18/26 14:06, Albert Esteve wrote:
udmabufs are already memcg-charged, so adding a separate MEMCG_DMABUF would double count. Are there any other exporters you had in mind that would benefit from this approach?
Well apart from DMA-buf memfd_create() is one of the things which as broken our neck in the past a couple of times.
But thinking more about it what if instead of making this DMA-buf heaps specific what if we have a general cgroups function which allows to change accounting of a buffer referenced by a file descriptor to a different process?
That would cover not only the DMA-buf heaps use case, but also all other DMA-buf with dmem and whatever we come up in the future as well.
I removed a draft adding an ioctl for charge transfer from the series before sending because I wanted to focus on the charge_pid_fd approach and keep things simple, deferring the recharge path to a follow-up depending on feedback.
The main difference between my removed draft and what you're describing, iiuc, is scope and layer: my draft was an explicit ioctl on the dma-buf fd that the consumer calls to claim the charge (see below), while you seem to be suggesting a more general kernel-internal function that could work across buffer types and cgroup controllers, so not necessarily userspace-initiated? A kernel-internal function will need a way to identify the target process, which sounds similar to the binder-backed approach from TJ [1]. For everything else, the receiver still needs to declare itself, which the ioctl accomplishes.
# When an app imports a daemon-allocated buffer, it can transfer the charge to itself: int buf_fd = receive_dmabuf_from_daemon(); ioctl(buf_fd, DMA_BUF_IOCTL_XFER_CHARGE); /* charge now attributed to apps's cgroup */Well that thinking goes into the right direction, but the requirements are still not completely covered as far as I can see.
Let me explain below a bit more.
[1] https://lore.kernel.org/cgroups/20230109213809.418135-1-tjmercier@google.com...
The only drawback I can see is that DMA-buf heap allocations would be temporarily accounted to the memory allocation daemon, but I don't think that this would be a problem.
The main reasons we moved away from TJ's transfer-based approach toward `charge_pid_fd` are: avoid the transient charge window on the daemon's cgroup; and to decouple from Binder, allowing any allocator to use it.
Yeah those concerns are completely correct.
The application should not volunteering says 'Charge that buffer to me.', but rather that the daemon says force charge that buffer to this application and tell me when the application is over its limit.
I would agree, but with a caveat: how do we want to deal with malicious applications here? The application should have expressed that it's okay for it to be charged by a different process, otherwise it becomes trivial for a malicious app to create arbitrary charges against another application in the system and DoS it.
But then, that means that an application could arbitrarily charge the daemon as well if it doesn't opt-in but asks for allocations.
So maybe we should have an opt-in for the caller, and a way for the daemon to check if the caller has indeed opted in before performing the allocation (and the charge transfer)?
Technically, both approaches could coexist, though. Of the three scenarios TJ described:
- Scenario 2 is directly addressed by charge_pid_fd approach without
any transient charge on the daemon at the cost of one extra field in the heap ioctl uAPI struct.
Yeah extending the uAPI to pass in the pid on allocation time is not much of a problem, but you also need to modify the whole stack above it and that is a bit more trickier.
- Scenario 3 can be handled by the charge transfer function without
changes to SurfaceFlinger. The app or dequeueBuffer claims the charge for itself or the app, respectively (depending on whether we include a pid_fd field in the transfer ioctl). It also covers non-heap exporters. The con in both variants is the transient charge window on the daemon.
It should be trivial for the deamon to charge the buffer to an application before handing it out.
Both approaches shift the responsibility for correct charging attribution to userspace: first, 'charge_pid_fd` on the allocator's side, and the transfer charge on the consumer's side.
Yeah that's why I said it would be better if we do that without any uAPI change, but with all the uAPI we have to transfer file descriptors (dup(), fork(), passing FDs over sockets etc...) it could be really tricky to implement that.
Deciding on one, the other or both depends on how much we value avoiding transient attribution, and how much we need a non-heap generic solution. With the XFER_CHARGE we can cover both. Thus, the `charge_pid_fd` approach in this RFC can be seen as a performance/strictness optimisation, eliminating transient charges to the daemon at the cost of a permanent uAPI addition to the heap ioctl struct, but not strictly required for correctness.
Well all we need is a uAPI which says charge this buffer (file descriptor) to that cgroup (pidfd).
With this at hand we should be able to handle all use cases at the same time.
On the other hand, if we agree on the end goal of migrating other exporters to use dma-buf heaps
That won't work. DMA-buf heaps is actually only a rather small and Anroid specific use case.
I don't think that's true anymore. heaps are used in lots of different use cases now in the embedded space, including in regular, generic, components not specifically used for embedded systems.
Maxime
On Mon, May 18, 2026 at 3:34 PM Christian König christian.koenig@amd.com wrote:
On 5/16/26 11:19, Barry Song wrote:
On Thu, May 14, 2026 at 12:35 AM T.J. Mercier tjmercier@google.com wrote: [...]
I have a question about this part. Albert I guess you are interested only in accounting dmabuf-heap allocations, or do you expect to add __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other non-dmabuf-heap exporters?
We're scoping this to dma-buf heaps for now. CMA heaps and the dmem controller are on the radar for follow-up/parallel work (there will be dragons and will surely need discussion). For DRM and V4L2 the long-term intent is migration to heaps, which would make direct accounting on those paths unnecessary.
Ah I see. GEM buffers exported to dmabufs are what I had in mind. I guess this would only leave the odd non-DRM driver with the need to add their own accounting calls, which I don't expect would be a big problem.
sounds like we still have a long way to go to correctly account for various v4l2, drm, GEM, CMA, etc. In patch 1, the charging is done in dma_buf_export(), so I guess it covers all dma-buf types except dma_heap, but the problem is that it has no remote charging support at all?
No, just the other way around
DMA-buf heaps can be handled here because we know that it is pure system memory and nothing special so memcg always applies.
dma_buf_export() on the other hand handles tons of different use cases, ranging from buffer accounted to dmem, over special resources which aren't even memory all the way to buffers which can migrate from dmem to memcg and back during their lifetime.
Hi Christian,
Thanks very much for your explanation. So basically it seems that dma_buf_export() is not the proper place to charge, since it may end up mixing in non-system-memory accounting?
My question is also about the global view for both heap and non-heap cases. After reading the discussion, I’ve tried to summarize it—please let me know if my understanding is correct.
for dma_heap, we have the ioctl DMA_HEAP_IOCTL_ALLOC, where users can pass a remote pidfd or similar information to indicate where the dma-buf should be charged, as in Albert's patchset.
For non-dma_heap dma-bufs, we don’t have an obvious userspace entry point that triggers the allocation. So we likely need other approaches. We could either move more drivers over to dma-heap, or introduce something like DMA_BUF_IOCTL_XFER_CHARGE, as you are discussing, to let userspace explicitly declare a charge.
Best Regards Barry
On 5/19/26 01:00, Barry Song wrote:
On Mon, May 18, 2026 at 3:34 PM Christian König christian.koenig@amd.com wrote:
On 5/16/26 11:19, Barry Song wrote:
On Thu, May 14, 2026 at 12:35 AM T.J. Mercier tjmercier@google.com wrote: [...]
I have a question about this part. Albert I guess you are interested only in accounting dmabuf-heap allocations, or do you expect to add __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other non-dmabuf-heap exporters?
We're scoping this to dma-buf heaps for now. CMA heaps and the dmem controller are on the radar for follow-up/parallel work (there will be dragons and will surely need discussion). For DRM and V4L2 the long-term intent is migration to heaps, which would make direct accounting on those paths unnecessary.
Ah I see. GEM buffers exported to dmabufs are what I had in mind. I guess this would only leave the odd non-DRM driver with the need to add their own accounting calls, which I don't expect would be a big problem.
sounds like we still have a long way to go to correctly account for various v4l2, drm, GEM, CMA, etc. In patch 1, the charging is done in dma_buf_export(), so I guess it covers all dma-buf types except dma_heap, but the problem is that it has no remote charging support at all?
No, just the other way around
DMA-buf heaps can be handled here because we know that it is pure system memory and nothing special so memcg always applies.
dma_buf_export() on the other hand handles tons of different use cases, ranging from buffer accounted to dmem, over special resources which aren't even memory all the way to buffers which can migrate from dmem to memcg and back during their lifetime.
Hi Christian,
Thanks very much for your explanation. So basically it seems that dma_buf_export() is not the proper place to charge, since it may end up mixing in non-system-memory accounting?
Yes, exactly that.
My question is also about the global view for both heap and non-heap cases. After reading the discussion, I’ve tried to summarize it—please let me know if my understanding is correct.
for dma_heap, we have the ioctl DMA_HEAP_IOCTL_ALLOC, where users can pass a remote pidfd or similar information to indicate where the dma-buf should be charged, as in Albert's patchset.
Well that's the current proposal, but I think we need to come up with something more general.
For non-dma_heap dma-bufs, we don’t have an obvious userspace entry point that triggers the allocation. So we likely need other approaches. We could either move more drivers over to dma-heap, or introduce something like DMA_BUF_IOCTL_XFER_CHARGE, as you are discussing, to let userspace explicitly declare a charge.
Yeah but that's not only for DMA-buf, we need that for file descriptors returned by memfd_create() as well.
Regards, Christian.
Best Regards Barry
On Tue, May 19, 2026 at 12:10 AM Christian König christian.koenig@amd.com wrote:
On 5/19/26 01:00, Barry Song wrote:
On Mon, May 18, 2026 at 3:34 PM Christian König christian.koenig@amd.com wrote:
On 5/16/26 11:19, Barry Song wrote:
On Thu, May 14, 2026 at 12:35 AM T.J. Mercier tjmercier@google.com wrote: [...]
> I have a question about this part. Albert I guess you are interested > only in accounting dmabuf-heap allocations, or do you expect to add > __GFP_ACCOUNT or mem_cgroup_charge_dmabuf calls to other > non-dmabuf-heap exporters?
We're scoping this to dma-buf heaps for now. CMA heaps and the dmem controller are on the radar for follow-up/parallel work (there will be dragons and will surely need discussion). For DRM and V4L2 the long-term intent is migration to heaps, which would make direct accounting on those paths unnecessary.
Ah I see. GEM buffers exported to dmabufs are what I had in mind. I guess this would only leave the odd non-DRM driver with the need to add their own accounting calls, which I don't expect would be a big problem.
sounds like we still have a long way to go to correctly account for various v4l2, drm, GEM, CMA, etc. In patch 1, the charging is done in dma_buf_export(), so I guess it covers all dma-buf types except dma_heap, but the problem is that it has no remote charging support at all?
No, just the other way around
DMA-buf heaps can be handled here because we know that it is pure system memory and nothing special so memcg always applies.
dma_buf_export() on the other hand handles tons of different use cases, ranging from buffer accounted to dmem, over special resources which aren't even memory all the way to buffers which can migrate from dmem to memcg and back during their lifetime.
Hi Christian,
Thanks very much for your explanation. So basically it seems that dma_buf_export() is not the proper place to charge, since it may end up mixing in non-system-memory accounting?
Yes, exactly that.
My question is also about the global view for both heap and non-heap cases. After reading the discussion, I’ve tried to summarize it—please let me know if my understanding is correct.
for dma_heap, we have the ioctl DMA_HEAP_IOCTL_ALLOC, where users can pass a remote pidfd or similar information to indicate where the dma-buf should be charged, as in Albert's patchset.
Well that's the current proposal, but I think we need to come up with something more general.
For non-dma_heap dma-bufs, we don’t have an obvious userspace entry point that triggers the allocation. So we likely need other approaches. We could either move more drivers over to dma-heap, or introduce something like DMA_BUF_IOCTL_XFER_CHARGE, as you are discussing, to let userspace explicitly declare a charge.
Yeah but that's not only for DMA-buf, we need that for file descriptors returned by memfd_create() as well.
memfds get charged on fault, so an allocator shouldn't currently be charged just for creating the fd. Unlike system/CMA heap buffers, the shmem backing a memfd / udmabuf is LRU memory, and swapping the memcg owner of those pages is a more-involved process which is not supported by memcg v2. There used to be some support in memcg v1, but it was removed. Commit e548ad4a7cbf ("mm: memcg: move charge migration code to memcontrol-v1.c ") said, "It's a fairly large and complicated code which created a number of problems in the past." So I'm not sure how much appetite there would be to support it in v2 for this.
On Wed, May 13, 2026 at 2:54 AM T.J. Mercier tjmercier@google.com wrote:
On Tue, May 12, 2026 at 3:14 AM Christian König christian.koenig@amd.com wrote:
On 5/12/26 11:10, Albert Esteve wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
Usage examples:
Central allocator charging to a client at allocation time. The allocator knows the client's PID (e.g., from binder's sender_pid) and uses pidfd to attribute the charge:
pid_t client_pid = txn->sender_pid; int pidfd = pidfd_open(client_pid, 0);
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, .charge_pid_fd = pidfd, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); close(pidfd); /* alloc.fd is now charged to client's cgroup */
Default allocation (no pidfd, mem_accounting=1). When charge_pid_fd is not set and the mem_accounting module parameter is enabled, the buffer is charged to the allocator's own cgroup:
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); /* charged to current process's cgroup */
Current limitations:
- Single-owner model: a dma-buf carries one memcg charge regardless of how many processes share it. Means only the first owner (and exporter) of the shared buffer bears the charge.
- Only memcg accounting supported. While this makes sense for system heap buffers, other heaps (e.g., CMA heaps) will require selectively charging also for the dmem controller.
Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.
Yeah I think this might work. I know of 3 cases, and it trivially solves the first two. The third requires some work on our end to extend our userspace interfaces to include the pidfd but it seems doable. I'm checking with our graphics folks.
- Direct allocation from user (e.g. app -> allocation ioctl on
/dev/dma_heap/foo) No changes required to userspace. mem_accounting=1 charges the app.
- Single hop remote allocation (e.g. app -> AHardwareBuffer_allocate
-> gralloc) gralloc has the caller's pid as described in the commit message. Open a pidfd and pass it in the dma_heap_allocation_data.
- Double hop remote allocation (e.g. app -> dequeueBuffer ->
SurfaceFlinger -> gralloc) In this case gralloc knows SurfaceFlinger's pid, but not the app's. So we need to add the app's pidfd to the SurfaceFlinger -> gralloc interface, or transfer the memcg charge from SurfaceFlinger to the app after the allocation. It'd be nice to avoid the charge transfer option entirely, but if we need it that doesn't seem so bad in this case because it's a bulk charge for the entire dmabuf rather than per-page. So the exporter doesn't need to get involved (we wouldn't need a new dma_buf_op) and we wouldn't have to worry about looping and locking for each page.
Hi T.J.,
Your description of the three different cases sounds very interesting. It helps me understand how difficult it can be to correctly charge dma-buf in the current user scenarios.
I’m wondering where I can find Android userspace code that transfers the PID of RPC callers. Do we have any existing sample code in Android for this?
I'm just not sure if this is future prove and will work for all use cases, e.g. cloud gaming, native context for automotive etc...
Thanks Barry
On Sat, May 16, 2026 at 1:40 AM Barry Song baohua@kernel.org wrote:
On Wed, May 13, 2026 at 2:54 AM T.J. Mercier tjmercier@google.com wrote:
On Tue, May 12, 2026 at 3:14 AM Christian König christian.koenig@amd.com wrote:
On 5/12/26 11:10, Albert Esteve wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
Usage examples:
Central allocator charging to a client at allocation time. The allocator knows the client's PID (e.g., from binder's sender_pid) and uses pidfd to attribute the charge:
pid_t client_pid = txn->sender_pid; int pidfd = pidfd_open(client_pid, 0);
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, .charge_pid_fd = pidfd, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); close(pidfd); /* alloc.fd is now charged to client's cgroup */
Default allocation (no pidfd, mem_accounting=1). When charge_pid_fd is not set and the mem_accounting module parameter is enabled, the buffer is charged to the allocator's own cgroup:
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); /* charged to current process's cgroup */
Current limitations:
- Single-owner model: a dma-buf carries one memcg charge regardless of how many processes share it. Means only the first owner (and exporter) of the shared buffer bears the charge.
- Only memcg accounting supported. While this makes sense for system heap buffers, other heaps (e.g., CMA heaps) will require selectively charging also for the dmem controller.
Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.
Yeah I think this might work. I know of 3 cases, and it trivially solves the first two. The third requires some work on our end to extend our userspace interfaces to include the pidfd but it seems doable. I'm checking with our graphics folks.
- Direct allocation from user (e.g. app -> allocation ioctl on
/dev/dma_heap/foo) No changes required to userspace. mem_accounting=1 charges the app.
- Single hop remote allocation (e.g. app -> AHardwareBuffer_allocate
-> gralloc) gralloc has the caller's pid as described in the commit message. Open a pidfd and pass it in the dma_heap_allocation_data.
- Double hop remote allocation (e.g. app -> dequeueBuffer ->
SurfaceFlinger -> gralloc) In this case gralloc knows SurfaceFlinger's pid, but not the app's. So we need to add the app's pidfd to the SurfaceFlinger -> gralloc interface, or transfer the memcg charge from SurfaceFlinger to the app after the allocation. It'd be nice to avoid the charge transfer option entirely, but if we need it that doesn't seem so bad in this case because it's a bulk charge for the entire dmabuf rather than per-page. So the exporter doesn't need to get involved (we wouldn't need a new dma_buf_op) and we wouldn't have to worry about looping and locking for each page.
Hi T.J.,
Your description of the three different cases sounds very interesting. It helps me understand how difficult it can be to correctly charge dma-buf in the current user scenarios.
I’m wondering where I can find Android userspace code that transfers the PID of RPC callers. Do we have any existing sample code in Android for this?
Hi Barry,
In Java android.os.Binder.getCallingPid() will provide it. Here
I'm just not sure if this is future prove and will work for all use cases, e.g. cloud gaming, native context for automotive etc...
Thanks Barry
On Mon, May 18, 2026 at 2:12 PM T.J. Mercier tjmercier@google.com wrote:
On Sat, May 16, 2026 at 1:40 AM Barry Song baohua@kernel.org wrote:
On Wed, May 13, 2026 at 2:54 AM T.J. Mercier tjmercier@google.com wrote:
On Tue, May 12, 2026 at 3:14 AM Christian König christian.koenig@amd.com wrote:
On 5/12/26 11:10, Albert Esteve wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
Usage examples:
Central allocator charging to a client at allocation time. The allocator knows the client's PID (e.g., from binder's sender_pid) and uses pidfd to attribute the charge:
pid_t client_pid = txn->sender_pid; int pidfd = pidfd_open(client_pid, 0);
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, .charge_pid_fd = pidfd, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); close(pidfd); /* alloc.fd is now charged to client's cgroup */
Default allocation (no pidfd, mem_accounting=1). When charge_pid_fd is not set and the mem_accounting module parameter is enabled, the buffer is charged to the allocator's own cgroup:
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); /* charged to current process's cgroup */
Current limitations:
- Single-owner model: a dma-buf carries one memcg charge regardless of how many processes share it. Means only the first owner (and exporter) of the shared buffer bears the charge.
- Only memcg accounting supported. While this makes sense for system heap buffers, other heaps (e.g., CMA heaps) will require selectively charging also for the dmem controller.
Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.
Yeah I think this might work. I know of 3 cases, and it trivially solves the first two. The third requires some work on our end to extend our userspace interfaces to include the pidfd but it seems doable. I'm checking with our graphics folks.
- Direct allocation from user (e.g. app -> allocation ioctl on
/dev/dma_heap/foo) No changes required to userspace. mem_accounting=1 charges the app.
- Single hop remote allocation (e.g. app -> AHardwareBuffer_allocate
-> gralloc) gralloc has the caller's pid as described in the commit message. Open a pidfd and pass it in the dma_heap_allocation_data.
- Double hop remote allocation (e.g. app -> dequeueBuffer ->
SurfaceFlinger -> gralloc) In this case gralloc knows SurfaceFlinger's pid, but not the app's. So we need to add the app's pidfd to the SurfaceFlinger -> gralloc interface, or transfer the memcg charge from SurfaceFlinger to the app after the allocation. It'd be nice to avoid the charge transfer option entirely, but if we need it that doesn't seem so bad in this case because it's a bulk charge for the entire dmabuf rather than per-page. So the exporter doesn't need to get involved (we wouldn't need a new dma_buf_op) and we wouldn't have to worry about looping and locking for each page.
Hi T.J.,
Your description of the three different cases sounds very interesting. It helps me understand how difficult it can be to correctly charge dma-buf in the current user scenarios.
I’m wondering where I can find Android userspace code that transfers the PID of RPC callers. Do we have any existing sample code in Android for this?
Hi Barry,
In Java android.os.Binder.getCallingPid() will provide it. Here
... let me try again
Here are some examples from the framework code:
https://cs.android.com/search?q=getCallingPid%20f:ActivityManager&sq=&am...
In native code we have AIBinder_getCallingPid and android::IPCThreadState::self()->getCallingPid() (or android::hardware::IPCThreadState::self()->getCallingPid() for HIDL)
https://cs.android.com/search?q=getCallingPid%20l:cpp%20-f:prebuilt&ss=a...
On Tue, May 19, 2026 at 5:17 AM T.J. Mercier tjmercier@google.com wrote: [...]
Yeah I think this might work. I know of 3 cases, and it trivially solves the first two. The third requires some work on our end to extend our userspace interfaces to include the pidfd but it seems doable. I'm checking with our graphics folks.
- Direct allocation from user (e.g. app -> allocation ioctl on
/dev/dma_heap/foo) No changes required to userspace. mem_accounting=1 charges the app.
- Single hop remote allocation (e.g. app -> AHardwareBuffer_allocate
-> gralloc) gralloc has the caller's pid as described in the commit message. Open a pidfd and pass it in the dma_heap_allocation_data.
- Double hop remote allocation (e.g. app -> dequeueBuffer ->
SurfaceFlinger -> gralloc) In this case gralloc knows SurfaceFlinger's pid, but not the app's. So we need to add the app's pidfd to the SurfaceFlinger -> gralloc interface, or transfer the memcg charge from SurfaceFlinger to the app after the allocation. It'd be nice to avoid the charge transfer option entirely, but if we need it that doesn't seem so bad in this case because it's a bulk charge for the entire dmabuf rather than per-page. So the exporter doesn't need to get involved (we wouldn't need a new dma_buf_op) and we wouldn't have to worry about looping and locking for each page.
Hi T.J.,
Your description of the three different cases sounds very interesting. It helps me understand how difficult it can be to correctly charge dma-buf in the current user scenarios.
I’m wondering where I can find Android userspace code that transfers the PID of RPC callers. Do we have any existing sample code in Android for this?
Hi Barry,
In Java android.os.Binder.getCallingPid() will provide it. Here
... let me try again
Here are some examples from the framework code:
https://cs.android.com/search?q=getCallingPid%20f:ActivityManager&sq=&am...
In native code we have AIBinder_getCallingPid and android::IPCThreadState::self()->getCallingPid() (or android::hardware::IPCThreadState::self()->getCallingPid() for HIDL)
https://cs.android.com/search?q=getCallingPid%20l:cpp%20-f:prebuilt&ss=a...
Thanks very much, T.J. That is very helpful. I guess that would require user space to understand the RPC procedure, including single-hop and two-hop cases, and make the corresponding changes.
You pointed out the SurfaceFlinger cases, which are two hops. It seems that AI models are also using dma_heap, at least from what I have observed on MTK and Qualcomm phones. Likely, we need to understand those RPC relationships in userspace and make the corresponding changes. I assume AI models are a single-hop case?
Best Regards Barry
On Mon, May 18, 2026 at 3:19 PM Barry Song baohua@kernel.org wrote:
On Tue, May 19, 2026 at 5:17 AM T.J. Mercier tjmercier@google.com wrote: [...]
Yeah I think this might work. I know of 3 cases, and it trivially solves the first two. The third requires some work on our end to extend our userspace interfaces to include the pidfd but it seems doable. I'm checking with our graphics folks.
- Direct allocation from user (e.g. app -> allocation ioctl on
/dev/dma_heap/foo) No changes required to userspace. mem_accounting=1 charges the app.
- Single hop remote allocation (e.g. app -> AHardwareBuffer_allocate
-> gralloc) gralloc has the caller's pid as described in the commit message. Open a pidfd and pass it in the dma_heap_allocation_data.
- Double hop remote allocation (e.g. app -> dequeueBuffer ->
SurfaceFlinger -> gralloc) In this case gralloc knows SurfaceFlinger's pid, but not the app's. So we need to add the app's pidfd to the SurfaceFlinger -> gralloc interface, or transfer the memcg charge from SurfaceFlinger to the app after the allocation. It'd be nice to avoid the charge transfer option entirely, but if we need it that doesn't seem so bad in this case because it's a bulk charge for the entire dmabuf rather than per-page. So the exporter doesn't need to get involved (we wouldn't need a new dma_buf_op) and we wouldn't have to worry about looping and locking for each page.
Hi T.J.,
Your description of the three different cases sounds very interesting. It helps me understand how difficult it can be to correctly charge dma-buf in the current user scenarios.
I’m wondering where I can find Android userspace code that transfers the PID of RPC callers. Do we have any existing sample code in Android for this?
Hi Barry,
In Java android.os.Binder.getCallingPid() will provide it. Here
... let me try again
Here are some examples from the framework code:
https://cs.android.com/search?q=getCallingPid%20f:ActivityManager&sq=&am...
In native code we have AIBinder_getCallingPid and android::IPCThreadState::self()->getCallingPid() (or android::hardware::IPCThreadState::self()->getCallingPid() for HIDL)
https://cs.android.com/search?q=getCallingPid%20l:cpp%20-f:prebuilt&ss=a...
Thanks very much, T.J. That is very helpful. I guess that would require user space to understand the RPC procedure, including single-hop and two-hop cases, and make the corresponding changes.
Yes, this is solvable by having a policy in allocator services where the caller is implicitly charged, while also supporting cases where the RPC includes additional explicit information about who to charge. This needs security checks to prevent arbitrary remote charges at both the ioctl() level (selinux charge_to from patch 4), and at the RPC level (not sure yet but maybe a private interface between system components and gralloc), so that only privileged components can initiate remote charges.
You pointed out the SurfaceFlinger cases, which are two hops. It seems that AI models are also using dma_heap, at least from what I have observed on MTK and Qualcomm phones. Likely, we need to understand those RPC relationships in userspace and make the corresponding changes. I assume AI models are a single-hop case?
It's currently a mix because AI model loading is largely controlled by vendor code right now. Some implementations use AHardwareBuffer_allocate, but that comes with unnecessary RPC overhead for the AI use case. So I think we should be trending towards direct allocations from dma-buf heaps because model loading time is important.
On Tue, May 12, 2026 at 12:14 PM Christian König christian.koenig@amd.com wrote:
On 5/12/26 11:10, Albert Esteve wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
Usage examples:
Central allocator charging to a client at allocation time. The allocator knows the client's PID (e.g., from binder's sender_pid) and uses pidfd to attribute the charge:
pid_t client_pid = txn->sender_pid; int pidfd = pidfd_open(client_pid, 0);
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, .charge_pid_fd = pidfd, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); close(pidfd); /* alloc.fd is now charged to client's cgroup */
Default allocation (no pidfd, mem_accounting=1). When charge_pid_fd is not set and the mem_accounting module parameter is enabled, the buffer is charged to the allocator's own cgroup:
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); /* charged to current process's cgroup */
Current limitations:
- Single-owner model: a dma-buf carries one memcg charge regardless of how many processes share it. Means only the first owner (and exporter) of the shared buffer bears the charge.
- Only memcg accounting supported. While this makes sense for system heap buffers, other heaps (e.g., CMA heaps) will require selectively charging also for the dmem controller.
Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.
I'm just not sure if this is future prove and will work for all use cases, e.g. cloud gaming, native context for automotive etc...
Essentially the problem boils down to two limitations:
- a piece of memory can only be charged to one cgroup, the framework doesn't has a concept of charging shared memory to multiple groups
- when memory references in the form of file descriptors are passed between applications we have no way of changing the accounting to a different cgroup
The passing of the memory reference already has a well defined uAPI and if we could solve those two limitations we not only solve the problem without introducing new uAPI (with potential new security risks) but also solve it for all other use cases which uses file descriptors as well as. E.g. memfd, accel and GPU drivers etc...
Honestly, adding a hook to fd-passing uAPI to manage charge transfers sounds like a promising solution requiring no uAPI changes. However, it still does not cover all paths, e.g., dup() or fork(). And shared memory sounds like a hard one to tackle, where deciding the best policy is more a per-usecase thing and would probably require userspace configuration. All in all, charge_pid_fd covers a well-defined and immediately practical subset. The UAPI cost is small and the mechanism is explicit about what it does and doesn't solve. A general solution, if it ever converges, would likely supersede charge_pid_fd for most cases, which is a fine outcome if it solves the problem more completely.
Either way, if you have a specific approach in mind for solving any of the above limitations, I'd be happy to look into it further.
BR, Albert.
On the other hand it is really nice to finally see this tackled for at least DMA-buf heaps. On the GPU side I have seen just another try of a driver doing some kind of special driver specific accounting to solve this just a few weeks ago. And to be honest such single driver island approach have the tendency to break more often that they are working correctly.
Regards, Christian.
Signed-off-by: Albert Esteve aesteve@redhat.com
Documentation/admin-guide/cgroup-v2.rst | 5 ++-- drivers/dma-buf/dma-buf.c | 16 ++++--------- drivers/dma-buf/dma-heap.c | 42 ++++++++++++++++++++++++++++++--- drivers/dma-buf/heaps/system_heap.c | 2 -- include/uapi/linux/dma-heap.h | 6 +++++ 5 files changed, 53 insertions(+), 18 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8bdbc2e866430..824d269531eb1 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1636,8 +1636,9 @@ The following nested keys are defined. structures.
dmabuf (npn)
Amount of memory used for exported DMA buffers allocated by the cgroup.Stays with the allocating cgroup regardless of how the buffer is shared.
Amount of memory used for exported DMA buffers allocated by or onbehalf of the cgroup. Stays with the allocating cgroup regardlessof how the buffer is shared. workingset_refault_anon Number of refaults of previously evicted anonymous pages.diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index ce02377f48908..23fb758b78297 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry) */ BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);
if (dmabuf->memcg) {mem_cgroup_uncharge_dmabuf(dmabuf->memcg,PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);} dmabuf->ops->release(dmabuf);@@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) dmabuf->resv = resv; }
dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,GFP_KERNEL)) {ret = -ENOMEM;goto err_memcg;}file->private_data = dmabuf; file->f_path.dentry->d_fsdata = dmabuf; dmabuf->file = file;@@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
return dmabuf;-err_memcg:
mem_cgroup_put(dmabuf->memcg);err_file: fput(file); err_module: diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c index ac5f8685a6494..ff6e259afcdc0 100644 --- a/drivers/dma-buf/dma-heap.c +++ b/drivers/dma-buf/dma-heap.c @@ -7,13 +7,17 @@ */
#include <linux/cdev.h> +#include <linux/cgroup.h> #include <linux/device.h> #include <linux/dma-buf.h> #include <linux/dma-heap.h> +#include <linux/memcontrol.h> +#include <linux/sched/mm.h> #include <linux/err.h> #include <linux/export.h> #include <linux/list.h> #include <linux/nospec.h> +#include <linux/pidfd.h> #include <linux/syscalls.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting, "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
u32 fd_flags,u64 heap_flags)
u32 fd_flags, u64 heap_flags,struct mem_cgroup *charge_to){ struct dma_buf *dmabuf;
unsigned int nr_pages;struct mem_cgroup *memcg = charge_to; int fd; /*@@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len, if (IS_ERR(dmabuf)) return PTR_ERR(dmabuf);
nr_pages = len / PAGE_SIZE;if (memcg)css_get(&memcg->css);else if (mem_accounting)memcg = get_mem_cgroup_from_mm(current->mm);if (memcg) {if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {mem_cgroup_put(memcg);dma_buf_put(dmabuf);return -ENOMEM;}dmabuf->memcg = memcg;}fd = dma_buf_fd(dmabuf, fd_flags); if (fd < 0) { dma_buf_put(dmabuf);@@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) { struct dma_heap_allocation_data *heap_allocation = data; struct dma_heap *heap = file->private_data;
struct mem_cgroup *memcg = NULL;struct task_struct *task;unsigned int pidfd_flags; int fd; if (heap_allocation->fd)@@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS) return -EINVAL;
if (heap_allocation->charge_pid_fd) {task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);if (IS_ERR(task))return PTR_ERR(task);memcg = get_mem_cgroup_from_mm(task->mm);put_task_struct(task);}fd = dma_heap_buffer_alloc(heap, heap_allocation->len, heap_allocation->fd_flags,
heap_allocation->heap_flags);
heap_allocation->heap_flags,memcg);mem_cgroup_put(memcg); if (fd < 0) return fd;diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index 03c2b87cb1112..95d7688167b93 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size, if (max_order < orders[i]) continue; flags = order_flags[i];
if (mem_accounting)flags |= __GFP_ACCOUNT; page = alloc_pages(flags, orders[i]); if (!page) continue;diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h index a4cf716a49fa6..e02b0f8cbc6a1 100644 --- a/include/uapi/linux/dma-heap.h +++ b/include/uapi/linux/dma-heap.h @@ -29,6 +29,10 @@
handle to the allocated dma-buf- @fd_flags: file descriptor flags used when allocating
- @heap_flags: flags passed to heap
- @charge_pid_fd: optional pidfd of the process whose cgroup should be
charged for this allocation; 0 means charge the calling
process's cgroup*/
- @__padding: reserved, must be zero
- Provided by userspace as an argument to the ioctl
@@ -37,6 +41,8 @@ struct dma_heap_allocation_data { __u32 fd; __u32 fd_flags; __u64 heap_flags;
__u32 charge_pid_fd;__u32 __padding;};
#define DMA_HEAP_IOC_MAGIC 'H'
On Wed, May 13, 2026 at 5:41 AM Albert Esteve aesteve@redhat.com wrote:
On Tue, May 12, 2026 at 12:14 PM Christian König christian.koenig@amd.com wrote:
On 5/12/26 11:10, Albert Esteve wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
Usage examples:
Central allocator charging to a client at allocation time. The allocator knows the client's PID (e.g., from binder's sender_pid) and uses pidfd to attribute the charge:
pid_t client_pid = txn->sender_pid; int pidfd = pidfd_open(client_pid, 0);
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, .charge_pid_fd = pidfd, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); close(pidfd); /* alloc.fd is now charged to client's cgroup */
Default allocation (no pidfd, mem_accounting=1). When charge_pid_fd is not set and the mem_accounting module parameter is enabled, the buffer is charged to the allocator's own cgroup:
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); /* charged to current process's cgroup */
Current limitations:
- Single-owner model: a dma-buf carries one memcg charge regardless of how many processes share it. Means only the first owner (and exporter) of the shared buffer bears the charge.
- Only memcg accounting supported. While this makes sense for system heap buffers, other heaps (e.g., CMA heaps) will require selectively charging also for the dmem controller.
Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.
I'm just not sure if this is future prove and will work for all use cases, e.g. cloud gaming, native context for automotive etc...
Essentially the problem boils down to two limitations:
- a piece of memory can only be charged to one cgroup, the framework doesn't has a concept of charging shared memory to multiple groups
- when memory references in the form of file descriptors are passed between applications we have no way of changing the accounting to a different cgroup
The passing of the memory reference already has a well defined uAPI and if we could solve those two limitations we not only solve the problem without introducing new uAPI (with potential new security risks) but also solve it for all other use cases which uses file descriptors as well as. E.g. memfd, accel and GPU drivers etc...
Honestly, adding a hook to fd-passing uAPI to manage charge transfers sounds like a promising solution requiring no uAPI changes. However, it still does not cover all paths, e.g., dup() or fork(). And shared memory sounds like a hard one to tackle, where deciding the best policy is more a per-usecase thing and would probably require userspace configuration.
I'm curious if anyone knows of a use case where FDs aren't involved at all? It's possible to fork() or clone() with only a dmabuf mapping and no FD. That sounds strange, and I'm not sure there's a real usecase for transferring ownership with that approach, but figured I'd at least pose the question.
All in all, charge_pid_fd covers a well-defined and immediately practical subset. The UAPI cost is small and the mechanism is explicit about what it does and doesn't solve. A general solution, if it ever converges, would likely supersede charge_pid_fd for most cases, which is a fine outcome if it solves the problem more completely.
Either way, if you have a specific approach in mind for solving any of the above limitations, I'd be happy to look into it further.
BR, Albert.
On the other hand it is really nice to finally see this tackled for at least DMA-buf heaps. On the GPU side I have seen just another try of a driver doing some kind of special driver specific accounting to solve this just a few weeks ago. And to be honest such single driver island approach have the tendency to break more often that they are working correctly.
Regards, Christian.
Signed-off-by: Albert Esteve aesteve@redhat.com
Documentation/admin-guide/cgroup-v2.rst | 5 ++-- drivers/dma-buf/dma-buf.c | 16 ++++--------- drivers/dma-buf/dma-heap.c | 42 ++++++++++++++++++++++++++++++--- drivers/dma-buf/heaps/system_heap.c | 2 -- include/uapi/linux/dma-heap.h | 6 +++++ 5 files changed, 53 insertions(+), 18 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8bdbc2e866430..824d269531eb1 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1636,8 +1636,9 @@ The following nested keys are defined. structures.
dmabuf (npn)
Amount of memory used for exported DMA buffers allocated by the cgroup.Stays with the allocating cgroup regardless of how the buffer is shared.
Amount of memory used for exported DMA buffers allocated by or onbehalf of the cgroup. Stays with the allocating cgroup regardlessof how the buffer is shared. workingset_refault_anon Number of refaults of previously evicted anonymous pages.diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index ce02377f48908..23fb758b78297 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry) */ BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);
if (dmabuf->memcg) {mem_cgroup_uncharge_dmabuf(dmabuf->memcg,PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);} dmabuf->ops->release(dmabuf);@@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) dmabuf->resv = resv; }
dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,GFP_KERNEL)) {ret = -ENOMEM;goto err_memcg;}file->private_data = dmabuf; file->f_path.dentry->d_fsdata = dmabuf; dmabuf->file = file;@@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
return dmabuf;-err_memcg:
mem_cgroup_put(dmabuf->memcg);err_file: fput(file); err_module: diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c index ac5f8685a6494..ff6e259afcdc0 100644 --- a/drivers/dma-buf/dma-heap.c +++ b/drivers/dma-buf/dma-heap.c @@ -7,13 +7,17 @@ */
#include <linux/cdev.h> +#include <linux/cgroup.h> #include <linux/device.h> #include <linux/dma-buf.h> #include <linux/dma-heap.h> +#include <linux/memcontrol.h> +#include <linux/sched/mm.h> #include <linux/err.h> #include <linux/export.h> #include <linux/list.h> #include <linux/nospec.h> +#include <linux/pidfd.h> #include <linux/syscalls.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting, "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
u32 fd_flags,u64 heap_flags)
u32 fd_flags, u64 heap_flags,struct mem_cgroup *charge_to){ struct dma_buf *dmabuf;
unsigned int nr_pages;struct mem_cgroup *memcg = charge_to; int fd; /*@@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len, if (IS_ERR(dmabuf)) return PTR_ERR(dmabuf);
nr_pages = len / PAGE_SIZE;if (memcg)css_get(&memcg->css);else if (mem_accounting)memcg = get_mem_cgroup_from_mm(current->mm);if (memcg) {if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {mem_cgroup_put(memcg);dma_buf_put(dmabuf);return -ENOMEM;}dmabuf->memcg = memcg;}fd = dma_buf_fd(dmabuf, fd_flags); if (fd < 0) { dma_buf_put(dmabuf);@@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) { struct dma_heap_allocation_data *heap_allocation = data; struct dma_heap *heap = file->private_data;
struct mem_cgroup *memcg = NULL;struct task_struct *task;unsigned int pidfd_flags; int fd; if (heap_allocation->fd)@@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS) return -EINVAL;
if (heap_allocation->charge_pid_fd) {task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);if (IS_ERR(task))return PTR_ERR(task);memcg = get_mem_cgroup_from_mm(task->mm);put_task_struct(task);}fd = dma_heap_buffer_alloc(heap, heap_allocation->len, heap_allocation->fd_flags,
heap_allocation->heap_flags);
heap_allocation->heap_flags,memcg);mem_cgroup_put(memcg); if (fd < 0) return fd;diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index 03c2b87cb1112..95d7688167b93 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size, if (max_order < orders[i]) continue; flags = order_flags[i];
if (mem_accounting)flags |= __GFP_ACCOUNT; page = alloc_pages(flags, orders[i]); if (!page) continue;diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h index a4cf716a49fa6..e02b0f8cbc6a1 100644 --- a/include/uapi/linux/dma-heap.h +++ b/include/uapi/linux/dma-heap.h @@ -29,6 +29,10 @@
handle to the allocated dma-buf- @fd_flags: file descriptor flags used when allocating
- @heap_flags: flags passed to heap
- @charge_pid_fd: optional pidfd of the process whose cgroup should be
charged for this allocation; 0 means charge the calling
process's cgroup*/
- @__padding: reserved, must be zero
- Provided by userspace as an argument to the ioctl
@@ -37,6 +41,8 @@ struct dma_heap_allocation_data { __u32 fd; __u32 fd_flags; __u64 heap_flags;
__u32 charge_pid_fd;__u32 __padding;};
#define DMA_HEAP_IOC_MAGIC 'H'
On Wed, May 13, 2026 at 6:39 PM T.J. Mercier tjmercier@google.com wrote:
On Wed, May 13, 2026 at 5:41 AM Albert Esteve aesteve@redhat.com wrote:
On Tue, May 12, 2026 at 12:14 PM Christian König christian.koenig@amd.com wrote:
On 5/12/26 11:10, Albert Esteve wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
Usage examples:
Central allocator charging to a client at allocation time. The allocator knows the client's PID (e.g., from binder's sender_pid) and uses pidfd to attribute the charge:
pid_t client_pid = txn->sender_pid; int pidfd = pidfd_open(client_pid, 0);
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, .charge_pid_fd = pidfd, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); close(pidfd); /* alloc.fd is now charged to client's cgroup */
Default allocation (no pidfd, mem_accounting=1). When charge_pid_fd is not set and the mem_accounting module parameter is enabled, the buffer is charged to the allocator's own cgroup:
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); /* charged to current process's cgroup */
Current limitations:
- Single-owner model: a dma-buf carries one memcg charge regardless of how many processes share it. Means only the first owner (and exporter) of the shared buffer bears the charge.
- Only memcg accounting supported. While this makes sense for system heap buffers, other heaps (e.g., CMA heaps) will require selectively charging also for the dmem controller.
Well that doesn't looks soo bad, it at least seems to tackle the problem at hand for Android and some of other embedded use cases.
I'm just not sure if this is future prove and will work for all use cases, e.g. cloud gaming, native context for automotive etc...
Essentially the problem boils down to two limitations:
- a piece of memory can only be charged to one cgroup, the framework doesn't has a concept of charging shared memory to multiple groups
- when memory references in the form of file descriptors are passed between applications we have no way of changing the accounting to a different cgroup
The passing of the memory reference already has a well defined uAPI and if we could solve those two limitations we not only solve the problem without introducing new uAPI (with potential new security risks) but also solve it for all other use cases which uses file descriptors as well as. E.g. memfd, accel and GPU drivers etc...
Honestly, adding a hook to fd-passing uAPI to manage charge transfers sounds like a promising solution requiring no uAPI changes. However, it still does not cover all paths, e.g., dup() or fork(). And shared memory sounds like a hard one to tackle, where deciding the best policy is more a per-usecase thing and would probably require userspace configuration.
I'm curious if anyone knows of a use case where FDs aren't involved at all? It's possible to fork() or clone() with only a dmabuf mapping and no FD. That sounds strange, and I'm not sure there's a real usecase for transferring ownership with that approach, but figured I'd at least pose the question.
Yeah, that's a good point. I do not really have a usecase myself for fork(), just thought of it as a posible gap/uncovered path.
All in all, charge_pid_fd covers a well-defined and immediately practical subset. The UAPI cost is small and the mechanism is explicit about what it does and doesn't solve. A general solution, if it ever converges, would likely supersede charge_pid_fd for most cases, which is a fine outcome if it solves the problem more completely.
Either way, if you have a specific approach in mind for solving any of the above limitations, I'd be happy to look into it further.
BR, Albert.
On the other hand it is really nice to finally see this tackled for at least DMA-buf heaps. On the GPU side I have seen just another try of a driver doing some kind of special driver specific accounting to solve this just a few weeks ago. And to be honest such single driver island approach have the tendency to break more often that they are working correctly.
Regards, Christian.
Signed-off-by: Albert Esteve aesteve@redhat.com
Documentation/admin-guide/cgroup-v2.rst | 5 ++-- drivers/dma-buf/dma-buf.c | 16 ++++--------- drivers/dma-buf/dma-heap.c | 42 ++++++++++++++++++++++++++++++--- drivers/dma-buf/heaps/system_heap.c | 2 -- include/uapi/linux/dma-heap.h | 6 +++++ 5 files changed, 53 insertions(+), 18 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8bdbc2e866430..824d269531eb1 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1636,8 +1636,9 @@ The following nested keys are defined. structures.
dmabuf (npn)
Amount of memory used for exported DMA buffers allocated by the cgroup.Stays with the allocating cgroup regardless of how the buffer is shared.
Amount of memory used for exported DMA buffers allocated by or onbehalf of the cgroup. Stays with the allocating cgroup regardlessof how the buffer is shared. workingset_refault_anon Number of refaults of previously evicted anonymous pages.diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index ce02377f48908..23fb758b78297 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry) */ BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);
if (dmabuf->memcg) {mem_cgroup_uncharge_dmabuf(dmabuf->memcg,PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);} dmabuf->ops->release(dmabuf);@@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) dmabuf->resv = resv; }
dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,GFP_KERNEL)) {ret = -ENOMEM;goto err_memcg;}file->private_data = dmabuf; file->f_path.dentry->d_fsdata = dmabuf; dmabuf->file = file;@@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
return dmabuf;-err_memcg:
mem_cgroup_put(dmabuf->memcg);err_file: fput(file); err_module: diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c index ac5f8685a6494..ff6e259afcdc0 100644 --- a/drivers/dma-buf/dma-heap.c +++ b/drivers/dma-buf/dma-heap.c @@ -7,13 +7,17 @@ */
#include <linux/cdev.h> +#include <linux/cgroup.h> #include <linux/device.h> #include <linux/dma-buf.h> #include <linux/dma-heap.h> +#include <linux/memcontrol.h> +#include <linux/sched/mm.h> #include <linux/err.h> #include <linux/export.h> #include <linux/list.h> #include <linux/nospec.h> +#include <linux/pidfd.h> #include <linux/syscalls.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting, "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
u32 fd_flags,u64 heap_flags)
u32 fd_flags, u64 heap_flags,struct mem_cgroup *charge_to){ struct dma_buf *dmabuf;
unsigned int nr_pages;struct mem_cgroup *memcg = charge_to; int fd; /*@@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len, if (IS_ERR(dmabuf)) return PTR_ERR(dmabuf);
nr_pages = len / PAGE_SIZE;if (memcg)css_get(&memcg->css);else if (mem_accounting)memcg = get_mem_cgroup_from_mm(current->mm);if (memcg) {if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {mem_cgroup_put(memcg);dma_buf_put(dmabuf);return -ENOMEM;}dmabuf->memcg = memcg;}fd = dma_buf_fd(dmabuf, fd_flags); if (fd < 0) { dma_buf_put(dmabuf);@@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) { struct dma_heap_allocation_data *heap_allocation = data; struct dma_heap *heap = file->private_data;
struct mem_cgroup *memcg = NULL;struct task_struct *task;unsigned int pidfd_flags; int fd; if (heap_allocation->fd)@@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS) return -EINVAL;
if (heap_allocation->charge_pid_fd) {task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);if (IS_ERR(task))return PTR_ERR(task);memcg = get_mem_cgroup_from_mm(task->mm);put_task_struct(task);}fd = dma_heap_buffer_alloc(heap, heap_allocation->len, heap_allocation->fd_flags,
heap_allocation->heap_flags);
heap_allocation->heap_flags,memcg);mem_cgroup_put(memcg); if (fd < 0) return fd;diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index 03c2b87cb1112..95d7688167b93 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size, if (max_order < orders[i]) continue; flags = order_flags[i];
if (mem_accounting)flags |= __GFP_ACCOUNT; page = alloc_pages(flags, orders[i]); if (!page) continue;diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h index a4cf716a49fa6..e02b0f8cbc6a1 100644 --- a/include/uapi/linux/dma-heap.h +++ b/include/uapi/linux/dma-heap.h @@ -29,6 +29,10 @@
handle to the allocated dma-buf- @fd_flags: file descriptor flags used when allocating
- @heap_flags: flags passed to heap
- @charge_pid_fd: optional pidfd of the process whose cgroup should be
charged for this allocation; 0 means charge the calling
process's cgroup*/
- @__padding: reserved, must be zero
- Provided by userspace as an argument to the ioctl
@@ -37,6 +41,8 @@ struct dma_heap_allocation_data { __u32 fd; __u32 fd_flags; __u64 heap_flags;
__u32 charge_pid_fd;__u32 __padding;};
#define DMA_HEAP_IOC_MAGIC 'H'
On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to
Please be aware that pidfds come in two flavors:
thread-group pidfds and thread-specific pidfds. Make sure that your API doesn't implicitly depend on this distinction not existing.
a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
Usage examples:
Central allocator charging to a client at allocation time. The allocator knows the client's PID (e.g., from binder's sender_pid) and uses pidfd to attribute the charge:
pid_t client_pid = txn->sender_pid; int pidfd = pidfd_open(client_pid, 0);
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, .charge_pid_fd = pidfd, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); close(pidfd); /* alloc.fd is now charged to client's cgroup */
Default allocation (no pidfd, mem_accounting=1). When charge_pid_fd is not set and the mem_accounting module parameter is enabled, the buffer is charged to the allocator's own cgroup:
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); /* charged to current process's cgroup */
Current limitations:
- Single-owner model: a dma-buf carries one memcg charge regardless of how many processes share it. Means only the first owner (and exporter) of the shared buffer bears the charge.
- Only memcg accounting supported. While this makes sense for system heap buffers, other heaps (e.g., CMA heaps) will require selectively charging also for the dmem controller.
Signed-off-by: Albert Esteve aesteve@redhat.com
Documentation/admin-guide/cgroup-v2.rst | 5 ++-- drivers/dma-buf/dma-buf.c | 16 ++++--------- drivers/dma-buf/dma-heap.c | 42 ++++++++++++++++++++++++++++++--- drivers/dma-buf/heaps/system_heap.c | 2 -- include/uapi/linux/dma-heap.h | 6 +++++ 5 files changed, 53 insertions(+), 18 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8bdbc2e866430..824d269531eb1 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1636,8 +1636,9 @@ The following nested keys are defined. structures. dmabuf (npn)
Amount of memory used for exported DMA buffers allocated by the cgroup.Stays with the allocating cgroup regardless of how the buffer is shared.
Amount of memory used for exported DMA buffers allocated by or onbehalf of the cgroup. Stays with the allocating cgroup regardlessof how the buffer is shared.workingset_refault_anon Number of refaults of previously evicted anonymous pages. diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index ce02377f48908..23fb758b78297 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry) */ BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
- mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
- mem_cgroup_put(dmabuf->memcg);
- if (dmabuf->memcg) {
mem_cgroup_uncharge_dmabuf(dmabuf->memcg,PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);- }
dmabuf->ops->release(dmabuf); @@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) dmabuf->resv = resv; }
- dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
- if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
GFP_KERNEL)) {ret = -ENOMEM;goto err_memcg;- }
- file->private_data = dmabuf; file->f_path.dentry->d_fsdata = dmabuf; dmabuf->file = file;
@@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) return dmabuf; -err_memcg:
- mem_cgroup_put(dmabuf->memcg);
err_file: fput(file); err_module: diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c index ac5f8685a6494..ff6e259afcdc0 100644 --- a/drivers/dma-buf/dma-heap.c +++ b/drivers/dma-buf/dma-heap.c @@ -7,13 +7,17 @@ */ #include <linux/cdev.h> +#include <linux/cgroup.h> #include <linux/device.h> #include <linux/dma-buf.h> #include <linux/dma-heap.h> +#include <linux/memcontrol.h> +#include <linux/sched/mm.h> #include <linux/err.h> #include <linux/export.h> #include <linux/list.h> #include <linux/nospec.h> +#include <linux/pidfd.h> #include <linux/syscalls.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting, "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false)."); static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
u32 fd_flags,u64 heap_flags)
u32 fd_flags, u64 heap_flags,struct mem_cgroup *charge_to){ struct dma_buf *dmabuf;
- unsigned int nr_pages;
- struct mem_cgroup *memcg = charge_to; int fd;
/* @@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len, if (IS_ERR(dmabuf)) return PTR_ERR(dmabuf);
- nr_pages = len / PAGE_SIZE;
- if (memcg)
css_get(&memcg->css);- else if (mem_accounting)
memcg = get_mem_cgroup_from_mm(current->mm);- if (memcg) {
if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {mem_cgroup_put(memcg);dma_buf_put(dmabuf);return -ENOMEM;}dmabuf->memcg = memcg;- }
- fd = dma_buf_fd(dmabuf, fd_flags); if (fd < 0) { dma_buf_put(dmabuf);
@@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) { struct dma_heap_allocation_data *heap_allocation = data; struct dma_heap *heap = file->private_data;
- struct mem_cgroup *memcg = NULL;
- struct task_struct *task;
- unsigned int pidfd_flags; int fd;
if (heap_allocation->fd) @@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS) return -EINVAL;
- if (heap_allocation->charge_pid_fd) {
task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);
Will always get a thread-group leader pidfd and will fail if this is a thread-specific pidfd. pidfd_open(1234, PIDFD_THREAD) can be used to open a thread-specific pidfd.
if (IS_ERR(task))return PTR_ERR(task);memcg = get_mem_cgroup_from_mm(task->mm);put_task_struct(task);- }
- fd = dma_heap_buffer_alloc(heap, heap_allocation->len, heap_allocation->fd_flags,
heap_allocation->heap_flags);
heap_allocation->heap_flags,memcg);- mem_cgroup_put(memcg); if (fd < 0) return fd;
diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index 03c2b87cb1112..95d7688167b93 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size, if (max_order < orders[i]) continue; flags = order_flags[i];
if (mem_accounting) page = alloc_pages(flags, orders[i]); if (!page) continue;flags |= __GFP_ACCOUNT;diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h index a4cf716a49fa6..e02b0f8cbc6a1 100644 --- a/include/uapi/linux/dma-heap.h +++ b/include/uapi/linux/dma-heap.h @@ -29,6 +29,10 @@
handle to the allocated dma-buf- @fd_flags: file descriptor flags used when allocating
- @heap_flags: flags passed to heap
- @charge_pid_fd: optional pidfd of the process whose cgroup should be
charged for this allocation; 0 means charge the calling
process's cgroup*/
- @__padding: reserved, must be zero
- Provided by userspace as an argument to the ioctl
@@ -37,6 +41,8 @@ struct dma_heap_allocation_data { __u32 fd; __u32 fd_flags; __u64 heap_flags;
- __u32 charge_pid_fd;
- __u32 __padding;
}; #define DMA_HEAP_IOC_MAGIC 'H'
-- 2.53.0
On Fri, May 15, 2026 at 6:53 AM Christian Brauner brauner@kernel.org wrote:
On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to
Please be aware that pidfds come in two flavors:
thread-group pidfds and thread-specific pidfds. Make sure that your API doesn't implicitly depend on this distinction not existing.
Hi Christian,
Memcg is not a controller that supports "thread mode" so all threads in a group should belong to the same memcg.
Checking the flags from pidfd_get_pid would be the best way for an explicit check of the pidfd type?
a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
Usage examples:
Central allocator charging to a client at allocation time. The allocator knows the client's PID (e.g., from binder's sender_pid) and uses pidfd to attribute the charge:
pid_t client_pid = txn->sender_pid; int pidfd = pidfd_open(client_pid, 0);
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, .charge_pid_fd = pidfd, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); close(pidfd); /* alloc.fd is now charged to client's cgroup */
Default allocation (no pidfd, mem_accounting=1). When charge_pid_fd is not set and the mem_accounting module parameter is enabled, the buffer is charged to the allocator's own cgroup:
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); /* charged to current process's cgroup */
Current limitations:
- Single-owner model: a dma-buf carries one memcg charge regardless of how many processes share it. Means only the first owner (and exporter) of the shared buffer bears the charge.
- Only memcg accounting supported. While this makes sense for system heap buffers, other heaps (e.g., CMA heaps) will require selectively charging also for the dmem controller.
Signed-off-by: Albert Esteve aesteve@redhat.com
Documentation/admin-guide/cgroup-v2.rst | 5 ++-- drivers/dma-buf/dma-buf.c | 16 ++++--------- drivers/dma-buf/dma-heap.c | 42 ++++++++++++++++++++++++++++++--- drivers/dma-buf/heaps/system_heap.c | 2 -- include/uapi/linux/dma-heap.h | 6 +++++ 5 files changed, 53 insertions(+), 18 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8bdbc2e866430..824d269531eb1 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1636,8 +1636,9 @@ The following nested keys are defined. structures.
dmabuf (npn)
Amount of memory used for exported DMA buffers allocated by the cgroup.Stays with the allocating cgroup regardless of how the buffer is shared.
Amount of memory used for exported DMA buffers allocated by or onbehalf of the cgroup. Stays with the allocating cgroup regardlessof how the buffer is shared. workingset_refault_anon Number of refaults of previously evicted anonymous pages.diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index ce02377f48908..23fb758b78297 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry) */ BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);
if (dmabuf->memcg) {mem_cgroup_uncharge_dmabuf(dmabuf->memcg,PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);} dmabuf->ops->release(dmabuf);@@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) dmabuf->resv = resv; }
dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,GFP_KERNEL)) {ret = -ENOMEM;goto err_memcg;}file->private_data = dmabuf; file->f_path.dentry->d_fsdata = dmabuf; dmabuf->file = file;@@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
return dmabuf;-err_memcg:
mem_cgroup_put(dmabuf->memcg);err_file: fput(file); err_module: diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c index ac5f8685a6494..ff6e259afcdc0 100644 --- a/drivers/dma-buf/dma-heap.c +++ b/drivers/dma-buf/dma-heap.c @@ -7,13 +7,17 @@ */
#include <linux/cdev.h> +#include <linux/cgroup.h> #include <linux/device.h> #include <linux/dma-buf.h> #include <linux/dma-heap.h> +#include <linux/memcontrol.h> +#include <linux/sched/mm.h> #include <linux/err.h> #include <linux/export.h> #include <linux/list.h> #include <linux/nospec.h> +#include <linux/pidfd.h> #include <linux/syscalls.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting, "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
u32 fd_flags,u64 heap_flags)
u32 fd_flags, u64 heap_flags,struct mem_cgroup *charge_to){ struct dma_buf *dmabuf;
unsigned int nr_pages;struct mem_cgroup *memcg = charge_to; int fd; /*@@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len, if (IS_ERR(dmabuf)) return PTR_ERR(dmabuf);
nr_pages = len / PAGE_SIZE;if (memcg)css_get(&memcg->css);else if (mem_accounting)memcg = get_mem_cgroup_from_mm(current->mm);if (memcg) {if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {mem_cgroup_put(memcg);dma_buf_put(dmabuf);return -ENOMEM;}dmabuf->memcg = memcg;}fd = dma_buf_fd(dmabuf, fd_flags); if (fd < 0) { dma_buf_put(dmabuf);@@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) { struct dma_heap_allocation_data *heap_allocation = data; struct dma_heap *heap = file->private_data;
struct mem_cgroup *memcg = NULL;struct task_struct *task;unsigned int pidfd_flags; int fd; if (heap_allocation->fd)@@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS) return -EINVAL;
if (heap_allocation->charge_pid_fd) {task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);Will always get a thread-group leader pidfd and will fail if this is a thread-specific pidfd. pidfd_open(1234, PIDFD_THREAD) can be used to open a thread-specific pidfd.
if (IS_ERR(task))return PTR_ERR(task);memcg = get_mem_cgroup_from_mm(task->mm);put_task_struct(task);}fd = dma_heap_buffer_alloc(heap, heap_allocation->len, heap_allocation->fd_flags,
heap_allocation->heap_flags);
heap_allocation->heap_flags,memcg);mem_cgroup_put(memcg); if (fd < 0) return fd;diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index 03c2b87cb1112..95d7688167b93 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size, if (max_order < orders[i]) continue; flags = order_flags[i];
if (mem_accounting)flags |= __GFP_ACCOUNT; page = alloc_pages(flags, orders[i]); if (!page) continue;diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h index a4cf716a49fa6..e02b0f8cbc6a1 100644 --- a/include/uapi/linux/dma-heap.h +++ b/include/uapi/linux/dma-heap.h @@ -29,6 +29,10 @@
handle to the allocated dma-buf- @fd_flags: file descriptor flags used when allocating
- @heap_flags: flags passed to heap
- @charge_pid_fd: optional pidfd of the process whose cgroup should be
charged for this allocation; 0 means charge the calling
process's cgroup*/
- @__padding: reserved, must be zero
- Provided by userspace as an argument to the ioctl
@@ -37,6 +41,8 @@ struct dma_heap_allocation_data { __u32 fd; __u32 fd_flags; __u64 heap_flags;
__u32 charge_pid_fd;__u32 __padding;};
#define DMA_HEAP_IOC_MAGIC 'H'
-- 2.53.0
On 5/15/26 19:06, T.J. Mercier wrote:
On Fri, May 15, 2026 at 6:53 AM Christian Brauner brauner@kernel.org wrote:
On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to
Please be aware that pidfds come in two flavors:
thread-group pidfds and thread-specific pidfds. Make sure that your API doesn't implicitly depend on this distinction not existing.
Hi Christian,
Memcg is not a controller that supports "thread mode" so all threads in a group should belong to the same memcg.
BTW: Exactly that is the requirement automotive has with their native context use case.
The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.
Regards, Christian.
Checking the flags from pidfd_get_pid would be the best way for an explicit check of the pidfd type?
a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
Usage examples:
Central allocator charging to a client at allocation time. The allocator knows the client's PID (e.g., from binder's sender_pid) and uses pidfd to attribute the charge:
pid_t client_pid = txn->sender_pid; int pidfd = pidfd_open(client_pid, 0);
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, .charge_pid_fd = pidfd, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); close(pidfd); /* alloc.fd is now charged to client's cgroup */
Default allocation (no pidfd, mem_accounting=1). When charge_pid_fd is not set and the mem_accounting module parameter is enabled, the buffer is charged to the allocator's own cgroup:
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); /* charged to current process's cgroup */
Current limitations:
- Single-owner model: a dma-buf carries one memcg charge regardless of how many processes share it. Means only the first owner (and exporter) of the shared buffer bears the charge.
- Only memcg accounting supported. While this makes sense for system heap buffers, other heaps (e.g., CMA heaps) will require selectively charging also for the dmem controller.
Signed-off-by: Albert Esteve aesteve@redhat.com
Documentation/admin-guide/cgroup-v2.rst | 5 ++-- drivers/dma-buf/dma-buf.c | 16 ++++--------- drivers/dma-buf/dma-heap.c | 42 ++++++++++++++++++++++++++++++--- drivers/dma-buf/heaps/system_heap.c | 2 -- include/uapi/linux/dma-heap.h | 6 +++++ 5 files changed, 53 insertions(+), 18 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8bdbc2e866430..824d269531eb1 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1636,8 +1636,9 @@ The following nested keys are defined. structures.
dmabuf (npn)
Amount of memory used for exported DMA buffers allocated by the cgroup.Stays with the allocating cgroup regardless of how the buffer is shared.
Amount of memory used for exported DMA buffers allocated by or onbehalf of the cgroup. Stays with the allocating cgroup regardlessof how the buffer is shared. workingset_refault_anon Number of refaults of previously evicted anonymous pages.diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index ce02377f48908..23fb758b78297 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry) */ BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);
if (dmabuf->memcg) {mem_cgroup_uncharge_dmabuf(dmabuf->memcg,PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);} dmabuf->ops->release(dmabuf);@@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) dmabuf->resv = resv; }
dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,GFP_KERNEL)) {ret = -ENOMEM;goto err_memcg;}file->private_data = dmabuf; file->f_path.dentry->d_fsdata = dmabuf; dmabuf->file = file;@@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
return dmabuf;-err_memcg:
mem_cgroup_put(dmabuf->memcg);err_file: fput(file); err_module: diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c index ac5f8685a6494..ff6e259afcdc0 100644 --- a/drivers/dma-buf/dma-heap.c +++ b/drivers/dma-buf/dma-heap.c @@ -7,13 +7,17 @@ */
#include <linux/cdev.h> +#include <linux/cgroup.h> #include <linux/device.h> #include <linux/dma-buf.h> #include <linux/dma-heap.h> +#include <linux/memcontrol.h> +#include <linux/sched/mm.h> #include <linux/err.h> #include <linux/export.h> #include <linux/list.h> #include <linux/nospec.h> +#include <linux/pidfd.h> #include <linux/syscalls.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting, "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
u32 fd_flags,u64 heap_flags)
u32 fd_flags, u64 heap_flags,struct mem_cgroup *charge_to){ struct dma_buf *dmabuf;
unsigned int nr_pages;struct mem_cgroup *memcg = charge_to; int fd; /*@@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len, if (IS_ERR(dmabuf)) return PTR_ERR(dmabuf);
nr_pages = len / PAGE_SIZE;if (memcg)css_get(&memcg->css);else if (mem_accounting)memcg = get_mem_cgroup_from_mm(current->mm);if (memcg) {if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {mem_cgroup_put(memcg);dma_buf_put(dmabuf);return -ENOMEM;}dmabuf->memcg = memcg;}fd = dma_buf_fd(dmabuf, fd_flags); if (fd < 0) { dma_buf_put(dmabuf);@@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) { struct dma_heap_allocation_data *heap_allocation = data; struct dma_heap *heap = file->private_data;
struct mem_cgroup *memcg = NULL;struct task_struct *task;unsigned int pidfd_flags; int fd; if (heap_allocation->fd)@@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS) return -EINVAL;
if (heap_allocation->charge_pid_fd) {task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);Will always get a thread-group leader pidfd and will fail if this is a thread-specific pidfd. pidfd_open(1234, PIDFD_THREAD) can be used to open a thread-specific pidfd.
if (IS_ERR(task))return PTR_ERR(task);memcg = get_mem_cgroup_from_mm(task->mm);put_task_struct(task);}fd = dma_heap_buffer_alloc(heap, heap_allocation->len, heap_allocation->fd_flags,
heap_allocation->heap_flags);
heap_allocation->heap_flags,memcg);mem_cgroup_put(memcg); if (fd < 0) return fd;diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index 03c2b87cb1112..95d7688167b93 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size, if (max_order < orders[i]) continue; flags = order_flags[i];
if (mem_accounting)flags |= __GFP_ACCOUNT; page = alloc_pages(flags, orders[i]); if (!page) continue;diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h index a4cf716a49fa6..e02b0f8cbc6a1 100644 --- a/include/uapi/linux/dma-heap.h +++ b/include/uapi/linux/dma-heap.h @@ -29,6 +29,10 @@
handle to the allocated dma-buf- @fd_flags: file descriptor flags used when allocating
- @heap_flags: flags passed to heap
- @charge_pid_fd: optional pidfd of the process whose cgroup should be
charged for this allocation; 0 means charge the calling
process's cgroup*/
- @__padding: reserved, must be zero
- Provided by userspace as an argument to the ioctl
@@ -37,6 +41,8 @@ struct dma_heap_allocation_data { __u32 fd; __u32 fd_flags; __u64 heap_flags;
__u32 charge_pid_fd;__u32 __padding;};
#define DMA_HEAP_IOC_MAGIC 'H'
-- 2.53.0
On Mon, May 18, 2026 at 9:20 AM Christian König christian.koenig@amd.com wrote:
On 5/15/26 19:06, T.J. Mercier wrote:
On Fri, May 15, 2026 at 6:53 AM Christian Brauner brauner@kernel.org wrote:
On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to
Please be aware that pidfds come in two flavors:
thread-group pidfds and thread-specific pidfds. Make sure that your API doesn't implicitly depend on this distinction not existing.
Hi Christian,
Memcg is not a controller that supports "thread mode" so all threads in a group should belong to the same memcg.
BTW: Exactly that is the requirement automotive has with their native context use case.
The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.
Hi Christian,
Thanks for sharing this atuomotive usecase. If I understand correctly, the actual requirement is attributing dma-buf charges to the right client, not putting each daemon thread in a different cgroup? If so, the `charge_pid_fd` approach achieves this directly by passing the client's `pid_fd`, without needing to add per-thread cgroup infrastructure.
Regards, Christian.
Checking the flags from pidfd_get_pid would be the best way for an explicit check of the pidfd type?
a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
Usage examples:
Central allocator charging to a client at allocation time. The allocator knows the client's PID (e.g., from binder's sender_pid) and uses pidfd to attribute the charge:
pid_t client_pid = txn->sender_pid; int pidfd = pidfd_open(client_pid, 0);
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, .charge_pid_fd = pidfd, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); close(pidfd); /* alloc.fd is now charged to client's cgroup */
Default allocation (no pidfd, mem_accounting=1). When charge_pid_fd is not set and the mem_accounting module parameter is enabled, the buffer is charged to the allocator's own cgroup:
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); /* charged to current process's cgroup */
Current limitations:
- Single-owner model: a dma-buf carries one memcg charge regardless of how many processes share it. Means only the first owner (and exporter) of the shared buffer bears the charge.
- Only memcg accounting supported. While this makes sense for system heap buffers, other heaps (e.g., CMA heaps) will require selectively charging also for the dmem controller.
Signed-off-by: Albert Esteve aesteve@redhat.com
Documentation/admin-guide/cgroup-v2.rst | 5 ++-- drivers/dma-buf/dma-buf.c | 16 ++++--------- drivers/dma-buf/dma-heap.c | 42 ++++++++++++++++++++++++++++++--- drivers/dma-buf/heaps/system_heap.c | 2 -- include/uapi/linux/dma-heap.h | 6 +++++ 5 files changed, 53 insertions(+), 18 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8bdbc2e866430..824d269531eb1 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1636,8 +1636,9 @@ The following nested keys are defined. structures.
dmabuf (npn)
Amount of memory used for exported DMA buffers allocated by the cgroup.Stays with the allocating cgroup regardless of how the buffer is shared.
Amount of memory used for exported DMA buffers allocated by or onbehalf of the cgroup. Stays with the allocating cgroup regardlessof how the buffer is shared. workingset_refault_anon Number of refaults of previously evicted anonymous pages.diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index ce02377f48908..23fb758b78297 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry) */ BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);
if (dmabuf->memcg) {mem_cgroup_uncharge_dmabuf(dmabuf->memcg,PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);} dmabuf->ops->release(dmabuf);@@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) dmabuf->resv = resv; }
dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,GFP_KERNEL)) {ret = -ENOMEM;goto err_memcg;}file->private_data = dmabuf; file->f_path.dentry->d_fsdata = dmabuf; dmabuf->file = file;@@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
return dmabuf;-err_memcg:
mem_cgroup_put(dmabuf->memcg);err_file: fput(file); err_module: diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c index ac5f8685a6494..ff6e259afcdc0 100644 --- a/drivers/dma-buf/dma-heap.c +++ b/drivers/dma-buf/dma-heap.c @@ -7,13 +7,17 @@ */
#include <linux/cdev.h> +#include <linux/cgroup.h> #include <linux/device.h> #include <linux/dma-buf.h> #include <linux/dma-heap.h> +#include <linux/memcontrol.h> +#include <linux/sched/mm.h> #include <linux/err.h> #include <linux/export.h> #include <linux/list.h> #include <linux/nospec.h> +#include <linux/pidfd.h> #include <linux/syscalls.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting, "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
u32 fd_flags,u64 heap_flags)
u32 fd_flags, u64 heap_flags,struct mem_cgroup *charge_to){ struct dma_buf *dmabuf;
unsigned int nr_pages;struct mem_cgroup *memcg = charge_to; int fd; /*@@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len, if (IS_ERR(dmabuf)) return PTR_ERR(dmabuf);
nr_pages = len / PAGE_SIZE;if (memcg)css_get(&memcg->css);else if (mem_accounting)memcg = get_mem_cgroup_from_mm(current->mm);if (memcg) {if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {mem_cgroup_put(memcg);dma_buf_put(dmabuf);return -ENOMEM;}dmabuf->memcg = memcg;}fd = dma_buf_fd(dmabuf, fd_flags); if (fd < 0) { dma_buf_put(dmabuf);@@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) { struct dma_heap_allocation_data *heap_allocation = data; struct dma_heap *heap = file->private_data;
struct mem_cgroup *memcg = NULL;struct task_struct *task;unsigned int pidfd_flags; int fd; if (heap_allocation->fd)@@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS) return -EINVAL;
if (heap_allocation->charge_pid_fd) {task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);Will always get a thread-group leader pidfd and will fail if this is a thread-specific pidfd. pidfd_open(1234, PIDFD_THREAD) can be used to open a thread-specific pidfd.
if (IS_ERR(task))return PTR_ERR(task);memcg = get_mem_cgroup_from_mm(task->mm);put_task_struct(task);}fd = dma_heap_buffer_alloc(heap, heap_allocation->len, heap_allocation->fd_flags,
heap_allocation->heap_flags);
heap_allocation->heap_flags,memcg);mem_cgroup_put(memcg); if (fd < 0) return fd;diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index 03c2b87cb1112..95d7688167b93 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size, if (max_order < orders[i]) continue; flags = order_flags[i];
if (mem_accounting)flags |= __GFP_ACCOUNT; page = alloc_pages(flags, orders[i]); if (!page) continue;diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h index a4cf716a49fa6..e02b0f8cbc6a1 100644 --- a/include/uapi/linux/dma-heap.h +++ b/include/uapi/linux/dma-heap.h @@ -29,6 +29,10 @@
handle to the allocated dma-buf- @fd_flags: file descriptor flags used when allocating
- @heap_flags: flags passed to heap
- @charge_pid_fd: optional pidfd of the process whose cgroup should be
charged for this allocation; 0 means charge the calling
process's cgroup*/
- @__padding: reserved, must be zero
- Provided by userspace as an argument to the ioctl
@@ -37,6 +41,8 @@ struct dma_heap_allocation_data { __u32 fd; __u32 fd_flags; __u64 heap_flags;
__u32 charge_pid_fd;__u32 __padding;};
#define DMA_HEAP_IOC_MAGIC 'H'
-- 2.53.0
On 5/18/26 14:50, Albert Esteve wrote:
On Mon, May 18, 2026 at 9:20 AM Christian König christian.koenig@amd.com wrote:
On 5/15/26 19:06, T.J. Mercier wrote:
On Fri, May 15, 2026 at 6:53 AM Christian Brauner brauner@kernel.org wrote:
On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to
Please be aware that pidfds come in two flavors:
thread-group pidfds and thread-specific pidfds. Make sure that your API doesn't implicitly depend on this distinction not existing.
Hi Christian,
Memcg is not a controller that supports "thread mode" so all threads in a group should belong to the same memcg.
BTW: Exactly that is the requirement automotive has with their native context use case.
The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.
Hi Christian,
Thanks for sharing this atuomotive usecase. If I understand correctly, the actual requirement is attributing dma-buf charges to the right client, not putting each daemon thread in a different cgroup?
Nope, exactly that's the difference.
The thread acts as a filtering agent for both memory allocation and command submission for somebody else, the process on which behalve the daemon does things can even be in a client VM, completely remote over some network or even something like a microcontroller.
Everything the thread does regarding CPU time, GPU driver memory allocation as well as resources like GPU processing and I/O time etc.. needs to be accounted to one client which can be different for each thread of the process.
The only thing which is shared with the main process thread is CPU memory resources, e.g. malloc() because that is basically just needed for housekeeping and pretty much irrelevant for this kind of use case.
The problem is now you can't do that with cgroups at the moment but unfortunately only the kernel has the information you need to know to do this.
So what you end up with is to define tons of interfaces just to get the necessary information from the kernel into userspace and then essentially duplicate the same infrastructure cgroup provides in the kernel in userspace again.
If so, the `charge_pid_fd` approach achieves this directly by passing the client's `pid_fd`, without needing to add per-thread cgroup infrastructure.
Well it's already a massive improvemt, we could basically stop doing the whole duplication part for the GPU driver stack and just use cgroups for this part.
Doing that automatically for CPU and I/O time would just be nice to have additionally.
Regards, Christian.
Regards, Christian.
Checking the flags from pidfd_get_pid would be the best way for an explicit check of the pidfd type?
a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
Usage examples:
Central allocator charging to a client at allocation time. The allocator knows the client's PID (e.g., from binder's sender_pid) and uses pidfd to attribute the charge:
pid_t client_pid = txn->sender_pid; int pidfd = pidfd_open(client_pid, 0);
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, .charge_pid_fd = pidfd, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); close(pidfd); /* alloc.fd is now charged to client's cgroup */
Default allocation (no pidfd, mem_accounting=1). When charge_pid_fd is not set and the mem_accounting module parameter is enabled, the buffer is charged to the allocator's own cgroup:
struct dma_heap_allocation_data alloc = { .len = buffer_size, .fd_flags = O_RDWR | O_CLOEXEC, }; ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc); /* charged to current process's cgroup */
Current limitations:
- Single-owner model: a dma-buf carries one memcg charge regardless of how many processes share it. Means only the first owner (and exporter) of the shared buffer bears the charge.
- Only memcg accounting supported. While this makes sense for system heap buffers, other heaps (e.g., CMA heaps) will require selectively charging also for the dmem controller.
Signed-off-by: Albert Esteve aesteve@redhat.com
Documentation/admin-guide/cgroup-v2.rst | 5 ++-- drivers/dma-buf/dma-buf.c | 16 ++++--------- drivers/dma-buf/dma-heap.c | 42 ++++++++++++++++++++++++++++++--- drivers/dma-buf/heaps/system_heap.c | 2 -- include/uapi/linux/dma-heap.h | 6 +++++ 5 files changed, 53 insertions(+), 18 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8bdbc2e866430..824d269531eb1 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1636,8 +1636,9 @@ The following nested keys are defined. structures.
dmabuf (npn)
Amount of memory used for exported DMA buffers allocated by the cgroup.Stays with the allocating cgroup regardless of how the buffer is shared.
Amount of memory used for exported DMA buffers allocated by or onbehalf of the cgroup. Stays with the allocating cgroup regardlessof how the buffer is shared. workingset_refault_anon Number of refaults of previously evicted anonymous pages.diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index ce02377f48908..23fb758b78297 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -181,8 +181,11 @@ static void dma_buf_release(struct dentry *dentry) */ BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);
if (dmabuf->memcg) {mem_cgroup_uncharge_dmabuf(dmabuf->memcg,PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);mem_cgroup_put(dmabuf->memcg);} dmabuf->ops->release(dmabuf);@@ -764,13 +767,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) dmabuf->resv = resv; }
dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,GFP_KERNEL)) {ret = -ENOMEM;goto err_memcg;}file->private_data = dmabuf; file->f_path.dentry->d_fsdata = dmabuf; dmabuf->file = file;@@ -781,8 +777,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
return dmabuf;-err_memcg:
mem_cgroup_put(dmabuf->memcg);err_file: fput(file); err_module: diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c index ac5f8685a6494..ff6e259afcdc0 100644 --- a/drivers/dma-buf/dma-heap.c +++ b/drivers/dma-buf/dma-heap.c @@ -7,13 +7,17 @@ */
#include <linux/cdev.h> +#include <linux/cgroup.h> #include <linux/device.h> #include <linux/dma-buf.h> #include <linux/dma-heap.h> +#include <linux/memcontrol.h> +#include <linux/sched/mm.h> #include <linux/err.h> #include <linux/export.h> #include <linux/list.h> #include <linux/nospec.h> +#include <linux/pidfd.h> #include <linux/syscalls.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -55,10 +59,12 @@ MODULE_PARM_DESC(mem_accounting, "Enable cgroup-based memory accounting for dma-buf heap allocations (default=false).");
static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
u32 fd_flags,u64 heap_flags)
u32 fd_flags, u64 heap_flags,struct mem_cgroup *charge_to){ struct dma_buf *dmabuf;
unsigned int nr_pages;struct mem_cgroup *memcg = charge_to; int fd; /*@@ -73,6 +79,22 @@ static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len, if (IS_ERR(dmabuf)) return PTR_ERR(dmabuf);
nr_pages = len / PAGE_SIZE;if (memcg)css_get(&memcg->css);else if (mem_accounting)memcg = get_mem_cgroup_from_mm(current->mm);if (memcg) {if (!mem_cgroup_charge_dmabuf(memcg, nr_pages, GFP_KERNEL)) {mem_cgroup_put(memcg);dma_buf_put(dmabuf);return -ENOMEM;}dmabuf->memcg = memcg;}fd = dma_buf_fd(dmabuf, fd_flags); if (fd < 0) { dma_buf_put(dmabuf);@@ -102,6 +124,9 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) { struct dma_heap_allocation_data *heap_allocation = data; struct dma_heap *heap = file->private_data;
struct mem_cgroup *memcg = NULL;struct task_struct *task;unsigned int pidfd_flags; int fd; if (heap_allocation->fd)@@ -113,9 +138,20 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) if (heap_allocation->heap_flags & ~DMA_HEAP_VALID_HEAP_FLAGS) return -EINVAL;
if (heap_allocation->charge_pid_fd) {task = pidfd_get_task(heap_allocation->charge_pid_fd, &pidfd_flags);Will always get a thread-group leader pidfd and will fail if this is a thread-specific pidfd. pidfd_open(1234, PIDFD_THREAD) can be used to open a thread-specific pidfd.
if (IS_ERR(task))return PTR_ERR(task);memcg = get_mem_cgroup_from_mm(task->mm);put_task_struct(task);}fd = dma_heap_buffer_alloc(heap, heap_allocation->len, heap_allocation->fd_flags,
heap_allocation->heap_flags);
heap_allocation->heap_flags,memcg);mem_cgroup_put(memcg); if (fd < 0) return fd;diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index 03c2b87cb1112..95d7688167b93 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -385,8 +385,6 @@ static struct page *alloc_largest_available(unsigned long size, if (max_order < orders[i]) continue; flags = order_flags[i];
if (mem_accounting)flags |= __GFP_ACCOUNT; page = alloc_pages(flags, orders[i]); if (!page) continue;diff --git a/include/uapi/linux/dma-heap.h b/include/uapi/linux/dma-heap.h index a4cf716a49fa6..e02b0f8cbc6a1 100644 --- a/include/uapi/linux/dma-heap.h +++ b/include/uapi/linux/dma-heap.h @@ -29,6 +29,10 @@
handle to the allocated dma-buf- @fd_flags: file descriptor flags used when allocating
- @heap_flags: flags passed to heap
- @charge_pid_fd: optional pidfd of the process whose cgroup should be
charged for this allocation; 0 means charge the calling
process's cgroup*/
- @__padding: reserved, must be zero
- Provided by userspace as an argument to the ioctl
@@ -37,6 +41,8 @@ struct dma_heap_allocation_data { __u32 fd; __u32 fd_flags; __u64 heap_flags;
__u32 charge_pid_fd;__u32 __padding;};
#define DMA_HEAP_IOC_MAGIC 'H'
-- 2.53.0
On Mon, May 18, 2026 at 7:07 AM Christian König christian.koenig@amd.com wrote:
On 5/18/26 14:50, Albert Esteve wrote:
On Mon, May 18, 2026 at 9:20 AM Christian König christian.koenig@amd.com wrote:
On 5/15/26 19:06, T.J. Mercier wrote:
On Fri, May 15, 2026 at 6:53 AM Christian Brauner brauner@kernel.org wrote:
On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to
Please be aware that pidfds come in two flavors:
thread-group pidfds and thread-specific pidfds. Make sure that your API doesn't implicitly depend on this distinction not existing.
Hi Christian,
Memcg is not a controller that supports "thread mode" so all threads in a group should belong to the same memcg.
BTW: Exactly that is the requirement automotive has with their native context use case.
The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.
Hi Christian,
Thanks for sharing this atuomotive usecase. If I understand correctly, the actual requirement is attributing dma-buf charges to the right client, not putting each daemon thread in a different cgroup?
Nope, exactly that's the difference.
The thread acts as a filtering agent for both memory allocation and command submission for somebody else, the process on which behalve the daemon does things can even be in a client VM, completely remote over some network or even something like a microcontroller.
Everything the thread does regarding CPU time, GPU driver memory allocation as well as resources like GPU processing and I/O time etc.. needs to be accounted to one client which can be different for each thread of the process.
The only thing which is shared with the main process thread is CPU memory resources, e.g. malloc() because that is basically just needed for housekeeping and pretty much irrelevant for this kind of use case.
The problem is now you can't do that with cgroups at the moment but unfortunately only the kernel has the information you need to know to do this.
So what you end up with is to define tons of interfaces just to get the necessary information from the kernel into userspace and then essentially duplicate the same infrastructure cgroup provides in the kernel in userspace again.
If so, the `charge_pid_fd` approach achieves this directly by passing the client's `pid_fd`, without needing to add per-thread cgroup infrastructure.
Well it's already a massive improvemt, we could basically stop doing the whole duplication part for the GPU driver stack and just use cgroups for this part.
Doing that automatically for CPU and I/O time would just be nice to have additionally.
Regards, Christian.
Hopefully I'm following correctly here.... So you are duplicating the GPU driver stack to achieve remote accounting on a per-thread basis? Does this mean for GPU allocations you currently have some GFP_ACCOUNT magic in your driver to attribute GPU memory to the correct remote client? So this series would close the gap for dma-buf allocations, but what about private GPU driver memory allocated on behalf of a client?
On 5/19/26 01:39, T.J. Mercier wrote:
On Mon, May 18, 2026 at 7:07 AM Christian König christian.koenig@amd.com wrote:
On 5/18/26 14:50, Albert Esteve wrote:
On Mon, May 18, 2026 at 9:20 AM Christian König christian.koenig@amd.com wrote:
On 5/15/26 19:06, T.J. Mercier wrote:
On Fri, May 15, 2026 at 6:53 AM Christian Brauner brauner@kernel.org wrote:
On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote: > On embedded platforms a central process often allocates dma-buf > memory on behalf of client applications. Without a way to > attribute the charge to the requesting client's cgroup, the > cost lands on the allocator, making per-cgroup memory limits > ineffective for the actual consumers. > > Add charge_pid_fd to struct dma_heap_allocation_data. When set to
Please be aware that pidfds come in two flavors:
thread-group pidfds and thread-specific pidfds. Make sure that your API doesn't implicitly depend on this distinction not existing.
Hi Christian,
Memcg is not a controller that supports "thread mode" so all threads in a group should belong to the same memcg.
BTW: Exactly that is the requirement automotive has with their native context use case.
The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.
Hi Christian,
Thanks for sharing this atuomotive usecase. If I understand correctly, the actual requirement is attributing dma-buf charges to the right client, not putting each daemon thread in a different cgroup?
Nope, exactly that's the difference.
The thread acts as a filtering agent for both memory allocation and command submission for somebody else, the process on which behalve the daemon does things can even be in a client VM, completely remote over some network or even something like a microcontroller.
Everything the thread does regarding CPU time, GPU driver memory allocation as well as resources like GPU processing and I/O time etc.. needs to be accounted to one client which can be different for each thread of the process.
The only thing which is shared with the main process thread is CPU memory resources, e.g. malloc() because that is basically just needed for housekeeping and pretty much irrelevant for this kind of use case.
The problem is now you can't do that with cgroups at the moment but unfortunately only the kernel has the information you need to know to do this.
So what you end up with is to define tons of interfaces just to get the necessary information from the kernel into userspace and then essentially duplicate the same infrastructure cgroup provides in the kernel in userspace again.
If so, the `charge_pid_fd` approach achieves this directly by passing the client's `pid_fd`, without needing to add per-thread cgroup infrastructure.
Well it's already a massive improvemt, we could basically stop doing the whole duplication part for the GPU driver stack and just use cgroups for this part.
Doing that automatically for CPU and I/O time would just be nice to have additionally.
Regards, Christian.
Hopefully I'm following correctly here.... So you are duplicating the GPU driver stack to achieve remote accounting on a per-thread basis?
Not quite, we are duplicating the handling cgroup provides in the kernel in userspace.
For this memory usage information as well as execution times of the GPU kernel driver is exposed in fdinfo for example.
Does this mean for GPU allocations you currently have some GFP_ACCOUNT magic in your driver to attribute GPU memory to the correct remote client?
No, we just expose what the kernel driver has allocated for itself. E.g. page tables, buffers etc...
When userspace allocates something using memfd_create() for example we just ignore that.
So this series would close the gap for dma-buf allocations, but what about private GPU driver memory allocated on behalf of a client?
Well we would need a cgroup which isn't associated with any process were we could charge the GPU driver allocations against.
But good point, charging against a pid wouldn't work in this use case.
Regards, Christian.
On Tue, May 19, 2026 at 9:20 AM Christian König christian.koenig@amd.com wrote:
On 5/19/26 01:39, T.J. Mercier wrote:
On Mon, May 18, 2026 at 7:07 AM Christian König christian.koenig@amd.com wrote:
On 5/18/26 14:50, Albert Esteve wrote:
On Mon, May 18, 2026 at 9:20 AM Christian König christian.koenig@amd.com wrote:
On 5/15/26 19:06, T.J. Mercier wrote:
On Fri, May 15, 2026 at 6:53 AM Christian Brauner brauner@kernel.org wrote: > > On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote: >> On embedded platforms a central process often allocates dma-buf >> memory on behalf of client applications. Without a way to >> attribute the charge to the requesting client's cgroup, the >> cost lands on the allocator, making per-cgroup memory limits >> ineffective for the actual consumers. >> >> Add charge_pid_fd to struct dma_heap_allocation_data. When set to > > Please be aware that pidfds come in two flavors: > > thread-group pidfds and thread-specific pidfds. Make sure that your API > doesn't implicitly depend on this distinction not existing.
Hi Christian,
Memcg is not a controller that supports "thread mode" so all threads in a group should belong to the same memcg.
BTW: Exactly that is the requirement automotive has with their native context use case.
The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.
Hi Christian,
Thanks for sharing this atuomotive usecase. If I understand correctly, the actual requirement is attributing dma-buf charges to the right client, not putting each daemon thread in a different cgroup?
Nope, exactly that's the difference.
The thread acts as a filtering agent for both memory allocation and command submission for somebody else, the process on which behalve the daemon does things can even be in a client VM, completely remote over some network or even something like a microcontroller.
Everything the thread does regarding CPU time, GPU driver memory allocation as well as resources like GPU processing and I/O time etc.. needs to be accounted to one client which can be different for each thread of the process.
The only thing which is shared with the main process thread is CPU memory resources, e.g. malloc() because that is basically just needed for housekeeping and pretty much irrelevant for this kind of use case.
The problem is now you can't do that with cgroups at the moment but unfortunately only the kernel has the information you need to know to do this.
So what you end up with is to define tons of interfaces just to get the necessary information from the kernel into userspace and then essentially duplicate the same infrastructure cgroup provides in the kernel in userspace again.
If so, the `charge_pid_fd` approach achieves this directly by passing the client's `pid_fd`, without needing to add per-thread cgroup infrastructure.
Well it's already a massive improvemt, we could basically stop doing the whole duplication part for the GPU driver stack and just use cgroups for this part.
Doing that automatically for CPU and I/O time would just be nice to have additionally.
Regards, Christian.
Hopefully I'm following correctly here.... So you are duplicating the GPU driver stack to achieve remote accounting on a per-thread basis?
Not quite, we are duplicating the handling cgroup provides in the kernel in userspace.
For this memory usage information as well as execution times of the GPU kernel driver is exposed in fdinfo for example.
Does this mean for GPU allocations you currently have some GFP_ACCOUNT magic in your driver to attribute GPU memory to the correct remote client?
No, we just expose what the kernel driver has allocated for itself. E.g. page tables, buffers etc...
When userspace allocates something using memfd_create() for example we just ignore that.
So this series would close the gap for dma-buf allocations, but what about private GPU driver memory allocated on behalf of a client?
Well we would need a cgroup which isn't associated with any process were we could charge the GPU driver allocations against.
I think I better understand your framing for this now. Thanks again for taking the time to explain.
I was looking for a way to pass cgroup around to do the charge. I found that `struct cgroup *cgroup_get_from_fd(int fd)` already exists in cgroups available symbols to handle cgroup directories.
So here's an idea...
Rename the charge_pid_fd to charge_fd: - If it is a pidfd (`!IS_ERR(pidfd_pid(fget(charge_fd)))`) then we do what we're already doing here. - If it is a cgroup_fd (`!IS_ERR(cgroup_get_from_fd(charge_fd))`) then we charge to that cgroup.
Also we could add add an ioctl for the generic fd path similar to what we have for dma-buf heaps. Or have a new flavour for memfd_create: ``` memfd_create2(name, flags, charge_fd); ```
The transfer ioctl could also be made generic to accept both pidfds and cgroup_fds.
For this series we could move forward as is, and make the generic solution a follow-up series, knowing that the field can be reused for cgroup fds.
But good point, charging against a pid wouldn't work in this use case.
Regards, Christian.
On Tue, May 19, 2026 at 12:19 AM Christian König christian.koenig@amd.com wrote:
On 5/19/26 01:39, T.J. Mercier wrote:
On Mon, May 18, 2026 at 7:07 AM Christian König christian.koenig@amd.com wrote:
On 5/18/26 14:50, Albert Esteve wrote:
On Mon, May 18, 2026 at 9:20 AM Christian König christian.koenig@amd.com wrote:
On 5/15/26 19:06, T.J. Mercier wrote:
On Fri, May 15, 2026 at 6:53 AM Christian Brauner brauner@kernel.org wrote: > > On Tue, May 12, 2026 at 11:10:44AM +0200, Albert Esteve wrote: >> On embedded platforms a central process often allocates dma-buf >> memory on behalf of client applications. Without a way to >> attribute the charge to the requesting client's cgroup, the >> cost lands on the allocator, making per-cgroup memory limits >> ineffective for the actual consumers. >> >> Add charge_pid_fd to struct dma_heap_allocation_data. When set to > > Please be aware that pidfds come in two flavors: > > thread-group pidfds and thread-specific pidfds. Make sure that your API > doesn't implicitly depend on this distinction not existing.
Hi Christian,
Memcg is not a controller that supports "thread mode" so all threads in a group should belong to the same memcg.
BTW: Exactly that is the requirement automotive has with their native context use case.
The use case is that you have a deamon which has multiple threads were each one is acting on behalve of some other process.
At the moment we basically say they are simply not using cgroups for that use case, but it would be really nice if we could handle that as well.
Summarizing the requirement of that use case: You need a different cgroup for each thread of a process.
Hi Christian,
Thanks for sharing this atuomotive usecase. If I understand correctly, the actual requirement is attributing dma-buf charges to the right client, not putting each daemon thread in a different cgroup?
Nope, exactly that's the difference.
The thread acts as a filtering agent for both memory allocation and command submission for somebody else, the process on which behalve the daemon does things can even be in a client VM, completely remote over some network or even something like a microcontroller.
Everything the thread does regarding CPU time, GPU driver memory allocation as well as resources like GPU processing and I/O time etc.. needs to be accounted to one client which can be different for each thread of the process.
The only thing which is shared with the main process thread is CPU memory resources, e.g. malloc() because that is basically just needed for housekeeping and pretty much irrelevant for this kind of use case.
The problem is now you can't do that with cgroups at the moment but unfortunately only the kernel has the information you need to know to do this.
So what you end up with is to define tons of interfaces just to get the necessary information from the kernel into userspace and then essentially duplicate the same infrastructure cgroup provides in the kernel in userspace again.
If so, the `charge_pid_fd` approach achieves this directly by passing the client's `pid_fd`, without needing to add per-thread cgroup infrastructure.
Well it's already a massive improvemt, we could basically stop doing the whole duplication part for the GPU driver stack and just use cgroups for this part.
Doing that automatically for CPU and I/O time would just be nice to have additionally.
Regards, Christian.
Hopefully I'm following correctly here.... So you are duplicating the GPU driver stack to achieve remote accounting on a per-thread basis?
Not quite, we are duplicating the handling cgroup provides in the kernel in userspace.
For this memory usage information as well as execution times of the GPU kernel driver is exposed in fdinfo for example.
Oh I see, thanks.
Does this mean for GPU allocations you currently have some GFP_ACCOUNT magic in your driver to attribute GPU memory to the correct remote client?
No, we just expose what the kernel driver has allocated for itself. E.g. page tables, buffers etc...
When userspace allocates something using memfd_create() for example we just ignore that.
So this series would close the gap for dma-buf allocations, but what about private GPU driver memory allocated on behalf of a client?
Well we would need a cgroup which isn't associated with any process were we could charge the GPU driver allocations against.
But good point, charging against a pid wouldn't work in this use case.
It would be pretty low overhead to put a process doing while(1) pause(); in a separate cgroup for this purpose, but I guess a fd for the actual cgroup would be a little cleaner in this case.
Regards, Christian.
On Tue, May 12, 2026 at 5:18 PM Albert Esteve aesteve@redhat.com wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
[...]
if (mem_accounting)flags |= __GFP_ACCOUNT;
Hi Albert,
would it be better to move this and its description to patch 1? It looks like patch 1 already introduces the double accounting changes, and patch 2 is mainly just supporting remote charging.
Also, mem_accounting is only used by system_heap.c; has this patchset also eliminated its need?
Thanks Barry
On Sat, May 16, 2026 at 9:37 AM Barry Song baohua@kernel.org wrote:
On Tue, May 12, 2026 at 5:18 PM Albert Esteve aesteve@redhat.com wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
[...]
if (mem_accounting)flags |= __GFP_ACCOUNT;Hi Albert,
would it be better to move this and its description to patch 1? It looks like patch 1 already introduces the double accounting changes, and patch 2 is mainly just supporting remote charging.
Hi Barry,
Thanks for looking into this series! Yes, in my head I was trying to keep patch 1, which was taken from a previous, different series, and then diverge from it starting with patch 2. This would clarify the difference between the two. But I can see it just added some confusion (for example, patch 1 charges on dma_buf_export() and then it is moved to dma_heap_buffer_alloc() in patch 2). I will reorganize it better for the next version, including your suggestion.
Also, mem_accounting is only used by system_heap.c; has this patchset also eliminated its need?
No, mem_accounting is still handled in this patch for the general case where no `charge_pid_fd` is used. See dma_heap_buffer_alloc() code:
+ if (memcg) + css_get(&memcg->css); + else if (mem_accounting) + memcg = get_mem_cgroup_from_mm(current->mm);
Thanks Barry
On Mon, May 18, 2026 at 8:16 PM Albert Esteve aesteve@redhat.com wrote:
On Sat, May 16, 2026 at 9:37 AM Barry Song baohua@kernel.org wrote:
On Tue, May 12, 2026 at 5:18 PM Albert Esteve aesteve@redhat.com wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
[...]
if (mem_accounting)flags |= __GFP_ACCOUNT;Hi Albert,
would it be better to move this and its description to patch 1? It looks like patch 1 already introduces the double accounting changes, and patch 2 is mainly just supporting remote charging.
Hi Barry,
Thanks for looking into this series! Yes, in my head I was trying to keep patch 1, which was taken from a previous, different series, and then diverge from it starting with patch 2. This would clarify the difference between the two. But I can see it just added some confusion (for example, patch 1 charges on dma_buf_export() and then it is moved to dma_heap_buffer_alloc() in patch 2). I will reorganize it better for the next version, including your suggestion.
Yep, I understand the situation now. I also understand that you were referring to T.J.'s patch, which caused some back-and-forth confusion for readers when reading patches 1 and 2.
Also, mem_accounting is only used by system_heap.c; has this patchset also eliminated its need?
No, mem_accounting is still handled in this patch for the general case where no `charge_pid_fd` is used. See dma_heap_buffer_alloc() code:
if (memcg)css_get(&memcg->css);else if (mem_accounting)memcg = get_mem_cgroup_from_mm(current->mm);
I see. What feels a bit odd to me is that mem_accounting could either be dropped (with unconditional charging), or it should cover both remote and local charge cases.
I don’t have a strong opinion here—it just feels a bit strange, since its description is quite generic for memcg:
"Enable cgroup-based memory accounting for dma-buf heap allocations (default=false)."
Best Regards Barry
On Mon, May 18, 2026 at 3:43 PM Barry Song baohua@kernel.org wrote:
On Mon, May 18, 2026 at 8:16 PM Albert Esteve aesteve@redhat.com wrote:
On Sat, May 16, 2026 at 9:37 AM Barry Song baohua@kernel.org wrote:
On Tue, May 12, 2026 at 5:18 PM Albert Esteve aesteve@redhat.com wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
[...]
if (mem_accounting)flags |= __GFP_ACCOUNT;Hi Albert,
would it be better to move this and its description to patch 1? It looks like patch 1 already introduces the double accounting changes, and patch 2 is mainly just supporting remote charging.
Hi Barry,
Thanks for looking into this series! Yes, in my head I was trying to keep patch 1, which was taken from a previous, different series, and then diverge from it starting with patch 2. This would clarify the difference between the two. But I can see it just added some confusion (for example, patch 1 charges on dma_buf_export() and then it is moved to dma_heap_buffer_alloc() in patch 2). I will reorganize it better for the next version, including your suggestion.
Yep, I understand the situation now. I also understand that you were referring to T.J.'s patch, which caused some back-and-forth confusion for readers when reading patches 1 and 2.
Albert, please don't feel obligated to keep my patch intact if integrating it into other patches simplifies the series.
Also, mem_accounting is only used by system_heap.c; has this patchset also eliminated its need?
No, mem_accounting is still handled in this patch for the general case where no `charge_pid_fd` is used. See dma_heap_buffer_alloc() code:
if (memcg)css_get(&memcg->css);else if (mem_accounting)memcg = get_mem_cgroup_from_mm(current->mm);I see. What feels a bit odd to me is that mem_accounting could either be dropped (with unconditional charging), or it should cover both remote and local charge cases.
I don’t have a strong opinion here—it just feels a bit strange, since its description is quite generic for memcg:
"Enable cgroup-based memory accounting for dma-buf heap allocations (default=false)."
Best Regards Barry
On Tue, May 19, 2026 at 12:43 AM Barry Song baohua@kernel.org wrote:
On Mon, May 18, 2026 at 8:16 PM Albert Esteve aesteve@redhat.com wrote:
On Sat, May 16, 2026 at 9:37 AM Barry Song baohua@kernel.org wrote:
On Tue, May 12, 2026 at 5:18 PM Albert Esteve aesteve@redhat.com wrote:
On embedded platforms a central process often allocates dma-buf memory on behalf of client applications. Without a way to attribute the charge to the requesting client's cgroup, the cost lands on the allocator, making per-cgroup memory limits ineffective for the actual consumers.
Add charge_pid_fd to struct dma_heap_allocation_data. When set to a valid pidfd, DMA_HEAP_IOCTL_ALLOC resolves the target task's memcg and charges the buffer there via mem_cgroup_charge_dmabuf() inside dma_heap_buffer_alloc(). Without charge_pid_fd, and with the mem_accounting module parameter enabled, the buffer is charged to the allocator's own cgroup.
Additionally, commit 3c227be90659 ("dma-buf: system_heap: account for system heap allocation in memcg") adds __GFP_ACCOUNT to system-heap page allocations. Keeping __GFP_ACCOUNT would charge the same pages twice (once to kmem, once to MEMCG_DMABUF), thus remove it and route all accounting through a single MEMCG_DMABUF path.
[...]
if (mem_accounting)flags |= __GFP_ACCOUNT;Hi Albert,
would it be better to move this and its description to patch 1? It looks like patch 1 already introduces the double accounting changes, and patch 2 is mainly just supporting remote charging.
Hi Barry,
Thanks for looking into this series! Yes, in my head I was trying to keep patch 1, which was taken from a previous, different series, and then diverge from it starting with patch 2. This would clarify the difference between the two. But I can see it just added some confusion (for example, patch 1 charges on dma_buf_export() and then it is moved to dma_heap_buffer_alloc() in patch 2). I will reorganize it better for the next version, including your suggestion.
Yep, I understand the situation now. I also understand that you were referring to T.J.'s patch, which caused some back-and-forth confusion for readers when reading patches 1 and 2.
Also, mem_accounting is only used by system_heap.c; has this patchset also eliminated its need?
No, mem_accounting is still handled in this patch for the general case where no `charge_pid_fd` is used. See dma_heap_buffer_alloc() code:
if (memcg)css_get(&memcg->css);else if (mem_accounting)memcg = get_mem_cgroup_from_mm(current->mm);I see. What feels a bit odd to me is that mem_accounting could either be dropped (with unconditional charging), or it should cover both remote and local charge cases.
Good point. If I understand correctly, looking at patch [1] that introduced the flag, the shared buffer caveats mentioned there are not yet covered by this approach, so the flag should stay. I will make it consistent and cover both remote and local charge cases.
[1] https://lore.kernel.org/all/20260116-dmabuf-heap-system-memcg-v3-1-ecc6b62cc...
I don’t have a strong opinion here—it just feels a bit strange, since its description is quite generic for memcg:
"Enable cgroup-based memory accounting for dma-buf heap allocations (default=false)."
Best Regards Barry
DMA_HEAP_IOCTL_ALLOC accepts a charge_pid_fd field that, when set, causes the allocation to be charged to an arbitrary process's cgroup rather than the caller's.
Without an access-control point, any process that holds a handle to a dma-heap device node can charge unlimited memory to any other process's cgroup, potentially exhausting that cgroup's limit and triggering OOM kills independent of the victim's own activity or privileges.
Add security_dma_heap_alloc(), called in dma_heap_ioctl_allocate() when charge_pid_fd refers to another process. The hook receives the credentials of the allocating process (from) and the credentials of the process whose cgroup will be charged (to), giving security modules a controlled enforcement point for cross-cgroup dma-buf attribution policy.
When CONFIG_SECURITY is not set the hook compiles to an inline returning 0, adding no overhead to the fast path.
Signed-off-by: Albert Esteve aesteve@redhat.com --- drivers/dma-buf/dma-heap.c | 12 +++++++++++- include/linux/lsm_hook_defs.h | 1 + include/linux/security.h | 7 +++++++ security/security.c | 16 ++++++++++++++++ 4 files changed, 35 insertions(+), 1 deletion(-)
diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c index ff6e259afcdc0..e8ffb1031955e 100644 --- a/drivers/dma-buf/dma-heap.c +++ b/drivers/dma-buf/dma-heap.c @@ -18,6 +18,7 @@ #include <linux/list.h> #include <linux/nospec.h> #include <linux/pidfd.h> +#include <linux/security.h> #include <linux/syscalls.h> #include <linux/uaccess.h> #include <linux/xarray.h> @@ -122,12 +123,13 @@ static int dma_heap_open(struct inode *inode, struct file *file)
static long dma_heap_ioctl_allocate(struct file *file, void *data) { + const struct cred *tcred; struct dma_heap_allocation_data *heap_allocation = data; struct dma_heap *heap = file->private_data; struct mem_cgroup *memcg = NULL; struct task_struct *task; unsigned int pidfd_flags; - int fd; + int fd, ret;
if (heap_allocation->fd) return -EINVAL; @@ -143,6 +145,14 @@ static long dma_heap_ioctl_allocate(struct file *file, void *data) if (IS_ERR(task)) return PTR_ERR(task);
+ tcred = get_task_cred(task); + ret = security_dma_heap_alloc(current_cred(), tcred); + put_cred(tcred); + if (ret) { + put_task_struct(task); + return ret; + } + memcg = get_mem_cgroup_from_mm(task->mm); put_task_struct(task); } diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h index 2b8dfb35caed3..6a91656f97e1e 100644 --- a/include/linux/lsm_hook_defs.h +++ b/include/linux/lsm_hook_defs.h @@ -43,6 +43,7 @@ LSM_HOOK(int, 0, capset, struct cred *new, const struct cred *old, const kernel_cap_t *permitted) LSM_HOOK(int, 0, capable, const struct cred *cred, struct user_namespace *ns, int cap, unsigned int opts) +LSM_HOOK(int, 0, dma_heap_alloc, const struct cred *from, const struct cred *to) LSM_HOOK(int, 0, quotactl, int cmds, int type, int id, const struct super_block *sb) LSM_HOOK(int, 0, quota_on, struct dentry *dentry) LSM_HOOK(int, 0, syslog, int type) diff --git a/include/linux/security.h b/include/linux/security.h index 41d7367cf4036..f1dad1eabe754 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -350,6 +350,7 @@ int security_capable(const struct cred *cred, struct user_namespace *ns, int cap, unsigned int opts); +int security_dma_heap_alloc(const struct cred *from, const struct cred *to); int security_quotactl(int cmds, int type, int id, const struct super_block *sb); int security_quota_on(struct dentry *dentry); int security_syslog(int type); @@ -701,6 +702,12 @@ static inline int security_capable(const struct cred *cred, return cap_capable(cred, ns, cap, opts); }
+static inline int security_dma_heap_alloc(const struct cred *from, + const struct cred *to) +{ + return 0; +} + static inline int security_quotactl(int cmds, int type, int id, const struct super_block *sb) { diff --git a/security/security.c b/security/security.c index 4e999f0236516..4adacef73c507 100644 --- a/security/security.c +++ b/security/security.c @@ -660,6 +660,22 @@ int security_capable(const struct cred *cred, return call_int_hook(capable, cred, ns, cap, opts); }
+/** + * security_dma_heap_alloc() - Check if cross-cgroup dma-heap charging is allowed + * @from: credentials of the allocating process + * @to: credentials of the process to charge + * + * Check whether the process with credentials @from is allowed to allocate + * dma-heap memory and charge it to the cgroup of the process with credentials + * @to. + * + * Return: Returns 0 if permission is granted. + */ +int security_dma_heap_alloc(const struct cred *from, const struct cred *to) +{ + return call_int_hook(dma_heap_alloc, from, to); +} + /** * security_quotactl() - Check if a quotactl() syscall is allowed for this fs * @cmds: commands
The security_dma_heap_alloc() hook allows security modules to control which processes may charge dma-buf allocations to another process's cgroup via the charge_pid_fd field of DMA_HEAP_IOCTL_ALLOC. Without a policy implementation, the hook is a no-op and the restriction is not enforced.
On SELinux-managed systems any domain with access to a dma-heap device node can therefore exhaust another cgroup's memory budget without restriction.
Implement selinux_dma_heap_alloc() using avc_has_perm() with a new dma_heap object class and a charge_to permission. Policy authors can then grant cross-cgroup charging selectively, for example:
allow allocator_app_t client_app_t:dma_heap charge_to;
Signed-off-by: Albert Esteve aesteve@redhat.com --- security/selinux/hooks.c | 7 +++++++ security/selinux/include/classmap.h | 1 + 2 files changed, 8 insertions(+)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index 0f704380a8c81..ea1f410b9f619 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -2189,6 +2189,12 @@ static int selinux_capable(const struct cred *cred, struct user_namespace *ns, return cred_has_capability(cred, cap, opts, ns == &init_user_ns); }
+static int selinux_dma_heap_alloc(const struct cred *from, const struct cred *to) +{ + return avc_has_perm(cred_sid(from), cred_sid(to), + SECCLASS_DMA_HEAP, DMA_HEAP__CHARGE_TO, NULL); +} + static int selinux_quotactl(int cmds, int type, int id, const struct super_block *sb) { const struct cred *cred = current_cred(); @@ -7541,6 +7547,7 @@ static struct security_hook_list selinux_hooks[] __ro_after_init = { LSM_HOOK_INIT(capget, selinux_capget), LSM_HOOK_INIT(capset, selinux_capset), LSM_HOOK_INIT(capable, selinux_capable), + LSM_HOOK_INIT(dma_heap_alloc, selinux_dma_heap_alloc), LSM_HOOK_INIT(quotactl, selinux_quotactl), LSM_HOOK_INIT(quota_on, selinux_quota_on), LSM_HOOK_INIT(syslog, selinux_syslog), diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h index 90cb61b164256..d232f7808f6b8 100644 --- a/security/selinux/include/classmap.h +++ b/security/selinux/include/classmap.h @@ -181,6 +181,7 @@ const struct security_class_mapping secclass_map[] = { { "user_namespace", { "create", NULL } }, { "memfd_file", { COMMON_FILE_PERMS, "execute_no_trans", "entrypoint", NULL } }, + { "dma_heap", { "charge_to", NULL } }, /* last one */ { NULL, {} } };
On May 12, 2026 Albert Esteve aesteve@redhat.com wrote:
The security_dma_heap_alloc() hook allows security modules to control which processes may charge dma-buf allocations to another process's cgroup via the charge_pid_fd field of DMA_HEAP_IOCTL_ALLOC. Without a policy implementation, the hook is a no-op and the restriction is not enforced.
On SELinux-managed systems any domain with access to a dma-heap device node can therefore exhaust another cgroup's memory budget without restriction.
Implement selinux_dma_heap_alloc() using avc_has_perm() with a new dma_heap object class and a charge_to permission. Policy authors can then grant cross-cgroup charging selectively, for example:
allow allocator_app_t client_app_t:dma_heap charge_to;
Signed-off-by: Albert Esteve aesteve@redhat.com
security/selinux/hooks.c | 7 +++++++ security/selinux/include/classmap.h | 1 + 2 files changed, 8 insertions(+)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index 0f704380a8c81..ea1f410b9f619 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -2189,6 +2189,12 @@ static int selinux_capable(const struct cred *cred, struct user_namespace *ns, return cred_has_capability(cred, cap, opts, ns == &init_user_ns); } +static int selinux_dma_heap_alloc(const struct cred *from, const struct cred *to) +{
- return avc_has_perm(cred_sid(from), cred_sid(to),
SECCLASS_DMA_HEAP, DMA_HEAP__CHARGE_TO, NULL);+}
static int selinux_quotactl(int cmds, int type, int id, const struct super_block *sb) { const struct cred *cred = current_cred(); @@ -7541,6 +7547,7 @@ static struct security_hook_list selinux_hooks[] __ro_after_init = { LSM_HOOK_INIT(capget, selinux_capget), LSM_HOOK_INIT(capset, selinux_capset), LSM_HOOK_INIT(capable, selinux_capable),
- LSM_HOOK_INIT(dma_heap_alloc, selinux_dma_heap_alloc), LSM_HOOK_INIT(quotactl, selinux_quotactl), LSM_HOOK_INIT(quota_on, selinux_quota_on), LSM_HOOK_INIT(syslog, selinux_syslog),
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h index 90cb61b164256..d232f7808f6b8 100644 --- a/security/selinux/include/classmap.h +++ b/security/selinux/include/classmap.h @@ -181,6 +181,7 @@ const struct security_class_mapping secclass_map[] = { { "user_namespace", { "create", NULL } }, { "memfd_file", { COMMON_FILE_PERMS, "execute_no_trans", "entrypoint", NULL } },
- { "dma_heap", { "charge_to", NULL } }, /* last one */ { NULL, {} }
};
While we have seen some one-off patches to add specific resource/cgroups controls in the past, much like this one, we've yet to see a patchset that provides a more comprehensive set of resource/cgroup access controls for SELinux.
I'm not opposed to a patch like this, but I would like to see it as part of a larger effort to introduce access controls across all of the existing cgroup control points where it makes sense. In other words, let's see a design for cgroup access controls so that we can ensure we have something that is meaningful and makes sense from a policy developer's perspective.
-- paul-moore.com
Add tests for the new charge_pid_fd field in struct dma_heap_allocation_data.
When the charge_pid_fd feature is absent (unpatched kernel), the probe in pidfd_alloc_supported() detects this and the tests are skipped gracefully.
Add vmtest.sh similar to other subsystem suites, to orchestrate building the selftests (optionally with a freshly compiled kernel) inside a virtme-ng VM, so the tests can be run without modifying the host system. Add a config fragment with required Kconfig symbols.
Also add test_memcg_dmabuf() to the existing test_memcontrol suite to verify end-to-end cross-cgroup accounting: a parent process opens a pidfd for a child in a separate cgroup, allocates a dma-buf via DMA_HEAP_IOCTL_ALLOC with that pidfd, and asserts that memory.stat dmabuf in the child's cgroup reflects the allocation. If the dmabuf key is missing (unpatched kernel) or /dev/dma_heap/system is absent, the test is skipped.
Assisted-by: Claude:claude-sonnet-4-6 Cursor Signed-off-by: Albert Esteve aesteve@redhat.com --- tools/testing/selftests/cgroup/Makefile | 2 +- tools/testing/selftests/cgroup/test_memcontrol.c | 143 +++++++++++++- tools/testing/selftests/dmabuf-heaps/config | 1 + tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c | 126 ++++++++++++- tools/testing/selftests/dmabuf-heaps/vmtest.sh | 205 +++++++++++++++++++++ 5 files changed, 473 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile index e01584c2189ac..9edfc9f1de5c4 100644 --- a/tools/testing/selftests/cgroup/Makefile +++ b/tools/testing/selftests/cgroup/Makefile @@ -1,5 +1,5 @@ # SPDX-License-Identifier: GPL-2.0 -CFLAGS += -Wall -pthread +CFLAGS += -Wall -pthread $(KHDR_INCLUDES)
all: ${HELPER_PROGS}
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c index b43da9bc20c49..b6a228407530f 100644 --- a/tools/testing/selftests/cgroup/test_memcontrol.c +++ b/tools/testing/selftests/cgroup/test_memcontrol.c @@ -19,9 +19,17 @@ #include <errno.h> #include <sys/mman.h>
+#include <linux/dma-heap.h> +#include <signal.h> +#include <sys/ioctl.h> + +#include "../pidfd/pidfd.h" #include "kselftest.h" #include "cgroup_util.h"
+#define DMA_HEAP_SYSTEM "/dev/dma_heap/system" +#define ONE_MEG (1024 * 1024) + #define MEMCG_SOCKSTAT_WAIT_RETRIES 30
static bool has_localevents; @@ -1762,6 +1770,125 @@ static int test_memcg_inotify_delete_dir(const char *root) return ret; }
+static int memcg_dmabuf_child(const char *cgroup, void *arg) +{ + pause(); + return 0; +} + +/* + * This test allocates a dma-buf via DMA_HEAP_IOCTL_ALLOC with a pidfd + * pointing to a child process in a separate cgroup, then checks that + * memory.stat[dmabuf] in the child's cgroup rises by the allocation size + * and returns to zero after the buffer fd is closed. + */ +static int test_memcg_dmabuf(const char *root) +{ + char *parent = NULL, *child_cg = NULL; + int ret = KSFT_FAIL; + int heap_fd = -1, dmabuf_fd = -1, pidfd = -1; + pid_t child_pid; + int child_status; + long dmabuf_stat; + struct dma_heap_allocation_data alloc = { + .len = ONE_MEG, + .fd_flags = O_RDWR | O_CLOEXEC, + }; + + if (access(DMA_HEAP_SYSTEM, R_OK | W_OK)) { + ret = KSFT_SKIP; + goto cleanup; + } + + parent = cg_name(root, "dmabuf_memcg_test"); + if (!parent) + goto cleanup; + + if (cg_create(parent)) + goto cleanup_parent; + + if (cg_write(parent, "cgroup.subtree_control", "+memory")) + goto cleanup_parent; + + child_cg = cg_name(parent, "child"); + if (!child_cg) + goto cleanup_parent; + + if (cg_create(child_cg)) + goto cleanup_parent; + + child_pid = cg_run_nowait(child_cg, memcg_dmabuf_child, NULL); + if (child_pid < 0) + goto cleanup_child; + + if (cg_wait_for_proc_count(child_cg, 1)) + goto cleanup_kill; + + pidfd = sys_pidfd_open(child_pid, 0); + if (pidfd < 0) { + ret = KSFT_SKIP; + goto cleanup_kill; + } + + heap_fd = open(DMA_HEAP_SYSTEM, O_RDWR); + if (heap_fd < 0) { + ret = KSFT_SKIP; + goto cleanup_pidfd; + } + + alloc.charge_pid_fd = (__u32)pidfd; + if (ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC, &alloc) < 0) + goto cleanup_heap; + dmabuf_fd = (int)alloc.fd; + + dmabuf_stat = cg_read_key_long(child_cg, "memory.stat", "dmabuf "); + if (dmabuf_stat == -1) { + ret = KSFT_SKIP; + goto cleanup_dmabuf; + } + if (dmabuf_stat != ONE_MEG) + dmabuf_stat = cg_read_key_long_poll(child_cg, "memory.stat", + "dmabuf ", ONE_MEG, + 15, 200000); + if (dmabuf_stat != ONE_MEG) { + fprintf(stderr, "Expected dmabuf stat %d, got %ld\n", + ONE_MEG, dmabuf_stat); + goto cleanup_dmabuf; + } + + close(dmabuf_fd); + dmabuf_fd = -1; + + dmabuf_stat = cg_read_key_long_poll(child_cg, "memory.stat", + "dmabuf ", 0, 15, 200000); + if (dmabuf_stat != 0) { + fprintf(stderr, "Expected dmabuf stat 0 after close, got %ld\n", + dmabuf_stat); + goto cleanup_heap; + } + + ret = KSFT_PASS; + +cleanup_dmabuf: + if (dmabuf_fd >= 0) + close(dmabuf_fd); +cleanup_heap: + close(heap_fd); +cleanup_pidfd: + close(pidfd); +cleanup_kill: + kill(child_pid, SIGTERM); + waitpid(child_pid, &child_status, 0); +cleanup_child: + cg_destroy(child_cg); + free(child_cg); +cleanup_parent: + cg_destroy(parent); + free(parent); +cleanup: + return ret; +} + #define T(x) { x, #x } struct memcg_test { int (*fn)(const char *root); @@ -1783,16 +1910,26 @@ struct memcg_test { T(test_memcg_oom_group_score_events), T(test_memcg_inotify_delete_file), T(test_memcg_inotify_delete_dir), + T(test_memcg_dmabuf), }; #undef T
int main(int argc, char **argv) { char root[PATH_MAX]; - int i, proc_status; + int i, proc_status, plan; + const char *filter = NULL; + + if (argc > 1) + filter = argv[1]; + + plan = 0; + for (i = 0; i < ARRAY_SIZE(tests); i++) + if (!filter || !strcmp(tests[i].name, filter)) + plan++;
ksft_print_header(); - ksft_set_plan(ARRAY_SIZE(tests)); + ksft_set_plan(plan); if (cg_find_unified_root(root, sizeof(root), NULL)) ksft_exit_skip("cgroup v2 isn't mounted\n");
@@ -1818,6 +1955,8 @@ int main(int argc, char **argv) has_localevents = proc_status;
for (i = 0; i < ARRAY_SIZE(tests); i++) { + if (filter && strcmp(tests[i].name, filter)) + continue; switch (tests[i].fn(root)) { case KSFT_PASS: ksft_test_result_pass("%s\n", tests[i].name); diff --git a/tools/testing/selftests/dmabuf-heaps/config b/tools/testing/selftests/dmabuf-heaps/config index be091f1cdfa04..94c8f33b71a28 100644 --- a/tools/testing/selftests/dmabuf-heaps/config +++ b/tools/testing/selftests/dmabuf-heaps/config @@ -1,3 +1,4 @@ +CONFIG_MEMCG=y CONFIG_DMABUF_HEAPS=y CONFIG_DMABUF_HEAPS_SYSTEM=y CONFIG_DRM_VGEM=y diff --git a/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c b/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c index fc9694fc4e89e..904332b17698a 100644 --- a/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c +++ b/tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c @@ -3,6 +3,7 @@ #include <dirent.h> #include <errno.h> #include <fcntl.h> +#include <signal.h> #include <stdio.h> #include <stdlib.h> #include <stdint.h> @@ -10,11 +11,14 @@ #include <unistd.h> #include <sys/ioctl.h> #include <sys/mman.h> +#include <sys/syscall.h> #include <sys/types.h> +#include <sys/wait.h>
#include <linux/dma-buf.h> #include <linux/dma-heap.h> #include <drm/drm.h> +#include "../pidfd/pidfd.h" #include "kselftest.h"
#define DEVPATH "/dev/dma_heap" @@ -320,6 +324,8 @@ static int dmabuf_heap_alloc_newer(int fd, size_t len, unsigned int flags, __u32 fd; __u32 fd_flags; __u64 heap_flags; + __u32 charge_pid_fd; + __u32 __padding; __u64 garbage1; __u64 garbage2; __u64 garbage3; @@ -328,6 +334,8 @@ static int dmabuf_heap_alloc_newer(int fd, size_t len, unsigned int flags, .fd = 0, .fd_flags = O_RDWR | O_CLOEXEC, .heap_flags = flags, + .charge_pid_fd = 0, + .__padding = 0, .garbage1 = 0xffffffff, .garbage2 = 0x88888888, .garbage3 = 0x11111111, @@ -390,6 +398,120 @@ static void test_alloc_errors(char *heap_name) close(heap_fd); }
+static int dmabuf_heap_alloc_pidfd(int fd, size_t len, unsigned int heap_flags, + unsigned int charge_pid_fd, int *dmabuf_fd) +{ + struct dma_heap_allocation_data data = { + .len = len, + .fd = 0, + .fd_flags = O_RDWR | O_CLOEXEC, + .heap_flags = heap_flags, + .charge_pid_fd = charge_pid_fd, + }; + int ret; + + if (!dmabuf_fd) + return -EINVAL; + + ret = ioctl(fd, DMA_HEAP_IOCTL_ALLOC, &data); + if (ret < 0) + return ret; + *dmabuf_fd = (int)data.fd; + return ret; +} + +/* + * Probe whether the kernel honours charge_pid_fd in DMA_HEAP_IOCTL_ALLOC. + */ +static bool pidfd_alloc_supported(int heap_fd) +{ + int devnull_fd, dmabuf_fd = -1, ret; + + devnull_fd = open("/dev/null", O_RDONLY); + if (devnull_fd < 0) + return false; + + ret = dmabuf_heap_alloc_pidfd(heap_fd, ONE_MEG, 0, devnull_fd, &dmabuf_fd); + if (dmabuf_fd >= 0) { + close(dmabuf_fd); + dmabuf_fd = -1; + } + close(devnull_fd); + return ret < 0; +} + +/* + * Test: allocate charging the calling process's own cgroup via a self pidfd. + */ +static void test_alloc_pidfd_self(char *heap_name) +{ + int heap_fd = -1, pidfd = -1, dmabuf_fd = -1, ret; + + heap_fd = dmabuf_heap_open(heap_name); + + if (!pidfd_alloc_supported(heap_fd)) { + ksft_test_result_skip("charge_pid_fd not supported by this kernel\n"); + goto out; + } + + pidfd = sys_pidfd_open(getpid(), 0); + if (pidfd < 0) { + ksft_test_result_skip("pidfd_open not available\n"); + goto out; + } + + ret = dmabuf_heap_alloc_pidfd(heap_fd, ONE_MEG, 0, pidfd, &dmabuf_fd); + ksft_test_result(!ret, "Allocation with self pidfd %d\n", ret); + if (dmabuf_fd >= 0) + close(dmabuf_fd); + close(pidfd); +out: + close(heap_fd); +} + +/* + * Test: allocate charging a child process's cgroup via a child pidfd. + */ +static void test_alloc_pidfd_child(char *heap_name) +{ + int heap_fd = -1, pidfd = -1, dmabuf_fd = -1; + pid_t child_pid; + int status, ret; + + heap_fd = dmabuf_heap_open(heap_name); + + if (!pidfd_alloc_supported(heap_fd)) { + ksft_test_result_skip("charge_pid_fd not supported by this kernel\n"); + goto out; + } + + child_pid = fork(); + if (child_pid == 0) { + pause(); + _exit(0); + } + if (child_pid < 0) + ksft_exit_fail_msg("fork failed: %s\n", strerror(errno)); + + pidfd = sys_pidfd_open(child_pid, 0); + if (pidfd < 0) { + kill(child_pid, SIGTERM); + waitpid(child_pid, &status, 0); + ksft_test_result_skip("pidfd_open for child failed\n"); + goto out; + } + + ret = dmabuf_heap_alloc_pidfd(heap_fd, ONE_MEG, 0, pidfd, &dmabuf_fd); + ksft_test_result(!ret, "Allocation with child pidfd %d\n", ret); + if (dmabuf_fd >= 0) + close(dmabuf_fd); + close(pidfd); + kill(child_pid, SIGTERM); + waitpid(child_pid, &status, 0); +out: + close(heap_fd); +} + static int numer_of_heaps(void) { DIR *d = opendir(DEVPATH); @@ -420,7 +542,7 @@ int main(void) return KSFT_SKIP; }
- ksft_set_plan(11 * numer_of_heaps()); + ksft_set_plan(13 * numer_of_heaps());
while ((dir = readdir(d))) { if (!strncmp(dir->d_name, ".", 2)) @@ -435,6 +557,8 @@ int main(void) test_alloc_zeroed(dir->d_name, ONE_MEG); test_alloc_compat(dir->d_name); test_alloc_errors(dir->d_name); + test_alloc_pidfd_self(dir->d_name); + test_alloc_pidfd_child(dir->d_name); } closedir(d);
diff --git a/tools/testing/selftests/dmabuf-heaps/vmtest.sh b/tools/testing/selftests/dmabuf-heaps/vmtest.sh new file mode 100755 index 0000000000000..6f1a878384127 --- /dev/null +++ b/tools/testing/selftests/dmabuf-heaps/vmtest.sh @@ -0,0 +1,205 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Copyright (c) 2026 Red Hat +# +# Dependencies: +# * virtme-ng +# * qemu (used by virtme-ng) + +readonly SCRIPT_DIR="$(cd -P -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P)" +readonly KERNEL_CHECKOUT=$(realpath "${SCRIPT_DIR}"/../../../../) +readonly CGROUP_DIR="${KERNEL_CHECKOUT}/tools/testing/selftests/cgroup" + +source "${SCRIPT_DIR}"/../kselftest/ktap_helpers.sh + +readonly DMABUF_HEAP_TEST="${SCRIPT_DIR}"/dmabuf-heap +readonly MEMCONTROL_TEST="${CGROUP_DIR}"/test_memcontrol +readonly TMP_DIR=$(mktemp -d /tmp/dmabuf-vmtest.XXXXXXXX) + +VERBOSE=false +BUILD=false +BUILD_HOST="" +BUILD_HOST_PODMAN_CONTAINER_NAME="" + +usage() { + echo + echo "$0 [OPTIONS]" + echo + echo "Options" + echo " -b: build the kernel from the current source tree and use it for the VM" + echo " -H: hostname for remote build host (used with -b)" + echo " -p: podman container name for remote build host (used with -b)" + echo " Example: -H beefyserver -p vng" + + echo " -v: enable verbose vng/qemu output" + echo + + exit 1 +} + +die() { + echo "$*" >&2 + exit "${KSFT_FAIL}" +} + +cleanup() { + rm -rf "${TMP_DIR}" +} + +check_deps() { + for dep in vng make; do + if [[ ! -x $(command -v "${dep}") ]]; then + echo -e "skip: dependency ${dep} not found!\n" + exit "${KSFT_SKIP}" + fi + done + + if [[ ! -x "${DMABUF_HEAP_TEST}" ]]; then + printf "skip: %s not found!" "${DMABUF_HEAP_TEST}" + printf " Please build the kselftest dmabuf-heaps target (or use -b).\n" + exit "${KSFT_SKIP}" + fi + + if [[ ! -x "${MEMCONTROL_TEST}" ]]; then + printf "skip: %s not found!" "${MEMCONTROL_TEST}" + printf " Please build the kselftest cgroup target (or use -b).\n" + exit "${KSFT_SKIP}" + fi +} + +check_vng() { + local tested_versions=("1.36" "1.37") + local version + local ok=0 + + version="$(vng --version)" + for tv in "${tested_versions[@]}"; do + if [[ "${version}" == *"${tv}"* ]]; then + ok=1 + break + fi + done + + if [[ "${ok}" -eq 0 ]]; then + printf "warning: vng version '%s' has not been tested and may " "${version}" >&2 + printf "not function properly.\n\tThe following versions have been tested: " >&2 + echo "${tested_versions[@]}" >&2 + fi +} + +build_selftests() { + make -C "${KERNEL_CHECKOUT}" headers_install \ + INSTALL_HDR_PATH="${TMP_DIR}/usr" -j"$(nproc)" + + local khdr="-isystem ${TMP_DIR}/usr/include" + + if ! make -C "${SCRIPT_DIR}" KHDR_INCLUDES="${khdr}" -j"$(nproc)"; then + die "failed to build dmabuf-heaps selftests" + fi + + if ! make -C "${CGROUP_DIR}" KHDR_INCLUDES="${khdr}" \ + "${MEMCONTROL_TEST}" -j"$(nproc)"; then + die "failed to build cgroup/test_memcontrol selftest" + fi +} + +handle_build() { + if ! ${BUILD}; then + return + fi + + if [[ ! -d "${KERNEL_CHECKOUT}" ]]; then + echo "-b requires vmtest.sh called from the kernel source tree" >&2 + exit 1 + fi + + pushd "${KERNEL_CHECKOUT}" &>/dev/null + + if ! vng --kconfig --config "${SCRIPT_DIR}/config"; then + die "failed to generate .config for kernel source tree (${KERNEL_CHECKOUT})" + fi + + local vng_args=("-v" "--config" "${SCRIPT_DIR}/config" "--build") + + if [[ -n "${BUILD_HOST}" ]]; then + vng_args+=("--build-host" "${BUILD_HOST}") + fi + + if [[ -n "${BUILD_HOST_PODMAN_CONTAINER_NAME}" ]]; then + vng_args+=("--build-host-exec-prefix" \ + "podman exec -ti ${BUILD_HOST_PODMAN_CONTAINER_NAME}") + fi + + if ! vng "${vng_args[@]}"; then + die "failed to build kernel from source tree (${KERNEL_CHECKOUT})" + fi + + build_selftests + + popd &>/dev/null +} + +make_runner() { + # virtme-ng shares the host filesystem, so TMP_DIR is accessible + # inside the VM at the same absolute path. + cat > "${TMP_DIR}/run_tests.sh" <<-EOF + #!/bin/sh + set -u + PASS=0; FAIL=0; SKIP=0; N=0 + + run() { + name="$1"; shift + N=$((N+1)) + "$@"; rc=$? + if [ $rc -eq 0 ]; then echo "ok $N $name"; PASS=$((PASS+1)) + elif [ $rc -eq 4 ]; then echo "ok $N $name # SKIP"; SKIP=$((SKIP+1)) + else echo "not ok $N $name"; FAIL=$((FAIL+1)) + fi + } + + run "dmabuf-heap charge_pid_fd ioctl" ${DMABUF_HEAP_TEST} + run "memcontrol dma-buf memcg" ${MEMCONTROL_TEST} test_memcg_dmabuf + echo "# PASS=$PASS SKIP=$SKIP FAIL=$FAIL" + [ $FAIL -eq 0 ] + EOF + chmod +x "${TMP_DIR}/run_tests.sh" +} + +run_vm() { + local verbose_opt="" + local kernel_opt="" + + ${VERBOSE} && verbose_opt="--verbose" + + # If we are running from within the kernel source tree, use the kernel + # source tree as the kernel to boot, otherwise use the running kernel. + if [[ "$(realpath "$(pwd)")" == "${KERNEL_CHECKOUT}"* ]]; then + kernel_opt="${KERNEL_CHECKOUT}" + fi + + vng --run ${kernel_opt} ${verbose_opt} --user root --memory 512M \ + --exec "${TMP_DIR}/run_tests.sh" +} + +while getopts :hvbH:p: o +do + case $o in + v) VERBOSE=true;; + b) BUILD=true;; + H) BUILD_HOST=$OPTARG;; + p) BUILD_HOST_PODMAN_CONTAINER_NAME=$OPTARG;; + h|*) usage;; + esac +done +shift $((OPTIND-1)) + +trap cleanup EXIT + +check_vng +handle_build +check_deps +make_runner + +echo "Booting VM and running tests..." +run_vm
linaro-mm-sig@lists.linaro.org