Hi,
Here's preliminary work to enable dmem tracking for heavy users of DMA allocations on behalf of userspace: v4l2, DRM, and dma-buf heaps.
It's not really meant for inclusion at the moment, because I really don't like it that much, and would like to discuss solutions on how to make it nicer.
In particular, the dma dmem region accessors don't feel that great to me. It duplicates the logic to select the proper accessor in dma_alloc_attrs(), and it looks fragile and potentially buggy to me.
One solution I tried is to do the accounting in dma_alloc_attrs() directly, depending on a flag being set, similar to what __GFP_ACCOUNT is doing.
It didn't work because dmem initialises a state pointer when charging an allocation to a region, and expects that state pointer to be passed back when uncharging. Since dma_alloc_attrs() returns a void pointer to the allocated buffer, we need to put that state into a higher-level structure, such as drm_gem_object, or dma_buf.
Since we can't share the region selection logic, we need to get the region through some other mean. Another thing I consider was to return the region as part of the allocated buffer (through struct page or folio), but those are lost across the calls and dma_alloc_attrs() will only get a void pointer. So that's not doable without some heavy rework, if it's a good idea at all.
So yeah, I went for the dumbest possible solution with the accessors, hoping you could suggest a much smarter idea :)
Thanks, Maxime
Signed-off-by: Maxime Ripard mripard@kernel.org --- Maxime Ripard (12): cma: Register dmem region for each cma region cma: Provide accessor to cma dmem region dma: coherent: Register dmem region for each coherent region dma: coherent: Provide accessor to dmem region dma: contiguous: Provide accessor to dmem region dma: direct: Provide accessor to dmem region dma: Create default dmem region for DMA allocations dma: Provide accessor to dmem region dma-buf: Clear cgroup accounting on release dma-buf: cma: Account for allocations in dmem cgroup drm/gem: Add cgroup memory accounting media: videobuf2: Track buffer allocations through the dmem cgroup
drivers/dma-buf/dma-buf.c | 7 ++++ drivers/dma-buf/heaps/cma_heap.c | 18 ++++++++-- drivers/gpu/drm/drm_gem.c | 5 +++ drivers/gpu/drm/drm_gem_dma_helper.c | 6 ++++ .../media/common/videobuf2/videobuf2-dma-contig.c | 19 +++++++++++ include/drm/drm_device.h | 1 + include/drm/drm_gem.h | 2 ++ include/linux/cma.h | 9 +++++ include/linux/dma-buf.h | 5 +++ include/linux/dma-direct.h | 2 ++ include/linux/dma-map-ops.h | 32 ++++++++++++++++++ include/linux/dma-mapping.h | 11 ++++++ kernel/dma/coherent.c | 26 +++++++++++++++ kernel/dma/direct.c | 8 +++++ kernel/dma/mapping.c | 39 ++++++++++++++++++++++ mm/cma.c | 21 +++++++++++- mm/cma.h | 3 ++ 17 files changed, 211 insertions(+), 3 deletions(-) --- base-commit: 55a2aa61ba59c138bd956afe0376ec412a7004cf change-id: 20250307-dmem-cgroups-73febced0989
Best regards,
Now that the dmem cgroup has been merged, we need to create memory regions for each allocator devices might allocate DMA memory from.
Since CMA is one of these allocators, we need to create such a region. CMA can deal with multiple regions though, so we'll need to create a dmem region per CMA region.
Signed-off-by: Maxime Ripard mripard@kernel.org --- mm/cma.c | 14 +++++++++++++- mm/cma.h | 3 +++ 2 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/mm/cma.c b/mm/cma.c index de5bc0c81fc232bf82cd7ef22f6097059ab605e2..41a9ae907dcf69a73e963830d2c5f589dfc44f22 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -21,10 +21,11 @@ #include <linux/mm.h> #include <linux/sizes.h> #include <linux/slab.h> #include <linux/log2.h> #include <linux/cma.h> +#include <linux/cgroup_dmem.h> #include <linux/highmem.h> #include <linux/io.h> #include <linux/kmemleak.h> #include <trace/events/cma.h>
@@ -89,16 +90,25 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long pfn, spin_unlock_irqrestore(&cma->lock, flags); }
static void __init cma_activate_area(struct cma *cma) { + struct dmem_cgroup_region *region; unsigned long base_pfn = cma->base_pfn, pfn; struct zone *zone;
+ region = dmem_cgroup_register_region(cma_get_size(cma), "cma/%s", cma->name); + if (IS_ERR(region)) + goto out_error; + +#ifdef CONFIG_CGROUP_DMEM + cma->dmem_cgrp_region = region; +#endif + cma->bitmap = bitmap_zalloc(cma_bitmap_maxno(cma), GFP_KERNEL); if (!cma->bitmap) - goto out_error; + goto unreg_dmem;
/* * alloc_contig_range() requires the pfn range specified to be in the * same zone. Simplify by forcing the entire CMA resv range to be in the * same zone. @@ -124,10 +134,12 @@ static void __init cma_activate_area(struct cma *cma)
return;
not_in_zone: bitmap_free(cma->bitmap); +unreg_dmem: + dmem_cgroup_unregister_region(region); out_error: /* Expose all pages to the buddy, they are useless for CMA. */ if (!cma->reserve_pages_on_error) { for (pfn = base_pfn; pfn < base_pfn + cma->count; pfn++) free_reserved_page(pfn_to_page(pfn)); diff --git a/mm/cma.h b/mm/cma.h index 8485ef893e99d8da5ee41eb03194b5b00ff088ba..e05d3eb7c173f3fe75ad7808968925c77d190c80 100644 --- a/mm/cma.h +++ b/mm/cma.h @@ -29,10 +29,13 @@ struct cma { atomic64_t nr_pages_failed; /* the number of CMA page released */ atomic64_t nr_pages_released; /* kobject requires dynamic object */ struct cma_kobject *cma_kobj; +#endif +#ifdef CONFIG_CGROUP_DMEM + struct dmem_cgroup_region *dmem_cgrp_region; #endif bool reserve_pages_on_error; };
extern struct cma cma_areas[MAX_CMA_AREAS];
Consumers of the CMA API will have to know which CMA region their device allocate from in order for them to charge the memory allocation in the right one.
Let's provide an accessor for that region.
Signed-off-by: Maxime Ripard mripard@kernel.org --- include/linux/cma.h | 9 +++++++++ mm/cma.c | 7 +++++++ 2 files changed, 16 insertions(+)
diff --git a/include/linux/cma.h b/include/linux/cma.h index d15b64f51336df18d17a4097e27961fd1ac8d79f..d7b2f13918e536aeb8bccebc1934d36f2f0b4cf4 100644 --- a/include/linux/cma.h +++ b/include/linux/cma.h @@ -66,6 +66,15 @@ static inline bool cma_free_folio(struct cma *cma, const struct folio *folio) { return false; } #endif
+#if IS_ENABLED(CONFIG_CGROUP_DMEM) +struct dmem_cgroup_region *cma_get_dmem_cgroup_region(struct cma *cma); +#else /* CONFIG_CGROUP_DMEM */ +static inline struct dmem_cgroup_region *cma_get_dmem_cgroup_region(struct cma *cma) +{ + return NULL; +} +#endif /* CONFIG_CGROUP_DMEM */ + #endif diff --git a/mm/cma.c b/mm/cma.c index 41a9ae907dcf69a73e963830d2c5f589dfc44f22..4973a8c6bacb9d4924f4969be07757cf631304b8 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -49,10 +49,17 @@ unsigned long cma_get_size(const struct cma *cma) const char *cma_get_name(const struct cma *cma) { return cma->name; }
+#if IS_ENABLED(CONFIG_CGROUP_DMEM) +struct dmem_cgroup_region *cma_get_dmem_cgroup_region(struct cma *cma) +{ + return cma->dmem_cgrp_region; +} +#endif /* CONFIG_CGROUP_DMEM */ + static unsigned long cma_bitmap_aligned_mask(const struct cma *cma, unsigned int align_order) { if (align_order <= cma->order_per_bit) return 0;
There can be several coherent memory region in the system, and all of them might end up being used to allocate a DMA buffer.
Let's register a dmem region for each of them to make sure we can track those allocations.
Signed-off-by: Maxime Ripard mripard@kernel.org --- kernel/dma/coherent.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c index 3b2bdca9f1d4b0274bf4874892b94730cd05c5df..2a2d515e43acbdef19c14d8840ed90e48e7ebb43 100644 --- a/kernel/dma/coherent.c +++ b/kernel/dma/coherent.c @@ -5,10 +5,11 @@ */ #include <linux/io.h> #include <linux/slab.h> #include <linux/kernel.h> #include <linux/module.h> +#include <linux/cgroup_dmem.h> #include <linux/dma-direct.h> #include <linux/dma-map-ops.h>
struct dma_coherent_mem { void *virt_base; @@ -16,10 +17,12 @@ struct dma_coherent_mem { unsigned long pfn_base; int size; unsigned long *bitmap; spinlock_t spinlock; bool use_dev_dma_pfn_offset; + + struct dmem_cgroup_region *dmem_cgroup_region; };
static inline struct dma_coherent_mem *dev_get_coherent_memory(struct device *dev) { if (dev && dev->dma_mem) @@ -335,16 +338,25 @@ static phys_addr_t dma_reserved_default_memory_size __initdata; #endif
static int rmem_dma_device_init(struct reserved_mem *rmem, struct device *dev) { if (!rmem->priv) { + struct dmem_cgroup_region *region; struct dma_coherent_mem *mem;
mem = dma_init_coherent_memory(rmem->base, rmem->base, rmem->size, true); if (IS_ERR(mem)) return PTR_ERR(mem); + + region = dmem_cgroup_register_region(rmem->size, + "dma/coherent/%s", + rmem->name); + if (IS_ERR(region)) + return PTR_ERR(region); + + mem->dmem_cgroup_region = region; rmem->priv = mem; } dma_assign_coherent_memory(dev, rmem->priv); return 0; }
Consumers of the coherent DMA API will have to know which coherent region their device allocate from in order for them to charge the memory allocation in the right one.
Let's provide an accessor for that region.
Signed-off-by: Maxime Ripard mripard@kernel.org --- include/linux/dma-map-ops.h | 11 +++++++++++ kernel/dma/coherent.c | 14 ++++++++++++++ 2 files changed, 25 insertions(+)
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h index e172522cd93651594607e16461fac56e4d67cbce..a2c10ed186efb6e08f64df0954b4d389589b6e35 100644 --- a/include/linux/dma-map-ops.h +++ b/include/linux/dma-map-ops.h @@ -199,10 +199,21 @@ static inline int dma_mmap_from_global_coherent(struct vm_area_struct *vma, { return 0; } #endif /* CONFIG_DMA_GLOBAL_POOL */
+#if IS_ENABLED(CONFIG_CGROUP_DMEM) && IS_ENABLED(CONFIG_DMA_DECLARE_COHERENT) +struct dmem_cgroup_region * +dma_coherent_get_dmem_cgroup_region(struct device *dev); +#else /* CONFIG_CGROUP_DMEM && CONFIG_DMA_DECLARE_COHERENT */ +static inline struct dmem_cgroup_region * +dma_coherent_get_dmem_cgroup_region(struct device *dev) +{ + return NULL; +} +#endif /* CONFIG_CGROUP_DMEM && CONFIG_DMA_DECLARE_COHERENT */ + int dma_common_get_sgtable(struct device *dev, struct sg_table *sgt, void *cpu_addr, dma_addr_t dma_addr, size_t size, unsigned long attrs); int dma_common_mmap(struct device *dev, struct vm_area_struct *vma, void *cpu_addr, dma_addr_t dma_addr, size_t size, diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c index 2a2d515e43acbdef19c14d8840ed90e48e7ebb43..74c5ff5105110487770c1b73812eefe8b3d7eb3c 100644 --- a/kernel/dma/coherent.c +++ b/kernel/dma/coherent.c @@ -28,10 +28,24 @@ static inline struct dma_coherent_mem *dev_get_coherent_memory(struct device *de if (dev && dev->dma_mem) return dev->dma_mem; return NULL; }
+#if IS_ENABLED(CONFIG_CGROUP_DMEM) +struct dmem_cgroup_region * +dma_coherent_get_dmem_cgroup_region(struct device *dev) +{ + struct dma_coherent_mem *mem; + + mem = dev_get_coherent_memory(dev); + if (!mem) + return NULL; + + return mem->dmem_cgroup_region; +} +#endif + static inline dma_addr_t dma_get_device_base(struct device *dev, struct dma_coherent_mem * mem) { if (mem->use_dev_dma_pfn_offset) return phys_to_dma(dev, PFN_PHYS(mem->pfn_base));
Consumers of the DMA contiguous API will have to know which region their device allocates from in order for them to charge the memory allocation in the right one.
Let's provide an accessor for that region.
Signed-off-by: Maxime Ripard mripard@kernel.org --- include/linux/dma-map-ops.h | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+)
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h index a2c10ed186efb6e08f64df0954b4d389589b6e35..bfc928d3bac37f3eece93d152abd57da513a1cc8 100644 --- a/include/linux/dma-map-ops.h +++ b/include/linux/dma-map-ops.h @@ -4,10 +4,11 @@ * It should not be included in drivers just using the DMA API. */ #ifndef _LINUX_DMA_MAP_OPS_H #define _LINUX_DMA_MAP_OPS_H
+#include <linux/cma.h> #include <linux/dma-mapping.h> #include <linux/pgtable.h> #include <linux/slab.h>
struct cma; @@ -153,10 +154,30 @@ static inline void dma_free_contiguous(struct device *dev, struct page *page, { __free_pages(page, get_order(size)); } #endif /* CONFIG_DMA_CMA*/
+#if IS_ENABLED(CONFIG_DMA_CMA) && IS_ENABLED(CONFIG_CGROUP_DMEM) + +static inline struct dmem_cgroup_region * +dma_contiguous_get_dmem_cgroup_region(struct device *dev) +{ + struct cma *cma = dev_get_cma_area(dev); + + return cma_get_dmem_cgroup_region(cma); +} + +#else /* IS_ENABLED(CONFIG_DMA_CMA) && IS_ENABLED(CONFIG_CGROUP_DMEM) */ + +static inline struct dmem_cgroup_region * +dma_contiguous_get_dmem_cgroup_region(struct device *dev) +{ + return NULL; +} + +#endif /* IS_ENABLED(CONFIG_DMA_CMA) && IS_ENABLED(CONFIG_CGROUP_DMEM) */ + #ifdef CONFIG_DMA_DECLARE_COHERENT int dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr, dma_addr_t device_addr, size_t size); void dma_release_coherent_memory(struct device *dev); int dma_alloc_from_dev_coherent(struct device *dev, ssize_t size,
Consumers of the direct DMA API will have to know which region their device allocate from in order for them to charge the memory allocation in the right one.
Let's provide an accessor for that region.
Signed-off-by: Maxime Ripard mripard@kernel.org --- include/linux/dma-direct.h | 2 ++ kernel/dma/direct.c | 8 ++++++++ 2 files changed, 10 insertions(+)
diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h index d7e30d4f7503a898a456df8eedf6a2cd284c35ff..2dd7cbccfaeed81c18c67aae877417fe89f2f2f5 100644 --- a/include/linux/dma-direct.h +++ b/include/linux/dma-direct.h @@ -145,6 +145,8 @@ void dma_direct_free_pages(struct device *dev, size_t size, enum dma_data_direction dir); int dma_direct_supported(struct device *dev, u64 mask); dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr, size_t size, enum dma_data_direction dir, unsigned long attrs);
+struct dmem_cgroup_region *dma_direct_get_dmem_cgroup_region(struct device *dev); + #endif /* _LINUX_DMA_DIRECT_H */ diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 5b4e6d3bf7bcca8930877ba078aed4ce26828f06..ece1361077b6efeec5b202d838750afd967d473f 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -42,10 +42,18 @@ u64 dma_direct_get_required_mask(struct device *dev) u64 max_dma = phys_to_dma_direct(dev, phys);
return (1ULL << (fls64(max_dma) - 1)) * 2 - 1; }
+#if IS_ENABLED(CONFIG_CGROUP_DMEM) +struct dmem_cgroup_region * +dma_direct_get_dmem_cgroup_region(struct device *dev) +{ + return dma_contiguous_get_dmem_cgroup_region(dev); +} +#endif + static gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 *phys_limit) { u64 dma_limit = min_not_zero( dev->coherent_dma_mask, dev->bus_dma_limit);
On 2025-03-10 12:06 pm, Maxime Ripard wrote:
Consumers of the direct DMA API will have to know which region their device allocate from in order for them to charge the memory allocation in the right one.
This doesn't seem to make much sense - dma-direct is not an allocator itself, it just provides the high-level dma_alloc_attrs/dma_alloc_pages/etc. interfaces wherein the underlying allocations _could_ come from CMA, but also a per-device coherent/restricted pool, or a global coherent/atomic pool, or the regular page allocator, or in one weird corner case the SWIOTLB buffer, or...
Thanks, Robin.
Let's provide an accessor for that region.
Signed-off-by: Maxime Ripard mripard@kernel.org
include/linux/dma-direct.h | 2 ++ kernel/dma/direct.c | 8 ++++++++ 2 files changed, 10 insertions(+)
diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h index d7e30d4f7503a898a456df8eedf6a2cd284c35ff..2dd7cbccfaeed81c18c67aae877417fe89f2f2f5 100644 --- a/include/linux/dma-direct.h +++ b/include/linux/dma-direct.h @@ -145,6 +145,8 @@ void dma_direct_free_pages(struct device *dev, size_t size, enum dma_data_direction dir); int dma_direct_supported(struct device *dev, u64 mask); dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr, size_t size, enum dma_data_direction dir, unsigned long attrs); +struct dmem_cgroup_region *dma_direct_get_dmem_cgroup_region(struct device *dev);
- #endif /* _LINUX_DMA_DIRECT_H */
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 5b4e6d3bf7bcca8930877ba078aed4ce26828f06..ece1361077b6efeec5b202d838750afd967d473f 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -42,10 +42,18 @@ u64 dma_direct_get_required_mask(struct device *dev) u64 max_dma = phys_to_dma_direct(dev, phys); return (1ULL << (fls64(max_dma) - 1)) * 2 - 1; } +#if IS_ENABLED(CONFIG_CGROUP_DMEM) +struct dmem_cgroup_region * +dma_direct_get_dmem_cgroup_region(struct device *dev) +{
- return dma_contiguous_get_dmem_cgroup_region(dev);
+} +#endif
- static gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 *phys_limit) { u64 dma_limit = min_not_zero( dev->coherent_dma_mask, dev->bus_dma_limit);
On Mon, Mar 10, 2025 at 02:56:37PM +0000, Robin Murphy wrote:
On 2025-03-10 12:06 pm, Maxime Ripard wrote:
Consumers of the direct DMA API will have to know which region their device allocate from in order for them to charge the memory allocation in the right one.
This doesn't seem to make much sense - dma-direct is not an allocator itself, it just provides the high-level dma_alloc_attrs/dma_alloc_pages/etc. interfaces wherein the underlying allocations _could_ come from CMA, but also a per-device coherent/restricted pool, or a global coherent/atomic pool, or the regular page allocator, or in one weird corner case the SWIOTLB buffer, or...
I guess it wasn't super clear, but what I meant is that it's an allocator to the consumer: it gets called, and returns a buffer. How it does so is transparent to the device, and on the other side of the abstraction.
I do agree that the logic is complicated to follow, and that's what I was getting at in the cover letter.
Maxime
On 2025-03-10 4:28 pm, Maxime Ripard wrote:
On Mon, Mar 10, 2025 at 02:56:37PM +0000, Robin Murphy wrote:
On 2025-03-10 12:06 pm, Maxime Ripard wrote:
Consumers of the direct DMA API will have to know which region their device allocate from in order for them to charge the memory allocation in the right one.
This doesn't seem to make much sense - dma-direct is not an allocator itself, it just provides the high-level dma_alloc_attrs/dma_alloc_pages/etc. interfaces wherein the underlying allocations _could_ come from CMA, but also a per-device coherent/restricted pool, or a global coherent/atomic pool, or the regular page allocator, or in one weird corner case the SWIOTLB buffer, or...
I guess it wasn't super clear, but what I meant is that it's an allocator to the consumer: it gets called, and returns a buffer. How it does so is transparent to the device, and on the other side of the abstraction.
I do agree that the logic is complicated to follow, and that's what I was getting at in the cover letter.
Right, but ultimately my point is that when we later end up with:
struct dmem_cgroup_region * dma_get_dmem_cgroup_region(struct device *dev) { if (dma_alloc_direct(dev, get_dma_ops(dev))) return dma_direct_get_dmem_cgroup_region(dev);
= dma_contiguous_get_dmem_cgroup_region(dev);
it's objectively wrong given what dma_alloc_direct() means in context:
void *dma_alloc_attrs(...) { if (dma_alloc_direct(dev, ops)) cpu_addr = dma_direct_alloc(...);
where dma_direct_alloc() may then use at least 5 different allocation methods, only one of which is CMA. Accounting things which are not CMA to CMA seems to thoroughly defeat the purpose of having such fine-grained accounting at all.
This is why the very notion of "consumers of dma-direct" should fundamentally not be a thing IMO. Drivers consume the DMA API interfaces, and the DMA API ultimately consumes various memory allocators, but what happens in between is nobody else's business; dma-direct happens to represent *some* paths between the two, but there are plenty more paths to the same (and different) allocators through other DMA API implementations as well. Which route a particular call takes to end up at a particular allocator is not meaningful unless you are the DMA ops dispatch code.
Or to put it another way, to even go for the "dumbest possible correct solution", the plumbing of dma_get_dmem_cgroup_region() would need to be about as complex and widespread as the plumbing of dma_alloc_attrs() itself ;)
I think I see why a simple DMA attribute couldn't be made to work, as dmem_cgroup_uncharge() can't simply look up the pool the same way dmem_cgroup_try_charge() found it, since we still need a cg for that and get_current_dmemcs() can't be assumed to be stable over time, right? At the point I'm probably starting to lean towards a whole new DMA op with a properly encapsulated return type (and maybe a long-term goal of consolidating the 3 or 4 different allocation type we already have), or just have a single dmem region for "DMA API memory" and don't care where it came from (although I do see the issues with that too - you probably wouldn't want to ration a device-private pool the same way as global system memory, for example)
Thanks, Robin.
On Mon, Mar 10, 2025 at 06:44:51PM +0000, Robin Murphy wrote:
On 2025-03-10 4:28 pm, Maxime Ripard wrote:
On Mon, Mar 10, 2025 at 02:56:37PM +0000, Robin Murphy wrote:
On 2025-03-10 12:06 pm, Maxime Ripard wrote:
Consumers of the direct DMA API will have to know which region their device allocate from in order for them to charge the memory allocation in the right one.
This doesn't seem to make much sense - dma-direct is not an allocator itself, it just provides the high-level dma_alloc_attrs/dma_alloc_pages/etc. interfaces wherein the underlying allocations _could_ come from CMA, but also a per-device coherent/restricted pool, or a global coherent/atomic pool, or the regular page allocator, or in one weird corner case the SWIOTLB buffer, or...
I guess it wasn't super clear, but what I meant is that it's an allocator to the consumer: it gets called, and returns a buffer. How it does so is transparent to the device, and on the other side of the abstraction.
I do agree that the logic is complicated to follow, and that's what I was getting at in the cover letter.
Right, but ultimately my point is that when we later end up with:
struct dmem_cgroup_region * dma_get_dmem_cgroup_region(struct device *dev) { if (dma_alloc_direct(dev, get_dma_ops(dev))) return dma_direct_get_dmem_cgroup_region(dev);
= dma_contiguous_get_dmem_cgroup_region(dev);
it's objectively wrong given what dma_alloc_direct() means in context:
void *dma_alloc_attrs(...) { if (dma_alloc_direct(dev, ops)) cpu_addr = dma_direct_alloc(...);
where dma_direct_alloc() may then use at least 5 different allocation methods, only one of which is CMA. Accounting things which are not CMA to CMA seems to thoroughly defeat the purpose of having such fine-grained accounting at all.
This is why the very notion of "consumers of dma-direct" should fundamentally not be a thing IMO. Drivers consume the DMA API interfaces, and the DMA API ultimately consumes various memory allocators, but what happens in between is nobody else's business; dma-direct happens to represent *some* paths between the two, but there are plenty more paths to the same (and different) allocators through other DMA API implementations as well. Which route a particular call takes to end up at a particular allocator is not meaningful unless you are the DMA ops dispatch code.
Or to put it another way, to even go for the "dumbest possible correct solution", the plumbing of dma_get_dmem_cgroup_region() would need to be about as complex and widespread as the plumbing of dma_alloc_attrs() itself ;)
I largely agree with the sentiment, and I think the very idea of dma_get_dmem_cgroup_region() is a bad one for that reason. But since I wasn't too sure what a good one might look like, I figured it would be a good way to start the discussion still :)
I think I see why a simple DMA attribute couldn't be made to work, as dmem_cgroup_uncharge() can't simply look up the pool the same way dmem_cgroup_try_charge() found it, since we still need a cg for that and get_current_dmemcs() can't be assumed to be stable over time, right? At the point I'm probably starting to lean towards a whole new DMA op with a properly encapsulated return type (and maybe a long-term goal of consolidating the 3 or 4 different allocation type we already have)
It felt like a good solution to me too, and what I alluded to with struct page or folio. My feeling was that the best way to do it would be to encapsulate it into the structure returned by the dma_alloc_* API. That's a pretty large rework though, so I wanted to make sure I was on the right path before doing so.
or just have a single dmem region for "DMA API memory" and don't care where it came from (although I do see the issues with that too - you probably wouldn't want to ration a device-private pool the same way as global system memory, for example)
Yeah, the CMA pool is probably something you want to limit differently as well.
Maxime
Some DMA allocations are not going to be performed through dedicated sub-allocators but using the default path. We need to create a default region to track those as well.
Signed-off-by: Maxime Ripard mripard@kernel.org --- kernel/dma/mapping.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+)
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index cda127027e48a757f2d9fb04a49249d2b0238871..7bc3957512fd84e0bf3a89c210338be72457b5c9 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -5,10 +5,11 @@ * Copyright (c) 2006 SUSE Linux Products GmbH * Copyright (c) 2006 Tejun Heo teheo@suse.de */ #include <linux/memblock.h> /* for max_pfn */ #include <linux/acpi.h> +#include <linux/cgroup_dmem.h> #include <linux/dma-map-ops.h> #include <linux/export.h> #include <linux/gfp.h> #include <linux/iommu-dma.h> #include <linux/kmsan.h> @@ -25,10 +26,14 @@ defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \ defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) bool dma_default_coherent = IS_ENABLED(CONFIG_ARCH_DMA_DEFAULT_COHERENT); #endif
+#if IS_ENABLED(CONFIG_CGROUP_DMEM) +static struct dmem_cgroup_region *default_dmem_cgroup_region; +#endif + /* * Managed DMA API */ struct dma_devres { size_t size; @@ -587,10 +592,28 @@ u64 dma_get_required_mask(struct device *dev) */ return DMA_BIT_MASK(32); } EXPORT_SYMBOL_GPL(dma_get_required_mask);
+#if IS_ENABLED(CONFIG_CGROUP_DMEM) +static int __init dma_init_dmem_cgroup(void) +{ + struct dmem_cgroup_region *region; + + if (default_dmem_cgroup_region) + return -EBUSY; + + region = dmem_cgroup_register_region(U64_MAX, "dma/global"); + if (IS_ERR(region)) + return PTR_ERR(region); + + default_dmem_cgroup_region = region; + return 0; +} +core_initcall(dma_init_dmem_cgroup); +#endif + void *dma_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t flag, unsigned long attrs) { const struct dma_map_ops *ops = get_dma_ops(dev); void *cpu_addr;
Consumers of the DMA API will have to know which DMA region their device allocate from in order for them to charge the memory allocation in the right one.
Let's provide an accessor for that region.
Signed-off-by: Maxime Ripard mripard@kernel.org --- include/linux/dma-mapping.h | 11 +++++++++++ kernel/dma/mapping.c | 16 ++++++++++++++++ 2 files changed, 27 insertions(+)
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index b79925b1c4333ce25e66c57d8ac1dae5c7b7fe37..75f5ca1d11a6297720742cea1359c7f28c23d741 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -351,10 +351,21 @@ static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr) { return false; } #endif /* !CONFIG_HAS_DMA || !CONFIG_DMA_NEED_SYNC */
+#if IS_ENABLED(CONFIG_HAS_DMA) && IS_ENABLED(CONFIG_CGROUP_DMEM) +struct dmem_cgroup_region * +dma_get_dmem_cgroup_region(struct device *dev); +#else +static inline struct dmem_cgroup_region * +dma_get_dmem_cgroup_region(struct device *dev) +{ + return NULL; +} +#endif + struct page *dma_alloc_pages(struct device *dev, size_t size, dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp); void dma_free_pages(struct device *dev, size_t size, struct page *page, dma_addr_t dma_handle, enum dma_data_direction dir); int dma_mmap_pages(struct device *dev, struct vm_area_struct *vma, diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index 7bc3957512fd84e0bf3a89c210338be72457b5c9..e45d63700183acb03c779f969ae33803dcf5cf1b 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -608,10 +608,26 @@ static int __init dma_init_dmem_cgroup(void)
default_dmem_cgroup_region = region; return 0; } core_initcall(dma_init_dmem_cgroup); + +struct dmem_cgroup_region * +dma_get_dmem_cgroup_region(struct device *dev) +{ + struct dmem_cgroup_region *region; + + region = dma_coherent_get_dmem_cgroup_region(dev); + if (region) + return region; + + if (dma_alloc_direct(dev, get_dma_ops(dev))) + return dma_direct_get_dmem_cgroup_region(dev); + + return default_dmem_cgroup_region; +} +EXPORT_SYMBOL(dma_get_dmem_cgroup_region); #endif
void *dma_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t flag, unsigned long attrs) {
In order to clean thing up when dma-heaps will allocate and register buffers in the dev cgroup, let's uncharge a released buffer for any (optional) cgroup controller.
Signed-off-by: Maxime Ripard mripard@kernel.org --- drivers/dma-buf/dma-buf.c | 7 +++++++ include/linux/dma-buf.h | 5 +++++ 2 files changed, 12 insertions(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 5baa83b855156516a0a766bee0789b122473efb3..a95eef17f193454b018dc8177ddfd434d7b64473 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -11,10 +11,11 @@ * refining of this idea. */
#include <linux/fs.h> #include <linux/slab.h> +#include <linux/cgroup_dmem.h> #include <linux/dma-buf.h> #include <linux/dma-fence.h> #include <linux/dma-fence-unwrap.h> #include <linux/anon_inodes.h> #include <linux/export.h> @@ -97,10 +98,16 @@ static void dma_buf_release(struct dentry *dentry) * * dmabuf->cb_in/out.active are non-0 despite no pending fence callback */ BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
dma_buf_stats_teardown(dmabuf); + +#ifdef CONFIG_CGROUP_DMEM + if (dmabuf->cgroup_pool) + dmem_cgroup_uncharge(dmabuf->cgroup_pool, dmabuf->size); +#endif + dmabuf->ops->release(dmabuf);
if (dmabuf->resv == (struct dma_resv *)&dmabuf[1]) dma_resv_fini(dmabuf->resv);
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 36216d28d8bdc01a9c9c47e27c392413f7f6c5fb..111ca5a738ae0a816ba1551313dfb0a958720b6c 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -437,10 +437,15 @@ struct dma_buf { struct dma_fence_cb cb; wait_queue_head_t *poll;
__poll_t active; } cb_in, cb_out; + +#ifdef CONFIG_CGROUP_DMEM + struct dmem_cgroup_pool_state *cgroup_pool; +#endif + #ifdef CONFIG_DMABUF_SYSFS_STATS /** * @sysfs_entry: * * For exposing information about this buffer in sysfs. See also
Now that we have a DMEM region per CMA region, we can track the allocations of the CMA heap through DMEM.
Signed-off-by: Maxime Ripard mripard@kernel.org --- drivers/dma-buf/heaps/cma_heap.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/drivers/dma-buf/heaps/cma_heap.c b/drivers/dma-buf/heaps/cma_heap.c index 9512d050563a9ad0a735230c4870c3d3b3b01b25..4951c41db3ba0cbd903b6d62787f51b83f4a1e7e 100644 --- a/drivers/dma-buf/heaps/cma_heap.c +++ b/drivers/dma-buf/heaps/cma_heap.c @@ -7,10 +7,11 @@ * * Also utilizing parts of Andrew Davis' SRAM heap: * Copyright (C) 2019 Texas Instruments Incorporated - http://www.ti.com/ * Andrew F. Davis afd@ti.com */ +#include <linux/cgroup_dmem.h> #include <linux/cma.h> #include <linux/dma-buf.h> #include <linux/dma-heap.h> #include <linux/dma-map-ops.h> #include <linux/err.h> @@ -276,23 +277,31 @@ static struct dma_buf *cma_heap_allocate(struct dma_heap *heap, unsigned long len, u32 fd_flags, u64 heap_flags) { struct cma_heap *cma_heap = dma_heap_get_drvdata(heap); + struct dmem_cgroup_pool_state *pool; struct cma_heap_buffer *buffer; DEFINE_DMA_BUF_EXPORT_INFO(exp_info); size_t size = PAGE_ALIGN(len); pgoff_t pagecount = size >> PAGE_SHIFT; unsigned long align = get_order(size); struct page *cma_pages; struct dma_buf *dmabuf; int ret = -ENOMEM; pgoff_t pg;
+ ret = dmem_cgroup_try_charge(cma_get_dmem_cgroup_region(cma_heap->cma), + size, &pool, NULL); + if (ret) + return ERR_PTR(ret); + buffer = kzalloc(sizeof(*buffer), GFP_KERNEL); - if (!buffer) - return ERR_PTR(-ENOMEM); + if (!buffer) { + ret = -ENOMEM; + goto uncharge_cgroup; + }
INIT_LIST_HEAD(&buffer->attachments); mutex_init(&buffer->lock); buffer->len = size;
@@ -348,18 +357,23 @@ static struct dma_buf *cma_heap_allocate(struct dma_heap *heap, dmabuf = dma_buf_export(&exp_info); if (IS_ERR(dmabuf)) { ret = PTR_ERR(dmabuf); goto free_pages; } + + dmabuf->cgroup_pool = pool; + return dmabuf;
free_pages: kfree(buffer->pages); free_cma: cma_release(cma_heap->cma, cma_pages, pagecount); free_buffer: kfree(buffer); +uncharge_cgroup: + dmem_cgroup_uncharge(pool, len);
return ERR_PTR(ret); }
static const struct dma_heap_ops cma_heap_ops = {
In order to support any device using the GEM support, let's charge any GEM DMA allocation into the dmem cgroup.
Signed-off-by: Maxime Ripard mripard@kernel.org --- drivers/gpu/drm/drm_gem.c | 5 +++++ drivers/gpu/drm/drm_gem_dma_helper.c | 6 ++++++ include/drm/drm_device.h | 1 + include/drm/drm_gem.h | 2 ++ 4 files changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index ee811764c3df4b4e9c377a66afd4967512ba2001..e04733cb49353cf3ff9672d883b106a083f80d86 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -108,10 +108,11 @@ drm_gem_init(struct drm_device *dev) dev->vma_offset_manager = vma_offset_manager; drm_vma_offset_manager_init(vma_offset_manager, DRM_FILE_PAGE_OFFSET_START, DRM_FILE_PAGE_OFFSET_SIZE);
+ return drmm_add_action(dev, drm_gem_init_release, NULL); }
/** * drm_gem_object_init_with_mnt - initialize an allocated shmem-backed GEM @@ -973,10 +974,14 @@ drm_gem_release(struct drm_device *dev, struct drm_file *file_private) * drm_gem_object_init(). */ void drm_gem_object_release(struct drm_gem_object *obj) { + + if (obj->cgroup_pool_state) + dmem_cgroup_uncharge(obj->cgroup_pool_state, obj->size); + if (obj->filp) fput(obj->filp);
drm_gem_private_object_fini(obj);
diff --git a/drivers/gpu/drm/drm_gem_dma_helper.c b/drivers/gpu/drm/drm_gem_dma_helper.c index 16988d316a6dc702310fa44c15c92dc67b82802b..6236feb67ddd6338f0f597a0606377e0352ca6ed 100644 --- a/drivers/gpu/drm/drm_gem_dma_helper.c +++ b/drivers/gpu/drm/drm_gem_dma_helper.c @@ -104,10 +104,16 @@ __drm_gem_dma_create(struct drm_device *drm, size_t size, bool private) if (ret) { drm_gem_object_release(gem_obj); goto error; }
+ ret = dmem_cgroup_try_charge(dma_get_dmem_cgroup_region(drm->dev), + size, + &dma_obj->base.cgroup_pool_state, NULL); + if (ret) + goto error; + return dma_obj;
error: kfree(dma_obj); return ERR_PTR(ret); diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h index c91f87b5242d7a499917eb4aeb6ca8350f856eb3..58987f39ba8718eb768f6261fb0a1fbf16b38549 100644 --- a/include/drm/drm_device.h +++ b/include/drm/drm_device.h @@ -1,8 +1,9 @@ #ifndef _DRM_DEVICE_H_ #define _DRM_DEVICE_H_
+#include <linux/cgroup_dmem.h> #include <linux/list.h> #include <linux/kref.h> #include <linux/mutex.h> #include <linux/idr.h>
diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h index fdae947682cd0b7b06db5e35e120f049a0f30179..95fe8ed48a26204020bb47d6074689829c410465 100644 --- a/include/drm/drm_gem.h +++ b/include/drm/drm_gem.h @@ -430,10 +430,12 @@ struct drm_gem_object { * @lru: * * The current LRU list that the GEM object is on. */ struct drm_gem_lru *lru; + + struct dmem_cgroup_pool_state *cgroup_pool_state; };
/** * DRM_GEM_FOPS - Default drm GEM file operations *
On 2025-03-10 12:06 pm, Maxime Ripard wrote:
In order to support any device using the GEM support, let's charge any GEM DMA allocation into the dmem cgroup.
Signed-off-by: Maxime Ripard mripard@kernel.org
drivers/gpu/drm/drm_gem.c | 5 +++++ drivers/gpu/drm/drm_gem_dma_helper.c | 6 ++++++ include/drm/drm_device.h | 1 + include/drm/drm_gem.h | 2 ++ 4 files changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index ee811764c3df4b4e9c377a66afd4967512ba2001..e04733cb49353cf3ff9672d883b106a083f80d86 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -108,10 +108,11 @@ drm_gem_init(struct drm_device *dev) dev->vma_offset_manager = vma_offset_manager; drm_vma_offset_manager_init(vma_offset_manager, DRM_FILE_PAGE_OFFSET_START, DRM_FILE_PAGE_OFFSET_SIZE);
- return drmm_add_action(dev, drm_gem_init_release, NULL); }
/**
- drm_gem_object_init_with_mnt - initialize an allocated shmem-backed GEM
@@ -973,10 +974,14 @@ drm_gem_release(struct drm_device *dev, struct drm_file *file_private)
- drm_gem_object_init().
*/ void drm_gem_object_release(struct drm_gem_object *obj) {
- if (obj->cgroup_pool_state)
dmem_cgroup_uncharge(obj->cgroup_pool_state, obj->size);
- if (obj->filp) fput(obj->filp);
drm_gem_private_object_fini(obj); diff --git a/drivers/gpu/drm/drm_gem_dma_helper.c b/drivers/gpu/drm/drm_gem_dma_helper.c index 16988d316a6dc702310fa44c15c92dc67b82802b..6236feb67ddd6338f0f597a0606377e0352ca6ed 100644 --- a/drivers/gpu/drm/drm_gem_dma_helper.c +++ b/drivers/gpu/drm/drm_gem_dma_helper.c @@ -104,10 +104,16 @@ __drm_gem_dma_create(struct drm_device *drm, size_t size, bool private) if (ret) { drm_gem_object_release(gem_obj); goto error; }
- ret = dmem_cgroup_try_charge(dma_get_dmem_cgroup_region(drm->dev),
size,
&dma_obj->base.cgroup_pool_state, NULL);
- if (ret)
goto error;
Doesn't that miss cleaning up gem_obj? However, surely you want the accounting before the allocation anyway, like in the other cases. Otherwise userspace is still able to allocate massive amounts of memory and incur some of the associated side-effects of that, it just doesn't get to keep said memory for very long :)
Thanks, Robin.
- return dma_obj;
error: kfree(dma_obj); return ERR_PTR(ret); diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h index c91f87b5242d7a499917eb4aeb6ca8350f856eb3..58987f39ba8718eb768f6261fb0a1fbf16b38549 100644 --- a/include/drm/drm_device.h +++ b/include/drm/drm_device.h @@ -1,8 +1,9 @@ #ifndef _DRM_DEVICE_H_ #define _DRM_DEVICE_H_ +#include <linux/cgroup_dmem.h> #include <linux/list.h> #include <linux/kref.h> #include <linux/mutex.h> #include <linux/idr.h> diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h index fdae947682cd0b7b06db5e35e120f049a0f30179..95fe8ed48a26204020bb47d6074689829c410465 100644 --- a/include/drm/drm_gem.h +++ b/include/drm/drm_gem.h @@ -430,10 +430,12 @@ struct drm_gem_object { * @lru: * * The current LRU list that the GEM object is on. */ struct drm_gem_lru *lru;
- struct dmem_cgroup_pool_state *cgroup_pool_state; };
/**
- DRM_GEM_FOPS - Default drm GEM file operations
The dmem cgroup allows to track any DMA memory allocation made by the userspace. Let's charge our allocations in videobuf2 to enable proper memory tracking.
Signed-off-by: Maxime Ripard mripard@kernel.org --- drivers/media/common/videobuf2/videobuf2-dma-contig.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
diff --git a/drivers/media/common/videobuf2/videobuf2-dma-contig.c b/drivers/media/common/videobuf2/videobuf2-dma-contig.c index a13ec569c82f6da2d977222b94af32e74c6c6c82..48384e18030812f4f89f1c225c38def2ac6aa3ca 100644 --- a/drivers/media/common/videobuf2/videobuf2-dma-contig.c +++ b/drivers/media/common/videobuf2/videobuf2-dma-contig.c @@ -8,10 +8,11 @@ * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation. */
+#include <linux/cgroup_dmem.h> #include <linux/dma-buf.h> #include <linux/module.h> #include <linux/refcount.h> #include <linux/scatterlist.h> #include <linux/sched.h> @@ -40,10 +41,14 @@ struct vb2_dc_buf { struct sg_table *sgt_base;
/* DMABUF related */ struct dma_buf_attachment *db_attach;
+#ifdef CONFIG_CGROUP_DMEM + struct dmem_cgroup_pool_state *cgroup_pool_state; +#endif + struct vb2_buffer *vb; bool non_coherent_mem; };
/*********************************************/ @@ -169,10 +174,14 @@ static void vb2_dc_put(void *buf_priv) struct vb2_dc_buf *buf = buf_priv;
if (!refcount_dec_and_test(&buf->refcount)) return;
+#ifdef CONFIG_CGROUP_DMEM + dmem_cgroup_uncharge(buf->cgroup_pool_state, buf->size); +#endif + if (buf->non_coherent_mem) { if (buf->vaddr) dma_vunmap_noncontiguous(buf->dev, buf->vaddr); dma_free_noncontiguous(buf->dev, buf->size, buf->dma_sgt, buf->dma_dir); @@ -230,10 +239,11 @@ static int vb2_dc_alloc_non_coherent(struct vb2_dc_buf *buf)
static void *vb2_dc_alloc(struct vb2_buffer *vb, struct device *dev, unsigned long size) { + struct dmem_cgroup_pool_state *pool; struct vb2_dc_buf *buf; int ret;
if (WARN_ON(!dev)) return ERR_PTR(-EINVAL); @@ -249,25 +259,34 @@ static void *vb2_dc_alloc(struct vb2_buffer *vb,
buf->size = size; /* Prevent the device from being released while the buffer is used */ buf->dev = get_device(dev);
+ ret = dmem_cgroup_try_charge(dma_get_dmem_cgroup_region(dev), size, &pool, NULL); + if (ret) + return ret; + if (buf->non_coherent_mem) ret = vb2_dc_alloc_non_coherent(buf); else ret = vb2_dc_alloc_coherent(buf);
if (ret) { dev_err(dev, "dma alloc of size %lu failed\n", size); + dmem_cgroup_uncharge(pool, size); kfree(buf); return ERR_PTR(-ENOMEM); }
buf->handler.refcount = &buf->refcount; buf->handler.put = vb2_dc_put; buf->handler.arg = buf;
+#ifdef CONFIG_CGROUP_DMEM + buf->cgroup_pool_state = pool; +#endif + refcount_set(&buf->refcount, 1);
return buf; }
On Mon, Mar 10, 2025 at 01:06:06PM +0100, Maxime Ripard wrote:
Here's preliminary work to enable dmem tracking for heavy users of DMA allocations on behalf of userspace: v4l2, DRM, and dma-buf heaps.
It's not really meant for inclusion at the moment, because I really don't like it that much, and would like to discuss solutions on how to make it nicer.
In particular, the dma dmem region accessors don't feel that great to me. It duplicates the logic to select the proper accessor in dma_alloc_attrs(), and it looks fragile and potentially buggy to me.
One solution I tried is to do the accounting in dma_alloc_attrs() directly, depending on a flag being set, similar to what __GFP_ACCOUNT is doing.
It didn't work because dmem initialises a state pointer when charging an allocation to a region, and expects that state pointer to be passed back when uncharging. Since dma_alloc_attrs() returns a void pointer to the allocated buffer, we need to put that state into a higher-level structure, such as drm_gem_object, or dma_buf.
Since we can't share the region selection logic, we need to get the region through some other mean. Another thing I consider was to return the region as part of the allocated buffer (through struct page or folio), but those are lost across the calls and dma_alloc_attrs() will only get a void pointer. So that's not doable without some heavy rework, if it's a good idea at all.
One thing I forgot to mention is that it makes it harder than it could for subsystems that can allocate from multiple allocators (like... all the ones included in this series at least).
I only added proper tracking in the backends using dma_alloc_attrs(), but they also support vmalloc. In what region vmalloc allocations should be tracked (if any) is an open-question to me. Similarly, some use dma_alloc_noncontiguous().
Also, I've set the size of the "default" DMA allocation region to U64_MAX, but that's obviously wrong and will break any relative metric. I'm not sure what would be the correct size though.
Maxime
[Adding Ben since we are currently in the middle of a discussion regarding exactly that problem]
Just for my understanding before I deep dive into the code: This uses a separate dmem cgroup and does not account against memcg, don't it?
Thanks, Christian.
Am 10.03.25 um 13:06 schrieb Maxime Ripard:
Hi,
Here's preliminary work to enable dmem tracking for heavy users of DMA allocations on behalf of userspace: v4l2, DRM, and dma-buf heaps.
It's not really meant for inclusion at the moment, because I really don't like it that much, and would like to discuss solutions on how to make it nicer.
In particular, the dma dmem region accessors don't feel that great to me. It duplicates the logic to select the proper accessor in dma_alloc_attrs(), and it looks fragile and potentially buggy to me.
One solution I tried is to do the accounting in dma_alloc_attrs() directly, depending on a flag being set, similar to what __GFP_ACCOUNT is doing.
It didn't work because dmem initialises a state pointer when charging an allocation to a region, and expects that state pointer to be passed back when uncharging. Since dma_alloc_attrs() returns a void pointer to the allocated buffer, we need to put that state into a higher-level structure, such as drm_gem_object, or dma_buf.
Since we can't share the region selection logic, we need to get the region through some other mean. Another thing I consider was to return the region as part of the allocated buffer (through struct page or folio), but those are lost across the calls and dma_alloc_attrs() will only get a void pointer. So that's not doable without some heavy rework, if it's a good idea at all.
So yeah, I went for the dumbest possible solution with the accessors, hoping you could suggest a much smarter idea :)
Thanks, Maxime
Signed-off-by: Maxime Ripard mripard@kernel.org
Maxime Ripard (12): cma: Register dmem region for each cma region cma: Provide accessor to cma dmem region dma: coherent: Register dmem region for each coherent region dma: coherent: Provide accessor to dmem region dma: contiguous: Provide accessor to dmem region dma: direct: Provide accessor to dmem region dma: Create default dmem region for DMA allocations dma: Provide accessor to dmem region dma-buf: Clear cgroup accounting on release dma-buf: cma: Account for allocations in dmem cgroup drm/gem: Add cgroup memory accounting media: videobuf2: Track buffer allocations through the dmem cgroup
drivers/dma-buf/dma-buf.c | 7 ++++ drivers/dma-buf/heaps/cma_heap.c | 18 ++++++++-- drivers/gpu/drm/drm_gem.c | 5 +++ drivers/gpu/drm/drm_gem_dma_helper.c | 6 ++++ .../media/common/videobuf2/videobuf2-dma-contig.c | 19 +++++++++++ include/drm/drm_device.h | 1 + include/drm/drm_gem.h | 2 ++ include/linux/cma.h | 9 +++++ include/linux/dma-buf.h | 5 +++ include/linux/dma-direct.h | 2 ++ include/linux/dma-map-ops.h | 32 ++++++++++++++++++ include/linux/dma-mapping.h | 11 ++++++ kernel/dma/coherent.c | 26 +++++++++++++++ kernel/dma/direct.c | 8 +++++ kernel/dma/mapping.c | 39 ++++++++++++++++++++++ mm/cma.c | 21 +++++++++++- mm/cma.h | 3 ++ 17 files changed, 211 insertions(+), 3 deletions(-)
base-commit: 55a2aa61ba59c138bd956afe0376ec412a7004cf change-id: 20250307-dmem-cgroups-73febced0989
Best regards,
Hi,
On Mon, Mar 10, 2025 at 03:16:53PM +0100, Christian König wrote:
[Adding Ben since we are currently in the middle of a discussion regarding exactly that problem]
Just for my understanding before I deep dive into the code: This uses a separate dmem cgroup and does not account against memcg, don't it?
Yes. The main rationale being that it doesn't always make sense to register against memcg: a lot of devices are going to allocate from dedicated chunks of memory that are either carved out from the main memory allocator, or not under Linux supervision at all.
And if there's no way to make it consistent across drivers, it's not the right tool.
Maxime
On Tue, 11 Mar 2025 at 00:26, Maxime Ripard mripard@kernel.org wrote:
Hi,
On Mon, Mar 10, 2025 at 03:16:53PM +0100, Christian König wrote:
[Adding Ben since we are currently in the middle of a discussion regarding exactly that problem]
Just for my understanding before I deep dive into the code: This uses a separate dmem cgroup and does not account against memcg, don't it?
Yes. The main rationale being that it doesn't always make sense to register against memcg: a lot of devices are going to allocate from dedicated chunks of memory that are either carved out from the main memory allocator, or not under Linux supervision at all.
And if there's no way to make it consistent across drivers, it's not the right tool.
While I agree on that, if a user can cause a device driver to allocate memory that is also memory that memcg accounts, then we have to interface with memcg to account that memory.
The pathological case would be a single application wanting to use 90% of RAM for device allocations, freeing it all, then using 90% of RAM for normal usage. How to create a policy that would allow that with dmem and memcg is difficult, since if you say you can do 90% on both then the user can easily OOM the system.
Dave.
Maxime
Am 31.03.25 um 22:43 schrieb Dave Airlie:
On Tue, 11 Mar 2025 at 00:26, Maxime Ripard mripard@kernel.org wrote:
Hi,
On Mon, Mar 10, 2025 at 03:16:53PM +0100, Christian König wrote:
[Adding Ben since we are currently in the middle of a discussion regarding exactly that problem]
Just for my understanding before I deep dive into the code: This uses a separate dmem cgroup and does not account against memcg, don't it?
Yes. The main rationale being that it doesn't always make sense to register against memcg: a lot of devices are going to allocate from dedicated chunks of memory that are either carved out from the main memory allocator, or not under Linux supervision at all.
And if there's no way to make it consistent across drivers, it's not the right tool.
While I agree on that, if a user can cause a device driver to allocate memory that is also memory that memcg accounts, then we have to interface with memcg to account that memory.
This assumes that memcg should be in control of device driver allocated memory. Which in some cases is intentionally not done.
E.g. a server application which allocates buffers on behalves of clients gets a nice deny of service problem if we suddenly start to account those buffers.
That was one of the reasons why my OOM killer improvement patches never landed (e.g. you could trivially kill X/Wayland or systemd with that).
The pathological case would be a single application wanting to use 90% of RAM for device allocations, freeing it all, then using 90% of RAM for normal usage. How to create a policy that would allow that with dmem and memcg is difficult, since if you say you can do 90% on both then the user can easily OOM the system.
Yeah, completely agree.
That's why the GTT size limit we already have per device and the global 50% TTM limit doesn't work as expected. People also didn't liked those limits and because of that we even have flags to circumvent them, see AMDGPU_GEM_CREATE_PREEMPTIBLE and TTM_TT_FLAG_EXTERNAL.
Another problem is when and to which process we account things when eviction happens? For example process A wants to use VRAM that process B currently occupies. In this case we would give both processes a mix of VRAM and system memory, but how do we account that?
If we account to process B then it can be that process A fails because of process Bs memcg limit. This creates a situation which is absolutely not traceable for a system administrator.
But process A never asked for system memory in the first place, so we can't account the memory to it either or otherwise we make the process responsible for things it didn't do.
There are good argument for all solutions and there are a couple of blocks which rule out one solution or another for a certain use case. To summarize I think the whole situation is a complete mess.
Maybe there is not this one solution and we need to make it somehow configurable?
Regards, Christian.
Dave.
Maxime
On Tue, 1 Apr 2025 at 21:03, Christian König christian.koenig@amd.com wrote:
Am 31.03.25 um 22:43 schrieb Dave Airlie:
On Tue, 11 Mar 2025 at 00:26, Maxime Ripard mripard@kernel.org wrote:
Hi,
On Mon, Mar 10, 2025 at 03:16:53PM +0100, Christian König wrote:
[Adding Ben since we are currently in the middle of a discussion regarding exactly that problem]
Just for my understanding before I deep dive into the code: This uses a separate dmem cgroup and does not account against memcg, don't it?
Yes. The main rationale being that it doesn't always make sense to register against memcg: a lot of devices are going to allocate from dedicated chunks of memory that are either carved out from the main memory allocator, or not under Linux supervision at all.
And if there's no way to make it consistent across drivers, it's not the right tool.
While I agree on that, if a user can cause a device driver to allocate memory that is also memory that memcg accounts, then we have to interface with memcg to account that memory.
This assumes that memcg should be in control of device driver allocated memory. Which in some cases is intentionally not done.
E.g. a server application which allocates buffers on behalves of clients gets a nice deny of service problem if we suddenly start to account those buffers.
Yes we definitely need the ability to transfer an allocation between cgroups for this case.
That was one of the reasons why my OOM killer improvement patches never landed (e.g. you could trivially kill X/Wayland or systemd with that).
The pathological case would be a single application wanting to use 90% of RAM for device allocations, freeing it all, then using 90% of RAM for normal usage. How to create a policy that would allow that with dmem and memcg is difficult, since if you say you can do 90% on both then the user can easily OOM the system.
Yeah, completely agree.
That's why the GTT size limit we already have per device and the global 50% TTM limit doesn't work as expected. People also didn't liked those limits and because of that we even have flags to circumvent them, see AMDGPU_GEM_CREATE_PREEMPTIBLE and TTM_TT_FLAG_EXTERNAL.
Another problem is when and to which process we account things when eviction happens? For example process A wants to use VRAM that process B currently occupies. In this case we would give both processes a mix of VRAM and system memory, but how do we account that?
If we account to process B then it can be that process A fails because of process Bs memcg limit. This creates a situation which is absolutely not traceable for a system administrator.
But process A never asked for system memory in the first place, so we can't account the memory to it either or otherwise we make the process responsible for things it didn't do.
There are good argument for all solutions and there are a couple of blocks which rule out one solution or another for a certain use case. To summarize I think the whole situation is a complete mess.
Maybe there is not this one solution and we need to make it somehow configurable?
My feeling is that we can't solve the VRAM eviction problem super effectively, but it's also probably not going to be a major common case, I don't think we should double account memcg/dmem just in case we have to evict all of a users dmem at some point, maybe if there was some kind of soft memcg limit we could add as an accounting but not enforced overhead it might be useful to track evictions, but yes we can't have A allocating memory causing B to fall over because we evict memory into it's memcg space and it fails to allocate the next time it tries, or having A fail in that case.
For the UMA GPU case where there is no device memory or eviction problem, perhaps a configurable option to just say account memory in memcg for all allocations done by this process, and state yes you can work around it with allocation servers or whatever but the behaviour for well behaved things is at least somewhat defined.
Dave.
Am 03.04.25 um 08:07 schrieb Dave Airlie:
On Tue, 1 Apr 2025 at 21:03, Christian König christian.koenig@amd.com wrote:
Am 31.03.25 um 22:43 schrieb Dave Airlie:
On Tue, 11 Mar 2025 at 00:26, Maxime Ripard mripard@kernel.org wrote:
Hi,
On Mon, Mar 10, 2025 at 03:16:53PM +0100, Christian König wrote:
[Adding Ben since we are currently in the middle of a discussion regarding exactly that problem]
Just for my understanding before I deep dive into the code: This uses a separate dmem cgroup and does not account against memcg, don't it?
Yes. The main rationale being that it doesn't always make sense to register against memcg: a lot of devices are going to allocate from dedicated chunks of memory that are either carved out from the main memory allocator, or not under Linux supervision at all.
And if there's no way to make it consistent across drivers, it's not the right tool.
While I agree on that, if a user can cause a device driver to allocate memory that is also memory that memcg accounts, then we have to interface with memcg to account that memory.
This assumes that memcg should be in control of device driver allocated memory. Which in some cases is intentionally not done.
E.g. a server application which allocates buffers on behalves of clients gets a nice deny of service problem if we suddenly start to account those buffers.
Yes we definitely need the ability to transfer an allocation between cgroups for this case.
The bigger issue is that you break UAPI for things which are "older" (X server, older Android approaches etc...), fixing all of this is certainly possible but most likely simply not worth it.
There are simpler approach to handle all this I think, but see below for further thoughts on that topic.
That was one of the reasons why my OOM killer improvement patches never landed (e.g. you could trivially kill X/Wayland or systemd with that).
The pathological case would be a single application wanting to use 90% of RAM for device allocations, freeing it all, then using 90% of RAM for normal usage. How to create a policy that would allow that with dmem and memcg is difficult, since if you say you can do 90% on both then the user can easily OOM the system.
Yeah, completely agree.
That's why the GTT size limit we already have per device and the global 50% TTM limit doesn't work as expected. People also didn't liked those limits and because of that we even have flags to circumvent them, see AMDGPU_GEM_CREATE_PREEMPTIBLE and TTM_TT_FLAG_EXTERNAL.
Another problem is when and to which process we account things when eviction happens? For example process A wants to use VRAM that process B currently occupies. In this case we would give both processes a mix of VRAM and system memory, but how do we account that?
If we account to process B then it can be that process A fails because of process Bs memcg limit. This creates a situation which is absolutely not traceable for a system administrator.
But process A never asked for system memory in the first place, so we can't account the memory to it either or otherwise we make the process responsible for things it didn't do.
There are good argument for all solutions and there are a couple of blocks which rule out one solution or another for a certain use case. To summarize I think the whole situation is a complete mess.
Maybe there is not this one solution and we need to make it somehow configurable?
My feeling is that we can't solve the VRAM eviction problem super effectively, but it's also probably not going to be a major common case, I don't think we should double account memcg/dmem just in case we have to evict all of a users dmem at some point, maybe if there was some kind of soft memcg limit we could add as an accounting but not enforced overhead it might be useful to track evictions, but yes we can't have A allocating memory causing B to fall over because we evict memory into it's memcg space and it fails to allocate the next time it tries, or having A fail in that case.
+1 yeah, exactly my thinking as well.
For the UMA GPU case where there is no device memory or eviction problem, perhaps a configurable option to just say account memory in memcg for all allocations done by this process, and state yes you can work around it with allocation servers or whatever but the behaviour for well behaved things is at least somewhat defined.
We can have that as a workaround, but I think we should approach that differently.
With upcoming CXL even coherent device memory is exposed to the core OS as NUMA memory with just a high latency.
So both in the CXL and UMA case it actually doesn't make sense to allocate the memory through the driver interfaces any more. With AMDGPU for example we are just replicating mbind()/madvise() within the driver.
Instead what the DRM subsystem should aim for is to allocate memory using the normal core OS functionality and then import it into the driver.
AMD, NVidia and Intel have HMM working for quite a while now but it has some limitations, especially on the performance side.
So for AMDGPU we are currently evaluating udmabuf as alternative. That seems to be working fine with different NUMA nodes, is perfectly memcg accounted and gives you a DMA-buf which can be imported everywhere.
The only show stopper might be the allocation performance, but even if that's the case I think the ongoing folio work will properly resolve that.
With that in mind I think for the CXL/UMA use case we should use dmem to limit the driver allocate memory to just a few megabytes for legacy things and let the wast amount of memory allocation go through the normal core OS channels instead.
Regards, Christian.
Dave.
On Thu, Apr 03, 2025 at 09:39:52AM +0200, Christian König wrote:
For the UMA GPU case where there is no device memory or eviction problem, perhaps a configurable option to just say account memory in memcg for all allocations done by this process, and state yes you can work around it with allocation servers or whatever but the behaviour for well behaved things is at least somewhat defined.
We can have that as a workaround, but I think we should approach that differently.
With upcoming CXL even coherent device memory is exposed to the core OS as NUMA memory with just a high latency.
So both in the CXL and UMA case it actually doesn't make sense to allocate the memory through the driver interfaces any more. With AMDGPU for example we are just replicating mbind()/madvise() within the driver.
Instead what the DRM subsystem should aim for is to allocate memory using the normal core OS functionality and then import it into the driver.
AMD, NVidia and Intel have HMM working for quite a while now but it has some limitations, especially on the performance side.
So for AMDGPU we are currently evaluating udmabuf as alternative. That seems to be working fine with different NUMA nodes, is perfectly memcg accounted and gives you a DMA-buf which can be imported everywhere.
The only show stopper might be the allocation performance, but even if that's the case I think the ongoing folio work will properly resolve that.
I mean, no, the showstopper to that is that using udmabuf has the assumption that you have an IOMMU for every device doing DMA, which is absolutely not true on !x86 platforms.
It might be true for all GPUs, but it certainly isn't for display controllers, and it's not either for codecs, ISPs, and cameras.
And then there's the other assumption that all memory is under the memory allocator control, which isn't the case on most recent platforms either.
We *need* to take CMA into account there, all the carved-out, device specific memory regions, and the memory regions that aren't even under Linux supervision like protected memory that is typically handled by the firmware and all you get is a dma-buf.
Saying that it's how you want to workaround it on AMD is absolutely fine, but DRM as a whole should certainly not aim for that, because it can't.
Maxime
Hi Maxime,
Am 03.04.25 um 17:47 schrieb Maxime Ripard:
On Thu, Apr 03, 2025 at 09:39:52AM +0200, Christian König wrote:
For the UMA GPU case where there is no device memory or eviction problem, perhaps a configurable option to just say account memory in memcg for all allocations done by this process, and state yes you can work around it with allocation servers or whatever but the behaviour for well behaved things is at least somewhat defined.
We can have that as a workaround, but I think we should approach that differently.
With upcoming CXL even coherent device memory is exposed to the core OS as NUMA memory with just a high latency.
So both in the CXL and UMA case it actually doesn't make sense to allocate the memory through the driver interfaces any more. With AMDGPU for example we are just replicating mbind()/madvise() within the driver.
Instead what the DRM subsystem should aim for is to allocate memory using the normal core OS functionality and then import it into the driver.
AMD, NVidia and Intel have HMM working for quite a while now but it has some limitations, especially on the performance side.
So for AMDGPU we are currently evaluating udmabuf as alternative. That seems to be working fine with different NUMA nodes, is perfectly memcg accounted and gives you a DMA-buf which can be imported everywhere.
The only show stopper might be the allocation performance, but even if that's the case I think the ongoing folio work will properly resolve that.
I mean, no, the showstopper to that is that using udmabuf has the assumption that you have an IOMMU for every device doing DMA, which is absolutely not true on !x86 platforms.
It might be true for all GPUs, but it certainly isn't for display controllers, and it's not either for codecs, ISPs, and cameras.
And then there's the other assumption that all memory is under the memory allocator control, which isn't the case on most recent platforms either.
We *need* to take CMA into account there, all the carved-out, device specific memory regions, and the memory regions that aren't even under Linux supervision like protected memory that is typically handled by the firmware and all you get is a dma-buf.
Saying that it's how you want to workaround it on AMD is absolutely fine, but DRM as a whole should certainly not aim for that, because it can't.
A bunch of good points you bring up here but it sounds like you misunderstood me a bit.
I'm certainly *not* saying that we should push for udmabuf for everything, that is clearly use case specific.
For use cases like CMA or protected carve-out the question what to do doesn't even arise in the first place.
When you have CMA which dynamically steals memory from the core OS then of course it should be accounted to memcg.
When you have carve-out which the core OS memory management doesn't even know about then it should certainly be handled by dmem.
The problematic use cases are the one where a buffer can sometimes be backed by system memory and sometime by something special. For this we don't have a good approach what to do since every approach seems to have a draw back for some use case.
Regards, Christian.
Maxime
linaro-mm-sig@lists.linaro.org