This patch series revisits the proposal for a GPU cgroup controller to track and limit memory allocations by various device/allocator subsystems. The patch series also contains a simple prototype to illustrate how Android intends to implement DMA-BUF allocator attribution using the GPU cgroup controller. The prototype does not include resource limit enforcements.
Changelog: v5: Rebase on top of v5.18-rc3
Drop the global GPU cgroup "total" (sum of all device totals) portion of the design since there is no currently known use for this per Tejun Heo.
Fix commit message which still contained the old name for dma_buf_transfer_charge per Michal Koutný.
Remove all GPU cgroup code except what's necessary to support charge transfer from dma_buf. Previously charging was done in export, but for non-Android graphics use-cases this is not ideal since there may be a delay between allocation and export, during which time there is no accounting.
Merge dmabuf: Use the GPU cgroup charge/uncharge APIs patch into dmabuf: heaps: export system_heap buffers with GPU cgroup charging as a result of above.
Put the charge and uncharge code in the same file (system_heap_allocate, system_heap_dma_buf_release) instead of splitting them between the heap and the dma_buf_release. This avoids asymmetric management of the gpucg charges.
Modify the dma_buf_transfer_charge API to accept a task_struct instead of a gpucg. This avoids requiring the caller to manage the refcount of the gpucg upon failure and confusing ownership transfer logic.
Support all strings for gpucg_register_bucket instead of just string literals.
Enforce globally unique gpucg_bucket names.
Constrain gpucg_bucket name lengths to 64 bytes.
Append "-heap" to gpucg_bucket names from dmabuf-heaps.
Drop patch 7 from the series, which changed the types of binder_transaction_data's sender_pid and sender_euid fields. This was done in another commit here: https://lore.kernel.org/all/20220210021129.3386083-4-masahiroy@kernel.org/
Rename: gpucg_try_charge -> gpucg_charge find_cg_rpool_locked -> cg_rpool_find_locked init_cg_rpool -> cg_rpool_init get_cg_rpool_locked -> cg_rpool_get_locked "gpu cgroup controller" -> "GPU controller" gpucg_device -> gpucg_bucket usage -> size
Tests: Support both binder_fd_array_object and binder_fd_object. This is necessary because new versions of Android will use binder_fd_object instead of binder_fd_array_object, and we need to support both.
Tests for both binder_fd_array_object and binder_fd_object.
For binder_utils return error codes instead of struct binder{fs}_ctx.
Use ifdef __ANDROID__ to choose platform-dependent temp path instead of a runtime fallback.
Ensure binderfs_mntpt ends with a trailing '/' character instead of prepending it where used.
v4: Skip test if not run as root per Shuah Khan
Add better test logging for abnormal child termination per Shuah Khan
Adjust ordering of charge/uncharge during transfer to avoid potentially hitting cgroup limit per Michal Koutný
Adjust gpucg_try_charge critical section for charge transfer functionality
Fix uninitialized return code error for dmabuf_try_charge error case
v3: Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz
Use more common dual author commit message format per John Stultz
Remove android from binder changes title per Todd Kjos
Add a kselftest for this new behavior per Greg Kroah-Hartman
Include details on behavior for all combinations of kernel/userspace versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.
Fix pid and uid types in binder UAPI header
v2: See the previous revision of this change submitted by Hridya Valsaraju at: https://lore.kernel.org/all/20220115010622.3185921-1-hridya@google.com/
Move dma-buf cgroup charge transfer from a dma_buf_op defined by every heap to a single dma-buf function for all heaps per Daniel Vetter and Christian König. Pointers to struct gpucg and struct gpucg_device tracking the current associations were added to the dma_buf struct to achieve this.
Fix incorrect Kconfig help section indentation per Randy Dunlap.
History of the GPU cgroup controller ==================================== The GPU/DRM cgroup controller came into being when a consensus[1] was reached that the resources it tracked were unsuitable to be integrated into memcg. Originally, the proposed controller was specific to the DRM subsystem and was intended to track GEM buffers and GPU-specific resources[2]. In order to help establish a unified memory accounting model for all GPU and all related subsystems, Daniel Vetter put forth a suggestion to move it out of the DRM subsystem so that it can be used by other DMA-BUF exporters as well[3]. This RFC proposes an interface that does the same.
[1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-b... [2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com... [3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/
Hridya Valsaraju (3): gpu: rfc: Proposal for a GPU cgroup controller cgroup: gpu: Add a cgroup controller for allocator attribution of GPU memory binder: Add flags to relinquish ownership of fds
T.J. Mercier (3): dmabuf: heaps: export system_heap buffers with GPU cgroup charging dmabuf: Add gpu cgroup charge transfer function selftests: Add binder cgroup gpu memory transfer tests
Documentation/gpu/rfc/gpu-cgroup.rst | 190 +++++++ Documentation/gpu/rfc/index.rst | 4 + drivers/android/binder.c | 27 +- drivers/dma-buf/dma-buf.c | 80 ++- drivers/dma-buf/dma-heap.c | 39 ++ drivers/dma-buf/heaps/system_heap.c | 28 +- include/linux/cgroup_gpu.h | 137 +++++ include/linux/cgroup_subsys.h | 4 + include/linux/dma-buf.h | 49 +- include/linux/dma-heap.h | 15 + include/uapi/linux/android/binder.h | 23 +- init/Kconfig | 7 + kernel/cgroup/Makefile | 1 + kernel/cgroup/gpu.c | 386 +++++++++++++ .../selftests/drivers/android/binder/Makefile | 8 + .../drivers/android/binder/binder_util.c | 250 +++++++++ .../drivers/android/binder/binder_util.h | 32 ++ .../selftests/drivers/android/binder/config | 4 + .../binder/test_dmabuf_cgroup_transfer.c | 526 ++++++++++++++++++ 19 files changed, 1787 insertions(+), 23 deletions(-) create mode 100644 Documentation/gpu/rfc/gpu-cgroup.rst create mode 100644 include/linux/cgroup_gpu.h create mode 100644 kernel/cgroup/gpu.c create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h create mode 100644 tools/testing/selftests/drivers/android/binder/config create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
All DMA heaps now register a new GPU cgroup bucket upon creation, and the system_heap now exports buffers associated with its GPU cgroup bucket for tracking purposes.
In order to support GPU cgroup charge transfer on a dma-buf, the current GPU cgroup information must be stored inside the dma-buf struct. For tracked buffers, exporters include the struct gpucg and struct gpucg_bucket pointers in the export info which can later be modified if the charge is migrated to another cgroup.
Signed-off-by: Hridya Valsaraju hridya@google.com Signed-off-by: T.J. Mercier tjmercier@google.com
--- v5 changes Merge dmabuf: Use the GPU cgroup charge/uncharge APIs into this patch.
Remove all GPU cgroup code from dma-buf except what's necessary to support charge transfer. Previously charging was done in export, but for non-Android graphics use-cases this is not ideal since there may be a dealy between allocation and export, during which time there is no accounting.
Append "-heap" to gpucg_bucket names.
Charge on allocation instead of export. This should more closely mirror non-Android use-cases where there is potentially a delay between allocation and export.
Put the charge and uncharge code in the same file (system_heap_allocate, system_heap_dma_buf_release) instead of splitting them between the heap and the dma_buf_release.
Move no-op code to header file to match other files in the series.
v3 changes Use more common dual author commit message format per John Stultz.
v2 changes Move dma-buf cgroup charge transfer from a dma_buf_op defined by every heap to a single dma-buf function for all heaps per Daniel Vetter and Christian König. --- drivers/dma-buf/dma-buf.c | 19 +++++++++++++ drivers/dma-buf/dma-heap.c | 39 +++++++++++++++++++++++++++ drivers/dma-buf/heaps/system_heap.c | 28 +++++++++++++++++--- include/linux/dma-buf.h | 41 +++++++++++++++++++++++------ include/linux/dma-heap.h | 15 +++++++++++ 5 files changed, 130 insertions(+), 12 deletions(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index df23239b04fc..bc89c44bd9b9 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -462,6 +462,24 @@ static struct file *dma_buf_getfile(struct dma_buf *dmabuf, int flags) * &dma_buf_ops. */
+#ifdef CONFIG_CGROUP_GPU +static void dma_buf_set_gpucg(struct dma_buf *dmabuf, const struct dma_buf_export_info *exp) +{ + dmabuf->gpucg = exp->gpucg; + dmabuf->gpucg_bucket = exp->gpucg_bucket; +} + +void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info, + struct gpucg *gpucg, + struct gpucg_bucket *gpucg_bucket) +{ + exp_info->gpucg = gpucg; + exp_info->gpucg_bucket = gpucg_bucket; +} +#else +static void dma_buf_set_gpucg(struct dma_buf *dmabuf, struct dma_buf_export_info *exp) {} +#endif + /** * dma_buf_export - Creates a new dma_buf, and associates an anon file * with this buffer, so it can be exported. @@ -527,6 +545,7 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) init_waitqueue_head(&dmabuf->poll); dmabuf->cb_in.poll = dmabuf->cb_out.poll = &dmabuf->poll; dmabuf->cb_in.active = dmabuf->cb_out.active = 0; + dma_buf_set_gpucg(dmabuf, exp_info);
if (!resv) { resv = (struct dma_resv *)&dmabuf[1]; diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c index 8f5848aa144f..b81015548314 100644 --- a/drivers/dma-buf/dma-heap.c +++ b/drivers/dma-buf/dma-heap.c @@ -7,10 +7,12 @@ */
#include <linux/cdev.h> +#include <linux/cgroup_gpu.h> #include <linux/debugfs.h> #include <linux/device.h> #include <linux/dma-buf.h> #include <linux/err.h> +#include <linux/kconfig.h> #include <linux/xarray.h> #include <linux/list.h> #include <linux/slab.h> @@ -21,6 +23,7 @@ #include <uapi/linux/dma-heap.h>
#define DEVNAME "dma_heap" +#define HEAP_NAME_SUFFIX "-heap"
#define NUM_HEAP_MINORS 128
@@ -31,6 +34,7 @@ * @heap_devt heap device node * @list list head connecting to list of heaps * @heap_cdev heap char device + * @gpucg_bucket gpu cgroup bucket for memory accounting * * Represents a heap of memory from which buffers can be made. */ @@ -41,6 +45,9 @@ struct dma_heap { dev_t heap_devt; struct list_head list; struct cdev heap_cdev; +#ifdef CONFIG_CGROUP_GPU + struct gpucg_bucket gpucg_bucket; +#endif };
static LIST_HEAD(heap_list); @@ -216,6 +223,19 @@ const char *dma_heap_get_name(struct dma_heap *heap) return heap->name; }
+/** + * dma_heap_get_gpucg_bucket() - get struct gpucg_bucket for the heap. + * @heap: DMA-Heap to get the gpucg_bucket struct for. + * + * Returns: + * The gpucg_bucket struct for the heap. NULL if the GPU cgroup controller is + * not enabled. + */ +struct gpucg_bucket *dma_heap_get_gpucg_bucket(struct dma_heap *heap) +{ + return &heap->gpucg_bucket; +} + struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info) { struct dma_heap *heap, *h, *err_ret; @@ -228,6 +248,12 @@ struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info) return ERR_PTR(-EINVAL); }
+ if (IS_ENABLED(CONFIG_CGROUP_GPU) && strlen(exp_info->name) + strlen(HEAP_NAME_SUFFIX) >= + GPUCG_BUCKET_NAME_MAX_LEN) { + pr_err("dma_heap: Name is too long for GPU cgroup\n"); + return ERR_PTR(-ENAMETOOLONG); + } + if (!exp_info->ops || !exp_info->ops->allocate) { pr_err("dma_heap: Cannot add heap with invalid ops struct\n"); return ERR_PTR(-EINVAL); @@ -253,6 +279,19 @@ struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info) heap->ops = exp_info->ops; heap->priv = exp_info->priv;
+ if (IS_ENABLED(CONFIG_CGROUP_GPU)) { + char gpucg_bucket_name[GPUCG_BUCKET_NAME_MAX_LEN]; + + snprintf(gpucg_bucket_name, sizeof(gpucg_bucket_name), "%s%s", + exp_info->name, HEAP_NAME_SUFFIX); + + ret = gpucg_register_bucket(dma_heap_get_gpucg_bucket(heap), gpucg_bucket_name); + if (ret < 0) { + err_ret = ERR_PTR(ret); + goto err0; + } + } + /* Find unused minor number */ ret = xa_alloc(&dma_heap_minors, &minor, heap, XA_LIMIT(0, NUM_HEAP_MINORS - 1), GFP_KERNEL); diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index fcf836ba9c1f..27f686faef00 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -297,6 +297,11 @@ static void system_heap_dma_buf_release(struct dma_buf *dmabuf) } sg_free_table(table); kfree(buffer); + + if (dmabuf->gpucg && dmabuf->gpucg_bucket) { + gpucg_uncharge(dmabuf->gpucg, dmabuf->gpucg_bucket, dmabuf->size); + gpucg_put(dmabuf->gpucg); + } }
static const struct dma_buf_ops system_heap_buf_ops = { @@ -346,11 +351,21 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap, struct scatterlist *sg; struct list_head pages; struct page *page, *tmp_page; - int i, ret = -ENOMEM; + struct gpucg *gpucg; + struct gpucg_bucket *gpucg_bucket; + int i, ret; + + gpucg = gpucg_get(current); + gpucg_bucket = dma_heap_get_gpucg_bucket(heap); + ret = gpucg_charge(gpucg, gpucg_bucket, len); + if (ret) + goto put_gpucg;
buffer = kzalloc(sizeof(*buffer), GFP_KERNEL); - if (!buffer) - return ERR_PTR(-ENOMEM); + if (!buffer) { + ret = -ENOMEM; + goto uncharge_gpucg; + }
INIT_LIST_HEAD(&buffer->attachments); mutex_init(&buffer->lock); @@ -396,6 +411,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap, exp_info.size = buffer->len; exp_info.flags = fd_flags; exp_info.priv = buffer; + dma_buf_exp_info_set_gpucg(&exp_info, gpucg, gpucg_bucket); + dmabuf = dma_buf_export(&exp_info); if (IS_ERR(dmabuf)) { ret = PTR_ERR(dmabuf); @@ -414,7 +431,10 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap, list_for_each_entry_safe(page, tmp_page, &pages, lru) __free_pages(page, compound_order(page)); kfree(buffer); - +uncharge_gpucg: + gpucg_uncharge(gpucg, gpucg_bucket, len); +put_gpucg: + gpucg_put(gpucg); return ERR_PTR(ret); }
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 2097760e8e95..8e7c55c830b3 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -13,6 +13,7 @@ #ifndef __DMA_BUF_H__ #define __DMA_BUF_H__
+#include <linux/cgroup_gpu.h> #include <linux/iosys-map.h> #include <linux/file.h> #include <linux/err.h> @@ -303,7 +304,7 @@ struct dma_buf { /** * @size: * - * Size of the buffer; invariant over the lifetime of the buffer. + * Size of the buffer in bytes; invariant over the lifetime of the buffer. */ size_t size;
@@ -453,6 +454,14 @@ struct dma_buf { struct dma_buf *dmabuf; } *sysfs_entry; #endif + +#ifdef CONFIG_CGROUP_GPU + /** @gpucg: Pointer to the GPU cgroup this buffer currently belongs to. */ + struct gpucg *gpucg; + + /* @gpucg_bucket: Pointer to the GPU cgroup bucket whence this buffer originates. */ + struct gpucg_bucket *gpucg_bucket; +#endif };
/** @@ -526,13 +535,15 @@ struct dma_buf_attachment {
/** * struct dma_buf_export_info - holds information needed to export a dma_buf - * @exp_name: name of the exporter - useful for debugging. - * @owner: pointer to exporter module - used for refcounting kernel module - * @ops: Attach allocator-defined dma buf ops to the new buffer - * @size: Size of the buffer - invariant over the lifetime of the buffer - * @flags: mode flags for the file - * @resv: reservation-object, NULL to allocate default one - * @priv: Attach private data of allocator to this buffer + * @exp_name: name of the exporter - useful for debugging. + * @owner: pointer to exporter module - used for refcounting kernel module + * @ops: Attach allocator-defined dma buf ops to the new buffer + * @size: Size of the buffer in bytes - invariant over the lifetime of the buffer + * @flags: mode flags for the file + * @resv: reservation-object, NULL to allocate default one + * @priv: Attach private data of allocator to this buffer + * @gpucg: Pointer to GPU cgroup this buffer is charged to, or NULL if not charged + * @gpucg_bucket: Pointer to GPU cgroup bucket this buffer comes from, or NULL if not charged * * This structure holds the information required to export the buffer. Used * with dma_buf_export() only. @@ -545,6 +556,10 @@ struct dma_buf_export_info { int flags; struct dma_resv *resv; void *priv; +#ifdef CONFIG_CGROUP_GPU + struct gpucg *gpucg; + struct gpucg_bucket *gpucg_bucket; +#endif };
/** @@ -630,4 +645,14 @@ int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *, unsigned long); int dma_buf_vmap(struct dma_buf *dmabuf, struct iosys_map *map); void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map); + +#ifdef CONFIG_CGROUP_GPU +void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info, + struct gpucg *gpucg, + struct gpucg_bucket *gpucg_bucket); +#else/* CONFIG_CGROUP_GPU */ +static inline void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info, + struct gpucg *gpucg, + struct gpucg_bucket *gpucg_bucket) {} +#endif /* CONFIG_CGROUP_GPU */ #endif /* __DMA_BUF_H__ */ diff --git a/include/linux/dma-heap.h b/include/linux/dma-heap.h index 0c05561cad6e..6321e7636538 100644 --- a/include/linux/dma-heap.h +++ b/include/linux/dma-heap.h @@ -10,6 +10,7 @@ #define _DMA_HEAPS_H
#include <linux/cdev.h> +#include <linux/cgroup_gpu.h> #include <linux/types.h>
struct dma_heap; @@ -59,6 +60,20 @@ void *dma_heap_get_drvdata(struct dma_heap *heap); */ const char *dma_heap_get_name(struct dma_heap *heap);
+#ifdef CONFIG_CGROUP_GPU +/** + * dma_heap_get_gpucg_bucket() - get a pointer to the struct gpucg_bucket for the heap. + * @heap: DMA-Heap to retrieve gpucg_bucket for + * + * Returns: + * The gpucg_bucket struct for the heap. + */ +struct gpucg_bucket *dma_heap_get_gpucg_bucket(struct dma_heap *heap); +#else /* CONFIG_CGROUP_GPU */ +static inline struct gpucg_bucket *dma_heap_get_gpucg_bucket(struct dma_heap *heap) +{ return NULL; } +#endif /* CONFIG_CGROUP_GPU */ + /** * dma_heap_add - adds a heap to dmabuf heaps * @exp_info: information needed to register this heap
The dma_buf_transfer_charge function provides a way for processes to transfer charge of a buffer to a different process. This is essential for the cases where a central allocator process does allocations for various subsystems, hands over the fd to the client who requested the memory and drops all references to the allocated memory.
Originally-by: Hridya Valsaraju hridya@google.com Signed-off-by: T.J. Mercier tjmercier@google.com
--- v5 changes Fix commit message which still contained the old name for dma_buf_transfer_charge per Michal Koutný.
Modify the dma_buf_transfer_charge API to accept a task_struct instead of a gpucg. This avoids requiring the caller to manage the refcount of the gpucg upon failure and confusing ownership transfer logic.
v4 changes Adjust ordering of charge/uncharge during transfer to avoid potentially hitting cgroup limit per Michal Koutný.
v3 changes Use more common dual author commit message format per John Stultz.
v2 changes Move dma-buf cgroup charge transfer from a dma_buf_op defined by every heap to a single dma-buf function for all heaps per Daniel Vetter and Christian König. --- drivers/dma-buf/dma-buf.c | 57 +++++++++++++++++++++++++++++++++++ include/linux/cgroup_gpu.h | 14 +++++++++ include/linux/dma-buf.h | 6 ++++ kernel/cgroup/gpu.c | 62 ++++++++++++++++++++++++++++++++++++++ 4 files changed, 139 insertions(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index bc89c44bd9b9..f3fb844925e2 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -1341,6 +1341,63 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map) } EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap, DMA_BUF);
+/** + * dma_buf_transfer_charge - Change the GPU cgroup to which the provided dma_buf is charged. + * @dmabuf: [in] buffer whose charge will be migrated to a different GPU cgroup + * @target: [in] the task_struct of the destination process for the GPU cgroup charge + * + * Only tasks that belong to the same cgroup the buffer is currently charged to + * may call this function, otherwise it will return -EPERM. + * + * Returns 0 on success, or a negative errno code otherwise. + */ +int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct task_struct *target) +{ + struct gpucg *current_gpucg, *target_gpucg, *to_release; + int ret; + + if (!dmabuf->gpucg || !dmabuf->gpucg_bucket) { + /* This dmabuf is not tracked under GPU cgroup accounting */ + return 0; + } + + current_gpucg = gpucg_get(current); + target_gpucg = gpucg_get(target); + to_release = target_gpucg; + + /* If the source and destination cgroups are the same, don't do anything. */ + if (current_gpucg == target_gpucg) { + ret = 0; + goto skip_transfer; + } + + /* + * Verify that the cgroup of the process requesting the transfer + * is the same as the one the buffer is currently charged to. + */ + mutex_lock(&dmabuf->lock); + if (current_gpucg != dmabuf->gpucg) { + ret = -EPERM; + goto err; + } + + ret = gpucg_transfer_charge( + dmabuf->gpucg, target_gpucg, dmabuf->gpucg_bucket, dmabuf->size); + if (ret) + goto err; + + to_release = dmabuf->gpucg; + dmabuf->gpucg = target_gpucg; + +err: + mutex_unlock(&dmabuf->lock); +skip_transfer: + gpucg_put(current_gpucg); + gpucg_put(to_release); + return ret; +} +EXPORT_SYMBOL_NS_GPL(dma_buf_transfer_charge, DMA_BUF); + #ifdef CONFIG_DEBUG_FS static int dma_buf_debug_show(struct seq_file *s, void *unused) { diff --git a/include/linux/cgroup_gpu.h b/include/linux/cgroup_gpu.h index 4dfe633d6ec7..f5973ef9f926 100644 --- a/include/linux/cgroup_gpu.h +++ b/include/linux/cgroup_gpu.h @@ -83,7 +83,13 @@ static inline struct gpucg *gpucg_parent(struct gpucg *cg) }
int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size); + void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size); + +int gpucg_transfer_charge(struct gpucg *source, + struct gpucg *dest, + struct gpucg_bucket *bucket, + u64 size); int gpucg_register_bucket(struct gpucg_bucket *bucket, const char *name); #else /* CONFIG_CGROUP_GPU */
@@ -118,6 +124,14 @@ static inline void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size) {}
+static inline int gpucg_transfer_charge(struct gpucg *source, + struct gpucg *dest, + struct gpucg_bucket *bucket, + u64 size) +{ + return 0; +} + static inline int gpucg_register_bucket(struct gpucg_bucket *bucket, const char *name) {} #endif /* CONFIG_CGROUP_GPU */ #endif /* _CGROUP_GPU_H */ diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 8e7c55c830b3..438ad8577b76 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -18,6 +18,7 @@ #include <linux/file.h> #include <linux/err.h> #include <linux/scatterlist.h> +#include <linux/sched.h> #include <linux/list.h> #include <linux/dma-mapping.h> #include <linux/fs.h> @@ -650,9 +651,14 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map); void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info, struct gpucg *gpucg, struct gpucg_bucket *gpucg_bucket); + +int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct task_struct *target); #else/* CONFIG_CGROUP_GPU */ static inline void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info, struct gpucg *gpucg, struct gpucg_bucket *gpucg_bucket) {} + +static inline int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct task_struct *target) +{ return 0; } #endif /* CONFIG_CGROUP_GPU */ #endif /* __DMA_BUF_H__ */ diff --git a/kernel/cgroup/gpu.c b/kernel/cgroup/gpu.c index 34d0a5b85834..7dfbe0fd7e45 100644 --- a/kernel/cgroup/gpu.c +++ b/kernel/cgroup/gpu.c @@ -252,6 +252,68 @@ void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size) css_put(&gpucg->css); }
+/** + * gpucg_transfer_charge - Transfer a GPU charge from one cgroup to another. + * + * @source: [in] The GPU cgroup the charge will be transferred from. + * @dest: [in] The GPU cgroup the charge will be transferred to. + * @bucket: [in] The GPU cgroup bucket corresponding to the charge. + * @size: [in] The size of the memory in bytes. + * This size will be rounded up to the nearest page size. + * + * Returns 0 on success, or a negative errno code otherwise. + */ +int gpucg_transfer_charge(struct gpucg *source, + struct gpucg *dest, + struct gpucg_bucket *bucket, + u64 size) +{ + struct page_counter *counter; + u64 nr_pages; + struct gpucg_resource_pool *rp_source, *rp_dest; + int ret = 0; + + nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT; + + mutex_lock(&gpucg_mutex); + rp_source = cg_rpool_find_locked(source, bucket); + if (unlikely(!rp_source)) { + ret = -ENOENT; + goto exit_early; + } + + rp_dest = cg_rpool_get_locked(dest, bucket); + if (IS_ERR(rp_dest)) { + ret = PTR_ERR(rp_dest); + goto exit_early; + } + + /* + * First uncharge from the pool it's currently charged to. This ordering avoids double + * charging while the transfer is in progress, which could cause us to hit a limit. + * If the try_charge fails for this transfer, we need to be able to reverse this uncharge, + * so we continue to hold the gpucg_mutex here. + */ + page_counter_uncharge(&rp_source->total, nr_pages); + css_put(&source->css); + + /* Now attempt the new charge */ + if (page_counter_try_charge(&rp_dest->total, nr_pages, &counter)) { + css_get(&dest->css); + } else { + /* + * The new charge failed, so reverse the uncharge from above. This should always + * succeed since charges on source are blocked by gpucg_mutex. + */ + WARN_ON(!page_counter_try_charge(&rp_source->total, nr_pages, &counter)); + css_get(&source->css); + ret = -ENOMEM; + } +exit_early: + mutex_unlock(&gpucg_mutex); + return ret; +} + /** * gpucg_register_bucket - Registers a bucket for memory accounting using the * GPU cgroup controller.
From: Hridya Valsaraju hridya@google.com
This patch introduces flags BINDER_FD_FLAG_SENDER_NO_NEED, and BINDER_FDA_FLAG_SENDER_NO_NEED that a process sending an individual fd or fd array to another process over binder IPC can set to relinquish ownership of the fds being sent for memory accounting purposes. If the flag is found to be set during the fd or fd array translation and the fd is for a DMA-BUF, the buffer is uncharged from the sender's cgroup and charged to the receiving process's cgroup instead.
It is up to the sending process to ensure that it closes the fds regardless of whether the transfer failed or succeeded.
Most graphics shared memory allocations in Android are done by the graphics allocator HAL process. On requests from clients, the HAL process allocates memory and sends the fds to the clients over binder IPC. The graphics allocator HAL will not retain any references to the buffers. When the HAL sets *_FLAG_SENDER_NO_NEED for fd arrays holding DMA-BUF fds, or individual fd objects, the gpu cgroup controller will be able to correctly charge the buffers to the client processes instead of the graphics allocator HAL.
Since this is a new feature exposed to userspace, the kernel and userspace must be compatible for the accounting to work for transfers. In all cases the allocation and transport of DMA buffers via binder will succeed, but only when both the kernel supports, and userspace depends on this feature will the transfer accounting work. The possible scenarios are detailed below:
1. new kernel + old userspace The kernel supports the feature but userspace does not use it. The old userspace won't mount the new cgroup controller, accounting is not performed, charge is not transferred.
2. old kernel + new userspace The new cgroup controller is not supported by the kernel, accounting is not performed, charge is not transferred.
3. old kernel + old userspace Same as #2
4. new kernel + new userspace Cgroup is mounted, feature is supported and used.
Signed-off-by: Hridya Valsaraju hridya@google.com Signed-off-by: T.J. Mercier tjmercier@google.com
--- v5 changes Support both binder_fd_array_object and binder_fd_object. This is necessary because new versions of Android will use binder_fd_object instead of binder_fd_array_object, and we need to support both.
Use the new, simpler dma_buf_transfer_charge API.
v3 changes Remove android from title per Todd Kjos.
Use more common dual author commit message format per John Stultz.
Include details on behavior for all combinations of kernel/userspace versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.
v2 changes Move dma-buf cgroup charge transfer from a dma_buf_op defined by every heap to a single dma-buf function for all heaps per Daniel Vetter and Christian König. --- drivers/android/binder.c | 27 +++++++++++++++++++++++---- drivers/dma-buf/dma-buf.c | 4 ++-- include/linux/dma-buf.h | 2 +- include/uapi/linux/android/binder.h | 23 +++++++++++++++++++---- 4 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/drivers/android/binder.c b/drivers/android/binder.c index 8351c5638880..b07d50fe1c80 100644 --- a/drivers/android/binder.c +++ b/drivers/android/binder.c @@ -42,6 +42,7 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/dma-buf.h> #include <linux/fdtable.h> #include <linux/file.h> #include <linux/freezer.h> @@ -2170,7 +2171,7 @@ static int binder_translate_handle(struct flat_binder_object *fp, return ret; }
-static int binder_translate_fd(u32 fd, binder_size_t fd_offset, +static int binder_translate_fd(u32 fd, binder_size_t fd_offset, __u32 flags, struct binder_transaction *t, struct binder_thread *thread, struct binder_transaction *in_reply_to) @@ -2208,6 +2209,23 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset, goto err_security; }
+ if (IS_ENABLED(CONFIG_CGROUP_GPU) && (flags & BINDER_FD_FLAG_SENDER_NO_NEED)) { + if (is_dma_buf_file(file)) { + struct dma_buf *dmabuf = file->private_data; + + ret = dma_buf_transfer_charge(dmabuf, target_proc->tsk); + if (ret) + pr_warn("%d:%d Unable to transfer DMA-BUF fd charge to %d\n", + proc->pid, thread->pid, target_proc->pid); + } else { + binder_user_error( + "%d:%d got transaction with SENDER_NO_NEED for non-dmabuf fd, %d\n", + proc->pid, thread->pid, fd); + ret = -EINVAL; + goto err_noneed; + } + } + /* * Add fixup record for this transaction. The allocation * of the fd in the target needs to be done from a @@ -2226,6 +2244,7 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset, return ret;
err_alloc: +err_noneed: err_security: fput(file); err_fget: @@ -2528,7 +2547,7 @@ static int binder_translate_fd_array(struct list_head *pf_head,
ret = copy_from_user(&fd, sender_ufda_base + sender_uoffset, sizeof(fd)); if (!ret) - ret = binder_translate_fd(fd, offset, t, thread, + ret = binder_translate_fd(fd, offset, fda->flags, t, thread, in_reply_to); if (ret) return ret > 0 ? -EINVAL : ret; @@ -3179,8 +3198,8 @@ static void binder_transaction(struct binder_proc *proc, struct binder_fd_object *fp = to_binder_fd_object(hdr); binder_size_t fd_offset = object_offset + (uintptr_t)&fp->fd - (uintptr_t)fp; - int ret = binder_translate_fd(fp->fd, fd_offset, t, - thread, in_reply_to); + int ret = binder_translate_fd(fp->fd, fd_offset, fp->flags, + t, thread, in_reply_to);
fp->pad_binder = 0; if (ret < 0 || diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index f3fb844925e2..36ed6cd4ddcc 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -31,7 +31,6 @@
#include "dma-buf-sysfs-stats.h"
-static inline int is_dma_buf_file(struct file *);
struct dma_buf_list { struct list_head head; @@ -400,10 +399,11 @@ static const struct file_operations dma_buf_fops = { /* * is_dma_buf_file - Check if struct file* is associated with dma_buf */ -static inline int is_dma_buf_file(struct file *file) +int is_dma_buf_file(struct file *file) { return file->f_op == &dma_buf_fops; } +EXPORT_SYMBOL_NS_GPL(is_dma_buf_file, DMA_BUF);
static struct file *dma_buf_getfile(struct dma_buf *dmabuf, int flags) { diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 438ad8577b76..2b9812758fee 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -614,7 +614,7 @@ dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach) { return !!attach->importer_ops; } - +int is_dma_buf_file(struct file *file); struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, struct device *dev); struct dma_buf_attachment * diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h index 11157fae8a8e..b263cbb603ea 100644 --- a/include/uapi/linux/android/binder.h +++ b/include/uapi/linux/android/binder.h @@ -91,14 +91,14 @@ struct flat_binder_object { /** * struct binder_fd_object - describes a filedescriptor to be fixed up. * @hdr: common header structure - * @pad_flags: padding to remain compatible with old userspace code + * @flags: One or more BINDER_FD_FLAG_* flags * @pad_binder: padding to remain compatible with old userspace code * @fd: file descriptor * @cookie: opaque data, used by user-space */ struct binder_fd_object { struct binder_object_header hdr; - __u32 pad_flags; + __u32 flags; union { binder_uintptr_t pad_binder; __u32 fd; @@ -107,6 +107,17 @@ struct binder_fd_object { binder_uintptr_t cookie; };
+enum { + /** + * @BINDER_FD_FLAG_SENDER_NO_NEED + * + * When set, the sender of a binder_fd_object wishes to relinquish ownership of the fd for + * memory accounting purposes. If the fd is for a DMA-BUF, the buffer is uncharged from the + * sender's cgroup and charged to the receiving process's cgroup instead. + */ + BINDER_FD_FLAG_SENDER_NO_NEED = 0x2000, +}; + /* struct binder_buffer_object - object describing a userspace buffer * @hdr: common header structure * @flags: one or more BINDER_BUFFER_* flags @@ -141,7 +152,7 @@ enum {
/* struct binder_fd_array_object - object describing an array of fds in a buffer * @hdr: common header structure - * @pad: padding to ensure correct alignment + * flags: One or more BINDER_FDA_FLAG_* flags * @num_fds: number of file descriptors in the buffer * @parent: index in offset array to buffer holding the fd array * @parent_offset: start offset of fd array in the buffer @@ -162,12 +173,16 @@ enum { */ struct binder_fd_array_object { struct binder_object_header hdr; - __u32 pad; + __u32 flags; binder_size_t num_fds; binder_size_t parent; binder_size_t parent_offset; };
+enum { + BINDER_FDA_FLAG_SENDER_NO_NEED = BINDER_FD_FLAG_SENDER_NO_NEED, +}; + /* * On 64-bit platforms where user code may run in 32-bits the driver must * translate the buffer (and local binder) addresses appropriately.
On Wed, Apr 20, 2022 at 11:52:23PM +0000, T.J. Mercier wrote:
From: Hridya Valsaraju hridya@google.com
This patch introduces flags BINDER_FD_FLAG_SENDER_NO_NEED, and BINDER_FDA_FLAG_SENDER_NO_NEED that a process sending an individual fd or fd array to another process over binder IPC can set to relinquish ownership of the fds being sent for memory accounting purposes. If the flag is found to be set during the fd or fd array translation and the fd is for a DMA-BUF, the buffer is uncharged from the sender's cgroup and charged to the receiving process's cgroup instead.
It is up to the sending process to ensure that it closes the fds regardless of whether the transfer failed or succeeded.
Most graphics shared memory allocations in Android are done by the graphics allocator HAL process. On requests from clients, the HAL process allocates memory and sends the fds to the clients over binder IPC. The graphics allocator HAL will not retain any references to the buffers. When the HAL sets *_FLAG_SENDER_NO_NEED for fd arrays holding DMA-BUF fds, or individual fd objects, the gpu cgroup controller will be able to correctly charge the buffers to the client processes instead of the graphics allocator HAL.
Since this is a new feature exposed to userspace, the kernel and userspace must be compatible for the accounting to work for transfers. In all cases the allocation and transport of DMA buffers via binder will succeed, but only when both the kernel supports, and userspace depends on this feature will the transfer accounting work. The possible scenarios are detailed below:
New binder driver features which require userspace coordination can be "advertised" by the kernel via binderfs. You can see an example of how oneway_spam_detection is exposed in commit fc470abf54b2 ("binderfs: add support for feature files"). This is just an option to consider if it makes things easier in userspace. Although it seems that for the second scenario (old kernel + new userpsace) the flags would just be ignored.
- new kernel + old userspace
The kernel supports the feature but userspace does not use it. The old userspace won't mount the new cgroup controller, accounting is not performed, charge is not transferred.
- old kernel + new userspace
The new cgroup controller is not supported by the kernel, accounting is not performed, charge is not transferred.
- old kernel + old userspace
Same as #2
- new kernel + new userspace
Cgroup is mounted, feature is supported and used.
Signed-off-by: Hridya Valsaraju hridya@google.com Signed-off-by: T.J. Mercier tjmercier@google.com
v5 changes Support both binder_fd_array_object and binder_fd_object. This is necessary because new versions of Android will use binder_fd_object instead of binder_fd_array_object, and we need to support both.
Use the new, simpler dma_buf_transfer_charge API.
v3 changes Remove android from title per Todd Kjos.
Use more common dual author commit message format per John Stultz.
Include details on behavior for all combinations of kernel/userspace versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.
v2 changes Move dma-buf cgroup charge transfer from a dma_buf_op defined by every heap to a single dma-buf function for all heaps per Daniel Vetter and Christian König.
drivers/android/binder.c | 27 +++++++++++++++++++++++---- drivers/dma-buf/dma-buf.c | 4 ++-- include/linux/dma-buf.h | 2 +- include/uapi/linux/android/binder.h | 23 +++++++++++++++++++---- 4 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/drivers/android/binder.c b/drivers/android/binder.c index 8351c5638880..b07d50fe1c80 100644 --- a/drivers/android/binder.c +++ b/drivers/android/binder.c @@ -42,6 +42,7 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt +#include <linux/dma-buf.h> #include <linux/fdtable.h> #include <linux/file.h> #include <linux/freezer.h> @@ -2170,7 +2171,7 @@ static int binder_translate_handle(struct flat_binder_object *fp, return ret; } -static int binder_translate_fd(u32 fd, binder_size_t fd_offset, +static int binder_translate_fd(u32 fd, binder_size_t fd_offset, __u32 flags, struct binder_transaction *t, struct binder_thread *thread, struct binder_transaction *in_reply_to) @@ -2208,6 +2209,23 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset, goto err_security; }
- if (IS_ENABLED(CONFIG_CGROUP_GPU) && (flags & BINDER_FD_FLAG_SENDER_NO_NEED)) {
if (is_dma_buf_file(file)) {
struct dma_buf *dmabuf = file->private_data;
ret = dma_buf_transfer_charge(dmabuf, target_proc->tsk);
if (ret)
pr_warn("%d:%d Unable to transfer DMA-BUF fd charge to %d\n",
proc->pid, thread->pid, target_proc->pid);
If we fail to transfer the charge, it seems we continue with the fixup allocation and then propagate the error. Shouldn't the translation be aborted at this point instead? Or is this supposed to be handled?
} else {
nit: negating is_dma_buf_file() check eliminates the "else" here.
binder_user_error(
"%d:%d got transaction with SENDER_NO_NEED for non-dmabuf fd, %d\n",
proc->pid, thread->pid, fd);
ret = -EINVAL;
goto err_noneed;
}
- }
- /*
- Add fixup record for this transaction. The allocation
- of the fd in the target needs to be done from a
@@ -2226,6 +2244,7 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset, return ret; err_alloc: +err_noneed: err_security: fput(file); err_fget: @@ -2528,7 +2547,7 @@ static int binder_translate_fd_array(struct list_head *pf_head, ret = copy_from_user(&fd, sender_ufda_base + sender_uoffset, sizeof(fd)); if (!ret)
ret = binder_translate_fd(fd, offset, t, thread,
if (ret) return ret > 0 ? -EINVAL : ret;ret = binder_translate_fd(fd, offset, fda->flags, t, thread, in_reply_to);
@@ -3179,8 +3198,8 @@ static void binder_transaction(struct binder_proc *proc, struct binder_fd_object *fp = to_binder_fd_object(hdr); binder_size_t fd_offset = object_offset + (uintptr_t)&fp->fd - (uintptr_t)fp;
int ret = binder_translate_fd(fp->fd, fd_offset, t,
thread, in_reply_to);
int ret = binder_translate_fd(fp->fd, fd_offset, fp->flags,
t, thread, in_reply_to);
fp->pad_binder = 0; if (ret < 0 || diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index f3fb844925e2..36ed6cd4ddcc 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -31,7 +31,6 @@ #include "dma-buf-sysfs-stats.h" -static inline int is_dma_buf_file(struct file *); struct dma_buf_list { struct list_head head; @@ -400,10 +399,11 @@ static const struct file_operations dma_buf_fops = { /*
- is_dma_buf_file - Check if struct file* is associated with dma_buf
*/ -static inline int is_dma_buf_file(struct file *file) +int is_dma_buf_file(struct file *file) { return file->f_op == &dma_buf_fops; } +EXPORT_SYMBOL_NS_GPL(is_dma_buf_file, DMA_BUF); static struct file *dma_buf_getfile(struct dma_buf *dmabuf, int flags) { diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 438ad8577b76..2b9812758fee 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -614,7 +614,7 @@ dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach) { return !!attach->importer_ops; }
+int is_dma_buf_file(struct file *file); struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, struct device *dev); struct dma_buf_attachment * diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h index 11157fae8a8e..b263cbb603ea 100644 --- a/include/uapi/linux/android/binder.h +++ b/include/uapi/linux/android/binder.h @@ -91,14 +91,14 @@ struct flat_binder_object { /**
- struct binder_fd_object - describes a filedescriptor to be fixed up.
- @hdr: common header structure
- @pad_flags: padding to remain compatible with old userspace code
Does this mean we no longer need to keep the compatibility with the "old userspace code"? Maybe these old flags are all less than 0x2000?
*/
- @flags: One or more BINDER_FD_FLAG_* flags
- @pad_binder: padding to remain compatible with old userspace code
- @fd: file descriptor
- @cookie: opaque data, used by user-space
struct binder_fd_object { struct binder_object_header hdr;
- __u32 pad_flags;
- __u32 flags; union { binder_uintptr_t pad_binder; __u32 fd;
@@ -107,6 +107,17 @@ struct binder_fd_object { binder_uintptr_t cookie; }; +enum {
- /**
* @BINDER_FD_FLAG_SENDER_NO_NEED
*
* When set, the sender of a binder_fd_object wishes to relinquish ownership of the fd for
* memory accounting purposes. If the fd is for a DMA-BUF, the buffer is uncharged from the
* sender's cgroup and charged to the receiving process's cgroup instead.
*/
- BINDER_FD_FLAG_SENDER_NO_NEED = 0x2000,
SENDER_NO_NEED wasn't straight-forward for me. Perhaps RELINQUISH or XFER_{OWNER|CHARGE|CGROUP} could be some other options to consider.
+};
/* struct binder_buffer_object - object describing a userspace buffer
- @hdr: common header structure
- @flags: one or more BINDER_BUFFER_* flags
@@ -141,7 +152,7 @@ enum { /* struct binder_fd_array_object - object describing an array of fds in a buffer
- @hdr: common header structure
- @pad: padding to ensure correct alignment
- flags: One or more BINDER_FDA_FLAG_* flags
- @num_fds: number of file descriptors in the buffer
- @parent: index in offset array to buffer holding the fd array
- @parent_offset: start offset of fd array in the buffer
@@ -162,12 +173,16 @@ enum { */ struct binder_fd_array_object { struct binder_object_header hdr;
- __u32 pad;
- __u32 flags; binder_size_t num_fds; binder_size_t parent; binder_size_t parent_offset;
}; +enum {
- BINDER_FDA_FLAG_SENDER_NO_NEED = BINDER_FD_FLAG_SENDER_NO_NEED,
+};
/*
- On 64-bit platforms where user code may run in 32-bits the driver must
- translate the buffer (and local binder) addresses appropriately.
-- 2.36.0.rc0.470.gd361397f0d-goog
Other than included minor comments:
Reviewed-by: Carlos Llamas cmllamas@google.com
-- Carlos Llamas
On Thu, Apr 21, 2022 at 11:28 AM Carlos Llamas cmllamas@google.com wrote:
On Wed, Apr 20, 2022 at 11:52:23PM +0000, T.J. Mercier wrote:
From: Hridya Valsaraju hridya@google.com
This patch introduces flags BINDER_FD_FLAG_SENDER_NO_NEED, and BINDER_FDA_FLAG_SENDER_NO_NEED that a process sending an individual fd or fd array to another process over binder IPC can set to relinquish ownership of the fds being sent for memory accounting purposes. If the flag is found to be set during the fd or fd array translation and the fd is for a DMA-BUF, the buffer is uncharged from the sender's cgroup and charged to the receiving process's cgroup instead.
It is up to the sending process to ensure that it closes the fds regardless of whether the transfer failed or succeeded.
Most graphics shared memory allocations in Android are done by the graphics allocator HAL process. On requests from clients, the HAL process allocates memory and sends the fds to the clients over binder IPC. The graphics allocator HAL will not retain any references to the buffers. When the HAL sets *_FLAG_SENDER_NO_NEED for fd arrays holding DMA-BUF fds, or individual fd objects, the gpu cgroup controller will be able to correctly charge the buffers to the client processes instead of the graphics allocator HAL.
Since this is a new feature exposed to userspace, the kernel and userspace must be compatible for the accounting to work for transfers. In all cases the allocation and transport of DMA buffers via binder will succeed, but only when both the kernel supports, and userspace depends on this feature will the transfer accounting work. The possible scenarios are detailed below:
New binder driver features which require userspace coordination can be "advertised" by the kernel via binderfs. You can see an example of how oneway_spam_detection is exposed in commit fc470abf54b2 ("binderfs: add support for feature files"). This is just an option to consider if it makes things easier in userspace. Although it seems that for the second scenario (old kernel + new userpsace) the flags would just be ignored.
This is a cool idea. You're right that the flags would be ignored. Since this isn't a binder feature that can be toggled like BINDER_ENABLE_ONEWAY_SPAM_DETECTION, I think the presence of the GPU cgroup controller in the cgroup.controllers file would also tell us the same thing from userspace.
- new kernel + old userspace
The kernel supports the feature but userspace does not use it. The old userspace won't mount the new cgroup controller, accounting is not performed, charge is not transferred.
- old kernel + new userspace
The new cgroup controller is not supported by the kernel, accounting is not performed, charge is not transferred.
- old kernel + old userspace
Same as #2
- new kernel + new userspace
Cgroup is mounted, feature is supported and used.
Signed-off-by: Hridya Valsaraju hridya@google.com Signed-off-by: T.J. Mercier tjmercier@google.com
v5 changes Support both binder_fd_array_object and binder_fd_object. This is necessary because new versions of Android will use binder_fd_object instead of binder_fd_array_object, and we need to support both.
Use the new, simpler dma_buf_transfer_charge API.
v3 changes Remove android from title per Todd Kjos.
Use more common dual author commit message format per John Stultz.
Include details on behavior for all combinations of kernel/userspace versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.
v2 changes Move dma-buf cgroup charge transfer from a dma_buf_op defined by every heap to a single dma-buf function for all heaps per Daniel Vetter and Christian König.
drivers/android/binder.c | 27 +++++++++++++++++++++++---- drivers/dma-buf/dma-buf.c | 4 ++-- include/linux/dma-buf.h | 2 +- include/uapi/linux/android/binder.h | 23 +++++++++++++++++++---- 4 files changed, 45 insertions(+), 11 deletions(-)
diff --git a/drivers/android/binder.c b/drivers/android/binder.c index 8351c5638880..b07d50fe1c80 100644 --- a/drivers/android/binder.c +++ b/drivers/android/binder.c @@ -42,6 +42,7 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/dma-buf.h> #include <linux/fdtable.h> #include <linux/file.h> #include <linux/freezer.h> @@ -2170,7 +2171,7 @@ static int binder_translate_handle(struct flat_binder_object *fp, return ret; }
-static int binder_translate_fd(u32 fd, binder_size_t fd_offset, +static int binder_translate_fd(u32 fd, binder_size_t fd_offset, __u32 flags, struct binder_transaction *t, struct binder_thread *thread, struct binder_transaction *in_reply_to) @@ -2208,6 +2209,23 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset, goto err_security; }
if (IS_ENABLED(CONFIG_CGROUP_GPU) && (flags & BINDER_FD_FLAG_SENDER_NO_NEED)) {
if (is_dma_buf_file(file)) {
struct dma_buf *dmabuf = file->private_data;
ret = dma_buf_transfer_charge(dmabuf, target_proc->tsk);
if (ret)
pr_warn("%d:%d Unable to transfer DMA-BUF fd charge to %d\n",
proc->pid, thread->pid, target_proc->pid);
If we fail to transfer the charge, it seems we continue with the fixup allocation and then propagate the error. Shouldn't the translation be aborted at this point instead? Or is this supposed to be handled?
I took the position that it was better to have incorrect accounting along with this log statement than potentially causing lots of crashes due to failed transactions. However if limiting gets added for the GPU cgroup, then we really should kill the transaction here. I'll go ahead and add that goto now.
} else {
nit: negating is_dma_buf_file() check eliminates the "else" here.
Thanks.
binder_user_error(
"%d:%d got transaction with SENDER_NO_NEED for non-dmabuf fd, %d\n",
proc->pid, thread->pid, fd);
ret = -EINVAL;
goto err_noneed;
}
}
/* * Add fixup record for this transaction. The allocation * of the fd in the target needs to be done from a
@@ -2226,6 +2244,7 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset, return ret;
err_alloc: +err_noneed: err_security: fput(file); err_fget: @@ -2528,7 +2547,7 @@ static int binder_translate_fd_array(struct list_head *pf_head,
ret = copy_from_user(&fd, sender_ufda_base + sender_uoffset, sizeof(fd)); if (!ret)
ret = binder_translate_fd(fd, offset, t, thread,
ret = binder_translate_fd(fd, offset, fda->flags, t, thread, in_reply_to); if (ret) return ret > 0 ? -EINVAL : ret;
@@ -3179,8 +3198,8 @@ static void binder_transaction(struct binder_proc *proc, struct binder_fd_object *fp = to_binder_fd_object(hdr); binder_size_t fd_offset = object_offset + (uintptr_t)&fp->fd - (uintptr_t)fp;
int ret = binder_translate_fd(fp->fd, fd_offset, t,
thread, in_reply_to);
int ret = binder_translate_fd(fp->fd, fd_offset, fp->flags,
t, thread, in_reply_to); fp->pad_binder = 0; if (ret < 0 ||
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index f3fb844925e2..36ed6cd4ddcc 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -31,7 +31,6 @@
#include "dma-buf-sysfs-stats.h"
-static inline int is_dma_buf_file(struct file *);
struct dma_buf_list { struct list_head head; @@ -400,10 +399,11 @@ static const struct file_operations dma_buf_fops = { /*
- is_dma_buf_file - Check if struct file* is associated with dma_buf
*/ -static inline int is_dma_buf_file(struct file *file) +int is_dma_buf_file(struct file *file) { return file->f_op == &dma_buf_fops; } +EXPORT_SYMBOL_NS_GPL(is_dma_buf_file, DMA_BUF);
static struct file *dma_buf_getfile(struct dma_buf *dmabuf, int flags) { diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 438ad8577b76..2b9812758fee 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -614,7 +614,7 @@ dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach) { return !!attach->importer_ops; }
+int is_dma_buf_file(struct file *file); struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, struct device *dev); struct dma_buf_attachment * diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h index 11157fae8a8e..b263cbb603ea 100644 --- a/include/uapi/linux/android/binder.h +++ b/include/uapi/linux/android/binder.h @@ -91,14 +91,14 @@ struct flat_binder_object { /**
- struct binder_fd_object - describes a filedescriptor to be fixed up.
- @hdr: common header structure
- @pad_flags: padding to remain compatible with old userspace code
Does this mean we no longer need to keep the compatibility with the "old userspace code"? Maybe these old flags are all less than 0x2000?
This comes from before binder_fd{_array}_object existed as a distinct type from flat_binder_object. With this layout, it's possible to cast between flat_binder_object and the binder_fd{_array}_object types. I don't think there were ever any binder_fd{_array}_object specific flags before now, but yes the value of 0x2000 was chosen to be sure that the FLAT_BINDER_* flags do not conflict. I did try smaller values (0x02) but found that occasionally this bit was set when I was not expecting it to be.
https://lore.kernel.org/lkml/1486161652-2612-2-git-send-email-john.stultz@li...
*/
- @flags: One or more BINDER_FD_FLAG_* flags
- @pad_binder: padding to remain compatible with old userspace code
- @fd: file descriptor
- @cookie: opaque data, used by user-space
struct binder_fd_object { struct binder_object_header hdr;
__u32 pad_flags;
__u32 flags; union { binder_uintptr_t pad_binder; __u32 fd;
@@ -107,6 +107,17 @@ struct binder_fd_object { binder_uintptr_t cookie; };
+enum {
/**
* @BINDER_FD_FLAG_SENDER_NO_NEED
*
* When set, the sender of a binder_fd_object wishes to relinquish ownership of the fd for
* memory accounting purposes. If the fd is for a DMA-BUF, the buffer is uncharged from the
* sender's cgroup and charged to the receiving process's cgroup instead.
*/
BINDER_FD_FLAG_SENDER_NO_NEED = 0x2000,
SENDER_NO_NEED wasn't straight-forward for me. Perhaps RELINQUISH or XFER_{OWNER|CHARGE|CGROUP} could be some other options to consider.
I'm happy to change this up. I like _XFER_CHARGE the best out of these.
+};
/* struct binder_buffer_object - object describing a userspace buffer
- @hdr: common header structure
- @flags: one or more BINDER_BUFFER_* flags
@@ -141,7 +152,7 @@ enum {
/* struct binder_fd_array_object - object describing an array of fds in a buffer
- @hdr: common header structure
- @pad: padding to ensure correct alignment
- flags: One or more BINDER_FDA_FLAG_* flags
- @num_fds: number of file descriptors in the buffer
- @parent: index in offset array to buffer holding the fd array
- @parent_offset: start offset of fd array in the buffer
@@ -162,12 +173,16 @@ enum { */ struct binder_fd_array_object { struct binder_object_header hdr;
__u32 pad;
__u32 flags; binder_size_t num_fds; binder_size_t parent; binder_size_t parent_offset;
};
+enum {
BINDER_FDA_FLAG_SENDER_NO_NEED = BINDER_FD_FLAG_SENDER_NO_NEED,
+};
/*
- On 64-bit platforms where user code may run in 32-bits the driver must
- translate the buffer (and local binder) addresses appropriately.
-- 2.36.0.rc0.470.gd361397f0d-goog
Other than included minor comments:
Reviewed-by: Carlos Llamas cmllamas@google.com
-- Carlos Llamas
On Wed, Apr 20, 2022 at 11:52:18PM +0000, T.J. Mercier wrote:
This patch series revisits the proposal for a GPU cgroup controller to track and limit memory allocations by various device/allocator subsystems. The patch series also contains a simple prototype to illustrate how Android intends to implement DMA-BUF allocator attribution using the GPU cgroup controller. The prototype does not include resource limit enforcements.
Changelog: v5: Rebase on top of v5.18-rc3
Why is a "RFC" series on v5? I treat "RFC" as "not ready to be merged, if people are interested, please look at it". But v5 seems like you think this is real.
confused,
greg k-h
On Fri, Apr 22, 2022 at 7:53 AM Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
On Wed, Apr 20, 2022 at 11:52:18PM +0000, T.J. Mercier wrote:
This patch series revisits the proposal for a GPU cgroup controller to track and limit memory allocations by various device/allocator subsystems. The patch series also contains a simple prototype to illustrate how Android intends to implement DMA-BUF allocator attribution using the GPU cgroup controller. The prototype does not include resource limit enforcements.
Changelog: v5: Rebase on top of v5.18-rc3
Why is a "RFC" series on v5? I treat "RFC" as "not ready to be merged, if people are interested, please look at it". But v5 seems like you think this is real.
confused,
greg k-h
I'm sorry for the confusion. I'll change this to PATCH in future revisions.
linaro-mm-sig@lists.linaro.org