Am 16.04.25 um 23:51 schrieb Juan Yescas:
> On Wed, Apr 16, 2025 at 4:34 AM Christian König
> <christian.koenig(a)amd.com> wrote:
>>
>>
>> Am 15.04.25 um 19:19 schrieb Juan Yescas:
>>> This change sets the allocation orders for the different page sizes
>>> (4k, 16k, 64k) based on PAGE_SHIFT. Before this change, the orders
>>> for large page sizes were calculated incorrectly, this caused system
>>> heap to allocate from 2% to 4% more memory on 16KiB page size kernels.
>>>
>>> This change was tested on 4k/16k page size kernels.
>>>
>>> Signed-off-by: Juan Yescas <jyescas(a)google.com>
>>> ---
>>> drivers/dma-buf/heaps/system_heap.c | 9 ++++++++-
>>> 1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
>>> index 26d5dc89ea16..54674c02dcb4 100644
>>> --- a/drivers/dma-buf/heaps/system_heap.c
>>> +++ b/drivers/dma-buf/heaps/system_heap.c
>>> @@ -50,8 +50,15 @@ static gfp_t order_flags[] = {HIGH_ORDER_GFP, HIGH_ORDER_GFP, LOW_ORDER_GFP};
>>> * to match with the sizes often found in IOMMUs. Using order 4 pages instead
>>> * of order 0 pages can significantly improve the performance of many IOMMUs
>>> * by reducing TLB pressure and time spent updating page tables.
>>> + *
>>> + * Note: When the order is 0, the minimum allocation is PAGE_SIZE. The possible
>>> + * page sizes for ARM devices could be 4K, 16K and 64K.
>>> */
>>> -static const unsigned int orders[] = {8, 4, 0};
>>> +#define ORDER_1M (20 - PAGE_SHIFT)
>>> +#define ORDER_64K (16 - PAGE_SHIFT)
>>> +#define ORDER_FOR_PAGE_SIZE (0)
>>> +static const unsigned int orders[] = {ORDER_1M, ORDER_64K, ORDER_FOR_PAGE_SIZE};
>>> +#
>> Good catch, but I think the defines are just overkill.
>>
>> What you should do instead is to subtract page shift when using the array.
>>
> There are several occurrences of orders in the file and I think it is
> better to do the calculations up front in the array. Would you be ok
> if we get rid of the defines as per your suggestion and make the
> calculations during the definition of the array. Something like this:
>
> static const unsigned int orders[] = {20 - PAGE_SHIFT, 16 - PAGE_SHIFT, 0};
Yeah that totally works for me as well. Just make sure that a code comment nearby explains why 20, 16 and 0.
>> Apart from that using 1M, 64K and then falling back to 4K just sounds random to me. We have especially pushed back on 64K more than once because it is actually not beneficial in almost all cases.
>>
> In the hardware where the driver is used, the 64K is beneficial for:
>
> Arm SMMUv3 (4KB granule size):
> 64KB (contiguous bit groups 16 4KB pages)
>
> SysMMU benefits from 64KB (“large” page) on 4k/16k page sizes.
Question could this HW also work with pages larger than 64K? (I strongly expect yes).
>> I suggest to fix the code in system_heap_allocate to not over allocate instead and just try the available orders like TTM does. This has proven to be working architecture independent.
>>
> Do you mean to have an implementation similar to __ttm_pool_alloc()?
Yeah something like that.
> https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree…
>
> If that is the case, we can try the change, run some benchmarks and
> submit the patch if we see positive results.
Could be that this doesn't matter for your platform, but it's minimal extra overhead asking for larger chunks first and it just avoids fragmenting everything into 64K chunks.
That is kind of important since the code in DMA-heaps should be platform neutral if possible.
Regards,
Christian.
>
> Thanks
> Juan
>
>> Regards,
>> Christian.
>>
>>> #define NUM_ORDERS ARRAY_SIZE(orders)
>>>
>>> static struct sg_table *dup_sg_table(struct sg_table *table)
On Wed, Apr 16, 2025 at 6:26 PM Song Liu <song(a)kernel.org> wrote:
>
> On Wed, Apr 16, 2025 at 4:40 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > On Wed, Apr 16, 2025 at 4:08 PM Song Liu <song(a)kernel.org> wrote:
> > >
> > > On Wed, Apr 16, 2025 at 3:51 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> > > [...]
> > > > >
> > > > > IIUC, the iterator simply traverses elements in a linked list. I feel it is
> > > > > an overkill to implement a new BPF iterator for it.
> > > >
> > > > Like other BPF iterators such as kmem_cache_iter or task_iter.
> > > > Cgroup_iter iterates trees instead of lists. This is iterating over
> > > > kernel objects just like the docs say, "A BPF iterator is a type of
> > > > BPF program that allows users to iterate over specific types of kernel
> > > > objects". More complicated iteration should not be a requirement here.
> > > >
> > > > > Maybe we simply
> > > > > use debugging tools like crash or drgn for this? The access with
> > > > > these tools will not be protected by the mutex. But from my personal
> > > > > experience, this is not a big issue for user space debugging tools.
> > > >
> > > > drgn is *way* too slow, and even if it weren't the dependencies for
> > > > running it aren't available. crash needs debug symbols which also
> > > > aren't available on user builds. This is not just for manual
> > > > debugging, it's for reporting memory use in production. Or anything
> > > > else someone might care to extract like attachment info or refcounts.
> > >
> > > Could you please share more information about the use cases and
> > > the time constraint here, and why drgn is too slow. Is most of the delay
> > > comes from parsing DWARF? This is mostly for my curiosity, because
> > > I have been thinking about using drgn to do some monitoring in
> > > production.
> > >
> > > Thanks,
> > > Song
> >
> > These RunCommands have 10 second timeouts for example. It's rare that
> > I see them get exceeded but it happens occasionally.:
> > https://cs.android.com/android/platform/superproject/main/+/main:frameworks…
>
> Thanks for sharing this information.
>
> > The last time I used drgn (admittedly back in 2023) it took over a
> > minute to iterate through less than 200 cgroups. I'm not sure what the
> > root cause of the slowness was, but I'd expect the DWARF processing to
> > be done up-front once and the slowness I experienced was not just at
> > startup. Eventually I switched over to tracefs for that issue, which
> > we still use for some telemetry.
>
> I haven't tried drgn on Android. On server side, iterating should 200
> cgroups should be fairly fast (< 5 seconds, where DWARF parsing is
> the most expensive part).
>
> > Other uses are by statsd for telemetry, memory reporting on app kills
> > or death, and for "dumpsys meminfo".
>
> Here is another rookie question, it appears to me there is a file descriptor
> associated with each DMA buffer, can we achieve the same goal with
> a task-file iterator?
That would find almost all of them, but not the kernel-only
allocations. (kernel_rss in the dmabuf_dump output I attached earlier.
If there's a leak, it's likely to show up in kernel_rss because some
driver forgot to release its reference(s).) Also wouldn't that be a
ton more iterations since we'd have to visit every FD to find the
small portion that are dmabufs? I'm not actually sure if buffers that
have been mapped, and then have had their file descriptors closed
would show up in task_struct->files; if not I think that would mean
scanning both files and vmas for each task.
On Wed, Apr 16, 2025 at 4:08 PM Song Liu <song(a)kernel.org> wrote:
>
> On Wed, Apr 16, 2025 at 3:51 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> [...]
> > >
> > > IIUC, the iterator simply traverses elements in a linked list. I feel it is
> > > an overkill to implement a new BPF iterator for it.
> >
> > Like other BPF iterators such as kmem_cache_iter or task_iter.
> > Cgroup_iter iterates trees instead of lists. This is iterating over
> > kernel objects just like the docs say, "A BPF iterator is a type of
> > BPF program that allows users to iterate over specific types of kernel
> > objects". More complicated iteration should not be a requirement here.
> >
> > > Maybe we simply
> > > use debugging tools like crash or drgn for this? The access with
> > > these tools will not be protected by the mutex. But from my personal
> > > experience, this is not a big issue for user space debugging tools.
> >
> > drgn is *way* too slow, and even if it weren't the dependencies for
> > running it aren't available. crash needs debug symbols which also
> > aren't available on user builds. This is not just for manual
> > debugging, it's for reporting memory use in production. Or anything
> > else someone might care to extract like attachment info or refcounts.
>
> Could you please share more information about the use cases and
> the time constraint here, and why drgn is too slow. Is most of the delay
> comes from parsing DWARF? This is mostly for my curiosity, because
> I have been thinking about using drgn to do some monitoring in
> production.
>
> Thanks,
> Song
These RunCommands have 10 second timeouts for example. It's rare that
I see them get exceeded but it happens occasionally.:
https://cs.android.com/android/platform/superproject/main/+/main:frameworks…
The last time I used drgn (admittedly back in 2023) it took over a
minute to iterate through less than 200 cgroups. I'm not sure what the
root cause of the slowness was, but I'd expect the DWARF processing to
be done up-front once and the slowness I experienced was not just at
startup. Eventually I switched over to tracefs for that issue, which
we still use for some telemetry.
Other uses are by statsd for telemetry, memory reporting on app kills
or death, and for "dumpsys meminfo".
Thanks,
T.J.
On Wed, Apr 16, 2025 at 3:02 PM Song Liu <song(a)kernel.org> wrote:
>
> On Mon, Apr 14, 2025 at 3:53 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> [...]
> > +
> > +BTF_ID_LIST_GLOBAL_SINGLE(bpf_dmabuf_btf_id, struct, dma_buf)
> > +DEFINE_BPF_ITER_FUNC(dmabuf, struct bpf_iter_meta *meta, struct dma_buf *dmabuf)
> > +
> > +static void *dmabuf_iter_seq_start(struct seq_file *seq, loff_t *pos)
> > +{
> > + struct dma_buf *dmabuf, *ret = NULL;
> > +
> > + if (*pos) {
> > + *pos = 0;
> > + return NULL;
> > + }
> > + /* Look for the first buffer we can obtain a reference to.
> > + * The list mutex does not protect a dmabuf's refcount, so it can be
> > + * zeroed while we are iterating. Therefore we cannot call get_dma_buf()
> > + * since the caller of this program may not already own a reference to
> > + * the buffer.
> > + */
> > + mutex_lock(&dmabuf_debugfs_list_mutex);
> > + list_for_each_entry(dmabuf, &dmabuf_debugfs_list, list_node) {
> > + if (file_ref_get(&dmabuf->file->f_ref)) {
> > + ret = dmabuf;
> > + break;
> > + }
> > + }
> > + mutex_unlock(&dmabuf_debugfs_list_mutex);
>
> IIUC, the iterator simply traverses elements in a linked list. I feel it is
> an overkill to implement a new BPF iterator for it.
Like other BPF iterators such as kmem_cache_iter or task_iter.
Cgroup_iter iterates trees instead of lists. This is iterating over
kernel objects just like the docs say, "A BPF iterator is a type of
BPF program that allows users to iterate over specific types of kernel
objects". More complicated iteration should not be a requirement here.
> Maybe we simply
> use debugging tools like crash or drgn for this? The access with
> these tools will not be protected by the mutex. But from my personal
> experience, this is not a big issue for user space debugging tools.
drgn is *way* too slow, and even if it weren't the dependencies for
running it aren't available. crash needs debug symbols which also
aren't available on user builds. This is not just for manual
debugging, it's for reporting memory use in production. Or anything
else someone might care to extract like attachment info or refcounts.
> Thanks,
> Song
>
>
> > +
> > + return ret;
> > +}
Until CONFIG_DMABUF_SYSFS_STATS was added [1] it was only possible to
perform per-buffer accounting with debugfs which is not suitable for
production environments. Eventually we discovered the overhead with
per-buffer sysfs file creation/removal was significantly impacting
allocation and free times, and exacerbated kernfs lock contention. [2]
dma_buf_stats_setup() is responsible for 39% of single-page buffer
creation duration, or 74% of single-page dma_buf_export() duration when
stressing dmabuf allocations and frees.
I prototyped a change from per-buffer to per-exporter statistics with a
RCU protected list of exporter allocations that accommodates most (but
not all) of our use-cases and avoids almost all of the sysfs overhead.
While that adds less overhead than per-buffer sysfs, and less even than
the maintenance of the dmabuf debugfs_list, it's still *additional*
overhead on top of the debugfs_list and doesn't give us per-buffer info.
This series uses the existing dmabuf debugfs_list to implement a BPF
dmabuf iterator, which adds no overhead to buffer allocation/free and
provides per-buffer info. While the kernel must have CONFIG_DEBUG_FS for
the dmabuf_iter to be available, debugfs does not need to be mounted.
The BPF program loaded by userspace that extracts per-buffer information
gets to define its own interface which avoids the lack of ABI stability
with debugfs (even if it were mounted).
As this is a replacement for our use of CONFIG_DMABUF_SYSFS_STATS, the
last patch is a RFC for removing it from the kernel. Please see my
suggestion there regarding the timeline for that.
[1] https://lore.kernel.org/linux-media/20201210044400.1080308-1-hridya@google.…
[2] https://lore.kernel.org/all/20220516171315.2400578-1-tjmercier@google.com/
T.J. Mercier (4):
dma-buf: Rename and expose debugfs symbols
bpf: Add dmabuf iterator
selftests/bpf: Add test for dmabuf_iter
RFC: dma-buf: Remove DMA-BUF statistics
.../ABI/testing/sysfs-kernel-dmabuf-buffers | 24 ---
Documentation/driver-api/dma-buf.rst | 5 -
drivers/dma-buf/Kconfig | 15 --
drivers/dma-buf/Makefile | 1 -
drivers/dma-buf/dma-buf-sysfs-stats.c | 202 ------------------
drivers/dma-buf/dma-buf-sysfs-stats.h | 35 ---
drivers/dma-buf/dma-buf.c | 40 +---
include/linux/btf_ids.h | 1 +
include/linux/dma-buf.h | 6 +
kernel/bpf/Makefile | 3 +
kernel/bpf/dmabuf_iter.c | 130 +++++++++++
tools/testing/selftests/bpf/config | 1 +
.../selftests/bpf/prog_tests/dmabuf_iter.c | 116 ++++++++++
.../testing/selftests/bpf/progs/dmabuf_iter.c | 31 +++
14 files changed, 299 insertions(+), 311 deletions(-)
delete mode 100644 Documentation/ABI/testing/sysfs-kernel-dmabuf-buffers
delete mode 100644 drivers/dma-buf/dma-buf-sysfs-stats.c
delete mode 100644 drivers/dma-buf/dma-buf-sysfs-stats.h
create mode 100644 kernel/bpf/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
create mode 100644 tools/testing/selftests/bpf/progs/dmabuf_iter.c
--
2.49.0.604.gff1f9ca942-goog
Am 15.04.25 um 19:19 schrieb Juan Yescas:
> This change sets the allocation orders for the different page sizes
> (4k, 16k, 64k) based on PAGE_SHIFT. Before this change, the orders
> for large page sizes were calculated incorrectly, this caused system
> heap to allocate from 2% to 4% more memory on 16KiB page size kernels.
>
> This change was tested on 4k/16k page size kernels.
>
> Signed-off-by: Juan Yescas <jyescas(a)google.com>
> ---
> drivers/dma-buf/heaps/system_heap.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> index 26d5dc89ea16..54674c02dcb4 100644
> --- a/drivers/dma-buf/heaps/system_heap.c
> +++ b/drivers/dma-buf/heaps/system_heap.c
> @@ -50,8 +50,15 @@ static gfp_t order_flags[] = {HIGH_ORDER_GFP, HIGH_ORDER_GFP, LOW_ORDER_GFP};
> * to match with the sizes often found in IOMMUs. Using order 4 pages instead
> * of order 0 pages can significantly improve the performance of many IOMMUs
> * by reducing TLB pressure and time spent updating page tables.
> + *
> + * Note: When the order is 0, the minimum allocation is PAGE_SIZE. The possible
> + * page sizes for ARM devices could be 4K, 16K and 64K.
> */
> -static const unsigned int orders[] = {8, 4, 0};
> +#define ORDER_1M (20 - PAGE_SHIFT)
> +#define ORDER_64K (16 - PAGE_SHIFT)
> +#define ORDER_FOR_PAGE_SIZE (0)
> +static const unsigned int orders[] = {ORDER_1M, ORDER_64K, ORDER_FOR_PAGE_SIZE};
> +#
Good catch, but I think the defines are just overkill.
What you should do instead is to subtract page shift when using the array.
Apart from that using 1M, 64K and then falling back to 4K just sounds random to me. We have especially pushed back on 64K more than once because it is actually not beneficial in almost all cases.
I suggest to fix the code in system_heap_allocate to not over allocate instead and just try the available orders like TTM does. This has proven to be working architecture independent.
Regards,
Christian.
> #define NUM_ORDERS ARRAY_SIZE(orders)
>
> static struct sg_table *dup_sg_table(struct sg_table *table)
On 14/04/2025 21:41, Adrián Larumbe wrote:
> On 14.04.2025 11:01, Steven Price wrote:
>> On 11/04/2025 16:03, Adrián Larumbe wrote:
>>> Allow UM to label a BO for which it possesses a DRM handle.
>>>
>>> Signed-off-by: Adrián Larumbe <adrian.larumbe(a)collabora.com>
>>> Reviewed-by: Liviu Dudau <liviu.dudau(a)arm.com>
>>> Reviewed-by: Boris Brezillon <boris.brezillon(a)collabora.com>
>>
>> Reviewed-by: Steven Price <steven.price(a)arm.com>
>>
>> Although very minor NITs below which you can consider.
>>
>>> ---
>>> drivers/gpu/drm/panthor/panthor_drv.c | 42 ++++++++++++++++++++++++++-
>>> drivers/gpu/drm/panthor/panthor_gem.h | 2 ++
>>> include/uapi/drm/panthor_drm.h | 23 +++++++++++++++
>>> 3 files changed, 66 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
>>> index 06fe46e32073..983b24f1236c 100644
>>> --- a/drivers/gpu/drm/panthor/panthor_drv.c
>>> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
>>> @@ -1331,6 +1331,44 @@ static int panthor_ioctl_vm_get_state(struct drm_device *ddev, void *data,
>>> return 0;
>>> }
>>>
>>> +static int panthor_ioctl_bo_set_label(struct drm_device *ddev, void *data,
>>> + struct drm_file *file)
>>> +{
>>> + struct drm_panthor_bo_set_label *args = data;
>>> + struct drm_gem_object *obj;
>>> + const char *label;
>>> + int ret = 0;
>>> +
>>> + obj = drm_gem_object_lookup(file, args->handle);
>>> + if (!obj)
>>> + return -ENOENT;
>>> +
>>> + if (args->size && args->label) {
>>> + if (args->size > PANTHOR_BO_LABEL_MAXLEN) {
>>> + ret = -E2BIG;
>>> + goto err_label;
>>> + }
>>> +
>>> + label = strndup_user(u64_to_user_ptr(args->label), args->size);
>>> + if (IS_ERR(label)) {
>>> + ret = PTR_ERR(label);
>>> + goto err_label;
>>> + }
>>> + } else if (args->size && !args->label) {
>>> + ret = -EINVAL;
>>> + goto err_label;
>>> + } else {
>>> + label = NULL;
>>> + }
>>> +
>>> + panthor_gem_bo_set_label(obj, label);
>>> +
>>> +err_label:
>>> + drm_gem_object_put(obj);
>>> +
>>> + return ret;
>>> +}
>>> +
>>> static int
>>> panthor_open(struct drm_device *ddev, struct drm_file *file)
>>> {
>>> @@ -1400,6 +1438,7 @@ static const struct drm_ioctl_desc panthor_drm_driver_ioctls[] = {
>>> PANTHOR_IOCTL(TILER_HEAP_CREATE, tiler_heap_create, DRM_RENDER_ALLOW),
>>> PANTHOR_IOCTL(TILER_HEAP_DESTROY, tiler_heap_destroy, DRM_RENDER_ALLOW),
>>> PANTHOR_IOCTL(GROUP_SUBMIT, group_submit, DRM_RENDER_ALLOW),
>>> + PANTHOR_IOCTL(BO_SET_LABEL, bo_set_label, DRM_RENDER_ALLOW),
>>> };
>>>
>>> static int panthor_mmap(struct file *filp, struct vm_area_struct *vma)
>>> @@ -1509,6 +1548,7 @@ static void panthor_debugfs_init(struct drm_minor *minor)
>>> * - 1.2 - adds DEV_QUERY_GROUP_PRIORITIES_INFO query
>>> * - adds PANTHOR_GROUP_PRIORITY_REALTIME priority
>>> * - 1.3 - adds DRM_PANTHOR_GROUP_STATE_INNOCENT flag
>>> + * - 1.4 - adds DRM_IOCTL_PANTHOR_BO_SET_LABEL ioctl
>>> */
>>> static const struct drm_driver panthor_drm_driver = {
>>> .driver_features = DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ |
>>> @@ -1522,7 +1562,7 @@ static const struct drm_driver panthor_drm_driver = {
>>> .name = "panthor",
>>> .desc = "Panthor DRM driver",
>>> .major = 1,
>>> - .minor = 3,
>>> + .minor = 4,
>>>
>>> .gem_create_object = panthor_gem_create_object,
>>> .gem_prime_import_sg_table = drm_gem_shmem_prime_import_sg_table,
>>> diff --git a/drivers/gpu/drm/panthor/panthor_gem.h b/drivers/gpu/drm/panthor/panthor_gem.h
>>> index af0d77338860..beba066b4974 100644
>>> --- a/drivers/gpu/drm/panthor/panthor_gem.h
>>> +++ b/drivers/gpu/drm/panthor/panthor_gem.h
>>> @@ -13,6 +13,8 @@
>>>
>>> struct panthor_vm;
>>>
>>> +#define PANTHOR_BO_LABEL_MAXLEN PAGE_SIZE
>>> +
>>> /**
>>> * struct panthor_gem_object - Driver specific GEM object.
>>> */
>>> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
>>> index 97e2c4510e69..12b1994499a9 100644
>>> --- a/include/uapi/drm/panthor_drm.h
>>> +++ b/include/uapi/drm/panthor_drm.h
>>> @@ -127,6 +127,9 @@ enum drm_panthor_ioctl_id {
>>>
>>> /** @DRM_PANTHOR_TILER_HEAP_DESTROY: Destroy a tiler heap. */
>>> DRM_PANTHOR_TILER_HEAP_DESTROY,
>>> +
>>> + /** @DRM_PANTHOR_BO_SET_LABEL: Label a BO. */
>>> + DRM_PANTHOR_BO_SET_LABEL,
>>> };
>>>
>>> /**
>>> @@ -977,6 +980,24 @@ struct drm_panthor_tiler_heap_destroy {
>>> __u32 pad;
>>> };
>>>
>>> +/**
>>> + * struct drm_panthor_bo_set_label - Arguments passed to DRM_IOCTL_PANTHOR_BO_SET_LABEL
>>> + */
>>> +struct drm_panthor_bo_set_label {
>>> + /** @handle: Handle of the buffer object to label. */
>>> + __u32 handle;
>>> +
>>> + /**
>>> + * @size: Length of the label, including the NULL terminator.
>>> + *
>>> + * Cannot be greater than the OS page size.
>>> + */
>>> + __u32 size;
>>> +
>>> + /** @label: User pointer to a NULL-terminated string */
>>> + __u64 label;
>>> +};
>>
>> First very minor NIT:
>> * NULL is a pointer, i.e. (void*)0
>> * NUL is the ASCII code point '\0'.
>> So it's a NUL-terminated string.
>
> Fixed
>
>> Second NIT: We don't actually need 'size' - since the string is
>> NUL-terminated we can just strndup_user(__user_pointer__, PAGE_SIZE).
>> As things stand we validate that strlen(label) < size <= PAGE_SIZE -
>> which is a little odd (user space might as well just pass PAGE_SIZE
>> rather than calculate the actual length).
>
> The snag I see in this approach is that the only way to make sure
> strlen(label) + 1 <= PAGE_SIZE would be doing something like
>
> label = strndup_user(u64_to_user_ptr(args->label), args->size);
> if (strlen(label) + 1 <= PAGE_SIZE) {
> kfree(label)
> return -E2BIG;
> }
You can just do
strndup_user(u64_to_user_ptr(args->label), PAGE_SIZE)
If you look at the kernel's implementation it handles an overly long
string by returning an error:
> length = strnlen_user(s, n);
>
> if (!length)
> return ERR_PTR(-EFAULT);
>
> if (length > n)
> return ERR_PTR(-EINVAL);
>
> p = memdup_user(s, length);
So there's no need to call strlen() on the result.
> In the meantime, we've duplicated the string and traversed a whole page
> of bytes, all to be discarded at once.
>
> In this case, I think it's alright to expect some cooperation from UM
> in supplying the actual size, although I'm really not an expert in
> designing elegant uAPIs, so if you think this looks very odd I'd be
> glad to replace it with.
>
> Actually, as I was writing this, I realised that strndup_user() calls
> strnlen_user(), which is publicly available for other drivers, so
> I might check the length first, and if it falls within bounds, do
> the actual user stringdup.
Again, no need (and strnlen_user() has a comment basically saying "don't
call this"). strndup_user() has all the magic we need here.
As I say this is just a 'nit' - so no big deal. But I don't really see
the point of the size in the uAPI.
Take a look at dma_buf_set_name() for an example which sets the name on
a dma_buf (and doesn't take a size argument) - although that wasn't an
example of good uAPI design as the type in the ioctl was botched ;(
Thanks,
Steve
> I shall also mention the size bound on the uAPI for the 'label' pointer.
>
>> Thanks,
>> Steve
>>
>> +
>> /**
>> * DRM_IOCTL_PANTHOR() - Build a Panthor IOCTL number
>> * @__access: Access type. Must be R, W or RW.
>> @@ -1019,6 +1040,8 @@ enum {
>> DRM_IOCTL_PANTHOR(WR, TILER_HEAP_CREATE, tiler_heap_create),
>> DRM_IOCTL_PANTHOR_TILER_HEAP_DESTROY =
>> DRM_IOCTL_PANTHOR(WR, TILER_HEAP_DESTROY, tiler_heap_destroy),
>> + DRM_IOCTL_PANTHOR_BO_SET_LABEL =
>> + DRM_IOCTL_PANTHOR(WR, BO_SET_LABEL, bo_set_label),
>> };
>>
>> #if defined(__cplusplus)
>
>
> Adrian Larumbe
On 14/04/2025 20:43, Adrián Larumbe wrote:
> Hi Steven,
>
> On 14.04.2025 10:50, Steven Price wrote:
>> Hi Adrián,
>>
>> On 11/04/2025 16:03, Adrián Larumbe wrote:
>>> Add a new character string Panthor BO field, and a function that allows
>>> setting it from within the driver.
>>>
>>> Driver takes care of freeing the string when it's replaced or no longer
>>> needed at object destruction time, but allocating it is the responsibility
>>> of callers.
>>>
>>> Signed-off-by: Adrián Larumbe <adrian.larumbe(a)collabora.com>
>>> Reviewed-by: Boris Brezillon <boris.brezillon(a)collabora.com>
>>> ---
>>> drivers/gpu/drm/panthor/panthor_gem.c | 39 +++++++++++++++++++++++++++
>>> drivers/gpu/drm/panthor/panthor_gem.h | 17 ++++++++++++
>>> 2 files changed, 56 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
>>> index 8244a4e6c2a2..af0ac17f357f 100644
>>> --- a/drivers/gpu/drm/panthor/panthor_gem.c
>>> +++ b/drivers/gpu/drm/panthor/panthor_gem.c
>>> @@ -2,6 +2,7 @@
>>> /* Copyright 2019 Linaro, Ltd, Rob Herring <robh(a)kernel.org> */
>>> /* Copyright 2023 Collabora ltd. */
>>>
>>> +#include <linux/cleanup.h>
>>> #include <linux/dma-buf.h>
>>> #include <linux/dma-mapping.h>
>>> #include <linux/err.h>
>>> @@ -18,6 +19,14 @@ static void panthor_gem_free_object(struct drm_gem_object *obj)
>>> struct panthor_gem_object *bo = to_panthor_bo(obj);
>>> struct drm_gem_object *vm_root_gem = bo->exclusive_vm_root_gem;
>>>
>>> + /*
>>> + * Label might have been allocated with kstrdup_const(),
>>> + * we need to take that into account when freeing the memory
>>> + */
>>> + kfree_const(bo->label.str);
>>> +
>>> + mutex_destroy(&bo->label.lock);
>>> +
>>> drm_gem_free_mmap_offset(&bo->base.base);
>>> mutex_destroy(&bo->gpuva_list_lock);
>>> drm_gem_shmem_free(&bo->base);
>>> @@ -196,6 +205,7 @@ struct drm_gem_object *panthor_gem_create_object(struct drm_device *ddev, size_t
>>> obj->base.map_wc = !ptdev->coherent;
>>> mutex_init(&obj->gpuva_list_lock);
>>> drm_gem_gpuva_set_lock(&obj->base.base, &obj->gpuva_list_lock);
>>> + mutex_init(&obj->label.lock);
>>>
>>> return &obj->base.base;
>>> }
>>> @@ -247,3 +257,32 @@ panthor_gem_create_with_handle(struct drm_file *file,
>>>
>>> return ret;
>>> }
>>> +
>>> +void
>>> +panthor_gem_bo_set_label(struct drm_gem_object *obj, const char *label)
>>> +{
>>> + struct panthor_gem_object *bo = to_panthor_bo(obj);
>>> + const char *old_label;
>>> +
>>> + scoped_guard(mutex, &bo->label.lock) {
>>> + old_label = bo->label.str;
>>> + bo->label.str = label;
>>> + }
>>> +
>>> + kfree(old_label);
>>
>> Shouldn't this be kfree_const()? I suspect as things stand we can't
>> trigger this bug but it's inconsistent.
>
> This could only be called either from the set_label() ioctl, in which case
> old_label could be NULL or a pointer to a string obtained from strdup(); or from
> panthor_gem_kernel_bo_set_label(). In the latter case, it could only ever be
> NULL, since we don't reassign kernel BO labels, so it'd be safe too.
Yeah I thought it probably doesn't cause problems now, but it's a foot
gun for the future.
> However I do agree that it's not consistent, and then in the future perhaps
> relabelling kernel BO's might be justified, so I'll change it to kfree_const().
Thanks!
>>> +}
>>> +
>>> +void
>>> +panthor_gem_kernel_bo_set_label(struct panthor_kernel_bo *bo, const char *label)
>>> +{
>>> + const char *str;
>>> +
>>> + str = kstrdup_const(label, GFP_KERNEL);
>>> + if (!str) {
>>
>> In the next patch you permit user space to clear the label
>> (args->size==0) which means label==NULL and AFAICT kstrdup_const() will
>> return NULL in this case triggering this warning.
>
> Kernel and UM-exposed BO's should never experience cross-labelling, so in theory
> this scenario ought to be impossible. The only way panthor_gem_kernel_bo_set_label()
> might take NULL in the 'label' argument is that someone called panthor_kernel_bo_create()
> with NULL for its name 'argument'.
You're absolutely correct - I somehow got confused between the kernel
and user paths. It's the user path above which needs to handle NULL (and
does).
> I think as a defensive check, I could do something as follows
>
> void
> panthor_gem_kernel_bo_set_label(struct panthor_kernel_bo *bo, const char *label)
> {
> const char *str;
>
> /* We should never attempt labelling a UM-exposed GEM object */
> if (drm_WARN_ON(bo->obj->dev, &bo->obj->handle_count > 0))
> return;
>
> if (!label)
> return;
>
> [...]
> }
I'm happy for you to do nothing here - that was my mistake getting the
two functions muddled. Sorry for the noise. I'm equally happy for the
defensive checks above.
Steve
>> Thanks,
>> Steve
>>
>>> + /* Failing to allocate memory for a label isn't a fatal condition */
>>> + drm_warn(bo->obj->dev, "Not enough memory to allocate BO label");
>>> + return;
>>> + }
>>> +
>>> + panthor_gem_bo_set_label(bo->obj, str);
>>> +}
>>> diff --git a/drivers/gpu/drm/panthor/panthor_gem.h b/drivers/gpu/drm/panthor/panthor_gem.h
>>> index 1a363bb814f4..af0d77338860 100644
>>> --- a/drivers/gpu/drm/panthor/panthor_gem.h
>>> +++ b/drivers/gpu/drm/panthor/panthor_gem.h
>>> @@ -46,6 +46,20 @@ struct panthor_gem_object {
>>>
>>> /** @flags: Combination of drm_panthor_bo_flags flags. */
>>> u32 flags;
>>> +
>>> + /**
>>> + * @label: BO tagging fields. The label can be assigned within the
>>> + * driver itself or through a specific IOCTL.
>>> + */
>>> + struct {
>>> + /**
>>> + * @label.str: Pointer to NULL-terminated string,
>>> + */
>>> + const char *str;
>>> +
>>> + /** @lock.str: Protects access to the @label.str field. */
>>> + struct mutex lock;
>>> + } label;
>>> };
>>>
>>> /**
>>> @@ -91,6 +105,9 @@ panthor_gem_create_with_handle(struct drm_file *file,
>>> struct panthor_vm *exclusive_vm,
>>> u64 *size, u32 flags, uint32_t *handle);
>>>
>>> +void panthor_gem_bo_set_label(struct drm_gem_object *obj, const char *label);
>>> +void panthor_gem_kernel_bo_set_label(struct panthor_kernel_bo *bo, const char *label);
>>> +
>>> static inline u64
>>> panthor_kernel_bo_gpuva(struct panthor_kernel_bo *bo)
>>> {
>
> Adrian Larumbe