On 14.10.25 17:10, Petr Tesarik wrote:
> On Tue, 14 Oct 2025 15:04:14 +0200
> Christian König <christian.koenig(a)amd.com> wrote:
>
>> On 14.10.25 14:44, Zhaoyang Huang wrote:
>>> On Tue, Oct 14, 2025 at 7:59 PM Christian König
>>> <christian.koenig(a)amd.com> wrote:
>>>>
>>>> On 14.10.25 10:32, zhaoyang.huang wrote:
>>>>> From: Zhaoyang Huang <zhaoyang.huang(a)unisoc.com>
>>>>>
>>>>> The size of once dma-buf allocation could be dozens MB or much more
>>>>> which introduce a loop of allocating several thousands of order-0 pages.
>>>>> Furthermore, the concurrent allocation could have dma-buf allocation enter
>>>>> direct-reclaim during the loop. This commit would like to eliminate the
>>>>> above two affections by introducing alloc_pages_bulk_list in dma-buf's
>>>>> order-0 allocation. This patch is proved to be conditionally helpful
>>>>> in 18MB allocation as decreasing the time from 24604us to 6555us and no
>>>>> harm when bulk allocation can't be done(fallback to single page
>>>>> allocation)
>>>>
>>>> Well that sounds like an absolutely horrible idea.
>>>>
>>>> See the handling of allocating only from specific order is *exactly* there to avoid the behavior of bulk allocation.
>>>>
>>>> What you seem to do with this patch here is to add on top of the behavior to avoid allocating large chunks from the buddy the behavior to allocate large chunks from the buddy because that is faster.
>>> emm, this patch doesn't change order-8 and order-4's allocation
>>> behaviour but just to replace the loop of order-0 allocations into
>>> once bulk allocation in the fallback way. What is your concern about
>>> this?
>>
>> As far as I know the bulk allocation favors splitting large pages into smaller ones instead of allocating smaller pages first. That's where the performance benefit comes from.
>>
>> But that is exactly what we try to avoid here by allocating only certain order of pages.
>
> This is a good question, actually. Yes, bulk alloc will split large
> pages if there are insufficient pages on the pcp free list. But is
> dma-buf indeed trying to avoid it, or is it merely using an inefficient
> API? And does it need the extra speed? Even if it leads to increased
> fragmentation?
DMA-buf-heaps is completly intentionally trying rather hard to avoid splitting large pages. That's why you have the distinction between HIGH_ORDER_GFP and LOW_ORDER_GFP as well.
Keep in mind that this is mostly used on embedded system with only small amounts of memory.
Not entering direct reclaim and instead preferring to split large pages until they are used up is an absolutely no-go for most use cases as far as I can see.
Could be that we need to make this behavior conditional, but somebody would need to come up with some really good arguments to justify the complexity.
Regards,
Christian.
>
> Petr T
On Tue, Oct 14, 2025 at 04:32:28PM +0800, zhaoyang.huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang(a)unisoc.com>
>
> This series of patches would like to introduce alloc_pages_bulk_list in
> dma-buf which need to call back the API for page allocation.
Start with the problem you're trying to solve.
On 14.10.25 14:44, Zhaoyang Huang wrote:
> On Tue, Oct 14, 2025 at 7:59 PM Christian König
> <christian.koenig(a)amd.com> wrote:
>>
>> On 14.10.25 10:32, zhaoyang.huang wrote:
>>> From: Zhaoyang Huang <zhaoyang.huang(a)unisoc.com>
>>>
>>> The size of once dma-buf allocation could be dozens MB or much more
>>> which introduce a loop of allocating several thousands of order-0 pages.
>>> Furthermore, the concurrent allocation could have dma-buf allocation enter
>>> direct-reclaim during the loop. This commit would like to eliminate the
>>> above two affections by introducing alloc_pages_bulk_list in dma-buf's
>>> order-0 allocation. This patch is proved to be conditionally helpful
>>> in 18MB allocation as decreasing the time from 24604us to 6555us and no
>>> harm when bulk allocation can't be done(fallback to single page
>>> allocation)
>>
>> Well that sounds like an absolutely horrible idea.
>>
>> See the handling of allocating only from specific order is *exactly* there to avoid the behavior of bulk allocation.
>>
>> What you seem to do with this patch here is to add on top of the behavior to avoid allocating large chunks from the buddy the behavior to allocate large chunks from the buddy because that is faster.
> emm, this patch doesn't change order-8 and order-4's allocation
> behaviour but just to replace the loop of order-0 allocations into
> once bulk allocation in the fallback way. What is your concern about
> this?
As far as I know the bulk allocation favors splitting large pages into smaller ones instead of allocating smaller pages first. That's where the performance benefit comes from.
But that is exactly what we try to avoid here by allocating only certain order of pages.
Regards,
Christian.
>>
>> So this change here doesn't looks like it will fly very high. Please explain what you're actually trying to do, just optimize allocation time?
>>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Zhaoyang Huang <zhaoyang.huang(a)unisoc.com>
>>> ---
>>> drivers/dma-buf/heaps/system_heap.c | 36 +++++++++++++++++++----------
>>> 1 file changed, 24 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
>>> index bbe7881f1360..71b028c63bd8 100644
>>> --- a/drivers/dma-buf/heaps/system_heap.c
>>> +++ b/drivers/dma-buf/heaps/system_heap.c
>>> @@ -300,8 +300,8 @@ static const struct dma_buf_ops system_heap_buf_ops = {
>>> .release = system_heap_dma_buf_release,
>>> };
>>>
>>> -static struct page *alloc_largest_available(unsigned long size,
>>> - unsigned int max_order)
>>> +static void alloc_largest_available(unsigned long size,
>>> + unsigned int max_order, unsigned int *num_pages, struct list_head *list)
>>> {
>>> struct page *page;
>>> int i;
>>> @@ -312,12 +312,19 @@ static struct page *alloc_largest_available(unsigned long size,
>>> if (max_order < orders[i])
>>> continue;
>>>
>>> - page = alloc_pages(order_flags[i], orders[i]);
>>> - if (!page)
>>> + if (orders[i]) {
>>> + page = alloc_pages(order_flags[i], orders[i]);
>>> + if (page) {
>>> + list_add(&page->lru, list);
>>> + *num_pages = 1;
>>> + }
>>> + } else
>>> + *num_pages = alloc_pages_bulk_list(LOW_ORDER_GFP, size / PAGE_SIZE, list);
>>> +
>>> + if (list_empty(list))
>>> continue;
>>> - return page;
>>> + return;
>>> }
>>> - return NULL;
>>> }
>>>
>>> static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>>> @@ -335,6 +342,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>>> struct list_head pages;
>>> struct page *page, *tmp_page;
>>> int i, ret = -ENOMEM;
>>> + unsigned int num_pages;
>>> + LIST_HEAD(head);
>>>
>>> buffer = kzalloc(sizeof(*buffer), GFP_KERNEL);
>>> if (!buffer)
>>> @@ -348,6 +357,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>>> INIT_LIST_HEAD(&pages);
>>> i = 0;
>>> while (size_remaining > 0) {
>>> + num_pages = 0;
>>> + INIT_LIST_HEAD(&head);
>>> /*
>>> * Avoid trying to allocate memory if the process
>>> * has been killed by SIGKILL
>>> @@ -357,14 +368,15 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>>> goto free_buffer;
>>> }
>>>
>>> - page = alloc_largest_available(size_remaining, max_order);
>>> - if (!page)
>>> + alloc_largest_available(size_remaining, max_order, &num_pages, &head);
>>> + if (!num_pages)
>>> goto free_buffer;
>>>
>>> - list_add_tail(&page->lru, &pages);
>>> - size_remaining -= page_size(page);
>>> - max_order = compound_order(page);
>>> - i++;
>>> + list_splice_tail(&head, &pages);
>>> + max_order = folio_order(lru_to_folio(&head));
>>> + size_remaining -= PAGE_SIZE * (num_pages << max_order);
>>> + i += num_pages;
>>> +
>>> }
>>>
>>> table = &buffer->sg_table;
>>
On 14.10.25 10:32, zhaoyang.huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang(a)unisoc.com>
>
> The size of once dma-buf allocation could be dozens MB or much more
> which introduce a loop of allocating several thousands of order-0 pages.
> Furthermore, the concurrent allocation could have dma-buf allocation enter
> direct-reclaim during the loop. This commit would like to eliminate the
> above two affections by introducing alloc_pages_bulk_list in dma-buf's
> order-0 allocation. This patch is proved to be conditionally helpful
> in 18MB allocation as decreasing the time from 24604us to 6555us and no
> harm when bulk allocation can't be done(fallback to single page
> allocation)
Well that sounds like an absolutely horrible idea.
See the handling of allocating only from specific order is *exactly* there to avoid the behavior of bulk allocation.
What you seem to do with this patch here is to add on top of the behavior to avoid allocating large chunks from the buddy the behavior to allocate large chunks from the buddy because that is faster.
So this change here doesn't looks like it will fly very high. Please explain what you're actually trying to do, just optimize allocation time?
Regards,
Christian.
> Signed-off-by: Zhaoyang Huang <zhaoyang.huang(a)unisoc.com>
> ---
> drivers/dma-buf/heaps/system_heap.c | 36 +++++++++++++++++++----------
> 1 file changed, 24 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> index bbe7881f1360..71b028c63bd8 100644
> --- a/drivers/dma-buf/heaps/system_heap.c
> +++ b/drivers/dma-buf/heaps/system_heap.c
> @@ -300,8 +300,8 @@ static const struct dma_buf_ops system_heap_buf_ops = {
> .release = system_heap_dma_buf_release,
> };
>
> -static struct page *alloc_largest_available(unsigned long size,
> - unsigned int max_order)
> +static void alloc_largest_available(unsigned long size,
> + unsigned int max_order, unsigned int *num_pages, struct list_head *list)
> {
> struct page *page;
> int i;
> @@ -312,12 +312,19 @@ static struct page *alloc_largest_available(unsigned long size,
> if (max_order < orders[i])
> continue;
>
> - page = alloc_pages(order_flags[i], orders[i]);
> - if (!page)
> + if (orders[i]) {
> + page = alloc_pages(order_flags[i], orders[i]);
> + if (page) {
> + list_add(&page->lru, list);
> + *num_pages = 1;
> + }
> + } else
> + *num_pages = alloc_pages_bulk_list(LOW_ORDER_GFP, size / PAGE_SIZE, list);
> +
> + if (list_empty(list))
> continue;
> - return page;
> + return;
> }
> - return NULL;
> }
>
> static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
> @@ -335,6 +342,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
> struct list_head pages;
> struct page *page, *tmp_page;
> int i, ret = -ENOMEM;
> + unsigned int num_pages;
> + LIST_HEAD(head);
>
> buffer = kzalloc(sizeof(*buffer), GFP_KERNEL);
> if (!buffer)
> @@ -348,6 +357,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
> INIT_LIST_HEAD(&pages);
> i = 0;
> while (size_remaining > 0) {
> + num_pages = 0;
> + INIT_LIST_HEAD(&head);
> /*
> * Avoid trying to allocate memory if the process
> * has been killed by SIGKILL
> @@ -357,14 +368,15 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
> goto free_buffer;
> }
>
> - page = alloc_largest_available(size_remaining, max_order);
> - if (!page)
> + alloc_largest_available(size_remaining, max_order, &num_pages, &head);
> + if (!num_pages)
> goto free_buffer;
>
> - list_add_tail(&page->lru, &pages);
> - size_remaining -= page_size(page);
> - max_order = compound_order(page);
> - i++;
> + list_splice_tail(&head, &pages);
> + max_order = folio_order(lru_to_folio(&head));
> + size_remaining -= PAGE_SIZE * (num_pages << max_order);
> + i += num_pages;
> +
> }
>
> table = &buffer->sg_table;
On Mon, Sep 29, 2025 at 1:54 PM Frank Li <Frank.li(a)nxp.com> wrote:
>
> On Fri, Sep 26, 2025 at 03:00:48PM -0500, Rob Herring (Arm) wrote:
> > Add a binding schema for Arm Ethos-U65/U85 NPU. The Arm Ethos-U NPUs are
> > designed for edge AI inference applications.
> >
> > Signed-off-by: Rob Herring (Arm) <robh(a)kernel.org>
> > ---
> > .../devicetree/bindings/npu/arm,ethos.yaml | 79 ++++++++++++++++++++++
> > 1 file changed, 79 insertions(+)
> >
> > diff --git a/Documentation/devicetree/bindings/npu/arm,ethos.yaml b/Documentation/devicetree/bindings/npu/arm,ethos.yaml
> > new file mode 100644
> > index 000000000000..716c4997f976
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/npu/arm,ethos.yaml
> > @@ -0,0 +1,79 @@
> > +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> > +%YAML 1.2
> > +---
> > +$id: http://devicetree.org/schemas/npu/arm,ethos.yaml#
> > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > +
> > +title: Arm Ethos U65/U85
> > +
> > +maintainers:
> > + - Rob Herring <robh(a)kernel.org>
> > +
> > +description: >
> > + The Arm Ethos-U NPUs are designed for IoT inference applications. The NPUs
> > + can accelerate 8-bit and 16-bit integer quantized networks:
> > +
> > + Transformer networks (U85 only)
> > + Convolutional Neural Networks (CNN)
> > + Recurrent Neural Networks (RNN)
> > +
> > + Further documentation is available here:
> > +
> > + U65 TRM: https://developer.arm.com/documentation/102023/
> > + U85 TRM: https://developer.arm.com/documentation/102685/
> > +
> > +properties:
> > + compatible:
> > + oneOf:
> > + - items:
> > + - enum:
> > + - fsl,imx93-npu
> > + - const: arm,ethos-u65
> > + - items:
> > + - {}
>
> what's means {} here ?, just not allow arm,ethos-u85 alone?
Yes, u85 support is currently on a FVP model. The naming for it isn't
really clear yet nor is it clear if it ever will be. So really just a
placeholder until there is a chip using it. It keeps folks from using
just the fallback.
>
> Reviewed-by: Frank Li <Frank.Li(a)nxp.com>
Thanks,
Rob
On Tue, Oct 07, 2025 at 11:10:32PM -0700, Kees Cook wrote:
> The dma-buf pseudo-filesystem should never have executable mappings nor
> device nodes. Set SB_I_NOEXEC and SB_I_NODEV on the superblock to enforce
> this at the filesystem level, similar to secretmem, commit 98f99394a104
> ("secretmem: use SB_I_NOEXEC").
>
> Fix the syzbot-reported warning from the exec code to enforce this
> requirement:
Can you please just enforce this in init_pseudo? If a file system
really wants to support devices or executable it can clear them,
but a quick grep suggests that none of them should.
On Thu, Oct 02, 2025 at 05:12:35PM +0900, Byungchul Park wrote:
> Functionally no change. This patch is a preparation for DEPT(DEPendency
> Tracker) to track dependencies related to a scheduler API,
> wait_for_completion().
>
> Unfortunately, struct i2c_algo_pca_data has a callback member named
> wait_for_completion, that is the same as the scheduler API, which makes
> it hard to change the scheduler API to a macro form because of the
> ambiguity.
>
> Add a postfix _cb to the callback member to remove the ambiguity.
>
> Signed-off-by: Byungchul Park <byungchul(a)sk.com>
This patch seems reasonable in any case. I'll pick it, so you have one
dependency less. Good luck with the series!
Applied to for-next, thanks!
On Fri, Oct 03, 2025 at 10:46:41AM +0900, Byungchul Park wrote:
> On Thu, Oct 02, 2025 at 12:39:31PM +0100, Mark Brown wrote:
> > On Thu, Oct 02, 2025 at 05:12:09PM +0900, Byungchul Park wrote:
> > > dept needs to notice every entrance from user to kernel mode to treat
> > > every kernel context independently when tracking wait-event dependencies.
> > > Roughly, system call and user oriented fault are the cases.
> > > Make dept aware of the entrances of arm64 and add support
> > > CONFIG_ARCH_HAS_DEPT_SUPPORT to arm64.
> > The description of what needs to be tracked probably needs some
> > tightening up here, it's not clear to me for example why exceptions for
> > mops or the vector extensions aren't included here, or what the
> > distinction is with error faults like BTI or GCS not being tracked?
> Thanks for the feedback but I'm afraid I don't get you. Can you explain
> in more detail with example?
Your commit log says we need to track every entrance from user mode to
kernel mode but the code only adds tracking to syscalls and some memory
faults. The exception types listed above (and some others) also result
in entries to the kernel from userspace.
> JFYI, pairs of wait and its event need to be tracked to see if each
> event can be prevented from being reachable by other waits like:
> context X context Y
>
> lock L
> ...
> initiate event A context start toward event A
> ... ...
> wait A // wait for event A and lock L // wait for unlock L and
> // prevent unlock L // prevent event A
> ... ...
> unlock L unlock L
> ...
> event A
> I meant things like this need to be tracked.
I don't think that's at all clear from the above context, and the
handling for some of the above exception types (eg, the vector
extensions) includes taking locks.