On 20/09/2024 23:36, Adrián Larumbe wrote:
> Hi Steve, thanks for the review.
Hi Adrián,
> I've applied all of your suggestions for the next patch series revision, so I'll
> only be answering to your question about the calc_profiling_ringbuf_num_slots
> function further down below.
>
[...]
>>> @@ -3003,6 +3190,34 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
>>> .free_job = queue_free_job,
>>> };
>>>
>>> +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
>>> + u32 cs_ringbuf_size)
>>> +{
>>> + u32 min_profiled_job_instrs = U32_MAX;
>>> + u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
>>> +
>>> + /*
>>> + * We want to calculate the minimum size of a profiled job's CS,
>>> + * because since they need additional instructions for the sampling
>>> + * of performance metrics, they might take up further slots in
>>> + * the queue's ringbuffer. This means we might not need as many job
>>> + * slots for keeping track of their profiling information. What we
>>> + * need is the maximum number of slots we should allocate to this end,
>>> + * which matches the maximum number of profiled jobs we can place
>>> + * simultaneously in the queue's ring buffer.
>>> + * That has to be calculated separately for every single job profiling
>>> + * flag, but not in the case job profiling is disabled, since unprofiled
>>> + * jobs don't need to keep track of this at all.
>>> + */
>>> + for (u32 i = 0; i < last_flag; i++) {
>>> + if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
>>> + min_profiled_job_instrs =
>>> + min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
>>> + }
>>> +
>>> + return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
>>> +}
>>
>> I may be missing something, but is there a situation where this is
>> different to calc_job_credits(0)? AFAICT the infrastructure you've added
>> can only add extra instructions to the no-flags case - whereas this
>> implies you're thinking that instructions may also be removed (or replaced).
>>
>> Steve
>
> Since we create a separate kernel BO to hold the profiling information slot, we
> need one that would be able to accomodate as many slots as the maximum number of
> profiled jobs we can insert simultaneously into the queue's ring buffer. Because
> profiled jobs always take more instructions than unprofiled ones, then we would
> usually need fewer slots than the number of unprofiled jobs we could insert at
> once in the ring buffer.
>
> Because we represent profiling metrics with a bit mask, then we need to test the
> size of the CS for every single metric enabled in isolation, since enabling more
> than one will always mean a bigger CS, and therefore fewer jobs tracked at once
> in the queue's ring buffer.
>
> In our case, calling calc_job_credits(0) would simply tell us the number of
> instructions we need for a normal job with no profiled features enabled, which
> would always requiere less instructions than profiled ones, and therefore more
> slots in the profiling info kernel BO. But we don't need to keep track of
> profiling numbers for unprofiled jobs, so there's no point in calculating this
> number.
>
> At first I was simply allocating a profiling info kernel BO as big as the number
> of simultaneous unprofiled job slots in the ring queue, but Boris pointed out
> that since queue ringbuffers can be as big as 2GiB, a lot of this memory would
> be wasted, since profiled jobs always require more slots because they hold more
> instructions, so fewer profiling slots in said kernel BO.
>
> The value of this approach will eventually manifest if we decided to keep track of
> more profiling metrics, since this code won't have to change at all, other than
> adding new profiling flags in the panthor_device_profiling_flags enum.
Thanks for the detailed explanation. I think what I was missing is that
the loop is checking each bit flag independently and *not* checking
calc_job_credits(0).
The check for (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL) is probably what
confused me - that should be completely redundant. Or at least we need
something more intelligent if we have profiling bits which are not
mutually compatible.
I'm also not entirely sure that the amount of RAM saved is significant,
but you've already written the code so we might as well have the saving ;)
Thanks,
Steve
> Regards,
> Adrian
>
>>> +
>>> static struct panthor_queue *
>>> group_create_queue(struct panthor_group *group,
>>> const struct drm_panthor_queue_create *args)
>>> @@ -3056,9 +3271,35 @@ group_create_queue(struct panthor_group *group,
>>> goto err_free_queue;
>>> }
>>>
>>> + queue->profiling.slot_count =
>>> + calc_profiling_ringbuf_num_slots(group->ptdev, args->ringbuf_size);
>>> +
>>> + queue->profiling.slots =
>>> + panthor_kernel_bo_create(group->ptdev, group->vm,
>>> + queue->profiling.slot_count *
>>> + sizeof(struct panthor_job_profiling_data),
>>> + DRM_PANTHOR_BO_NO_MMAP,
>>> + DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
>>> + DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
>>> + PANTHOR_VM_KERNEL_AUTO_VA);
>>> +
>>> + if (IS_ERR(queue->profiling.slots)) {
>>> + ret = PTR_ERR(queue->profiling.slots);
>>> + goto err_free_queue;
>>> + }
>>> +
>>> + ret = panthor_kernel_bo_vmap(queue->profiling.slots);
>>> + if (ret)
>>> + goto err_free_queue;
>>> +
>>> + /*
>>> + * Credit limit argument tells us the total number of instructions
>>> + * across all CS slots in the ringbuffer, with some jobs requiring
>>> + * twice as many as others, depending on their profiling status.
>>> + */
>>> ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
>>> group->ptdev->scheduler->wq, 1,
>>> - args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
>>> + args->ringbuf_size / sizeof(u64),
>>> 0, msecs_to_jiffies(JOB_TIMEOUT_MS),
>>> group->ptdev->reset.wq,
>>> NULL, "panthor-queue", group->ptdev->base.dev);
>>> @@ -3354,6 +3595,7 @@ panthor_job_create(struct panthor_file *pfile,
>>> {
>>> struct panthor_group_pool *gpool = pfile->groups;
>>> struct panthor_job *job;
>>> + u32 credits;
>>> int ret;
>>>
>>> if (qsubmit->pad)
>>> @@ -3407,9 +3649,16 @@ panthor_job_create(struct panthor_file *pfile,
>>> }
>>> }
>>>
>>> + job->profiling.mask = pfile->ptdev->profile_mask;
>>> + credits = calc_job_credits(job->profiling.mask);
>>> + if (credits == 0) {
>>> + ret = -EINVAL;
>>> + goto err_put_job;
>>> + }
>>> +
>>> ret = drm_sched_job_init(&job->base,
>>> &job->group->queues[job->queue_idx]->entity,
>>> - 1, job->group);
>>> + credits, job->group);
>>> if (ret)
>>> goto err_put_job;
>>>
>
Condsider the following call sequence:
/* Upper layer */
dma_fence_begin_signalling();
lock(tainted_shared_lock);
/* Driver callback */
dma_fence_begin_signalling();
...
The driver might here use a utility that is annotated as intended for the
dma-fence signalling critical path. Now if the upper layer isn't correctly
annotated yet for whatever reason, resulting in
/* Upper layer */
lock(tainted_shared_lock);
/* Driver callback */
dma_fence_begin_signalling();
We will receive a false lockdep locking order violation notification from
dma_fence_begin_signalling(). However entering a dma-fence signalling
critical section itself doesn't block and could not cause a deadlock.
So use a successful read_trylock() annotation instead for
dma_fence_begin_signalling(). That will make sure that the locking order
is correctly registered in the first case, and doesn't register any
locking order in the second case.
The alternative is of course to make sure that the "Upper layer" is always
correctly annotated. But experience shows that's not easily achievable
in all cases.
Signed-off-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
---
drivers/dma-buf/dma-fence.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index f177c56269bb..17f632768ef9 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -308,8 +308,8 @@ bool dma_fence_begin_signalling(void)
if (in_atomic())
return true;
- /* ... and non-recursive readlock */
- lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
+ /* ... and non-recursive successful read_trylock */
+ lock_acquire(&dma_fence_lockdep_map, 0, 1, 1, 1, NULL, _RET_IP_);
return false;
}
@@ -340,7 +340,7 @@ void __dma_fence_might_wait(void)
lock_map_acquire(&dma_fence_lockdep_map);
lock_map_release(&dma_fence_lockdep_map);
if (tmp)
- lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
+ lock_acquire(&dma_fence_lockdep_map, 0, 1, 1, 1, NULL, _THIS_IP_);
}
#endif
--
2.39.2
Hi Adrián,
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on v6.11-rc7 next-20240913]
[cannot apply to drm-misc/drm-misc-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Adri-n-Larumbe/drm-panthor-i…
base: linus/master
patch link: https://lore.kernel.org/r/20240913124857.389630-2-adrian.larumbe%40collabor…
patch subject: [PATCH v6 1/5] drm/panthor: introduce job cycle and timestamp accounting
config: i386-buildonly-randconfig-003-20240915 (https://download.01.org/0day-ci/archive/20240915/202409152243.r3t2jdOJ-lkp@…)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240915/202409152243.r3t2jdOJ-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409152243.r3t2jdOJ-lkp@intel.com/
All errors (new ones prefixed by >>):
>> drivers/gpu/drm/panthor/panthor_sched.c:2885:12: error: call to '__compiletime_assert_371' declared with 'error' attribute: min(ringbuf_size - start, size) signedness error
2885 | written = min(ringbuf_size - start, size);
| ^
include/linux/minmax.h:129:19: note: expanded from macro 'min'
129 | #define min(x, y) __careful_cmp(min, x, y)
| ^
include/linux/minmax.h:105:2: note: expanded from macro '__careful_cmp'
105 | __careful_cmp_once(op, x, y, __UNIQUE_ID(x_), __UNIQUE_ID(y_))
| ^
include/linux/minmax.h:100:2: note: expanded from macro '__careful_cmp_once'
100 | BUILD_BUG_ON_MSG(!__types_ok(x,y,ux,uy), \
| ^
note: (skipping 2 expansions in backtrace; use -fmacro-backtrace-limit=0 to see all)
include/linux/compiler_types.h:498:2: note: expanded from macro '_compiletime_assert'
498 | __compiletime_assert(condition, msg, prefix, suffix)
| ^
include/linux/compiler_types.h:491:4: note: expanded from macro '__compiletime_assert'
491 | prefix ## suffix(); \
| ^
<scratch space>:68:1: note: expanded from here
68 | __compiletime_assert_371
| ^
1 error generated.
vim +2885 drivers/gpu/drm/panthor/panthor_sched.c
2862
2863 #define JOB_INSTR(__prof, __instr) \
2864 { \
2865 .profile_mask = __prof, \
2866 .instr = __instr, \
2867 }
2868
2869 static void
2870 copy_instrs_to_ringbuf(struct panthor_queue *queue,
2871 struct panthor_job *job,
2872 struct panthor_job_ringbuf_instrs *instrs)
2873 {
2874 ssize_t ringbuf_size = panthor_kernel_bo_size(queue->ringbuf);
2875 u32 start = job->ringbuf.start & (ringbuf_size - 1);
2876 ssize_t size, written;
2877
2878 /*
2879 * We need to write a whole slot, including any trailing zeroes
2880 * that may come at the end of it. Also, because instrs.buffer has
2881 * been zero-initialised, there's no need to pad it with 0's
2882 */
2883 instrs->count = ALIGN(instrs->count, NUM_INSTRS_PER_CACHE_LINE);
2884 size = instrs->count * sizeof(u64);
> 2885 written = min(ringbuf_size - start, size);
2886
2887 memcpy(queue->ringbuf->kmap + start, instrs->buffer, written);
2888
2889 if (written < size)
2890 memcpy(queue->ringbuf->kmap,
2891 &instrs->buffer[(ringbuf_size - start)/sizeof(u64)],
2892 size - written);
2893 }
2894
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki