Re: [Linaro-mm-sig] [RFC] Add DMA_RESV_USAGE flags

19 May 2021

      On Wed, May 19, 2021 at 6:43 AM Christian König
ckoenig.leichtzumerken@gmail.com wrote:
...
Am 19.05.21 um 00:06 schrieb Jason Ekstrand:
...
[SNIP]
...
...
E.g. we can't add a fence which doesn't wait for the exclusive one as
shared.
Ok I think that's a real problem, and  guess it's also related to all
the ttm privatization tricks and all that. So essentially we'd need
the opposite of ttm_bo->moving, as in you can't ignore it, but
otherwise it completely ignores all the userspace implicit fence
stuff.
Would you mind explaining it to the rest of the class?  I get the need
to do a TLB flush after a BO is removed from the processes address
space and I get that it may be super-heavy and that it has to be
delayed.  I also get that the driver needs to hold a reference to the
underlying pages until that TLB flush is done.  What I don't get is
what this has to do with the exclusive fence.  Why can't the driver
just gather up all the dma_resv fences on the current object (or,
better yet, just the ones from the current amdgpu process) and wait on
them all?  Why does it need to insert an exclusive fence that then
clogs up the whole works?
Because we have mixed up resource management with implicit syncing.
When I sum up all fences in (for example) a dma_fence_array container
and add that as explicit fence to the dma_resv object resource
management will do what I want and wait for everything to finish before
moving or freeing the buffer. But implicit sync will just horrible over
sync and wait for stuff it shouldn't wait for in the first place.
When I add the fence as shared fence I can run into the problem the the
TLB flush might finish before the exclusive fence. Which is not allowed
according to the DMA-buf fencing rules.
I'm starting to feel a bit dense here, sorry...  So the problem is
that the TLB flush really wants to just wait on memory management
fences and not implicit sync fences?  Or is it that you need to wait
on the exclusive fence in case it actually matters but you don't want
to if it was stuffed in there for implicit sync and doesn't have any
memory implications?  Also, how bad is it for the TLB flush to come in
late?  Is other stuff blocking on it?
...
We currently have some rather crude workarounds to make use cases like
this work as expected. E.g. by using a
dma_fence_chain()/dma_fence_array() and/or adding the explusive fence to
the shared fences etc etc...
...
...
...
...
...
>> Let's say that you have a buffer which is shared between two drivers A
>> and B and let's say driver A has thrown a fence on it just to ensure
>> that the BO doesn't get swapped out to disk until it's at a good
>> stopping point.  Then driver B comes along and wants to throw a
>> write-fence on it.  Suddenly, your memory fence from driver A causes
>> driver B to have to stall waiting for a "good" time to throw in a
>> fence.  It sounds like this is the sort of scenario that Christian is
>> running into.  And, yes, with certain Vulkan drivers being a bit
>> sloppy about exactly when they throw in write fences, I could see it
>> being a real problem.
> Yes this is a potential problem, and on the i915 side we need to do
> some shuffling here most likely. Especially due to discrete, but the
> problem is pre-existing. tbh I forgot about the implications here
> until I pondered this again yesterday evening.
>
> But afaiui the amdgpu code and winsys in mesa, this isn't (yet) the
> problem amd vk drivers have. The issue is that with amdgpu, all you
> supply are the following bits at CS time:
> - list of always mapped private buffers, which is implicit and O(1) in
> the kernel fastpath
> - additional list of shared buffers that are used by the current CS
>
> I didn't check how exactly that works wrt winsys buffer ownership, but
> the thing is that on the kernel side _any_ buffer in there is treated
> as a implicit sync'ed write. Which means if you render your winsys
> with a bunch of command submission split over 3d and compute pipes,
> you end up with horrendous amounts of oversync.
What are you talking about? We have no sync at all for submissions from
the same client.
Yes. Except when the buffer is shared with another driver, at which
point you sync a _lot_ and feel the pain.
Yes, exactly that's the problem.
We basically don't know during CS if a BO is shared or not.
We do know that during importing or exporting the BO thought.
No you don't. Or at least that's massively awkward, see Jason's reply.
Please.  In Vulkan, we know explicitly whether or not any BO will ever
be shared and, if a BO is ever flagged as shared even though it's not,
that's the app being stupid and they can eat the perf hit.
Yeah, that's not a problem at all. We already have the per BO flag in
amdgpu for this as well.
...
In GL, things are more wishy-washy but GL has so many stupid cases where we
have to throw a buffer away and re-allocate that one more isn't going
to be all that bad.  Even there, you could do something where you add
an in-fence to the BO export operation so that the driver knows when
to switch from the shared internal dma_resv to the external one
without having to create a new BO and copy.
Hui what? What do you mean with in-fence here?
I could imagine wanting to have a sort of mode-switch on BO export
rather than the userspace driver creating a new exportable buffer and
copying into it.  That would get sticky if there was any outstanding
usage of said buffer that would need to be converted to from the
internal model of usage with explicit fences to a more
dma-buf-friendly model.  I was just saying that you could add
something to BO export to give amdgpu the information it needs to do
the switch.  If you're too confused, feel free to ignore this whole
idea.
...
...
[SNIP]
...
Yeah but why does your userspace not know when a bo is used?
We always know when a BO is exported because we're the ones doing the
export call.  Always.  Of course, we don't know if that BO is shared
with another driver or re-imported back into the same one but is that
really the case we're optimizing for?
Yes, unfortunately. Exactly that's one of the reasons we couldn't go
with the per CS per BO flag if it should be shared or exclusive.
...
...
Or very bluntly, why cant radv do what anv does (or amdvlk if you care
more about that, it's the same)? What's missing with lots of blantant
lying?
I'm also not buying this.  You keep claiming that userspace doesn't
know but GL definitely does know and Vulkan knows well enough.  You
say that it's motivated by Vulkan and use RADV as an example but the
only reason why the RADV guys haven't followed the ANV design is to
work around limitations in amdgpu.  We shouldn't then use RADV to
justify why this is the right uAPI and why i915 is wrong.
Well, I never said that this is because of RADV. The main motivation we
had is because of MM engines, e.g. VA-API, VDPAU and OpenMax.
Sorry, RADV was mentioned many e-mails ago and this is the first time
I've seen video mentioned.  I've just been going with what I read.
If this is an internal driver over-sync issue with video, that's a
whole different card game.  The ways this is tying you in knots makes
a lot more sense to me now, I think.
...
And when we expose a BO with the DMA-buf functions we simply doesn't
know in userspace if that is then re-imported into VA-API or send to a
different process.
...
...
...
[SNIP]
Yeah, and that is exactly the reason why I will NAK this uAPI change.
This doesn't works for amdgpu at all for the reasons outlined above.
Uh that's really not how uapi works. "my driver is right, everyone
else is wrong" is not how cross driver contracts are defined. If that
means a perf impact until you've fixed your rules, that's on you.
Also you're a few years too late with nacking this, it's already uapi
in the form of the dma-buf poll() support.
^^  My fancy new ioctl doesn't expose anything that isn't already
there.  It just lets you take a snap-shot of a wait instead of doing
an active wait which might end up with more fences added depending on
interrupts and retries.  The dma-buf poll waits on all fences for
POLLOUT and only the exclusive fence for POLLIN.  It's already uAPI.
Well that's not the stuff I'm concerned about. But rather that you want
to add that as exclusive fence from the shared ones once more.
This prevents the TLB flush case I've outlined from working correctly.
...
...
...
...
So the way I see things right now:

exclusive fence slot is for implicit sync. kmd should only set it

when userspace indicates, otherwise you will suffer. Explicit syncing
userspace needs to tell the kernel with a flag in the CS ioctl when it
should sync against this exclusive fence and when it should ignore it,
otherwise you'll suffer badly once more.
That is not sufficient. The explicit sync slot is for kernel internal
memory management.
Then we need to split it. But what I discussed with Thomas Hellstrom
is that at least for anything except p2p dma-buf ttm_bo->moving should
be enough.
This is starting to sound like maybe roughly the right direction to me
but I'm still unclear on exactly what problem we're trying to solve
for TLB invalidates.  I'd like to understand that better before giving
strong opinions.  I'm also not super-familiar with ttm_bo->moving but
it sounds like we need some third category of fence somewhere.
Well I would rather say that we should separate the use cases.
E.g. clear APIs for resource management vs. implicit sync.
No arguments there.  I'm all for separating those two use-cases.  I'm
just trying to understand the problem space.  I think I may finally be
getting there. :-)
--Jason

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [Linaro-mm-sig] [RFC] Add DMA_RESV_USAGE flags