On Wed, May 19, 2021 at 6:43 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 19.05.21 um 00:06 schrieb Jason Ekstrand:
[SNIP]
E.g. we can't add a fence which doesn't wait for the exclusive one as shared.
Ok I think that's a real problem, and guess it's also related to all the ttm privatization tricks and all that. So essentially we'd need the opposite of ttm_bo->moving, as in you can't ignore it, but otherwise it completely ignores all the userspace implicit fence stuff.
Would you mind explaining it to the rest of the class? I get the need to do a TLB flush after a BO is removed from the processes address space and I get that it may be super-heavy and that it has to be delayed. I also get that the driver needs to hold a reference to the underlying pages until that TLB flush is done. What I don't get is what this has to do with the exclusive fence. Why can't the driver just gather up all the dma_resv fences on the current object (or, better yet, just the ones from the current amdgpu process) and wait on them all? Why does it need to insert an exclusive fence that then clogs up the whole works?
Because we have mixed up resource management with implicit syncing.
When I sum up all fences in (for example) a dma_fence_array container and add that as explicit fence to the dma_resv object resource management will do what I want and wait for everything to finish before moving or freeing the buffer. But implicit sync will just horrible over sync and wait for stuff it shouldn't wait for in the first place.
When I add the fence as shared fence I can run into the problem the the TLB flush might finish before the exclusive fence. Which is not allowed according to the DMA-buf fencing rules.
I'm starting to feel a bit dense here, sorry... So the problem is that the TLB flush really wants to just wait on memory management fences and not implicit sync fences? Or is it that you need to wait on the exclusive fence in case it actually matters but you don't want to if it was stuffed in there for implicit sync and doesn't have any memory implications? Also, how bad is it for the TLB flush to come in late? Is other stuff blocking on it?
We currently have some rather crude workarounds to make use cases like this work as expected. E.g. by using a dma_fence_chain()/dma_fence_array() and/or adding the explusive fence to the shared fences etc etc...
>> Let's say that you have a buffer which is shared between two drivers A >> and B and let's say driver A has thrown a fence on it just to ensure >> that the BO doesn't get swapped out to disk until it's at a good >> stopping point. Then driver B comes along and wants to throw a >> write-fence on it. Suddenly, your memory fence from driver A causes >> driver B to have to stall waiting for a "good" time to throw in a >> fence. It sounds like this is the sort of scenario that Christian is >> running into. And, yes, with certain Vulkan drivers being a bit >> sloppy about exactly when they throw in write fences, I could see it >> being a real problem. > Yes this is a potential problem, and on the i915 side we need to do > some shuffling here most likely. Especially due to discrete, but the > problem is pre-existing. tbh I forgot about the implications here > until I pondered this again yesterday evening. > > But afaiui the amdgpu code and winsys in mesa, this isn't (yet) the > problem amd vk drivers have. The issue is that with amdgpu, all you > supply are the following bits at CS time: > - list of always mapped private buffers, which is implicit and O(1) in > the kernel fastpath > - additional list of shared buffers that are used by the current CS > > I didn't check how exactly that works wrt winsys buffer ownership, but > the thing is that on the kernel side _any_ buffer in there is treated > as a implicit sync'ed write. Which means if you render your winsys > with a bunch of command submission split over 3d and compute pipes, > you end up with horrendous amounts of oversync. What are you talking about? We have no sync at all for submissions from the same client.
Yes. Except when the buffer is shared with another driver, at which point you sync a _lot_ and feel the pain.
Yes, exactly that's the problem.
We basically don't know during CS if a BO is shared or not.
We do know that during importing or exporting the BO thought.
No you don't. Or at least that's massively awkward, see Jason's reply.
Please. In Vulkan, we know explicitly whether or not any BO will ever be shared and, if a BO is ever flagged as shared even though it's not, that's the app being stupid and they can eat the perf hit.
Yeah, that's not a problem at all. We already have the per BO flag in amdgpu for this as well.
In GL, things are more wishy-washy but GL has so many stupid cases where we have to throw a buffer away and re-allocate that one more isn't going to be all that bad. Even there, you could do something where you add an in-fence to the BO export operation so that the driver knows when to switch from the shared internal dma_resv to the external one without having to create a new BO and copy.
Hui what? What do you mean with in-fence here?
I could imagine wanting to have a sort of mode-switch on BO export rather than the userspace driver creating a new exportable buffer and copying into it. That would get sticky if there was any outstanding usage of said buffer that would need to be converted to from the internal model of usage with explicit fences to a more dma-buf-friendly model. I was just saying that you could add something to BO export to give amdgpu the information it needs to do the switch. If you're too confused, feel free to ignore this whole idea.
[SNIP]
Yeah but why does your userspace not know when a bo is used?
We always know when a BO is exported because we're the ones doing the export call. Always. Of course, we don't know if that BO is shared with another driver or re-imported back into the same one but is that really the case we're optimizing for?
Yes, unfortunately. Exactly that's one of the reasons we couldn't go with the per CS per BO flag if it should be shared or exclusive.
Or very bluntly, why cant radv do what anv does (or amdvlk if you care more about that, it's the same)? What's missing with lots of blantant lying?
I'm also not buying this. You keep claiming that userspace doesn't know but GL definitely does know and Vulkan knows well enough. You say that it's motivated by Vulkan and use RADV as an example but the only reason why the RADV guys haven't followed the ANV design is to work around limitations in amdgpu. We shouldn't then use RADV to justify why this is the right uAPI and why i915 is wrong.
Well, I never said that this is because of RADV. The main motivation we had is because of MM engines, e.g. VA-API, VDPAU and OpenMax.
Sorry, RADV was mentioned many e-mails ago and this is the first time I've seen video mentioned. I've just been going with what I read.
If this is an internal driver over-sync issue with video, that's a whole different card game. The ways this is tying you in knots makes a lot more sense to me now, I think.
And when we expose a BO with the DMA-buf functions we simply doesn't know in userspace if that is then re-imported into VA-API or send to a different process.
[SNIP] Yeah, and that is exactly the reason why I will NAK this uAPI change.
This doesn't works for amdgpu at all for the reasons outlined above.
Uh that's really not how uapi works. "my driver is right, everyone else is wrong" is not how cross driver contracts are defined. If that means a perf impact until you've fixed your rules, that's on you.
Also you're a few years too late with nacking this, it's already uapi in the form of the dma-buf poll() support.
^^ My fancy new ioctl doesn't expose anything that isn't already there. It just lets you take a snap-shot of a wait instead of doing an active wait which might end up with more fences added depending on interrupts and retries. The dma-buf poll waits on all fences for POLLOUT and only the exclusive fence for POLLIN. It's already uAPI.
Well that's not the stuff I'm concerned about. But rather that you want to add that as exclusive fence from the shared ones once more.
This prevents the TLB flush case I've outlined from working correctly.
So the way I see things right now:
- exclusive fence slot is for implicit sync. kmd should only set it
when userspace indicates, otherwise you will suffer. Explicit syncing userspace needs to tell the kernel with a flag in the CS ioctl when it should sync against this exclusive fence and when it should ignore it, otherwise you'll suffer badly once more.
That is not sufficient. The explicit sync slot is for kernel internal memory management.
Then we need to split it. But what I discussed with Thomas Hellstrom is that at least for anything except p2p dma-buf ttm_bo->moving should be enough.
This is starting to sound like maybe roughly the right direction to me but I'm still unclear on exactly what problem we're trying to solve for TLB invalidates. I'd like to understand that better before giving strong opinions. I'm also not super-familiar with ttm_bo->moving but it sounds like we need some third category of fence somewhere.
Well I would rather say that we should separate the use cases.
E.g. clear APIs for resource management vs. implicit sync.
No arguments there. I'm all for separating those two use-cases. I'm just trying to understand the problem space. I think I may finally be getting there. :-)
--Jason