On Wed, May 19, 2021 at 1:24 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 18.05.21 um 23:17 schrieb Daniel Vetter:
[SNIP]
The problem in this case is not starting a new CS, but synchronizing to the existing ones.
See a heavy TLB flush is made completely out of sync. E.g. it doesn't want to wait for any previous operation.
In other words imagine the following example:
- Both process A and B have a BO mapped.
- Process A is heavily using the BO and doing all kind of rendering.
- Process B is unmapping the BO.
Now that process B unmaps the BO needs to trigger page table updates and a heavy TLB flush, but since this can take really long we want to do it asynchronously on the hardware.
With the current approach you basically can't do that because you can't note that a fence should not participate in synchronization at all.
E.g. we can't add a fence which doesn't wait for the exclusive one as shared.
Ok I think that's a real problem, and guess it's also related to all the ttm privatization tricks and all that. So essentially we'd need the opposite of ttm_bo->moving, as in you can't ignore it, but otherwise it completely ignores all the userspace implicit fence stuff.
It goes into that direction, but doesn't sounds like the full solution either.
[SNIP]
Can we please stop with the "amdgpu is right, everyone else is wrong" approach?
Well the approach I do here is not "amdgpu is right, everyone else is wrong". But rather we had DRM uAPI for i915, nouveau and radeon and unfortunately leaked that into DMA-buf without much thinking about it.
I'm also not saying that the approach amdgpu is right. It's just what amdgpu needs in it's CS interface.
What I'm saying is that DMA-buf is a device driver independent subsystem and we shouldn't make any assumption which come from just a handful of DRM driver on it's implicit sync implementation.
Like I'm pretty much going to type up the patch that does a full drm subsytem audit of everything and whack amdgpu into compliance. Perf hit be damned, you had a few years to fix this with better uapi. Or I find out that there's a giant inconsistent mess, but at least we'd gain some clarity about where exactly we are here and maybe what to do next.
Ok to let us move forward please take a look at the first patches of the set. It cleans up quite a bunch of the mess we have in there before even coming to adding flags to the shared slots.
I think you will agree on that we should do is cleaning up the use cases further and separate implicit sync from resource management.
Just replying on this because I'm a bit busy with reviewing everything we have in upstream right now.
I agree there's some useful stuff in there, but we have a fundamental disagreement on how this works. That needs to be resolved first, and as part of that we need to come up with a plan how to get everyone on the same page.
Then next thing is a plan how to get the various issues you're raising around dma_resv rules sorted out.
Once we have that, and only then, does it imo make sense to review/merge cleanup patches. As long as we have fundamental disagreements along the lines like we have here there's no point.
I should have a patch set maybe tomorrow or early next week with my results of the drm subsystem review of how exactly dma_resv is used currently. Thus far it's a few pages of code analysis, but not yet complete. Also I found some smaller issues in a few places, so the discussion is going to involve a few more people until we're settled here :-/
Cheers, Daniel
In other words we forbid touching the exclusive and shared fences directly and have separate APIs for resource management and implicit sync.
This makes sense anyway, no matter what implicit synchronization framework we will install underneath.
Regards, Christian.
-Daniel
Regards, Christian.
After that I think we can look at what exact oversync issue remains and why and solve it, but until we have this this just feels like another rehash of "amgpu insist its own dma_resv interpration is the right one and everyone else should move one over".
Or maybe I've just become real garbage at reading random driver code, wouldn't be the first time :-)
Cheers, Daniel
Regards, Christian.
Cheers, Daniel
> --Jason > > >>> That's also the reason the Valve guys came up with a solution where each >>> BO gets a flag for explicit sync, but that only works for exports and >>> not for imports. >>> >>>> I915 and iirc msm has explicit flags for this, panfrost was designed to >>>> support this correctly from the start (also with flags I think). That's at >>>> least what I remember from all the discussions at XDC and #dri-devel, but >>>> didn't check the code again to give you the list of uapi flags you need >>>> for each driver. >>>> >>>> The other piece is making sure you're only picking up implicit fences when >>>> you should, and not any later ones, for which Jason has a solution: >>>> >>>> https://lore.kernel.org/dri-devel/20210317221940.2146688-1-jason@jlekstrand.... >>> Yes, I helped with that as well. But I think that this is just another >>> workaround without really addressing the underlying problem. >>> >>>> If amdgpu isn't using those, then you will suffer from >>>> over-synchronization in vulkan and pay a price. The entire point of vulkan >>>> is that you pick up sync points very explicitly, and we also need to have >>>> very explicit uapi for userspace to pick up/set the implicit fences. >>>> >>>> Trying to paper over this with more implicit magic is imo just wrong, and >>>> definitely not the long term explicit sync model we want. >>> I completely disagree. >>> >>> In my opinion the implicit sync model we have for dma_resv currently is >>> just not well designed at all, since it always requires cooperation from >>> userspace. >>> >>> In other words you need to know when to enable implicit sync in >>> userspace and that information is simply not present all of the time. >>> >>> What we have done here is just keeping the old reader/writer flags i915, >>> radeon and nouveau once had and pushed that out to everybody else making >>> the assumption that everybody would follow that without documenting the >>> actual rules of engagement you need to follow here. >>> >>> That was a really big mistake and we should try to fix that sooner or >>> later. The only other clean alternative I see is to use a flag on the >>> exporter to tell the importer if it should sync to shared fences or not. >>> >>> Additional to that I'm perfectly fine with implicit sync. Explicit sync >>> certainly has some use cases as well, but I don't see it as an absolute >>> advantage over the implicit model. >> Ok this stops making sense. Somehow you claim userspace doesn't know >> when to sync, but somehow the kernel does? By guessing, and getting it >> wrong mostly, except for the one case that you benchmarked? >> >> Aside from silly userspace which exports a buffer to a dma-buf, but >> then never imports it anywhere else, there isn't a case I know of >> where the kernel actually knows more than userspace. But there's lots >> of cases where the kernel definitely knows less, especially if >> userspace doesn't tell it about what's going on with each rendering >> and buffer. >> >> So here's the 2 things you need to make this work like every other driver: >> >> 1. A way to set the explicit fence on a buffer. CS ioctl is perfectly >> fine, but also can be seperate. Userspace uses this only on a) shared >> buffers b) when there's a flush/swap on that shared buffer. Not when >> rendering any of the interim stuff, that only leads to oversync. >> Anything non-shared is handled explicitly in userspace (at least for >> modern-ish drivers). This is the only thing that ever sets an >> exclusive fence (aside from ttm moving buffers around ofc). >> >> 2. A way to sync with the implicit fences, either all of them (for >> upcoming write access) or just the write fence (for read access). At >> first we thought it's good enough to do this in the CS ioctl, but >> that's a wee bit too late, hence the patches from Jason. My >> understanding is that vulkan converts this into an vk syncobj/fence of >> some sorts, so really can't make this more explicit and intentional >> than that. >> >> None of this is something the kernel has the slightest idea about when >> it happens, so you have to have explicit uapi for it. Trying to fake >> it in the kernel just doesn't work. >> -Daniel >> -- >> Daniel Vetter >> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch