Am 20.05.21 um 19:23 schrieb Jason Ekstrand:
[SNIP]
I'd argue then that making amdgpu poll semantics match those of other drivers is a pre-requisite for the new ioctl, otherwise it seems unlikely that the ioctl will be widely adopted.
This seems backwards, because that means useful improvements in all other drivers are stalled until amdgpu is fixed.
Well there is nothing to fix in amdgpu, what we need to is to come up with an DMA-buf implicit syncing model which works for everyone.
I've pointed this problem out at FOSDEM roughly 6 years ago, before DMA-buf was even merged upstream and way before amdgpu even existed. And the response was yeah, maybe we need to look at this as well.
Over the years I've mentioned now at least 5 times that this isn't going to work in some situations and came up with different approaches how to fix it.
And you still have the nerves to tell me that this isn't a problem and we should fix amdgpu instead? Sorry, but I'm really running out of ideas how to explain why this isn't working for everybody.
I'm trying really hard to not fuel a flame war here but I tend to lean Daniel's direction on this. Stepping back from the individual needs of amdgpu and looking at things from the PoV of Linux as a whole, AMD being a special snowflake here is bad. I think we have two problems: amdgpu doesn't play by the established rules, and the rules don't work well for amdgpu. We need to solve BOTH problems. Does that mean we need to smash something into amdgpu to force it into the dma-buf model today? Maybe not; stuff's working well enough, I guess. But we can't just rewrite all the rules and break everyone else either.
Totally agree. Key point is I think I really expressed why some of the rules needs some changes and that at least requires an audit of everything currently using the dma_resv object.
That amdgpu wants to be special is true, but it is a fundamental problem that we have designed the implicit sync in DMA-buf only around the needs of DRM drivers at that time instead of going a step back and saying hey what would be an approach which works for everyone.
How else was it supposed to be designed? Based on the needs of non-existent future drivers? That's just not fair. We (Intel) are being burned by various aspects of dma-buf these days too. It does no good to blame past developers or our past selves for not knowing the future. It sucks but it's what we have. And, to move forward, we need to fix it. Let's do that.
Yeah, coming up with a design which also works for future needs is always hard.
But what annoys me is that I've noted those problems way before DMA-buf was merged or amdgpu even existed. I could really kick my own ass to not have pushed back on this harder.
My concern with the flags approach as I'm beginning to digest it is that it's a bit too much of an attempt to rewrite history for my liking. What do I mean by that? I mean that any solution we come up with needs ensure that legacy drivers and modern drivers can play nicely together. Either that or we need to modernize all the users of dma-buf implicit sync. I really don't like the "as long as AMD+Intel works, we're good" approach.
Seconded. That's why I'm saying that we need to take a step back and look at what would be a good design for drivers in general.
After sleeping a night over it I think what Daniel noted to have something similar to the moving fence of TTM inside the dma_resv object is a really good step into the right direction.
When we combine that with an ability to add fences which should never play with implicit sync and only resource management I think we could solve this.
This essentially untangles resource management from implicit sync and results in the following four categories:
1. A moving fence used by resource management only. Userspace can't in any way mess with that one. 2. The existing exclusive fence which is set by CS and/or your new IOCTL. 3. The existing shared fences which can be added by CS. 4. A new group of fences which don't participate in resource management, but not in implicit sync.
Number 1 requires an audit of all places which currently do CS or page flip.
Number 4 requires an audit of all places which do resource management.
I can tackle those and I'm perfectly aware that it might take some time.
Regards, Christian.