On 09/03/2023 02.57, Christian König wrote:
Am 08.03.23 um 17:44 schrieb Asahi Lina:
On 09/03/2023 00.30, Christian König wrote:
Am 08.03.23 um 15:53 schrieb Asahi Lina:
[SNIP]
The background is that core memory management requires that signaling a fence only depends on signaling other fences and hardware progress and nothing else. Otherwise you immediately run into problems because of circle dependencies or what we call infinite fences.
And hardware progress is exactly the only dependency here...
Well then you should have a fence for that hardware progress.
I do, it's the prior job hardware completion fences that drm_sched already knows about!
Yes, I could return those in the prepare callback, it just means I need to start stashing fence references in the underlying firmware job queue command objects so I can find out what is the oldest pending fence is, and return it when a queue is full. As long as drm_sched doesn't mind if I keep giving it fences (since multiple commands can have to complete before there is space) or the occasional already signaled fence (because this process is inherently racy), it should work fine.
Well this handling is intentional and necessary, but see below for a more in deep explanation.
If you think this is the better way, I'll do it that way and drop this patch. It just seemed simpler to do it with another callback, since drm_sched is already tracking those fences and doing a hardware queue limit check anyway, and that way I can avoid tracking the fences down into the hardware queue code... *
Well it's not the better way, it's the only way that works.
I have to admit that my bet on your intentions was wrong, but even that use case doesn't work correctly.
See when your callback returns false it is perfectly possible that all hw fences are signaled between returning that information and processing it.
The result would be that the scheduler goes to sleep and never wakes up again.
That can't happen, because it will just go into another iteration of the drm_sched main loop since there is an entity available still.
Rather there is probably the opposite bug in this patch: the can_run_job logic should be moved into the wait_event_interruptible() condition check, otherwise I think it can end up busy-looping since the condition itself can be true even when the can_run_job check blocks it.
But there is no risk of it going to sleep and never waking up because job completions will wake up the waitqueue by definition, and that happens after the driver-side queues are popped. If this problem could happen, then the existing hw_submission_limit logic would be broken in the same way. It is logically equivalent in how it works.
Basically, if properly done in wait_event_interruptible, it is exactly the logic of that macro that prevents this race condition and makes everything work at all. Without it, drm_sched would be completely broken.
As I said we exercised those ideas before and yes this approach here came up before as well and no it doesn't work.
It can never deadlock with this patch as it stands (though it could busy loop), and if properly moved into the wait_event_interruptible(), it would also never busy loop and work entirely as intended. The actual API change is sound.
I don't know why you're trying so hard to convince everyone that this approach is fundamentally broken... It might be a bad idea for other reasons, it might encourage incorrect usage, it might not be the best option, there are plenty of arguments you can make... but you just keep trying to make an argument that it just can't work at all for some reason. Why? I already said I'm happy dropping it in favor of the fences...
It's intended to mirror the hw_submission_limit logic. If you think this is broken, then that's broken too. They are equivalent mechanisms.
This particular issue aside, fairness in global resource allocation is a conversation I'd love to have! Right now the driver doesn't try to ensure that, a queue can easily monopolize certain hardware resources (though one queue can only monopolize one of each, so you'd need something like 63 queues with 63 distinct VMs all submitting free-running jobs back to back in order to starve other queues of resources forever). For starters, one thing I'm thinking of doing is reserving certain subsets of hardware resources for queues with a given priority, so you can at least guarantee forward progress of higher-priority queues when faced with misbehaving lower-priority queues. But if we want to guarantee proper fairness, I think I'll have to start doing things like switching to a CPU-roundtrip submission model when resources become scarce (to guarantee that queues actually release the resources once in a while) and then figure out how to add fairness to the allocation code...
But let's have that conversation when we talk about the driver (or maybe on IRC or something?), right now I'm more interested in getting the abstractions reviewed ^^
Well that stuff is highly problematic as well. The fairness aside you risk starvation which in turn breaks the guarantee of forward progress.
In this particular case you can catch this with a timeout for the hw operation, but you should consider blocking that from the sw side as well.
In the current state I actually think it's not really that problematic, because the resources are acquired directly in the ioctl path. So that can block if starved, but if that can cause overall forward progress to stop because some fence doesn't get signaled, then so can just not doing the ioctl in the first place, so there's not much point (userspace can always misbehave with its fence usage...). By the time anything gets submitted to drm_sched, the resources are already guaranteed to be acquired, we never block in the run callback.
It needs to be fixed of course, but if the threat model is a malicious GPU process, well, there are many other ways to DoS your system... and I don't think it's very likely that 63+ queues (which usually means 63+ processes with OpenGL) will end up accidentally starving the GPU in a tight loop at the same time. I'd love to hear about real-world scenarios where this kind of thing has been a real problem and not just a theoretical one though... maybe I'm missing something?
Basically my priorities with the driver are:
1. Make sure it never crashes 2. Make sure it works well for real users 3. Make it work smoothly for real users under reasonable load (priorities, CPU scheduler interactions, etc.) 4. Make it handle accidental problems more gracefully (OOMs etc, I need to look into private GEM BO accounting to processes so the OOM killer has better data to work with) 5. Make it more robust against deliberate abuse/starvation (this should matter more once we have some kind of paravirtualization solution...)
And right now we're somewhere between 2 and 3. So if there are cases where this resource acquisition stuff can cause a problem for real users, I'll want to fix it earlier. But if this is more theoretical than anything (with the resource limits of AGX GPUs), I'd rather focus on things like memory accounting and shrinker support first.
We don't even have a shrinker yet, and I'm sure that's going to be a lot of fun when we add it too... but yes, if we can't do any memory allocations in some of these callbacks (is this documented anywhere?), that's going to be interesting...
Yes, that is all part of the dma_fence documentation. It's just absolutely not obvious what all this means.
I mean is there any documentation on how this interacts with drm_sched? Like, am I not allowed to allocate memory in prepare()? What about run()? What about GPU interrupt work? (not a raw IRQ handler context, I mean the execution path from GPU IRQ to drm_sched run() fences getting signaled)
It's not all bad news though! All memory allocations are fallible in kernel Rust (and therefore explicit, and also failures have to be explicitly handled or propagated), so it's pretty easy to point out where they are, and there are already discussions of higher-level tooling to enforce rules like that (and things like wait contexts). Also, Rust makes it a lot easier to refactor code in general and not be scared that you're going to regress everything, so I'm not really worried if I need to turn a chunk of the driver on its head to solve some of these problems in the future ^^ (I already did that when I switched it from the "demo" synchronous submission model to the proper explicit sync + fences one.)
Yeah, well the problem isn't that you run into memory allocation failure.
What I mean is that the mandatory failure handling means it's relatively easy to audit where memory allocations can actually happen.
The problem is rather something like this:
- You try to allocate memory to signal your fence.
- This memory allocation can't be fulfilled and goes to sleep to wait
for reclaim. 3. On another CPU reclaim is running and through the general purpose shrinker, page fault or MMU notifier ends up wait for your dma_fence.
You don't even need to implement the shrinker for this to go boom extremely easy.
Hmm, can you actually get something waiting on a dma_fence like that today with this driver? We don't have a shrinker, we don't have synchronous page faults or MMU notifications for the GPU, and this is explicit sync so all in/out fences cross over into userspace so surely they can't be trusted anyway?
I'm definitely not familiar with the intricacies of DMA fences and how they interact with everything else yet, but it's starting to sound like either this isn't quite broken for our simple driver yet, or it must be broken pretty much everywhere in some way...
So everything involved with signaling the fence can allocate memory only with GFP_ATOMIC and only if you absolutely have to.
I don't think we even have a good story for passing around gfp_flags in Rust code so that will be interesting... though I need to actually audit the code paths and see how many allocations we really do. I know I alloc some vectors for holding completed commands and stuff like that, but I'm pretty sure I can fix that one with some reworking, and I'm not sure how many other random things there really are...? Obviously most allocations happen at command creation time, on completion you mostly get a lot of freeing, so maybe I can just eliminate all allocs and not worry about GFP_ATOMIC.
~~ Lina