On Fri, 2023-03-10 at 18:58 +0900, Asahi Lina wrote:
On 10/03/2023 04.59, Faith Ekstrand wrote:
On Thu, 2023-03-09 at 18:43 +0900, Asahi Lina wrote:
On 09/03/2023 17.42, Christian König wrote:
Long story short: Don't do this! This is what the Windows drivers have been doing and it creates tons of problems.
Yeah, we tried to do a bit of that in the GL days. It was a bad idea.
I think I should clarify: I was proposing re-queueing innocent jobs from innocent queues/VMs that were impacted by a fault. The reason is that we may be able to tweak firmware state to force it to do that safely, during the firmware recovery cycle, such that an aborted job restarts and then subsequent jobs/commands continue as normal. We can't leave it to userspace because if we do nothing, the affected job ends up incomplete but then everything after it that is already queued still runs, and that is definitely a recipe for a bigger mess if userspace wants to seamlessly recover. The firmware recovery cycle is a "stop-the-world" situation for the GPU (the firmware literally busy-loops waiting for the driver to set a continue flag in memory...), so that's the only real chance that the driver gets to make decisions about what is going to happen next.
Ok, that makes sense. Yes, if you have other jobs on other queues and are able to recover everything that isn't in the faulting VM, that's a good thing. I wasn't sure how hang/fault recovery worked on AGX. In tat case, I don't think there's a dma_fence problem. As long as you keep recovering and killing off any faulting contexts, eventually the good contexts should make progress and those fences should signal.
Of course, the firmware recovery cycle may be complex and need (or at least appear to) memory allocation or similar and that's where everything gets hairy. Hopefully, though, if you've already got the resources from the old context, you can re-use them after a bit of clean-up work and still get deterministic and reliable recovery cycles.
Of course, that only works if individual possibly concurrently running commands are idempotent, but I think a lot of typical GPU work is?
No, that's not a valid assumption. For a single 3D render pass which doesn't do any image or SSBO access, it may be possible to re-run it. However, that won't be true of compute work and isn't necessarily true of back-to-back passes. Lots of modern apps do temporal stuff where one frame depends on the previous and a re-run might screw that up. Also, with Vulkan's memory aliasing, it's hard to tell just from which resources are accessed whether or not a command buffer leaves its input memory undamaged.
(E.g. any render pass without side effects other than the render targets and where the background shader does no loads, or even render passes that do loads but where all draws are opaque, which are all things the current Gallium driver is intimately familiar with since Crazy Tiler Optimizations™ need that info to be provided anyway). So I was wondering whether it'd make sense to have such an idempotency/restartable flag on job submission, and then the driver would do its best to recover and rerun it if it gets killed by an unrelated concurrent bad job.
Then again this all depends on an investigation into what we *can* do during firmware recovery that hasn't happened at all yet. It might be that it isn't safe to do anything really, or that doing things depends on touching even deeper firmware state structs that we treat as opaque right now and we really don't want to have to touch...
But maybe none of this is worth it in practice, it just sounded like it could be useful maybe?
Maybe? It's not clear to me that such a flag would be useful or even practical to provide from the Mesa side. Ideally, you'd be able to figure out when a fault happens, what VM it happened in and exactly what work was in-flight when it happened and only kill the one guilty VM. However, it sounds like your understanding of the firmware is currently rough enough that doing so may not be practical. In that case, the best thing to do is to kill any VMs which were on the GPU at the time and hope the individual apps are able to recover.
Now that I look at it, we have a lovely "what is this flag doing anyway" bit already passed from Mesa through to the firmware we called ASAHI_RENDER_SET_WHEN_RELOADING_Z_OR_S which, now that I look at it, is actually getting set when any attachment (any color, Z, S) is not being cleared for that pass (so it's loaded). That could very well be an "is not idempotent" flag... and maybe that means the firmware does this for us already? Sounds like something to test... I might have some 16Kx16K GLmark runs to do concurrent with an evil faulting job now ^^ (and then that also means we need to set it when shaders have side effects and stuff, which right now we don't).
Just signal the problem back to userspace and let the user space driver decide what to do.
The background is that most graphics applications (games etc..) then rather start on the next frame instead of submitting the current one again while compute applications make sure that the abort and tell the user that the calculations might be corrupted and need to be redone.
The guarantee that Vulkan makes is that, if you idle the GPU and you haven't gotten a DEVICE_LOST yet, your data is good. If you get a DEVICE_LOST, all bets are off. The problem is that, no matter how fast the error propagation may be in the kernel or userspace driver, errors can still show up in strange ways. An OOB buffer access could end up modifying a shader binary which gets run 3 frames later and causes a corruption. Once you've faulted, you really have no idea how far back is good or what memory is corrupted. You have to assume that everything mapped to the GPU VA space is potentially toast.
Yes of course, for the actually faulting VM all bets are off after a fault (though we can try a bit harder at least... I have a READ_ONLY BO flag now, I should set it on the shader pools!).
Actually I wanted to ask about error notifications. Right now we have an out-of-band mechanism to provide detailed fault info to userspace which works fine, but in principle it's optional.
This is fine, in principal. Because of the nature of errors, async is fine as long as the error shows up eventually. Faster is better, for sure, but error latency doesn't really matter in practice.
However, I also mark the hw fences as errored when a fault happens (with an errno that describes the overall situation), but that never makes it into the drm_sched job complete fence. I looked at the drm_sched code and I didn't see any error propagation. Is that supposed to work, or am I supposed to directly mark the drm_sched side fence as complete, or did I misunderstand all this? I get the feeling maybe existing drivers just rely on the recovery/timeout/etc paths to mark jobs as errored (since those do it explicitly) and never need error forwarding from the hw fence?
The end behavior needs to be that all fences for all jobs submitted to the queue get signaled. That's needed to satisfy the finite time guarantees of dma_fence. Exactly how that happens (let the job run, abort all the jobs, etc.) is an implementation detail for the driver to decide. If you want, you can also set a bit on the context (or queue) to mark it as dead and start returning EIO or similar from any ioctls trying to submit more work if you wanted. Not required but you can.
Fences have an error flag though, does that get reported to userspace somehow? I thought it did, but maybe not, or maybe only drm_sched not propagating it is the issue?
In other words, absent my fancy stats reporting BO system, what is the normal way that an explicit sync driver signals to userspace that the job associated with a syncobj has failed?
One is via the return value from exec/submit. Often there's also a query mechanism for more detailed information. It's not particularly standard at the moment, I'm afraid. I could point you at i915 but I wouldn't call that uAPI something to be emulated, in general.
(If there is no way, then I'll probably want to change the stats BO system to be configurable, so if you ask for no stats/time info, you only get overall job status and faults, which has less overhead.)
There is an error but it doesn't automatically get propagated to userspace. So, for instance, a SYNCOBJ_WAIT ioctl won't return an error if it sees a fence error. It needs to get caught by the driver and returned through a driver ioctl somehow.
~Faith