[Linaro-mm-sig] Re: [PATCH RFC 11/18] drm/scheduler: Clean up jobs when the scheduler is torn down

13 Mar 2023


      On Fri, 2023-03-10 at 18:58 +0900, Asahi Lina wrote:
...
On 10/03/2023 04.59, Faith Ekstrand wrote:
...
On Thu, 2023-03-09 at 18:43 +0900, Asahi Lina wrote:
...
On 09/03/2023 17.42, Christian König wrote:
...
Long story short: Don't do this! This is what the Windows
drivers
have 
been doing and it creates tons of problems.
Yeah, we tried to do a bit of that in the GL days.  It was a bad
idea.
I think I should clarify: I was proposing re-queueing innocent jobs
from
innocent queues/VMs that were impacted by a fault. The reason is that
we
may be able to tweak firmware state to force it to do that safely,
during the firmware recovery cycle, such that an aborted job restarts
and then subsequent jobs/commands continue as normal. We can't leave
it
to userspace because if we do nothing, the affected job ends up
incomplete but then everything after it that is already queued still
runs, and that is definitely a recipe for a bigger mess if userspace
wants to seamlessly recover. The firmware recovery cycle is a
"stop-the-world" situation for the GPU (the firmware literally
busy-loops waiting for the driver to set a continue flag in
memory...),
so that's the only real chance that the driver gets to make decisions
about what is going to happen next.
Ok, that makes sense.  Yes, if you have other jobs on other queues and
are able to recover everything that isn't in the faulting VM, that's a
good thing.  I wasn't sure how hang/fault recovery worked on AGX.  In
tat case, I don't think there's a dma_fence problem.  As long as you
keep recovering and killing off any faulting contexts, eventually the
good contexts should make progress and those fences should signal.
Of course, the firmware recovery cycle may be complex and need (or at
least appear to) memory allocation or similar and that's where
everything gets hairy.  Hopefully, though, if you've already got the
resources from the old context, you can re-use them after a bit of
clean-up work and still get deterministic and reliable recovery cycles.
...
Of course, that only works if individual possibly concurrently
running
commands are idempotent, but I think a lot of typical GPU work is?
No, that's not a valid assumption.  For a single 3D render pass which
doesn't do any image or SSBO access, it may be possible to re-run it. 
However, that won't be true of compute work and isn't necessarily true
of back-to-back passes. Lots of modern apps do temporal stuff where one
frame depends on the previous and a re-run might screw that up. Also,
with Vulkan's memory aliasing, it's hard to tell just from which
resources are accessed whether or not a command buffer leaves its input
memory undamaged.
...
(E.g.
any render pass without side effects other than the render targets
and
where the background shader does no loads, or even render passes that
do
loads but where all draws are opaque, which are all things the
current
Gallium driver is intimately familiar with since Crazy Tiler
Optimizations™ need that info to be provided anyway). So I was
wondering
whether it'd make sense to have such an idempotency/restartable flag
on
job submission, and then the driver would do its best to recover and
rerun it if it gets killed by an unrelated concurrent bad job.
Then again this all depends on an investigation into what we *can* do
during firmware recovery that hasn't happened at all yet. It might be
that it isn't safe to do anything really, or that doing things
depends
on touching even deeper firmware state structs that we treat as
opaque
right now and we really don't want to have to touch...
But maybe none of this is worth it in practice, it just sounded like
it
could be useful maybe?
Maybe? It's not clear to me that such a flag would be useful or even
practical to provide from the Mesa side.  Ideally, you'd be able to
figure out when a fault happens, what VM it happened in and exactly
what work was in-flight when it happened and only kill the one guilty
VM.  However, it sounds like your understanding of the firmware is
currently rough enough that doing so may not be practical.  In that
case, the best thing to do is to kill any VMs which were on the GPU at
the time and hope the individual apps are able to recover.
...
Now that I look at it, we have a lovely "what is this flag doing
anyway"
bit already passed from Mesa through to the firmware we called
ASAHI_RENDER_SET_WHEN_RELOADING_Z_OR_S which, now that I look at it,
is
actually getting set when any attachment (any color, Z, S) is not
being
cleared for that pass (so it's loaded). That could very well be an
"is
not idempotent" flag... and maybe that means the firmware does this
for
us already? Sounds like something to test... I might have some
16Kx16K
GLmark runs to do concurrent with an evil faulting job now ^^ (and
then
that also means we need to set it when shaders have side effects and
stuff, which right now we don't).
...
...
...
Just signal the problem back to userspace and let the user
space
driver 
decide what to do.
The background is that most graphics applications (games etc..)
then 
rather start on the next frame instead of submitting the
current
one 
again while compute applications make sure that the abort and
tell
the 
user that the calculations might be corrupted and need to be
redone.
The guarantee that Vulkan makes is that, if you idle the GPU and
you
haven't gotten a DEVICE_LOST yet, your data is good.  If you get a
DEVICE_LOST, all bets are off.  The problem is that, no matter how
fast
the error propagation may be in the kernel or userspace driver,
errors
can still show up in strange ways.  An OOB buffer access could end
up
modifying a shader binary which gets run 3 frames later and causes
a
corruption.  Once you've faulted, you really have no idea how far
back
is good or what memory is corrupted.  You have to assume that
everything mapped to the GPU VA space is potentially toast.
Yes of course, for the actually faulting VM all bets are off after a
fault (though we can try a bit harder at least... I have a READ_ONLY
BO
flag now, I should set it on the shader pools!).
...
...
Actually I wanted to ask about error notifications. Right now we
have
an
out-of-band mechanism to provide detailed fault info to userspace
which
works fine, but in principle it's optional.
This is fine, in principal.  Because of the nature of errors, async
is
fine as long as the error shows up eventually.  Faster is better,
for
sure, but error latency doesn't really matter in practice.
...
However, I also mark the hw
 fences as errored when a fault happens (with an errno that
describes
the overall situation), but that never makes it into the
drm_sched
job
complete fence. I looked at the drm_sched code and I didn't see
any
error propagation. Is that supposed to work, or am I supposed to
directly mark the drm_sched side fence as complete, or did I
misunderstand all this? I get the feeling maybe existing drivers
just
rely on the recovery/timeout/etc paths to mark jobs as errored
(since
those do it explicitly) and never need error forwarding from the
hw
fence?
The end behavior needs to be that all fences for all jobs submitted
to
the queue get signaled.  That's needed to satisfy the finite time
guarantees of dma_fence.  Exactly how that happens (let the job
run,
abort all the jobs, etc.) is an implementation detail for the
driver to
decide.  If you want, you can also set a bit on the context (or
queue)
to mark it as dead and start returning EIO or similar from any
ioctls
trying to submit more work if you wanted.  Not required but you
can.
Fences have an error flag though, does that get reported to userspace
somehow? I thought it did, but maybe not, or maybe only drm_sched not
propagating it is the issue?
In other words, absent my fancy stats reporting BO system, what is
the
normal way that an explicit sync driver signals to userspace that the
job associated with a syncobj has failed?
One is via the return value from exec/submit.  Often there's also a
query mechanism for more detailed information.  It's not particularly
standard at the moment, I'm afraid.  I could point you at i915 but I
wouldn't call that uAPI something to be emulated, in general.
...
(If there is no way, then I'll probably want to change the stats BO
system to be configurable, so if you ask for no stats/time info, you
only get overall job status and faults, which has less overhead.)
There is an error but it doesn't automatically get propagated to
userspace.  So, for instance, a SYNCOBJ_WAIT ioctl won't return an
error if it sees a fence error.  It needs to get caught by the driver
and returned through a driver ioctl somehow.
~Faith

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[Linaro-mm-sig] Re: [PATCH RFC 11/18] drm/scheduler: Clean up jobs when the scheduler is torn down