On 05/02/2020 14:01, Boris Brezillon wrote:
On Wed, 5 Feb 2020 13:39:21 +0000 Steven Price steven.price@arm.com wrote:
On 04/02/2020 14:35, Boris Brezillon wrote:
Jobs can be in-flight when the file descriptor is closed (either because the process did not terminate properly, or because it didn't wait for all GPU jobs to be finished), and apparently panfrost_job_close() does not cancel already running jobs. Let's refcount the MMU context object so it's lifetime is no longer bound to the FD lifetime and running jobs can finish properly without generating spurious page faults.
Is there any good reason not to just make panfrost_job_close() kill off any running jobs?
Nope, I just didn't know how to do that without stopping all other jobs (should have looked at how mali_kbase is doing that before posting this patch :)).
Yeah - this is an area Panfrost might need some improvement, you need to target the HARD_STOP carefully to ensure you kill of the correct jobs (and only the correct ones). It's not too difficult at the moment because we still don't have the _NEXT registers actively in use (I must get round to reposting that having fixed up the devfreq integration).
I'm not sure what the benefit is of allowing the jobs to still run after the file descriptor has closed.
None that I can think of.
In particular this could cause problems when(/if) Panfrost starts trying to deal with "compute" work loads that might have long runtimes. It's quite possible to produce a job which never (naturally) exits, currently we have a simplistic timeout which kills anything which doesn't complete promptly. However there is nothing conceptually wrong with a job which takes seconds (or even minutes) to complete.
Absolutely. That was also one of my concerns.
The hardware has support for task switching ('soft stopping') between jobs so this can be done to prevent blocking other applications.
Okay. I guess it's implemented in mali_kbase. I'll have a look.
If panfrost_job_close() doesn't kill the jobs then removing the timeouts could lead to the situation where there is an 'infinite' job with no owner and no way of killing it off. Which doesn't seem like a great feature ;)
Didn't know you were planning to remove the timeouts.
Well I don't have any immediate plans, but when(/if) there's interest in compute the relatively short timeouts for graphics tend to get in the way. I'm not in a hurry to do it because it will make the scheduling more complex :)
Another approach could be simply to silence the page fault output in this case - switching the address space to UNMAPPED is actually an effective way of killing jobs - at some point I think this was a workaround to a hardware bug, but IIRC that was unreleased hardware :)
Okay. I'll check how it's done in mali_kbase.
Thanks for the feedback.
Let me know if you need any pointers about how mali_kbase does it - the code is not exactly the prettiest ;)
Steve