On 04/02/2020 14:35, Boris Brezillon wrote:
Jobs can be in-flight when the file descriptor is closed (either because the process did not terminate properly, or because it didn't wait for all GPU jobs to be finished), and apparently panfrost_job_close() does not cancel already running jobs. Let's refcount the MMU context object so it's lifetime is no longer bound to the FD lifetime and running jobs can finish properly without generating spurious page faults.
Is there any good reason not to just make panfrost_job_close() kill off any running jobs? I'm not sure what the benefit is of allowing the jobs to still run after the file descriptor has closed.
In particular this could cause problems when(/if) Panfrost starts trying to deal with "compute" work loads that might have long runtimes. It's quite possible to produce a job which never (naturally) exits, currently we have a simplistic timeout which kills anything which doesn't complete promptly. However there is nothing conceptually wrong with a job which takes seconds (or even minutes) to complete. The hardware has support for task switching ('soft stopping') between jobs so this can be done to prevent blocking other applications.
If panfrost_job_close() doesn't kill the jobs then removing the timeouts could lead to the situation where there is an 'infinite' job with no owner and no way of killing it off. Which doesn't seem like a great feature ;)
Another approach could be simply to silence the page fault output in this case - switching the address space to UNMAPPED is actually an effective way of killing jobs - at some point I think this was a workaround to a hardware bug, but IIRC that was unreleased hardware :)
Steve