On Mon, Feb 10, 2025 at 10:43:23AM +0100, Johan Hovold wrote:
On Mon, Feb 10, 2025 at 02:50:18PM +0530, Mukesh Ojha wrote:
On Thu, Feb 06, 2025 at 04:13:25PM -0600, Bjorn Andersson wrote:
On Wed, Jan 29, 2025 at 09:25:44PM +0530, Mukesh Ojha wrote:
When some client process A call pdr_add_lookup() to add the look up for the service and does schedule locator work, later a process B got a new server packet indicating locator is up and call pdr_locator_new_server() which eventually sets pdr->locator_init_complete to true which process A sees and takes list lock and queries domain list but it will timeout due to deadlock as the response will queued to the same qmi->wq and it is ordered workqueue and process B is not able to complete new server request work due to deadlock on list lock.
Process A Process B process_scheduled_works()
pdr_add_lookup() qmi_data_ready_work() process_scheduled_works() pdr_locator_new_server() pdr->locator_init_complete=true; pdr_locator_work() mutex_lock(&pdr->list_lock);
pdr_locate_service() mutex_lock(&pdr->list_lock); pdr_get_domain_list() pr_err("PDR: %s get domain list txn wait failed: %d\n", req->service_name, ret);
Fix it by removing the unnecessary list iteration as the list iteration is already being done inside locator work, so avoid it here and just call schedule_work() here.
I came to the same patch while looking into the issue related to in-kernel pd-mapper reported here: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/
So: Reviewed-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com Tested-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com
Should i add this in next version ?
I was gonna ask if you have confirmed that this indeed fixes the audio regression with the in-kernel pd-mapper?
Is this how you discovered the issue as well, Mukesh and Saranya?
No, we are not using in kernel pd-mapper yet in downstream..
If so, please mention that in the commit message, but in any case also include the corresponding error messages directly so that people running into this can find the fix more easily. (I see the pr_err now, but it's not as greppable).
Below is the sample log which got in downstream when we hit this issue
13.799119: PDR: tms/servreg get domain list txn wait failed: -110 13.799146: PDR: service lookup for msm/adsp/sensor_pd:tms/servreg failed: -110
A Link tag to my report would be good to have as well if this fixes the audio regression.
I see this is somehow matching the logs you have reported, but this deadlock is there from the very first day of pdr_interface driver.
[ 14.565059] PDR: avs/audio get domain list txn wait failed: -110 [ 14.571943] PDR: service lookup for avs/audio failed: -110
-Mukesh
Johan