[PATCH v2] soc: qcom: pdr: Fix the potential deadlock

List overview All Threads
Download

newer

older

[PATCH v2] net: microchip: sparx5:...

[PATCH] arm64: dts: socfpga:...

Mukesh Ojha

29 Jan 2025 29 Jan '25

3:55 p.m.

When some client process A call pdr_add_lookup() to add the look up for the service and does schedule locator work, later a process B got a new server packet indicating locator is up and call pdr_locator_new_server() which eventually sets pdr->locator_init_complete to true which process A sees and takes list lock and queries domain list but it will timeout due to deadlock as the response will queued to the same qmi->wq and it is ordered workqueue and process B is not able to complete new server request work due to deadlock on list lock.

Process A Process B

process_scheduled_works() pdr_add_lookup() qmi_data_ready_work() process_scheduled_works() pdr_locator_new_server() pdr->locator_init_complete=true; pdr_locator_work() mutex_lock(&pdr->list_lock);

pdr_locate_service() mutex_lock(&pdr->list_lock);

pdr_get_domain_list() pr_err("PDR: %s get domain list txn wait failed: %d\n", req->service_name, ret);

Fix it by removing the unnecessary list iteration as the list iteration is already being done inside locator work, so avoid it here and just call schedule_work() here.

Fixes: fbe639b44a82 ("soc: qcom: Introduce Protection Domain Restart helpers") CC: stable@vger.kernel.org Signed-off-by: Saranya R quic_sarar@quicinc.com Signed-off-by: Mukesh Ojha mukesh.ojha@oss.qualcomm.com --- Changes in v2: - Added Fixes tag,

drivers/soc/qcom/pdr_interface.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/soc/qcom/pdr_interface.c b/drivers/soc/qcom/pdr_interface.c index 328b6153b2be..71be378d2e43 100644 --- a/drivers/soc/qcom/pdr_interface.c +++ b/drivers/soc/qcom/pdr_interface.c @@ -75,7 +75,6 @@ static int pdr_locator_new_server(struct qmi_handle *qmi, { struct pdr_handle *pdr = container_of(qmi, struct pdr_handle, locator_hdl); - struct pdr_service *pds;

mutex_lock(&pdr->lock); /* Create a local client port for QMI communication */ @@ -87,12 +86,7 @@ static int pdr_locator_new_server(struct qmi_handle *qmi, mutex_unlock(&pdr->lock);

/* Service pending lookup requests */ - mutex_lock(&pdr->list_lock); - list_for_each_entry(pds, &pdr->lookups, node) { - if (pds->need_locator_lookup) - schedule_work(&pdr->locator_work); - } - mutex_unlock(&pdr->list_lock); + schedule_work(&pdr->locator_work);

return 0; }

-- 2.34.1

Show replies by date

Bjorn Andersson

6 Feb 6 Feb

10:13 p.m.

On Wed, Jan 29, 2025 at 09:25:44PM +0530, Mukesh Ojha wrote:

...

When some client process A call pdr_add_lookup() to add the look up for the service and does schedule locator work, later a process B got a new server packet indicating locator is up and call pdr_locator_new_server() which eventually sets pdr->locator_init_complete to true which process A sees and takes list lock and queries domain list but it will timeout due to deadlock as the response will queued to the same qmi->wq and it is ordered workqueue and process B is not able to complete new server request work due to deadlock on list lock.
   Process A                        Process B

                                 process_scheduled_works()
pdr_add_lookup() qmi_data_ready_work() process_scheduled_works() pdr_locator_new_server() pdr->locator_init_complete=true; pdr_locator_work() mutex_lock(&pdr->list_lock);
 pdr_locate_service()                  mutex_lock(&pdr->list_lock);

  pdr_get_domain_list()
   pr_err("PDR: %s get domain list
           txn wait failed: %d\n",
           req->service_name,
           ret);
Fix it by removing the unnecessary list iteration as the list iteration is already being done inside locator work, so avoid it here and just call schedule_work() here.

I came to the same patch while looking into the issue related to in-kernel pd-mapper reported here: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/

So: Reviewed-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com Tested-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com

...

Fixes: fbe639b44a82 ("soc: qcom: Introduce Protection Domain Restart helpers") CC: stable@vger.kernel.org Signed-off-by: Saranya R quic_sarar@quicinc.com

Can we please use full names?

...

Signed-off-by: Mukesh Ojha mukesh.ojha@oss.qualcomm.com

Unfortunately I can't merge this; Saranya's S-o-b comes first which implies that she authored the patch, but you're listed as author.

Regards, Bjorn

...

Changes in v2:

Added Fixes tag,

drivers/soc/qcom/pdr_interface.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/soc/qcom/pdr_interface.c b/drivers/soc/qcom/pdr_interface.c index 328b6153b2be..71be378d2e43 100644 --- a/drivers/soc/qcom/pdr_interface.c +++ b/drivers/soc/qcom/pdr_interface.c @@ -75,7 +75,6 @@ static int pdr_locator_new_server(struct qmi_handle *qmi, { struct pdr_handle *pdr = container_of(qmi, struct pdr_handle, locator_hdl);

struct pdr_service *pds;

mutex_lock(&pdr->lock); /* Create a local client port for QMI communication */ @@ -87,12 +86,7 @@ static int pdr_locator_new_server(struct qmi_handle *qmi, mutex_unlock(&pdr->lock); /* Service pending lookup requests */
mutex_lock(&pdr->list_lock);

list_for_each_entry(pds, &pdr->lookups, node) {
if (pds->need_locator_lookup)
	schedule_work(&pdr->locator_work);
}

mutex_unlock(&pdr->list_lock);
schedule_work(&pdr->locator_work);

return 0; } -- 2.34.1

Mukesh Ojha

10 Feb 10 Feb

9:20 a.m.

On Thu, Feb 06, 2025 at 04:13:25PM -0600, Bjorn Andersson wrote:

...

On Wed, Jan 29, 2025 at 09:25:44PM +0530, Mukesh Ojha wrote:

...
When some client process A call pdr_add_lookup() to add the look up for the service and does schedule locator work, later a process B got a new server packet indicating locator is up and call pdr_locator_new_server() which eventually sets pdr->locator_init_complete to true which process A sees and takes list lock and queries domain list but it will timeout due to deadlock as the response will queued to the same qmi->wq and it is ordered workqueue and process B is not able to complete new server request work due to deadlock on list lock.
   Process A                        Process B

                                 process_scheduled_works()
pdr_add_lookup() qmi_data_ready_work() process_scheduled_works() pdr_locator_new_server() pdr->locator_init_complete=true; pdr_locator_work() mutex_lock(&pdr->list_lock);
 pdr_locate_service()                  mutex_lock(&pdr->list_lock);

  pdr_get_domain_list()
   pr_err("PDR: %s get domain list
           txn wait failed: %d\n",
           req->service_name,
           ret);
Fix it by removing the unnecessary list iteration as the list iteration is already being done inside locator work, so avoid it here and just call schedule_work() here.
I came to the same patch while looking into the issue related to in-kernel pd-mapper reported here: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/

So: Reviewed-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com Tested-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com

Thanks.,

...

...
Fixes: fbe639b44a82 ("soc: qcom: Introduce Protection Domain Restart helpers") CC: stable@vger.kernel.org Signed-off-by: Saranya R quic_sarar@quicinc.com

Can we please use full names?

I have informed her about this and she wants the same name to be mentioned here as well.

...

...
Signed-off-by: Mukesh Ojha mukesh.ojha@oss.qualcomm.com

Unfortunately I can't merge this; Saranya's S-o-b comes first which implies that she authored the patch, but you're listed as author.

I will correct it.

-Mukesh

...

Regards, Bjorn

...
Changes in v2:

Added Fixes tag,

drivers/soc/qcom/pdr_interface.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/soc/qcom/pdr_interface.c b/drivers/soc/qcom/pdr_interface.c index 328b6153b2be..71be378d2e43 100644 --- a/drivers/soc/qcom/pdr_interface.c +++ b/drivers/soc/qcom/pdr_interface.c @@ -75,7 +75,6 @@ static int pdr_locator_new_server(struct qmi_handle *qmi, { struct pdr_handle *pdr = container_of(qmi, struct pdr_handle, locator_hdl);

struct pdr_service *pds;

mutex_lock(&pdr->lock); /* Create a local client port for QMI communication */ @@ -87,12 +86,7 @@ static int pdr_locator_new_server(struct qmi_handle *qmi, mutex_unlock(&pdr->lock); /* Service pending lookup requests */
mutex_lock(&pdr->list_lock);

list_for_each_entry(pds, &pdr->lookups, node) {
if (pds->need_locator_lookup)
	schedule_work(&pdr->locator_work);
}

mutex_unlock(&pdr->list_lock);
schedule_work(&pdr->locator_work);

return 0; } -- 2.34.1

Johan Hovold

9:43 a.m.

On Mon, Feb 10, 2025 at 02:50:18PM +0530, Mukesh Ojha wrote:

...

On Thu, Feb 06, 2025 at 04:13:25PM -0600, Bjorn Andersson wrote:

...
On Wed, Jan 29, 2025 at 09:25:44PM +0530, Mukesh Ojha wrote:

...
When some client process A call pdr_add_lookup() to add the look up for the service and does schedule locator work, later a process B got a new server packet indicating locator is up and call pdr_locator_new_server() which eventually sets pdr->locator_init_complete to true which process A sees and takes list lock and queries domain list but it will timeout due to deadlock as the response will queued to the same qmi->wq and it is ordered workqueue and process B is not able to complete new server request work due to deadlock on list lock.
   Process A                        Process B

                                 process_scheduled_works()
pdr_add_lookup() qmi_data_ready_work() process_scheduled_works() pdr_locator_new_server() pdr->locator_init_complete=true; pdr_locator_work() mutex_lock(&pdr->list_lock);
 pdr_locate_service()                  mutex_lock(&pdr->list_lock);

  pdr_get_domain_list()
   pr_err("PDR: %s get domain list
           txn wait failed: %d\n",
           req->service_name,
           ret);
Fix it by removing the unnecessary list iteration as the list iteration is already being done inside locator work, so avoid it here and just call schedule_work() here.
I came to the same patch while looking into the issue related to in-kernel pd-mapper reported here: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/

So: Reviewed-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com Tested-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com

I was gonna ask if you have confirmed that this indeed fixes the audio regression with the in-kernel pd-mapper?

Is this how you discovered the issue as well, Mukesh and Saranya?

If so, please mention that in the commit message, but in any case also include the corresponding error messages directly so that people running into this can find the fix more easily. (I see the pr_err now, but it's not as greppable).

A Link tag to my report would be good to have as well if this fixes the audio regression.

Johan

Mukesh Ojha

11 Feb 11 Feb

5:07 p.m.

On Mon, Feb 10, 2025 at 10:43:23AM +0100, Johan Hovold wrote:

...

On Mon, Feb 10, 2025 at 02:50:18PM +0530, Mukesh Ojha wrote:

...
On Thu, Feb 06, 2025 at 04:13:25PM -0600, Bjorn Andersson wrote:

...
On Wed, Jan 29, 2025 at 09:25:44PM +0530, Mukesh Ojha wrote:

...
When some client process A call pdr_add_lookup() to add the look up for the service and does schedule locator work, later a process B got a new server packet indicating locator is up and call pdr_locator_new_server() which eventually sets pdr->locator_init_complete to true which process A sees and takes list lock and queries domain list but it will timeout due to deadlock as the response will queued to the same qmi->wq and it is ordered workqueue and process B is not able to complete new server request work due to deadlock on list lock.
   Process A                        Process B

                                 process_scheduled_works()
pdr_add_lookup() qmi_data_ready_work() process_scheduled_works() pdr_locator_new_server() pdr->locator_init_complete=true; pdr_locator_work() mutex_lock(&pdr->list_lock);
 pdr_locate_service()                  mutex_lock(&pdr->list_lock);

  pdr_get_domain_list()
   pr_err("PDR: %s get domain list
           txn wait failed: %d\n",
           req->service_name,
           ret);
Fix it by removing the unnecessary list iteration as the list iteration is already being done inside locator work, so avoid it here and just call schedule_work() here.
I came to the same patch while looking into the issue related to in-kernel pd-mapper reported here: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/

So: Reviewed-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com Tested-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com

Should i add this in next version ?

...

I was gonna ask if you have confirmed that this indeed fixes the audio regression with the in-kernel pd-mapper?

Is this how you discovered the issue as well, Mukesh and Saranya?

No, we are not using in kernel pd-mapper yet in downstream..

...

If so, please mention that in the commit message, but in any case also include the corresponding error messages directly so that people running into this can find the fix more easily. (I see the pr_err now, but it's not as greppable).

Below is the sample log which got in downstream when we hit this issue

13.799119: PDR: tms/servreg get domain list txn wait failed: -110 13.799146: PDR: service lookup for msm/adsp/sensor_pd:tms/servreg failed: -110

...

A Link tag to my report would be good to have as well if this fixes the audio regression.

I see this is somehow matching the logs you have reported, but this deadlock is there from the very first day of pdr_interface driver.

[ 14.565059] PDR: avs/audio get domain list txn wait failed: -110 [ 14.571943] PDR: service lookup for avs/audio failed: -110

-Mukesh

...

Johan

Johan Hovold

12 Feb 12 Feb

9:37 a.m.

On Tue, Feb 11, 2025 at 10:37:11PM +0530, Mukesh Ojha wrote:

...

On Mon, Feb 10, 2025 at 10:43:23AM +0100, Johan Hovold wrote:

...
On Mon, Feb 10, 2025 at 02:50:18PM +0530, Mukesh Ojha wrote:

...
On Thu, Feb 06, 2025 at 04:13:25PM -0600, Bjorn Andersson wrote:

...

...
...
...
I came to the same patch while looking into the issue related to in-kernel pd-mapper reported here: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/

So: Reviewed-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com Tested-by: Bjorn Andersson bjorn.andersson@oss.qualcomm.com

Should i add this in next version ?

Yes, if there is another revision.

...

...
I was gonna ask if you have confirmed that this indeed fixes the audio regression with the in-kernel pd-mapper?

Is this how you discovered the issue as well, Mukesh and Saranya?

No, we are not using in kernel pd-mapper yet in downstream..

Ok, thanks for confirming.

...

...
If so, please mention that in the commit message, but in any case also include the corresponding error messages directly so that people running into this can find the fix more easily. (I see the pr_err now, but it's not as greppable).

Below is the sample log which got in downstream when we hit this issue

13.799119: PDR: tms/servreg get domain list txn wait failed: -110 13.799146: PDR: service lookup for msm/adsp/sensor_pd:tms/servreg failed: -110

I think it would be good to include this (without the time stamp) as an example as it would make it easier to find this fix even if the failure happens for another service.

...

...
A Link tag to my report would be good to have as well if this fixes the audio regression.

I see this is somehow matching the logs you have reported, but this deadlock is there from the very first day of pdr_interface driver.

[ 14.565059] PDR: avs/audio get domain list txn wait failed: -110 [ 14.571943] PDR: service lookup for avs/audio failed: -110

Yes, but using the in-kernel pd-mapper has exposed a number of existing bugs since it changes the timing of events enough to make it easier to hit them.

The audio regression is a very real regression for users of Snapdragon based laptops like, for example, the Lenovo Yoga Slim 7x.

If Bjorn has confirmed that this is the same issue (I can try to instrument the code based on your analysis to confirm this too), then I think it would be good to mention this in the commit message and link to the report, for example:

This specifically also fixes an audio regression when using the in-kernel pd-mapper as that makes it easier to hit this race. [1]

Link: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/ # [1]

or similar.

Johan

Johan Hovold

10:59 a.m.

On Wed, Feb 12, 2025 at 10:37:35AM +0100, Johan Hovold wrote:

...

On Tue, Feb 11, 2025 at 10:37:11PM +0530, Mukesh Ojha wrote:

...
On Mon, Feb 10, 2025 at 10:43:23AM +0100, Johan Hovold wrote:

...

...
...
A Link tag to my report would be good to have as well if this fixes the audio regression.

I see this is somehow matching the logs you have reported, but this deadlock is there from the very first day of pdr_interface driver.

[ 14.565059] PDR: avs/audio get domain list txn wait failed: -110 [ 14.571943] PDR: service lookup for avs/audio failed: -110

Yes, but using the in-kernel pd-mapper has exposed a number of existing bugs since it changes the timing of events enough to make it easier to hit them.

The audio regression is a very real regression for users of Snapdragon based laptops like, for example, the Lenovo Yoga Slim 7x.

If Bjorn has confirmed that this is the same issue (I can try to instrument the code based on your analysis to confirm this too), then I think it would be good to mention this in the commit message and link to the report, for example:

This specifically also fixes an audio regression when using the in-kernel pd-mapper as that makes it easier to hit this race. [1]

Link: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/ # [1]

or similar.

I can confirm that audio regression with the in-kernel pd-mapper appears to be caused by the race that this patch fixes.

If I insert a short (100-200 ms) sleep before taking the list lock in pdr_locator_new_server() to increase the race window, I see the audio service failing to register on both the X1E CRD and Lenovo ThinkPad X13s:

[ 11.118557] pdr_add_lookup - tms/servreg / msm/adsp/charger_pd [ 11.443632] pdr_locator_new_server [ 11.558122] pdr_locator_new_server - taking list lock [ 11.563939] pdr_locator_new_server - releasing list lock [ 11.582178] pdr_locator_work - taking list lock [ 11.594468] pdr_locator_work - releasing list lock [ 11.992018] pdr_add_lookup - avs/audio / msm/adsp/audio_pd [ 11.992034] pdr_add_lookup - avs/audio / msm/adsp/audio_pd [ 11.992224] pdr_locator_new_server < 100 ms sleep inserted before taking lock in pdr_locator_new_server() > [ 11.997937] pdr_locator_work - taking list lock [ 12.093984] pdr_locator_new_server - taking list lock [ 17.120169] PDR: avs/audio get domain list txn wait failed: -110 [ 17.127066] PDR: service lookup for avs/audio failed: -110 [ 17.132893] pdr_locator_work - releasing list lock [ 17.139885] pdr_locator_new_server - releasing list lock

[ On the X13s, where I have not hit this issue with the in-kernel pd-mapper, I had to make sure to insert the sleep only on the second call, possibly because of interaction with the charger_pd registration which happened closer to the audio registration. ]

Please add a comment and link to the audio regression report as I suggested above, and feel free to include my:

Tested-by: Johan Hovold johan+linaro@kernel.org

Thanks for fixing this!

Johan

Mukesh Ojha

4:13 p.m.

On Wed, Feb 12, 2025 at 11:59:36AM +0100, Johan Hovold wrote:

...

On Wed, Feb 12, 2025 at 10:37:35AM +0100, Johan Hovold wrote:

...
On Tue, Feb 11, 2025 at 10:37:11PM +0530, Mukesh Ojha wrote:

...
On Mon, Feb 10, 2025 at 10:43:23AM +0100, Johan Hovold wrote:

...
...
...
A Link tag to my report would be good to have as well if this fixes the audio regression.

I see this is somehow matching the logs you have reported, but this deadlock is there from the very first day of pdr_interface driver.

[ 14.565059] PDR: avs/audio get domain list txn wait failed: -110 [ 14.571943] PDR: service lookup for avs/audio failed: -110

Yes, but using the in-kernel pd-mapper has exposed a number of existing bugs since it changes the timing of events enough to make it easier to hit them.

The audio regression is a very real regression for users of Snapdragon based laptops like, for example, the Lenovo Yoga Slim 7x.

If Bjorn has confirmed that this is the same issue (I can try to instrument the code based on your analysis to confirm this too), then I think it would be good to mention this in the commit message and link to the report, for example:

This specifically also fixes an audio regression when using the in-kernel pd-mapper as that makes it easier to hit this race. [1]

Link: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/ # [1]

or similar.

I can confirm that audio regression with the in-kernel pd-mapper appears to be caused by the race that this patch fixes.

If I insert a short (100-200 ms) sleep before taking the list lock in pdr_locator_new_server() to increase the race window, I see the audio service failing to register on both the X1E CRD and Lenovo ThinkPad X13s:

[ 11.118557] pdr_add_lookup - tms/servreg / msm/adsp/charger_pd [ 11.443632] pdr_locator_new_server [ 11.558122] pdr_locator_new_server - taking list lock [ 11.563939] pdr_locator_new_server - releasing list lock [ 11.582178] pdr_locator_work - taking list lock [ 11.594468] pdr_locator_work - releasing list lock [ 11.992018] pdr_add_lookup - avs/audio / msm/adsp/audio_pd [ 11.992034] pdr_add_lookup - avs/audio / msm/adsp/audio_pd [ 11.992224] pdr_locator_new_server < 100 ms sleep inserted before taking lock in pdr_locator_new_server() > [ 11.997937] pdr_locator_work - taking list lock [ 12.093984] pdr_locator_new_server - taking list lock [ 17.120169] PDR: avs/audio get domain list txn wait failed: -110 [ 17.127066] PDR: service lookup for avs/audio failed: -110 [ 17.132893] pdr_locator_work - releasing list lock [ 17.139885] pdr_locator_new_server - releasing list lock

[ On the X13s, where I have not hit this issue with the in-kernel pd-mapper, I had to make sure to insert the sleep only on the second call, possibly because of interaction with the charger_pd registration which happened closer to the audio registration. ]

Please add a comment and link to the audio regression report as I suggested above, and feel free to include my:

Tested-by: Johan Hovold johan+linaro@kernel.org

Thanks for fixing this!

Thanks for putting extra effort in reproducing the issue and testing.

-Mukesh

...

Johan

321

days inactive

335

days old

linux-stable-mirror@lists.linaro.org

7 comments

participants

tags (0)

participants (3)

Bjorn Andersson
Johan Hovold
Mukesh Ojha