Francois reported that VMware balloon gets stuck after a balloon reset, when the VMCI doorbell is removed. A similar error can occur when the balloon driver is removed with the following splat:
[ 1088.622000] INFO: task modprobe:3565 blocked for more than 120 seconds. [ 1088.622035] Tainted: G W 5.2.0 #4 [ 1088.622087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1088.622205] modprobe D 0 3565 1450 0x00000000 [ 1088.622210] Call Trace: [ 1088.622246] __schedule+0x2a8/0x690 [ 1088.622248] schedule+0x2d/0x90 [ 1088.622250] schedule_timeout+0x1d3/0x2f0 [ 1088.622252] wait_for_completion+0xba/0x140 [ 1088.622320] ? wake_up_q+0x80/0x80 [ 1088.622370] vmci_resource_remove+0xb9/0xc0 [vmw_vmci] [ 1088.622373] vmci_doorbell_destroy+0x9e/0xd0 [vmw_vmci] [ 1088.622379] vmballoon_vmci_cleanup+0x6e/0xf0 [vmw_balloon] [ 1088.622381] vmballoon_exit+0x18/0xcc8 [vmw_balloon] [ 1088.622394] __x64_sys_delete_module+0x146/0x280 [ 1088.622408] do_syscall_64+0x5a/0x130 [ 1088.622410] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1088.622415] RIP: 0033:0x7f54f62791b7 [ 1088.622421] Code: Bad RIP value. [ 1088.622421] RSP: 002b:00007fff2a949008 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [ 1088.622426] RAX: ffffffffffffffda RBX: 000055dff8b55d00 RCX: 00007f54f62791b7 [ 1088.622426] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055dff8b55d68 [ 1088.622427] RBP: 000055dff8b55d00 R08: 00007fff2a947fb1 R09: 0000000000000000 [ 1088.622427] R10: 00007f54f62f5cc0 R11: 0000000000000206 R12: 000055dff8b55d68 [ 1088.622428] R13: 0000000000000001 R14: 000055dff8b55d68 R15: 00007fff2a94a3f0
The cause for the bug is that when the "delayed" doorbell is invoked, it takes a reference on the doorbell entry and schedules work that is supposed to run the appropriate code and drop the doorbell entry reference. The code ignores the fact that if the work is already queued, it will not be scheduled to run one more time. As a result one of the references would not be dropped. When the code waits for the reference to get to zero, during balloon reset or module removal, it gets stuck.
Fix it. Drop the reference if schedule_work() indicates that the work is already queued.
Note that this bug got more apparent (or apparent at all) due to commit ce664331b248 ("vmw_balloon: VMCI_DOORBELL_SET does not check status").
Fixes: 83e2ec765be03 ("VMCI: doorbell implementation.") Reported-by: Francois Rigault rigault.francois@gmail.com Cc: Jorgen Hansen jhansen@vmware.com Cc: Adit Ranadive aditr@vmware.com Cc: Alexios Zavras alexios.zavras@intel.com Cc: Vishnu DASA vdasa@vmware.com Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit namit@vmware.com --- drivers/misc/vmw_vmci/vmci_doorbell.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/misc/vmw_vmci/vmci_doorbell.c b/drivers/misc/vmw_vmci/vmci_doorbell.c index bad89b6e0802..345addd9306d 100644 --- a/drivers/misc/vmw_vmci/vmci_doorbell.c +++ b/drivers/misc/vmw_vmci/vmci_doorbell.c @@ -310,7 +310,8 @@ int vmci_dbell_host_context_notify(u32 src_cid, struct vmci_handle handle)
entry = container_of(resource, struct dbell_entry, resource); if (entry->run_delayed) { - schedule_work(&entry->work); + if (!schedule_work(&entry->work)) + vmci_resource_put(resource); } else { entry->notify_cb(entry->client_data); vmci_resource_put(resource); @@ -361,7 +362,8 @@ static void dbell_fire_entries(u32 notify_idx) atomic_read(&dbell->active) == 1) { if (dbell->run_delayed) { vmci_resource_get(&dbell->resource); - schedule_work(&dbell->work); + if (!schedule_work(&dbell->work)) + vmci_resource_put(&dbell->resource); } else { dbell->notify_cb(dbell->client_data); }
On 8/20/19, 8:48 PM, "Nadav Amit" namit@vmware.com wrote:
Francois reported that VMware balloon gets stuck after a balloon reset, when the VMCI doorbell is removed. A similar error can occur when the balloon driver is removed with the following splat:
[ 1088.622000] INFO: task modprobe:3565 blocked for more than 120 seconds. [ 1088.622035] Tainted: G W 5.2.0 #4 [ 1088.622087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1088.622205] modprobe D 0 3565 1450 0x00000000 [ 1088.622210] Call Trace: [ 1088.622246] __schedule+0x2a8/0x690 [ 1088.622248] schedule+0x2d/0x90 [ 1088.622250] schedule_timeout+0x1d3/0x2f0 [ 1088.622252] wait_for_completion+0xba/0x140 [ 1088.622320] ? wake_up_q+0x80/0x80 [ 1088.622370] vmci_resource_remove+0xb9/0xc0 [vmw_vmci] [ 1088.622373] vmci_doorbell_destroy+0x9e/0xd0 [vmw_vmci] [ 1088.622379] vmballoon_vmci_cleanup+0x6e/0xf0 [vmw_balloon] [ 1088.622381] vmballoon_exit+0x18/0xcc8 [vmw_balloon] [ 1088.622394] __x64_sys_delete_module+0x146/0x280 [ 1088.622408] do_syscall_64+0x5a/0x130 [ 1088.622410] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1088.622415] RIP: 0033:0x7f54f62791b7 [ 1088.622421] Code: Bad RIP value. [ 1088.622421] RSP: 002b:00007fff2a949008 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [ 1088.622426] RAX: ffffffffffffffda RBX: 000055dff8b55d00 RCX: 00007f54f62791b7 [ 1088.622426] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055dff8b55d68 [ 1088.622427] RBP: 000055dff8b55d00 R08: 00007fff2a947fb1 R09: 0000000000000000 [ 1088.622427] R10: 00007f54f62f5cc0 R11: 0000000000000206 R12: 000055dff8b55d68 [ 1088.622428] R13: 0000000000000001 R14: 000055dff8b55d68 R15: 00007fff2a94a3f0
The cause for the bug is that when the "delayed" doorbell is invoked, it takes a reference on the doorbell entry and schedules work that is supposed to run the appropriate code and drop the doorbell entry reference. The code ignores the fact that if the work is already queued, it will not be scheduled to run one more time. As a result one of the references would not be dropped. When the code waits for the reference to get to zero, during balloon reset or module removal, it gets stuck.
Fix it. Drop the reference if schedule_work() indicates that the work is already queued.
Note that this bug got more apparent (or apparent at all) due to commit ce664331b248 ("vmw_balloon: VMCI_DOORBELL_SET does not check status").
Fixes: 83e2ec765be03 ("VMCI: doorbell implementation.") Reported-by: Francois Rigault rigault.francois@gmail.com Cc: Jorgen Hansen jhansen@vmware.com Cc: Adit Ranadive aditr@vmware.com Cc: Alexios Zavras alexios.zavras@intel.com Cc: Vishnu DASA vdasa@vmware.com Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit namit@vmware.com
drivers/misc/vmw_vmci/vmci_doorbell.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
Thanks for the fix, looks good to me. Reviewed-by: Vishnu Dasa vdasa@vmware.com
-- vishnu
On Tue, Aug 20, 2019 at 01:26:38PM -0700, Nadav Amit wrote:
Francois reported that VMware balloon gets stuck after a balloon reset, when the VMCI doorbell is removed. A similar error can occur when the balloon driver is removed with the following splat:
<snip>
Note, google thinks your email is spam as you are not sending this from a valid vmware.com email server. Please fix this up if you want to make sure your patches actually make it through...
thanks,
greg k-h
On Sep 7, 2019, at 9:47 PM, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
On Tue, Aug 20, 2019 at 01:26:38PM -0700, Nadav Amit wrote:
Francois reported that VMware balloon gets stuck after a balloon reset, when the VMCI doorbell is removed. A similar error can occur when the balloon driver is removed with the following splat:
<snip>
Note, google thinks your email is spam as you are not sending this from a valid vmware.com email server. Please fix this up if you want to make sure your patches actually make it through...
I used gmail servers since I get complaints (mainly from Linus) that our email servers use DMARC, and this causes mailing list postings to be rejected.
I will check again if our open source people got to a solution, or I would just send it using my gmail identity, while keeping the vmware authorship.
On Sun, Sep 08, 2019 at 04:04:39AM +0000, Nadav Amit wrote:
On Sep 7, 2019, at 9:47 PM, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
On Tue, Aug 20, 2019 at 01:26:38PM -0700, Nadav Amit wrote:
Francois reported that VMware balloon gets stuck after a balloon reset, when the VMCI doorbell is removed. A similar error can occur when the balloon driver is removed with the following splat:
<snip>
Note, google thinks your email is spam as you are not sending this from a valid vmware.com email server. Please fix this up if you want to make sure your patches actually make it through...
I used gmail servers since I get complaints (mainly from Linus) that our email servers use DMARC, and this causes mailing list postings to be rejected.
Well the email servers might not reject it now, but gmail itself sure does as you are lying about what your domain is :(
I will check again if our open source people got to a solution, or I would just send it using my gmail identity, while keeping the vmware authorship.
Please do, this is not a viable workflow.
thanks,
greg k-h
linux-stable-mirror@lists.linaro.org