Switching to closed state earlier can cause the block-drbd script to fail with 'Device is held open by someone':
root: /etc/xen/scripts/block-drbd: remove XENBUS_PATH=backend/vbd/6/51712 kernel: [ 2222.278235] block drbd6: State change failed: Device is held open by someone kernel: [ 2222.278304] block drbd6: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- } kernel: [ 2222.278340] block drbd6: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- } root: /etc/xen/scripts/block-drbd: Writing backend/vbd/6/51712/hotplug-error /etc/xen/scripts/block-drbd failed; error detected. backend/vbd/6/51712/hotplug-status error to xenstore. root: /etc/xen/scripts/block-drbd: /etc/xen/scripts/block-drbd failed; error detected.
Signed-off-by: Valentin Vidic Valentin.Vidic@CARNet.hr Cc: stable@vger.kernel.org --- drivers/block/xen-blkback/xenbus.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c index a4bc74e72c39..43bddc996709 100644 --- a/drivers/block/xen-blkback/xenbus.c +++ b/drivers/block/xen-blkback/xenbus.c @@ -323,6 +323,7 @@ static void xen_blkif_free(struct xen_blkif *blkif) { WARN_ON(xen_blkif_disconnect(blkif)); xen_vbd_free(&blkif->vbd); + xenbus_switch_state(blkif->be->dev, XenbusStateClosed); kfree(blkif->be->mode); kfree(blkif->be);
@@ -814,7 +815,6 @@ static void frontend_changed(struct xenbus_device *dev,
case XenbusStateClosed: xen_blkif_disconnect(be->blkif); - xenbus_switch_state(dev, XenbusStateClosed); if (xenbus_dev_is_online(dev)) break; /* fall through */
On 29/08/18 08:52, Valentin Vidic wrote:
Switching to closed state earlier can cause the block-drbd script to fail with 'Device is held open by someone':
root: /etc/xen/scripts/block-drbd: remove XENBUS_PATH=backend/vbd/6/51712 kernel: [ 2222.278235] block drbd6: State change failed: Device is held open by someone kernel: [ 2222.278304] block drbd6: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- } kernel: [ 2222.278340] block drbd6: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- } root: /etc/xen/scripts/block-drbd: Writing backend/vbd/6/51712/hotplug-error /etc/xen/scripts/block-drbd failed; error detected. backend/vbd/6/51712/hotplug-status error to xenstore. root: /etc/xen/scripts/block-drbd: /etc/xen/scripts/block-drbd failed; error detected.
Signed-off-by: Valentin Vidic Valentin.Vidic@CARNet.hr Cc: stable@vger.kernel.org
drivers/block/xen-blkback/xenbus.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c index a4bc74e72c39..43bddc996709 100644 --- a/drivers/block/xen-blkback/xenbus.c +++ b/drivers/block/xen-blkback/xenbus.c @@ -323,6 +323,7 @@ static void xen_blkif_free(struct xen_blkif *blkif) { WARN_ON(xen_blkif_disconnect(blkif)); xen_vbd_free(&blkif->vbd);
- xenbus_switch_state(blkif->be->dev, XenbusStateClosed); kfree(blkif->be->mode); kfree(blkif->be);
@@ -814,7 +815,6 @@ static void frontend_changed(struct xenbus_device *dev, case XenbusStateClosed: xen_blkif_disconnect(be->blkif);
if (xenbus_dev_is_online(dev)) break; /* fall through */xenbus_switch_state(dev, XenbusStateClosed);
Did you test whether it is okay to not change state in case the device is still online?
Juergen
On 29/08/18 10:27, Valentin Vidic wrote:
On Wed, Aug 29, 2018 at 10:16:09AM +0200, Juergen Gross wrote:
Did you test whether it is okay to not change state in case the device is still online?
Not sure how to simulate that. Maybe using xl block-detach or something else?
I think this should be an action triggered by the frontend, not by dom0 (xen tools will always set "online" to 0 when removing a device, AFAIK).
I'm not sure this is relevant, but I realized this behavior is changed by your patch.
Juergen
On Wed, Aug 29, 2018 at 10:43:48AM +0200, Juergen Gross wrote:
I think this should be an action triggered by the frontend, not by dom0 (xen tools will always set "online" to 0 when removing a device, AFAIK).
I'm not sure this is relevant, but I realized this behavior is changed by your patch.
Can't find any place in xen-blkfront.c (or the rest of the kernel) where the "online" value gets set, so it seems xen tools in dom0 is the only thing that modifies this Xenbus node?
On 29/08/18 11:23, Valentin Vidic wrote:
On Wed, Aug 29, 2018 at 10:43:48AM +0200, Juergen Gross wrote:
I think this should be an action triggered by the frontend, not by dom0 (xen tools will always set "online" to 0 when removing a device, AFAIK).
I'm not sure this is relevant, but I realized this behavior is changed by your patch.
Can't find any place in xen-blkfront.c (or the rest of the kernel) where the "online" value gets set, so it seems xen tools in dom0 is the only thing that modifies this Xenbus node?
I think so, yes.
It will be set to "1" at device creation and to "0" when removing the device, both times by Xen tools.
Juergen
On Wed, Aug 29, 2018 at 08:52:14AM +0200, Valentin Vidic wrote:
Switching to closed state earlier can cause the block-drbd script to fail with 'Device is held open by someone':
root: /etc/xen/scripts/block-drbd: remove XENBUS_PATH=backend/vbd/6/51712 kernel: [ 2222.278235] block drbd6: State change failed: Device is held open by someone kernel: [ 2222.278304] block drbd6: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- } kernel: [ 2222.278340] block drbd6: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- } root: /etc/xen/scripts/block-drbd: Writing backend/vbd/6/51712/hotplug-error /etc/xen/scripts/block-drbd failed; error detected. backend/vbd/6/51712/hotplug-status error to xenstore. root: /etc/xen/scripts/block-drbd: /etc/xen/scripts/block-drbd failed; error detected.
Signed-off-by: Valentin Vidic Valentin.Vidic@CARNet.hr Cc: stable@vger.kernel.org
drivers/block/xen-blkback/xenbus.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c index a4bc74e72c39..43bddc996709 100644 --- a/drivers/block/xen-blkback/xenbus.c +++ b/drivers/block/xen-blkback/xenbus.c @@ -323,6 +323,7 @@ static void xen_blkif_free(struct xen_blkif *blkif) { WARN_ON(xen_blkif_disconnect(blkif)); xen_vbd_free(&blkif->vbd);
- xenbus_switch_state(blkif->be->dev, XenbusStateClosed); kfree(blkif->be->mode); kfree(blkif->be);
@@ -814,7 +815,6 @@ static void frontend_changed(struct xenbus_device *dev, case XenbusStateClosed: xen_blkif_disconnect(be->blkif);
if (xenbus_dev_is_online(dev)) break;xenbus_switch_state(dev, XenbusStateClosed);
AFAICT, this will cause the backend to never switch to 'Closed' state until the toolstack sets online to 0, which is not good IMO.
If for example a frontend decides to close a device, the backend will stay in state 'Closing' until the toolstack actually removes the disk by setting online to 0.
This will prevent resetting blk connections, as blkback will refuse to switch to state XenbusStateInitWait unless it's at XenbusStateClosed (see the XenbusStateInitialising case in frontend_changed), which will never be reached with your patch.
Maybe the easiest solution would be to wait in the block-drbd script until the device is released? Maybe using fstat in a loop or one of the drbd tools?
Thanks, Roger.
On Wed, Sep 05, 2018 at 12:36:49PM +0200, Roger Pau Monné wrote:
On Wed, Aug 29, 2018 at 08:52:14AM +0200, Valentin Vidic wrote:
Switching to closed state earlier can cause the block-drbd script to fail with 'Device is held open by someone':
root: /etc/xen/scripts/block-drbd: remove XENBUS_PATH=backend/vbd/6/51712 kernel: [ 2222.278235] block drbd6: State change failed: Device is held open by someone kernel: [ 2222.278304] block drbd6: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- } kernel: [ 2222.278340] block drbd6: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- } root: /etc/xen/scripts/block-drbd: Writing backend/vbd/6/51712/hotplug-error /etc/xen/scripts/block-drbd failed; error detected. backend/vbd/6/51712/hotplug-status error to xenstore. root: /etc/xen/scripts/block-drbd: /etc/xen/scripts/block-drbd failed; error detected.
Signed-off-by: Valentin Vidic Valentin.Vidic@CARNet.hr Cc: stable@vger.kernel.org
drivers/block/xen-blkback/xenbus.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c index a4bc74e72c39..43bddc996709 100644 --- a/drivers/block/xen-blkback/xenbus.c +++ b/drivers/block/xen-blkback/xenbus.c @@ -323,6 +323,7 @@ static void xen_blkif_free(struct xen_blkif *blkif) { WARN_ON(xen_blkif_disconnect(blkif)); xen_vbd_free(&blkif->vbd);
- xenbus_switch_state(blkif->be->dev, XenbusStateClosed); kfree(blkif->be->mode); kfree(blkif->be);
@@ -814,7 +815,6 @@ static void frontend_changed(struct xenbus_device *dev, case XenbusStateClosed: xen_blkif_disconnect(be->blkif);
if (xenbus_dev_is_online(dev)) break;xenbus_switch_state(dev, XenbusStateClosed);
AFAICT, this will cause the backend to never switch to 'Closed' state until the toolstack sets online to 0, which is not good IMO.
If for example a frontend decides to close a device, the backend will stay in state 'Closing' until the toolstack actually removes the disk by setting online to 0.
This will prevent resetting blk connections, as blkback will refuse to switch to state XenbusStateInitWait unless it's at XenbusStateClosed (see the XenbusStateInitialising case in frontend_changed), which will never be reached with your patch.
Ok, is it possible to test this somehow?
Maybe the easiest solution would be to wait in the block-drbd script until the device is released? Maybe using fstat in a loop or one of the drbd tools?
That is an option, but I don't know if it is possible to check device release like that. Including drbd-user in CC, maybe they have some idea how to solve this.
On Wed, Sep 05, 2018 at 06:27:56PM +0200, Valentin Vidic wrote:
On Wed, Sep 05, 2018 at 12:36:49PM +0200, Roger Pau Monné wrote:
On Wed, Aug 29, 2018 at 08:52:14AM +0200, Valentin Vidic wrote:
Switching to closed state earlier can cause the block-drbd script to fail with 'Device is held open by someone':
root: /etc/xen/scripts/block-drbd: remove XENBUS_PATH=backend/vbd/6/51712 kernel: [ 2222.278235] block drbd6: State change failed: Device is held open by someone kernel: [ 2222.278304] block drbd6: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- } kernel: [ 2222.278340] block drbd6: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- } root: /etc/xen/scripts/block-drbd: Writing backend/vbd/6/51712/hotplug-error /etc/xen/scripts/block-drbd failed; error detected. backend/vbd/6/51712/hotplug-status error to xenstore. root: /etc/xen/scripts/block-drbd: /etc/xen/scripts/block-drbd failed; error detected.
Signed-off-by: Valentin Vidic Valentin.Vidic@CARNet.hr Cc: stable@vger.kernel.org
drivers/block/xen-blkback/xenbus.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c index a4bc74e72c39..43bddc996709 100644 --- a/drivers/block/xen-blkback/xenbus.c +++ b/drivers/block/xen-blkback/xenbus.c @@ -323,6 +323,7 @@ static void xen_blkif_free(struct xen_blkif *blkif) { WARN_ON(xen_blkif_disconnect(blkif)); xen_vbd_free(&blkif->vbd);
- xenbus_switch_state(blkif->be->dev, XenbusStateClosed); kfree(blkif->be->mode); kfree(blkif->be);
@@ -814,7 +815,6 @@ static void frontend_changed(struct xenbus_device *dev, case XenbusStateClosed: xen_blkif_disconnect(be->blkif);
if (xenbus_dev_is_online(dev)) break;xenbus_switch_state(dev, XenbusStateClosed);
AFAICT, this will cause the backend to never switch to 'Closed' state until the toolstack sets online to 0, which is not good IMO.
If for example a frontend decides to close a device, the backend will stay in state 'Closing' until the toolstack actually removes the disk by setting online to 0.
This will prevent resetting blk connections, as blkback will refuse to switch to state XenbusStateInitWait unless it's at XenbusStateClosed (see the XenbusStateInitialising case in frontend_changed), which will never be reached with your patch.
Ok, is it possible to test this somehow?
Yes, you can try booting a PV guest with pvgrub, that will cause pvgrub to open a blk connection, then close it and afterwards the OS kernel will start a new connection.
Thanks, Roger.
On Thu, Sep 06, 2018 at 06:14:21PM +0200, Roger Pau Monné wrote:
On Wed, Sep 05, 2018 at 06:27:56PM +0200, Valentin Vidic wrote:
On Wed, Sep 05, 2018 at 12:36:49PM +0200, Roger Pau Monné wrote:
On Wed, Aug 29, 2018 at 08:52:14AM +0200, Valentin Vidic wrote:
Switching to closed state earlier can cause the block-drbd script to fail with 'Device is held open by someone':
root: /etc/xen/scripts/block-drbd: remove XENBUS_PATH=backend/vbd/6/51712 kernel: [ 2222.278235] block drbd6: State change failed: Device is held open by someone kernel: [ 2222.278304] block drbd6: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- } kernel: [ 2222.278340] block drbd6: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- } root: /etc/xen/scripts/block-drbd: Writing backend/vbd/6/51712/hotplug-error /etc/xen/scripts/block-drbd failed; error detected. backend/vbd/6/51712/hotplug-status error to xenstore. root: /etc/xen/scripts/block-drbd: /etc/xen/scripts/block-drbd failed; error detected.
Signed-off-by: Valentin Vidic Valentin.Vidic@CARNet.hr Cc: stable@vger.kernel.org
drivers/block/xen-blkback/xenbus.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c index a4bc74e72c39..43bddc996709 100644 --- a/drivers/block/xen-blkback/xenbus.c +++ b/drivers/block/xen-blkback/xenbus.c @@ -323,6 +323,7 @@ static void xen_blkif_free(struct xen_blkif *blkif) { WARN_ON(xen_blkif_disconnect(blkif)); xen_vbd_free(&blkif->vbd);
- xenbus_switch_state(blkif->be->dev, XenbusStateClosed); kfree(blkif->be->mode); kfree(blkif->be);
@@ -814,7 +815,6 @@ static void frontend_changed(struct xenbus_device *dev, case XenbusStateClosed: xen_blkif_disconnect(be->blkif);
if (xenbus_dev_is_online(dev)) break;xenbus_switch_state(dev, XenbusStateClosed);
AFAICT, this will cause the backend to never switch to 'Closed' state until the toolstack sets online to 0, which is not good IMO.
If for example a frontend decides to close a device, the backend will stay in state 'Closing' until the toolstack actually removes the disk by setting online to 0.
This will prevent resetting blk connections, as blkback will refuse to switch to state XenbusStateInitWait unless it's at XenbusStateClosed (see the XenbusStateInitialising case in frontend_changed), which will never be reached with your patch.
Ok, is it possible to test this somehow?
Yes, you can try booting a PV guest with pvgrub, that will cause pvgrub to open a blk connection, then close it and afterwards the OS kernel will start a new connection.
Indeed the boot hangs after pvgrub loads the kernel and initrd, so it seems this does not work as expected.
On Wed, Sep 05, 2018 at 06:27:56PM +0200, Valentin Vidic wrote:
On Wed, Sep 05, 2018 at 12:36:49PM +0200, Roger Pau Monné wrote:
On Wed, Aug 29, 2018 at 08:52:14AM +0200, Valentin Vidic wrote:
Switching to closed state earlier can cause the block-drbd script to fail with 'Device is held open by someone':
root: /etc/xen/scripts/block-drbd: remove XENBUS_PATH=backend/vbd/6/51712 kernel: [ 2222.278235] block drbd6: State change failed: Device is held open by someone kernel: [ 2222.278304] block drbd6: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- } kernel: [ 2222.278340] block drbd6: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- } root: /etc/xen/scripts/block-drbd: Writing backend/vbd/6/51712/hotplug-error /etc/xen/scripts/block-drbd failed; error detected. backend/vbd/6/51712/hotplug-status error to xenstore. root: /etc/xen/scripts/block-drbd: /etc/xen/scripts/block-drbd failed; error detected.
Very frequently it is *NOT* the "original user", that "still" holds it open, but udev, or something triggered-by-udev.
So double-checking the udev rules, or the "lvm global_filter" settings may help. You could instrument DRBD to log current->{pid,comm} on open and close, so you can better detect who the "someone" is in the message above.
Adding a small retry loop in the script may help as well.
As a side note, with DRBD 9 (which still moves too fast to be "in tree") we have the "auto-promote" feature, where no explicit promotion/demotion is necessary anymore, but DRBD (tries to) promote itself with the first "RW" opener, and demote itself once the last opener is gone.
Hth, Lars
On Fri, Sep 07, 2018 at 02:03:37PM +0200, Lars Ellenberg wrote:
Very frequently it is *NOT* the "original user", that "still" holds it open, but udev, or something triggered-by-udev.
So double-checking the udev rules, or the "lvm global_filter" settings may help. You could instrument DRBD to log current->{pid,comm} on open and close, so you can better detect who the "someone" is in the message above.
Don't think there is anything else holding the device open, because it is possible to change state to Secondary a few seconds later. But I will try to print those values in case anything interesting comes up.
Adding a small retry loop in the script may help as well.
Yes, that is an option, but it would still leave those nasty "State change failed" messages in the log. I guess there is no way to check the value of DRBD device->open_cnt from userspace?
On Fri, Sep 07, 2018 at 02:13:48PM +0200, Valentin Vidic wrote:
On Fri, Sep 07, 2018 at 02:03:37PM +0200, Lars Ellenberg wrote:
Very frequently it is *NOT* the "original user", that "still" holds it open, but udev, or something triggered-by-udev.
So double-checking the udev rules, or the "lvm global_filter" settings may help. You could instrument DRBD to log current->{pid,comm} on open and close, so you can better detect who the "someone" is in the message above.
Don't think there is anything else holding the device open, because it is possible to change state to Secondary a few seconds later. But I will try to print those values in case anything interesting comes up.
Adding a small retry loop in the script may help as well.
Yes, that is an option, but it would still leave those nasty "State change failed" messages in the log. I guess there is no way to check the value of DRBD device->open_cnt from userspace?
We don't expose that, no. But even if we did, that would not be racefree :-)
The last (or even: any?) "close" of a block device that used to be open for WRITE triggeres a udev "change" event, thus a udev run, and the minimal action will be some read-only open and ioctl from (systemd-)udev itself, more likely there also will be blkid and possibly pvscan and similar actions. All of them should be read-only openers, and all of them should be "short". But they will race with the drbd demotion.
You can watch that happen with # udevadm monitor & then open and close for write: # : > /dev/sda (or any other block device ...)
To excercise the drbd vs udev race, (you can leave the udevadm monitor running if you want): ## clear dmesg # dmesg -c > /dev/null ## promote # drbdadm primary s0 ; ## open/close for write # : > /dev/drbd0 ; ## the close has triggered a udev run, ## which will now race with demotion, ## which will "sometimes" fail: # drbdadm secondary s0 ; ## wait a bit, and retry # sleep 1; drbdadm secondary s0 ; dmesg
### --- example output on some test box right now: ---------------- @root@ava:~# udevadm monitor & ... monitor will print the received events for: UDEV - the event which udev sends out after rule processing KERNEL - the kernel uevent ... @root@ava:~# dmesg -c > /dev/null; drbdadm primary s0 ; : > /dev/drbd0 ; drbdadm secondary s0 ; sleep 1; drbdadm secondary s0 ; dmesg KERNEL[609638.990320] change /devices/virtual/block/drbd0 (block) KERNEL[609638.991364] change /devices/virtual/block/drbd0 (block) UDEV [609639.008879] change /devices/virtual/block/drbd0 (block) 0: State change failed: (-12) Device is held open by someone Command 'drbdsetup-84 secondary 0' terminated with exit code 11 UDEV [609639.011652] change /devices/virtual/block/drbd0 (block) KERNEL[609640.017356] change /devices/virtual/block/drbd0 (block) KERNEL[609640.018074] change /devices/virtual/block/drbd0 (block) [609613.882751] block drbd0: role( Secondary -> Primary ) [609613.889998] block drbd0: State change failed: Device is held open by someone [609613.894280] block drbd0: state = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated r----- } [609613.897609] block drbd0: wanted = { cs:WFConnection ro:Secondary/Unknown ds:UpToDate/Outdated r----- } [609614.909537] block drbd0: role( Primary -> Secondary ) [609614.909662] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. UDEV [609640.024588] change /devices/virtual/block/drbd0 (block) UDEV [609640.028731] change /devices/virtual/block/drbd0 (block) @root@ava:~# ### --------------------------------------------------------------
Obviously change s0 and drbd0 appropriately; to exactly hit the time window where udev has it open may be tricky, and on a busy system even the sleep 1 can be not enough (so even the second invocation may fail still), on an idle system udev may already be done, or may not even have started yet when the first "secondary" runs.
Your best bet is to review your udev rules, and make sure "drbd*" will not be looked at by blkid or pvscan or similar.
See also: https://github.com/systemd/systemd/commit/fee854ee8ccde respectively https://github.com/systemd/systemd/issues/9371 to get rid even of the unconditional open by systemd-udevd itself.
Note, I'm not saying blkback has no issue; I don't know. I'm just pointing out that there are other things that may cause the same effects.
Lars
On Fri, Sep 07, 2018 at 03:28:28PM +0200, Lars Ellenberg wrote:
We don't expose that, no. But even if we did, that would not be racefree :-)
The last (or even: any?) "close" of a block device that used to be open for WRITE triggeres a udev "change" event, thus a udev run, and the minimal action will be some read-only open and ioctl from (systemd-)udev itself, more likely there also will be blkid and possibly pvscan and similar actions. All of them should be read-only openers, and all of them should be "short". But they will race with the drbd demotion.
True, but did not find any strange interaction with udev during VM shutdown.
This is what udevadm monitor reports:
KERNEL[1174.220256] remove /devices/vbd-10-51712 (xen-backend) UDEV [1174.222484] remove /devices/vbd-10-51712 (xen-backend) KERNEL[1174.224405] remove /devices/console-10-0 (xen-backend) UDEV [1174.226964] remove /devices/console-10-0 (xen-backend) KERNEL[1174.287215] change /devices/virtual/block/drbd0 (block) KERNEL[1174.287267] change /devices/virtual/block/drbd0 (block) UDEV [1174.295811] change /devices/virtual/block/drbd0 (block) UDEV [1174.301983] change /devices/virtual/block/drbd0 (block)
Strace on the udev daemon gives only these:
[pid 7416] execve("/sbin/drbdadm", ["/sbin/drbdadm", "sh-udev", "minor-0"], [/* 10 vars */]) = 0 [pid 7416] execve("/lib/drbd/drbdadm-84", ["drbdadm", "sh-udev", "minor-0"], [/* 12 vars */]) = 0 [pid 7418] execve("/sbin/drbdadm", ["/sbin/drbdadm", "sh-udev", "minor-0"], [/* 9 vars */]) = 0 [pid 7418] execve("/lib/drbd/drbdadm-84", ["drbdadm", "sh-udev", "minor-0"], [/* 11 vars */]) = 0
and this should be 65-drbd.rules to add/remove symlinks in /dev.
Adding a dump_stack in drbd_release gives two possible code paths, both from xen_blkback and the first one from workqueue being the problematic one:
[ 530.698782] CPU: 0 PID: 4 Comm: kworker/0:0 Tainted: G O 4.9.0-8-amd64 #1 Debian 4.9.110-3+deb9u4 [ 530.698783] Hardware name: HP ProLiant BL460c Gen9, BIOS I36 09/12/2016 [ 530.698784] Workqueue: events xen_blkif_deferred_free [xen_blkback] [ 530.698785] 0000000000000000 ffffffff81331e54 ffff883f82143800 ffff883f741b5660 [ 530.698787] ffffffffc03a309e ffff883f741b5580 ffffffff81245ca8 0000000281a186b5 [ 530.698789] 000000008935f100 8412ffa31a64cc4c 000000000000020a ffff883fa6218280 [ 530.698791] Call Trace: [ 530.698792] [<ffffffff81331e54>] ? dump_stack+0x5c/0x78 [ 530.698805] [<ffffffffc03a309e>] ? drbd_release+0x1e/0x40 [drbd] [ 530.698810] [<ffffffff81245ca8>] ? __blkdev_put+0x1e8/0x2a0 [ 530.698813] [<ffffffffc055d6b8>] ? xen_vbd_free.isra.9+0x48/0x60 [xen_blkback] [ 530.698814] [<ffffffffc055d6f7>] ? xen_blkif_deferred_free+0x27/0x70 [xen_blkback] [ 530.698816] [<ffffffff81092fea>] ? process_one_work+0x18a/0x420 [ 530.698817] [<ffffffff810932cd>] ? worker_thread+0x4d/0x490 [ 530.698818] [<ffffffff81093280>] ? process_one_work+0x420/0x420 [ 530.698820] [<ffffffff81099329>] ? kthread+0xd9/0xf0 [ 530.698822] [<ffffffff81099250>] ? kthread_park+0x60/0x60 [ 530.698823] [<ffffffff81615df7>] ? ret_from_fork+0x57/0x70
[ 1216.251924] CPU: 14 PID: 297 Comm: xenwatch Tainted: G O 4.9.0-8-amd64 #1 Debian 4.9.110-3+deb9u4 [ 1216.251925] Hardware name: HP ProLiant BL460c Gen9, BIOS I36 09/12/2016 [ 1216.251926] 0000000000000000 ffffffff81331e54 ffff883f82143800 ffff883f741b5660 [ 1216.251928] ffffffffc03a309e ffff883f741b5580 ffffffff81245ca8 0000000281a186b5 [ 1216.251930] ffffffffc0562360 97e48f39d448f082 000000000000020a 0000000000000006 [ 1216.251933] Call Trace: [ 1216.251935] [<ffffffff81331e54>] ? dump_stack+0x5c/0x78 [ 1216.251947] [<ffffffffc03a309e>] ? drbd_release+0x1e/0x40 [drbd] [ 1216.251951] [<ffffffff81245ca8>] ? __blkdev_put+0x1e8/0x2a0 [ 1216.251954] [<ffffffff81414910>] ? register_xenbus_watch+0xe0/0xe0 [ 1216.251956] [<ffffffffc055d6b8>] ? xen_vbd_free.isra.9+0x48/0x60 [xen_blkback] [ 1216.251959] [<ffffffffc055e9fe>] ? frontend_changed+0x9e/0x660 [xen_blkback] [ 1216.251961] [<ffffffff814120d9>] ? xenbus_read_driver_state+0x39/0x60 [ 1216.251962] [<ffffffff8141542c>] ? xenbus_otherend_changed+0x8c/0x120 [ 1216.251964] [<ffffffff81414910>] ? register_xenbus_watch+0xe0/0xe0 [ 1216.251967] [<ffffffff81414995>] ? xenwatch_thread+0x85/0x120 [ 1216.251969] [<ffffffff810bbfb0>] ? prepare_to_wait_event+0xf0/0xf0 [ 1216.251971] [<ffffffff81099329>] ? kthread+0xd9/0xf0 [ 1216.251973] [<ffffffff81099250>] ? kthread_park+0x60/0x60 [ 1216.251975] [<ffffffff81615df7>] ? ret_from_fork+0x57/0x70
On Fri, Sep 07, 2018 at 06:45:00PM +0200, Valentin Vidic wrote:
Adding a dump_stack in drbd_release gives two possible code paths, both from xen_blkback and the first one from workqueue being the problematic one:
In fact the first one is the original code path before I modified blkback. The problem is it gets executed async from workqueue so it might not always run before the call to drbdadm secondary.
[ 530.698782] CPU: 0 PID: 4 Comm: kworker/0:0 Tainted: G O 4.9.0-8-amd64 #1 Debian 4.9.110-3+deb9u4 [ 530.698783] Hardware name: HP ProLiant BL460c Gen9, BIOS I36 09/12/2016 [ 530.698784] Workqueue: events xen_blkif_deferred_free [xen_blkback] [ 530.698785] 0000000000000000 ffffffff81331e54 ffff883f82143800 ffff883f741b5660 [ 530.698787] ffffffffc03a309e ffff883f741b5580 ffffffff81245ca8 0000000281a186b5 [ 530.698789] 000000008935f100 8412ffa31a64cc4c 000000000000020a ffff883fa6218280 [ 530.698791] Call Trace: [ 530.698792] [<ffffffff81331e54>] ? dump_stack+0x5c/0x78 [ 530.698805] [<ffffffffc03a309e>] ? drbd_release+0x1e/0x40 [drbd] [ 530.698810] [<ffffffff81245ca8>] ? __blkdev_put+0x1e8/0x2a0 [ 530.698813] [<ffffffffc055d6b8>] ? xen_vbd_free.isra.9+0x48/0x60 [xen_blkback] [ 530.698814] [<ffffffffc055d6f7>] ? xen_blkif_deferred_free+0x27/0x70 [xen_blkback] [ 530.698816] [<ffffffff81092fea>] ? process_one_work+0x18a/0x420 [ 530.698817] [<ffffffff810932cd>] ? worker_thread+0x4d/0x490 [ 530.698818] [<ffffffff81093280>] ? process_one_work+0x420/0x420 [ 530.698820] [<ffffffff81099329>] ? kthread+0xd9/0xf0 [ 530.698822] [<ffffffff81099250>] ? kthread_park+0x60/0x60 [ 530.698823] [<ffffffff81615df7>] ? ret_from_fork+0x57/0x70
On Fri, Sep 07, 2018 at 07:14:59PM +0200, Valentin Vidic wrote:
In fact the first one is the original code path before I modified blkback. The problem is it gets executed async from workqueue so it might not always run before the call to drbdadm secondary.
As the DRBD device gets released only when the last IO request has finished, I found a way to check and wait for this in the block-drbd script:
--- block-drbd.orig 2018-09-08 09:07:23.499648515 +0200 +++ block-drbd 2018-09-08 09:28:12.892193649 +0200 @@ -230,6 +230,24 @@ and so cannot be mounted ${m2}${when}." }
+wait_for_inflight() +{ + local dev="$1" + local inflight="/sys/block/${dev#/dev/}/inflight" + local rd wr + + if ! [ -f "$inflight" ]; then + return + fi + + while true; do + read rd wr < $inflight + if [ "$rd" = "0" -a "$wr" = "0" ]; then + return + fi + sleep 1 + done +}
t=$(xenstore_read_default "$XENBUS_PATH/type" 'MISSING')
@@ -285,6 +303,8 @@ drbd_lrole="${drbd_role%%/*}" drbd_dev="$(drbdadm sh-dev $drbd_resource)"
+ wait_for_inflight $drbd_dev + if [ "$drbd_lrole" != 'Secondary' ]; then drbdadm secondary $drbd_resource fi
On Sat, Sep 08, 2018 at 09:34:32AM +0200, Valentin Vidic wrote:
On Fri, Sep 07, 2018 at 07:14:59PM +0200, Valentin Vidic wrote:
In fact the first one is the original code path before I modified blkback. The problem is it gets executed async from workqueue so it might not always run before the call to drbdadm secondary.
As the DRBD device gets released only when the last IO request has finished, I found a way to check and wait for this in the block-drbd script:
--- block-drbd.orig 2018-09-08 09:07:23.499648515 +0200 +++ block-drbd 2018-09-08 09:28:12.892193649 +0200 @@ -230,6 +230,24 @@ and so cannot be mounted ${m2}${when}." } +wait_for_inflight() +{
- local dev="$1"
- local inflight="/sys/block/${dev#/dev/}/inflight"
- local rd wr
- if ! [ -f "$inflight" ]; then
- return
- fi
- while true; do
- read rd wr < $inflight
- if [ "$rd" = "0" -a "$wr" = "0" ]; then
If it is "idle" now, but still "open", this will not sleep, and still fail the demotion below.
return
- fi
- sleep 1
- done
+} t=$(xenstore_read_default "$XENBUS_PATH/type" 'MISSING') @@ -285,6 +303,8 @@ drbd_lrole="${drbd_role%%/*}" drbd_dev="$(drbdadm sh-dev $drbd_resource)"
wait_for_inflight $drbd_dev
if [ "$drbd_lrole" != 'Secondary' ]; then drbdadm secondary $drbd_resource
You try to help it by "waiting forever until it appears to be idle". I suggest to at least limit the retries by iteration or time. And also (or, instead; but you'd potentially get a number of "scary messages" in the logs) add something like: for i in 1 2 3 5 7 x; do drbdadm secondary $drbd_resource && exit 0 if [ $i = x ]; then # ... "appears to still be in use, maybe by" ... fuser -v $drbd_dev exit 1 # else ... "will retry in $i seconds" ... fi sleep $i done ...
Or, well, yes, fix blkback to not "defer" the final close "too long", if at all possible.
Lars
On Mon, Sep 10, 2018 at 02:45:31PM +0200, Lars Ellenberg wrote:
On Sat, Sep 08, 2018 at 09:34:32AM +0200, Valentin Vidic wrote:
On Fri, Sep 07, 2018 at 07:14:59PM +0200, Valentin Vidic wrote:
In fact the first one is the original code path before I modified blkback. The problem is it gets executed async from workqueue so it might not always run before the call to drbdadm secondary.
As the DRBD device gets released only when the last IO request has finished, I found a way to check and wait for this in the block-drbd script:
--- block-drbd.orig 2018-09-08 09:07:23.499648515 +0200 +++ block-drbd 2018-09-08 09:28:12.892193649 +0200 @@ -230,6 +230,24 @@ and so cannot be mounted ${m2}${when}." } +wait_for_inflight() +{
- local dev="$1"
- local inflight="/sys/block/${dev#/dev/}/inflight"
- local rd wr
- if ! [ -f "$inflight" ]; then
- return
- fi
- while true; do
- read rd wr < $inflight
- if [ "$rd" = "0" -a "$wr" = "0" ]; then
If it is "idle" now, but still "open", this will not sleep, and still fail the demotion below.
True, but in this case blkback is holding it open until all the writes have finished and the last write closes the device. Since fuser can't check blkback this is an approximation that seems to work because I don't get any failed drbdadm calls now.
You try to help it by "waiting forever until it appears to be idle". I suggest to at least limit the retries by iteration or time. And also (or, instead; but you'd potentially get a number of "scary messages" in the logs) add something like:
Ok, should I open a PR to discuss this change further?
Or, well, yes, fix blkback to not "defer" the final close "too long", if at all possible.
blkback needs to finish the writes on shutdown or I get a fsck errors on next boot. Ideally XenbusStateClosed should be delayed until the device release but currently it does not seem possible without breaking other things.
On Mon, Sep 10, 2018 at 03:22:52PM +0200, Valentin Vidic wrote:
On Mon, Sep 10, 2018 at 02:45:31PM +0200, Lars Ellenberg wrote:
On Sat, Sep 08, 2018 at 09:34:32AM +0200, Valentin Vidic wrote:
On Fri, Sep 07, 2018 at 07:14:59PM +0200, Valentin Vidic wrote:
In fact the first one is the original code path before I modified blkback. The problem is it gets executed async from workqueue so it might not always run before the call to drbdadm secondary.
As the DRBD device gets released only when the last IO request has finished, I found a way to check and wait for this in the block-drbd script:
--- block-drbd.orig 2018-09-08 09:07:23.499648515 +0200 +++ block-drbd 2018-09-08 09:28:12.892193649 +0200 @@ -230,6 +230,24 @@ and so cannot be mounted ${m2}${when}." } +wait_for_inflight() +{
- local dev="$1"
- local inflight="/sys/block/${dev#/dev/}/inflight"
- local rd wr
- if ! [ -f "$inflight" ]; then
- return
- fi
- while true; do
- read rd wr < $inflight
- if [ "$rd" = "0" -a "$wr" = "0" ]; then
If it is "idle" now, but still "open", this will not sleep, and still fail the demotion below.
True, but in this case blkback is holding it open until all the writes have finished and the last write closes the device. Since fuser can't check blkback this is an approximation that seems to work because I don't get any failed drbdadm calls now.
You try to help it by "waiting forever until it appears to be idle". I suggest to at least limit the retries by iteration or time. And also (or, instead; but you'd potentially get a number of "scary messages" in the logs) add something like:
Ok, should I open a PR to discuss this change further?
Or, well, yes, fix blkback to not "defer" the final close "too long", if at all possible.
blkback needs to finish the writes on shutdown or I get a fsck errors on next boot. Ideally XenbusStateClosed should be delayed until the device release but currently it does not seem possible without breaking other things.
I can try to take a look at this and attempt to make sure the state is only changed to closed in blkback _after_ the device has been released, but it might take me a couple of days to get you a patch.
I'm afraid that other hotplug scripts will also have issues with such behavior, and we shouldn't force all users of hotplug scripts to add such workarounds.
Roger.
On Mon, Sep 10, 2018 at 05:00:58PM +0200, Roger Pau Monné wrote:
I can try to take a look at this and attempt to make sure the state is only changed to closed in blkback _after_ the device has been released, but it might take me a couple of days to get you a patch.
Thanks, I have two test nodes now where I can try different approaches.
I'm afraid that other hotplug scripts will also have issues with such behavior, and we shouldn't force all users of hotplug scripts to add such workarounds.
True, iSCSI and other "network" disks might have similar problems.
On Mon, Sep 10, 2018 at 06:18:31PM +0200, Valentin Vidic wrote:
On Mon, Sep 10, 2018 at 05:00:58PM +0200, Roger Pau Monné wrote:
I can try to take a look at this and attempt to make sure the state is only changed to closed in blkback _after_ the device has been released, but it might take me a couple of days to get you a patch.
Thanks, I have two test nodes now where I can try different approaches.
I'm afraid that other hotplug scripts will also have issues with such behavior, and we shouldn't force all users of hotplug scripts to add such workarounds.
True, iSCSI and other "network" disks might have similar problems.
So I have the following patch which I think might solve your issues while keeping the reset logic working. Would you mind giving it a try with your use case and pvgrub?
Thanks, Roger. ---8<---
From c6a76a6265f29440c886b422c9d1b8549ffd6ed8 Mon Sep 17 00:00:00 2001
From: Roger Pau Monne roger.pau@citrix.com Date: Thu, 13 Sep 2018 11:27:57 +0200 Subject: [PATCH] xen-blkback: only set Closed state when resources have been released MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Don't set the backend to Closed state until all the resources have been released and the blkback instance is gone. Setting the backend state to Closed will be done by xenbus_dev_remove after removing the device instance.
Note that this requires disconnecting and allowing reconnection with state Closing, since the Closed state will only be reached when detaching the device.
Reported-by: Valentin Vidic Valentin.Vidic@carnet.hr Signed-off-by: Roger Pau Monné roger.pau@citrix.com --- drivers/block/xen-blkback/xenbus.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c index a4bc74e72c39..b0629f1b0511 100644 --- a/drivers/block/xen-blkback/xenbus.c +++ b/drivers/block/xen-blkback/xenbus.c @@ -770,7 +770,8 @@ static void frontend_changed(struct xenbus_device *dev,
switch (frontend_state) { case XenbusStateInitialising: - if (dev->state == XenbusStateClosed) { + if (dev->state == XenbusStateClosed || + dev->state == XenbusStateClosing) { pr_info("%s: prepare for reconnect\n", dev->nodename); xenbus_switch_state(dev, XenbusStateInitWait); } @@ -809,12 +810,12 @@ static void frontend_changed(struct xenbus_device *dev, break;
case XenbusStateClosing: + xen_blkif_disconnect(be->blkif); xenbus_switch_state(dev, XenbusStateClosing); break;
case XenbusStateClosed: xen_blkif_disconnect(be->blkif); - xenbus_switch_state(dev, XenbusStateClosed); if (xenbus_dev_is_online(dev)) break; /* fall through */
On Thu, Sep 13, 2018 at 05:08:00PM +0200, Roger Pau Monné wrote:
So I have the following patch which I think might solve your issues while keeping the reset logic working. Would you mind giving it a try with your use case and pvgrub?
Thanks for the patch. It seems to be having some problems with pvgrub: machines don't boot past pvgrub and they are using 100% CPU. Also in dom0 xenstored is usind 100% CPU - strace reports it is looping in this:
read(26, "\\240\0\0", 4) = 4 read(26, "\0\0\0\0L\0\0\0)\0\0\0\35\0\0\0+\\232`\231\31\1&", 24) = 24 read(26, "/local/domain/0/backend/vbd/1/51712/state", 41) = 41 read(26, "\2\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0004", 29) = 29 read(26, "\0\330\0\0", 4) = 4 read(26, "\0\0\0\0L\0\0\0)\0\0\0\35\0\0\0+\JG\231\31\1&", 24) = 24 read(26, "/local/domain/0/backend/vbd/2/51712/state", 41) = 41 read(26, "\2\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\2\0\0\0\1\0\0\0004", 29) = 29 read(26, "\0\200\1\0", 4) = 4 read(26, "\0\0\0\0L\0\0\0)\0\0\0\35\0\0\0+\Z\373\231\31\1&", 24) = 24 read(26, "/local/domain/0/backend/vbd/5/51712/state", 41) = 41 read(26, "\2\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\5\0\0\0\1\0\0\0004", 29) = 29 read(12, "\321\1\0\0", 4) = 4 write(12, "\321\1\0\0", 4) = 4
On Fri, Sep 14, 2018 at 01:49:51PM +0200, Valentin Vidic wrote:
On Thu, Sep 13, 2018 at 05:08:00PM +0200, Roger Pau Monné wrote:
So I have the following patch which I think might solve your issues while keeping the reset logic working. Would you mind giving it a try with your use case and pvgrub?
Thanks for the patch. It seems to be having some problems with pvgrub: machines don't boot past pvgrub and they are using 100% CPU. Also in dom0 xenstored is usind 100% CPU - strace reports it is looping in this:
read(26, "\\240\0\0", 4) = 4 read(26, "\0\0\0\0L\0\0\0)\0\0\0\35\0\0\0+\\232`\231\31\1&", 24) = 24 read(26, "/local/domain/0/backend/vbd/1/51712/state", 41) = 41 read(26, "\2\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0004", 29) = 29 read(26, "\0\330\0\0", 4) = 4 read(26, "\0\0\0\0L\0\0\0)\0\0\0\35\0\0\0+\JG\231\31\1&", 24) = 24 read(26, "/local/domain/0/backend/vbd/2/51712/state", 41) = 41 read(26, "\2\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\2\0\0\0\1\0\0\0004", 29) = 29 read(26, "\0\200\1\0", 4) = 4 read(26, "\0\0\0\0L\0\0\0)\0\0\0\35\0\0\0+\Z\373\231\31\1&", 24) = 24 read(26, "/local/domain/0/backend/vbd/5/51712/state", 41) = 41 read(26, "\2\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\5\0\0\0\1\0\0\0004", 29) = 29 read(12, "\321\1\0\0", 4) = 4 write(12, "\321\1\0\0", 4) = 4
I see the issue, blkfront in pvgrub is waiting for the backend to switch to Closed state before booting the OS kernel, so my fix is not going to work. I guess I will have to find a proper way to switch to Closed state.
Thanks, Roger.
linux-stable-mirror@lists.linaro.org