Unplugging a USB3.0 webcam while streaming results in errors like this:
[ 132.646387] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646446] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8630 trb-start 000000002fdf8640 trb-end 000000002fdf8650 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0 [ 132.646560] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646568] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8660 trb-start 000000002fdf8670 trb-end 000000002fdf8670 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0
If an error is detected while processing the last TRB of an isoc TD, the Etron xHC generates two transfer events for the TRB where the error was detected. The first event can be any sort of error (like USB Transaction or Babble Detected, etc), and the final event is Success.
The xHCI driver will handle the TD after the first event and remove it from its internal list, and then print an "Transfer event TRB DMA ptr not part of current TD" error message after the final event.
Commit 5372c65e1311 ("xhci: process isoc TD properly when there was a transaction error mid TD.") is designed to address isoc transaction errors, but unfortunately it doesn't account for this scenario.
To work around this by reusing the logic that handles isoc transaction errors, but continuing to wait for the final event when this condition occurs. Sometimes we see the Stopped event after an error mid TD, this is a normal event for a pending TD and we can think of it as the final event we are waiting for.
Check if the XHCI_ETRON_HOST quirk flag is set before invoking the workaround in process_isoc_td().
Fixes: 5372c65e1311 ("xhci: process isoc TD properly when there was a transaction error mid TD.") Cc: stable@vger.kernel.org Signed-off-by: Kuangyi Chiang ki.chiang65@gmail.com --- drivers/usb/host/xhci-ring.c | 29 +++++++++++++++++++++-------- 1 file changed, 21 insertions(+), 8 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 965bffce301e..936fd9151ba8 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -2452,8 +2452,10 @@ static void process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep, switch (trb_comp_code) { case COMP_SUCCESS: /* Don't overwrite status if TD had an error, see xHCI 4.9.1 */ - if (td->error_mid_td) + if (td->error_mid_td) { + td->error_mid_td = false; break; + } if (remaining) { frame->status = short_framestatus; sum_trbs_for_length = true; @@ -2468,25 +2470,36 @@ static void process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep, case COMP_BANDWIDTH_OVERRUN_ERROR: frame->status = -ECOMM; break; + case COMP_USB_TRANSACTION_ERROR: case COMP_BABBLE_DETECTED_ERROR: sum_trbs_for_length = true; fallthrough; case COMP_ISOCH_BUFFER_OVERRUN: frame->status = -EOVERFLOW; + if (trb_comp_code == COMP_USB_TRANSACTION_ERROR) + frame->status = -EPROTO; if (ep_trb != td->end_trb) td->error_mid_td = true; + else + td->error_mid_td = false; + + /* + * If an error is detected on the last TRB of the TD, + * wait for the final event. + */ + if ((xhci->quirks & XHCI_ETRON_HOST) && + td->urb->dev->speed >= USB_SPEED_SUPER && + ep_trb == td->end_trb) + td->error_mid_td = true; break; case COMP_INCOMPATIBLE_DEVICE_ERROR: case COMP_STALL_ERROR: frame->status = -EPROTO; break; - case COMP_USB_TRANSACTION_ERROR: - frame->status = -EPROTO; - sum_trbs_for_length = true; - if (ep_trb != td->end_trb) - td->error_mid_td = true; - break; case COMP_STOPPED: + /* Think of it as the final event if TD had an error */ + if (td->error_mid_td) + td->error_mid_td = false; sum_trbs_for_length = true; break; case COMP_STOPPED_SHORT_PACKET: @@ -2519,7 +2532,7 @@ static void process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
finish_td: /* Don't give back TD yet if we encountered an error mid TD */ - if (td->error_mid_td && ep_trb != td->end_trb) { + if (td->error_mid_td) { xhci_dbg(xhci, "Error mid isoc TD, wait for final completion event\n"); td->urb_length_set = true; return;
On 5.2.2025 7.37, Kuangyi Chiang wrote:
Unplugging a USB3.0 webcam while streaming results in errors like this:
[ 132.646387] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646446] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8630 trb-start 000000002fdf8640 trb-end 000000002fdf8650 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0 [ 132.646560] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646568] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8660 trb-start 000000002fdf8670 trb-end 000000002fdf8670 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0
If an error is detected while processing the last TRB of an isoc TD, the Etron xHC generates two transfer events for the TRB where the error was detected. The first event can be any sort of error (like USB Transaction or Babble Detected, etc), and the final event is Success.
The xHCI driver will handle the TD after the first event and remove it from its internal list, and then print an "Transfer event TRB DMA ptr not part of current TD" error message after the final event.
Commit 5372c65e1311 ("xhci: process isoc TD properly when there was a transaction error mid TD.") is designed to address isoc transaction errors, but unfortunately it doesn't account for this scenario.
To work around this by reusing the logic that handles isoc transaction errors, but continuing to wait for the final event when this condition occurs. Sometimes we see the Stopped event after an error mid TD, this is a normal event for a pending TD and we can think of it as the final event we are waiting for.
Not giving back the TD when we get an event for the last TRB in the TD sounds risky. With this change we assume all old and future ETRON hosts will trigger this additional spurious success event.
I think we could handle this more like the XHCI_SPURIOUS_SUCCESS case seen with short transfers, and just silence the error message.
Are there any other issues besides the error message seen?
Thanks Mathias
On 5.2.2025 16.17, Mathias Nyman wrote:
On 5.2.2025 7.37, Kuangyi Chiang wrote:
Unplugging a USB3.0 webcam while streaming results in errors like this:
[ 132.646387] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646446] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8630 trb-start 000000002fdf8640 trb-end 000000002fdf8650 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0 [ 132.646560] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646568] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8660 trb-start 000000002fdf8670 trb-end 000000002fdf8670 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0
If an error is detected while processing the last TRB of an isoc TD, the Etron xHC generates two transfer events for the TRB where the error was detected. The first event can be any sort of error (like USB Transaction or Babble Detected, etc), and the final event is Success.
The xHCI driver will handle the TD after the first event and remove it from its internal list, and then print an "Transfer event TRB DMA ptr not part of current TD" error message after the final event.
Commit 5372c65e1311 ("xhci: process isoc TD properly when there was a transaction error mid TD.") is designed to address isoc transaction errors, but unfortunately it doesn't account for this scenario.
To work around this by reusing the logic that handles isoc transaction errors, but continuing to wait for the final event when this condition occurs. Sometimes we see the Stopped event after an error mid TD, this is a normal event for a pending TD and we can think of it as the final event we are waiting for.
Not giving back the TD when we get an event for the last TRB in the TD sounds risky. With this change we assume all old and future ETRON hosts will trigger this additional spurious success event.
I think we could handle this more like the XHCI_SPURIOUS_SUCCESS case seen with short transfers, and just silence the error message.
Are there any other issues besides the error message seen?
Would something like this work: (untested, compiles)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 965bffce301e..81d45ddebace 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -2644,6 +2644,22 @@ static int handle_transferless_tx_event(struct xhci_hcd *xhci, struct xhci_virt_ return 0; }
+static bool xhci_spurious_success_tx_event(struct xhci_hcd *xhci, + struct xhci_ring *ring) +{ + switch(ring->last_td_comp_code) { + case COMP_SHORT_PACKET: + return xhci->quirks & XHCI_SPURIOUS_SUCCESS; + case COMP_USB_TRANSACTION_ERROR: + case COMP_BABBLE_DETECTED_ERROR: + case COMP_ISOCH_BUFFER_OVERRUN: + return xhci->quirks & XHCI_ETRON_HOST && + ring->type == TYPE_ISOC; + default: + return false; + } +} + /* * If this function returns an error condition, it means it got a Transfer * event with a corrupted Slot ID, Endpoint ID, or TRB DMA address. @@ -2697,8 +2713,8 @@ static int handle_tx_event(struct xhci_hcd *xhci, case COMP_SUCCESS: if (EVENT_TRB_LEN(le32_to_cpu(event->transfer_len)) != 0) { trb_comp_code = COMP_SHORT_PACKET; - xhci_dbg(xhci, "Successful completion on short TX for slot %u ep %u with last td short %d\n", - slot_id, ep_index, ep_ring->last_td_was_short); + xhci_dbg(xhci, "Successful completion on short TX for slot %u ep %u with last td comp code %d\n", + slot_id, ep_index, ep_ring->last_td_comp_code); } break; case COMP_SHORT_PACKET: @@ -2846,7 +2862,7 @@ static int handle_tx_event(struct xhci_hcd *xhci, */ if (trb_comp_code != COMP_STOPPED && trb_comp_code != COMP_STOPPED_LENGTH_INVALID && - !ep_ring->last_td_was_short) { + !xhci_spurious_success_tx_event(xhci, ep_ring)) { xhci_warn(xhci, "Event TRB for slot %u ep %u with no TDs queued\n", slot_id, ep_index); } @@ -2890,11 +2906,10 @@ static int handle_tx_event(struct xhci_hcd *xhci,
/* * Some hosts give a spurious success event after a short - * transfer. Ignore it. + * transfer or error on last TRB. Ignore it. */ - if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) && - ep_ring->last_td_was_short) { - ep_ring->last_td_was_short = false; + if (xhci_spurious_success_tx_event(xhci, ep_ring)) { + ep_ring->last_td_comp_code = trb_comp_code; return 0; }
@@ -2922,10 +2937,7 @@ static int handle_tx_event(struct xhci_hcd *xhci, */ } while (ep->skip);
- if (trb_comp_code == COMP_SHORT_PACKET) - ep_ring->last_td_was_short = true; - else - ep_ring->last_td_was_short = false; + ep_ring->last_td_comp_code = trb_comp_code;
ep_trb = &ep_seg->trbs[(ep_trb_dma - ep_seg->dma) / sizeof(*ep_trb)]; trace_xhci_handle_transfer(ep_ring, (struct xhci_generic_trb *) ep_trb, ep_trb_dma); diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h index 779b01dee068..acc8d3c7a199 100644 --- a/drivers/usb/host/xhci.h +++ b/drivers/usb/host/xhci.h @@ -1371,7 +1371,7 @@ struct xhci_ring { unsigned int num_trbs_free; /* used only by xhci DbC */ unsigned int bounce_buf_len; enum xhci_ring_type type; - bool last_td_was_short; + u32 last_td_comp_code; struct radix_tree_root *trb_address_map; };
Hi,
Mathias Nyman mathias.nyman@linux.intel.com 於 2025年2月5日 週三 下午11:16寫道:
On 5.2.2025 16.17, Mathias Nyman wrote:
On 5.2.2025 7.37, Kuangyi Chiang wrote:
Unplugging a USB3.0 webcam while streaming results in errors like this:
[ 132.646387] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646446] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8630 trb-start 000000002fdf8640 trb-end 000000002fdf8650 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0 [ 132.646560] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646568] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8660 trb-start 000000002fdf8670 trb-end 000000002fdf8670 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0
If an error is detected while processing the last TRB of an isoc TD, the Etron xHC generates two transfer events for the TRB where the error was detected. The first event can be any sort of error (like USB Transaction or Babble Detected, etc), and the final event is Success.
The xHCI driver will handle the TD after the first event and remove it from its internal list, and then print an "Transfer event TRB DMA ptr not part of current TD" error message after the final event.
Commit 5372c65e1311 ("xhci: process isoc TD properly when there was a transaction error mid TD.") is designed to address isoc transaction errors, but unfortunately it doesn't account for this scenario.
To work around this by reusing the logic that handles isoc transaction errors, but continuing to wait for the final event when this condition occurs. Sometimes we see the Stopped event after an error mid TD, this is a normal event for a pending TD and we can think of it as the final event we are waiting for.
Not giving back the TD when we get an event for the last TRB in the TD sounds risky. With this change we assume all old and future ETRON hosts will trigger this additional spurious success event.
I think we could handle this more like the XHCI_SPURIOUS_SUCCESS case seen with short transfers, and just silence the error message.
Are there any other issues besides the error message seen?
Would something like this work: (untested, compiles)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 965bffce301e..81d45ddebace 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -2644,6 +2644,22 @@ static int handle_transferless_tx_event(struct xhci_hcd *xhci, struct xhci_virt_ return 0; }
+static bool xhci_spurious_success_tx_event(struct xhci_hcd *xhci,
struct xhci_ring *ring)
+{
switch(ring->last_td_comp_code) {
case COMP_SHORT_PACKET:
return xhci->quirks & XHCI_SPURIOUS_SUCCESS;
case COMP_USB_TRANSACTION_ERROR:
case COMP_BABBLE_DETECTED_ERROR:
case COMP_ISOCH_BUFFER_OVERRUN:
return xhci->quirks & XHCI_ETRON_HOST &&
ring->type == TYPE_ISOC;
default:
return false;
}
+}
- /*
- If this function returns an error condition, it means it got a Transfer
- event with a corrupted Slot ID, Endpoint ID, or TRB DMA address.
@@ -2697,8 +2713,8 @@ static int handle_tx_event(struct xhci_hcd *xhci, case COMP_SUCCESS: if (EVENT_TRB_LEN(le32_to_cpu(event->transfer_len)) != 0) { trb_comp_code = COMP_SHORT_PACKET;
xhci_dbg(xhci, "Successful completion on short TX for slot %u ep %u with last td short %d\n",
slot_id, ep_index, ep_ring->last_td_was_short);
xhci_dbg(xhci, "Successful completion on short TX for slot %u ep %u with last td comp code %d\n",
slot_id, ep_index, ep_ring->last_td_comp_code); } break; case COMP_SHORT_PACKET:
@@ -2846,7 +2862,7 @@ static int handle_tx_event(struct xhci_hcd *xhci, */ if (trb_comp_code != COMP_STOPPED && trb_comp_code != COMP_STOPPED_LENGTH_INVALID &&
!ep_ring->last_td_was_short) {
!xhci_spurious_success_tx_event(xhci, ep_ring)) { xhci_warn(xhci, "Event TRB for slot %u ep %u with no TDs queued\n", slot_id, ep_index); }
@@ -2890,11 +2906,10 @@ static int handle_tx_event(struct xhci_hcd *xhci,
/* * Some hosts give a spurious success event after a short
* transfer. Ignore it.
* transfer or error on last TRB. Ignore it. */
if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) &&
ep_ring->last_td_was_short) {
ep_ring->last_td_was_short = false;
if (xhci_spurious_success_tx_event(xhci, ep_ring)) {
ep_ring->last_td_comp_code = trb_comp_code; return 0; }
@@ -2922,10 +2937,7 @@ static int handle_tx_event(struct xhci_hcd *xhci, */ } while (ep->skip);
if (trb_comp_code == COMP_SHORT_PACKET)
ep_ring->last_td_was_short = true;
else
ep_ring->last_td_was_short = false;
ep_ring->last_td_comp_code = trb_comp_code; ep_trb = &ep_seg->trbs[(ep_trb_dma - ep_seg->dma) / sizeof(*ep_trb)]; trace_xhci_handle_transfer(ep_ring, (struct xhci_generic_trb *) ep_trb, ep_trb_dma);
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h index 779b01dee068..acc8d3c7a199 100644 --- a/drivers/usb/host/xhci.h +++ b/drivers/usb/host/xhci.h @@ -1371,7 +1371,7 @@ struct xhci_ring { unsigned int num_trbs_free; /* used only by xhci DbC */ unsigned int bounce_buf_len; enum xhci_ring_type type;
bool last_td_was_short;
};u32 last_td_comp_code; struct radix_tree_root *trb_address_map;
It looks better than my patch v2, I will test it later.
Thanks, Kuangyi Chiang
Kuangyi Chiang ki.chiang65@gmail.com 於 2025年2月7日 週五 上午9:38寫道:
Hi,
Mathias Nyman mathias.nyman@linux.intel.com 於 2025年2月5日 週三 下午11:16寫道:
On 5.2.2025 16.17, Mathias Nyman wrote:
On 5.2.2025 7.37, Kuangyi Chiang wrote:
Unplugging a USB3.0 webcam while streaming results in errors like this:
[ 132.646387] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646446] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8630 trb-start 000000002fdf8640 trb-end 000000002fdf8650 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0 [ 132.646560] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646568] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8660 trb-start 000000002fdf8670 trb-end 000000002fdf8670 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0
If an error is detected while processing the last TRB of an isoc TD, the Etron xHC generates two transfer events for the TRB where the error was detected. The first event can be any sort of error (like USB Transaction or Babble Detected, etc), and the final event is Success.
The xHCI driver will handle the TD after the first event and remove it from its internal list, and then print an "Transfer event TRB DMA ptr not part of current TD" error message after the final event.
Commit 5372c65e1311 ("xhci: process isoc TD properly when there was a transaction error mid TD.") is designed to address isoc transaction errors, but unfortunately it doesn't account for this scenario.
To work around this by reusing the logic that handles isoc transaction errors, but continuing to wait for the final event when this condition occurs. Sometimes we see the Stopped event after an error mid TD, this is a normal event for a pending TD and we can think of it as the final event we are waiting for.
Not giving back the TD when we get an event for the last TRB in the TD sounds risky. With this change we assume all old and future ETRON hosts will trigger this additional spurious success event.
I think we could handle this more like the XHCI_SPURIOUS_SUCCESS case seen with short transfers, and just silence the error message.
Are there any other issues besides the error message seen?
Would something like this work: (untested, compiles)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 965bffce301e..81d45ddebace 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -2644,6 +2644,22 @@ static int handle_transferless_tx_event(struct xhci_hcd *xhci, struct xhci_virt_ return 0; }
+static bool xhci_spurious_success_tx_event(struct xhci_hcd *xhci,
struct xhci_ring *ring)
+{
switch(ring->last_td_comp_code) {
case COMP_SHORT_PACKET:
return xhci->quirks & XHCI_SPURIOUS_SUCCESS;
case COMP_USB_TRANSACTION_ERROR:
case COMP_BABBLE_DETECTED_ERROR:
case COMP_ISOCH_BUFFER_OVERRUN:
return xhci->quirks & XHCI_ETRON_HOST &&
ring->type == TYPE_ISOC;
default:
return false;
}
+}
- /*
- If this function returns an error condition, it means it got a Transfer
- event with a corrupted Slot ID, Endpoint ID, or TRB DMA address.
@@ -2697,8 +2713,8 @@ static int handle_tx_event(struct xhci_hcd *xhci, case COMP_SUCCESS: if (EVENT_TRB_LEN(le32_to_cpu(event->transfer_len)) != 0) { trb_comp_code = COMP_SHORT_PACKET;
xhci_dbg(xhci, "Successful completion on short TX for slot %u ep %u with last td short %d\n",
slot_id, ep_index, ep_ring->last_td_was_short);
xhci_dbg(xhci, "Successful completion on short TX for slot %u ep %u with last td comp code %d\n",
slot_id, ep_index, ep_ring->last_td_comp_code); } break; case COMP_SHORT_PACKET:
@@ -2846,7 +2862,7 @@ static int handle_tx_event(struct xhci_hcd *xhci, */ if (trb_comp_code != COMP_STOPPED && trb_comp_code != COMP_STOPPED_LENGTH_INVALID &&
!ep_ring->last_td_was_short) {
!xhci_spurious_success_tx_event(xhci, ep_ring)) { xhci_warn(xhci, "Event TRB for slot %u ep %u with no TDs queued\n", slot_id, ep_index); }
@@ -2890,11 +2906,10 @@ static int handle_tx_event(struct xhci_hcd *xhci,
/* * Some hosts give a spurious success event after a short
* transfer. Ignore it.
* transfer or error on last TRB. Ignore it. */
if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) &&
ep_ring->last_td_was_short) {
ep_ring->last_td_was_short = false;
if (xhci_spurious_success_tx_event(xhci, ep_ring)) {
ep_ring->last_td_comp_code = trb_comp_code; return 0; }
@@ -2922,10 +2937,7 @@ static int handle_tx_event(struct xhci_hcd *xhci, */ } while (ep->skip);
if (trb_comp_code == COMP_SHORT_PACKET)
ep_ring->last_td_was_short = true;
else
ep_ring->last_td_was_short = false;
ep_ring->last_td_comp_code = trb_comp_code; ep_trb = &ep_seg->trbs[(ep_trb_dma - ep_seg->dma) / sizeof(*ep_trb)]; trace_xhci_handle_transfer(ep_ring, (struct xhci_generic_trb *) ep_trb, ep_trb_dma);
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h index 779b01dee068..acc8d3c7a199 100644 --- a/drivers/usb/host/xhci.h +++ b/drivers/usb/host/xhci.h @@ -1371,7 +1371,7 @@ struct xhci_ring { unsigned int num_trbs_free; /* used only by xhci DbC */ unsigned int bounce_buf_len; enum xhci_ring_type type;
bool last_td_was_short;
};u32 last_td_comp_code; struct radix_tree_root *trb_address_map;
It looks better than my patch v2, I will test it later.
Yes, it works. I applied and tested it under Linux-6.13.1. This issue has been resolved. I have tested Etron EJ168 and EJ188, both have the same problem.
What should I do next?
Thanks, Kuangyi Chiang
Thanks, Kuangyi Chiang
Not giving back the TD when we get an event for the last TRB in the TD sounds risky. With this change we assume all old and future ETRON hosts will trigger this additional spurious success event.
error_mid_td can cope with hosts which don't produce the extra success event, it was done this way to deal with buggy NECs. The cost is one more ESIT of latency on TDs with error.
I don't know how many Etron chips Kuangyi Chiang tested, but I have one here and it definitely has this double event bug.
These are old chips, not sure if Etron is still making them or if anyone would want to buy (large chip, USB 3.0 only, barely functional streams, no UAS on Linux and on Windows with stock MS drivers).
I think we could handle this more like the XHCI_SPURIOUS_SUCCESS case seen with short transfers, and just silence the error message.
That's a little dodgy because it frees the TD before the HC is completely done with it. *Probably* no problem with data buffers (no sensible reason to DMA into them after an earlier error), but we could overwrite the transfer ring in rare cases and IDK if it would or wouldn't cause problems in this particular case.
Same applies to the "short packet" case existing today. I thought about fixing it, but IIRC I ran into some differences between HCs or out of spec behavior and it got tricky.
Maybe it would make sense to separate giveback (and freeing of the data buffer by class drivers) from transfer ring inc_deq(). Do the former when we reasonably believe the HC won't touch the buffers anymore, do the latter when we are sure that it's in the next TD.
Not ideal, but easier and better than the status quo.
Regards, Michal
On 6.2.2025 0.42, Michał Pecio wrote:
Not giving back the TD when we get an event for the last TRB in the TD sounds risky. With this change we assume all old and future ETRON hosts will trigger this additional spurious success event.
error_mid_td can cope with hosts which don't produce the extra success event, it was done this way to deal with buggy NECs. The cost is one more ESIT of latency on TDs with error.
It makes giving back the TD depend on a future event we can't guarantee.
I still think it better fits the spurious success case. It's not an error mid TD, it's a spurious success event sent by host after a completion (error) event for the last TRB in the TD.
Making this change to error_mid_td code also makes that code more confusing and harder to follow.
I think we could handle this more like the XHCI_SPURIOUS_SUCCESS case seen with short transfers, and just silence the error message.
That's a little dodgy because it frees the TD before the HC is completely done with it. *Probably* no problem with data buffers (no sensible reason to DMA into them after an earlier error), but we could overwrite the transfer ring in rare cases and IDK if it would or wouldn't cause problems in this particular case.
We did get an event for the last TRB in the TD, so in normal cases this TD should be considered complete, and given back.
I don't think the controller has any reason to touch data buffers at this stage either. Can't recall any iommu/dma issues related to this.
Same applies to the "short packet" case existing today. I thought about fixing it, but IIRC I ran into some differences between HCs or out of spec behavior and it got tricky.
For the short transfer case this is more valid concern. Here we give back the TD after an event mid TD, and we know hardware might still walk the rest of the TD. It shouldn't touch data buffers either as short transfer indicates all data has been written.
Maybe it would make sense to separate giveback (and freeing of the data buffer by class drivers) from transfer ring inc_deq(). Do the former when we reasonably believe the HC won't touch the buffers anymore, do the latter when we are sure that it's in the next TD.
This sounds reasonable, makes sense to keep the software dequeue pointer where hardware last reported its position. Currently we advance it to where we assume hardware will be next.
But this is a separate project. Might need some work around in the driver.
Thanks Mathias
On Fri, 7 Feb 2025 14:06:54 +0200, Mathias Nyman wrote:
On 6.2.2025 0.42, Michał Pecio wrote:
error_mid_td can cope with hosts which don't produce the extra success event, it was done this way to deal with buggy NECs. The cost is one more ESIT of latency on TDs with error.
It makes giving back the TD depend on a future event we can't guarantee.
For the record, this is not the same disaster as failing to give back an unlinked URB. Situation here is no different from usual 'error mid TD' case on buggy HCs (known so far: NEC uPD720200 and most if not all of ASMedia, including at least 1st gen AMD Promontory).
We are owed an event in the next ESIT, worst case it will be something weird like Missed Service or Ring Underrun. I've sent you some patches for that, they also apply to the existing NEC/ASMedia problem.
I still think it better fits the spurious success case. It's not an error mid TD, it's a spurious success event sent by host after a completion (error) event for the last TRB in the TD.
Legally you are right of course, but materially we know what happens: the damn thing still holds an internal reference to the last TRB for some unknown reason. We would need to know that it doesn't actually use it for anything and will not mind the TRB being overwritten.
This may well be true, but I guess I prefer known evils over unknown ones so I immediately suggested using 'erorr mid TD' instead.
Making this change to error_mid_td code also makes that code more confusing and harder to follow.
I will see if I can come up with something clean.
I will also try the patch you sent, it looks like it would work.
One thing I don't like is that it fails to distinguish the known spurious events from other invalid events due to hardware or kernel bugs. In the past last_td_was_short caused me similar problems.
Regards, Michal
xHCI 4.9.1 requires HCs to obey the IOC flag we set on the last TRB even after an error has been reported on an earlier TRB. This typically means that an error mid TD is followed by a success event for the last TRB.
On SuperSpeed (and only SS) isochronous endpoints Etron hosts were found to emit a success event also after an error on the last TRB of a TD.
Reuse the machinery for handling errors mid TD to handle these otherwise unexpected events. Avoid printing "TRB not part of current TD" errors, ensure proper tracking of HC's internal dequeue pointer and distinguish this known quirk from other bogus events caused by ordinary bugs.
This patch was found to eliminate all related warnings and errors while running for 30 minutes with a UVC camera on a flaky cable which produces transaction errors about every second. An altsetting was chosen which causes some TDs to be multi-TRB, dynamic debug was used to confirm that errors both mid TD and on the last TRB are handled as expected:
[ 6028.439776] xhci_hcd 0000:06:00.0: Transfer error for slot 1 ep 2 on endpoint [ 6028.439784] xhci_hcd 0000:06:00.0: Error 4 mid isoc TD, wait for final completion event, is_last_trb=1 [ 6028.440268] xhci_hcd 0000:06:00.0: Successful completion on short TX for slot 1 ep 2 with last td short 0 [ 6028.440270] xhci_hcd 0000:06:00.0: Got event 1 after mid TD error [ 6029.123683] xhci_hcd 0000:06:00.0: Transfer error for slot 1 ep 2 on endpoint [ 6029.123694] xhci_hcd 0000:06:00.0: Error 4 mid isoc TD, wait for final completion event, is_last_trb=0 [ 6029.123697] xhci_hcd 0000:06:00.0: Successful completion on short TX for slot 1 ep 2 with last td short 0 [ 6029.123700] xhci_hcd 0000:06:00.0: Got event 1 after mid TD error
Handling of Stopped events is unaffected: finish_td() is called but it does nothing and the TD waits until it's unlinked:
[ 7081.705544] xhci_hcd 0000:06:00.0: Transfer error for slot 1 ep 2 on endpoint [ 7081.705546] xhci_hcd 0000:06:00.0: Error 4 mid isoc TD, wait for final completion event, is_last_trb=1 [ 7081.705630] xhci_hcd 0000:06:00.0: Stopped on Transfer TRB for slot 1 ep 2 [ 7081.705633] xhci_hcd 0000:06:00.0: Got event 26 after mid TD error [ 7081.705678] xhci_hcd 0000:06:00.0: Stopped on Transfer TRB for slot 1 ep 2 [ 7081.705680] xhci_hcd 0000:06:00.0: Got event 26 after mid TD error [ 7081.705759] xhci_hcd 0000:06:00.0: Stopped on No-op or Link TRB for slot 1 ep 2 [ 7081.705799] xhci_hcd 0000:06:00.0: Stopped on No-op or Link TRB for slot 1 ep 2
Reported-by: Kuangyi Chiang ki.chiang65@gmail.com Closes: https://lore.kernel.org/linux-usb/20250205053750.28251-1-ki.chiang65@gmail.c... Signed-off-by: Michal Pecio michal.pecio@gmail.com ---
Hi Mathias,
This is the best I was able to do. It does add a few lines, but I don't think it's too scary and IMO the switch looks even better this way. It accurately predicts those events while not breaking anything else that I can see or think of, save for the risk of firmware bugfix adding one ESIT of latency on errors.
I tried to also test your Etron patch but it has whitespace damage all over the place and would be hard to apply.
Regards, Michal
drivers/usb/host/xhci-ring.c | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 965bffce301e..7ff5075e5890 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -2437,6 +2437,7 @@ static void process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep, bool sum_trbs_for_length = false; u32 remaining, requested, ep_trb_len; int short_framestatus; + bool error_event = false, etron_quirk = false;
trb_comp_code = GET_COMP_CODE(le32_to_cpu(event->transfer_len)); urb_priv = td->urb->hcpriv; @@ -2473,8 +2474,7 @@ static void process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep, fallthrough; case COMP_ISOCH_BUFFER_OVERRUN: frame->status = -EOVERFLOW; - if (ep_trb != td->end_trb) - td->error_mid_td = true; + error_event = true; break; case COMP_INCOMPATIBLE_DEVICE_ERROR: case COMP_STALL_ERROR: @@ -2483,8 +2483,7 @@ static void process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep, case COMP_USB_TRANSACTION_ERROR: frame->status = -EPROTO; sum_trbs_for_length = true; - if (ep_trb != td->end_trb) - td->error_mid_td = true; + error_event = true; break; case COMP_STOPPED: sum_trbs_for_length = true; @@ -2518,8 +2517,17 @@ static void process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep, td->urb->actual_length += frame->actual_length;
finish_td: + /* An error event mid TD will be followed by more events, xHCI 4.9.1 */ + td->error_mid_td |= error_event && (ep_trb != td->end_trb); + + /* Etron treats *all* SuperSpeed isoc errors like errors mid TD */ + if (xhci->quirks & XHCI_ETRON_HOST && td->urb->dev->speed == USB_SPEED_SUPER) { + td->error_mid_td |= error_event; + etron_quirk |= error_event; + } + /* Don't give back TD yet if we encountered an error mid TD */ - if (td->error_mid_td && ep_trb != td->end_trb) { + if (td->error_mid_td && (ep_trb != td->end_trb || etron_quirk)) { xhci_dbg(xhci, "Error mid isoc TD, wait for final completion event\n"); td->urb_length_set = true; return;
Hi,
Thanks for your patch.
FYI: kernel test robot notices the stable kernel rule is not satisfied.
The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html#opti...
Rule: add the tag "Cc: stable@vger.kernel.org" in the sign-off area to have the patch automatically included in the stable tree. Subject: [PATCH] usb: xhci: Handle quirky SuperSpeed isoc error reporting by Etron HCs Link: https://lore.kernel.org/stable/20250211133614.5d64301f%40foxbook
Hi,
Thank you for the review.
Mathias Nyman mathias.nyman@linux.intel.com 於 2025年2月5日 週三 下午10:16寫道:
On 5.2.2025 7.37, Kuangyi Chiang wrote:
Unplugging a USB3.0 webcam while streaming results in errors like this:
[ 132.646387] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646446] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8630 trb-start 000000002fdf8640 trb-end 000000002fdf8650 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0 [ 132.646560] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 18 comp_code 13 [ 132.646568] xhci_hcd 0000:03:00.0: Looking for event-dma 000000002fdf8660 trb-start 000000002fdf8670 trb-end 000000002fdf8670 seg-start 000000002fdf8000 seg-end 000000002fdf8ff0
If an error is detected while processing the last TRB of an isoc TD, the Etron xHC generates two transfer events for the TRB where the error was detected. The first event can be any sort of error (like USB Transaction or Babble Detected, etc), and the final event is Success.
The xHCI driver will handle the TD after the first event and remove it from its internal list, and then print an "Transfer event TRB DMA ptr not part of current TD" error message after the final event.
Commit 5372c65e1311 ("xhci: process isoc TD properly when there was a transaction error mid TD.") is designed to address isoc transaction errors, but unfortunately it doesn't account for this scenario.
To work around this by reusing the logic that handles isoc transaction errors, but continuing to wait for the final event when this condition occurs. Sometimes we see the Stopped event after an error mid TD, this is a normal event for a pending TD and we can think of it as the final event we are waiting for.
Not giving back the TD when we get an event for the last TRB in the TD sounds risky. With this change we assume all old and future ETRON hosts will trigger this additional spurious success event.
I think we could handle this more like the XHCI_SPURIOUS_SUCCESS case seen with short transfers, and just silence the error message.
Yes, this was my initial solution as well. See my patch v2 [1].
[1] https://lore.kernel.org/all/20241028025337.6372-6-ki.chiang65@gmail.com
Are there any other issues besides the error message seen?
There are no other issues.
Thanks Mathias
Thanks, Kuangyi Chiang
case COMP_STOPPED:
/* Think of it as the final event if TD had an error */
if (td->error_mid_td)
sum_trbs_for_length = true; break;td->error_mid_td = false;
What was the reason for this part?
As written it is going to cause problems, the driver will forget about earlier errors if the endpoint is stopped and resumed on the same TD.
I think that the whole patch could be much simpler, like:
case X_ERROR: frame->status = X; td->error_mid_td = true; break; case Y_ERROR: frame->status = Y; td->error_mid_td = true; break;
and then
if (error_mid_td && (ep_trb != td->end_trb || ETRON && SUPERSPEED)) { // error mid TD, wait for final event }
finish_td()
Regards, Michal
Hi,
Thank you for the review.
Michał Pecio michal.pecio@gmail.com 於 2025年2月6日 週四 上午5:45寫道:
case COMP_STOPPED:
/* Think of it as the final event if TD had an error */
if (td->error_mid_td)
td->error_mid_td = false; sum_trbs_for_length = true; break;
What was the reason for this part?
To prevent the driver from printing the following debug message twice:
"Error mid isoc TD, wait for final completion event"
This can happen if the driver queues a Stop Endpoint command after mid isoc TD error occurred, see my debug messages below:
[ 110.514149] xhci_hcd 0000:01:00.0: xhci_queue_isoc_tx [ 110.514164] xhci_hcd 0000:01:00.0: @000000002d119510 59600000 00000000 00009000 800b1725 [ 110.514169] xhci_hcd 0000:01:00.0: @000000002d119520 59609000 00000000 00107000 800b1515 [ 110.514173] xhci_hcd 0000:01:00.0: @000000002d119530 59610000 00000000 00002000 00000625 ... [ 110.530263] xhci_hcd 0000:01:00.0: xhci_handle_event_trb [ 110.530266] xhci_hcd 0000:01:00.0: @000000010afe6350 2d119510 00000000 04009000 01138001 [ 110.530271] xhci_hcd 0000:01:00.0: Error mid isoc TD, wait for final completion event [ 110.530373] xhci_hcd 0000:01:00.0: xhci_handle_event_trb [ 110.530378] xhci_hcd 0000:01:00.0: @000000010afe6360 2d119510 00000000 01009000 01138001 [ 110.530387] xhci_hcd 0000:01:00.0: xhci_handle_event_trb [ 110.530391] xhci_hcd 0000:01:00.0: @000000010afe6370 2d119520 00000000 04007000 01138001 [ 110.530395] xhci_hcd 0000:01:00.0: Error mid isoc TD, wait for final completion event [ 110.530430] xhci_hcd 0000:01:00.0: queue_command [ 110.530434] xhci_hcd 0000:01:00.0: @000000010afe5230 00000000 00000000 00000000 01133c01 [ 110.530470] xhci_hcd 0000:01:00.0: xhci_handle_event_trb [ 110.530473] xhci_hcd 0000:01:00.0: @000000010afe6380 2d119520 00000000 1a000000 01138001 [ 110.530478] xhci_hcd 0000:01:00.0: Error mid isoc TD, wait for final completion event [ 110.530481] xhci_hcd 0000:01:00.0: xhci_handle_event_trb [ 110.530484] xhci_hcd 0000:01:00.0: @000000010afe6390 0afe5230 00000001 01000000 01008401 ...
This may become confusing.
As written it is going to cause problems, the driver will forget about earlier errors if the endpoint is stopped and resumed on the same TD.
Yes, this can happen, I didn't account for this scenario.
I think that the whole patch could be much simpler, like:
case X_ERROR: frame->status = X; td->error_mid_td = true; break; case Y_ERROR: frame->status = Y; td->error_mid_td = true; break;
and then
if (error_mid_td && (ep_trb != td->end_trb || ETRON && SUPERSPEED)) { // error mid TD, wait for final event }
finish_td()
Yes, this is much simpler than my patch, but we still need to fix the issue with debug message being printed twice.
Maybe silence the debug message like this:
if (error_mid_td && (ep_trb != td->end_trb || ETRON && SUPERSPEED)) { if (trb_comp_code != COMP_STOPPED) xhci_dbg(xhci, "Error mid isoc TD, wait for final completion event\n"); ... }
, or do nothing. Could you help with some suggestions?
Regards, Michal
Thanks, Kuangyi Chiang
On Fri, 7 Feb 2025 14:59:25 +0800, Kuangyi Chiang wrote:
case COMP_STOPPED:
/* Think of it as the final event if TD had an error */
if (td->error_mid_td)
td->error_mid_td = false; sum_trbs_for_length = true; break;
What was the reason for this part?
To prevent the driver from printing the following debug message twice:
"Error mid isoc TD, wait for final completion event"
This can happen if the driver queues a Stop Endpoint command after mid isoc TD error occurred, see my debug messages below:
I see. Not sure if it's a big problem, dynamic debug is disabled by default and anyone using it needs to read the code anyway to understand what those messages mean. And when you read the code it becomes obvious why the message can show up twice (or even more, in fact).
I would even say that it is helpful, because it shows that control flow passes exactly as expected when the Stopped event is handled. And it's nothing new, this debug code always worked like that on all HCs.
Regards, Michal
Michał Pecio michal.pecio@gmail.com 於 2025年2月7日 週五 下午5:51寫道:
On Fri, 7 Feb 2025 14:59:25 +0800, Kuangyi Chiang wrote:
case COMP_STOPPED:
/* Think of it as the final event if TD had an error */
if (td->error_mid_td)
td->error_mid_td = false; sum_trbs_for_length = true; break;
What was the reason for this part?
To prevent the driver from printing the following debug message twice:
"Error mid isoc TD, wait for final completion event"
This can happen if the driver queues a Stop Endpoint command after mid isoc TD error occurred, see my debug messages below:
I see. Not sure if it's a big problem, dynamic debug is disabled by default and anyone using it needs to read the code anyway to understand what those messages mean. And when you read the code it becomes obvious why the message can show up twice (or even more, in fact).
I would even say that it is helpful, because it shows that control flow passes exactly as expected when the Stopped event is handled. And it's nothing new, this debug code always worked like that on all HCs.
Got it, thanks for your suggestion.
Regards, Michal
Thanks, Kuangyi Chiang
linux-stable-mirror@lists.linaro.org