There is a possibility that the context descriptor still owned by the DMA even the previous normal descriptor own bit is already cleared. Checking the context descriptor readiness without delay might be not enough time for the DMA to update the descriptor field, which causing failure in getting HW Rx timestamp.
This patch introduces a 1us fsleep() in HW Rx timestamp checking loop to give time for DMA to update/complete the context descriptor.
ptp4l Timestamp log without this patch: ----------------------------------------------------------- $ echo 10000 > /sys/class/net/enp0s30f4/gro_flush_timeout $ echo 10000 > /sys/class/net/enp0s30f4/napi_defer_hard_irqs $ ptp4l -P2Hi enp0s30f4 --step_threshold=1 -m ptp4l: selected /dev/ptp2 as PTP clock ptp4l: port 1: INITIALIZING to LISTENING on INIT_COMPLETE ptp4l: selected local clock 901210.fffe.b57df7 as best master ptp4l: port 1: new foreign master 22bb22.fffe.bb22bb-1 ptp4l: selected best master clock 22bb22.fffe.bb22bb ptp4l: port 1: LISTENING to UNCALIBRATED on RS_SLAVE ptp4l: port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED ptp4l: port 1: received SYNC without timestamp ptp4l: rms 49 max 63 freq -9573 +/- 34 delay 71 +/- 1 ptp4l: rms 15 max 25 freq -9553 +/- 20 delay 72 +/- 0 ptp4l: port 1: received SYNC without timestamp ptp4l: rms 9 max 18 freq -9540 +/- 11 delay 70 +/- 0 ptp4l: port 1: received PDELAY_REQ without timestamp ptp4l: rms 16 max 29 freq -9519 +/- 12 delay 72 +/- 0 ptp4l: port 1: received PDELAY_REQ without timestamp ptp4l: rms 9 max 18 freq -9527 +/- 12 delay 72 +/- 0 ptp4l: rms 5 max 9 freq -9530 +/- 7 delay 70 +/- 0 ptp4l: rms 11 max 20 freq -9530 +/- 16 delay 72 +/- 0 ptp4l: rms 5 max 11 freq -9530 +/- 7 delay 74 +/- 0 ptp4l: rms 6 max 9 freq -9522 +/- 7 delay 72 +/- 0 ptp4l: port 1: received PDELAY_REQ without timestamp -----------------------------------------------------------
ptp4l Timestamp log with this patch: ----------------------------------------------------------- $ echo 10000 > /sys/class/net/enp0s30f4/gro_flush_timeout $ echo 10000 > /sys/class/net/enp0s30f4/napi_defer_hard_irqs $ ptp4l -P2Hi enp0s30f4 --step_threshold=1 -m ptp4l: selected /dev/ptp2 as PTP clock ptp4l: port 1: INITIALIZING to LISTENING on INIT_COMPLETE ptp4l: selected local clock 901210.fffe.b57df7 as best master ptp4l: port 1: new foreign master 22bb22.fffe.bb22bb-1 ptp4l: selected best master clock 22bb22.fffe.bb22bb ptp4l: port 1: LISTENING to UNCALIBRATED on RS_SLAVE ptp4l: port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED ptp4l: rms 30 max 45 freq -9400 +/- 23 delay 72 +/- 0 ptp4l: rms 7 max 16 freq -9414 +/- 10 delay 70 +/- 0 ptp4l: rms 6 max 9 freq -9422 +/- 6 delay 72 +/- 0 ptp4l: rms 13 max 20 freq -9436 +/- 13 delay 74 +/- 0 ptp4l: rms 12 max 27 freq -9446 +/- 11 delay 72 +/- 0 ptp4l: rms 9 max 12 freq -9453 +/- 6 delay 74 +/- 0 ptp4l: rms 9 max 15 freq -9438 +/- 11 delay 74 +/- 0 ptp4l: rms 10 max 16 freq -9435 +/- 12 delay 74 +/- 0 ptp4l: rms 8 max 18 freq -9428 +/- 8 delay 72 +/- 0 ptp4l: rms 8 max 18 freq -9423 +/- 8 delay 72 +/- 0 ptp4l: rms 9 max 16 freq -9431 +/- 12 delay 70 +/- 0 ptp4l: rms 9 max 18 freq -9441 +/- 9 delay 72 +/- 0 -----------------------------------------------------------
Fixes: ba1ffd74df74 ("stmmac: fix PTP support for GMAC4") Cc: stable@vger.kernel.org # 5.4.x Signed-off-by: Song Yoong Siang yoong.siang.song@intel.com Signed-off-by: Tan Tee Min tee.min.tan@intel.com --- drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c index d3b4765c1a5b..289bf26a6105 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c @@ -279,10 +279,11 @@ static int dwmac4_wrback_get_rx_timestamp_status(void *desc, void *next_desc, /* Check if timestamp is OK from context descriptor */ do { ret = dwmac4_rx_check_timestamp(next_desc); - if (ret < 0) + if (ret <= 0) goto exit; i++;
+ fsleep(1); } while ((ret == 1) && (i < 10));
if (i == 10)
On Wed, Apr 13, 2022 at 12:01:15PM +0800, Tan Tee Min wrote:
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c index d3b4765c1a5b..289bf26a6105 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c @@ -279,10 +279,11 @@ static int dwmac4_wrback_get_rx_timestamp_status(void *desc, void *next_desc, /* Check if timestamp is OK from context descriptor */ do { ret = dwmac4_rx_check_timestamp(next_desc);
if (ret < 0)
if (ret <= 0) goto exit; i++;
fsleep(1);
This is nutty. Why isn't this code using proper deferral mechanisms like work or kthread?
} while ((ret == 1) && (i < 10));
if (i == 10) -- 2.25.1
Thanks, Richard
On Wed, Apr 13, 2022 at 05:59:15AM -0700, Richard Cochran wrote:
On Wed, Apr 13, 2022 at 12:01:15PM +0800, Tan Tee Min wrote:
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c index d3b4765c1a5b..289bf26a6105 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c @@ -279,10 +279,11 @@ static int dwmac4_wrback_get_rx_timestamp_status(void *desc, void *next_desc, /* Check if timestamp is OK from context descriptor */ do { ret = dwmac4_rx_check_timestamp(next_desc);
if (ret < 0)
if (ret <= 0) goto exit; i++;
fsleep(1);
This is nutty. Why isn't this code using proper deferral mechanisms like work or kthread?
Appreciate your comment. The dwmac4_wrback_get_rx_timestamp_status() is called by stmmac_rx() function which is scheduled by NAPI framework. Do we still need to create deferred work inside NAPI work? Would you mind to explain it more in detail?
} while ((ret == 1) && (i < 10));
if (i == 10) -- 2.25.1
Thanks, Richard
Thanks, Tee Min
On Thu, 14 Apr 2022 15:29:34 +0800 Tan Tee Min wrote:
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c @@ -279,10 +279,11 @@ static int dwmac4_wrback_get_rx_timestamp_status(void *desc, void *next_desc, /* Check if timestamp is OK from context descriptor */ do { ret = dwmac4_rx_check_timestamp(next_desc);
if (ret < 0)
if (ret <= 0) goto exit; i++;
fsleep(1);
This is nutty. Why isn't this code using proper deferral mechanisms like work or kthread?
Appreciate your comment. The dwmac4_wrback_get_rx_timestamp_status() is called by stmmac_rx() function which is scheduled by NAPI framework. Do we still need to create deferred work inside NAPI work? Would you mind to explain it more in detail?
fsleep() is a big hammer, can you try cpu_relax() and bumping the max loop count a little?
On Thu, Apr 14, 2022 at 10:42:59AM +0200, Jakub Kicinski wrote:
On Thu, 14 Apr 2022 15:29:34 +0800 Tan Tee Min wrote:
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c @@ -279,10 +279,11 @@ static int dwmac4_wrback_get_rx_timestamp_status(void *desc, void *next_desc, /* Check if timestamp is OK from context descriptor */ do { ret = dwmac4_rx_check_timestamp(next_desc);
if (ret < 0)
if (ret <= 0) goto exit; i++;
fsleep(1);
This is nutty. Why isn't this code using proper deferral mechanisms like work or kthread?
Appreciate your comment. The dwmac4_wrback_get_rx_timestamp_status() is called by stmmac_rx() function which is scheduled by NAPI framework. Do we still need to create deferred work inside NAPI work? Would you mind to explain it more in detail?
fsleep() is a big hammer, can you try cpu_relax() and bumping the max loop count a little?
Thanks for the suggestion! I tried cpu_relax(), unfortunately the issue still happens when the system is in a high-load situation.
I agree that the fsleep(1) (=1us) is a big hammer. Thus in order to improve this, I’ve figured out a smaller delay time that is enough for the context descriptor to be ready which is ndelay(500) (=500ns).
Would you think this is more acceptable?
On Tue, Apr 19, 2022 at 08:52:20AM +0800, Tan Tee Min wrote:
I agree that the fsleep(1) (=1us) is a big hammer. Thus in order to improve this, I’ve figured out a smaller delay time that is enough for the context descriptor to be ready which is ndelay(500) (=500ns).
Why isn't the context descriptor ready?
I mean, the frame already belongs to the CPU, right?
Thanks, Richard
On Tue, Apr 19, 2022 at 06:28:53AM -0700, Richard Cochran wrote:
On Tue, Apr 19, 2022 at 08:52:20AM +0800, Tan Tee Min wrote:
I agree that the fsleep(1) (=1us) is a big hammer. Thus in order to improve this, I’ve figured out a smaller delay time that is enough for the context descriptor to be ready which is ndelay(500) (=500ns).
Why isn't the context descriptor ready?
I mean, the frame already belongs to the CPU, right?
No. The context descriptor (frame) is possibly still owned by the DMA controller in this situation. This is why a looping in the original code to wait for the descriptor to be owned by the application CPU. However, when NAPI is busy polling, the context descriptor might be still owned by the DMA controller even after the looping.
Thus, we are adding an additional nanosecond delay inside the loop, so that the DMA controller can get a short moment to breathe and complete the context descriptor.
Thanks, Tee Min
Thanks, Richard
On Wed, Apr 20, 2022 at 01:15:08PM +0800, Tan Tee Min wrote:
No. The context descriptor (frame) is possibly still owned by the DMA controller in this situation.
So that is a problem. The solution is to postpone this logic until the driver owns the buffer. Doesn't the HW offer some means of notification, like an interrupt for example?
Thanks, Richard
linux-stable-mirror@lists.linaro.org