Previously receiver buffer auto-tuning starts after receiving one advertised window amount of data.After the initial receiver buffer was raised by commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB"),the receiver buffer may take too long for TCP autotuning to start raising the receiver buffer size. commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner") tried to decrease the threshold at which TCP auto-tuning starts but it's doesn't work well in some environments where the receiver has large MTU (9001) configured specially within environments where RTT is high. To address this issue this patch is relying on RCV_MSS so auto-tuning can start early regardless the receiver configured MTU.
Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
Signed-off-by: Hazem Mohamed Abuelfotoh abuehaze@amazon.com --- net/ipv4/tcp_input.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 389d1b340248..f0ffac9e937b 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -504,13 +504,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb) static void tcp_init_buffer_space(struct sock *sk) { int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win; + struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); int maxwin;
if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK)) tcp_sndbuf_expand(sk);
- tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss); + tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss); tcp_mstamp_refresh(tp); tp->rcvq_space.time = tp->tcp_mstamp; tp->rcvq_space.seq = tp->copied_seq;
Hey Team,
I am sending you this e-mail as a follow-up to provide more context about the patch that I proposed in my previous e-mail.
1-We have received a customer complain[1] about degraded download speed from google endpoints after they upgraded their Ubuntu kernel from 4.14 to 5.4.These customers were getting around 80MB/s on kernel 4.14 which became 3MB/s after the upgrade to kernel 5.4. 2-We tried to reproduce the issue locally between EC2 instances within the same region but we couldn’t however we were able to reproduce it when fetching data from google endpoint. 3-The issue could only be reproduced in Regions where we have high RTT(around 12msec or more ) with Google endpoints. 4-We have found some workarounds that can be applied on the receiver side which has proven to be effective and I am listing them below: A) Decrease TCP socket default rmem from 131072 to 87380 B) Decrease MTU from 9001 to 1500. C) Change sysctl_tcp_adv_win_scale from default 1 to 0 or 2 D)We have also found that disabling net.ipv4.tcp_moderate_rcvbuf on kernel 4.14 is giving exactly the same bad performance speed. 5-We have done some kernel bisect to understand when this behaviour has been introduced and found that commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")[2] which has been merged to mainline kernel 4.19.86 is the culprit behind this download performance degradation, The commit mainly did two main changes: A)Raising the initial TCP receive buffer size and receive window. B)Changing the way in which TCP Dynamic Right Sizing (DRS) is been kicked off.
6)There was a regression that has been introduced because of the above patch causing the receive window scaling to take long time after raising the initial receiver buffer & receive window and there was additional fix for that in commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner")[3].
7)Commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner") was trying to decrease the initial rcvq_space.space which is used in TCP's internal auto-tuning to grow socket buffers based on how much data the kernel estimates the sender can send and It should change over the life of any connection based on the amount of data that the sender is sending. This patch is relying on advmss (which is the MSS configured on the receiver) to identify the initial receive space, although this works very well with receivers with small MTUs like 1500 it’s doesn’t help if the receiver is configured to use Jumbo frames (9001 MTU) which is the default MTU on AWS EC2 instances and this is why we think this hasn’t been reported before beside the high RTT >=12msec required to see the issue as well.
8)After further debugging and testing we have found that the issue can only be reproduced under any of the below conditions: A)Sender (MTU 1500) using bbr/bbrv2 as congestion control algorithm ——> Receiver (MTU 9001) with default ipv4.sysctl_tcp_rmem[1] = 131072 running kernel 4.19.86 or later with RTT >=12msec.——>consistently reproducible B)Sender (MTU 1500) using cubic as congestion control algorithm with fq as disc ——> Receiver (MTU 9001) with default ipv4.sysctl_tcp_rmem[1] = 131072 running kernel 4.19.86 or later with RTT >=30msec.——>consistently reproducible. C)Sender (MTU 1500) using cubic as congestion control algorithm with pfifo_fast as qdisc ——> Receiver (MTU 9001) with default ipv4.sysctl_tcp_rmem[1] = 131072 running kernel 4.19.86 or later with RTT >=30msec.——>intermittently reproducible D)Sender needs a MTU of 1500. If the sender is using MTU of 9001 with no MSS clamping , then we couldn’t reproduce the issue. E)AWS EC2 instances are using 9001 as MTU by default hence they are likely more impacted by this.
9)With some kernel hacking & packet capture analysis we found that the main issue is that under the above mentioned conditions the receive window never scales up as it looks like the tcp receiver autotuning never kicks off, I have attached to this e-mail screenshots showing Window scaling with and without the proposed patch. We also found that all workarounds either decreasing initial rcvq_space (this includes decreasing receiver advertised MSS from 9001 to 1500 or default receive buffer size from 131072 to 87380) or increasing the maximum advertised receive window (before TCP autotuning start scaling) and this includes changing net.ipv4.tcp_adv_win_scale from 1 to 0 or 2.
10)It looks like when the issue happen we have a kind of deadlock here so advertised receive window has to exceed rcvq_space for the tcp auto tuning to kickoff at the same time with the initial default configuration the receive window is not going to exceed rcvq_space because it can only get half of the initial receive socket buffer size.
11)The current code which is based on patch has main drawback which should be handled: A)It relies on receiver configured MTU to define the initial receive space(threshold where tcp autotuning starts), as mentioned above this works well with 1500 MTU because with that it will make sure that initial receive space is lower than receive window so tcp autotuning will work just fine while it won’t work with Jumbo frames in use on the receiver because at this case the receiver won’t start tcp autotuning especially with high RTT and we will be hitting the regression that commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner") was trying to handle. 12)I am proposing the below patch which is relying on RCV_MSS (our guess about MSS used by the peer which is equal to TCP_MSS_DEFAULT 536 bytes by default) this should work regardless the receiver configured MSS. I am also sharing my iperf test results with and without the patch and also verified that the connection won’t get stuck in the middle in case of packet loss or latency spike which I emulated using tc netem on the sender side.
Test Results using the same sender & receiver:
-Without our proposed patch
#iperf3 -c xx.xx.xx.xx -t15 -i1 -R Connecting to host xx.xx.xx.xx, port 5201 Reverse mode, remote host xx.xx.xx.xx is sending [ 4] local 172.31.37.167 port 52838 connected to xx.xx.xx.xx port 5201 [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.00 sec 269 KBytes 2.20 Mbits/sec [ 4] 1.00-2.00 sec 332 KBytes 2.72 Mbits/sec [ 4] 2.00-3.00 sec 334 KBytes 2.73 Mbits/sec [ 4] 3.00-4.00 sec 335 KBytes 2.75 Mbits/sec [ 4] 4.00-5.00 sec 332 KBytes 2.72 Mbits/sec [ 4] 5.00-6.00 sec 283 KBytes 2.32 Mbits/sec [ 4] 6.00-7.00 sec 332 KBytes 2.72 Mbits/sec [ 4] 7.00-8.00 sec 335 KBytes 2.75 Mbits/sec [ 4] 8.00-9.00 sec 335 KBytes 2.75 Mbits/sec [ 4] 9.00-10.00 sec 334 KBytes 2.73 Mbits/sec [ 4] 10.00-11.00 sec 332 KBytes 2.72 Mbits/sec [ 4] 11.00-12.00 sec 332 KBytes 2.72 Mbits/sec [ 4] 12.00-13.00 sec 338 KBytes 2.77 Mbits/sec [ 4] 13.00-14.00 sec 334 KBytes 2.73 Mbits/sec [ 4] 14.00-15.00 sec 332 KBytes 2.72 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-15.00 sec 6.07 MBytes 3.39 Mbits/sec 0 sender [ 4] 0.00-15.00 sec 4.90 MBytes 2.74 Mbits/sec receiver
iperf Done.
Test downloading from google endpoint:
# wget https://storage.googleapis.com/kubernetes-release/release/v1.18.9/bin/linux/... --2020-12-04 16:53:00-- https://storage.googleapis.com/kubernetes-release/release/v1.18.9/bin/linux/... Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.1.48, 172.217.8.176, 172.217.4.48, ... Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.1.48|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 113320760 (108M) [application/octet-stream] Saving to: ‘kubelet.45’
100%[===================================================================================================================================================>] 113,320,760 3.04MB/s in 36s
2020-12-04 16:53:36 (3.02 MB/s) - ‘kubelet’ saved [113320760/113320760]
########################################################################################################################
-With the proposed patch:
#iperf3 -c xx.xx.xx.xx -t15 -i1 -R Connecting to host xx.xx.xx.xx, port 5201 Reverse mode, remote host xx.xx.xx.xx is sending [ 4] local 172.31.37.167 port 44514 connected to xx.xx.xx.xx port 5201 [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.00 sec 911 KBytes 7.46 Mbits/sec [ 4] 1.00-2.00 sec 8.95 MBytes 75.1 Mbits/sec [ 4] 2.00-3.00 sec 9.57 MBytes 80.3 Mbits/sec [ 4] 3.00-4.00 sec 9.56 MBytes 80.2 Mbits/sec [ 4] 4.00-5.00 sec 9.58 MBytes 80.3 Mbits/sec [ 4] 5.00-6.00 sec 9.58 MBytes 80.4 Mbits/sec [ 4] 6.00-7.00 sec 9.59 MBytes 80.4 Mbits/sec [ 4] 7.00-8.00 sec 9.59 MBytes 80.5 Mbits/sec [ 4] 8.00-9.00 sec 9.58 MBytes 80.4 Mbits/sec [ 4] 9.00-10.00 sec 9.58 MBytes 80.4 Mbits/sec [ 4] 10.00-11.00 sec 9.59 MBytes 80.4 Mbits/sec [ 4] 11.00-12.00 sec 9.59 MBytes 80.5 Mbits/sec [ 4] 12.00-13.00 sec 8.05 MBytes 67.5 Mbits/sec [ 4] 13.00-14.00 sec 9.57 MBytes 80.3 Mbits/sec [ 4] 14.00-15.00 sec 9.57 MBytes 80.3 Mbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-15.00 sec 136 MBytes 76.3 Mbits/sec 0 sender [ 4] 0.00-15.00 sec 134 MBytes 75.2 Mbits/sec receiver
iperf Done.
Test downloading from google endpoint:
# wget https://storage.googleapis.com/kubernetes-release/release/v1.18.9/bin/linux/... --2020-12-04 16:54:34-- https://storage.googleapis.com/kubernetes-release/release/v1.18.9/bin/linux/... Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.0.16, 216.58.192.144, 172.217.6.16, ... Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.0.16|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 113320760 (108M) [application/octet-stream] Saving to: ‘kubelet’
100%[===================================================================================================================================================>] 113,320,760 80.0MB/s in 1.4s
2020-12-04 16:54:36 (80.0 MB/s) - ‘kubelet.1’ saved [113320760/113320760]
Links:
[1] https://github.com/kubernetes/kops/issues/10206 [2] https://lore.kernel.org/patchwork/patch/1157936/ [3] https://lore.kernel.org/patchwork/patch/1157883/
Thank you.
Hazem
On 04/12/2020, 18:08, "Hazem Mohamed Abuelfotoh" abuehaze@amazon.com wrote:
Previously receiver buffer auto-tuning starts after receiving one advertised window amount of data.After the initial receiver buffer was raised by commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB"),the receiver buffer may take too long for TCP autotuning to start raising the receiver buffer size. commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner") tried to decrease the threshold at which TCP auto-tuning starts but it's doesn't work well in some environments where the receiver has large MTU (9001) configured specially within environments where RTT is high. To address this issue this patch is relying on RCV_MSS so auto-tuning can start early regardless the receiver configured MTU.
Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
Signed-off-by: Hazem Mohamed Abuelfotoh abuehaze@amazon.com --- net/ipv4/tcp_input.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 389d1b340248..f0ffac9e937b 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -504,13 +504,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb) static void tcp_init_buffer_space(struct sock *sk) { int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win; + struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); int maxwin;
if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK)) tcp_sndbuf_expand(sk);
- tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss); + tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss); tcp_mstamp_refresh(tp); tp->rcvq_space.time = tp->tcp_mstamp; tp->rcvq_space.seq = tp->copied_seq; -- 2.16.6
Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284
Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
On Fri, Dec 4, 2020 at 7:19 PM Mohamed Abuelfotoh, Hazem abuehaze@amazon.com wrote:
Hey Team,
I am sending you this e-mail as a follow-up to provide more context about the patch that I proposed in my previous e-mail.
1-We have received a customer complain[1] about degraded download speed from google endpoints after they upgraded their Ubuntu kernel from 4.14 to 5.4.These customers were getting around 80MB/s on kernel 4.14 which became 3MB/s after the upgrade to kernel 5.4. 2-We tried to reproduce the issue locally between EC2 instances within the same region but we couldn’t however we were able to reproduce it when fetching data from google endpoint. 3-The issue could only be reproduced in Regions where we have high RTT(around 12msec or more ) with Google endpoints. 4-We have found some workarounds that can be applied on the receiver side which has proven to be effective and I am listing them below: A) Decrease TCP socket default rmem from 131072 to 87380 B) Decrease MTU from 9001 to 1500. C) Change sysctl_tcp_adv_win_scale from default 1 to 0 or 2 D)We have also found that disabling net.ipv4.tcp_moderate_rcvbuf on kernel 4.14 is giving exactly the same bad performance speed. 5-We have done some kernel bisect to understand when this behaviour has been introduced and found that commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")[2] which has been merged to mainline kernel 4.19.86 is the culprit behind this download performance degradation, The commit mainly did two main changes: A)Raising the initial TCP receive buffer size and receive window. B)Changing the way in which TCP Dynamic Right Sizing (DRS) is been kicked off.
6)There was a regression that has been introduced because of the above patch causing the receive window scaling to take long time after raising the initial receiver buffer & receive window and there was additional fix for that in commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner")[3].
7)Commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner") was trying to decrease the initial rcvq_space.space which is used in TCP's internal auto-tuning to grow socket buffers based on how much data the kernel estimates the sender can send and It should change over the life of any connection based on the amount of data that the sender is sending. This patch is relying on advmss (which is the MSS configured on the receiver) to identify the initial receive space, although this works very well with receivers with small MTUs like 1500 it’s doesn’t help if the receiver is configured to use Jumbo frames (9001 MTU) which is the default MTU on AWS EC2 instances and this is why we think this hasn’t been reported before beside the high RTT >=12msec required to see the issue as well.
8)After further debugging and testing we have found that the issue can only be reproduced under any of the below conditions: A)Sender (MTU 1500) using bbr/bbrv2 as congestion control algorithm ——> Receiver (MTU 9001) with default ipv4.sysctl_tcp_rmem[1] = 131072 running kernel 4.19.86 or later with RTT >=12msec.——>consistently reproducible B)Sender (MTU 1500) using cubic as congestion control algorithm with fq as disc ——> Receiver (MTU 9001) with default ipv4.sysctl_tcp_rmem[1] = 131072 running kernel 4.19.86 or later with RTT >=30msec.——>consistently reproducible. C)Sender (MTU 1500) using cubic as congestion control algorithm with pfifo_fast as qdisc ——> Receiver (MTU 9001) with default ipv4.sysctl_tcp_rmem[1] = 131072 running kernel 4.19.86 or later with RTT >=30msec.——>intermittently reproducible D)Sender needs a MTU of 1500. If the sender is using MTU of 9001 with no MSS clamping , then we couldn’t reproduce the issue. E)AWS EC2 instances are using 9001 as MTU by default hence they are likely more impacted by this.
9)With some kernel hacking & packet capture analysis we found that the main issue is that under the above mentioned conditions the receive window never scales up as it looks like the tcp receiver autotuning never kicks off, I have attached to this e-mail screenshots showing Window scaling with and without the proposed patch. We also found that all workarounds either decreasing initial rcvq_space (this includes decreasing receiver advertised MSS from 9001 to 1500 or default receive buffer size from 131072 to 87380) or increasing the maximum advertised receive window (before TCP autotuning start scaling) and this includes changing net.ipv4.tcp_adv_win_scale from 1 to 0 or 2.
10)It looks like when the issue happen we have a kind of deadlock here so advertised receive window has to exceed rcvq_space for the tcp auto tuning to kickoff at the same time with the initial default configuration the receive window is not going to exceed rcvq_space because it can only get half of the initial receive socket buffer size.
11)The current code which is based on patch has main drawback which should be handled: A)It relies on receiver configured MTU to define the initial receive space(threshold where tcp autotuning starts), as mentioned above this works well with 1500 MTU because with that it will make sure that initial receive space is lower than receive window so tcp autotuning will work just fine while it won’t work with Jumbo frames in use on the receiver because at this case the receiver won’t start tcp autotuning especially with high RTT and we will be hitting the regression that commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner") was trying to handle. 12)I am proposing the below patch which is relying on RCV_MSS (our guess about MSS used by the peer which is equal to TCP_MSS_DEFAULT 536 bytes by default) this should work regardless the receiver configured MSS. I am also sharing my iperf test results with and without the patch and also verified that the connection won’t get stuck in the middle in case of packet loss or latency spike which I emulated using tc netem on the sender side.
Test Results using the same sender & receiver:
-Without our proposed patch
#iperf3 -c xx.xx.xx.xx -t15 -i1 -R Connecting to host xx.xx.xx.xx, port 5201 Reverse mode, remote host xx.xx.xx.xx is sending [ 4] local 172.31.37.167 port 52838 connected to xx.xx.xx.xx port 5201 [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.00 sec 269 KBytes 2.20 Mbits/sec [ 4] 1.00-2.00 sec 332 KBytes 2.72 Mbits/sec [ 4] 2.00-3.00 sec 334 KBytes 2.73 Mbits/sec [ 4] 3.00-4.00 sec 335 KBytes 2.75 Mbits/sec [ 4] 4.00-5.00 sec 332 KBytes 2.72 Mbits/sec [ 4] 5.00-6.00 sec 283 KBytes 2.32 Mbits/sec [ 4] 6.00-7.00 sec 332 KBytes 2.72 Mbits/sec [ 4] 7.00-8.00 sec 335 KBytes 2.75 Mbits/sec [ 4] 8.00-9.00 sec 335 KBytes 2.75 Mbits/sec [ 4] 9.00-10.00 sec 334 KBytes 2.73 Mbits/sec [ 4] 10.00-11.00 sec 332 KBytes 2.72 Mbits/sec [ 4] 11.00-12.00 sec 332 KBytes 2.72 Mbits/sec [ 4] 12.00-13.00 sec 338 KBytes 2.77 Mbits/sec [ 4] 13.00-14.00 sec 334 KBytes 2.73 Mbits/sec [ 4] 14.00-15.00 sec 332 KBytes 2.72 Mbits/sec
[ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-15.00 sec 6.07 MBytes 3.39 Mbits/sec 0 sender [ 4] 0.00-15.00 sec 4.90 MBytes 2.74 Mbits/sec receiver
iperf Done.
Test downloading from google endpoint:
# wget https://storage.googleapis.com/kubernetes-release/release/v1.18.9/bin/linux/... --2020-12-04 16:53:00-- https://storage.googleapis.com/kubernetes-release/release/v1.18.9/bin/linux/... Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.1.48, 172.217.8.176, 172.217.4.48, ... Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.1.48|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 113320760 (108M) [application/octet-stream] Saving to: ‘kubelet.45’
100%[===================================================================================================================================================>] 113,320,760 3.04MB/s in 36s
2020-12-04 16:53:36 (3.02 MB/s) - ‘kubelet’ saved [113320760/113320760]
########################################################################################################################
-With the proposed patch:
#iperf3 -c xx.xx.xx.xx -t15 -i1 -R Connecting to host xx.xx.xx.xx, port 5201 Reverse mode, remote host xx.xx.xx.xx is sending [ 4] local 172.31.37.167 port 44514 connected to xx.xx.xx.xx port 5201 [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.00 sec 911 KBytes 7.46 Mbits/sec [ 4] 1.00-2.00 sec 8.95 MBytes 75.1 Mbits/sec [ 4] 2.00-3.00 sec 9.57 MBytes 80.3 Mbits/sec [ 4] 3.00-4.00 sec 9.56 MBytes 80.2 Mbits/sec [ 4] 4.00-5.00 sec 9.58 MBytes 80.3 Mbits/sec [ 4] 5.00-6.00 sec 9.58 MBytes 80.4 Mbits/sec [ 4] 6.00-7.00 sec 9.59 MBytes 80.4 Mbits/sec [ 4] 7.00-8.00 sec 9.59 MBytes 80.5 Mbits/sec [ 4] 8.00-9.00 sec 9.58 MBytes 80.4 Mbits/sec [ 4] 9.00-10.00 sec 9.58 MBytes 80.4 Mbits/sec [ 4] 10.00-11.00 sec 9.59 MBytes 80.4 Mbits/sec [ 4] 11.00-12.00 sec 9.59 MBytes 80.5 Mbits/sec [ 4] 12.00-13.00 sec 8.05 MBytes 67.5 Mbits/sec [ 4] 13.00-14.00 sec 9.57 MBytes 80.3 Mbits/sec [ 4] 14.00-15.00 sec 9.57 MBytes 80.3 Mbits/sec
[ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-15.00 sec 136 MBytes 76.3 Mbits/sec 0 sender [ 4] 0.00-15.00 sec 134 MBytes 75.2 Mbits/sec receiver
iperf Done.
Test downloading from google endpoint:
# wget https://storage.googleapis.com/kubernetes-release/release/v1.18.9/bin/linux/... --2020-12-04 16:54:34-- https://storage.googleapis.com/kubernetes-release/release/v1.18.9/bin/linux/... Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.0.16, 216.58.192.144, 172.217.6.16, ... Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.0.16|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 113320760 (108M) [application/octet-stream] Saving to: ‘kubelet’
100%[===================================================================================================================================================>] 113,320,760 80.0MB/s in 1.4s
2020-12-04 16:54:36 (80.0 MB/s) - ‘kubelet.1’ saved [113320760/113320760]
Links:
[1] https://github.com/kubernetes/kops/issues/10206 [2] https://lore.kernel.org/patchwork/patch/1157936/ [3] https://lore.kernel.org/patchwork/patch/1157883/
Unfortunately few things are missing in this report.
What is the RTT between hosts in your test ?
What driver is used at the receiving side ?
Usually, this kind of problem comes when s(kb->len / skb->truesize) is pathologically small. This could be caused by a driver lacking scatter gather support at RX (a 1500 bytes incoming packet would use 12KB of memory or so, because driver MTU was set to 9000)
Also worth noting that if you set MTU to 9000 (instead of standard 1500), you probably need to tweak a few sysctls.
autotuning is tricky, changing initial values can be good in some cases, bad in others.
It would be nice if you send "ss -temoi" output taken at receiver while transfer is in progress.
Thank you.
Hazem
On 04/12/2020, 18:08, "Hazem Mohamed Abuelfotoh" abuehaze@amazon.com wrote:
Previously receiver buffer auto-tuning starts after receiving one advertised window amount of data.After the initial receiver buffer was raised by commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB"),the receiver buffer may take too long for TCP autotuning to start raising the receiver buffer size. commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner") tried to decrease the threshold at which TCP auto-tuning starts but it's doesn't work well in some environments where the receiver has large MTU (9001) configured specially within environments where RTT is high. To address this issue this patch is relying on RCV_MSS so auto-tuning can start early regardless the receiver configured MTU. Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner") Signed-off-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com> --- net/ipv4/tcp_input.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 389d1b340248..f0ffac9e937b 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -504,13 +504,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb) static void tcp_init_buffer_space(struct sock *sk) { int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win; + struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); int maxwin; if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK)) tcp_sndbuf_expand(sk); - tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss); + tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss); tcp_mstamp_refresh(tp); tp->rcvq_space.time = tp->tcp_mstamp; tp->rcvq_space.seq = tp->copied_seq; -- 2.16.6
Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284
Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
On Fri, Dec 4, 2020 at 7:08 PM Hazem Mohamed Abuelfotoh abuehaze@amazon.com wrote:
Previously receiver buffer auto-tuning starts after receiving one advertised window amount of data.After the initial receiver buffer was raised by commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB"),the receiver buffer may take too long for TCP autotuning to start raising the receiver buffer size. commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner") tried to decrease the threshold at which TCP auto-tuning starts but it's doesn't work well in some environments where the receiver has large MTU (9001) configured specially within environments where RTT is high. To address this issue this patch is relying on RCV_MSS so auto-tuning can start early regardless the receiver configured MTU. Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
Signed-off-by: Hazem Mohamed Abuelfotoh abuehaze@amazon.com
net/ipv4/tcp_input.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 389d1b340248..f0ffac9e937b 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -504,13 +504,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb) static void tcp_init_buffer_space(struct sock *sk) { int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win;
struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); int maxwin; if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK)) tcp_sndbuf_expand(sk);
tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss);
tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss);
So are you claiming icsk->icsk_ack.rcv_mss is related to MTU 9000 ?
RCV_MSS is not known until we receive actual packets... The initial value is somthing like 536 if I am not mistaken.
I think your patch does not match the changelog.
tcp_mstamp_refresh(tp); tp->rcvq_space.time = tp->tcp_mstamp; tp->rcvq_space.seq = tp->copied_seq;
-- 2.16.6
Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284
Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
On Fri, Dec 4, 2020 at 1:08 PM Hazem Mohamed Abuelfotoh abuehaze@amazon.com wrote:
Previously receiver buffer auto-tuning starts after receiving one advertised window amount of data.After the initial receiver buffer was raised by commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB"),the receiver buffer may take too long for TCP autotuning to start raising the receiver buffer size. commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner") tried to decrease the threshold at which TCP auto-tuning starts but it's doesn't work well in some environments where the receiver has large MTU (9001) configured specially within environments where RTT is high. To address this issue this patch is relying on RCV_MSS so auto-tuning can start early regardless the receiver configured MTU. Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
Signed-off-by: Hazem Mohamed Abuelfotoh abuehaze@amazon.com
net/ipv4/tcp_input.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 389d1b340248..f0ffac9e937b 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -504,13 +504,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb) static void tcp_init_buffer_space(struct sock *sk) { int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win;
struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); int maxwin; if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK)) tcp_sndbuf_expand(sk);
tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss);
tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss);
Thanks for the detailed report and the proposed fix.
AFAICT the core of the bug is related to this part of your follow-up email:
10)It looks like when the issue happen we have a kind of deadlock here so advertised receive window has to exceed rcvq_space for the tcp auto tuning to kickoff at the same time with the initial default configuration the receive window is not going to exceed rcvq_space
The existing code is:
tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss);
(1) With a typical case where both sides have an advmss based on an MTU of 1500 bytes, the governing limit here is around 10*1460 or so, or around 15Kbytes. With a tp->rcvq_space.space of that size, it is common for the data sender to send that amount of data in one round trip, which will trigger the receive buffer autotuning code in tcp_rcv_space_adjust(), so autotuning works well.
(2) With a case where the sender has a 1500B NIC and the receiver has an advmss based on an MTU of 9KB, then the expression becomes governed by the receive window of 64KBytes:
tp->rcvq_space.space ~= min(rcv_wnd, 10*9KBytes) ~= min(64KByte,90KByte) = 65536 bytes
But because tp->rcvq_space.space is set to the rcv_wnd, and the sender is not allowed to send more than the receive window, this check in tcp_rcv_space_adjust() will always fail, and the receiver will take the new_measure path:
/* Number of bytes copied to user in last RTT */ copied = tp->copied_seq - tp->rcvq_space.seq; if (copied <= tp->rcvq_space.space) goto new_measure;
This seems to be why receive buffer autotuning is never triggered.
Furthermore, if we try to fix it with:
- if (copied <= tp->rcvq_space.space) + if (copied < tp->rcvq_space.space)
The buggy behavior will still remain, because 65536 is not a multiple of the MSS, so the number of bytes copied to the user in the last RTT will never, in practice, exactly match the tp->rcvq_space.space.
AFAICT the proposed patch fixes this bug by setting the tp->rcvq_space.space to 10*536 = 5360. This is a number of bytes that *can* actually be delivered in a round trip in a mixed 1500B/9KB scenario, so this allows receive window auto-tuning to actually be triggered.
It seems like a reasonable approach to fix this issue, but I agree with Eric that it would be good to improve the commit message a bit.
Also, since this is a bug fix, it seems it should be directed to the "net" branch rather than the "net-next" branch.
Also, FWIW, I think the "for high latency connections" can be dropped from the commit summary/first-line, since the bug can be hit at any RTT, and is simply easier to notice at high RTTs. You might consider something like:
tcp: fix receive buffer autotuning to trigger for any valid advertised MSS
best, neal
Previously receiver buffer auto-tuning starts after receiving one advertised window amount of data.After the initial receiver buffer was raised by commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB"),the receiver buffer may take too long for TCP autotuning to start raising the receiver buffer size. commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner") tried to decrease the threshold at which TCP auto-tuning starts but it's doesn't work well in some environments where the receiver has large MTU (9001) especially with high RTT connections as in these environments rcvq_space.space will be the same as rcv_wnd so TCP autotuning will never start because sender can't send more than rcv_wnd size in one round trip. To address this issue this patch is decreasing the initial rcvq_space.space so TCP autotuning kicks in whenever the sender is able to send more than 5360 bytes in one round trip regardless the receiver's configured MTU.
Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
Signed-off-by: Hazem Mohamed Abuelfotoh abuehaze@amazon.com --- net/ipv4/tcp_input.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 389d1b340248..f0ffac9e937b 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -504,13 +504,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb) static void tcp_init_buffer_space(struct sock *sk) { int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win; + struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); int maxwin;
if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK)) tcp_sndbuf_expand(sk);
- tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss); + tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss); tcp_mstamp_refresh(tp); tp->rcvq_space.time = tp->tcp_mstamp; tp->rcvq_space.seq = tp->copied_seq;
On Mon, 7 Dec 2020 11:46:25 +0000 Hazem Mohamed Abuelfotoh wrote:
Previously receiver buffer auto-tuning starts after receiving one advertised window amount of data.After the initial receiver buffer was raised by commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB"),the receiver buffer may take too long for TCP autotuning to start raising the receiver buffer size. commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner") tried to decrease the threshold at which TCP auto-tuning starts but it's doesn't work well in some environments where the receiver has large MTU (9001) especially with high RTT connections as in these environments rcvq_space.space will be the same as rcv_wnd so TCP autotuning will never start because sender can't send more than rcv_wnd size in one round trip. To address this issue this patch is decreasing the initial rcvq_space.space so TCP autotuning kicks in whenever the sender is able to send more than 5360 bytes in one round trip regardless the receiver's configured MTU. Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
Signed-off-by: Hazem Mohamed Abuelfotoh abuehaze@amazon.com
If the discussion concludes in favor of this patch please un-indent this commit message, remove the empty line after the fixes tag, and repost.
linux-stable-mirror@lists.linaro.org