On Sat, Dec 5, 2020 at 1:03 PM Mohamed Abuelfotoh, Hazem abuehaze@amazon.com wrote:
Unfortunately few things are missing in this report.
What is the RTT between hosts in your test ? >>>>>RTT in my test is 162 msec, but I am able to reproduce it with lower RTTs for example I could see the issue downloading from google endpoint with RTT of 16.7 msec, as mentioned in my previous e-mail the issue is reproducible whenever RTT exceeded 12msec given that the sender is using bbr. RTT between hosts where I run the iperf test. # ping 54.199.163.187 PING 54.199.163.187 (54.199.163.187) 56(84) bytes of data. 64 bytes from 54.199.163.187: icmp_seq=1 ttl=33 time=162 ms 64 bytes from 54.199.163.187: icmp_seq=2 ttl=33 time=162 ms 64 bytes from 54.199.163.187: icmp_seq=3 ttl=33 time=162 ms 64 bytes from 54.199.163.187: icmp_seq=4 ttl=33 time=162 ms RTT between my EC2 instances and google endpoint. # ping 172.217.4.240 PING 172.217.4.240 (172.217.4.240) 56(84) bytes of data. 64 bytes from 172.217.4.240: icmp_seq=1 ttl=101 time=16.7 ms 64 bytes from 172.217.4.240: icmp_seq=2 ttl=101 time=16.7 ms 64 bytes from 172.217.4.240: icmp_seq=3 ttl=101 time=16.7 ms 64 bytes from 172.217.4.240: icmp_seq=4 ttl=101 time=16.7 ms What driver is used at the receiving side ? >>>>>>I am using ENA driver version version: 2.2.10g on the receiver with scatter gathering enabled. # ethtool -k eth0 | grep scatter-gather scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed]
This ethtool output refers to TX scatter gather, which is not relevant for this bug.
I see ENA driver might use 16 KB per incoming packet (if ENA_PAGE_SIZE is 16 KB)
Since I can not reproduce this problem with another NIC on x86, I really wonder if this is not an issue with ENA driver on PowerPC perhaps ?
>Since I can not reproduce this problem with another NIC on x86, I >really wonder if this is not an issue with ENA driver on PowerPC >perhaps ?
I am able to reproduce it on x86 based EC2 instances using ENA or Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail.
What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side?
Thank you.
Hazem
On 07/12/2020, 15:26, "Eric Dumazet" edumazet@google.com wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On Sat, Dec 5, 2020 at 1:03 PM Mohamed Abuelfotoh, Hazem abuehaze@amazon.com wrote: > > Unfortunately few things are missing in this report. > > What is the RTT between hosts in your test ? > >>>>>RTT in my test is 162 msec, but I am able to reproduce it with lower RTTs for example I could see the issue downloading from google endpoint with RTT of 16.7 msec, as mentioned in my previous e-mail the issue is reproducible whenever RTT exceeded 12msec given that the sender is using bbr. > > RTT between hosts where I run the iperf test. > # ping 54.199.163.187 > PING 54.199.163.187 (54.199.163.187) 56(84) bytes of data. > 64 bytes from 54.199.163.187: icmp_seq=1 ttl=33 time=162 ms > 64 bytes from 54.199.163.187: icmp_seq=2 ttl=33 time=162 ms > 64 bytes from 54.199.163.187: icmp_seq=3 ttl=33 time=162 ms > 64 bytes from 54.199.163.187: icmp_seq=4 ttl=33 time=162 ms > > RTT between my EC2 instances and google endpoint. > # ping 172.217.4.240 > PING 172.217.4.240 (172.217.4.240) 56(84) bytes of data. > 64 bytes from 172.217.4.240: icmp_seq=1 ttl=101 time=16.7 ms > 64 bytes from 172.217.4.240: icmp_seq=2 ttl=101 time=16.7 ms > 64 bytes from 172.217.4.240: icmp_seq=3 ttl=101 time=16.7 ms > 64 bytes from 172.217.4.240: icmp_seq=4 ttl=101 time=16.7 ms > > What driver is used at the receiving side ? > >>>>>>I am using ENA driver version version: 2.2.10g on the receiver with scatter gathering enabled. > > # ethtool -k eth0 | grep scatter-gather > scatter-gather: on > tx-scatter-gather: on > tx-scatter-gather-fraglist: off [fixed]
This ethtool output refers to TX scatter gather, which is not relevant for this bug.
I see ENA driver might use 16 KB per incoming packet (if ENA_PAGE_SIZE is 16 KB)
Since I can not reproduce this problem with another NIC on x86, I really wonder if this is not an issue with ENA driver on PowerPC perhaps ?
Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284
Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
On Mon, Dec 7, 2020 at 5:09 PM Mohamed Abuelfotoh, Hazem abuehaze@amazon.com wrote:
>Since I can not reproduce this problem with another NIC on x86, I >really wonder if this is not an issue with ENA driver on PowerPC >perhaps ?
I am able to reproduce it on x86 based EC2 instances using ENA or Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail.
What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side?
100ms RTT
Which exact version of linux kernel are you using ?
Thank you.
Hazem
On 07/12/2020, 15:26, "Eric Dumazet" edumazet@google.com wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. On Sat, Dec 5, 2020 at 1:03 PM Mohamed Abuelfotoh, Hazem <abuehaze@amazon.com> wrote: > > Unfortunately few things are missing in this report. > > What is the RTT between hosts in your test ? > >>>>>RTT in my test is 162 msec, but I am able to reproduce it with lower RTTs for example I could see the issue downloading from google endpoint with RTT of 16.7 msec, as mentioned in my previous e-mail the issue is reproducible whenever RTT exceeded 12msec given that the sender is using bbr. > > RTT between hosts where I run the iperf test. > # ping 54.199.163.187 > PING 54.199.163.187 (54.199.163.187) 56(84) bytes of data. > 64 bytes from 54.199.163.187: icmp_seq=1 ttl=33 time=162 ms > 64 bytes from 54.199.163.187: icmp_seq=2 ttl=33 time=162 ms > 64 bytes from 54.199.163.187: icmp_seq=3 ttl=33 time=162 ms > 64 bytes from 54.199.163.187: icmp_seq=4 ttl=33 time=162 ms > > RTT between my EC2 instances and google endpoint. > # ping 172.217.4.240 > PING 172.217.4.240 (172.217.4.240) 56(84) bytes of data. > 64 bytes from 172.217.4.240: icmp_seq=1 ttl=101 time=16.7 ms > 64 bytes from 172.217.4.240: icmp_seq=2 ttl=101 time=16.7 ms > 64 bytes from 172.217.4.240: icmp_seq=3 ttl=101 time=16.7 ms > 64 bytes from 172.217.4.240: icmp_seq=4 ttl=101 time=16.7 ms > > What driver is used at the receiving side ? > >>>>>>I am using ENA driver version version: 2.2.10g on the receiver with scatter gathering enabled. > > # ethtool -k eth0 | grep scatter-gather > scatter-gather: on > tx-scatter-gather: on > tx-scatter-gather-fraglist: off [fixed] This ethtool output refers to TX scatter gather, which is not relevant for this bug. I see ENA driver might use 16 KB per incoming packet (if ENA_PAGE_SIZE is 16 KB) Since I can not reproduce this problem with another NIC on x86, I really wonder if this is not an issue with ENA driver on PowerPC perhaps ?
Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284
Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
On Mon, Dec 7, 2020 at 11:23 AM Eric Dumazet edumazet@google.com wrote:
On Mon, Dec 7, 2020 at 5:09 PM Mohamed Abuelfotoh, Hazem abuehaze@amazon.com wrote:
>Since I can not reproduce this problem with another NIC on x86, I >really wonder if this is not an issue with ENA driver on PowerPC >perhaps ?
I am able to reproduce it on x86 based EC2 instances using ENA or Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail.
What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side?
100ms RTT
Which exact version of linux kernel are you using ?
Thanks for testing this, Eric. Would you be able to share the MTU config commands you used, and the tcpdump traces you get? I'm surprised that receive buffer autotuning would work for advmss of around 6500 or higher.
thanks, neal
On Mon, Dec 7, 2020 at 5:34 PM Neal Cardwell ncardwell@google.com wrote:
On Mon, Dec 7, 2020 at 11:23 AM Eric Dumazet edumazet@google.com wrote:
On Mon, Dec 7, 2020 at 5:09 PM Mohamed Abuelfotoh, Hazem abuehaze@amazon.com wrote:
>Since I can not reproduce this problem with another NIC on x86, I >really wonder if this is not an issue with ENA driver on PowerPC >perhaps ?
I am able to reproduce it on x86 based EC2 instances using ENA or Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail.
What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side?
100ms RTT
Which exact version of linux kernel are you using ?
Thanks for testing this, Eric. Would you be able to share the MTU config commands you used, and the tcpdump traces you get? I'm surprised that receive buffer autotuning would work for advmss of around 6500 or higher.
autotuning might be delayed by one RTT, this does not match numbers given by Mohamed (flows stuck in low speed)
autotuning is an heuristic, and because it has one RTT latency, it is crucial to get proper initial rcvmem values.
People using MTU=9000 should know they have to tune tcp_rmem[1] accordingly, especially when using drivers consuming one page per incoming MSS.
(mlx4 driver only uses ome 2048 bytes fragment for a 1500 MTU packet. even with MTU set to 9000)
I want to state again that using 536 bytes as a magic value makes no sense to me.
For the record, Google has increased tcp_rmem[1] when switching to a bigger MTU.
The reason is simple : If we intend to receive 10 MSS, we should allow for 90000 bytes of payload, or tcp_rmem[1] set to 180,000 Because of autotuning latency, doubling the value is advised : 360000
Another problem with kicking autotuning too fast is that it might allow bigger sk->sk_rcvbuf values even for small flows, opening more surface to malicious attacks.
I _think_ that if we want to allow admins to set high MTU without having to tune tcp_rmem[], we need something different than current proposal.
>I want to state again that using 536 bytes as a magic value makes no sense to me.
autotuning might be delayed by one RTT, this does not match numbers given by Mohamed (flows stuck in low speed)
autotuning is an heuristic, and because it has one RTT latency, it is crucial to get proper initial rcvmem values.
People using MTU=9000 should know they have to tune tcp_rmem[1] accordingly, especially when using drivers consuming one page per +incoming MSS.
The magic number would be 10*rcv_mss=5360 not 536 and in my opinion it's a big amount of data to be sent in security attack so if we are talking about DDos attack triggering Autotuning at 5360 bytes I'd say he will also be able to trigger it sending 64KB but I totally agree that it would be easier with lower rcvq_space.space, it's always a tradeoff between security and performance.
Other options would be to either consider the configured MTU in the rcv_wnd calculation or probably check the MTU before calculating the initial rcvspace. We have to make sure that initial receive space is lower than initial receive window so Autotuning would work regardless the configured MTU on the receiver and only people using Jumbo frames will be paying the price if we agreed that it's expected for Jumbo frame users to have machines with more memory, I'd say something as below should work:
void tcp_init_buffer_space(struct sock *sk) { int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win; struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); int maxwin;
if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK)) tcp_sndbuf_expand(sk); if(tp->advmss < 6000) tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss); else tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss); tcp_mstamp_refresh(tp); tp->rcvq_space.time = tp->tcp_mstamp; tp->rcvq_space.seq = tp->copied_seq;
I don't think that we should rely on Admins manually tuning this tcp_rmem[1] with Jumbo frame in use also Linux users shouldn't expect performance degradation after kernel upgrade. although [1] is the only public reporting of this issue, I am pretty sure we will see more users reporting this with Linux Main distributions moving to kernel 5.4 as stable version. In Summary we should come up with something either the proposed patch or something else to avoid admins doing the manual job.
Links
[1] https://github.com/kubernetes/kops/issues/10206
On 07/12/2020, 17:08, "Eric Dumazet" edumazet@google.com wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On Mon, Dec 7, 2020 at 5:34 PM Neal Cardwell ncardwell@google.com wrote: > > On Mon, Dec 7, 2020 at 11:23 AM Eric Dumazet edumazet@google.com wrote: > > > > On Mon, Dec 7, 2020 at 5:09 PM Mohamed Abuelfotoh, Hazem > > abuehaze@amazon.com wrote: > > > > > > >Since I can not reproduce this problem with another NIC on x86, I > > > >really wonder if this is not an issue with ENA driver on PowerPC > > > >perhaps ? > > > > > > > > > I am able to reproduce it on x86 based EC2 instances using ENA or Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail. > > > > > > What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side? > > > > > > 100ms RTT > > > > Which exact version of linux kernel are you using ? > > Thanks for testing this, Eric. Would you be able to share the MTU > config commands you used, and the tcpdump traces you get? I'm > surprised that receive buffer autotuning would work for advmss of > around 6500 or higher.
autotuning might be delayed by one RTT, this does not match numbers given by Mohamed (flows stuck in low speed)
autotuning is an heuristic, and because it has one RTT latency, it is crucial to get proper initial rcvmem values.
People using MTU=9000 should know they have to tune tcp_rmem[1] accordingly, especially when using drivers consuming one page per incoming MSS.
(mlx4 driver only uses ome 2048 bytes fragment for a 1500 MTU packet. even with MTU set to 9000)
I want to state again that using 536 bytes as a magic value makes no sense to me.
For the record, Google has increased tcp_rmem[1] when switching to a bigger MTU.
The reason is simple : If we intend to receive 10 MSS, we should allow for 90000 bytes of payload, or tcp_rmem[1] set to 180,000 Because of autotuning latency, doubling the value is advised : 360000
Another problem with kicking autotuning too fast is that it might allow bigger sk->sk_rcvbuf values even for small flows, opening more surface to malicious attacks.
I _think_ that if we want to allow admins to set high MTU without having to tune tcp_rmem[], we need something different than current proposal.
Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284
Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
On Mon, Dec 7, 2020 at 9:09 PM Mohamed Abuelfotoh, Hazem abuehaze@amazon.com wrote:
>I want to state again that using 536 bytes as a magic value makes no sense to me.
autotuning might be delayed by one RTT, this does not match numbers given by Mohamed (flows stuck in low speed)
autotuning is an heuristic, and because it has one RTT latency, it is crucial to get proper initial rcvmem values.
People using MTU=9000 should know they have to tune tcp_rmem[1] accordingly, especially when using drivers consuming one page per +incoming MSS.
The magic number would be 10*rcv_mss=5360 not 536 and in my opinion it's a big amount of data to be sent in security attack so if we are talking about DDos attack triggering Autotuning at 5360 bytes I'd say he will also be able to trigger it sending 64KB but I totally agree that it would be easier with lower rcvq_space.space, it's always a tradeoff between security and performance.
Other options would be to either consider the configured MTU in the rcv_wnd calculation or probably check the MTU before calculating the initial rcvspace. We have to make sure that initial receive space is lower than initial receive window so Autotuning would work regardless the configured MTU on the receiver and only people using Jumbo frames will be paying the price if we agreed that it's expected for Jumbo frame users to have machines with more memory, I'd say something as below should work:
void tcp_init_buffer_space(struct sock *sk) { int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win; struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); int maxwin;
if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK)) tcp_sndbuf_expand(sk); if(tp->advmss < 6000) tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss);
This is just another hack, based on 'magic' numbers.
else tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss); tcp_mstamp_refresh(tp); tp->rcvq_space.time = tp->tcp_mstamp; tp->rcvq_space.seq = tp->copied_seq;
I don't think that we should rely on Admins manually tuning this tcp_rmem[1] with Jumbo frame in use also Linux users shouldn't expect performance degradation after kernel upgrade. although [1] is the only public reporting of this issue, I am pretty sure we will see more users reporting this with Linux Main distributions moving to kernel 5.4 as stable version. In Summary we should come up with something either the proposed patch or something else to avoid admins doing the manual job.
Default MTU is 1500, not 9000.
I hinted in my very first reply to you that MTU 9000 is not easy and needs tuning. We could argue and try to make this less of a pain in future kernel (net-next)
<quote>Also worth noting that if you set MTU to 9000 (instead of standard 1500), you probably need to tweak a few sysctls. </quote>
I think I have asked you multiple times to test appropriate tcp_rmem[1] settings...
I gave the reason why tcp_rmem[1] set to 131072 is not good for MTU 9000, I will prefer a solution that involves no kernel patch, no backports, just a matter of educating sysadmins, for increased TCP performance, especially when really using 9000 MTU...
Your patch would change the behavior of TCP stack for standard MTU=1500 flows which are yet the majority. This is very risky.
Anyway. _if_ we really wanted to change the kernel, ( keeping stupid tcp_rmem[1] value ) :
In the tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss); formula, really the bug is in the tp->rcv_wnd term, not the second one.
This is buggy, because tcp_init_buffer_space() ends up with tp->window_clamp smaller than tp->rcv_wnd, so tcp_grow_window() is not able to change tp->rcv_ssthresh
The only mechanism allowing to change tp->window_clamp later would be DRS, so we better use the proper limit when initializing tp->rcvq_space.space
This issue disappears if tcp_rmem[1] is slightly above 131072, because then the following is not needed.
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 9e8a6c1aa0190cc248b3b99b073a4c6e45884cf5..81b5d9375860ae583e08045fb25b089c456c60ab 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -534,6 +534,7 @@ static void tcp_init_buffer_space(struct sock *sk)
tp->rcv_ssthresh = min(tp->rcv_ssthresh, tp->window_clamp); tp->snd_cwnd_stamp = tcp_jiffies32; + tp->rcvq_space.space = min(tp->rcv_ssthresh, tp->rcvq_space.space); }
/* 4. Recalculate window clamp after socket hit its memory bounds. */
>Thanks for testing this, Eric. Would you be able to share the MTU >config commands you used, and the tcpdump traces you get? I'm >surprised that receive buffer autotuning would work for advmss of >around 6500 or higher.
Packet capture before applying the proposed patch
https://tcpautotuningpcaps.s3.eu-west-1.amazonaws.com/sender-bbr-bad-unpatch...
Packet capture after applying the proposed patch
https://tcpautotuningpcaps.s3.eu-west-1.amazonaws.com/sender-bbr-good-patche...
kernel version & MTU and configuration from my receiver & sender is attached to this e-mail, please be aware that EC2 is doing MSS clamping so you need to configure MTU as 1500 on the sender side if you don’t have any MSS clamping between sender & receiver.
Thank you.
Hazem
On 07/12/2020, 16:34, "Neal Cardwell" ncardwell@google.com wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On Mon, Dec 7, 2020 at 11:23 AM Eric Dumazet edumazet@google.com wrote: > > On Mon, Dec 7, 2020 at 5:09 PM Mohamed Abuelfotoh, Hazem > abuehaze@amazon.com wrote: > > > > >Since I can not reproduce this problem with another NIC on x86, I > > >really wonder if this is not an issue with ENA driver on PowerPC > > >perhaps ? > > > > > > I am able to reproduce it on x86 based EC2 instances using ENA or Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail. > > > > What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side? > > > 100ms RTT > > Which exact version of linux kernel are you using ?
Thanks for testing this, Eric. Would you be able to share the MTU config commands you used, and the tcpdump traces you get? I'm surprised that receive buffer autotuning would work for advmss of around 6500 or higher.
thanks, neal
Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284
Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
On Mon, Dec 7, 2020 at 6:17 PM Mohamed Abuelfotoh, Hazem abuehaze@amazon.com wrote:
>Thanks for testing this, Eric. Would you be able to share the MTU >config commands you used, and the tcpdump traces you get? I'm >surprised that receive buffer autotuning would work for advmss of >around 6500 or higher.
Packet capture before applying the proposed patch
https://tcpautotuningpcaps.s3.eu-west-1.amazonaws.com/sender-bbr-bad-unpatch...
Packet capture after applying the proposed patch
https://tcpautotuningpcaps.s3.eu-west-1.amazonaws.com/sender-bbr-good-patche...
kernel version & MTU and configuration from my receiver & sender is attached to this e-mail, please be aware that EC2 is doing MSS clamping so you need to configure MTU as 1500 on the sender side if you don’t have any MSS clamping between sender & receiver.
Thank you.
Hazem
Please try again, with a fixed tcp_rmem[1] on receiver, taking into account bigger memory requirement for MTU 9000
Rationale : TCP should be ready to receive 10 full frames before autotuning takes place (these 10 MSS are typically in a single GRO packet)
At 9000 MTU, one frame typically consumes 12KB (or 16KB on some arches/drivers)
TCP uses a 50% factor rule, accounting 18000 bytes of kernel memory per MSS.
->
echo "4096 180000 15728640" >/proc/sys/net/ipv4/tcp_rmem
On 07/12/2020, 16:34, "Neal Cardwell" ncardwell@google.com wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. On Mon, Dec 7, 2020 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Dec 7, 2020 at 5:09 PM Mohamed Abuelfotoh, Hazem > <abuehaze@amazon.com> wrote: > > > > >Since I can not reproduce this problem with another NIC on x86, I > > >really wonder if this is not an issue with ENA driver on PowerPC > > >perhaps ? > > > > > > I am able to reproduce it on x86 based EC2 instances using ENA or Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail. > > > > What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side? > > > 100ms RTT > > Which exact version of linux kernel are you using ? Thanks for testing this, Eric. Would you be able to share the MTU config commands you used, and the tcpdump traces you get? I'm surprised that receive buffer autotuning would work for advmss of around 6500 or higher. thanks, neal
Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284
Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
>Please try again, with a fixed tcp_rmem[1] on receiver, taking into >account bigger memory requirement for MTU 9000
>Rationale : TCP should be ready to receive 10 full frames before >autotuning takes place (these 10 MSS are typically in a single GRO
packet)
>At 9000 MTU, one frame typically consumes 12KB (or 16KB on some arches/drivers)
TCP uses a 50% factor rule, accounting 18000 bytes of kernel memory per MSS.
->
>echo "4096 180000 15728640" >/proc/sys/net/ipv4/tcp_rmem
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 9e8a6c1aa0190cc248b3b99b073a4c6e45884cf5..81b5d9375860ae583e08045fb25b089c456c60ab 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -534,6 +534,7 @@ static void tcp_init_buffer_space(struct sock *sk)
tp->rcv_ssthresh = min(tp->rcv_ssthresh, tp->window_clamp); tp->snd_cwnd_stamp = tcp_jiffies32;
tp->rcvq_space.space = min(tp->rcv_ssthresh, tp->rcvq_space.space);
}
Yes this worked and it looks like echo "4096 140000 15728640" >/proc/sys/net/ipv4/tcp_rmem is actually enough to trigger TCP autotuning, if the current default tcp_rmem[1] doesn't work well with 9000 MTU I am curious to know if there is specific reason behind having 131072 specifically as tcp_rmem[1]?I think the number itself has to be divisible by page size (4K) and 16KB given what you said that each Jumbo frame packet may consume up to 16KB.
if the patch I proposed would be risky for users who have MTU of 1500 because of its higher memory footprint in my opinion we should get the patch you proposed merged instead of asking the Admins doing the manual work.
Thank you.
Hazem
On 07/12/2020, 17:28, "Eric Dumazet" edumazet@google.com wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On Mon, Dec 7, 2020 at 6:17 PM Mohamed Abuelfotoh, Hazem abuehaze@amazon.com wrote: > > >Thanks for testing this, Eric. Would you be able to share the MTU > >config commands you used, and the tcpdump traces you get? I'm > >surprised that receive buffer autotuning would work for advmss of > >around 6500 or higher. > > Packet capture before applying the proposed patch > > https://tcpautotuningpcaps.s3.eu-west-1.amazonaws.com/sender-bbr-bad-unpatch... > > Packet capture after applying the proposed patch > > https://tcpautotuningpcaps.s3.eu-west-1.amazonaws.com/sender-bbr-good-patche... > > kernel version & MTU and configuration from my receiver & sender is attached to this e-mail, please be aware that EC2 is doing MSS clamping so you need to configure MTU as 1500 on the sender side if you don’t have any MSS clamping between sender & receiver. > > Thank you. > > Hazem
Please try again, with a fixed tcp_rmem[1] on receiver, taking into account bigger memory requirement for MTU 9000
Rationale : TCP should be ready to receive 10 full frames before autotuning takes place (these 10 MSS are typically in a single GRO packet)
At 9000 MTU, one frame typically consumes 12KB (or 16KB on some arches/drivers)
TCP uses a 50% factor rule, accounting 18000 bytes of kernel memory per MSS.
->
echo "4096 180000 15728640" >/proc/sys/net/ipv4/tcp_rmem
> > > On 07/12/2020, 16:34, "Neal Cardwell" ncardwell@google.com wrote: > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On Mon, Dec 7, 2020 at 11:23 AM Eric Dumazet edumazet@google.com wrote: > > > > On Mon, Dec 7, 2020 at 5:09 PM Mohamed Abuelfotoh, Hazem > > abuehaze@amazon.com wrote: > > > > > > >Since I can not reproduce this problem with another NIC on x86, I > > > >really wonder if this is not an issue with ENA driver on PowerPC > > > >perhaps ? > > > > > > > > > I am able to reproduce it on x86 based EC2 instances using ENA or Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail. > > > > > > What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side? > > > > > > 100ms RTT > > > > Which exact version of linux kernel are you using ? > > Thanks for testing this, Eric. Would you be able to share the MTU > config commands you used, and the tcpdump traces you get? I'm > surprised that receive buffer autotuning would work for advmss of > around 6500 or higher. > > thanks, > neal > > > > > Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284 > > Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705 > >
Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284
Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
Feel free to ignore this message as I sent it before seeing your newly submitted patch (
Thank you.
Hazem
On 08/12/2020, 16:28, "Mohamed Abuelfotoh, Hazem" abuehaze@amazon.com wrote:
>Please try again, with a fixed tcp_rmem[1] on receiver, taking into >account bigger memory requirement for MTU 9000
>Rationale : TCP should be ready to receive 10 full frames before >autotuning takes place (these 10 MSS are typically in a single GRO > packet)
>At 9000 MTU, one frame typically consumes 12KB (or 16KB on some arches/drivers)
>TCP uses a 50% factor rule, accounting 18000 bytes of kernel memory per MSS.
->
>echo "4096 180000 15728640" >/proc/sys/net/ipv4/tcp_rmem
>diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c >index 9e8a6c1aa0190cc248b3b99b073a4c6e45884cf5..81b5d9375860ae583e08045fb25b089c456c60ab >100644 >--- a/net/ipv4/tcp_input.c >+++ b/net/ipv4/tcp_input.c >@@ -534,6 +534,7 @@ static void tcp_init_buffer_space(struct sock *sk) > > tp->rcv_ssthresh = min(tp->rcv_ssthresh, tp->window_clamp); > tp->snd_cwnd_stamp = tcp_jiffies32; >+ tp->rcvq_space.space = min(tp->rcv_ssthresh, tp->rcvq_space.space); >}
Yes this worked and it looks like echo "4096 140000 15728640" >/proc/sys/net/ipv4/tcp_rmem is actually enough to trigger TCP autotuning, if the current default tcp_rmem[1] doesn't work well with 9000 MTU I am curious to know if there is specific reason behind having 131072 specifically as tcp_rmem[1]?I think the number itself has to be divisible by page size (4K) and 16KB given what you said that each Jumbo frame packet may consume up to 16KB.
if the patch I proposed would be risky for users who have MTU of 1500 because of its higher memory footprint in my opinion we should get the patch you proposed merged instead of asking the Admins doing the manual work.
Thank you.
Hazem
On 07/12/2020, 17:28, "Eric Dumazet" edumazet@google.com wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On Mon, Dec 7, 2020 at 6:17 PM Mohamed Abuelfotoh, Hazem abuehaze@amazon.com wrote: > > >Thanks for testing this, Eric. Would you be able to share the MTU > >config commands you used, and the tcpdump traces you get? I'm > >surprised that receive buffer autotuning would work for advmss of > >around 6500 or higher. > > Packet capture before applying the proposed patch > > https://tcpautotuningpcaps.s3.eu-west-1.amazonaws.com/sender-bbr-bad-unpatch... > > Packet capture after applying the proposed patch > > https://tcpautotuningpcaps.s3.eu-west-1.amazonaws.com/sender-bbr-good-patche... > > kernel version & MTU and configuration from my receiver & sender is attached to this e-mail, please be aware that EC2 is doing MSS clamping so you need to configure MTU as 1500 on the sender side if you don’t have any MSS clamping between sender & receiver. > > Thank you. > > Hazem
Please try again, with a fixed tcp_rmem[1] on receiver, taking into account bigger memory requirement for MTU 9000
Rationale : TCP should be ready to receive 10 full frames before autotuning takes place (these 10 MSS are typically in a single GRO packet)
At 9000 MTU, one frame typically consumes 12KB (or 16KB on some arches/drivers)
TCP uses a 50% factor rule, accounting 18000 bytes of kernel memory per MSS.
->
echo "4096 180000 15728640" >/proc/sys/net/ipv4/tcp_rmem
> > > On 07/12/2020, 16:34, "Neal Cardwell" ncardwell@google.com wrote: > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On Mon, Dec 7, 2020 at 11:23 AM Eric Dumazet edumazet@google.com wrote: > > > > On Mon, Dec 7, 2020 at 5:09 PM Mohamed Abuelfotoh, Hazem > > abuehaze@amazon.com wrote: > > > > > > >Since I can not reproduce this problem with another NIC on x86, I > > > >really wonder if this is not an issue with ENA driver on PowerPC > > > >perhaps ? > > > > > > > > > I am able to reproduce it on x86 based EC2 instances using ENA or Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail. > > > > > > What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side? > > > > > > 100ms RTT > > > > Which exact version of linux kernel are you using ? > > Thanks for testing this, Eric. Would you be able to share the MTU > config commands you used, and the tcpdump traces you get? I'm > surprised that receive buffer autotuning would work for advmss of > around 6500 or higher. > > thanks, > neal > > > > > Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284 > > Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705 > >
Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284
Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
On Tue, Dec 8, 2020 at 5:28 PM Mohamed Abuelfotoh, Hazem abuehaze@amazon.com wrote:
>Please try again, with a fixed tcp_rmem[1] on receiver, taking into >account bigger memory requirement for MTU 9000 >Rationale : TCP should be ready to receive 10 full frames before >autotuning takes place (these 10 MSS are typically in a single GRO
packet)
>At 9000 MTU, one frame typically consumes 12KB (or 16KB on some arches/drivers)
TCP uses a 50% factor rule, accounting 18000 bytes of kernel memory per MSS.
-> >echo "4096 180000 15728640" >/proc/sys/net/ipv4/tcp_rmem
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 9e8a6c1aa0190cc248b3b99b073a4c6e45884cf5..81b5d9375860ae583e08045fb25b089c456c60ab 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -534,6 +534,7 @@ static void tcp_init_buffer_space(struct sock *sk)
tp->rcv_ssthresh = min(tp->rcv_ssthresh, tp->window_clamp); tp->snd_cwnd_stamp = tcp_jiffies32;
tp->rcvq_space.space = min(tp->rcv_ssthresh, tp->rcvq_space.space);
}
Yes this worked and it looks like echo "4096 140000 15728640" >/proc/sys/net/ipv4/tcp_rmem is actually enough to trigger TCP autotuning, if the current default tcp_rmem[1] doesn't work well with 9000 MTU I am curious to know if there is specific reason behind having 131072 specifically as tcp_rmem[1]?I think the number itself has to be divisible by page size (4K) and 16KB given what you said that each Jumbo frame packet may consume up to 16KB.
I think the idea behind the value of 131072 was that because TCP RWIN was set to 65535, we had to reserve twice this amount of memory -> 131072 bytes.
Assuming DRS works well, the exact value should matter only for unresponsive applications (slow to read/drain the receive queue), since DRS is delayed for them.
100ms RTT
Which exact version of linux kernel are you using ?
On the receiver side I could see the issue with any mainline kernel version >=4.19.86 which is the first kernel version that has patches [1] & [2] included. On the sender I am using kernel 5.4.0-rc6.
Links:
[1] https://lore.kernel.org/patchwork/patch/1157936/ [2] https://lore.kernel.org/patchwork/patch/1157883/
Thank you.
Hazem
On 07/12/2020, 16:24, "Eric Dumazet" edumazet@google.com wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On Mon, Dec 7, 2020 at 5:09 PM Mohamed Abuelfotoh, Hazem abuehaze@amazon.com wrote: > > >Since I can not reproduce this problem with another NIC on x86, I > >really wonder if this is not an issue with ENA driver on PowerPC > >perhaps ? > > > I am able to reproduce it on x86 based EC2 instances using ENA or Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail. > > What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side?
100ms RTT
Which exact version of linux kernel are you using ?
> > Thank you. > > Hazem > > On 07/12/2020, 15:26, "Eric Dumazet" edumazet@google.com wrote: > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On Sat, Dec 5, 2020 at 1:03 PM Mohamed Abuelfotoh, Hazem > abuehaze@amazon.com wrote: > > > > Unfortunately few things are missing in this report. > > > > What is the RTT between hosts in your test ? > > >>>>>RTT in my test is 162 msec, but I am able to reproduce it with lower RTTs for example I could see the issue downloading from google endpoint with RTT of 16.7 msec, as mentioned in my previous e-mail the issue is reproducible whenever RTT exceeded 12msec given that the sender is using bbr. > > > > RTT between hosts where I run the iperf test. > > # ping 54.199.163.187 > > PING 54.199.163.187 (54.199.163.187) 56(84) bytes of data. > > 64 bytes from 54.199.163.187: icmp_seq=1 ttl=33 time=162 ms > > 64 bytes from 54.199.163.187: icmp_seq=2 ttl=33 time=162 ms > > 64 bytes from 54.199.163.187: icmp_seq=3 ttl=33 time=162 ms > > 64 bytes from 54.199.163.187: icmp_seq=4 ttl=33 time=162 ms > > > > RTT between my EC2 instances and google endpoint. > > # ping 172.217.4.240 > > PING 172.217.4.240 (172.217.4.240) 56(84) bytes of data. > > 64 bytes from 172.217.4.240: icmp_seq=1 ttl=101 time=16.7 ms > > 64 bytes from 172.217.4.240: icmp_seq=2 ttl=101 time=16.7 ms > > 64 bytes from 172.217.4.240: icmp_seq=3 ttl=101 time=16.7 ms > > 64 bytes from 172.217.4.240: icmp_seq=4 ttl=101 time=16.7 ms > > > > What driver is used at the receiving side ? > > >>>>>>I am using ENA driver version version: 2.2.10g on the receiver with scatter gathering enabled. > > > > # ethtool -k eth0 | grep scatter-gather > > scatter-gather: on > > tx-scatter-gather: on > > tx-scatter-gather-fraglist: off [fixed] > > This ethtool output refers to TX scatter gather, which is not relevant > for this bug. > > I see ENA driver might use 16 KB per incoming packet (if ENA_PAGE_SIZE is 16 KB) > > Since I can not reproduce this problem with another NIC on x86, I > really wonder if this is not an issue with ENA driver on PowerPC > perhaps ? > > > > > Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284 > > Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705 > >
Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284
Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
On Mon, Dec 07, 2020 at 04:34:57PM +0000, Mohamed Abuelfotoh, Hazem wrote:
100ms RTT
Which exact version of linux kernel are you using ?
On the receiver side I could see the issue with any mainline kernel version >=4.19.86 which is the first kernel version that has patches [1] & [2] included. On the sender I am using kernel 5.4.0-rc6.
5.4.0-rc6 is a very old and odd kernel to be doing anything with. Are you sure you don't mean "5.10-rc6" here?
thanks,
greg k-h
5.4.0-rc6 is a very old and odd kernel to be doing anything with. Are you sure you don't mean "5.10-rc6" here?
I was able to reproduce it on the latest mainline kernel as well so anything newer than 4.19.85 is just broken.
Thank you.
Hazem
On 07/12/2020, 17:45, "Greg KH" gregkh@linuxfoundation.org wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On Mon, Dec 07, 2020 at 04:34:57PM +0000, Mohamed Abuelfotoh, Hazem wrote: > 100ms RTT > > >Which exact version of linux kernel are you using ? > On the receiver side I could see the issue with any mainline kernel > version >=4.19.86 which is the first kernel version that has patches > [1] & [2] included. On the sender I am using kernel 5.4.0-rc6.
5.4.0-rc6 is a very old and odd kernel to be doing anything with. Are you sure you don't mean "5.10-rc6" here?
thanks,
greg k-h
Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284
Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
linux-stable-mirror@lists.linaro.org