Hi Eric,
On 08/08/2019 21:05, Greg Kroah-Hartman wrote:
commit b617158dc096709d8600c53b6052144d12b89fab upstream.
Some applications set tiny SO_SNDBUF values and expect TCP to just work. Recent patches to address CVE-2019-11478 broke them in case of losses, since retransmits might be prevented.
We should allow these flows to make progress.
This patch allows the first and last skb in retransmit queue to be split even if memory limits are hit.
It also adds the some room due to the fact that tcp_sendmsg() and tcp_sendpage() might overshoot sk_wmem_queued by about one full TSO skb (64KB size). Note this allowance was already present in stable backports for kernels < 4.15
Note for < 4.15 backports : tcp_rtx_queue_tail() will probably look like :
static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk) { struct sk_buff *skb = tcp_send_head(sk);
return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk); }
Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits") Signed-off-by: Eric Dumazet edumazet@google.com Reported-by: Andrew Prout aprout@ll.mit.edu Tested-by: Andrew Prout aprout@ll.mit.edu Tested-by: Jonathan Lemon jonathan.lemon@gmail.com Tested-by: Michal Kubecek mkubecek@suse.cz Acked-by: Neal Cardwell ncardwell@google.com Acked-by: Yuchung Cheng ycheng@google.com Acked-by: Christoph Paasch cpaasch@apple.com Cc: Jonathan Looney jtl@netflix.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Matthieu Baerts matthieu.baerts@tessares.net Signed-off-by: Sasha Levin sashal@kernel.org
include/net/tcp.h | 17 +++++++++++++++++ net/ipv4/tcp_output.c | 11 ++++++++++- 2 files changed, 27 insertions(+), 1 deletion(-)
I am sorry to bother you again with the recent modifications in tcp_fragment() but it seems we have a new kernel BUG with this patch in v4.14.
Here is the call trace.
[26665.934461] ------------[ cut here ]------------ [26665.936152] kernel BUG at ./include/linux/skbuff.h:1406! [26665.937941] invalid opcode: 0000 [#1] SMP PTI [26665.977252] Call Trace: [26665.978267] <IRQ> [26665.979163] tcp_fragment+0x9c/0x2cf [26665.980562] tcp_write_xmit+0x68f/0x988 [26665.982031] __tcp_push_pending_frames+0x3b/0xa0 [26665.983684] tcp_data_snd_check+0x2a/0xc8 [26665.985196] tcp_rcv_established+0x2a8/0x30d [26665.986736] tcp_v4_do_rcv+0xb2/0x158 [26665.988140] tcp_v4_rcv+0x692/0x956 [26665.989533] ip_local_deliver_finish+0xeb/0x169 [26665.991250] __netif_receive_skb_core+0x51c/0x582 [26665.993028] ? inet_gro_receive+0x239/0x247 [26665.994581] netif_receive_skb_internal+0xab/0xc6 [26665.996340] napi_gro_receive+0x8a/0xc0 [26665.997790] receive_buf+0x9a1/0x9cd [26665.999232] ? load_balance+0x17a/0x7b7 [26666.000711] ? vring_unmap_one+0x18/0x61 [26666.002196] ? detach_buf+0x60/0xfa [26666.003526] virtnet_poll+0x128/0x1e1 [26666.004860] net_rx_action+0x12a/0x2b1 [26666.006309] __do_softirq+0x11c/0x26b [26666.007734] ? handle_irq_event+0x44/0x56 [26666.009275] irq_exit+0x61/0xa0 [26666.010511] do_IRQ+0x9d/0xbb [26666.011685] common_interrupt+0x85/0x85
We are doing the tests with the MPTCP stack[1], the error might come from there but the call trace is free of MPTCP functions. We are still working on having a reproducible setup with MPTCP before doing the same without MPTCP but please see below the analysis we did so far with some questions.
[1] https://github.com/multipath-tcp/mptcp/tree/mptcp_v0.94
diff --git a/include/net/tcp.h b/include/net/tcp.h index 0b477a1e11770..7994e569644e0 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1688,6 +1688,23 @@ static inline void tcp_check_send_head(struct sock *sk, struct sk_buff *skb_unli tcp_sk(sk)->highest_sack = NULL; } +static inline struct sk_buff *tcp_rtx_queue_head(const struct sock *sk) +{
- struct sk_buff *skb = tcp_write_queue_head(sk);
- if (skb == tcp_send_head(sk))
skb = NULL;
- return skb;
+}
+static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk) +{
- struct sk_buff *skb = tcp_send_head(sk);
- return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
+}
static inline void __tcp_add_write_queue_tail(struct sock *sk, struct sk_buff *skb) { __skb_queue_tail(&sk->sk_write_queue, skb); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index a5960b9b6741c..a99086bf26eaf 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1264,6 +1264,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len, struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *buff; int nsize, old_factor;
- long limit; int nlen; u8 flags;
@@ -1274,7 +1275,15 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len, if (nsize < 0) nsize = 0;
- if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf + 0x20000)) {
- /* tcp_sendmsg() can overshoot sk_wmem_queued by one full size skb.
* We need some allowance to not penalize applications setting small
* SO_SNDBUF values.
* Also allow first and last skb in retransmit queue to be split.
*/
- limit = sk->sk_sndbuf + 2 * SKB_TRUESIZE(GSO_MAX_SIZE);
- if (unlikely((sk->sk_wmem_queued >> 1) > limit &&
skb != tcp_rtx_queue_head(sk) &&
If the skb returned by tcp_rtx_queue_head() is null -- because skb == tcp_send_head(sk), please see above -- should we not do something else than going to the next condition?
skb != tcp_rtx_queue_tail(sk))) {
It seems that behind, tcp_write_queue_prev() can be called -- please see above -- which will directly called skb_queue_prev() which does this:
/* This BUG_ON may seem severe, but if we just return then we * are going to dereference garbage. */ BUG_ON(skb_queue_is_first(list, skb)); return skb->prev;
Should we do the same check before to avoid the BUG_ON()?
Could it be normal to hit this BUG_ON() with regular TCP or is it due to other changes in MPTCP stack? We hope not having bothered you for something not in the upstream kernel (yet) :)
Cheers, Matt
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG); return -ENOMEM;
}