From: Geliang Tang tanggeliang@kylinos.cn
v3: - modifications that better address the root causes. - only contains the first two patches for -net.
v2: - add patch 2, a new fix for sk_msg_memcopy_from_iter. - update patch 3, only test "sk->sk_prot->close" as Eric suggested. - update patch 4, use "goto err" instead of "return" as Eduard suggested. - add "fixes" tag for patch 1-3. - change subject prefixes as "bpf-next" to trigger BPF CI. - cc Loongarch maintainers too.
BPF selftests seem to have not been fully tested on Loongarch. When I ran these tests on Loongarch recently, some errors occur. This patch set contains two bugfixes for skmsg.
Geliang Tang (2): skmsg: prevent empty ingress skb from enqueuing skmsg: bugfix for sk_msg sge iteration
net/core/skmsg.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
From: Geliang Tang tanggeliang@kylinos.cn
Run this BPF selftests (./test_progs -t sockmap_basic) on a Loongarch platform, a Kernel panic occurs:
''' Oops[#1]: CPU: 22 PID: 2824 Comm: test_progs Tainted: G OE 6.10.0-rc2+ #18 Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018 ... ... ra: 90000000048bf6c0 sk_msg_recvmsg+0x120/0x560 ERA: 9000000004162774 copy_page_to_iter+0x74/0x1c0 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 0000000c (PPLV0 +PIE +PWE) EUEN: 00000007 (+FPE +SXE +ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0) BADV: 0000000000000040 PRID: 0014c011 (Loongson-64bit, Loongson-3C5000) Modules linked in: bpf_testmod(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack Process test_progs (pid: 2824, threadinfo=0000000000863a31, task=...) Stack : ... ... Call Trace: [<9000000004162774>] copy_page_to_iter+0x74/0x1c0 [<90000000048bf6c0>] sk_msg_recvmsg+0x120/0x560 [<90000000049f2b90>] tcp_bpf_recvmsg_parser+0x170/0x4e0 [<90000000049aae34>] inet_recvmsg+0x54/0x100 [<900000000481ad5c>] sock_recvmsg+0x7c/0xe0 [<900000000481e1a8>] __sys_recvfrom+0x108/0x1c0 [<900000000481e27c>] sys_recvfrom+0x1c/0x40 [<9000000004c076ec>] do_syscall+0x8c/0xc0 [<9000000003731da4>] handle_syscall+0xc4/0x160
Code: ...
---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Fatal exception Kernel relocated by 0x3510000 .text @ 0x9000000003710000 .data @ 0x9000000004d70000 .bss @ 0x9000000006469400 ---[ end Kernel panic - not syncing: Fatal exception ]--- '''
This crash happens every time when running sockmap_skb_verdict_shutdown subtest in sockmap_basic.
This crash is because a NULL pointer is passed to page_address() in sk_msg_recvmsg(). Due to the difference in architecture, page_address(0) will not trigger a panic on the X86 platform but will panic on the Loogarch platform. So this bug was hidden on the x86 platform, but now it is exposed on the Loogarch platform.
The root cause is an empty skb (skb->len == 0) is put on the queue.
In this case, in sk_psock_skb_ingress_enqueue(), num_sge is zero, and no page is put to this sge (see sg_set_page in sg_set_page), but this empty sge is queued into ingress_msg list.
And in sk_msg_recvmsg(), this empty sge is used, and a NULL page is got by sg_page(sge). Pass this NULL-page to copy_page_to_iter(), it passed to kmap_local_page() and page_address(), then kernel panics.
To solve this, we should prevent empty skb from putting on the queue. So in sk_psock_verdict_recv(), if the skb->len is zero, drop this skb.
Fixes: ef5659280eb1 ("bpf, sockmap: Allow skipping sk_skb parser program") Signed-off-by: Geliang Tang tanggeliang@kylinos.cn --- net/core/skmsg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c index fd20aae30be2..44952cdd1425 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -1184,7 +1184,7 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb)
rcu_read_lock(); psock = sk_psock(sk); - if (unlikely(!psock)) { + if (unlikely(!psock || !len)) { len = 0; tcp_eat_skb(sk, skb); sock_drop(sk, skb);
On 6/28/24 1:47 PM, Geliang Tang wrote:
From: Geliang Tang tanggeliang@kylinos.cn
Run this BPF selftests (./test_progs -t sockmap_basic) on a Loongarch platform, a Kernel panic occurs:
''' Oops[#1]: CPU: 22 PID: 2824 Comm: test_progs Tainted: G OE 6.10.0-rc2+ #18 Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018 ... ... ra: 90000000048bf6c0 sk_msg_recvmsg+0x120/0x560 ERA: 9000000004162774 copy_page_to_iter+0x74/0x1c0 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 0000000c (PPLV0 +PIE +PWE) EUEN: 00000007 (+FPE +SXE +ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0) BADV: 0000000000000040 PRID: 0014c011 (Loongson-64bit, Loongson-3C5000) Modules linked in: bpf_testmod(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack Process test_progs (pid: 2824, threadinfo=0000000000863a31, task=...) Stack : ... ... Call Trace: [<9000000004162774>] copy_page_to_iter+0x74/0x1c0 [<90000000048bf6c0>] sk_msg_recvmsg+0x120/0x560 [<90000000049f2b90>] tcp_bpf_recvmsg_parser+0x170/0x4e0 [<90000000049aae34>] inet_recvmsg+0x54/0x100 [<900000000481ad5c>] sock_recvmsg+0x7c/0xe0 [<900000000481e1a8>] __sys_recvfrom+0x108/0x1c0 [<900000000481e27c>] sys_recvfrom+0x1c/0x40 [<9000000004c076ec>] do_syscall+0x8c/0xc0 [<9000000003731da4>] handle_syscall+0xc4/0x160
Code: ...
---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Fatal exception Kernel relocated by 0x3510000 .text @ 0x9000000003710000 .data @ 0x9000000004d70000 .bss @ 0x9000000006469400 ---[ end Kernel panic - not syncing: Fatal exception ]--- '''
This crash happens every time when running sockmap_skb_verdict_shutdown subtest in sockmap_basic.
This crash is because a NULL pointer is passed to page_address() in sk_msg_recvmsg(). Due to the difference in architecture, page_address(0) will not trigger a panic on the X86 platform but will panic on the Loogarch platform. So this bug was hidden on the x86 platform, but now it is exposed on the Loogarch platform.
Maybe Loongarch ? Besides that, LGTM. Reviewed-by: D. Wythe alibuda@linux.alibaba.com
The root cause is an empty skb (skb->len == 0) is put on the queue.
In this case, in sk_psock_skb_ingress_enqueue(), num_sge is zero, and no page is put to this sge (see sg_set_page in sg_set_page), but this empty sge is queued into ingress_msg list.
And in sk_msg_recvmsg(), this empty sge is used, and a NULL page is got by sg_page(sge). Pass this NULL-page to copy_page_to_iter(), it passed to kmap_local_page() and page_address(), then kernel panics.
To solve this, we should prevent empty skb from putting on the queue. So in sk_psock_verdict_recv(), if the skb->len is zero, drop this skb.
Fixes: ef5659280eb1 ("bpf, sockmap: Allow skipping sk_skb parser program") Signed-off-by: Geliang Tang tanggeliang@kylinos.cn
net/core/skmsg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c index fd20aae30be2..44952cdd1425 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -1184,7 +1184,7 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb) rcu_read_lock(); psock = sk_psock(sk);
- if (unlikely(!psock)) {
- if (unlikely(!psock || !len)) { len = 0; tcp_eat_skb(sk, skb); sock_drop(sk, skb);
Superseded.
I just sent a v4 ("skmsg: skip empty sge in sk_msg_recvmsg").
Thanks, -Geliang
On Fri, 2024-06-28 at 13:47 +0800, Geliang Tang wrote:
From: Geliang Tang tanggeliang@kylinos.cn
Run this BPF selftests (./test_progs -t sockmap_basic) on a Loongarch platform, a Kernel panic occurs:
''' Oops[#1]: CPU: 22 PID: 2824 Comm: test_progs Tainted: G OE 6.10.0- rc2+ #18 Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson- UDK2018 ... ... ra: 90000000048bf6c0 sk_msg_recvmsg+0x120/0x560 ERA: 9000000004162774 copy_page_to_iter+0x74/0x1c0 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 0000000c (PPLV0 +PIE +PWE) EUEN: 00000007 (+FPE +SXE +ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0) BADV: 0000000000000040 PRID: 0014c011 (Loongson-64bit, Loongson-3C5000) Modules linked in: bpf_testmod(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack Process test_progs (pid: 2824, threadinfo=0000000000863a31, task=...) Stack : ... ... Call Trace: [<9000000004162774>] copy_page_to_iter+0x74/0x1c0 [<90000000048bf6c0>] sk_msg_recvmsg+0x120/0x560 [<90000000049f2b90>] tcp_bpf_recvmsg_parser+0x170/0x4e0 [<90000000049aae34>] inet_recvmsg+0x54/0x100 [<900000000481ad5c>] sock_recvmsg+0x7c/0xe0 [<900000000481e1a8>] __sys_recvfrom+0x108/0x1c0 [<900000000481e27c>] sys_recvfrom+0x1c/0x40 [<9000000004c076ec>] do_syscall+0x8c/0xc0 [<9000000003731da4>] handle_syscall+0xc4/0x160
Code: ...
---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Fatal exception Kernel relocated by 0x3510000 .text @ 0x9000000003710000 .data @ 0x9000000004d70000 .bss @ 0x9000000006469400 ---[ end Kernel panic - not syncing: Fatal exception ]--- '''
This crash happens every time when running sockmap_skb_verdict_shutdown subtest in sockmap_basic.
This crash is because a NULL pointer is passed to page_address() in sk_msg_recvmsg(). Due to the difference in architecture, page_address(0) will not trigger a panic on the X86 platform but will panic on the Loogarch platform. So this bug was hidden on the x86 platform, but now it is exposed on the Loogarch platform.
The root cause is an empty skb (skb->len == 0) is put on the queue.
In this case, in sk_psock_skb_ingress_enqueue(), num_sge is zero, and no page is put to this sge (see sg_set_page in sg_set_page), but this empty sge is queued into ingress_msg list.
And in sk_msg_recvmsg(), this empty sge is used, and a NULL page is got by sg_page(sge). Pass this NULL-page to copy_page_to_iter(), it passed to kmap_local_page() and page_address(), then kernel panics.
To solve this, we should prevent empty skb from putting on the queue. So in sk_psock_verdict_recv(), if the skb->len is zero, drop this skb.
Fixes: ef5659280eb1 ("bpf, sockmap: Allow skipping sk_skb parser program") Signed-off-by: Geliang Tang tanggeliang@kylinos.cn
net/core/skmsg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c index fd20aae30be2..44952cdd1425 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -1184,7 +1184,7 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb) rcu_read_lock(); psock = sk_psock(sk);
- if (unlikely(!psock)) {
- if (unlikely(!psock || !len)) {
len = 0; tcp_eat_skb(sk, skb); sock_drop(sk, skb);
Geliang Tang wrote:
From: Geliang Tang tanggeliang@kylinos.cn
Run this BPF selftests (./test_progs -t sockmap_basic) on a Loongarch platform, a Kernel panic occurs:
''' Oops[#1]: CPU: 22 PID: 2824 Comm: test_progs Tainted: G OE 6.10.0-rc2+ #18 Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018 ... ... ra: 90000000048bf6c0 sk_msg_recvmsg+0x120/0x560 ERA: 9000000004162774 copy_page_to_iter+0x74/0x1c0 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 0000000c (PPLV0 +PIE +PWE) EUEN: 00000007 (+FPE +SXE +ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0) BADV: 0000000000000040 PRID: 0014c011 (Loongson-64bit, Loongson-3C5000) Modules linked in: bpf_testmod(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack Process test_progs (pid: 2824, threadinfo=0000000000863a31, task=...) Stack : ... ... Call Trace: [<9000000004162774>] copy_page_to_iter+0x74/0x1c0 [<90000000048bf6c0>] sk_msg_recvmsg+0x120/0x560 [<90000000049f2b90>] tcp_bpf_recvmsg_parser+0x170/0x4e0 [<90000000049aae34>] inet_recvmsg+0x54/0x100 [<900000000481ad5c>] sock_recvmsg+0x7c/0xe0 [<900000000481e1a8>] __sys_recvfrom+0x108/0x1c0 [<900000000481e27c>] sys_recvfrom+0x1c/0x40 [<9000000004c076ec>] do_syscall+0x8c/0xc0 [<9000000003731da4>] handle_syscall+0xc4/0x160
Code: ...
---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Fatal exception Kernel relocated by 0x3510000 .text @ 0x9000000003710000 .data @ 0x9000000004d70000 .bss @ 0x9000000006469400 ---[ end Kernel panic - not syncing: Fatal exception ]--- '''
This crash happens every time when running sockmap_skb_verdict_shutdown subtest in sockmap_basic.
This crash is because a NULL pointer is passed to page_address() in sk_msg_recvmsg(). Due to the difference in architecture, page_address(0) will not trigger a panic on the X86 platform but will panic on the Loogarch platform. So this bug was hidden on the x86 platform, but now it is exposed on the Loogarch platform.
The root cause is an empty skb (skb->len == 0) is put on the queue.
In this case, in sk_psock_skb_ingress_enqueue(), num_sge is zero, and no page is put to this sge (see sg_set_page in sg_set_page), but this empty sge is queued into ingress_msg list.
And in sk_msg_recvmsg(), this empty sge is used, and a NULL page is got by sg_page(sge). Pass this NULL-page to copy_page_to_iter(), it passed to kmap_local_page() and page_address(), then kernel panics.
To solve this, we should prevent empty skb from putting on the queue. So in sk_psock_verdict_recv(), if the skb->len is zero, drop this skb.
Fixes: ef5659280eb1 ("bpf, sockmap: Allow skipping sk_skb parser program") Signed-off-by: Geliang Tang tanggeliang@kylinos.cn
net/core/skmsg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c index fd20aae30be2..44952cdd1425 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -1184,7 +1184,7 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb) rcu_read_lock(); psock = sk_psock(sk);
- if (unlikely(!psock)) {
- if (unlikely(!psock || !len)) { len = 0; tcp_eat_skb(sk, skb); sock_drop(sk, skb);
The skb->len == 0 here is the FIN pkt right? We are using the EFAULT return triggered by copy_page_to_iter to check for is_fin in tcp_bpf.c.
The concern I have here is if we don't have the skb fin pkt on the recv queue we might go into wait_data and block instead of return to user when rcvmsg() is called from user. I wonder if we can write a test for this if we don't already have one we probably should create one.
Maybe a better fix assuming my assumption about fin being skb->len=0 is correct?
diff --git a/net/core/skmsg.c b/net/core/skmsg.c index fd20aae30be2..bbf40b999713 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -434,7 +434,8 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg, page = sg_page(sge); if (copied + copy > len) copy = len - copied; - copy = copy_page_to_iter(page, sge->offset, copy, iter); + if (copy) + copy = copy_page_to_iter(page, sge->offset, copy, iter); if (!copy) { copied = copied ? copied : -EFAULT; goto out;
Thanks, John
On Tue, 2024-07-02 at 18:03 -0700, John Fastabend wrote:
Geliang Tang wrote:
From: Geliang Tang tanggeliang@kylinos.cn
Run this BPF selftests (./test_progs -t sockmap_basic) on a Loongarch platform, a Kernel panic occurs:
''' Oops[#1]: CPU: 22 PID: 2824 Comm: test_progs Tainted: G OE 6.10.0- rc2+ #18 Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson- UDK2018 ... ... ra: 90000000048bf6c0 sk_msg_recvmsg+0x120/0x560 ERA: 9000000004162774 copy_page_to_iter+0x74/0x1c0 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 0000000c (PPLV0 +PIE +PWE) EUEN: 00000007 (+FPE +SXE +ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0) BADV: 0000000000000040 PRID: 0014c011 (Loongson-64bit, Loongson-3C5000) Modules linked in: bpf_testmod(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack Process test_progs (pid: 2824, threadinfo=0000000000863a31, task=...) Stack : ... ... Call Trace: [<9000000004162774>] copy_page_to_iter+0x74/0x1c0 [<90000000048bf6c0>] sk_msg_recvmsg+0x120/0x560 [<90000000049f2b90>] tcp_bpf_recvmsg_parser+0x170/0x4e0 [<90000000049aae34>] inet_recvmsg+0x54/0x100 [<900000000481ad5c>] sock_recvmsg+0x7c/0xe0 [<900000000481e1a8>] __sys_recvfrom+0x108/0x1c0 [<900000000481e27c>] sys_recvfrom+0x1c/0x40 [<9000000004c076ec>] do_syscall+0x8c/0xc0 [<9000000003731da4>] handle_syscall+0xc4/0x160
Code: ...
---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Fatal exception Kernel relocated by 0x3510000 .text @ 0x9000000003710000 .data @ 0x9000000004d70000 .bss @ 0x9000000006469400 ---[ end Kernel panic - not syncing: Fatal exception ]--- '''
This crash happens every time when running sockmap_skb_verdict_shutdown subtest in sockmap_basic.
This crash is because a NULL pointer is passed to page_address() in sk_msg_recvmsg(). Due to the difference in architecture, page_address(0) will not trigger a panic on the X86 platform but will panic on the Loogarch platform. So this bug was hidden on the x86 platform, but now it is exposed on the Loogarch platform.
The root cause is an empty skb (skb->len == 0) is put on the queue.
In this case, in sk_psock_skb_ingress_enqueue(), num_sge is zero, and no page is put to this sge (see sg_set_page in sg_set_page), but this empty sge is queued into ingress_msg list.
And in sk_msg_recvmsg(), this empty sge is used, and a NULL page is got by sg_page(sge). Pass this NULL-page to copy_page_to_iter(), it passed to kmap_local_page() and page_address(), then kernel panics.
To solve this, we should prevent empty skb from putting on the queue. So in sk_psock_verdict_recv(), if the skb->len is zero, drop this skb.
Fixes: ef5659280eb1 ("bpf, sockmap: Allow skipping sk_skb parser program") Signed-off-by: Geliang Tang tanggeliang@kylinos.cn
net/core/skmsg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c index fd20aae30be2..44952cdd1425 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -1184,7 +1184,7 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb) rcu_read_lock(); psock = sk_psock(sk);
- if (unlikely(!psock)) {
- if (unlikely(!psock || !len)) {
len = 0; tcp_eat_skb(sk, skb); sock_drop(sk, skb);
The skb->len == 0 here is the FIN pkt right? We are using the EFAULT return triggered by copy_page_to_iter to check for is_fin in tcp_bpf.c.
The concern I have here is if we don't have the skb fin pkt on the recv queue we might go into wait_data and block instead of return to user when rcvmsg() is called from user. I wonder if we can write a test for this if we don't already have one we probably should create one.
Maybe a better fix assuming my assumption about fin being skb->len=0 is correct?
Thanks John. Your fix is much better than mine. I'll use this as v5 and update the commit log. I'll add your "Suggested-by" tag in it.
-Geliang
diff --git a/net/core/skmsg.c b/net/core/skmsg.c index fd20aae30be2..bbf40b999713 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -434,7 +434,8 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg, page = sg_page(sge); if (copied + copy > len) copy = len - copied; - copy = copy_page_to_iter(page, sge->offset, copy, iter); + if (copy) + copy = copy_page_to_iter(page, sge-
offset, copy, iter);
if (!copy) { copied = copied ? copied : -EFAULT; goto out;
Thanks, John
Hi John,
On Wed, 2024-07-03 at 09:54 +0800, Geliang Tang wrote:
On Tue, 2024-07-02 at 18:03 -0700, John Fastabend wrote:
Geliang Tang wrote:
From: Geliang Tang tanggeliang@kylinos.cn
Run this BPF selftests (./test_progs -t sockmap_basic) on a Loongarch platform, a Kernel panic occurs:
''' Oops[#1]: CPU: 22 PID: 2824 Comm: test_progs Tainted: G OE 6.10.0- rc2+ #18 Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson- UDK2018 ... ... ra: 90000000048bf6c0 sk_msg_recvmsg+0x120/0x560 ERA: 9000000004162774 copy_page_to_iter+0x74/0x1c0 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 0000000c (PPLV0 +PIE +PWE) EUEN: 00000007 (+FPE +SXE +ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0) BADV: 0000000000000040 PRID: 0014c011 (Loongson-64bit, Loongson-3C5000) Modules linked in: bpf_testmod(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack Process test_progs (pid: 2824, threadinfo=0000000000863a31, task=...) Stack : ... ... Call Trace: [<9000000004162774>] copy_page_to_iter+0x74/0x1c0 [<90000000048bf6c0>] sk_msg_recvmsg+0x120/0x560 [<90000000049f2b90>] tcp_bpf_recvmsg_parser+0x170/0x4e0 [<90000000049aae34>] inet_recvmsg+0x54/0x100 [<900000000481ad5c>] sock_recvmsg+0x7c/0xe0 [<900000000481e1a8>] __sys_recvfrom+0x108/0x1c0 [<900000000481e27c>] sys_recvfrom+0x1c/0x40 [<9000000004c076ec>] do_syscall+0x8c/0xc0 [<9000000003731da4>] handle_syscall+0xc4/0x160
Code: ...
---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Fatal exception Kernel relocated by 0x3510000 .text @ 0x9000000003710000 .data @ 0x9000000004d70000 .bss @ 0x9000000006469400 ---[ end Kernel panic - not syncing: Fatal exception ]--- '''
This crash happens every time when running sockmap_skb_verdict_shutdown subtest in sockmap_basic.
This crash is because a NULL pointer is passed to page_address() in sk_msg_recvmsg(). Due to the difference in architecture, page_address(0) will not trigger a panic on the X86 platform but will panic on the Loogarch platform. So this bug was hidden on the x86 platform, but now it is exposed on the Loogarch platform.
The root cause is an empty skb (skb->len == 0) is put on the queue.
In this case, in sk_psock_skb_ingress_enqueue(), num_sge is zero, and no page is put to this sge (see sg_set_page in sg_set_page), but this empty sge is queued into ingress_msg list.
And in sk_msg_recvmsg(), this empty sge is used, and a NULL page is got by sg_page(sge). Pass this NULL-page to copy_page_to_iter(), it passed to kmap_local_page() and page_address(), then kernel panics.
To solve this, we should prevent empty skb from putting on the queue. So in sk_psock_verdict_recv(), if the skb->len is zero, drop this skb.
Fixes: ef5659280eb1 ("bpf, sockmap: Allow skipping sk_skb parser program") Signed-off-by: Geliang Tang tanggeliang@kylinos.cn
net/core/skmsg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c index fd20aae30be2..44952cdd1425 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -1184,7 +1184,7 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb) rcu_read_lock(); psock = sk_psock(sk);
- if (unlikely(!psock)) {
- if (unlikely(!psock || !len)) {
len = 0; tcp_eat_skb(sk, skb); sock_drop(sk, skb);
The skb->len == 0 here is the FIN pkt right? We are using the EFAULT return triggered by copy_page_to_iter to check for is_fin in tcp_bpf.c.
I added some logs for debugging and found that this FIN packet do hit is_fin check in tcp_bpf.c.
The concern I have here is if we don't have the skb fin pkt on the recv queue we might go into wait_data and block instead of return to user when rcvmsg() is called from user. I wonder if we can write a test for this if we don't already have one we probably should create one.
In test_sockmap_skb_verdict_shutdown(), the FIN packet is sent by
shutdown(p1, SHUT_WR);
and received by
n = recv(c1, &b, 1, SOCK_NONBLOCK); ASSERT_EQ(n, 0, "recv_timeout(fin)");
I think this test has covered the FIN packet scenario already. No need to add a new one. WDYT?
Maybe a better fix assuming my assumption about fin being skb-
len=0
is correct?
Thanks John. Your fix is much better than mine. I'll use this as v5 and update the commit log. I'll add your "Suggested-by" tag in it.
Anyway, this v5 (skmsg: skip zero length skb in sk_msg_recvmsg) seems ready to be merged.
Thanks, -Geliang
-Geliang
diff --git a/net/core/skmsg.c b/net/core/skmsg.c index fd20aae30be2..bbf40b999713 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -434,7 +434,8 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg, page = sg_page(sge); if (copied + copy > len) copy = len - copied; - copy = copy_page_to_iter(page, sge->offset, copy, iter); + if (copy) + copy = copy_page_to_iter(page, sge-
offset, copy, iter);
if (!copy) { copied = copied ? copied : -EFAULT; goto out;
Thanks, John
From: Geliang Tang tanggeliang@kylinos.cn
Every time run this BPF selftests (./test_sockmap) on a Loongarch platform, a Kernel panic occurs:
''' Oops[#1]: CPU: 20 PID: 23245 Comm: test_sockmap Tainted: G OE 6.10.0-rc2+ #32 Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018 ... ... ra: 90000000043a315c tcp_bpf_sendmsg+0x23c/0x420 ERA: 900000000426cd1c sk_msg_memcopy_from_iter+0xbc/0x220 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 0000000c (PPLV0 +PIE +PWE) EUEN: 00000007 (+FPE +SXE +ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0) BADV: 0000000000000040 PRID: 0014c011 (Loongson-64bit, Loongson-3C5000) Modules linked in: tls xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT Process test_sockmap (pid: 23245, threadinfo=00000000aeb68043, task=...) Stack : ... ... ... Call Trace: [<900000000426cd1c>] sk_msg_memcopy_from_iter+0xbc/0x220 [<90000000043a315c>] tcp_bpf_sendmsg+0x23c/0x420 [<90000000041cafc8>] __sock_sendmsg+0x68/0xe0 [<90000000041cc4bc>] ____sys_sendmsg+0x2bc/0x360 [<90000000041cea18>] ___sys_sendmsg+0xb8/0x120 [<90000000041cf1f8>] __sys_sendmsg+0x98/0x100 [<90000000045b76ec>] do_syscall+0x8c/0xc0 [<90000000030e1da4>] handle_syscall+0xc4/0x160
Code: ...
---[ end trace 0000000000000000 ]--- '''
This crash is because a NULL pointer is passed to page_address() in sk_msg_memcopy_from_iter(). Due to the difference in architecture, page_address(0) will not trigger a panic on the X86 platform but will panic on the Loogarch platform. So this bug was hidden on the x86 platform, but now it is exposed on the Loogarch platform.
This bug is a logic error indeed. In sk_msg_memcopy_from_iter(), an invalid "sge" is always used:
if (msg->sg.copybreak >= sge->length) { msg->sg.copybreak = 0; sk_msg_iter_var_next(i); if (i == msg->sg.end) break; sge = sk_msg_elem(msg, i); }
If the value of i is 2, msg->sg.end is also 2 when entering this if block. sk_msg_iter_var_next() increases i by 1, and now i is 3, which is no longer equal to msg->sg.end. The break will not be triggered, and the next sge obtained by sk_msg_elem(3) will be an invalid one.
The correct approach is to check (i == msg->sg.end) first, and then invoke sk_msg_iter_var_next() if they are not equal.
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Geliang Tang tanggeliang@kylinos.cn --- net/core/skmsg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 44952cdd1425..1906d0d0eeac 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -378,9 +378,9 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from, /* This is possible if a trim operation shrunk the buffer */ if (msg->sg.copybreak >= sge->length) { msg->sg.copybreak = 0; - sk_msg_iter_var_next(i); if (i == msg->sg.end) break; + sk_msg_iter_var_next(i); sge = sk_msg_elem(msg, i); }
On 6/28/24 1:47 PM, Geliang Tang wrote:
From: Geliang Tang tanggeliang@kylinos.cn
Every time run this BPF selftests (./test_sockmap) on a Loongarch platform, a Kernel panic occurs:
''' Oops[#1]: CPU: 20 PID: 23245 Comm: test_sockmap Tainted: G OE 6.10.0-rc2+ #32 Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018 ... ... ra: 90000000043a315c tcp_bpf_sendmsg+0x23c/0x420 ERA: 900000000426cd1c sk_msg_memcopy_from_iter+0xbc/0x220 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 0000000c (PPLV0 +PIE +PWE) EUEN: 00000007 (+FPE +SXE +ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0) BADV: 0000000000000040 PRID: 0014c011 (Loongson-64bit, Loongson-3C5000) Modules linked in: tls xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT Process test_sockmap (pid: 23245, threadinfo=00000000aeb68043, task=...) Stack : ... ... ... Call Trace: [<900000000426cd1c>] sk_msg_memcopy_from_iter+0xbc/0x220 [<90000000043a315c>] tcp_bpf_sendmsg+0x23c/0x420 [<90000000041cafc8>] __sock_sendmsg+0x68/0xe0 [<90000000041cc4bc>] ____sys_sendmsg+0x2bc/0x360 [<90000000041cea18>] ___sys_sendmsg+0xb8/0x120 [<90000000041cf1f8>] __sys_sendmsg+0x98/0x100 [<90000000045b76ec>] do_syscall+0x8c/0xc0 [<90000000030e1da4>] handle_syscall+0xc4/0x160
Code: ...
---[ end trace 0000000000000000 ]--- '''
This crash is because a NULL pointer is passed to page_address() in sk_msg_memcopy_from_iter(). Due to the difference in architecture, page_address(0) will not trigger a panic on the X86 platform but will panic on the Loogarch platform. So this bug was hidden on the x86 platform, but now it is exposed on the Loogarch platform.
This bug is a logic error indeed. In sk_msg_memcopy_from_iter(), an invalid "sge" is always used:
if (msg->sg.copybreak >= sge->length) { msg->sg.copybreak = 0; sk_msg_iter_var_next(i); if (i == msg->sg.end) break; sge = sk_msg_elem(msg, i); }
If the value of i is 2, msg->sg.end is also 2 when entering this if block. sk_msg_iter_var_next() increases i by 1, and now i is 3, which is no longer equal to msg->sg.end. The break will not be triggered, and the next sge obtained by sk_msg_elem(3) will be an invalid one.
The correct approach is to check (i == msg->sg.end) first, and then invoke sk_msg_iter_var_next() if they are not equal.
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Geliang Tang tanggeliang@kylinos.cn
net/core/skmsg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 44952cdd1425..1906d0d0eeac 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -378,9 +378,9 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from, /* This is possible if a trim operation shrunk the buffer */ if (msg->sg.copybreak >= sge->length) { msg->sg.copybreak = 0;
sk_msg_iter_var_next(i); if (i == msg->sg.end) break;
sk_msg_iter_var_next(i);
Reviewed-by: D. Wythe alibuda@linux.alibaba.com
sge = sk_msg_elem(msg, i); }
Hello,
On Mon, 2024-07-01 at 17:00 +0800, D. Wythe wrote:
On 6/28/24 1:47 PM, Geliang Tang wrote:
From: Geliang Tang tanggeliang@kylinos.cn
Every time run this BPF selftests (./test_sockmap) on a Loongarch platform, a Kernel panic occurs:
''' Oops[#1]: CPU: 20 PID: 23245 Comm: test_sockmap Tainted: G OE 6.10.0- rc2+ #32 Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018 ... ... ra: 90000000043a315c tcp_bpf_sendmsg+0x23c/0x420 ERA: 900000000426cd1c sk_msg_memcopy_from_iter+0xbc/0x220 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 0000000c (PPLV0 +PIE +PWE) EUEN: 00000007 (+FPE +SXE +ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0) BADV: 0000000000000040 PRID: 0014c011 (Loongson-64bit, Loongson-3C5000) Modules linked in: tls xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT Process test_sockmap (pid: 23245, threadinfo=00000000aeb68043, task=...) Stack : ... ... ... Call Trace: [<900000000426cd1c>] sk_msg_memcopy_from_iter+0xbc/0x220 [<90000000043a315c>] tcp_bpf_sendmsg+0x23c/0x420 [<90000000041cafc8>] __sock_sendmsg+0x68/0xe0 [<90000000041cc4bc>] ____sys_sendmsg+0x2bc/0x360 [<90000000041cea18>] ___sys_sendmsg+0xb8/0x120 [<90000000041cf1f8>] __sys_sendmsg+0x98/0x100 [<90000000045b76ec>] do_syscall+0x8c/0xc0 [<90000000030e1da4>] handle_syscall+0xc4/0x160
Code: ...
---[ end trace 0000000000000000 ]--- '''
This crash is because a NULL pointer is passed to page_address() in sk_msg_memcopy_from_iter(). Due to the difference in architecture, page_address(0) will not trigger a panic on the X86 platform but will panic on the Loogarch platform. So this bug was hidden on the x86 platform, but now it is exposed on the Loogarch platform.
This bug is a logic error indeed. In sk_msg_memcopy_from_iter(), an invalid "sge" is always used:
if (msg->sg.copybreak >= sge->length) { msg->sg.copybreak = 0; sk_msg_iter_var_next(i); if (i == msg->sg.end) break; sge = sk_msg_elem(msg, i); }
If the value of i is 2, msg->sg.end is also 2 when entering this if block. sk_msg_iter_var_next() increases i by 1, and now i is 3, which is no longer equal to msg->sg.end. The break will not be triggered, and the next sge obtained by sk_msg_elem(3) will be an invalid one.
The correct approach is to check (i == msg->sg.end) first, and then invoke sk_msg_iter_var_next() if they are not equal.
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Geliang Tang tanggeliang@kylinos.cn
net/core/skmsg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 44952cdd1425..1906d0d0eeac 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -378,9 +378,9 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from, /* This is possible if a trim operation shrunk the buffer */ if (msg->sg.copybreak >= sge->length) { msg->sg.copybreak = 0;
sk_msg_iter_var_next(i);
if (i == msg->sg.end) break;
sk_msg_iter_var_next(i);
Reviewed-by: D. Wythe alibuda@linux.alibaba.com
Thanks for your review.
But this change breaks test_sockmap. Will send a v4 to fix this.
Changes Requested.
-Geliang
sge = sk_msg_elem(msg, i); }
Hi John,
On Mon, 2024-07-01 at 18:29 +0800, Geliang Tang wrote:
> > Hello, > > > > On Mon, 2024-07-01 at 17:00 +0800, D. Wythe wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 6/28/24 1:47 PM, Geliang Tang wrote: > > > > > > > > > > > > > > > > > > From: Geliang Tang > > > > > > > > > > > > > > > > > > <tanggeliang@kylinos.cn > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Every time run this BPF > > > > > > > > > > > > > > > > > > selftests > > > > > > > > > > > > > > > > > > (./test_sockmap) on a > > > > > > > > > > > > > > > > > > Loongarch > > > > > > > > > > > > > > > > > > platform,
This issue seems related to "txmsg_cork".
Every subtest of test_sockmap with a txmsg_cork value greater than 1 will trigger this issue. For example, this test_txmsg_cork_hangs test:
static void test_txmsg_cork_hangs(int cgrp, struct sockmap_options *opt) { txmsg_pass = 1; txmsg_redir = 0; txmsg_cork = 4097; txmsg_apply = 4097; test_send_large(opt, cgrp);
txmsg_pass = 0; txmsg_redir = 1; txmsg_apply = 0; txmsg_cork = 4097; test_send_large(opt, cgrp);
txmsg_pass = 0; txmsg_redir = 1; txmsg_apply = 4097; txmsg_cork = 4097; test_send_large(opt, cgrp); }
These tests will break the sk_msg sge iteration in sk_msg_memcopy_from_iter().
I added the following test code:
''' diff --git a/net/core/skmsg.c b/net/core/skmsg.c index bbf40b999713..e1fd40aa8586 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -374,6 +374,7 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from, void *to;
do { + pr_info("%s i=%d end=%d\n", __func__, i, msg->sg.end); sge = sk_msg_elem(msg, i); /* This is possible if a trim operation shrunk the buffer */ if (msg->sg.copybreak >= sge->length) { -- '''
And got this logs when running "test hanging corks" test:
''' # 1/ 1 sockmap::txmsg test hanging corks:OK # 2/ 1 sockhash::txmsg test hanging corks:OK # 3/ 1 sockhash:ktls:txmsg test hanging corks:OK
[ 55.751687] sk_msg_memcopy_from_iter i=0 end=8 [ 55.751712] sk_msg_memcopy_from_iter i=1 end=8 [ 55.751726] sk_msg_memcopy_from_iter i=2 end=8 [ 55.751769] sk_msg_memcopy_from_iter i=3 end=8 [ 55.751778] sk_msg_memcopy_from_iter i=4 end=8 [ 55.751787] sk_msg_memcopy_from_iter i=5 end=8 [ 55.751796] sk_msg_memcopy_from_iter i=6 end=8 [ 55.751805] sk_msg_memcopy_from_iter i=7 end=8 [ 55.752979] sk_msg_memcopy_from_iter i=8 end=16 [ 55.752988] sk_msg_memcopy_from_iter i=9 end=16 [ 55.752995] sk_msg_memcopy_from_iter i=10 end=16 [ 55.753002] sk_msg_memcopy_from_iter i=11 end=16 [ 55.753008] sk_msg_memcopy_from_iter i=12 end=16 [ 55.753015] sk_msg_memcopy_from_iter i=13 end=16 [ 55.753022] sk_msg_memcopy_from_iter i=14 end=16 [ 55.753028] sk_msg_memcopy_from_iter i=15 end=16 [ 56.087047] sk_msg_memcopy_from_iter i=0 end=1 [ 56.087075] sk_msg_memcopy_from_iter i=1 end=1 [ 56.087081] sk_msg_memcopy_from_iter i=3 end=1 [ 56.087086] sk_msg_memcopy_from_iter i=5 end=1 [ 56.087091] sk_msg_memcopy_from_iter i=7 end=1 [ 56.087095] sk_msg_memcopy_from_iter i=9 end=1 [ 56.087100] sk_msg_memcopy_from_iter i=11 end=1 [ 56.087105] sk_msg_memcopy_from_iter i=13 end=1 [ 56.087110] sk_msg_memcopy_from_iter i=15 end=1 [ 56.087115] sk_msg_memcopy_from_iter i=17 end=1 '''
When "i" is greater than "end", the sge we get is empty, sge->length is 0. If we access this sge, this issue will occur. Kernel panics on some machines.
To fix this, we can't "break" the loop like I did in this patch itself. It will break test_sockmap with the following error logs:
# 1/ 1 sockmap::txmsg test hanging corks:OK sendpage loop error: No space left on device msg_loop_tx: iov_count 1024 iov_buf 256 cnt 2 err -1 tx thread exited with err 1. # 2/ 1 sockhash::txmsg test hanging corks:FAIL # 3/ 1 sockhash:ktls:txmsg test hanging corks:OK Pass: 2 Fail: 1
We must "continue" the loop to get the next valid sge.
A better fix is here:
''' diff --git a/net/core/skmsg.c b/net/core/skmsg.c index bbf40b999713..bbaf909d0f9c 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -376,13 +376,8 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from, do { sge = sk_msg_elem(msg, i); /* This is possible if a trim operation shrunk the buffer */ - if (msg->sg.copybreak >= sge->length) { - msg->sg.copybreak = 0; - sk_msg_iter_var_next(i); - if (i == msg->sg.end) - break; - sge = sk_msg_elem(msg, i); - } + if (msg->sg.copybreak >= sge->length) + goto next;
buf_size = sge->length - msg->sg.copybreak; copy = (buf_size > bytes) ? bytes : buf_size; @@ -399,6 +394,7 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from, bytes -= copy; if (!bytes) break; +next: msg->sg.copybreak = 0; sk_msg_iter_var_next(i); } while (i != msg->sg.end);
linux-kselftest-mirror@lists.linaro.org