January 2025 - Linux-kselftest-mirror

[PATCH bpf-next v2 0/7] bpf: Add probe_read_{kernel,user}_dynptr and copy_from_user_dynptr

by Levi Zim via B4 Relay

This series introduce the dynptr counterpart of the bpf_probe_read_{kernel,user} helpers and bpf_copy_from_user helper. These helpers are helpful for reading variable-length data from kernel memory into dynptr without going through an intermediate buffer. Link: https://lore.kernel.org/bpf/MEYP282MB2312CFCE5F7712FDE313215AC64D2@MEYP282M… Suggested-by: Andrii Nakryiko <andrii.nakryiko(a)gmail.com> Signed-off-by: Levi Zim <rsworktech(a)outlook.com> --- Changes in v2: - Add missing bpf-next prefix. I forgot it in the initial series. Sorry about that. - Link to v1: https://lore.kernel.org/r/20250125-bpf_dynptr_probe-v1-0-c3cb121f6951@outlo… --- Levi Zim (7): bpf: Implement bpf_probe_read_kernel_dynptr helper bpf: Implement bpf_probe_read_user_dynptr helper bpf: Implement bpf_copy_from_user_dynptr helper tools headers UAPI: Update tools's copy of bpf.h header selftests/bpf: probe_read_kernel_dynptr test selftests/bpf: probe_read_user_dynptr test selftests/bpf: copy_from_user_dynptr test include/linux/bpf.h | 3 + include/uapi/linux/bpf.h | 49 ++++++++++ kernel/bpf/helpers.c | 53 ++++++++++- kernel/trace/bpf_trace.c | 72 ++++++++++++++ tools/include/uapi/linux/bpf.h | 49 ++++++++++ tools/testing/selftests/bpf/prog_tests/dynptr.c | 45 ++++++++- tools/testing/selftests/bpf/progs/dynptr_success.c | 106 +++++++++++++++++++++ 7 files changed, 374 insertions(+), 3 deletions(-) --- base-commit: d0d106a2bd21499901299160744e5fe9f4c83ddb change-id: 20250124-bpf_dynptr_probe-ab483c554f1a Best regards, -- Levi Zim <rsworktech(a)outlook.com>

5 months

6
20
0 0

[PATCH v2 net] udp: gso: do not drop small packets when PMTU reduces

by Yan Zhai

Commit 4094871db1d6 ("udp: only do GSO if # of segs > 1") avoided GSO for small packets. But the kernel currently dismisses GSO requests only after checking MTU/PMTU on gso_size. This means any packets, regardless of their payload sizes, could be dropped when PMTU becomes smaller than requested gso_size. We encountered this issue in production and it caused a reliability problem that new QUIC connection cannot be established before PMTU cache expired, while non GSO sockets still worked fine at the same time. Ideally, do not check any GSO related constraints when payload size is smaller than requested gso_size, and return EMSGSIZE instead of EINVAL on MTU/PMTU check failure to be more specific on the error cause. Fixes: 4094871db1d6 ("udp: only do GSO if # of segs > 1") Signed-off-by: Yan Zhai <yan(a)cloudflare.com> -- v1->v2: add a missing MTU check when fall back to no GSO mode suggested by Willem de Bruijn <willemdebruijn.kernel(a)gmail.com>; Fixed up commit message to be more precise. v1: https://lore.kernel.org/all/Z5cgWh%2F6bRQm9vVU@debian.debian/ --- net/ipv4/udp.c | 28 +++++++++++++++++++--------- net/ipv6/udp.c | 28 +++++++++++++++++++--------- tools/testing/selftests/net/udpgso.c | 14 ++++++++++++++ 3 files changed, 52 insertions(+), 18 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index c472c9a57cf6..0b5010238d05 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1141,9 +1141,20 @@ static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4, const int hlen = skb_network_header_len(skb) + sizeof(struct udphdr); + if (datalen <= cork->gso_size) { + /* + * check MTU again: it's skipped previously when + * gso_size != 0 + */ + if (hlen + datalen > cork->fragsize) { + kfree_skb(skb); + return -EMSGSIZE; + } + goto no_gso; + } if (hlen + cork->gso_size > cork->fragsize) { kfree_skb(skb); - return -EINVAL; + return -EMSGSIZE; } if (datalen > cork->gso_size * UDP_MAX_SEGMENTS) { kfree_skb(skb); @@ -1158,17 +1169,16 @@ static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4, return -EIO; } - if (datalen > cork->gso_size) { - skb_shinfo(skb)->gso_size = cork->gso_size; - skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4; - skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(datalen, - cork->gso_size); + skb_shinfo(skb)->gso_size = cork->gso_size; + skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4; + skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(datalen, + cork->gso_size); - /* Don't checksum the payload, skb will get segmented */ - goto csum_partial; - } + /* Don't checksum the payload, skb will get segmented */ + goto csum_partial; } +no_gso: if (is_udplite) /* UDP-Lite */ csum = udplite_csum(skb); diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 6671daa67f4f..d97befa7f80d 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1389,9 +1389,20 @@ static int udp_v6_send_skb(struct sk_buff *skb, struct flowi6 *fl6, const int hlen = skb_network_header_len(skb) + sizeof(struct udphdr); + if (datalen <= cork->gso_size) { + /* + * check MTU again: it's skipped previously when + * gso_size != 0 + */ + if (hlen + datalen > cork->fragsize) { + kfree_skb(skb); + return -EMSGSIZE; + } + goto no_gso; + } if (hlen + cork->gso_size > cork->fragsize) { kfree_skb(skb); - return -EINVAL; + return -EMSGSIZE; } if (datalen > cork->gso_size * UDP_MAX_SEGMENTS) { kfree_skb(skb); @@ -1406,17 +1417,16 @@ static int udp_v6_send_skb(struct sk_buff *skb, struct flowi6 *fl6, return -EIO; } - if (datalen > cork->gso_size) { - skb_shinfo(skb)->gso_size = cork->gso_size; - skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4; - skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(datalen, - cork->gso_size); + skb_shinfo(skb)->gso_size = cork->gso_size; + skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4; + skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(datalen, + cork->gso_size); - /* Don't checksum the payload, skb will get segmented */ - goto csum_partial; - } + /* Don't checksum the payload, skb will get segmented */ + goto csum_partial; } +no_gso: if (is_udplite) csum = udplite_csum(skb); else if (udp_get_no_check6_tx(sk)) { /* UDP csum disabled */ diff --git a/tools/testing/selftests/net/udpgso.c b/tools/testing/selftests/net/udpgso.c index 3f2fca02fec5..fb73f1c331fb 100644 --- a/tools/testing/selftests/net/udpgso.c +++ b/tools/testing/selftests/net/udpgso.c @@ -102,6 +102,13 @@ struct testcase testcases_v4[] = { .gso_len = CONST_MSS_V4, .r_num_mss = 1, }, + { + /* datalen <= MSS < gso_len: will fall back to no GSO */ + .tlen = CONST_MSS_V4, + .gso_len = CONST_MSS_V4 + 1, + .r_num_mss = 0, + .r_len_last = CONST_MSS_V4, + }, { /* send a single MSS + 1B */ .tlen = CONST_MSS_V4 + 1, @@ -205,6 +212,13 @@ struct testcase testcases_v6[] = { .gso_len = CONST_MSS_V6, .r_num_mss = 1, }, + { + /* datalen <= MSS < gso_len: will fall back to no GSO */ + .tlen = CONST_MSS_V6, + .gso_len = CONST_MSS_V6 + 1, + .r_num_mss = 0, + .r_len_last = CONST_MSS_V6, + }, { /* send a single MSS + 1B */ .tlen = CONST_MSS_V6 + 1, -- 2.30.2

5 months

2
2
0 0

[RFC net-next 0/2] netdevgenl: Add an xsk attribute to queues

by Joe Damato

Greetings: This is an attempt to followup on something Jakub asked me about [1], adding an xsk attribute to queues and more clearly documenting which queues are linked to NAPIs... But: 1. I couldn't pick a good "thing" to expose as "xsk", so I chose 0 or 1. Happy to take suggestions on what might be better to expose for the xsk queue attribute. 2. I create a silly C helper program to create an XDP socket in order to add a new test to queues.py. I'm not particularly good at python programming, so there's probably a better way to do this. Notably, python does not seem to have a socket.AF_XDP, so I needed the C helper to make a socket and bind it to a queue to perform the test. Tested this on my mlx5 machine and the test seems to pass. Happy to take any suggestions / feedback on this one; sorry in advance if I missed many obvious better ways to do things. Thanks, Joe [1]: https://lore.kernel.org/netdev/20250113143109.60afa59a@kernel.org/ Joe Damato (2): netdev-genl: Add an XSK attribute to queues selftests: drv-net: Test queue xsk attribute Documentation/netlink/specs/netdev.yaml | 10 ++- include/uapi/linux/netdev.h | 1 + net/core/netdev-genl.c | 6 ++ tools/include/uapi/linux/netdev.h | 1 + tools/testing/selftests/drivers/.gitignore | 1 + tools/testing/selftests/drivers/net/Makefile | 3 + tools/testing/selftests/drivers/net/queues.py | 32 ++++++- .../selftests/drivers/net/xdp_helper.c | 90 +++++++++++++++++++ 8 files changed, 141 insertions(+), 3 deletions(-) create mode 100644 tools/testing/selftests/drivers/net/xdp_helper.c base-commit: 0ad9617c78acbc71373fb341a6f75d4012b01d69 -- 2.25.1

5 months

2
4
0 0

[PATCH v11 00/14] riscv: Add support for xtheadvector

by Charlie Jenkins

xtheadvector is a custom extension that is based upon riscv vector version 0.7.1 [1]. All of the vector routines have been modified to support this alternative vector version based upon whether xtheadvector was determined to be supported at boot. vlenb is not supported on the existing xtheadvector hardware, so a devicetree property thead,vlenb is added to provide the vlenb to Linux. There is a new hwprobe key RISCV_HWPROBE_KEY_VENDOR_EXT_THEAD_0 that is used to request which thead vendor extensions are supported on the current platform. This allows future vendors to allocate hwprobe keys for their vendor. Support for xtheadvector is also added to the vector kselftests. Signed-off-by: Charlie Jenkins <charlie(a)rivosinc.com> [1] https://github.com/T-head-Semi/thead-extension-spec/blob/95358cb2cca9489361… --- This series is a continuation of a different series that was fragmented into two other series in an attempt to get part of it merged in the 6.10 merge window. The split-off series did not get merged due to a NAK on the series that added the generic riscv,vlenb devicetree entry. This series has converted riscv,vlenb to thead,vlenb to remedy this issue. The original series is titled "riscv: Support vendor extensions and xtheadvector" [3]. I have tested this with an Allwinner Nezha board. I used SkiffOS [1] to manage building the image, but upgraded the U-Boot version to Samuel Holland's more up-to-date version [2] and changed out the device tree used by U-Boot with the device trees that are present in upstream linux and this series. Thank you Samuel for all of the work you did to make this task possible. [1] https://github.com/skiffos/SkiffOS/tree/master/configs/allwinner/nezha [2] https://github.com/smaeul/u-boot/commit/2e89b706f5c956a70c989cd31665f1429e9… [3] https://lore.kernel.org/all/20240503-dev-charlie-support_thead_vector_6_9-v… [4] https://lore.kernel.org/lkml/20240719-support_vendor_extensions-v3-4-0af758… --- Changes in v11: - Fix an issue where the mitigation was not being properly skipped when requested - Fix vstate_discard issue - Fix issue when -1 was passed into __riscv_isa_vendor_extension_available() - Remove some artifacts from being placed in the test directory - Link to v10: https://lore.kernel.org/r/20240911-xtheadvector-v10-0-8d3930091246@rivosinc… Changes in v10: - In DT probing disable vector with new function to clear vendor extension bits for xtheadvector - Add ghostwrite mitigations for c9xx CPUs. This disables xtheadvector unless mitigations=off is set as a kernel boot arg - Link to v9: https://lore.kernel.org/r/20240806-xtheadvector-v9-0-62a56d2da5d0@rivosinc.… Changes in v9: - Rebase onto palmer's for-next - Fix sparse error in arch/riscv/kernel/vendor_extensions/thead.c - Fix maybe-uninitialized warning in arch/riscv/include/asm/vendor_extensions/vendor_hwprobe.h - Wrap some long lines - Link to v8: https://lore.kernel.org/r/20240724-xtheadvector-v8-0-cf043168e137@rivosinc.… Changes in v8: - Rebase onto palmer's for-next - Link to v7: https://lore.kernel.org/r/20240724-xtheadvector-v7-0-b741910ada3e@rivosinc.… Changes in v7: - Add defs for has_xtheadvector_no_alternatives() and has_xtheadvector() when vector disabled. (Palmer) - Link to v6: https://lore.kernel.org/r/20240722-xtheadvector-v6-0-c9af0130fa00@rivosinc.… Changes in v6: - Fix return type of is_vector_supported()/is_xthead_supported() to be bool - Link to v5: https://lore.kernel.org/r/20240719-xtheadvector-v5-0-4b485fc7d55f@rivosinc.… Changes in v5: - Rebase on for-next - Link to v4: https://lore.kernel.org/r/20240702-xtheadvector-v4-0-2bad6820db11@rivosinc.… Changes in v4: - Replace inline asm with C (Samuel) - Rename VCSRs to CSRs (Samuel) - Replace .insn directives with .4byte directives - Link to v3: https://lore.kernel.org/r/20240619-xtheadvector-v3-0-bff39eb9668e@rivosinc.… Changes in v3: - Add back Heiko's signed-off-by (Conor) - Mark RISCV_HWPROBE_KEY_VENDOR_EXT_THEAD_0 as a bitmask - Link to v2: https://lore.kernel.org/r/20240610-xtheadvector-v2-0-97a48613ad64@rivosinc.… Changes in v2: - Removed extraneous references to "riscv,vlenb" (Jess) - Moved declaration of "thead,vlenb" into cpus.yaml and added restriction that it's only applicable to thead cores (Conor) - Check CONFIG_RISCV_ISA_XTHEADVECTOR instead of CONFIG_RISCV_ISA_V for thead,vlenb (Jess) - Fix naming of hwprobe variables (Evan) - Link to v1: https://lore.kernel.org/r/20240609-xtheadvector-v1-0-3fe591d7f109@rivosinc.… --- Charlie Jenkins (13): dt-bindings: riscv: Add xtheadvector ISA extension description dt-bindings: cpus: add a thead vlen register length property riscv: dts: allwinner: Add xtheadvector to the D1/D1s devicetree riscv: Add thead and xtheadvector as a vendor extension riscv: vector: Use vlenb from DT for thead riscv: csr: Add CSR encodings for CSR_VXRM/CSR_VXSAT riscv: Add xtheadvector instruction definitions riscv: vector: Support xtheadvector save/restore riscv: hwprobe: Add thead vendor extension probing riscv: hwprobe: Document thead vendor extensions and xtheadvector extension selftests: riscv: Fix vector tests selftests: riscv: Support xtheadvector in vector tests riscv: Add ghostwrite vulnerability Heiko Stuebner (1): RISC-V: define the elements of the VCSR vector CSR Documentation/arch/riscv/hwprobe.rst | 10 + Documentation/devicetree/bindings/riscv/cpus.yaml | 19 ++ .../devicetree/bindings/riscv/extensions.yaml | 10 + arch/riscv/Kconfig.errata | 11 + arch/riscv/Kconfig.vendor | 26 ++ arch/riscv/boot/dts/allwinner/sun20i-d1s.dtsi | 3 +- arch/riscv/errata/thead/errata.c | 28 ++ arch/riscv/include/asm/bugs.h | 22 ++ arch/riscv/include/asm/cpufeature.h | 2 + arch/riscv/include/asm/csr.h | 15 + arch/riscv/include/asm/errata_list.h | 3 +- arch/riscv/include/asm/hwprobe.h | 3 +- arch/riscv/include/asm/switch_to.h | 2 +- arch/riscv/include/asm/vector.h | 222 +++++++++++---- arch/riscv/include/asm/vendor_extensions/thead.h | 47 ++++ .../include/asm/vendor_extensions/thead_hwprobe.h | 19 ++ .../include/asm/vendor_extensions/vendor_hwprobe.h | 37 +++ arch/riscv/include/uapi/asm/hwprobe.h | 3 +- arch/riscv/include/uapi/asm/vendor/thead.h | 3 + arch/riscv/kernel/Makefile | 2 + arch/riscv/kernel/bugs.c | 60 ++++ arch/riscv/kernel/cpufeature.c | 59 +++- arch/riscv/kernel/kernel_mode_vector.c | 8 +- arch/riscv/kernel/process.c | 4 +- arch/riscv/kernel/signal.c | 6 +- arch/riscv/kernel/sys_hwprobe.c | 5 + arch/riscv/kernel/vector.c | 24 +- arch/riscv/kernel/vendor_extensions.c | 10 + arch/riscv/kernel/vendor_extensions/Makefile | 2 + arch/riscv/kernel/vendor_extensions/thead.c | 29 ++ .../riscv/kernel/vendor_extensions/thead_hwprobe.c | 19 ++ drivers/base/cpu.c | 3 + include/linux/cpu.h | 1 + tools/testing/selftests/riscv/vector/.gitignore | 3 +- tools/testing/selftests/riscv/vector/Makefile | 17 +- .../selftests/riscv/vector/v_exec_initval_nolibc.c | 94 +++++++ tools/testing/selftests/riscv/vector/v_helpers.c | 68 +++++ tools/testing/selftests/riscv/vector/v_helpers.h | 8 + tools/testing/selftests/riscv/vector/v_initval.c | 22 ++ .../selftests/riscv/vector/v_initval_nolibc.c | 68 ----- .../selftests/riscv/vector/vstate_exec_nolibc.c | 20 +- .../testing/selftests/riscv/vector/vstate_prctl.c | 305 +++++++++++++-------- 42 files changed, 1051 insertions(+), 271 deletions(-) --- base-commit: 0eb512779d642b21ced83778287a0f7a3ca8f2a1 change-id: 20240530-xtheadvector-833d3d17b423 -- - Charlie

5 months

3
22
0 0

[PATCH bpf-next/net v2 0/7] bpf: Add mptcp_subflow bpf_iter support

by Matthieu Baerts (NGI0)

Here is a series from Geliang, adding mptcp_subflow bpf_iter support. We are working on extending MPTCP with BPF, e.g. to control the path manager -- in charge of the creation, deletion, and announcements of subflows (paths) -- and the packet scheduler -- in charge of selecting which available path the next data will be sent to. These extensions need to iterate over the list of subflows attached to an MPTCP connection, and do some specific actions via some new kfunc that will be added later on. This preparation work is split in different patches: - Patch 1: extend bpf_skc_to_mptcp_sock() to be called with msk. - Patch 2: allow using skc_to_mptcp_sock() in CGroup sockopt hooks. - Patch 3: register some "basic" MPTCP kfunc. - Patch 4: add mptcp_subflow bpf_iter support. Note that previous versions of this single patch have already been shared to the BPF mailing list. The changelog has been kept with a comment, but the version number has been reset to avoid confusions. - Patch 5: add kfunc to make sure the msk is valid - Patch 6: add more MPTCP endpoints in the selftests, in order to create more than 2 subflows. - Patch 7: add a very simple test validating mptcp_subflow bpf_iter support. This test could be written without the new bpf_iter, but it is there only to make sure this specific feature works as expected. Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org> --- Changes in v2: - Patches 1-2: new ones. - Patch 3: remove two kfunc, more restrictions. (Martin) - Patch 4: add BUILD_BUG_ON(), more restrictions. (Martin) - Patch 7: adaptations due to modifications in patches 1-4. - Link to v1: https://lore.kernel.org/r/20241108-bpf-next-net-mptcp-bpf_iter-subflows-v1-… --- Geliang Tang (7): bpf: Extend bpf_skc_to_mptcp_sock to MPTCP sock bpf: Allow use of skc_to_mptcp_sock in cg_sockopt bpf: Register mptcp common kfunc set bpf: Add mptcp_subflow bpf_iter bpf: Acquire and release mptcp socket selftests/bpf: More endpoints for endpoint_init selftests/bpf: Add mptcp_subflow bpf_iter subtest include/net/mptcp.h | 4 +- kernel/bpf/cgroup.c | 2 + net/core/filter.c | 2 +- net/mptcp/bpf.c | 113 +++++++++++++++++- tools/testing/selftests/bpf/bpf_experimental.h | 8 ++ tools/testing/selftests/bpf/prog_tests/mptcp.c | 129 ++++++++++++++++++++- tools/testing/selftests/bpf/progs/mptcp_bpf.h | 9 ++ .../testing/selftests/bpf/progs/mptcp_bpf_iters.c | 63 ++++++++++ 8 files changed, 318 insertions(+), 12 deletions(-) --- base-commit: dad704ebe38642cd405e15b9c51263356391355c change-id: 20241108-bpf-next-net-mptcp-bpf_iter-subflows-027f6d87770e Best regards, -- Matthieu Baerts (NGI0) <matttbe(a)kernel.org>

5 months

3
13
0 0

[PATCH] udp: gso: fix MTU check for small packets

by Yan Zhai

Commit 4094871db1d6 ("udp: only do GSO if # of segs > 1") avoided GSO for small packets. But the kernel currently dismisses GSO requests only after checking MTU on gso_size. This means any packets, regardless of their payload sizes, would be dropped when MTU is smaller than requested gso_size. Meanwhile, EINVAL would be returned in this case, making it very misleading to debug. Ideally, do not check any GSO related constraints when payload size is smaller than requested gso_size, and return EMSGSIZE on MTU check failure consistently for all packets to ease debugging. Fixes: 4094871db1d6 ("udp: only do GSO if # of segs > 1") Signed-off-by: Yan Zhai <yan(a)cloudflare.com> --- net/ipv4/udp.c | 18 ++++++++---------- net/ipv6/udp.c | 18 ++++++++---------- tools/testing/selftests/net/udpgso.c | 14 ++++++++++++++ 3 files changed, 30 insertions(+), 20 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index c472c9a57cf6..9aed1b4a871f 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1137,13 +1137,13 @@ static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4, uh->len = htons(len); uh->check = 0; - if (cork->gso_size) { + if (cork->gso_size && datalen > cork->gso_size) { const int hlen = skb_network_header_len(skb) + sizeof(struct udphdr); if (hlen + cork->gso_size > cork->fragsize) { kfree_skb(skb); - return -EINVAL; + return -EMSGSIZE; } if (datalen > cork->gso_size * UDP_MAX_SEGMENTS) { kfree_skb(skb); @@ -1158,15 +1158,13 @@ static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4, return -EIO; } - if (datalen > cork->gso_size) { - skb_shinfo(skb)->gso_size = cork->gso_size; - skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4; - skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(datalen, - cork->gso_size); + skb_shinfo(skb)->gso_size = cork->gso_size; + skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4; + skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(datalen, + cork->gso_size); - /* Don't checksum the payload, skb will get segmented */ - goto csum_partial; - } + /* Don't checksum the payload, skb will get segmented */ + goto csum_partial; } if (is_udplite) /* UDP-Lite */ diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 6671daa67f4f..6cdc8ce4c6f9 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1385,13 +1385,13 @@ static int udp_v6_send_skb(struct sk_buff *skb, struct flowi6 *fl6, uh->len = htons(len); uh->check = 0; - if (cork->gso_size) { + if (cork->gso_size && datalen > cork->gso_size) { const int hlen = skb_network_header_len(skb) + sizeof(struct udphdr); if (hlen + cork->gso_size > cork->fragsize) { kfree_skb(skb); - return -EINVAL; + return -EMSGSIZE; } if (datalen > cork->gso_size * UDP_MAX_SEGMENTS) { kfree_skb(skb); @@ -1406,15 +1406,13 @@ static int udp_v6_send_skb(struct sk_buff *skb, struct flowi6 *fl6, return -EIO; } - if (datalen > cork->gso_size) { - skb_shinfo(skb)->gso_size = cork->gso_size; - skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4; - skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(datalen, - cork->gso_size); + skb_shinfo(skb)->gso_size = cork->gso_size; + skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4; + skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(datalen, + cork->gso_size); - /* Don't checksum the payload, skb will get segmented */ - goto csum_partial; - } + /* Don't checksum the payload, skb will get segmented */ + goto csum_partial; } if (is_udplite) diff --git a/tools/testing/selftests/net/udpgso.c b/tools/testing/selftests/net/udpgso.c index 3f2fca02fec5..fb73f1c331fb 100644 --- a/tools/testing/selftests/net/udpgso.c +++ b/tools/testing/selftests/net/udpgso.c @@ -102,6 +102,13 @@ struct testcase testcases_v4[] = { .gso_len = CONST_MSS_V4, .r_num_mss = 1, }, + { + /* datalen <= MSS < gso_len: will fall back to no GSO */ + .tlen = CONST_MSS_V4, + .gso_len = CONST_MSS_V4 + 1, + .r_num_mss = 0, + .r_len_last = CONST_MSS_V4, + }, { /* send a single MSS + 1B */ .tlen = CONST_MSS_V4 + 1, @@ -205,6 +212,13 @@ struct testcase testcases_v6[] = { .gso_len = CONST_MSS_V6, .r_num_mss = 1, }, + { + /* datalen <= MSS < gso_len: will fall back to no GSO */ + .tlen = CONST_MSS_V6, + .gso_len = CONST_MSS_V6 + 1, + .r_num_mss = 0, + .r_len_last = CONST_MSS_V6, + }, { /* send a single MSS + 1B */ .tlen = CONST_MSS_V6 + 1, -- 2.30.2

5 months

2
7
0 0

[PATCH v4 0/9] mm: workingset reporting

by Yuanchu Xie

This patch series provides workingset reporting of user pages in lruvecs, of which coldness can be tracked by accessed bits and fd references. However, the concept of workingset applies generically to all types of memory, which could be kernel slab caches, discardable userspace caches (databases), or CXL.mem. Therefore, data sources might come from slab shrinkers, device drivers, or the userspace. Another interesting idea might be hugepage workingset, so that we can measure the proportion of hugepages backing cold memory. However, with architectures like arm, there may be too many hugepage sizes leading to a combinatorial explosion when exporting stats to the userspace. Nonetheless, the kernel should provide a set of workingset interfaces that is generic enough to accommodate the various use cases, and extensible to potential future use cases. Use cases ========== Job scheduling On overcommitted hosts, workingset information improves efficiency and reliability by allowing the job scheduler to have better stats on the exact memory requirements of each job. This can manifest in efficiency by landing more jobs on the same host or NUMA node. On the other hand, the job scheduler can also ensure each node has a sufficient amount of memory and does not enter direct reclaim or the kernel OOM path. With workingset information and job priority, the userspace OOM killing or proactive reclaim policy can kick in before the system is under memory pressure. If the job shape is very different from the machine shape, knowing the workingset per-node can also help inform page allocation policies. Proactive reclaim Workingset information allows the a container manager to proactively reclaim memory while not impacting a job's performance. While PSI may provide a reactive measure of when a proactive reclaim has reclaimed too much, workingset reporting allows the policy to be more accurate and flexible. Ballooning (similar to proactive reclaim) The last patch of the series extends the virtio-balloon device to report the guest workingset. Balloon policies benefit from workingset to more precisely determine the size of the memory balloon. On end-user devices where memory is scarce and overcommitted, the balloon sizing in multiple VMs running on the same device can be orchestrated with workingset reports from each one. On the server side, workingset reporting allows the balloon controller to inflate the balloon without causing too much file cache to be reclaimed in the guest. Promotion/Demotion If different mechanisms are used for promition and demotion, workingset information can help connect the two and avoid pages being migrated back and forth. For example, given a promotion hot page threshold defined in reaccess distance of N seconds (promote pages accessed more often than every N seconds). The threshold N should be set so that ~80% (e.g.) of pages on the fast memory node passes the threshold. This calculation can be done with workingset reports. To be directly useful for promotion policies, the workingset report interfaces need to be extended to report hotness and gather hotness information from the devices[1]. [1] https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements… Sysfs and Cgroup Interfaces ========== The interfaces are detailed in the patches that introduce them. The main idea here is we break down the workingset per-node per-memcg into time intervals (ms), e.g. 1000 anon=137368 file=24530 20000 anon=34342 file=0 30000 anon=353232 file=333608 40000 anon=407198 file=206052 9223372036854775807 anon=4925624 file=892892 Implementation ========== The reporting of user pages is based off of MGLRU, and therefore requires CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more fine-grained workingset report, but we can already gather a lot of data with just four generations. The workingset reporting mechanism is gated behind CONFIG_WORKINGSET_REPORT, and the aging thread is behind CONFIG_WORKINGSET_REPORT_AGING. Benchmarks ========== Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux compile and redis benchmarks from openbenchmarking.org. The policy and runner is referred to as WMO (Workload Memory Optimization). The results were based on v3 of the series, but v4 doesn't change the core of the working set reporting and just adds the ballooning counterpart. The timed Linux kernel compilation benchmark shows improvements in peak memory usage with a policy of "swap out all bytes colder than 10 seconds every 40 seconds". A swapfile is configured on SSD. -------------------------------------------- peak memory usage (with WMO): 4982.61328 MiB peak memory usage (control): 9569.1367 MiB peak memory reduction: 47.9% -------------------------------------------- Benchmark | Experimental |Control | Experimental_Std_Dev | Control_Std_Dev Timed Linux Kernel Compilation - allmodconfig (sec) | 708.486 (95.91%) | 679.499 (100%) | 0.6% | 0.1% -------------------------------------------- Seconds, fewer is better The redis benchmark shows employs the same policy: -------------------------------------------- peak memory usage (with WMO): 375.9023 MiB peak memory usage (control): 509.765 MiB peak memory reduction: 26% -------------------------------------------- Benchmark | Experimental | Control | Experimental_Std_Dev | Control_Std_Dev Redis - LPOP (Reqs/sec) | 2023130 (98.22%) | 2059849 (100%) | 1.2% | 2% Redis - SADD (Reqs/sec) | 2539662 (98.63%) | 2574811 (100%) | 2.3% | 1.4% Redis - LPUSH (Reqs/sec)| 2024880 (100%) | 2000884 (98.81%) | 1.1% | 0.8% Redis - GET (Reqs/sec) | 2835764 (100%) | 2763722 (97.46%) | 2.7% | 1.6% Redis - SET (Reqs/sec) | 2340723 (100%) | 2327372 (99.43%) | 2.4% | 1.8% -------------------------------------------- Reqs/sec, more is better The detailed report and benchmarking results are in Ghait's repo: https://github.com/miloudi98/WMO Changelog ========== Changes from PATCH v3 -> v4: - Added documentation for cgroup-v2 (Waiman Long) - Fixed types in documentation (Randy Dunlap) - Added implementation for the ballooning use case - Added detailed description of benchmark results (Andrew Morton) Changes from PATCH v2 -> v3: - Fixed typos in commit messages and documentation (Lance Yang, Randy Dunlap) - Split out the force_scan patch to be reviewed separately - Added benchmarks from Ghait Ouled Amar Ben Cheikh - Fixed reported compile error without CONFIG_MEMCG Changes from PATCH v1 -> v2: - Updated selftest to use ksft_test_result_code instead of switch-case (Muhammad Usama Anjum) - Included more use cases in the cover letter (Huang, Ying) - Added documentation for sysfs and memcg interfaces - Added an aging-specific struct lru_gen_mm_walk in struct pglist_data to avoid allocating for each lruvec. [v1] https://lore.kernel.org/linux-mm/20240504073011.4000534-1-yuanchu@google.co… [v2] https://lore.kernel.org/linux-mm/20240604020549.1017540-1-yuanchu@google.co… [v3] https://lore.kernel.org/linux-mm/20240813165619.748102-1-yuanchu@google.com/ Yuanchu Xie (9): mm: aggregate workingset information into histograms mm: use refresh interval to rate-limit workingset report aggregation mm: report workingset during memory pressure driven scanning mm: extend workingset reporting to memcgs mm: add kernel aging thread for workingset reporting selftest: test system-wide workingset reporting Docs/admin-guide/mm/workingset_report: document sysfs and memcg interfaces Docs/admin-guide/cgroup-v2: document workingset reporting virtio-balloon: add workingset reporting Documentation/admin-guide/cgroup-v2.rst | 35 + Documentation/admin-guide/mm/index.rst | 1 + .../admin-guide/mm/workingset_report.rst | 105 +++ drivers/base/node.c | 6 + drivers/virtio/virtio_balloon.c | 390 ++++++++++- include/linux/balloon_compaction.h | 1 + include/linux/memcontrol.h | 21 + include/linux/mmzone.h | 13 + include/linux/workingset_report.h | 167 +++++ include/uapi/linux/virtio_balloon.h | 30 + mm/Kconfig | 15 + mm/Makefile | 2 + mm/internal.h | 19 + mm/memcontrol.c | 162 ++++- mm/mm_init.c | 2 + mm/mmzone.c | 2 + mm/vmscan.c | 56 +- mm/workingset_report.c | 653 ++++++++++++++++++ mm/workingset_report_aging.c | 127 ++++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 3 + tools/testing/selftests/mm/run_vmtests.sh | 5 + .../testing/selftests/mm/workingset_report.c | 306 ++++++++ .../testing/selftests/mm/workingset_report.h | 39 ++ .../selftests/mm/workingset_report_test.c | 330 +++++++++ 25 files changed, 2482 insertions(+), 9 deletions(-) create mode 100644 Documentation/admin-guide/mm/workingset_report.rst create mode 100644 include/linux/workingset_report.h create mode 100644 mm/workingset_report.c create mode 100644 mm/workingset_report_aging.c create mode 100644 tools/testing/selftests/mm/workingset_report.c create mode 100644 tools/testing/selftests/mm/workingset_report.h create mode 100644 tools/testing/selftests/mm/workingset_report_test.c -- 2.47.0.338.g60cca15819-goog

5 months

6
20
0 0

[PATCH bpf v9 0/5] bpf: fix wrong copied_seq calculation and add tests

by Jiayuan Chen

A previous commit described in this topic http://lore.kernel.org/bpf/20230523025618.113937-9-john.fastabend@gmail.com directly updated 'sk->copied_seq' in the tcp_eat_skb() function when the action of a BPF program was SK_REDIRECT. For other actions, like SK_PASS, the update logic for 'sk->copied_seq' was moved to tcp_bpf_recvmsg_parser() to ensure the accuracy of the 'fionread' feature. That commit works for a single stream_verdict scenario, as it also modified 'sk_data_ready->sk_psock_verdict_data_ready->tcp_read_skb' to remove updating 'sk->copied_seq'. However, for programs where both stream_parser and stream_verdict are active (strparser purpose), tcp_read_sock() was used instead of tcp_read_skb() (sk_data_ready->strp_data_ready->tcp_read_sock). tcp_read_sock() now still updates 'sk->copied_seq', leading to duplicated updates. In summary, for strparser + SK_PASS, copied_seq is redundantly calculated in both tcp_read_sock() and tcp_bpf_recvmsg_parser(). The issue causes incorrect copied_seq calculations, which prevent correct data reads from the recv() interface in user-land. Also we added test cases for bpf + strparser and separated them from sockmap_basic, as strparser has more encapsulation and parsing capabilities compared to sockmap. --- V8 -> v9 https://lore.kernel.org/bpf/20250121050707.55523-1-mrpre@163.com/ Fixed some issues suggested by Jakub Sitnicki. V7 -> V8 https://lore.kernel.org/bpf/20250116140531.108636-1-mrpre@163.com/ Avoid using add read_sock to psock. (Jakub Sitnicki) Avoid using warpper function to check whether strparser is supported. V3 -> V7: https://lore.kernel.org/bpf/20250109094402.50838-1-mrpre@163.com/ https://lore.kernel.org/bpf/20241218053408.437295-1-mrpre@163.com/ Avoid introducing new proto_ops. (Jakub Sitnicki). Add more edge test cases for strparser + bpf. Fix patchwork fail of test cases code. Fix psock fetch without rcu lock. Move code of modifying to tcp_bpf.c. V1 -> V3: https://lore.kernel.org/bpf/20241209152740.281125-1-mrpre@163.com/ Fix patchwork fail by adding Fixes tag. Save skb data offset for ENOMEM. (John Fastabend) --- Jiayuan Chen (5): strparser: add read_sock callback bpf: fix wrong copied_seq calculation bpf: disable non stream socket for strparser selftests/bpf: fix invalid flag of recv() selftests/bpf: add strparser test for bpf Documentation/networking/strparser.rst | 9 +- include/linux/skmsg.h | 2 + include/net/strparser.h | 2 + include/net/tcp.h | 8 + net/core/skmsg.c | 7 + net/core/sock_map.c | 5 +- net/ipv4/tcp.c | 29 +- net/ipv4/tcp_bpf.c | 36 ++ net/strparser/strparser.c | 11 +- .../selftests/bpf/prog_tests/sockmap_basic.c | 59 +-- .../selftests/bpf/prog_tests/sockmap_strp.c | 454 ++++++++++++++++++ .../selftests/bpf/progs/test_sockmap_strp.c | 53 ++ 12 files changed, 610 insertions(+), 65 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/sockmap_strp.c create mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_strp.c -- 2.43.5

5 months

4
13
0 0

[PATCH 0/6] Address some issues related to Python version

by Mauro Carvalho Chehab

This series remove compatibility with Python 2.x from scripts that have some backward compatibility logic on it. The rationale is that, since commit 627395716cc3 ("docs: document python version used for compilation"), the minimal Python version was set to 3.x. Also, Python 2.x is EOL since Jan, 2020. Patch 1: fix a script that was compatible only with Python 2.x; Patches 2-4: remove backward-compat code; Patches 5-6 solves forward-compat with modern Python which warns about using raw strings without using "r" format. Mauro Carvalho Chehab (6): docs: trace: decode_msr.py: make it compatible with python 3 tools: perf: exported-sql-viewer: drop support for Python 2 tools: perf: tools: perf: exported-sql-viewer: drop support for Python 2 tools: perf: task-analyzer: drop support for Python 2 tools: selftests/bpf: test_bpftool_synctypes: escape raw symbols comedi: convert_csv_to_c.py: use r-string for a regex expression Documentation/trace/postprocess/decode_msr.py | 2 +- .../ni_routing/tools/convert_csv_to_c.py | 2 +- .../scripts/python/exported-sql-viewer.py | 5 ++-- tools/perf/scripts/python/task-analyzer.py | 23 ++++---------- tools/perf/tests/shell/lib/attr.py | 6 +--- .../selftests/bpf/test_bpftool_synctypes.py | 30 +++++++++---------- 6 files changed, 25 insertions(+), 43 deletions(-) -- 2.48.1

5 months, 1 week

2
2
0 0

[PATCH v3 0/6] ptrace: introduce PTRACE_SET_SYSCALL_INFO API

by Dmitry V. Levin

PTRACE_SET_SYSCALL_INFO is a generic ptrace API that complements PTRACE_GET_SYSCALL_INFO by letting the ptracer modify details of system calls the tracee is blocked in. This API allows ptracers to obtain and modify system call details in a straightforward and architecture-agnostic way. Current implementation supports changing only those bits of system call information that are used by strace, namely, syscall number, syscall arguments, and syscall return value. Support of changing additional details returned by PTRACE_GET_SYSCALL_INFO, such as instruction pointer and stack pointer, could be added later if needed, by using struct ptrace_syscall_info.flags to specify the additional details that should be set. Currently, "flags", "reserved", and "seccomp.reserved2" fields of struct ptrace_syscall_info must be initialized with zeroes; "arch", "instruction_pointer", and "stack_pointer" fields are ignored. PTRACE_SET_SYSCALL_INFO currently supports only PTRACE_SYSCALL_INFO_ENTRY, PTRACE_SYSCALL_INFO_EXIT, and PTRACE_SYSCALL_INFO_SECCOMP operations. Other operations could be added later if needed. Ideally, PTRACE_SET_SYSCALL_INFO should have been introduced along with PTRACE_GET_SYSCALL_INFO, but it didn't happen. The last straw that convinced me to implement PTRACE_SET_SYSCALL_INFO was apparent failure to provide an API of changing the first system call argument on riscv architecture [1]. ptrace(2) man page: long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data); ... PTRACE_SET_SYSCALL_INFO Modify information about the system call that caused the stop. The "data" argument is a pointer to struct ptrace_syscall_info that specifies the system call information to be set. The "addr" argument should be set to sizeof(struct ptrace_syscall_info)). [1] https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/ Notes: v3: * powerpc: Submit syscall_set_return_value fix for "sc" case separately * mips: Do not introduce erroneous argument truncation on mips n32, add a detailed description to the commit message of the mips_get_syscall_arg change * ptrace: Add explicit padding to the end of struct ptrace_syscall_info, simplify obtaining of user ptrace_syscall_info, do not introduce PTRACE_SYSCALL_INFO_SIZE_VER0 * ptrace: Change the return type of ptrace_set_syscall_info_* functions from "unsigned long" to "int" * ptrace: Add -ERANGE check to ptrace_set_syscall_info_exit, add comments to -ERANGE checks * ptrace: Update comments about supported syscall stops * selftests: Extend set_syscall_info test, fix for mips n32 * Add Tested-by and Reviewed-by v2: * Add patch to fix syscall_set_return_value() on powerpc * Add patch to fix mips_get_syscall_arg() on mips * Add syscall_set_return_value() implementation on hexagon * Add syscall_set_return_value() invocation to syscall_set_nr() on arm and arm64. * Fix syscall_set_nr() and mips_set_syscall_arg() on mips * Add a comment to syscall_set_nr() on arc, powerpc, s390, sh, and sparc * Remove redundant ptrace_syscall_info.op assignments in ptrace_get_syscall_info_* * Minor style tweaks in ptrace_get_syscall_info_op() * Remove syscall_set_return_value() invocation from ptrace_set_syscall_info_entry() * Skip syscall_set_arguments() invocation in case of syscall number -1 in ptrace_set_syscall_info_entry() * Split ptrace_syscall_info.reserved into ptrace_syscall_info.reserved and ptrace_syscall_info.flags * Use __kernel_ulong_t instead of unsigned long in set_syscall_info test Dmitry V. Levin (6): mips: fix mips_get_syscall_arg() for o32 syscall.h: add syscall_set_arguments() and syscall_set_return_value() syscall.h: introduce syscall_set_nr() ptrace_get_syscall_info: factor out ptrace_get_syscall_info_op ptrace: introduce PTRACE_SET_SYSCALL_INFO request selftests/ptrace: add a test case for PTRACE_SET_SYSCALL_INFO arch/arc/include/asm/syscall.h | 25 + arch/arm/include/asm/syscall.h | 37 ++ arch/arm64/include/asm/syscall.h | 29 + arch/csky/include/asm/syscall.h | 13 + arch/hexagon/include/asm/syscall.h | 21 + arch/loongarch/include/asm/syscall.h | 15 + arch/m68k/include/asm/syscall.h | 7 + arch/microblaze/include/asm/syscall.h | 7 + arch/mips/include/asm/syscall.h | 70 ++- arch/nios2/include/asm/syscall.h | 16 + arch/openrisc/include/asm/syscall.h | 13 + arch/parisc/include/asm/syscall.h | 19 + arch/powerpc/include/asm/syscall.h | 20 + arch/riscv/include/asm/syscall.h | 16 + arch/s390/include/asm/syscall.h | 24 + arch/sh/include/asm/syscall_32.h | 24 + arch/sparc/include/asm/syscall.h | 22 + arch/um/include/asm/syscall-generic.h | 19 + arch/x86/include/asm/syscall.h | 43 ++ arch/xtensa/include/asm/syscall.h | 18 + include/asm-generic/syscall.h | 30 + include/uapi/linux/ptrace.h | 7 +- kernel/ptrace.c | 179 +++++- tools/testing/selftests/ptrace/Makefile | 2 +- .../selftests/ptrace/set_syscall_info.c | 514 ++++++++++++++++++ 25 files changed, 1143 insertions(+), 47 deletions(-) create mode 100644 tools/testing/selftests/ptrace/set_syscall_info.c -- ldv

5 months, 1 week

2
2
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror January 2025