I had cause to look at the vfork() support for GCS and realised that we
don't have any direct test coverage, this series does so by adding
vfork() to nolibc and then using that in basic-gcs to provide some
simple vfork() coverage.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v2:
- Add replacement of ifdef with if defined() in nolibc since the code
doesn't reflect the coding style.
- Remove check for arch specific vfork().
- Link to v1: https://lore.kernel.org/r/20250609-arm64-gcs-vfork-exit-v1-0-baad0f085747@k…
---
Mark Brown (4):
tools/nolibc: Replace ifdef with if defined() in sys.h
tools/nolibc: Provide vfork()
kselftest/arm64: Add a test for vfork() with GCS
selftests/nolibc: Add coverage of vfork()
tools/include/nolibc/sys.h | 57 +++++++++++++++++-------
tools/testing/selftests/arm64/gcs/basic-gcs.c | 63 +++++++++++++++++++++++++++
tools/testing/selftests/nolibc/nolibc-test.c | 23 ++++++++--
3 files changed, 124 insertions(+), 19 deletions(-)
---
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
change-id: 20250528-arm64-gcs-vfork-exit-4a7daf7652ee
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Of these:
7 "for a while" typos
5 "take a while" typos
1 misreading of "once in a while"?
3 awhiles used correctly remain in the tree
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli(a)nabijaczleweli.xyz>
---
Documentation/trace/histogram.rst | 2 +-
arch/sh/drivers/pci/common.c | 2 +-
arch/sh/drivers/pci/pci-sh7780.c | 2 +-
drivers/atm/lanai.c | 2 +-
drivers/md/bcache/bcache.h | 2 +-
drivers/md/bcache/request.c | 2 +-
drivers/net/ethernet/google/gve/gve_rx_dqo.c | 2 +-
drivers/scsi/hpsa.c | 2 +-
drivers/tty/serial/jsm/jsm_neo.c | 2 +-
fs/ocfs2/dlm/dlmrecovery.c | 2 +-
sound/pci/emu10k1/emu10k1_main.c | 2 +-
sound/pci/emu10k1/emupcm.c | 2 +-
tools/testing/selftests/powerpc/tm/tm-tmspr.c | 2 +-
13 files changed, 13 insertions(+), 13 deletions(-)
diff --git a/Documentation/trace/histogram.rst b/Documentation/trace/histogram.rst
index 0aada18c38c6..2b98c1720a54 100644
--- a/Documentation/trace/histogram.rst
+++ b/Documentation/trace/histogram.rst
@@ -249,7 +249,7 @@ Extended error information
table, it should keep a running total of the number of bytes
requested by that call_site.
- We'll let it run for awhile and then dump the contents of the 'hist'
+ We'll let it run for a while and then dump the contents of the 'hist'
file in the kmalloc event's subdirectory (for readability, a number
of entries have been omitted)::
diff --git a/arch/sh/drivers/pci/common.c b/arch/sh/drivers/pci/common.c
index 9633b6147a05..f95004c67e6c 100644
--- a/arch/sh/drivers/pci/common.c
+++ b/arch/sh/drivers/pci/common.c
@@ -148,7 +148,7 @@ unsigned int pcibios_handle_status_errors(unsigned long addr,
cmd |= PCI_STATUS_PARITY | PCI_STATUS_DETECTED_PARITY;
- /* Now back off of the IRQ for awhile */
+ /* Now back off of the IRQ for a while */
if (hose->err_irq) {
disable_irq_nosync(hose->err_irq);
hose->err_timer.expires = jiffies + HZ;
diff --git a/arch/sh/drivers/pci/pci-sh7780.c b/arch/sh/drivers/pci/pci-sh7780.c
index 9a624a6ee354..f41d6939a3d9 100644
--- a/arch/sh/drivers/pci/pci-sh7780.c
+++ b/arch/sh/drivers/pci/pci-sh7780.c
@@ -153,7 +153,7 @@ static irqreturn_t sh7780_pci_serr_irq(int irq, void *dev_id)
/* Deassert SERR */
__raw_writel(SH4_PCIINTM_SDIM, hose->reg_base + SH4_PCIINTM);
- /* Back off the IRQ for awhile */
+ /* Back off the IRQ for a while */
disable_irq_nosync(irq);
hose->serr_timer.expires = jiffies + HZ;
add_timer(&hose->serr_timer);
diff --git a/drivers/atm/lanai.c b/drivers/atm/lanai.c
index 2a1fe3080712..0dfa2cdc897c 100644
--- a/drivers/atm/lanai.c
+++ b/drivers/atm/lanai.c
@@ -755,7 +755,7 @@ static void lanai_shutdown_rx_vci(const struct lanai_vcc *lvcc)
/* Shutdown transmitting on card.
* Unfortunately the lanai needs us to wait until all the data
* drains out of the buffer before we can dealloc it, so this
- * can take awhile -- up to 370ms for a full 128KB buffer
+ * can take a while -- up to 370ms for a full 128KB buffer
* assuming everone else is quiet. In theory the time is
* boundless if there's a CBR VCC holding things up.
*/
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 1d33e40d26ea..7318d9800370 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -499,7 +499,7 @@ struct gc_stat {
* won't automatically reattach).
*
* CACHE_SET_STOPPING always gets set first when we're closing down a cache set;
- * we'll continue to run normally for awhile with CACHE_SET_STOPPING set (i.e.
+ * we'll continue to run normally for a while with CACHE_SET_STOPPING set (i.e.
* flushing dirty data).
*
* CACHE_SET_RUNNING means all cache devices have been registered and journal
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index af345dc6fde1..87b4341cb42c 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -257,7 +257,7 @@ static CLOSURE_CALLBACK(bch_data_insert_start)
/*
* But if it's not a writeback write we'd rather just bail out if
- * there aren't any buckets ready to write to - it might take awhile and
+ * there aren't any buckets ready to write to - it might take a while and
* we might be starving btree writes for gc or something.
*/
diff --git a/drivers/net/ethernet/google/gve/gve_rx_dqo.c b/drivers/net/ethernet/google/gve/gve_rx_dqo.c
index dcb0545baa50..6a0be54f1c81 100644
--- a/drivers/net/ethernet/google/gve/gve_rx_dqo.c
+++ b/drivers/net/ethernet/google/gve/gve_rx_dqo.c
@@ -608,7 +608,7 @@ static int gve_rx_dqo(struct napi_struct *napi, struct gve_rx_ring *rx,
buf_len = compl_desc->packet_len;
hdr_len = compl_desc->header_len;
- /* Page might have not been used for awhile and was likely last written
+ /* Page might have not been used for a while and was likely last written
* by a different thread.
*/
if (rx->dqo.page_pool) {
diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index c73a71ac3c29..0066f15153a7 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -7795,7 +7795,7 @@ static int hpsa_wait_for_mode_change_ack(struct ctlr_info *h)
u32 doorbell_value;
unsigned long flags;
- /* under certain very rare conditions, this can take awhile.
+ /* under certain very rare conditions, this can take a while.
* (e.g.: hot replace a failed 144GB drive in a RAID 5 set right
* as we enter this code.)
*/
diff --git a/drivers/tty/serial/jsm/jsm_neo.c b/drivers/tty/serial/jsm/jsm_neo.c
index e8e13bf056e2..2eb9ff26d6e8 100644
--- a/drivers/tty/serial/jsm/jsm_neo.c
+++ b/drivers/tty/serial/jsm/jsm_neo.c
@@ -1189,7 +1189,7 @@ static irqreturn_t neo_intr(int irq, void *voidbrd)
/*
* The UART triggered us with a bogus interrupt type.
* It appears the Exar chip, when REALLY bogged down, will throw
- * these once and awhile.
+ * these periodically.
* Its harmless, just ignore it and move on.
*/
jsm_dbg(INTR, &brd->pci_dev,
diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
index 67fc62a49a76..00f52812dbb0 100644
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -2632,7 +2632,7 @@ static int dlm_pick_recovery_master(struct dlm_ctxt *dlm)
dlm_reco_master_ready(dlm),
msecs_to_jiffies(1000));
if (!dlm_reco_master_ready(dlm)) {
- mlog(0, "%s: reco master taking awhile\n",
+ mlog(0, "%s: reco master taking a while\n",
dlm->name);
goto again;
}
diff --git a/sound/pci/emu10k1/emu10k1_main.c b/sound/pci/emu10k1/emu10k1_main.c
index bbe252b8916c..6050201851b1 100644
--- a/sound/pci/emu10k1/emu10k1_main.c
+++ b/sound/pci/emu10k1/emu10k1_main.c
@@ -606,7 +606,7 @@ static int snd_emu10k1_ecard_init(struct snd_emu10k1 *emu)
/* Step 2: Calibrate the ADC and DAC */
snd_emu10k1_ecard_write(emu, EC_DACCAL | EC_LEDN | EC_TRIM_CSN);
- /* Step 3: Wait for awhile; XXX We can't get away with this
+ /* Step 3: Wait for a while; XXX We can't get away with this
* under a real operating system; we'll need to block and wait that
* way. */
snd_emu10k1_wait(emu, 48000);
diff --git a/sound/pci/emu10k1/emupcm.c b/sound/pci/emu10k1/emupcm.c
index 1bf6e3d652f8..ca4b03317539 100644
--- a/sound/pci/emu10k1/emupcm.c
+++ b/sound/pci/emu10k1/emupcm.c
@@ -991,7 +991,7 @@ static snd_pcm_uframes_t snd_emu10k1_capture_pointer(struct snd_pcm_substream *s
if (!epcm->running)
return 0;
if (epcm->first_ptr) {
- udelay(50); /* hack, it takes awhile until capture is started */
+ udelay(50); /* hack, it takes a while until capture is started */
epcm->first_ptr = 0;
}
ptr = snd_emu10k1_ptr_read(emu, epcm->capture_idx_reg, 0) & 0x0000ffff;
diff --git a/tools/testing/selftests/powerpc/tm/tm-tmspr.c b/tools/testing/selftests/powerpc/tm/tm-tmspr.c
index dd5ddffa28b7..0d64988ffb40 100644
--- a/tools/testing/selftests/powerpc/tm/tm-tmspr.c
+++ b/tools/testing/selftests/powerpc/tm/tm-tmspr.c
@@ -14,7 +14,7 @@
* (1) create more threads than cpus
* (2) in each thread:
* (a) set TFIAR and TFHAR a unique value
- * (b) loop for awhile, continually checking to see if
+ * (b) loop for a while, continually checking to see if
* either register has been corrupted.
*
* (3) Loop:
--
2.39.5
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the v10 AccECN protocol patch series, which covers the core
functionality of Accurate ECN, AccECN negotiation, AccECN TCP options,
and AccECN failure handling. The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28
This patch series is part of the full AccECN patch series, which is available at
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
Best Regards,
Chia-Yu
---
v11 (04-Jul-2025)
- Fix compilation issue of some intermediate patches in v10
v10 (03-Jul-2025)
- Add new patch of separated header file include/net/tcp_ecn.h to include ECN and AccECN functions (Eric Dumazet <edumazet(a)google.com>)
- Add comments on the AccECN helper functions in tcp_ecn.h (Eric Dumazet <edumazet(a)google.com>)
- Add documentation of tcp_ecn, tcp_ecn_option, tcp_ecn_beacon in ip-sysctl.rst to the corresponding patch (Eric Dumazet <edumazet(a)google.com>)
- Split wait third ACK functionality into a separated patch from AccECN negotiation patch (Eric Dumazet <edumazet(a)google.com>)
- Add READ_ONCE() over every reads of sysctl for all patches in the series (Eric Dumazet <edumazet(a)google.com>)
- Merge heuristics of AccECN option ceb/cep and ACE field multi-wrap into a single patch
- Add a table of SACK block reduction and required AccECN field in patch #15 commit message (Eric Dumazet <edumazet(a)google.com>)
v9 (21-Jun-2025)
- Use tcp_data_ecn_check() to set TCP_ECN_SEE flag only for RFC3168 ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments about setting TCP_ECN_SEEN flag for RFC3168 and Accruate ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Restruct the code in the for loop of tcp_accecn_process_option() (Paolo Abeni <pabeni(a)redhat.com>)
- Remove ecn_bytes and add use_synack_ecn_bytes flag to identify whether syn_ack_bytes or received_ecn_bytes is used (Paolo Abeni <pabeni(a)redhat.com>)
- Replace leftover_bytes and leftover_size with leftover_highbyte and leftover_lowbyte and add comments in tcp_options_write() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments and commit message about the 1st retx SYN still attempt AccECN negotiation (Paolo Abeni <pabeni(a)redhat.com>)
v8 (10-Jun-2025)
- Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>)
- Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>)
v7 (14-May-2025)
- Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>)
- Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Update commit message for #9 to explain the increase in tcp_sock_write_rx group size
- Modify group size of tcp_sock_write_tx in #10 based on pahole results
v6 (09-May-2025)
- Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>)
- Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>)
- Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15
v5 (22-Apr-2025)
- Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>)
v4 (18-Apr-2025)
- Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>)
v3 (14-Apr-2025)
- Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Mar-2025)
- Add one missing patch from the previous AccECN protocol preparation patch series to this patch series.
---
Chia-Yu Chang (6):
tcp: reorganize tcp_sock_write_txrx group for variables later
tcp: ecn functions in separated include file
tcp: Add wait_third_ack flag for ECN negotiation in simultaneous
connect
tcp: accecn: AccECN option send control
tcp: accecn: AccECN option failure handling
tcp: accecn: try to fit AccECN option with SACK
Ilpo Järvinen (9):
tcp: reorganize SYN ECN code
tcp: fast path functions later
tcp: AccECN core
tcp: accecn: AccECN negotiation
tcp: accecn: add AccECN rx byte counters
tcp: accecn: AccECN needs to know delivered bytes
tcp: sack option handling improvements
tcp: accecn: AccECN option
tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics
Documentation/networking/ip-sysctl.rst | 55 +-
.../networking/net_cachelines/tcp_sock.rst | 13 +
include/linux/tcp.h | 33 +-
include/net/netns/ipv4.h | 2 +
include/net/tcp.h | 87 ++-
include/net/tcp_ecn.h | 648 ++++++++++++++++++
include/uapi/linux/tcp.h | 7 +
net/ipv4/syncookies.c | 4 +
net/ipv4/sysctl_net_ipv4.c | 19 +
net/ipv4/tcp.c | 29 +-
net/ipv4/tcp_input.c | 371 ++++++++--
net/ipv4/tcp_ipv4.c | 8 +-
net/ipv4/tcp_minisocks.c | 40 +-
net/ipv4/tcp_output.c | 297 ++++++--
net/ipv6/syncookies.c | 2 +
net/ipv6/tcp_ipv6.c | 1 +
16 files changed, 1429 insertions(+), 187 deletions(-)
create mode 100644 include/net/tcp_ecn.h
--
2.34.1
V12:
- Fixed YAML indentation in devlink.yaml.
- Removed unused total variable from devlink_nl_rate_tc_bw_set().
- Quoted shell variables in devlink.sh and split declarations to fix
shellcheck warnings.
- Added missing DevlinkFamily imports in selftests to fix pylint
warnings.
- Pulled changes from net-next to enable these adjustments:
Inclusion of DevlinkFamily in YNL test libs.
Introduction of nlmsg_for_each_attr_type() macro and its use in nfsd.
V11:
- Refactored the devlink code to accept relative TC bandwidth share
values instead of percentages.
- Updated documentation to clarify that values are interpreted as
relative shares.
- Refactored the logic in mlx5 to support proportional scaling for
tc-bw values.
- Switched to `nlmsg_for_each_attr_type()` for cleaner attribute
parsing.
- Added a hardware selftest to validate TC bandwidth behavior.
- Refactored esw_qos_is_node_empty for readability.
V10:
- Added netdevsim selftest for tc-bw ops.
- Dropped header: field as it’s unnecessary for local constants in
devlink.yaml.
V9:
- Defined DEVLINK_RATE_TCS_MAX as 8 in uapi/linux/devlink.h.
- Replaced IEEE_8021QAZ_MAX_TCS with DEVLINK_RATE_TCS_MAX throughout
the code.
- Updated devlink-rate-tc-index-max spec to reference the correct UAPI
header.
V8:
- Limit line width to 80 characters in mlx5 changes instead of 100.
- Increase the scheduling node levels to support TC arbitration.
- Ensure parent nodes are set correctly in all code paths that extend
the hierarchy depth for TC arbitration.
- Extended the cover letter with the ongoing discussion on devlink-rate
and net-shapers.
- Extended the cover letter with the Netdev talk link on this series.
V7:
- Fixed disabling tc-bw on leaf nodes that did not have tc-bw
configured.
- Fixed an issue where tc-bw was disabled on a node with assigned
vports, ensuring that vport->qos.sched_node->parent is correctly
updated with the cloned node.
- Declared a constant for the maximum allowed Traffic Class index in
devlink rate.
- Added a range check to validate rate-tc-index.
- Added documentation for the tc-bw argument.
- Add a validation check to ensure that the total bandwidth assigned to
all traffic classes sums to 100.
V6:
- Addressed comments on devlink patch #3.
- Removed first 4 IFC patches, to be pulled from mlx5-next.
V5:
- Fix warning in devlink_nl_rate_tc_bw_set().
- Fix target branch of patch #4.
V4:
- Renamed the nested attribute for traffic class bandwidth to
DEVLINK_ATTR_RATE_TC_BWS.
- Changed the order of the attributes in `devlink.h`.
- Refactored the initialization tc-bw array in
devlink_nl_rate_tc_bw_set().
- Added extack messages to provide clear feedback on issues with tc-bw
arguments.
- Updated `rate-tc-bws` to support a multi-attr set, where each
attribute includes an index and the corresponding bandwidth for that
traffic class.
- Handled the issue where the user could provide
DEVLINK_ATTR_RATE_TC_BWS with duplicate indices.
- Provided ynl exmaples in patch [1/5] commit message.
- Take IFC patches to beginning of the series, targeted for mlx5-next.
V3:
- Dropped rate-tc-index, using tc-bw array index instead.
- Renamed rate-bw to rate-tc-bw.
- Documneted what the rate-tc-bw represents and added a range check for
validation.
- Intorduced devlink_nl_rate_tc_bw_set() to parse and set the TC
bandwidth values.
- Updated the user API in the commit message of patch 1/6 to ensure
bandwidths sum equals 100.
- Fixed missing filling of rate-parent in devlink_nl_rate_fill().
V2:
- Included <linux/dcbnl.h> in devlink.h to resolve missing
IEEE_8021QAZ_MAX_TCS definition.
- Refactored the rate-tc-bw attribute structure to use a separate
rate-tc-index.
- Updated patch 2/6 title.
This patch series extends the devlink-rate API to support traffic class
(TC) bandwidth management, enabling more granular control over traffic
shaping and rate limiting across multiple TCs. The API now allows users
to specify bandwidth proportions for different traffic classes in a
single command. This is particularly useful for managing Enhanced
Transmission Selection (ETS) for groups of Virtual Functions (VFs),
allowing precise bandwidth allocation across traffic classes.
Additionally the series refines the QoS handling in net/mlx5 to support
TC arbitration and bandwidth management on vports and rate nodes.
Discussions on traffic class shaping in net-shapers began in V5 [1],
where we discussed with maintainers whether net-shapers should support
traffic classes and how this could be implemented.
Later, after further conversations with Paolo Abeni and Simon Horman,
Cosmin provided an update [2], confirming that net-shapers' tree-based
hierarchy aligns well with traffic classes when treated as distinct
subsets of netdev queues. Since mlx5 enforces a 1:1 mapping between TX
queues and traffic classes, this approach seems feasible, though some
open questions remain regarding queue reconfiguration and certain mlx5
scheduling behaviors.
Building on that discussion, Cosmin has now shared a concrete
implementation plan on the netdev mailing list [3]. The plan, developed
in collaboration with Paolo and Simon, outlines how net-shapers can be
extended to support the same use cases currently covered by
devlink-rate, with the eventual goal of aligning both and simplifying
the shaping infrastructure in the kernel.
This work was presented at Netdev 0x19 in Zagreb [4].
There we presented how TC scheduling is enforced in mlx5 hardware,
which led to discussions on the mailing list.
A summary of how things work:
Classification means labeling a packet with a traffic class based on
the packet's DSCP or VLAN PCP field, then treating packets with
different traffic classes differently during transmit processing.
In a virtualized setup, VFs are untrusted and do not control
classification or shaping. Classification is done by the hardware using
a prio-to-TC mapping set by the hypervisor. VFs only select which send
queue to use and are expected to respect the classification logic by
sending each traffic class on its dedicated queue. As stated in the
net-shapers plan [3], each transmit queue should carry only a single
traffic class. Mixing classes in a single queue can lead to HOL
blocking.
In the mlx5 implementation, if the queue used does not match the
classified traffic class, the hardware moves the queue to the correct
TC scheduler. This movement is not a reclassification; it’s a necessary
enforcement step to ensure traffic class isolation is maintained.
Extend devlink-rate API to support rate management on TCs:
- devlink: Extend the devlink rate API to support traffic class
bandwidth management
Introduce a no-op implementation:
- net/mlx5: Add no-op implementation for setting tc-bw on rate objects
Add support for enabling and disabling TC QoS on vports and nodes:
- net/mlx5: Add support for setting tc-bw on nodes
- net/mlx5: Add traffic class scheduling support for vport QoS
Support for setting tc-bw on rate objects:
- net/mlx5: Manage TC arbiter nodes and implement full support for
tc-bw
[1]
https://lore.kernel.org/netdev/20241204220931.254964-1-tariqt@nvidia.com/
[2]
https://lore.kernel.org/netdev/67df1a562614b553dcab043f347a0d7c5393ff83.cam…
[3]
https://lore.kernel.org/netdev/d9831d0c940a7b77419abe7c7330e822bbfd1cfb.cam…
[4]
https://netdevconf.info/0x19/sessions/talk/optimizing-bandwidth-allocation-…
Carolina Jubran (8):
netlink: introduce type-checking attribute iteration for nlmsg
devlink: Extend devlink rate API with traffic classes bandwidth management
selftest: netdevsim: Add devlink rate tc-bw test
net/mlx5: Add no-op implementation for setting tc-bw on rate objects
net/mlx5: Add support for setting tc-bw on nodes
net/mlx5: Add traffic class scheduling support for vport QoS
net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw
selftests: drv-net: Add test for devlink-rate traffic class bandwidth distribution
Documentation/netlink/specs/devlink.yaml | 32 +-
.../networking/devlink/devlink-port.rst | 8 +
.../net/ethernet/mellanox/mlx5/core/devlink.c | 2 +
.../net/ethernet/mellanox/mlx5/core/esw/qos.c | 1037 ++++++++++++++++-
.../net/ethernet/mellanox/mlx5/core/esw/qos.h | 8 +
.../net/ethernet/mellanox/mlx5/core/eswitch.h | 14 +-
drivers/net/netdevsim/dev.c | 43 +
drivers/net/netdevsim/netdevsim.h | 1 +
drivers/net/vxlan/vxlan_vnifilter.c | 13 +-
fs/nfsd/nfsctl.c | 36 +-
include/net/devlink.h | 8 +
include/net/netlink.h | 14 +
include/uapi/linux/devlink.h | 9 +
net/devlink/netlink_gen.c | 15 +-
net/devlink/netlink_gen.h | 1 +
net/devlink/rate.c | 127 ++
.../drivers/net/hw/devlink_rate_tc_bw.py | 466 ++++++++
.../drivers/net/hw/lib/py/__init__.py | 2 +-
.../selftests/drivers/net/lib/py/__init__.py | 2 +-
.../drivers/net/netdevsim/devlink.sh | 53 +
.../testing/selftests/net/lib/py/__init__.py | 2 +-
tools/testing/selftests/net/lib/py/ynl.py | 5 +
22 files changed, 1825 insertions(+), 73 deletions(-)
create mode 100755 tools/testing/selftests/drivers/net/hw/devlink_rate_tc_bw.py
base-commit: 20a0c20f82acf46d5731a11743e7c7ac4de25db8
--
2.34.1