The Felix VSC9959 switch in NXP LS1028A supports the tc-gate action which enforced time-based access control per stream. A stream as seen by this switch is identified by {MAC DA, VID}.
We use the standard forwarding selftest topology with 2 host interfaces and 2 switch interfaces. The host ports must require timestamping non-IP packets and supporting tc-etf offload, for isochron to work. The isochron program monitors network sync status (ptp4l, phc2sys) and deterministically transmits packets to the switch such that the tc-gate action either (a) always accepts them based on its schedule, or (b) always drops them.
I tried to keep as much of the logic that isn't specific to the NXP LS1028A in a new tsn_lib.sh, for future reuse. This covers synchronization using ptp4l and phc2sys, and isochron.
The cycle-time chosen for this selftest isn't particularly impressive (and the focus is the functionality of the switch), but I didn't really know what to do better, considering that it will mostly be run during debugging sessions, various kernel bloatware would be enabled, like lockdep, KASAN, etc, and we certainly can't run any races with those on.
I tried to look through the kselftest framework for other real time applications and didn't really find any, so I'm not sure how better to prepare the environment in case we want to go for a lower cycle time. At the moment, the only thing the selftest is ensuring is that dynamic frequency scaling is disabled on the CPU that isochron runs on. It would probably be useful to have a blacklist of kernel config options (checked through zcat /proc/config.gz) and some cyclictest scripts to run beforehand, but I saw none of those.
Signed-off-by: Vladimir Oltean vladimir.oltean@nxp.com --- v1->v2: - fix an off-by-one bug introduced at the last minute regarding which tc-mqprio queue was used for tc-etf and SO_TXTIME - introduce debugging for packets incorrectly received / incorrectly dropped based on "isochron report" - make the tsn_lib.sh dependency on isochron and linuxptp optional via REQUIRE_ISOCHRON and REQUIRE_LINUXPTP - avoid errors when CONFIG_CPU_FREQ is disabled - consistently use SCHED_FIFO instead of SCHED_RR for the isochron receiver
.../selftests/drivers/net/ocelot/psfp.sh | 327 ++++++++++++++++++ .../selftests/net/forwarding/tsn_lib.sh | 235 +++++++++++++ 2 files changed, 562 insertions(+) create mode 100755 tools/testing/selftests/drivers/net/ocelot/psfp.sh create mode 100644 tools/testing/selftests/net/forwarding/tsn_lib.sh
diff --git a/tools/testing/selftests/drivers/net/ocelot/psfp.sh b/tools/testing/selftests/drivers/net/ocelot/psfp.sh new file mode 100755 index 000000000000..5a5cee92c665 --- /dev/null +++ b/tools/testing/selftests/drivers/net/ocelot/psfp.sh @@ -0,0 +1,327 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright 2021-2022 NXP + +# Note: On LS1028A, in lack of enough user ports, this setup requires patching +# the device tree to use the second CPU port as a user port + +WAIT_TIME=1 +NUM_NETIFS=4 +STABLE_MAC_ADDRS=yes +NETIF_CREATE=no +lib_dir=$(dirname $0)/../../../net/forwarding +source $lib_dir/tc_common.sh +source $lib_dir/lib.sh +source $lib_dir/tsn_lib.sh + +UDS_ADDRESS_H1="/var/run/ptp4l_h1" +UDS_ADDRESS_SWP1="/var/run/ptp4l_swp1" + +# Tunables +NUM_PKTS=1000 +STREAM_VID=100 +STREAM_PRIO=6 +# Use a conservative cycle of 10 ms to allow the test to still pass when the +# kernel has some extra overhead like lockdep etc +CYCLE_TIME_NS=10000000 +# Create two Gate Control List entries, one OPEN and one CLOSE, of equal +# durations +GATE_DURATION_NS=$((${CYCLE_TIME_NS} / 2)) +# Give 2/3 of the cycle time to user space and 1/3 to the kernel +FUDGE_FACTOR=$((${CYCLE_TIME_NS} / 3)) +# Shift the isochron base time by half the gate time, so that packets are +# always received by swp1 close to the middle of the time slot, to minimize +# inaccuracies due to network sync +SHIFT_TIME_NS=$((${GATE_DURATION_NS} / 2)) + +h1=${NETIFS[p1]} +swp1=${NETIFS[p2]} +swp2=${NETIFS[p3]} +h2=${NETIFS[p4]} + +H1_IPV4="192.0.2.1" +H2_IPV4="192.0.2.2" +H1_IPV6="2001:db8:1::1" +H2_IPV6="2001:db8:1::2" + +# Chain number exported by the ocelot driver for +# Per-Stream Filtering and Policing filters +PSFP() +{ + echo 30000 +} + +psfp_chain_create() +{ + local if_name=$1 + + tc qdisc add dev $if_name clsact + + tc filter add dev $if_name ingress chain 0 pref 49152 flower \ + skip_sw action goto chain $(PSFP) +} + +psfp_chain_destroy() +{ + local if_name=$1 + + tc qdisc del dev $if_name clsact +} + +psfp_filter_check() +{ + local expected=$1 + local packets="" + local drops="" + local stats="" + + stats=$(tc -j -s filter show dev ${swp1} ingress chain $(PSFP) pref 1) + packets=$(echo ${stats} | jq ".[1].options.actions[].stats.packets") + drops=$(echo ${stats} | jq ".[1].options.actions[].stats.drops") + + if ! [ "${packets}" = "${expected}" ]; then + printf "Expected filter to match on %d packets but matched on %d instead\n" \ + "${expected}" "${packets}" + fi + + echo "Hardware filter reports ${drops} drops" +} + +h1_create() +{ + simple_if_init $h1 $H1_IPV4/24 $H1_IPV6/64 +} + +h1_destroy() +{ + simple_if_fini $h1 $H1_IPV4/24 $H1_IPV6/64 +} + +h2_create() +{ + simple_if_init $h2 $H2_IPV4/24 $H2_IPV6/64 +} + +h2_destroy() +{ + simple_if_fini $h2 $H2_IPV4/24 $H2_IPV6/64 +} + +switch_create() +{ + local h2_mac_addr=$(mac_get $h2) + + ip link set ${swp1} up + ip link set ${swp2} up + + ip link add br0 type bridge vlan_filtering 1 + ip link set ${swp1} master br0 + ip link set ${swp2} master br0 + ip link set br0 up + + bridge vlan add dev ${swp2} vid ${STREAM_VID} + bridge vlan add dev ${swp1} vid ${STREAM_VID} + # PSFP on Ocelot requires the filter to also be added to the bridge + # FDB, and not be removed + bridge fdb add dev ${swp2} \ + ${h2_mac_addr} vlan ${STREAM_VID} static master + + psfp_chain_create ${swp1} + + tc filter add dev ${swp1} ingress chain $(PSFP) pref 1 \ + protocol 802.1Q flower skip_sw \ + dst_mac ${h2_mac_addr} vlan_id ${STREAM_VID} \ + action gate base-time 0.000000000 \ + sched-entry OPEN ${GATE_DURATION_NS} -1 -1 \ + sched-entry CLOSE ${GATE_DURATION_NS} -1 -1 +} + +switch_destroy() +{ + psfp_chain_destroy ${swp1} + ip link del br0 +} + +txtime_setup() +{ + local if_name=$1 + + tc qdisc add dev ${if_name} clsact + # Classify PTP on TC 7 and isochron on TC 6 + tc filter add dev ${if_name} egress protocol 0x88f7 \ + flower action skbedit priority 7 + tc filter add dev ${if_name} egress protocol 802.1Q \ + flower vlan_ethtype 0xdead action skbedit priority 6 + tc qdisc add dev ${if_name} handle 100: parent root mqprio num_tc 8 \ + queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \ + map 0 1 2 3 4 5 6 7 \ + hw 1 + # Set up TC 6 for SO_TXTIME. tc-mqprio queues count from 1. + tc qdisc replace dev ${if_name} parent 100:$((${STREAM_PRIO} + 1)) etf \ + clockid CLOCK_TAI offload delta ${FUDGE_FACTOR} +} + +txtime_cleanup() +{ + local if_name=$1 + + tc qdisc del dev ${if_name} root + tc qdisc del dev ${if_name} clsact +} + +setup_prepare() +{ + vrf_prepare + + h1_create + h2_create + switch_create + + txtime_setup ${h1} + + # Set up swp1 as a master PHC for h1, synchronized to the local + # CLOCK_REALTIME. + phc2sys_start ${swp1} ${UDS_ADDRESS_SWP1} + + # Assumption true for LS1028A: h1 and h2 use the same PHC. So by + # synchronizing h1 to swp1 via PTP, h2 is also implicitly synchronized + # to swp1 (and both to CLOCK_REALTIME). + ptp4l_start ${h1} true ${UDS_ADDRESS_H1} + ptp4l_start ${swp1} false ${UDS_ADDRESS_SWP1} + + # Make sure there are no filter matches at the beginning of the test + psfp_filter_check 0 +} + +cleanup() +{ + pre_cleanup + + ptp4l_stop ${swp1} + ptp4l_stop ${h1} + phc2sys_stop + isochron_recv_stop + + txtime_cleanup ${h1} + + h2_destroy + h1_destroy + switch_destroy + + vrf_cleanup +} + +debug_incorrectly_dropped_packets() +{ + local isochron_dat=$1 + local dropped_seqids + local seqid + + echo "Packets incorrectly dropped:" + + dropped_seqids=$(isochron report \ + --input-file "${isochron_dat}" \ + --printf-format "%u RX hw %T\n" \ + --printf-args "qR" | \ + grep 'RX hw 0.000000000' | \ + awk '{print $1}') + + for seqid in ${dropped_seqids}; do + isochron report \ + --input-file "${isochron_dat}" \ + --start ${seqid} --stop ${seqid} \ + --printf-format "seqid %u scheduled for %T, HW TX timestamp %T\n" \ + --printf-args "qST" + done +} + +debug_incorrectly_received_packets() +{ + local isochron_dat=$1 + + echo "Packets incorrectly received:" + + isochron report \ + --input-file "${isochron_dat}" \ + --printf-format "seqid %u scheduled for %T, HW TX timestamp %T, HW RX timestamp %T\n" \ + --printf-args "qSTR" | + grep -v 'HW RX timestamp 0.000000000' +} + +run_test() +{ + local base_time=$1 + local expected=$2 + local test_name=$3 + local debug=$4 + local isochron_dat="$(mktemp)" + local extra_args="" + local received + + isochron_do \ + "${h1}" \ + "${h2}" \ + "${UDS_ADDRESS_H1}" \ + "" \ + "${base_time}" \ + "${CYCLE_TIME_NS}" \ + "${SHIFT_TIME_NS}" \ + "${NUM_PKTS}" \ + "${STREAM_VID}" \ + "${STREAM_PRIO}" \ + "" \ + "${isochron_dat}" + + # Count all received packets by looking at the non-zero RX timestamps + received=$(isochron report \ + --input-file "${isochron_dat}" \ + --printf-format "%u\n" --printf-args "R" | \ + grep -w -v '0' | wc -l) + + if [ "${received}" = "${expected}" ]; then + RET=0 + else + RET=1 + echo "Expected isochron to receive ${expected} packets but received ${received}" + fi + + log_test "${test_name}" + + if [ "$RET" = "1" ]; then + ${debug} "${isochron_dat}" + fi + + rm ${isochron_dat} 2> /dev/null +} + +test_gate_in_band() +{ + # Send packets in-band with the OPEN gate entry + run_test 0.000000000 ${NUM_PKTS} "In band" \ + debug_incorrectly_dropped_packets + + psfp_filter_check ${NUM_PKTS} +} + +test_gate_out_of_band() +{ + # Send packets in-band with the CLOSE gate entry + run_test 0.005000000 0 "Out of band" \ + debug_incorrectly_received_packets + + psfp_filter_check $((2 * ${NUM_PKTS})) +} + +trap cleanup EXIT + +ALL_TESTS=" + test_gate_in_band + test_gate_out_of_band +" + +setup_prepare +setup_wait + +tests_run + +exit $EXIT_STATUS diff --git a/tools/testing/selftests/net/forwarding/tsn_lib.sh b/tools/testing/selftests/net/forwarding/tsn_lib.sh new file mode 100644 index 000000000000..60a1423e8116 --- /dev/null +++ b/tools/testing/selftests/net/forwarding/tsn_lib.sh @@ -0,0 +1,235 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright 2021-2022 NXP + +REQUIRE_ISOCHRON=${REQUIRE_ISOCHRON:=yes} +REQUIRE_LINUXPTP=${REQUIRE_LINUXPTP:=yes} + +# Tunables +UTC_TAI_OFFSET=37 +ISOCHRON_CPU=1 + +if [[ "$REQUIRE_ISOCHRON" = "yes" ]]; then + # https://github.com/vladimiroltean/tsn-scripts + # WARNING: isochron versions pre-1.0 are unstable, + # always use the latest version + require_command isochron +fi +if [[ "$REQUIRE_LINUXPTP" = "yes" ]]; then + require_command phc2sys + require_command ptp4l +fi + +phc2sys_start() +{ + local if_name=$1 + local uds_address=$2 + local extra_args="" + + if ! [ -z "${uds_address}" ]; then + extra_args="${extra_args} -z ${uds_address}" + fi + + phc2sys_log="$(mktemp)" + + chrt -f 10 phc2sys -m \ + -c ${if_name} \ + -s CLOCK_REALTIME \ + -O ${UTC_TAI_OFFSET} \ + --step_threshold 0.00002 \ + --first_step_threshold 0.00002 \ + ${extra_args} \ + > "${phc2sys_log}" 2>&1 & + phc2sys_pid=$! + + echo "phc2sys logs to ${phc2sys_log} and has pid ${phc2sys_pid}" + + sleep 1 +} + +phc2sys_stop() +{ + { kill ${phc2sys_pid} && wait ${phc2sys_pid}; } 2> /dev/null + rm "${phc2sys_log}" 2> /dev/null +} + +ptp4l_start() +{ + local if_name=$1 + local slave_only=$2 + local uds_address=$3 + local log="ptp4l_log_${if_name}" + local pid="ptp4l_pid_${if_name}" + local extra_args="" + + if [ "${slave_only}" = true ]; then + extra_args="${extra_args} -s" + fi + + # declare dynamic variables ptp4l_log_${if_name} and ptp4l_pid_${if_name} + # as global, so that they can be referenced later + declare -g "${log}=$(mktemp)" + + chrt -f 10 ptp4l -m -2 -P \ + -i ${if_name} \ + --step_threshold 0.00002 \ + --first_step_threshold 0.00002 \ + --tx_timestamp_timeout 100 \ + --uds_address="${uds_address}" \ + ${extra_args} \ + > "${!log}" 2>&1 & + declare -g "${pid}=$!" + + echo "ptp4l for interface ${if_name} logs to ${!log} and has pid ${!pid}" + + sleep 1 +} + +ptp4l_stop() +{ + local if_name=$1 + local log="ptp4l_log_${if_name}" + local pid="ptp4l_pid_${if_name}" + + { kill ${!pid} && wait ${!pid}; } 2> /dev/null + rm "${!log}" 2> /dev/null +} + +cpufreq_max() +{ + local cpu=$1 + local freq="cpu${cpu}_freq" + local governor="cpu${cpu}_governor" + + # Kernel may be compiled with CONFIG_CPU_FREQ disabled + if ! [ -d /sys/bus/cpu/devices/cpu${cpu}/cpufreq ]; then + return + fi + + # declare dynamic variables cpu${cpu}_freq and cpu${cpu}_governor as + # global, so they can be referenced later + declare -g "${freq}=$(cat /sys/bus/cpu/devices/cpu${cpu}/cpufreq/scaling_min_freq)" + declare -g "${governor}=$(cat /sys/bus/cpu/devices/cpu${cpu}/cpufreq/scaling_governor)" + + cat /sys/bus/cpu/devices/cpu${cpu}/cpufreq/scaling_max_freq > \ + /sys/bus/cpu/devices/cpu${cpu}/cpufreq/scaling_min_freq + echo -n "performance" > \ + /sys/bus/cpu/devices/cpu${cpu}/cpufreq/scaling_governor +} + +cpufreq_restore() +{ + local cpu=$1 + local freq="cpu${cpu}_freq" + local governor="cpu${cpu}_governor" + + if ! [ -d /sys/bus/cpu/devices/cpu${cpu}/cpufreq ]; then + return + fi + + echo "${!freq}" > /sys/bus/cpu/devices/cpu${cpu}/cpufreq/scaling_min_freq + echo -n "${!governor}" > \ + /sys/bus/cpu/devices/cpu${cpu}/cpufreq/scaling_governor +} + +isochron_recv_start() +{ + local if_name=$1 + local uds=$2 + local extra_args=$3 + + if ! [ -z "${uds}" ]; then + extra_args="--unix-domain-socket ${uds}" + fi + + isochron rcv \ + --interface ${if_name} \ + --sched-priority 98 \ + --sched-fifo \ + --utc-tai-offset ${UTC_TAI_OFFSET} \ + --quiet \ + ${extra_args} & \ + isochron_pid=$! + + sleep 1 +} + +isochron_recv_stop() +{ + { kill ${isochron_pid} && wait ${isochron_pid}; } 2> /dev/null +} + +isochron_do() +{ + local sender_if_name=$1; shift + local receiver_if_name=$1; shift + local sender_uds=$1; shift + local receiver_uds=$1; shift + local base_time=$1; shift + local cycle_time=$1; shift + local shift_time=$1; shift + local num_pkts=$1; shift + local vid=$1; shift + local priority=$1; shift + local dst_ip=$1; shift + local isochron_dat=$1; shift + local extra_args="" + local receiver_extra_args="" + local vrf="$(master_name_get ${sender_if_name})" + local use_l2="true" + + if ! [ -z "${dst_ip}" ]; then + use_l2="false" + fi + + if ! [ -z "${vrf}" ]; then + dst_ip="${dst_ip}%${vrf}" + fi + + if ! [ -z "${vid}" ]; then + vid="--vid=${vid}" + fi + + if [ -z "${receiver_uds}" ]; then + extra_args="${extra_args} --omit-remote-sync" + fi + + if ! [ -z "${shift_time}" ]; then + extra_args="${extra_args} --shift-time=${shift_time}" + fi + + if [ "${use_l2}" = "true" ]; then + extra_args="${extra_args} --l2 --etype=0xdead ${vid}" + receiver_extra_args="--l2 --etype=0xdead" + else + extra_args="${extra_args} --l4 --ip-destination=${dst_ip}" + receiver_extra_args="--l4" + fi + + cpufreq_max ${ISOCHRON_CPU} + + isochron_recv_start "${h2}" "${receiver_uds}" "${receiver_extra_args}" + + isochron send \ + --interface ${sender_if_name} \ + --unix-domain-socket ${sender_uds} \ + --priority ${priority} \ + --base-time ${base_time} \ + --cycle-time ${cycle_time} \ + --num-frames ${num_pkts} \ + --frame-size 64 \ + --txtime \ + --utc-tai-offset ${UTC_TAI_OFFSET} \ + --cpu-mask $((1 << ${ISOCHRON_CPU})) \ + --sched-fifo \ + --sched-priority 98 \ + --client 127.0.0.1 \ + --sync-threshold 5000 \ + --output-file ${isochron_dat} \ + ${extra_args} \ + --quiet + + isochron_recv_stop + + cpufreq_restore ${ISOCHRON_CPU} +}
On Sun May 01 2022, Vladimir Oltean wrote:
The Felix VSC9959 switch in NXP LS1028A supports the tc-gate action which enforced time-based access control per stream. A stream as seen by this switch is identified by {MAC DA, VID}.
We use the standard forwarding selftest topology with 2 host interfaces and 2 switch interfaces. The host ports must require timestamping non-IP packets and supporting tc-etf offload, for isochron to work. The isochron program monitors network sync status (ptp4l, phc2sys) and deterministically transmits packets to the switch such that the tc-gate action either (a) always accepts them based on its schedule, or (b) always drops them.
I tried to keep as much of the logic that isn't specific to the NXP LS1028A in a new tsn_lib.sh, for future reuse. This covers synchronization using ptp4l and phc2sys, and isochron.
The cycle-time chosen for this selftest isn't particularly impressive (and the focus is the functionality of the switch), but I didn't really know what to do better, considering that it will mostly be run during debugging sessions, various kernel bloatware would be enabled, like lockdep, KASAN, etc, and we certainly can't run any races with those on.
I tried to look through the kselftest framework for other real time applications and didn't really find any, so I'm not sure how better to prepare the environment in case we want to go for a lower cycle time. At the moment, the only thing the selftest is ensuring is that dynamic frequency scaling is disabled on the CPU that isochron runs on. It would probably be useful to have a blacklist of kernel config options (checked through zcat /proc/config.gz) and some cyclictest scripts to run beforehand, but I saw none of those.
Signed-off-by: Vladimir Oltean vladimir.oltean@nxp.com
Reviewed-by: Kurt Kanzenbach kurt@linutronix.de
Hello:
This patch was applied to netdev/net-next.git (master) by Jakub Kicinski kuba@kernel.org:
On Sun, 1 May 2022 14:29:53 +0300 you wrote:
The Felix VSC9959 switch in NXP LS1028A supports the tc-gate action which enforced time-based access control per stream. A stream as seen by this switch is identified by {MAC DA, VID}.
We use the standard forwarding selftest topology with 2 host interfaces and 2 switch interfaces. The host ports must require timestamping non-IP packets and supporting tc-etf offload, for isochron to work. The isochron program monitors network sync status (ptp4l, phc2sys) and deterministically transmits packets to the switch such that the tc-gate action either (a) always accepts them based on its schedule, or (b) always drops them.
[...]
Here is the summary with links: - [v2,net-next] selftests: forwarding: add Per-Stream Filtering and Policing test for Ocelot https://git.kernel.org/netdev/net-next/c/954f46d2f0b4
You are awesome, thank you!
Hi!
On 2022. 05. 01. 13:29, Vladimir Oltean wrote:
The Felix VSC9959 switch in NXP LS1028A supports the tc-gate action which enforced time-based access control per stream. A stream as seen by this switch is identified by {MAC DA, VID}.
We use the standard forwarding selftest topology with 2 host interfaces and 2 switch interfaces. The host ports must require timestamping non-IP packets and supporting tc-etf offload, for isochron to work. The isochron program monitors network sync status (ptp4l, phc2sys) and deterministically transmits packets to the switch such that the tc-gate action either (a) always accepts them based on its schedule, or (b) always drops them.
I tried to keep as much of the logic that isn't specific to the NXP LS1028A in a new tsn_lib.sh, for future reuse. This covers synchronization using ptp4l and phc2sys, and isochron.
The cycle-time chosen for this selftest isn't particularly impressive (and the focus is the functionality of the switch), but I didn't really know what to do better, considering that it will mostly be run during debugging sessions, various kernel bloatware would be enabled, like lockdep, KASAN, etc, and we certainly can't run any races with those on.
I tried to look through the kselftest framework for other real time applications and didn't really find any, so I'm not sure how better to prepare the environment in case we want to go for a lower cycle time. At the moment, the only thing the selftest is ensuring is that dynamic frequency scaling is disabled on the CPU that isochron runs on. It would probably be useful to have a blacklist of kernel config options (checked through zcat /proc/config.gz) and some cyclictest scripts to run beforehand, but I saw none of those.
Signed-off-by: Vladimir Oltean vladimir.oltean@nxp.com
+switch_create() +{
- local h2_mac_addr=$(mac_get $h2)
- ip link set ${swp1} up
- ip link set ${swp2} up
- ip link add br0 type bridge vlan_filtering 1
- ip link set ${swp1} master br0
- ip link set ${swp2} master br0
- ip link set br0 up
- bridge vlan add dev ${swp2} vid ${STREAM_VID}
- bridge vlan add dev ${swp1} vid ${STREAM_VID}
- # PSFP on Ocelot requires the filter to also be added to the bridge
- # FDB, and not be removed
- bridge fdb add dev ${swp2} \
${h2_mac_addr} vlan ${STREAM_VID} static master
- psfp_chain_create ${swp1}
- tc filter add dev ${swp1} ingress chain $(PSFP) pref 1 \
protocol 802.1Q flower skip_sw \
dst_mac ${h2_mac_addr} vlan_id ${STREAM_VID} \
action gate base-time 0.000000000 \
sched-entry OPEN ${GATE_DURATION_NS} -1 -1 \
sched-entry CLOSE ${GATE_DURATION_NS} -1 -1
I know that might be little bit off-topic here, but the current implementation of the act_gate does nothing with the IPV value [0] even if the user set it to non -1. IMO this IPV value should be carried through in the tcf_gate struct [1] as something like a "current_ipv" member or so. Then this value can be applied in the tcf_gate_act function to the skb->priority.
Background story: I tried to combine gate and taprio (802.1Qci and Qbv) to achieve 802.1Qch operation (which is really just a coordinated config of those two) but without the IPV (should by set by the ingress port) we have no way to carry the gating info to the taprio, and as a result its just sending every packet with the default priority, no matter how we open/close the gate at the ingress.
[0] https://elixir.bootlin.com/linux/v5.18-rc5/source/include/net/tc_act/tc_gate... [1] https://elixir.bootlin.com/linux/v5.18-rc5/source/include/net/tc_act/tc_gate... [2] https://elixir.bootlin.com/linux/v5.18-rc5/source/net/sched/act_gate.c#L117
+}
Ferenc
Hi Ferenc,
On Fri, May 06, 2022 at 07:49:40AM +0000, Ferenc Fejes wrote:
Hi! I know that might be little bit off-topic here, but the current implementation of the act_gate does nothing with the IPV value [0] even if the user set it to non -1. IMO this IPV value should be carried through in the tcf_gate struct [1] as something like a "current_ipv" member or so. Then this value can be applied in the tcf_gate_act function to the skb->priority.
Background story: I tried to combine gate and taprio (802.1Qci and Qbv) to achieve 802.1Qch operation (which is really just a coordinated config of those two) but without the IPV (should by set by the ingress port) we have no way to carry the gating info to the taprio, and as a result its just sending every packet with the default priority, no matter how we open/close the gate at the ingress.
[0] https://elixir.bootlin.com/linux/v5.18-rc5/source/include/net/tc_act/tc_gate... [1] https://elixir.bootlin.com/linux/v5.18-rc5/source/include/net/tc_act/tc_gate... [2] https://elixir.bootlin.com/linux/v5.18-rc5/source/net/sched/act_gate.c#L117
This is correct. I have been testing only with the offloaded tc-gate action so I did not notice that the software does not act upon the ipv. Your proposal sounds straightforward enough. Care to send a bug fix patch?
Hi Vladimir!
On 2022. 05. 06. 14:01, Vladimir Oltean wrote:
Hi Ferenc,
On Fri, May 06, 2022 at 07:49:40AM +0000, Ferenc Fejes wrote: This is correct. I have been testing only with the offloaded tc-gate action so I did not notice that the software does not act upon the ipv. Your proposal sounds straightforward enough. Care to send a bug fix patch?
Unfortunately I cant, our company policy does not allow direct open-source contributions :-(
However I would be more than happy if you can fix it.
Thanks, Ferenc
On Fri, May 06, 2022 at 12:24:37PM +0000, Ferenc Fejes wrote:
Hi Vladimir!
On 2022. 05. 06. 14:01, Vladimir Oltean wrote:
Hi Ferenc,
On Fri, May 06, 2022 at 07:49:40AM +0000, Ferenc Fejes wrote: This is correct. I have been testing only with the offloaded tc-gate action so I did not notice that the software does not act upon the ipv. Your proposal sounds straightforward enough. Care to send a bug fix patch?
Unfortunately I cant, our company policy does not allow direct open-source contributions :-(
However I would be more than happy if you can fix it.
That's too bad.
I have a patch which I am still testing, but you've managed to captivate my attention for saying that you are testing 802.1Qch with a software implementation of tc-gate.
Do you have a use case for this? What cycle times are you targeting? How are you ensuring that you are deterministically meeting the deadlines? Do you also have a software tc-taprio on the egress device or is that at least offloaded?
I'm asking these questions because the peristaltic shaper is primarily intended to be used on hardware switches. The patch I'm preparing includes changes to struct sk_buff. I just want to know how well I'll be able to sell these changes to maintainers strongly opposing the growth of this structure for an exceedingly small niche :)
Hi Vladimir!
On Thu, 2022-05-26 at 00:50 +0000, Vladimir Oltean wrote:
On Fri, May 06, 2022 at 12:24:37PM +0000, Ferenc Fejes wrote:
Hi Vladimir!
On 2022. 05. 06. 14:01, Vladimir Oltean wrote:
Hi Ferenc,
On Fri, May 06, 2022 at 07:49:40AM +0000, Ferenc Fejes wrote: This is correct. I have been testing only with the offloaded tc- gate action so I did not notice that the software does not act upon the ipv. Your proposal sounds straightforward enough. Care to send a bug fix patch?
Unfortunately I cant, our company policy does not allow direct open-source contributions :-(
However I would be more than happy if you can fix it.
That's too bad.
I have a patch which I am still testing, but you've managed to captivate my attention for saying that you are testing 802.1Qch with a software implementation of tc-gate.
Do you have a use case for this? What cycle times are you targeting? How are you ensuring that you are deterministically meeting the deadlines?
The cycle times I targeted were nowhere near to a realistic TSN scenario: I "generated" ping packets in every 100 msecs and on the ingress port and I marked them with prio 1 for 500ms (gate 1) and prio 2 for another 500ms (gate 2). On the egress port I applied taprio with the same base- time and same 500-500ms cycles but reverse ordered gates (that's the "definition" of the Qch), so while act_gate on the ingress is in gate 1 cycle, the taprio kept open gate 2 and gate 1 closed, etc. For "verification" I simply run a tcpdump on the listener machine what I pinged from the talker and eyeballed wether the 5-5 packets bursts shows up arrival timestamps.
Do you also have a software tc-taprio on the egress device or is that at least offloaded?
No, I experimented with the software version, but that worked with my netns tests and on physical machines too (after the IPV patch).
I'm asking these questions because the peristaltic shaper is primarily intended to be used on hardware switches. The patch I'm preparing includes changes to struct sk_buff. I just want to know how well I'll be able to sell these changes to maintainers strongly opposing the growth of this structure for an exceedingly small niche :)
Can you tell me about the intention behind the sk_buff changes? Does that required because of some offloading scenario? In my case putting the IPV into the skb->priority was good enough because the taprio using that field by default to select the traffic class for the packet.
Thanks, Ferenc
On Thu, May 26, 2022 at 06:40:22AM +0000, Ferenc Fejes wrote:
Hi Vladimir!
On Thu, 2022-05-26 at 00:50 +0000, Vladimir Oltean wrote:
On Fri, May 06, 2022 at 12:24:37PM +0000, Ferenc Fejes wrote:
Hi Vladimir!
On 2022. 05. 06. 14:01, Vladimir Oltean wrote:
Hi Ferenc,
On Fri, May 06, 2022 at 07:49:40AM +0000, Ferenc Fejes wrote: This is correct. I have been testing only with the offloaded tc- gate action so I did not notice that the software does not act upon the ipv. Your proposal sounds straightforward enough. Care to send a bug fix patch?
Unfortunately I cant, our company policy does not allow direct open-source contributions :-(
However I would be more than happy if you can fix it.
That's too bad.
I have a patch which I am still testing, but you've managed to captivate my attention for saying that you are testing 802.1Qch with a software implementation of tc-gate.
Do you have a use case for this? What cycle times are you targeting? How are you ensuring that you are deterministically meeting the deadlines?
The cycle times I targeted were nowhere near to a realistic TSN scenario: I "generated" ping packets in every 100 msecs and on the ingress port and I marked them with prio 1 for 500ms (gate 1) and prio 2 for another 500ms (gate 2). On the egress port I applied taprio with the same base- time and same 500-500ms cycles but reverse ordered gates (that's the "definition" of the Qch), so while act_gate on the ingress is in gate 1 cycle, the taprio kept open gate 2 and gate 1 closed, etc. For "verification" I simply run a tcpdump on the listener machine what I pinged from the talker and eyeballed wether the 5-5 packets bursts shows up arrival timestamps.
Do you also have a software tc-taprio on the egress device or is that at least offloaded?
No, I experimented with the software version, but that worked with my netns tests and on physical machines too (after the IPV patch).
I'm asking these questions because the peristaltic shaper is primarily intended to be used on hardware switches. The patch I'm preparing includes changes to struct sk_buff. I just want to know how well I'll be able to sell these changes to maintainers strongly opposing the growth of this structure for an exceedingly small niche :)
Can you tell me about the intention behind the sk_buff changes? Does that required because of some offloading scenario? In my case putting the IPV into the skb->priority was good enough because the taprio using that field by default to select the traffic class for the packet.
Thanks, Ferenc
Hopefully this patch explains it:
-----------------------------[ cut here ]----------------------------- From c8d33e0d65a4a63884a626dc04c7190a2bbe122b Mon Sep 17 00:00:00 2001 From: Vladimir Oltean vladimir.oltean@nxp.com Date: Wed, 25 May 2022 16:27:52 +0300 Subject: [PATCH] net/sched: act_gate: act on Internal Priority Value
IEEE 802.1Q specifies in Annex T ("Cyclic queuing and forwarding") the use case for an Internal Priority Value set by the PSFP function. Essentially, the intended use is to be able to select a packet's egress queue squarely based on its arrival time. Essentially, this means that a packet with the same VLAN PCP 5 can be delivered to TC 7 or 6 depending on whether it was received in an "odd" or "even" cycle. To quote the spec, "The use of the IPV allows this direction of frames to outbound queues to be independent of the received priority, and also does not affect the priority associated with the frame on transmission."
The problem is that the software implementation of act_gate takes the IPV as entry from user space, but does not act on it in its software implementation - only offloading drivers do.
To fix this, we need to keep a current_ipv variable according to the gate entry that's currently executed by act_gate, and use this IPV to overwrite the skb->priority.
In fact, a complication arises due to the following clause from 802.1Q:
| 8.6.6.1 PSFP queuing | If PSFP is supported (8.6.5.1), and the IPV associated with the stream | filter that passed the frame is anything other than the null value, then | that IPV is used to determine the traffic class of the frame, in place | of the frame's priority, via the Traffic Class Table specified in 8.6.6. | In all other respects, the queuing actions specified in 8.6.6 are | unchanged. The IPV is used only to determine the traffic class | associated with a frame, and hence select an outbound queue; for all | other purposes, the received priority is used.
An example of layer that we don't want to see the IPV is the egress-qos-map of a potential VLAN output device. Instead, the VLAN PCP should still be populated based on the original skb->priority.
I'm aware of the existence of a skb_update_prio() function in __dev_queue_xmit(), right before passing the skb to the qdisc layer, for the use case where net_prio cgroups are used to assign processes to network priorities on a per-interface basis. But rewriting the skb->priority with the skb->ipv there simply wouldn't work, exactly because there might be a VLAN in the output path of the skb.
One might suggest: just delay the IPV rewriting in the presence of stacked devices until it is actually needed and likely to be used, like when dev->num_tc != 0 (which VLAN doesn't have). But there are still other uses of skb->priority in the qdisc layer, like skbprio, htb etc. From the spec's wording it isn't clear that we want these functions to look at the proper packet user priority or at the PSFP IPV. Probably the former.
So the only implementation that conforms to these requirements seems to be the invasive one, where we introduce a dedicated helper for the pattern where drivers and the core ask for netdev_get_prio_tc_map(dev, skb->priority). We replace all such instances with a helper that selects the TC based on IPV, if the skb was filtered by PSFP time gates on ingress, and falls back to skb->priority otherwise.
Fixes: a51c328df310 ("net: qos: introduce a gate control flow action") Reported-by: Ferenc Fejes ferenc.fejes@ericsson.com Link: https://lore.kernel.org/netdev/87o80h1n65.fsf@kurt/T/#meaec9c24b3add9cb11edd... Signed-off-by: Vladimir Oltean vladimir.oltean@nxp.com --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +- include/linux/netdevice.h | 13 +++++++++++++ include/linux/skbuff.h | 11 +++++++++++ include/net/tc_act/tc_gate.h | 1 + net/core/dev.c | 4 ++-- net/dsa/tag_ocelot.c | 2 +- net/sched/act_gate.c | 6 ++++++ net/sched/sch_taprio.c | 12 +++--------- 8 files changed, 38 insertions(+), 13 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 77c2e70b0860..719259af9aaa 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -8531,7 +8531,7 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb, int txq;
if (sb_dev) { - u8 tc = netdev_get_prio_tc_map(dev, skb->priority); + u8 tc = skb_get_prio_tc_map(skb, dev); struct net_device *vdev = sb_dev;
txq = vdev->tc_to_txq[tc].offset; diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 47b59f99b037..8b87d017288f 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2366,6 +2366,19 @@ int netdev_set_prio_tc_map(struct net_device *dev, u8 prio, u8 tc) return 0; }
+static inline int skb_get_prio_tc_map(const struct sk_buff *skb, + const struct net_device *dev) +{ + __u32 prio = skb->priority; + +#if IS_ENABLED(CONFIG_NET_ACT_GATE) + if (skb->use_ipv) + prio = skb->ipv; +#endif + + return netdev_get_prio_tc_map(dev, prio); +} + int netdev_txq_to_tc(struct net_device *dev, unsigned int txq); void netdev_reset_tc(struct net_device *dev); int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset); diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index da96f0d3e753..b0a463c0bc65 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -913,6 +913,9 @@ typedef unsigned char *sk_buff_data_t; * @csum_start: Offset from skb->head where checksumming should start * @csum_offset: Offset from csum_start where checksum should be stored * @priority: Packet queueing priority + * @use_ipv: Packet internal priority was altered by PSFP + * @ipv: Internal Priority Value, overrides priority for traffic class + * selection * @ignore_df: allow local fragmentation * @cloned: Head may be cloned (check refcnt to be sure) * @ip_summed: Driver fed us an IP checksum @@ -1145,6 +1148,10 @@ struct sk_buff { __u8 slow_gro:1; __u8 csum_not_inet:1;
+#ifdef CONFIG_NET_ACT_GATE + __u8 use_ipv:1; +#endif + #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ #endif @@ -1209,6 +1216,10 @@ struct sk_buff { /* only useable after checking ->active_extensions != 0 */ struct skb_ext *extensions; #endif + +#ifdef CONFIG_NET_ACT_GATE + __u32 ipv; +#endif };
/* if you move pkt_type around you also must adapt those constants */ diff --git a/include/net/tc_act/tc_gate.h b/include/net/tc_act/tc_gate.h index c8fa11ebb397..b05c2c7d78e4 100644 --- a/include/net/tc_act/tc_gate.h +++ b/include/net/tc_act/tc_gate.h @@ -44,6 +44,7 @@ struct tcf_gate { ktime_t current_close_time; u32 current_entry_octets; s32 current_max_octets; + s32 current_ipv; struct tcfg_gate_entry *next_entry; struct hrtimer hitimer; enum tk_offsets tk_offset; diff --git a/net/core/dev.c b/net/core/dev.c index 721ba9c26554..956aa227c260 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3194,7 +3194,7 @@ static u16 skb_tx_hash(const struct net_device *dev, u16 qcount = dev->real_num_tx_queues;
if (dev->num_tc) { - u8 tc = netdev_get_prio_tc_map(dev, skb->priority); + u8 tc = skb_get_prio_tc_map(skb, dev);
qoffset = sb_dev->tc_to_txq[tc].offset; qcount = sb_dev->tc_to_txq[tc].count; @@ -4002,7 +4002,7 @@ EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue); static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb, struct xps_dev_maps *dev_maps, unsigned int tci) { - int tc = netdev_get_prio_tc_map(dev, skb->priority); + int tc = skb_get_prio_tc_map(skb, dev); struct xps_map *map; int queue_index = -1;
diff --git a/net/dsa/tag_ocelot.c b/net/dsa/tag_ocelot.c index 0d81f172b7a6..036e746f18eb 100644 --- a/net/dsa/tag_ocelot.c +++ b/net/dsa/tag_ocelot.c @@ -52,7 +52,7 @@ static void ocelot_xmit_common(struct sk_buff *skb, struct net_device *netdev, ocelot_xmit_get_vlan_info(skb, dp, &vlan_tci, &tag_type);
qos_class = netdev_get_num_tc(netdev) ? - netdev_get_prio_tc_map(netdev, skb->priority) : skb->priority; + skb_get_prio_tc_map(skb, netdev) : skb->priority;
injection = skb_push(skb, OCELOT_TAG_LEN); prefix = skb_push(skb, OCELOT_SHORT_PREFIX_LEN); diff --git a/net/sched/act_gate.c b/net/sched/act_gate.c index fd5155274733..9fb248b104f8 100644 --- a/net/sched/act_gate.c +++ b/net/sched/act_gate.c @@ -81,6 +81,7 @@ static enum hrtimer_restart gate_timer_func(struct hrtimer *timer) gact->current_gate_status = next->gate_state ? GATE_ACT_GATE_OPEN : 0; gact->current_entry_octets = 0; gact->current_max_octets = next->maxoctets; + gact->current_ipv = next->ipv;
gact->current_close_time = ktime_add_ns(gact->current_close_time, next->interval); @@ -140,6 +141,11 @@ static int tcf_gate_act(struct sk_buff *skb, const struct tc_action *a, } }
+ if (gact->current_ipv >= 0) { + skb->use_ipv = true; + skb->ipv = gact->current_ipv; + } + spin_unlock(&gact->tcf_lock);
return gact->tcf_action; diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c index b9c71a304d39..fb8bc17e38bb 100644 --- a/net/sched/sch_taprio.c +++ b/net/sched/sch_taprio.c @@ -201,7 +201,7 @@ static struct sched_entry *find_entry_to_transmit(struct sk_buff *skb, s32 cycle_elapsed; int tc, n;
- tc = netdev_get_prio_tc_map(dev, skb->priority); + tc = skb_get_prio_tc_map(skb, dev); packet_transmit_time = length_to_duration(q, qdisc_pkt_len(skb));
*interval_start = 0; @@ -509,7 +509,6 @@ static struct sk_buff *taprio_peek_soft(struct Qdisc *sch)
for (i = 0; i < dev->num_tx_queues; i++) { struct Qdisc *child = q->qdiscs[i]; - int prio; u8 tc;
if (unlikely(!child)) @@ -522,9 +521,7 @@ static struct sk_buff *taprio_peek_soft(struct Qdisc *sch) if (TXTIME_ASSIST_IS_ENABLED(q->flags)) return skb;
- prio = skb->priority; - tc = netdev_get_prio_tc_map(dev, prio); - + tc = skb_get_prio_tc_map(skb, dev); if (!(gate_mask & BIT(tc))) continue;
@@ -579,7 +576,6 @@ static struct sk_buff *taprio_dequeue_soft(struct Qdisc *sch) for (i = 0; i < dev->num_tx_queues; i++) { struct Qdisc *child = q->qdiscs[i]; ktime_t guard; - int prio; int len; u8 tc;
@@ -597,9 +593,7 @@ static struct sk_buff *taprio_dequeue_soft(struct Qdisc *sch) if (!skb) continue;
- prio = skb->priority; - tc = netdev_get_prio_tc_map(dev, prio); - + tc = skb_get_prio_tc_map(skb, dev); if (!(gate_mask & BIT(tc))) { skb = NULL; continue; -----------------------------[ cut here ]-----------------------------
Hi!
Sorry for the delay.
On Thu, 2022-05-26 at 09:30 +0000, Vladimir Oltean wrote:
On Thu, May 26, 2022 at 06:40:22AM +0000, Ferenc Fejes wrote:
Hi Vladimir!
On Thu, 2022-05-26 at 00:50 +0000, Vladimir Oltean wrote:
On Fri, May 06, 2022 at 12:24:37PM +0000, Ferenc Fejes wrote:
Hi Vladimir!
On 2022. 05. 06. 14:01, Vladimir Oltean wrote:
Hi Ferenc,
On Fri, May 06, 2022 at 07:49:40AM +0000, Ferenc Fejes wrote: This is correct. I have been testing only with the offloaded tc- gate action so I did not notice that the software does not act upon the ipv. Your proposal sounds straightforward enough. Care to send a bug fix patch?
Unfortunately I cant, our company policy does not allow direct open-source contributions :-(
However I would be more than happy if you can fix it.
That's too bad.
I have a patch which I am still testing, but you've managed to captivate my attention for saying that you are testing 802.1Qch with a software implementation of tc-gate.
Do you have a use case for this? What cycle times are you targeting? How are you ensuring that you are deterministically meeting the deadlines?
The cycle times I targeted were nowhere near to a realistic TSN scenario: I "generated" ping packets in every 100 msecs and on the ingress port and I marked them with prio 1 for 500ms (gate 1) and prio 2 for another 500ms (gate 2). On the egress port I applied taprio with the same base- time and same 500-500ms cycles but reverse ordered gates (that's the "definition" of the Qch), so while act_gate on the ingress is in gate 1 cycle, the taprio kept open gate 2 and gate 1 closed, etc. For "verification" I simply run a tcpdump on the listener machine what I pinged from the talker and eyeballed wether the 5-5 packets bursts shows up arrival timestamps.
Do you also have a software tc-taprio on the egress device or is that at least offloaded?
No, I experimented with the software version, but that worked with my netns tests and on physical machines too (after the IPV patch).
I'm asking these questions because the peristaltic shaper is primarily intended to be used on hardware switches. The patch I'm preparing includes changes to struct sk_buff. I just want to know how well I'll be able to sell these changes to maintainers strongly opposing the growth of this structure for an exceedingly small niche :)
Can you tell me about the intention behind the sk_buff changes? Does that required because of some offloading scenario? In my case putting the IPV into the skb->priority was good enough because the taprio using that field by default to select the traffic class for the packet.
Thanks, Ferenc
Hopefully this patch explains it:
-----------------------------[ cut here ]----------------------------
From c8d33e0d65a4a63884a626dc04c7190a2bbe122b Mon Sep 17 00:00:00 2001 From: Vladimir Oltean vladimir.oltean@nxp.com Date: Wed, 25 May 2022 16:27:52 +0300 Subject: [PATCH] net/sched: act_gate: act on Internal Priority Value
IEEE 802.1Q specifies in Annex T ("Cyclic queuing and forwarding") the use case for an Internal Priority Value set by the PSFP function. Essentially, the intended use is to be able to select a packet's egress queue squarely based on its arrival time. Essentially, this means that a packet with the same VLAN PCP 5 can be delivered to TC 7 or 6 depending on whether it was received in an "odd" or "even" cycle. To quote the spec, "The use of the IPV allows this direction of frames to outbound queues to be independent of the received priority, and also does not affect the priority associated with the frame on transmission."
The problem is that the software implementation of act_gate takes the IPV as entry from user space, but does not act on it in its software implementation - only offloading drivers do.
To fix this, we need to keep a current_ipv variable according to the gate entry that's currently executed by act_gate, and use this IPV to overwrite the skb->priority.
In fact, a complication arises due to the following clause from 802.1Q:
8.6.6.1 PSFP queuing If PSFP is supported (8.6.5.1), and the IPV associated with the stream filter that passed the frame is anything other than the null value, then that IPV is used to determine the traffic class of the frame, in place of the frame's priority, via the Traffic Class Table specified in 8.6.6. In all other respects, the queuing actions specified in 8.6.6 are unchanged. The IPV is used only to determine the traffic class associated with a frame, and hence select an outbound queue; for all other purposes, the received priority is used.
Interesting. In my understanding this indeed something like "modify the queueing decision while preserve the original PCP value".
An example of layer that we don't want to see the IPV is the egress- qos-map of a potential VLAN output device. Instead, the VLAN PCP should still be populated based on the original skb->priority.
I'm aware of the existence of a skb_update_prio() function in __dev_queue_xmit(), right before passing the skb to the qdisc layer, for the use case where net_prio cgroups are used to assign processes to network priorities on a per-interface basis. But rewriting the skb->priority with the skb->ipv there simply wouldn't work, exactly because there might be a VLAN in the output path of the skb.
So the problem is currently we use skb->priority with two semantics: 1. for egress-qos-mapping which set the PCP of the VLAN header based on skb->priority 2. for selecting the traffic class (and the tx queues with that as well).
One might suggest: just delay the IPV rewriting in the presence of stacked devices until it is actually needed and likely to be used, like when dev->num_tc != 0 (which VLAN doesn't have). But there are still other uses of skb->priority in the qdisc layer, like skbprio, htb etc. From the spec's wording it isn't clear that we want these functions to look at the proper packet user priority or at the PSFP IPV. Probably the former.
So the only implementation that conforms to these requirements seems to be the invasive one, where we introduce a dedicated helper for the pattern where drivers and the core ask for netdev_get_prio_tc_map(dev, skb->priority). We replace all such instances with a helper that selects the TC based on IPV, if the skb was filtered by PSFP time gates on ingress, and falls back to skb->priority otherwise.
Fixes: a51c328df310 ("net: qos: introduce a gate control flow action") Reported-by: Ferenc Fejes ferenc.fejes@ericsson.com Link: https://lore.kernel.org/netdev/87o80h1n65.fsf@kurt/T/#meaec9c24b3add9cb11edd... Signed-off-by: Vladimir Oltean vladimir.oltean@nxp.com
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +- include/linux/netdevice.h | 13 +++++++++++++ include/linux/skbuff.h | 11 +++++++++++ include/net/tc_act/tc_gate.h | 1 + net/core/dev.c | 4 ++-- net/dsa/tag_ocelot.c | 2 +- net/sched/act_gate.c | 6 ++++++ net/sched/sch_taprio.c | 12 +++--------- 8 files changed, 38 insertions(+), 13 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 77c2e70b0860..719259af9aaa 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -8531,7 +8531,7 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb, int txq; if (sb_dev) { - u8 tc = netdev_get_prio_tc_map(dev, skb->priority); + u8 tc = skb_get_prio_tc_map(skb, dev); struct net_device *vdev = sb_dev; txq = vdev->tc_to_txq[tc].offset; diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 47b59f99b037..8b87d017288f 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2366,6 +2366,19 @@ int netdev_set_prio_tc_map(struct net_device *dev, u8 prio, u8 tc) return 0; } +static inline int skb_get_prio_tc_map(const struct sk_buff *skb, + const struct net_device *dev) +{ + __u32 prio = skb->priority;
+#if IS_ENABLED(CONFIG_NET_ACT_GATE) + if (skb->use_ipv) + prio = skb->ipv; +#endif
+ return netdev_get_prio_tc_map(dev, prio); +}
int netdev_txq_to_tc(struct net_device *dev, unsigned int txq); void netdev_reset_tc(struct net_device *dev); int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset); diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index da96f0d3e753..b0a463c0bc65 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -913,6 +913,9 @@ typedef unsigned char *sk_buff_data_t; * @csum_start: Offset from skb->head where checksumming should start * @csum_offset: Offset from csum_start where checksum should be stored * @priority: Packet queueing priority
- * @use_ipv: Packet internal priority was altered by PSFP
- * @ipv: Internal Priority Value, overrides priority for traffic
class
- * selection
* @ignore_df: allow local fragmentation * @cloned: Head may be cloned (check refcnt to be sure) * @ip_summed: Driver fed us an IP checksum @@ -1145,6 +1148,10 @@ struct sk_buff { __u8 slow_gro:1; __u8 csum_not_inet:1; +#ifdef CONFIG_NET_ACT_GATE + __u8 use_ipv:1; +#endif
#ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ #endif @@ -1209,6 +1216,10 @@ struct sk_buff { /* only useable after checking ->active_extensions != 0 */ struct skb_ext *extensions; #endif
+#ifdef CONFIG_NET_ACT_GATE + __u32 ipv; +#endif }; /* if you move pkt_type around you also must adapt those constants */ diff --git a/include/net/tc_act/tc_gate.h b/include/net/tc_act/tc_gate.h index c8fa11ebb397..b05c2c7d78e4 100644 --- a/include/net/tc_act/tc_gate.h +++ b/include/net/tc_act/tc_gate.h @@ -44,6 +44,7 @@ struct tcf_gate { ktime_t current_close_time; u32 current_entry_octets; s32 current_max_octets; + s32 current_ipv; struct tcfg_gate_entry *next_entry; struct hrtimer hitimer; enum tk_offsets tk_offset; diff --git a/net/core/dev.c b/net/core/dev.c index 721ba9c26554..956aa227c260 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3194,7 +3194,7 @@ static u16 skb_tx_hash(const struct net_device *dev, u16 qcount = dev->real_num_tx_queues; if (dev->num_tc) { - u8 tc = netdev_get_prio_tc_map(dev, skb->priority); + u8 tc = skb_get_prio_tc_map(skb, dev); qoffset = sb_dev->tc_to_txq[tc].offset; qcount = sb_dev->tc_to_txq[tc].count; @@ -4002,7 +4002,7 @@ EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue); static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb, struct xps_dev_maps *dev_maps, unsigned int tci) { - int tc = netdev_get_prio_tc_map(dev, skb->priority); + int tc = skb_get_prio_tc_map(skb, dev); struct xps_map *map; int queue_index = -1; diff --git a/net/dsa/tag_ocelot.c b/net/dsa/tag_ocelot.c index 0d81f172b7a6..036e746f18eb 100644 --- a/net/dsa/tag_ocelot.c +++ b/net/dsa/tag_ocelot.c @@ -52,7 +52,7 @@ static void ocelot_xmit_common(struct sk_buff *skb, struct net_device *netdev, ocelot_xmit_get_vlan_info(skb, dp, &vlan_tci, &tag_type); qos_class = netdev_get_num_tc(netdev) ? - netdev_get_prio_tc_map(netdev, skb->priority) : skb->priority; + skb_get_prio_tc_map(skb, netdev) : skb->priority; injection = skb_push(skb, OCELOT_TAG_LEN); prefix = skb_push(skb, OCELOT_SHORT_PREFIX_LEN); diff --git a/net/sched/act_gate.c b/net/sched/act_gate.c index fd5155274733..9fb248b104f8 100644 --- a/net/sched/act_gate.c +++ b/net/sched/act_gate.c @@ -81,6 +81,7 @@ static enum hrtimer_restart gate_timer_func(struct hrtimer *timer) gact->current_gate_status = next->gate_state ? GATE_ACT_GATE_OPEN : 0; gact->current_entry_octets = 0; gact->current_max_octets = next->maxoctets; + gact->current_ipv = next->ipv; gact->current_close_time = ktime_add_ns(gact-
current_close_time,
next->interval); @@ -140,6 +141,11 @@ static int tcf_gate_act(struct sk_buff *skb, const struct tc_action *a, } } + if (gact->current_ipv >= 0) { + skb->use_ipv = true; + skb->ipv = gact->current_ipv; + }
spin_unlock(&gact->tcf_lock); return gact->tcf_action; diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c index b9c71a304d39..fb8bc17e38bb 100644 --- a/net/sched/sch_taprio.c +++ b/net/sched/sch_taprio.c @@ -201,7 +201,7 @@ static struct sched_entry *find_entry_to_transmit(struct sk_buff *skb, s32 cycle_elapsed; int tc, n; - tc = netdev_get_prio_tc_map(dev, skb->priority); + tc = skb_get_prio_tc_map(skb, dev); packet_transmit_time = length_to_duration(q, qdisc_pkt_len(skb)); *interval_start = 0; @@ -509,7 +509,6 @@ static struct sk_buff *taprio_peek_soft(struct Qdisc *sch) for (i = 0; i < dev->num_tx_queues; i++) { struct Qdisc *child = q->qdiscs[i]; - int prio; u8 tc; if (unlikely(!child)) @@ -522,9 +521,7 @@ static struct sk_buff *taprio_peek_soft(struct Qdisc *sch) if (TXTIME_ASSIST_IS_ENABLED(q->flags)) return skb; - prio = skb->priority; - tc = netdev_get_prio_tc_map(dev, prio);
+ tc = skb_get_prio_tc_map(skb, dev); if (!(gate_mask & BIT(tc))) continue; @@ -579,7 +576,6 @@ static struct sk_buff *taprio_dequeue_soft(struct Qdisc *sch) for (i = 0; i < dev->num_tx_queues; i++) { struct Qdisc *child = q->qdiscs[i]; ktime_t guard; - int prio; int len; u8 tc; @@ -597,9 +593,7 @@ static struct sk_buff *taprio_dequeue_soft(struct Qdisc *sch) if (!skb) continue; - prio = skb->priority; - tc = netdev_get_prio_tc_map(dev, prio);
+ tc = skb_get_prio_tc_map(skb, dev); if (!(gate_mask & BIT(tc))) { skb = NULL; continue; -----------------------------[ cut here ]----------------------------
Your solution certainly correct and do the differentiation between the cases where we have PSFP at ingress or not. However in my understanding the only purpose of the IPV is the traffic class selection.
Setting skb->queue_mapping to IPV probably wont work, because of two reasons: 1. it brings inconsistency with the mqprio/taprio queues and the actual hw rings the traffic sent 2. some drivers dont check if skb->queue_mapping is bounded, they expect its smaller than the num_tx_queues.
The 2. might be solvable, but the 1. is more problematic. However with a helper, we might check if skb->queue_mapping is already set and use that as a traffic class. Is that possible? I dont really see any other codepath where that value can be other than zero before the qdisc layer. That way one flag (use_ipv) might be enough which tells that we should use the skb->queue_mapping as is (set by act_gate) and preserve skb->priority.
What do you think? Again, sorry for being late here, but I'm following the list and see that recently you did major mqprio/taprio fixes and refactors, so I hope your cache line is hot.
Best, Ferenc
On Thu, Feb 16, 2023 at 02:47:11PM +0100, Ferenc Fejes wrote:
To fix this, we need to keep a current_ipv variable according to the gate entry that's currently executed by act_gate, and use this IPV to overwrite the skb->priority.
In fact, a complication arises due to the following clause from 802.1Q:
8.6.6.1 PSFP queuing If PSFP is supported (8.6.5.1), and the IPV associated with the stream filter that passed the frame is anything other than the null value, then that IPV is used to determine the traffic class of the frame, in place of the frame's priority, via the Traffic Class Table specified in 8.6.6. In all other respects, the queuing actions specified in 8.6.6 are unchanged. The IPV is used only to determine the traffic class associated with a frame, and hence select an outbound queue; for all other purposes, the received priority is used.
Interesting. In my understanding this indeed something like "modify the queueing decision while preserve the original PCP value".
I actually wonder what the interpretation in the spirit of this clause is. I'm known to over-interpret the text of these IEEE standards and finding inconsistencies in the process (being able to prove both A and !A).
For example, if we consider the framePreemptionAdminStatus (express or preemptible). It is expressed per priority. So this means that the Internal Priority Value rewritten by PSFP will not affect the express/ preemptible nature of a frame, but it *will* affect the traffic class selection?
This is insane, because it means that to enforce the requirements of clause 12.30.1.1.1:
| Priorities that all map to the same traffic class should be | constrained to use the same value of preemption status.
it is insufficient to check that the prio_tc_map does not make some traffic classes both express and preemptible. One has to also check the tc-gate configuration, because the literal interpretation of the standard would suggest that a packet is preemptible or express based on skb->priority, but is classified to a traffic class based on skb->ipv. So I guess IPV rewriting can only take place between one preemptible priority and another, or only between one express priority and another, and that we should somehow test this. But PSFP is at ingress, and FP is at egress, so we'd somehow have to test the FP adminStatus of all the other net devices that we can forward to!!!
I'll try to ask around and see if anyone knows more details about what is the expected behavior.
Your solution certainly correct and do the differentiation between the cases where we have PSFP at ingress or not. However in my understanding the only purpose of the IPV is the traffic class selection.
Setting skb->queue_mapping to IPV probably wont work, because of two reasons:
- it brings inconsistency with the mqprio/taprio queues and the actual
hw rings the traffic sent 2. some drivers dont check if skb->queue_mapping is bounded, they expect its smaller than the num_tx_queues.
The 2. might be solvable, but the 1. is more problematic. However with a helper, we might check if skb->queue_mapping is already set and use that as a traffic class. Is that possible? I dont really see any other codepath where that value can be other than zero before the qdisc layer. That way one flag (use_ipv) might be enough which tells that we should use the skb->queue_mapping as is (set by act_gate) and preserve skb->priority.
What do you think? Again, sorry for being late here, but I'm following the list and see that recently you did major mqprio/taprio fixes and refactors, so I hope your cache line is hot.
I'm afraid it's not so easy to reuse this field for the IPV, because skb->queue_mapping is already used between the RX and the TX path for "recording the RX queue", see skb_rx_queue_recorded(), skb_record_rx_queue(), skb_get_rx_queue().
Hi Vladimir!
On Thu, 2023-02-16 at 17:58 +0200, Vladimir Oltean wrote:
On Thu, Feb 16, 2023 at 02:47:11PM +0100, Ferenc Fejes wrote:
To fix this, we need to keep a current_ipv variable according to the gate entry that's currently executed by act_gate, and use this IPV to overwrite the skb->priority.
In fact, a complication arises due to the following clause from 802.1Q:
8.6.6.1 PSFP queuing If PSFP is supported (8.6.5.1), and the IPV associated with the stream filter that passed the frame is anything other than the null value, then that IPV is used to determine the traffic class of the frame, in place of the frame's priority, via the Traffic Class Table specified in 8.6.6. In all other respects, the queuing actions specified in 8.6.6 are unchanged. The IPV is used only to determine the traffic class associated with a frame, and hence select an outbound queue; for all other purposes, the received priority is used.
Interesting. In my understanding this indeed something like "modify the queueing decision while preserve the original PCP value".
I actually wonder what the interpretation in the spirit of this clause is. I'm known to over-interpret the text of these IEEE standards and finding inconsistencies in the process (being able to prove both A and !A).
I agree, it takes time to guess what the intention behind the wording of the standard in some cases. I have the standard in front of me right now and its 2163 pages... Even if I grep to IPV, the context is overwhelmingly dense.
For example, if we consider the framePreemptionAdminStatus (express or preemptible). It is expressed per priority. So this means that the Internal Priority Value rewritten by PSFP will not affect the express/ preemptible nature of a frame, but it *will* affect the traffic class selection?
This is insane, because it means that to enforce the requirements of clause 12.30.1.1.1:
Priorities that all map to the same traffic class should be constrained to use the same value of preemption status.
it is insufficient to check that the prio_tc_map does not make some traffic classes both express and preemptible. One has to also check the tc-gate configuration, because the literal interpretation of the standard would suggest that a packet is preemptible or express based on skb-
priority,
but is classified to a traffic class based on skb->ipv. So I guess IPV rewriting can only take place between one preemptible priority and another, or only between one express priority and another, and that we should somehow test this. But PSFP is at ingress, and FP is at egress, so we'd somehow have to test the FP adminStatus of all the other net devices that we can forward to!!!
I'll try to ask around and see if anyone knows more details about what is the expected behavior.
I'll try to ask around too, thanks for pointing this out. My best understanding from the IPV that the standard treat it as skb->priority. It defines IPV as a 32bit signed value, which clearly imply similar semantics as skb->priority, which can be much larger than the number of the queues or traffic classes.
Your solution certainly correct and do the differentiation between the cases where we have PSFP at ingress or not. However in my understanding the only purpose of the IPV is the traffic class selection.
Setting skb->queue_mapping to IPV probably wont work, because of two reasons:
- it brings inconsistency with the mqprio/taprio queues and the
actual hw rings the traffic sent 2. some drivers dont check if skb->queue_mapping is bounded, they expect its smaller than the num_tx_queues.
The 2. might be solvable, but the 1. is more problematic. However with a helper, we might check if skb->queue_mapping is already set and use that as a traffic class. Is that possible? I dont really see any other codepath where that value can be other than zero before the qdisc layer. That way one flag (use_ipv) might be enough which tells that we should use the skb->queue_mapping as is (set by act_gate) and preserve skb->priority.
What do you think? Again, sorry for being late here, but I'm following the list and see that recently you did major mqprio/taprio fixes and refactors, so I hope your cache line is hot.
I'm afraid it's not so easy to reuse this field for the IPV, because skb->queue_mapping is already used between the RX and the TX path for "recording the RX queue", see skb_rx_queue_recorded(), skb_record_rx_queue(), skb_get_rx_queue().
Oh, alright. I continue to think about alternatives over introducing new members into sk_buff. It would be very nice to have proper act_gate IPV handling without hardware offload. Its great to see the support of frame preemption and PSFP support in more and more hardware but on the other hand it makes the lack of the proper software mode operation more and more awkward.
Best, Ferenc
Hi Ferenc,
On Fri, Feb 17, 2023 at 09:03:30AM +0100, Ferenc Fejes wrote:
I agree, it takes time to guess what the intention behind the wording of the standard in some cases. I have the standard in front of me right now and its 2163 pages... Even if I grep to IPV, the context is overwhelmingly dense.
(...)
I'll try to ask around too, thanks for pointing this out. My best understanding from the IPV that the standard treat it as skb->priority. It defines IPV as a 32bit signed value, which clearly imply similar semantics as skb->priority, which can be much larger than the number of the queues or traffic classes.
What would you say if we made the software act_gate implementation simply alter skb->priority, which would potentially affect more stuff including the egress-qos-map of a VLAN device in the output path of the skb? It would definitely put less pressure on the networking data structures, at the price of leaving an exceedingly unlikely case uncovered.
Oh, alright. I continue to think about alternatives over introducing new members into sk_buff. It would be very nice to have proper act_gate IPV handling without hardware offload. Its great to see the support of frame preemption and PSFP support in more and more hardware but on the other hand it makes the lack of the proper software mode operation more and more awkward.
I'm not sure that cyclic queuing and forwarding done with software forwarding is going to be that practical anyway?
Hi Vladimir!
On Fri, 2023-02-17 at 15:07 +0200, Vladimir Oltean wrote:
Hi Ferenc,
On Fri, Feb 17, 2023 at 09:03:30AM +0100, Ferenc Fejes wrote:
I agree, it takes time to guess what the intention behind the wording of the standard in some cases. I have the standard in front of me right now and its 2163 pages... Even if I grep to IPV, the context is overwhelmingly dense.
(...)
I'll try to ask around too, thanks for pointing this out. My best understanding from the IPV that the standard treat it as skb-
priority.
It defines IPV as a 32bit signed value, which clearly imply similar semantics as skb->priority, which can be much larger than the number of the queues or traffic classes.
What would you say if we made the software act_gate implementation simply alter skb->priority, which would potentially affect more stuff including the egress-qos-map of a VLAN device in the output path of the skb? It would definitely put less pressure on the networking data structures, at the price of leaving an exceedingly unlikely case uncovered.
This is exactly what I just started to write in my reply! Yes, this would be the right choice. The key here is the "exceedingly unlikely case" what you just mentioned.
If you are lucky enough to have the luxury to think about cases where the IPV mapping should only affect the queueing and not the egress-qos- map, it would be nice to have options. Sadly mqprio and taprio classless, so you can't do the following as far as I understand it:
1. Configure tc act_gate with IPV-s like 0, 1, 2, etc. 2. Configure mqprio/taprio with prio to tc mapping for 0, 1, 2, etc. 3. As mqprio/taprio leaf-s you can apply tc skbedit/bpf direct actions which altering the skb->priority to values you would like to see at egress-to-qos mapping 4. The egress-to-qos mapping act on the new skb->priority
However what if mqprio/taprio can read prio:tc map like: tc qdisc add dev eth0 root mqprio num_tc 4 map 1000:0 1001:1 1000:2 1003:3 queues 1@0 1@1 1@2 1@3 hw 0 I think that dont necessarily break the existing script since we can check if ":" if in the map parameters or not. That way skb->priority ---> tc would be flexible (as in egress-qos- map), the tc ---> queue mapping flexible too, and the original skb-
priority can control the egress-qos-map differently than the queueing
that way. I miss something fundamental here?
Oh, alright. I continue to think about alternatives over introducing new members into sk_buff. It would be very nice to have proper act_gate IPV handling without hardware offload. Its great to see the support of frame preemption and PSFP support in more and more hardware but on the other hand it makes the lack of the proper software mode operation more and more awkward.
I'm not sure that cyclic queuing and forwarding done with software forwarding is going to be that practical anyway?
VNFs can perform PSFP as well or at least reprioritize the packets. Also it would be handy for selftests too. Other than that future driver implementers can verify their operation with the software behavior.
Best, Ferenc
linux-kselftest-mirror@lists.linaro.org