The networking stack currently doesn't clearly distinguish between DSCP and ECN bits. The entire DSCP+ECN bits are stored in u8 variables (or structure fields), and each part of the stack handles them in their own way, using different macros. This has created several bugs in the past and some uncommon code paths are still unfixed.
Such bugs generally manifest by selecting invalid routes because of ECN bits interfering with FIB routes and rules lookups (more details in the LPC 2021 talk[1] and in the RFC of this series[2]).
This patch series aims at preventing the introduction of such bugs (and detecting existing ones), by introducing a dscp_t type, representing "sanitised" DSCP values (that is, with no ECN information), as opposed to plain u8 values that contain both DSCP and ECN information. dscp_t makes it clear for the reader what we're working on, and Sparse can flag invalid interactions between dscp_t and plain u8.
This series converts only a few variables and structures:
* Patch 1 converts the tclass field of struct fib6_rule. It effectively forbids the use of ECN bits in the tos/dsfield option of ip -6 rule. Rules now match packets solely based on their DSCP bits, so ECN doesn't influence the result any more. This contrasts with the previous behaviour where all 8 bits of the Traffic Class field were used. It is believed that this change is acceptable as matching ECN bits wasn't usable for IPv4, so only IPv6-only deployments could be depending on it. Also the previous behaviour made DSCP-based ip6-rules fail for packets with both a DSCP and an ECN mark, which is another reason why any such deploy is unlikely.
* Patch 2 converts the tos field of struct fib4_rule. This one too effectively forbids defining ECN bits, this time in ip -4 rule. Before that, setting ECN bit 1 was accepted, while ECN bit 0 was rejected. But even when accepted, the rule would never match, as the packets would have their ECN bits cleared before doing the rule lookup.
* Patch 3 converts the fc_tos field of struct fib_config. This is equivalent to patch 2, but for IPv4 routes. Routes using a tos/dsfield option with any ECN bit set is now rejected. Before this patch, they were accepted but, as with ip4 rules, these routes couldn't match any packet, since their ECN bits are cleared before the lookup.
* Patch 4 converts the fa_tos field of struct fib_alias. This one is pure internal u8 to dscp_t conversion. While patches 1-3 had user facing consequences, this patch shouldn't have any side effect and is there to give an overview of what future conversion patches will look like. Conversions are quite mechanical, but imply some code churn, which is the price for the extra clarity a possibility of type checking.
To summarise, all the behaviour changes required for the dscp_t type approach to work should be contained in patches 1-3. These changes are edge cases of ip-route and ip-rule that don't currently work properly. So they should be safe. Also, a kernel selftest is added for each of them.
Finally, this work also paves the way for allowing the usage of the 3 high order DSCP bits in IPv4 (a few call paths already handle them, but in general the stack clears them before IPv4 rule and route lookups).
References: [1] LPC 2021 talk: - https://linuxplumbersconf.org/event/11/contributions/943/ - Direct link to slide deck: https://linuxplumbersconf.org/event/11/contributions/943/attachments/901/178... [2] RFC version of this series: - https://lore.kernel.org/netdev/cover.1638814614.git.gnault@redhat.com/
Changes since RFC: - Use simple mask instead of a bit shift to converting between u8 and dscp_t (Toke). - Reword patch 4 to make it clear that no behaviour change is intended (Toke). - Add kernel selftests. - Rebase on latest net-next.
Guillaume Nault (4): ipv6: Define dscp_t and stop taking ECN bits into account in fib6-rules ipv4: Stop taking ECN bits into account in fib4-rules ipv4: Reject routes specifying ECN bits in rtm_tos ipv4: Use dscp_t in struct fib_alias
include/net/inet_dscp.h | 57 ++++++++++++++ include/net/ip_fib.h | 3 +- include/net/ipv6.h | 6 ++ net/ipv4/fib_frontend.c | 11 ++- net/ipv4/fib_lookup.h | 3 +- net/ipv4/fib_rules.c | 18 +++-- net/ipv4/fib_semantics.c | 14 ++-- net/ipv4/fib_trie.c | 58 ++++++++------ net/ipv4/route.c | 3 +- net/ipv6/fib6_rules.c | 19 +++-- tools/testing/selftests/net/fib_rule_tests.sh | 60 ++++++++++++++- tools/testing/selftests/net/fib_tests.sh | 76 +++++++++++++++++++ 12 files changed, 278 insertions(+), 50 deletions(-) create mode 100644 include/net/inet_dscp.h
Define a dscp_t type and its appropriate helpers that ensure ECN bits are not taken into account when handling DSCP.
Use this new type to replace the tclass field of struct fib6_rule, so that fib6-rules don't get influenced by ECN bits anymore.
Before this patch, fib6-rules didn't make any distinction between the DSCP and ECN bits. Therefore, rules specifying a DSCP (tos or dsfield options in iproute2) stopped working as soon a packets had at least one of its ECN bits set (as a work around one could create four rules for each DSCP value to match, one for each possible ECN value).
After this patch fib6-rules only compare the DSCP bits. ECN doesn't influence the result anymore. Also, fib6-rules now must have the ECN bits cleared or they will be rejected.
Signed-off-by: Guillaume Nault gnault@redhat.com --- include/net/inet_dscp.h | 57 +++++++++++++++++++ include/net/ipv6.h | 6 ++ net/ipv6/fib6_rules.c | 19 +++++-- tools/testing/selftests/net/fib_rule_tests.sh | 30 +++++++++- 4 files changed, 105 insertions(+), 7 deletions(-) create mode 100644 include/net/inet_dscp.h
diff --git a/include/net/inet_dscp.h b/include/net/inet_dscp.h new file mode 100644 index 000000000000..72f250dffada --- /dev/null +++ b/include/net/inet_dscp.h @@ -0,0 +1,57 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * inet_dscp.h: helpers for handling differentiated services codepoints (DSCP) + * + * DSCP is defined in RFC 2474: + * + * 0 1 2 3 4 5 6 7 + * +---+---+---+---+---+---+---+---+ + * | DSCP | CU | + * +---+---+---+---+---+---+---+---+ + * + * DSCP: differentiated services codepoint + * CU: currently unused + * + * The whole DSCP + CU bits form the DS field. + * The DS field is also commonly called TOS or Traffic Class (for IPv6). + * + * Note: the CU bits are now used for Explicit Congestion Notification + * (RFC 3168). + */ + +#ifndef _INET_DSCP_H +#define _INET_DSCP_H + +#include <linux/types.h> + +/* Special type for storing DSCP values. + * + * A dscp_t variable stores a DS field with the CU (ECN) bits cleared. + * Using dscp_t allows to strictly separate DSCP and ECN bits, thus avoiding + * bugs where ECN bits are erroneously taken into account during FIB lookups + * or policy routing. + * + * Note: to get the real DSCP value contained in a dscp_t variable one would + * have to do a bit shift after calling inet_dscp_to_dsfield(). We could have + * a helper for that, but there's currently no users. + */ +typedef u8 __bitwise dscp_t; + +#define INET_DSCP_MASK 0xfc + +static inline dscp_t inet_dsfield_to_dscp(__u8 dsfield) +{ + return (__force dscp_t)(dsfield & INET_DSCP_MASK); +} + +static inline __u8 inet_dscp_to_dsfield(dscp_t dscp) +{ + return (__force __u8)dscp; +} + +static inline bool inet_validate_dscp(__u8 val) +{ + return !(val & ~INET_DSCP_MASK); +} + +#endif /* _INET_DSCP_H */ diff --git a/include/net/ipv6.h b/include/net/ipv6.h index 082f30256f59..3d898eb6df9c 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -18,6 +18,7 @@ #include <net/ndisc.h> #include <net/flow.h> #include <net/flow_dissector.h> +#include <net/inet_dscp.h> #include <net/snmp.h> #include <net/netns/hash.h>
@@ -975,6 +976,11 @@ static inline u8 ip6_tclass(__be32 flowinfo) return ntohl(flowinfo & IPV6_TCLASS_MASK) >> IPV6_TCLASS_SHIFT; }
+static inline dscp_t ip6_dscp(__be32 flowinfo) +{ + return inet_dsfield_to_dscp(ip6_tclass(flowinfo)); +} + static inline __be32 ip6_make_flowinfo(unsigned int tclass, __be32 flowlabel) { return htonl(tclass << IPV6_TCLASS_SHIFT) | flowlabel; diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c index ec029c86ae06..e2a7b0059669 100644 --- a/net/ipv6/fib6_rules.c +++ b/net/ipv6/fib6_rules.c @@ -16,6 +16,7 @@ #include <linux/indirect_call_wrapper.h>
#include <net/fib_rules.h> +#include <net/inet_dscp.h> #include <net/ipv6.h> #include <net/addrconf.h> #include <net/ip6_route.h> @@ -25,14 +26,14 @@ struct fib6_rule { struct fib_rule common; struct rt6key src; struct rt6key dst; - u8 tclass; + dscp_t dscp; };
static bool fib6_rule_matchall(const struct fib_rule *rule) { struct fib6_rule *r = container_of(rule, struct fib6_rule, common);
- if (r->dst.plen || r->src.plen || r->tclass) + if (r->dst.plen || r->src.plen || r->dscp) return false; return fib_rule_matchall(rule); } @@ -323,7 +324,7 @@ INDIRECT_CALLABLE_SCOPE int fib6_rule_match(struct fib_rule *rule, return 0; }
- if (r->tclass && r->tclass != ip6_tclass(fl6->flowlabel)) + if (r->dscp && r->dscp != ip6_dscp(fl6->flowlabel)) return 0;
if (rule->ip_proto && (rule->ip_proto != fl6->flowi6_proto)) @@ -349,6 +350,13 @@ static int fib6_rule_configure(struct fib_rule *rule, struct sk_buff *skb, struct net *net = sock_net(skb->sk); struct fib6_rule *rule6 = (struct fib6_rule *) rule;
+ if (!inet_validate_dscp(frh->tos)) { + NL_SET_ERR_MSG(extack, + "Invalid dsfield (tos): ECN bits must be 0"); + goto errout; + } + rule6->dscp = inet_dsfield_to_dscp(frh->tos); + if (rule->action == FR_ACT_TO_TBL && !rule->l3mdev) { if (rule->table == RT6_TABLE_UNSPEC) { NL_SET_ERR_MSG(extack, "Invalid table"); @@ -369,7 +377,6 @@ static int fib6_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
rule6->src.plen = frh->src_len; rule6->dst.plen = frh->dst_len; - rule6->tclass = frh->tos;
if (fib_rule_requires_fldissect(rule)) net->ipv6.fib6_rules_require_fldissect++; @@ -402,7 +409,7 @@ static int fib6_rule_compare(struct fib_rule *rule, struct fib_rule_hdr *frh, if (frh->dst_len && (rule6->dst.plen != frh->dst_len)) return 0;
- if (frh->tos && (rule6->tclass != frh->tos)) + if (frh->tos && inet_dscp_to_dsfield(rule6->dscp) != frh->tos) return 0;
if (frh->src_len && @@ -423,7 +430,7 @@ static int fib6_rule_fill(struct fib_rule *rule, struct sk_buff *skb,
frh->dst_len = rule6->dst.plen; frh->src_len = rule6->src.plen; - frh->tos = rule6->tclass; + frh->tos = inet_dscp_to_dsfield(rule6->dscp);
if ((rule6->dst.plen && nla_put_in6_addr(skb, FRA_DST, &rule6->dst.addr)) || diff --git a/tools/testing/selftests/net/fib_rule_tests.sh b/tools/testing/selftests/net/fib_rule_tests.sh index 3b0489910422..d7a9ab3be1d3 100755 --- a/tools/testing/selftests/net/fib_rule_tests.sh +++ b/tools/testing/selftests/net/fib_rule_tests.sh @@ -114,10 +114,25 @@ fib_rule6_test_match_n_redirect() log_test $? 0 "rule6 del by pref: $description" }
+fib_rule6_test_reject() +{ + local match="$1" + local rc + + $IP -6 rule add $match table $RTABLE 2>/dev/null + rc=$? + log_test $rc 2 "rule6 check: $match" + + if [ $rc -eq 0 ]; then + $IP -6 rule del $match table $RTABLE + fi +} + fib_rule6_test() { local getmatch local match + local cnt
# setup the fib rule redirect route $IP -6 route add table $RTABLE default via $GW_IP6 dev $DEV onlink @@ -128,8 +143,21 @@ fib_rule6_test() match="from $SRC_IP6 iif $DEV" fib_rule6_test_match_n_redirect "$match" "$match" "iif redirect to table"
+ # Reject dsfield (tos) options which have ECN bits set + for cnt in $(seq 1 3); do + match="dsfield $cnt" + fib_rule6_test_reject "$match" + done + + # Don't take ECN bits into account when matching on dsfield match="tos 0x10" - fib_rule6_test_match_n_redirect "$match" "$match" "tos redirect to table" + for cnt in "0x10" "0x11" "0x12" "0x13"; do + # Using option 'tos' instead of 'dsfield' as old iproute2 + # versions don't support 'dsfield' in ip rule show. + getmatch="tos $cnt" + fib_rule6_test_match_n_redirect "$match" "$getmatch" \ + "$getmatch redirect to table" + done
match="fwmark 0x64" getmatch="mark 0x64"
Use the new dscp_t type to replace the tos field of struct fib4_rule, so that fib4-rules consistently ignore ECN bits.
Before this patch, fib4-rules did accept rules with the high order ECN bit set (but not the low order one). Also, it relied on its callers masking the ECN bits of ->flowi4_tos to prevent those from influencing the result. This was brittle and a few call paths still do the lookup without masking the ECN bits first.
After this patch fib4-rules only compare the DSCP bits. ECN can't influence the result anymore, even if the caller didn't mask these bits. Also, fib4-rules now must have both ECN bits cleared or they will be rejected.
Signed-off-by: Guillaume Nault gnault@redhat.com --- net/ipv4/fib_rules.c | 18 ++++++----- tools/testing/selftests/net/fib_rule_tests.sh | 30 ++++++++++++++++++- 2 files changed, 39 insertions(+), 9 deletions(-)
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c index e0b6c8b6de57..117c48571cf0 100644 --- a/net/ipv4/fib_rules.c +++ b/net/ipv4/fib_rules.c @@ -23,6 +23,7 @@ #include <linux/list.h> #include <linux/rcupdate.h> #include <linux/export.h> +#include <net/inet_dscp.h> #include <net/ip.h> #include <net/route.h> #include <net/tcp.h> @@ -35,7 +36,7 @@ struct fib4_rule { struct fib_rule common; u8 dst_len; u8 src_len; - u8 tos; + dscp_t dscp; __be32 src; __be32 srcmask; __be32 dst; @@ -49,7 +50,7 @@ static bool fib4_rule_matchall(const struct fib_rule *rule) { struct fib4_rule *r = container_of(rule, struct fib4_rule, common);
- if (r->dst_len || r->src_len || r->tos) + if (r->dst_len || r->src_len || r->dscp) return false; return fib_rule_matchall(rule); } @@ -185,7 +186,7 @@ INDIRECT_CALLABLE_SCOPE int fib4_rule_match(struct fib_rule *rule, ((daddr ^ r->dst) & r->dstmask)) return 0;
- if (r->tos && (r->tos != fl4->flowi4_tos)) + if (r->dscp && r->dscp != inet_dsfield_to_dscp(fl4->flowi4_tos)) return 0;
if (rule->ip_proto && (rule->ip_proto != fl4->flowi4_proto)) @@ -225,10 +226,12 @@ static int fib4_rule_configure(struct fib_rule *rule, struct sk_buff *skb, int err = -EINVAL; struct fib4_rule *rule4 = (struct fib4_rule *) rule;
- if (frh->tos & ~IPTOS_TOS_MASK) { - NL_SET_ERR_MSG(extack, "Invalid tos"); + if (!inet_validate_dscp(frh->tos)) { + NL_SET_ERR_MSG(extack, + "Invalid dsfield (tos): ECN bits must be 0"); goto errout; } + rule4->dscp = inet_dsfield_to_dscp(frh->tos);
/* split local/main if they are not already split */ err = fib_unmerge(net); @@ -270,7 +273,6 @@ static int fib4_rule_configure(struct fib_rule *rule, struct sk_buff *skb, rule4->srcmask = inet_make_mask(rule4->src_len); rule4->dst_len = frh->dst_len; rule4->dstmask = inet_make_mask(rule4->dst_len); - rule4->tos = frh->tos;
net->ipv4.fib_has_custom_rules = true;
@@ -313,7 +315,7 @@ static int fib4_rule_compare(struct fib_rule *rule, struct fib_rule_hdr *frh, if (frh->dst_len && (rule4->dst_len != frh->dst_len)) return 0;
- if (frh->tos && (rule4->tos != frh->tos)) + if (frh->tos && inet_dscp_to_dsfield(rule4->dscp) != frh->tos) return 0;
#ifdef CONFIG_IP_ROUTE_CLASSID @@ -337,7 +339,7 @@ static int fib4_rule_fill(struct fib_rule *rule, struct sk_buff *skb,
frh->dst_len = rule4->dst_len; frh->src_len = rule4->src_len; - frh->tos = rule4->tos; + frh->tos = inet_dscp_to_dsfield(rule4->dscp);
if ((rule4->dst_len && nla_put_in_addr(skb, FRA_DST, rule4->dst)) || diff --git a/tools/testing/selftests/net/fib_rule_tests.sh b/tools/testing/selftests/net/fib_rule_tests.sh index d7a9ab3be1d3..4f70baad867d 100755 --- a/tools/testing/selftests/net/fib_rule_tests.sh +++ b/tools/testing/selftests/net/fib_rule_tests.sh @@ -215,10 +215,25 @@ fib_rule4_test_match_n_redirect() log_test $? 0 "rule4 del by pref: $description" }
+fib_rule4_test_reject() +{ + local match="$1" + local rc + + $IP rule add $match table $RTABLE 2>/dev/null + rc=$? + log_test $rc 2 "rule4 check: $match" + + if [ $rc -eq 0 ]; then + $IP rule del $match table $RTABLE + fi +} + fib_rule4_test() { local getmatch local match + local cnt
# setup the fib rule redirect route $IP route add table $RTABLE default via $GW_IP4 dev $DEV onlink @@ -234,8 +249,21 @@ fib_rule4_test() fib_rule4_test_match_n_redirect "$match" "$match" "iif redirect to table" ip netns exec testns sysctl -qw net.ipv4.ip_forward=0
+ # Reject dsfield (tos) options which have ECN bits set + for cnt in $(seq 1 3); do + match="dsfield $cnt" + fib_rule4_test_reject "$match" + done + + # Don't take ECN bits into account when matching on dsfield match="tos 0x10" - fib_rule4_test_match_n_redirect "$match" "$match" "tos redirect to table" + for cnt in "0x10" "0x11" "0x12" "0x13"; do + # Using option 'tos' instead of 'dsfield' as old iproute2 + # versions don't support 'dsfield' in ip rule show. + getmatch="tos $cnt" + fib_rule4_test_match_n_redirect "$match" "$getmatch" \ + "$getmatch redirect to table" + done
match="fwmark 0x64" getmatch="mark 0x64"
Use the new dscp_t type to replace the fc_tos field of fib_config, to ensure IPv4 routes aren't influenced by ECN bits when configured with non-zero rtm_tos.
Before this patch, IPv4 routes specifying an rtm_tos with some of the ECN bits set were accepted. However they wouldn't work (never match) as IPv4 normally clears the ECN bits with IPTOS_RT_MASK before doing a FIB lookup (although a few buggy code paths don't).
After this patch, IPv4 routes specifying an rtm_tos with any ECN bit set is rejected.
Note: IPv6 routes ignore rtm_tos altogether, any rtm_tos is accepted, but treated as if it were 0.
Signed-off-by: Guillaume Nault gnault@redhat.com --- Shuah, FYI, this is the patch I was refering to in our discussion about testing invalid tos values: https://lore.kernel.org/netdev/20220202232555.GC15826@pc-4.home/
include/net/ip_fib.h | 3 +- net/ipv4/fib_frontend.c | 11 +++- net/ipv4/fib_trie.c | 7 ++- tools/testing/selftests/net/fib_tests.sh | 76 ++++++++++++++++++++++++ 4 files changed, 93 insertions(+), 4 deletions(-)
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index c4297704bbcb..6a82bcb8813b 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -17,6 +17,7 @@ #include <linux/rcupdate.h> #include <net/fib_notifier.h> #include <net/fib_rules.h> +#include <net/inet_dscp.h> #include <net/inetpeer.h> #include <linux/percpu.h> #include <linux/notifier.h> @@ -24,7 +25,7 @@
struct fib_config { u8 fc_dst_len; - u8 fc_tos; + dscp_t fc_dscp; u8 fc_protocol; u8 fc_scope; u8 fc_type; diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 4d61ddd8a0ec..c60e1d1ed2b0 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -32,6 +32,7 @@ #include <linux/list.h> #include <linux/slab.h>
+#include <net/inet_dscp.h> #include <net/ip.h> #include <net/protocol.h> #include <net/route.h> @@ -735,8 +736,16 @@ static int rtm_to_fib_config(struct net *net, struct sk_buff *skb, memset(cfg, 0, sizeof(*cfg));
rtm = nlmsg_data(nlh); + + if (!inet_validate_dscp(rtm->rtm_tos)) { + NL_SET_ERR_MSG(extack, + "Invalid dsfield (tos): ECN bits must be 0"); + err = -EINVAL; + goto errout; + } + cfg->fc_dscp = inet_dsfield_to_dscp(rtm->rtm_tos); + cfg->fc_dst_len = rtm->rtm_dst_len; - cfg->fc_tos = rtm->rtm_tos; cfg->fc_table = rtm->rtm_table; cfg->fc_protocol = rtm->rtm_protocol; cfg->fc_scope = rtm->rtm_scope; diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 8060524f4256..d937eeebb812 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -61,6 +61,7 @@ #include <linux/vmalloc.h> #include <linux/notifier.h> #include <net/net_namespace.h> +#include <net/inet_dscp.h> #include <net/ip.h> #include <net/protocol.h> #include <net/route.h> @@ -1210,9 +1211,9 @@ int fib_table_insert(struct net *net, struct fib_table *tb, struct fib_info *fi; u8 plen = cfg->fc_dst_len; u8 slen = KEYLENGTH - plen; - u8 tos = cfg->fc_tos; u32 key; int err; + u8 tos;
key = ntohl(cfg->fc_dst);
@@ -1227,6 +1228,7 @@ int fib_table_insert(struct net *net, struct fib_table *tb, goto err; }
+ tos = inet_dscp_to_dsfield(cfg->fc_dscp); l = fib_find_node(t, &tp, key); fa = l ? fib_find_alias(&l->leaf, slen, tos, fi->fib_priority, tb->tb_id, false) : NULL; @@ -1703,8 +1705,8 @@ int fib_table_delete(struct net *net, struct fib_table *tb, struct key_vector *l, *tp; u8 plen = cfg->fc_dst_len; u8 slen = KEYLENGTH - plen; - u8 tos = cfg->fc_tos; u32 key; + u8 tos;
key = ntohl(cfg->fc_dst);
@@ -1715,6 +1717,7 @@ int fib_table_delete(struct net *net, struct fib_table *tb, if (!l) return -ESRCH;
+ tos = inet_dscp_to_dsfield(cfg->fc_dscp); fa = fib_find_alias(&l->leaf, slen, tos, 0, tb->tb_id, false); if (!fa) return -ESRCH; diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh index 996af1ae3d3d..bb73235976b3 100755 --- a/tools/testing/selftests/net/fib_tests.sh +++ b/tools/testing/selftests/net/fib_tests.sh @@ -1447,6 +1447,81 @@ ipv4_local_rt_cache() log_test $? 0 "Cached route removed from VRF port device" }
+ipv4_rt_dsfield() +{ + echo + echo "IPv4 route with dsfield tests" + + run_cmd "$IP route flush 172.16.102.0/24" + + # New routes should reject dsfield options that interfere with ECN + run_cmd "$IP route add 172.16.102.0/24 dsfield 0x01 via 172.16.101.2" + log_test $? 2 "Reject route with dsfield 0x01" + + run_cmd "$IP route add 172.16.102.0/24 dsfield 0x02 via 172.16.101.2" + log_test $? 2 "Reject route with dsfield 0x02" + + run_cmd "$IP route add 172.16.102.0/24 dsfield 0x03 via 172.16.101.2" + log_test $? 2 "Reject route with dsfield 0x03" + + # A generic route that doesn't take DSCP into account + run_cmd "$IP route add 172.16.102.0/24 via 172.16.101.2" + + # A more specific route for DSCP 0x10 + run_cmd "$IP route add 172.16.102.0/24 dsfield 0x10 via 172.16.103.2" + + # DSCP 0x10 should match the specific route, no matter the ECN bits + $IP route get fibmatch 172.16.102.1 dsfield 0x10 | \ + grep -q "via 172.16.103.2" + log_test $? 0 "IPv4 route with DSCP and ECN:Not-ECT" + + $IP route get fibmatch 172.16.102.1 dsfield 0x11 | \ + grep -q "via 172.16.103.2" + log_test $? 0 "IPv4 route with DSCP and ECN:ECT(1)" + + $IP route get fibmatch 172.16.102.1 dsfield 0x12 | \ + grep -q "via 172.16.103.2" + log_test $? 0 "IPv4 route with DSCP and ECN:ECT(0)" + + $IP route get fibmatch 172.16.102.1 dsfield 0x13 | \ + grep -q "via 172.16.103.2" + log_test $? 0 "IPv4 route with DSCP and ECN:CE" + + # Unknown DSCP should match the generic route, no matter the ECN bits + $IP route get fibmatch 172.16.102.1 dsfield 0x14 | \ + grep -q "via 172.16.101.2" + log_test $? 0 "IPv4 route with unknown DSCP and ECN:Not-ECT" + + $IP route get fibmatch 172.16.102.1 dsfield 0x15 | \ + grep -q "via 172.16.101.2" + log_test $? 0 "IPv4 route with unknown DSCP and ECN:ECT(1)" + + $IP route get fibmatch 172.16.102.1 dsfield 0x16 | \ + grep -q "via 172.16.101.2" + log_test $? 0 "IPv4 route with unknown DSCP and ECN:ECT(0)" + + $IP route get fibmatch 172.16.102.1 dsfield 0x17 | \ + grep -q "via 172.16.101.2" + log_test $? 0 "IPv4 route with unknown DSCP and ECN:CE" + + # Null DSCP should match the generic route, no matter the ECN bits + $IP route get fibmatch 172.16.102.1 dsfield 0x00 | \ + grep -q "via 172.16.101.2" + log_test $? 0 "IPv4 route with no DSCP and ECN:Not-ECT" + + $IP route get fibmatch 172.16.102.1 dsfield 0x01 | \ + grep -q "via 172.16.101.2" + log_test $? 0 "IPv4 route with no DSCP and ECN:ECT(1)" + + $IP route get fibmatch 172.16.102.1 dsfield 0x02 | \ + grep -q "via 172.16.101.2" + log_test $? 0 "IPv4 route with no DSCP and ECN:ECT(0)" + + $IP route get fibmatch 172.16.102.1 dsfield 0x03 | \ + grep -q "via 172.16.101.2" + log_test $? 0 "IPv4 route with no DSCP and ECN:CE" +} + ipv4_route_test() { route_setup @@ -1454,6 +1529,7 @@ ipv4_route_test() ipv4_rt_add ipv4_rt_replace ipv4_local_rt_cache + ipv4_rt_dsfield
route_cleanup }
On 2/4/22 6:58 AM, Guillaume Nault wrote:
Use the new dscp_t type to replace the fc_tos field of fib_config, to ensure IPv4 routes aren't influenced by ECN bits when configured with non-zero rtm_tos.
Before this patch, IPv4 routes specifying an rtm_tos with some of the ECN bits set were accepted. However they wouldn't work (never match) as IPv4 normally clears the ECN bits with IPTOS_RT_MASK before doing a FIB lookup (although a few buggy code paths don't).
After this patch, IPv4 routes specifying an rtm_tos with any ECN bit set is rejected.
Note: IPv6 routes ignore rtm_tos altogether, any rtm_tos is accepted, but treated as if it were 0.
Signed-off-by: Guillaume Nault gnault@redhat.com
Shuah, FYI, this is the patch I was refering to in our discussion about testing invalid tos values: https://lore.kernel.org/netdev/20220202232555.GC15826@pc-4.home/
This give me context. Thank you.
thanks, -- Shuah
On 2/4/22 5:58 AM, Guillaume Nault wrote:
The networking stack currently doesn't clearly distinguish between DSCP and ECN bits. The entire DSCP+ECN bits are stored in u8 variables (or structure fields), and each part of the stack handles them in their own way, using different macros. This has created several bugs in the past and some uncommon code paths are still unfixed.
Such bugs generally manifest by selecting invalid routes because of ECN bits interfering with FIB routes and rules lookups (more details in the LPC 2021 talk[1] and in the RFC of this series[2]).
This patch series aims at preventing the introduction of such bugs (and detecting existing ones), by introducing a dscp_t type, representing "sanitised" DSCP values (that is, with no ECN information), as opposed to plain u8 values that contain both DSCP and ECN information. dscp_t makes it clear for the reader what we're working on, and Sparse can flag invalid interactions between dscp_t and plain u8.
This series converts only a few variables and structures:
Patch 1 converts the tclass field of struct fib6_rule. It effectively forbids the use of ECN bits in the tos/dsfield option of ip -6 rule. Rules now match packets solely based on their DSCP bits, so ECN doesn't influence the result any more. This contrasts with the previous behaviour where all 8 bits of the Traffic Class field were used. It is believed that this change is acceptable as matching ECN bits wasn't usable for IPv4, so only IPv6-only deployments could be depending on it. Also the previous behaviour made DSCP-based ip6-rules fail for packets with both a DSCP and an ECN mark, which is another reason why any such deploy is unlikely.
Patch 2 converts the tos field of struct fib4_rule. This one too effectively forbids defining ECN bits, this time in ip -4 rule. Before that, setting ECN bit 1 was accepted, while ECN bit 0 was rejected. But even when accepted, the rule would never match, as the packets would have their ECN bits cleared before doing the rule lookup.
Patch 3 converts the fc_tos field of struct fib_config. This is equivalent to patch 2, but for IPv4 routes. Routes using a tos/dsfield option with any ECN bit set is now rejected. Before this patch, they were accepted but, as with ip4 rules, these routes couldn't match any packet, since their ECN bits are cleared before the lookup.
Patch 4 converts the fa_tos field of struct fib_alias. This one is pure internal u8 to dscp_t conversion. While patches 1-3 had user facing consequences, this patch shouldn't have any side effect and is there to give an overview of what future conversion patches will look like. Conversions are quite mechanical, but imply some code churn, which is the price for the extra clarity a possibility of type checking.
To summarise, all the behaviour changes required for the dscp_t type approach to work should be contained in patches 1-3. These changes are edge cases of ip-route and ip-rule that don't currently work properly. So they should be safe. Also, a kernel selftest is added for each of them.
seems like the right directions to me.
Acked-by: David Ahern dsahern@kernel.org
Guillaume Nault gnault@redhat.com writes:
The networking stack currently doesn't clearly distinguish between DSCP and ECN bits. The entire DSCP+ECN bits are stored in u8 variables (or structure fields), and each part of the stack handles them in their own way, using different macros. This has created several bugs in the past and some uncommon code paths are still unfixed.
Such bugs generally manifest by selecting invalid routes because of ECN bits interfering with FIB routes and rules lookups (more details in the LPC 2021 talk[1] and in the RFC of this series[2]).
This patch series aims at preventing the introduction of such bugs (and detecting existing ones), by introducing a dscp_t type, representing "sanitised" DSCP values (that is, with no ECN information), as opposed to plain u8 values that contain both DSCP and ECN information. dscp_t makes it clear for the reader what we're working on, and Sparse can flag invalid interactions between dscp_t and plain u8.
This series converts only a few variables and structures:
Patch 1 converts the tclass field of struct fib6_rule. It effectively forbids the use of ECN bits in the tos/dsfield option of ip -6 rule. Rules now match packets solely based on their DSCP bits, so ECN doesn't influence the result any more. This contrasts with the previous behaviour where all 8 bits of the Traffic Class field were used. It is believed that this change is acceptable as matching ECN bits wasn't usable for IPv4, so only IPv6-only deployments could be depending on it. Also the previous behaviour made DSCP-based ip6-rules fail for packets with both a DSCP and an ECN mark, which is another reason why any such deploy is unlikely.
Patch 2 converts the tos field of struct fib4_rule. This one too effectively forbids defining ECN bits, this time in ip -4 rule. Before that, setting ECN bit 1 was accepted, while ECN bit 0 was rejected. But even when accepted, the rule would never match, as the packets would have their ECN bits cleared before doing the rule lookup.
Patch 3 converts the fc_tos field of struct fib_config. This is equivalent to patch 2, but for IPv4 routes. Routes using a tos/dsfield option with any ECN bit set is now rejected. Before this patch, they were accepted but, as with ip4 rules, these routes couldn't match any packet, since their ECN bits are cleared before the lookup.
Patch 4 converts the fa_tos field of struct fib_alias. This one is pure internal u8 to dscp_t conversion. While patches 1-3 had user facing consequences, this patch shouldn't have any side effect and is there to give an overview of what future conversion patches will look like. Conversions are quite mechanical, but imply some code churn, which is the price for the extra clarity a possibility of type checking.
To summarise, all the behaviour changes required for the dscp_t type approach to work should be contained in patches 1-3. These changes are edge cases of ip-route and ip-rule that don't currently work properly. So they should be safe. Also, a kernel selftest is added for each of them.
Finally, this work also paves the way for allowing the usage of the 3 high order DSCP bits in IPv4 (a few call paths already handle them, but in general the stack clears them before IPv4 rule and route lookups).
LGTM; thanks again for doing this!
Reviewed-by: Toke Høiland-Jørgensen toke@redhat.com
Hello:
This series was applied to netdev/net-next.git (master) by Jakub Kicinski kuba@kernel.org:
On Fri, 4 Feb 2022 14:58:09 +0100 you wrote:
The networking stack currently doesn't clearly distinguish between DSCP and ECN bits. The entire DSCP+ECN bits are stored in u8 variables (or structure fields), and each part of the stack handles them in their own way, using different macros. This has created several bugs in the past and some uncommon code paths are still unfixed.
Such bugs generally manifest by selecting invalid routes because of ECN bits interfering with FIB routes and rules lookups (more details in the LPC 2021 talk[1] and in the RFC of this series[2]).
[...]
Here is the summary with links: - [net-next,1/4] ipv6: Define dscp_t and stop taking ECN bits into account in fib6-rules (no matching commit) - [net-next,2/4] ipv4: Stop taking ECN bits into account in fib4-rules https://git.kernel.org/netdev/net-next/c/563f8e97e054 - [net-next,3/4] ipv4: Reject routes specifying ECN bits in rtm_tos https://git.kernel.org/netdev/net-next/c/f55fbb6afb8d - [net-next,4/4] ipv4: Use dscp_t in struct fib_alias https://git.kernel.org/netdev/net-next/c/32ccf1107980
You are awesome, thank you!
linux-kselftest-mirror@lists.linaro.org