Intel SIOV allows creating virtual devices of which the vRID is
represented by a pasid of a physical device. It is called as SIOV
virtual device in this series. Such devices can be bound to an iommufd
as physical device does and then later be attached to an IOAS/hwpt
using that pasid. Such PASIDs are called as default pasid.
iommufd has already supported pasid attach[1]. So a simple way to
support SIOV virtual device attachment is to let device driver call
the iommufd_device_pasid_attach() and pass in the default pasid for
the virtual device. This should work for now, but it may have problem
if iommufd core wants to differentiate the default pasids with other
kind of pasids (e.g. pasid given by userspace). In the later forwarding
page request to userspace, the default pasids are not supposed to send
to userspace as default pasids are mainly used by the SIOV device driver.
With above reason, this series chooses to have a new API to bind the
default pasid to iommufd, and extends the iommufd_device_attach() to
convert the attachment to be pasid attach with the default pasid. Device
drivers (e.g. VFIO) that support SIOV shall call the below APIs to
interact with iommufd:
- iommufd_device_bind_pasid(): Bind virtual device (a pasid of a device)
to iommufd;
- iommufd_device_attach(): Attach a SIOV virtual device to IOAS/HWPT;
- iommufd_device_replace(): Replace IOAS/HWPT of a SIOV virtual device;
- iommufd_device_detach(): Detach IOAS/HWPT of a SIOV virtual device;
- iommufd_device_unbind(): Unbind virtual device from iommufd;
For vfio devices, the device drivers that support SIOV should:
- use below API to register vdev for SIOV virtual device
vfio_register_pasid_iommu_dev()
- use below API to bind vdev to iommufd in .bind_iommufd() callback
iommufd_device_bind_pasid()
- allocate pasid by itself before calling iommufd_device_bind_pasid()
Complete code can be found at[2]
[1] https://lore.kernel.org/linux-iommu/20230926092651.17041-1-yi.l.liu@intel.c…
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_pasid_siov
Regards,
Yi Liu
Kevin Tian (5):
iommufd: Handle unsafe interrupts in a separate function
iommufd: Introduce iommufd_alloc_device()
iommufd: Add iommufd_device_bind_pasid()
iommufd: Support attach/replace for SIOV virtual device {dev, pasid}
vfio: Add vfio_register_pasid_iommu_dev()
Yi Liu (2):
iommufd/selftest: Extend IOMMU_TEST_OP_MOCK_DOMAIN to pass in pasid
iommufd/selftest: Add test coverage for SIOV virtual device
drivers/iommu/iommufd/device.c | 163 ++++++++++++++----
drivers/iommu/iommufd/iommufd_private.h | 7 +
drivers/iommu/iommufd/iommufd_test.h | 2 +
drivers/iommu/iommufd/selftest.c | 10 +-
drivers/vfio/group.c | 18 ++
drivers/vfio/vfio.h | 8 +
drivers/vfio/vfio_main.c | 10 ++
include/linux/iommufd.h | 3 +
include/linux/vfio.h | 1 +
tools/testing/selftests/iommu/iommufd.c | 75 ++++++--
.../selftests/iommu/iommufd_fail_nth.c | 42 ++++-
tools/testing/selftests/iommu/iommufd_utils.h | 21 ++-
12 files changed, 296 insertions(+), 64 deletions(-)
--
2.34.1
Multiple files/programs in `tools/testing/selftests/bpf/prog_tests/` still
heavily use the `CHECK` macro, even when better `ASSERT_` alternatives are
available.
As it was already pointed out by Yonghong Song [1] in the bpf selftests the use
of the ASSERT_* series of macros is preferred over the CHECK macro.
This patchset replaces the usage of `CHECK(` macros to the equivalent `ASSERT_`
family of macros in the following prog_tests:
- bind_perm.c
- bpf_obj_id.c
- bpf_tcp_ca.c
- vmlinux.c
[1] https://lore.kernel.org/lkml/0a142924-633c-44e6-9a92-2dc019656bf2@linux.dev
Changes in v3:
- Addressed the following points mentioned by Yonghong Song
- Improved `bpf_map_lookup_elem` assertion in bpf_tcp_ca.
- Replaced assertion introduced in v2 with one that checks `thread_ret`
instead of `pthread_join`. This ensures that `server`'s return value
(thread_ret) is the one being checked, as oposed to `pthread_join`'s
return value, since the latter one is less likely to fail.
Changes in v2:
- Fixed pthread_join assertion that broke the previous test
Previous version:
v2 - https://lore.kernel.org/lkml/GV1PR10MB6563AECF8E94798A1E5B36A4E8B6A@GV1PR10…
v1 - https://lore.kernel.org/lkml/GV1PR10MB6563FCFF1C5DEBE84FEA985FE8B0A@GV1PR10…
Yuran Pereira (4):
Replaces the usage of CHECK calls for ASSERTs in bpf_tcp_ca
Replaces the usage of CHECK calls for ASSERTs in bind_perm
Replaces the usage of CHECK calls for ASSERTs in bpf_obj_id
selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in
vmlinux
.../selftests/bpf/prog_tests/bind_perm.c | 6 +-
.../selftests/bpf/prog_tests/bpf_obj_id.c | 204 +++++++-----------
.../selftests/bpf/prog_tests/bpf_tcp_ca.c | 48 ++---
.../selftests/bpf/prog_tests/vmlinux.c | 16 +-
4 files changed, 105 insertions(+), 169 deletions(-)
--
2.25.1
Changes from v1:
* Dropped some changes that were independently fixed[1]
* No longer separate the f strings to their own patch
* Use r strings when the value is a regular expression
* Updated verification script
In retrospect a script to find the instances and apply fixes isn't that
useful for review, so the attached script this time just looks for
differences in the AST. Apply the series and run the script, with
the two references to compare as arguments.
There are some intentional changes to the AST now though, as the r strings
turn '\t' from a single character tab into a backslash and 't' character
pair (similar for '\n'). This does not affect the correctness of the
regular expression though.
v1: https://lore.kernel.org/all/20230814060704.79655-1-bgray@linux.ibm.com/
[1]: https://lore.kernel.org/all/20230816122133.1231599-1-vishalc@linux.ibm.com/
---
#!/usr/bin/env python3
"""
Verify Python syntax trees are equivalent between two references
"""
import argparse
import ast
from pathlib import Path
import subprocess as sp
def read_file(path: Path, ref: str) -> str:
return sp.run(f"git show {ref}:{path}", stdout=sp.PIPE, shell=True, encoding="utf-8", check=True).stdout
parser = argparse.ArgumentParser("Compare Python ASTs between revisions")
parser.add_argument("ref1", type=str, help="First revision to use")
parser.add_argument("ref2", type=str, help="Second revision to use")
args = parser.parse_args()
for pyfile in Path(".").glob("**/*.py"):
try:
ref1_content = read_file(pyfile, args.ref1)
ref2_content = read_file(pyfile, args.ref2)
except Exception as e:
print(f"ERROR:{pyfile}: Failed to read ({e})")
continue
try:
ref1_syntax = ast.parse(ref1_content, filename=pyfile)
ref2_syntax = ast.parse(ref2_content, filename=pyfile)
except SyntaxError as e:
print(f"ERROR:{pyfile}: Failed to parse, is it Python3? ({e})")
continue
if ast.dump(ref1_syntax) != ast.dump(ref2_syntax):
print(f"ERROR:{pyfile}: Revisions have different AST")
cmd = f"diff <(git show {args.ref1}:{pyfile} | python -m ast) <(git show {args.ref2}:{pyfile} | python -m ast)"
print(cmd)
sp.run(cmd, shell=True)
continue
Benjamin Gray (7):
ia64: fix Python string escapes
Documentation/sphinx: fix Python string escapes
drivers/comedi: fix Python string escapes
scripts: fix Python string escapes
tools/perf: fix Python string escapes
tools/power: fix Python string escapes
selftests/bpf: fix Python string escapes
Documentation/sphinx/cdomain.py | 2 +-
Documentation/sphinx/kernel_abi.py | 2 +-
Documentation/sphinx/kernel_feat.py | 2 +-
Documentation/sphinx/kerneldoc.py | 2 +-
Documentation/sphinx/maintainers_include.py | 8 +++---
arch/ia64/scripts/unwcheck.py | 2 +-
.../ni_routing/tools/convert_csv_to_c.py | 2 +-
scripts/clang-tools/gen_compile_commands.py | 2 +-
scripts/gdb/linux/symbols.py | 2 +-
tools/perf/pmu-events/jevents.py | 2 +-
.../scripts/python/arm-cs-trace-disasm.py | 4 +--
tools/perf/scripts/python/compaction-times.py | 2 +-
.../scripts/python/exported-sql-viewer.py | 4 +--
tools/power/pm-graph/bootgraph.py | 12 ++++-----
.../selftests/bpf/test_bpftool_synctypes.py | 26 +++++++++----------
tools/testing/selftests/bpf/test_offload.py | 2 +-
16 files changed, 38 insertions(+), 38 deletions(-)
--
2.41.0
From: Willem de Bruijn <willemb(a)google.com>
Commit 29f834aa326e ("net_sched: sch_fq: add 3 bands and WRR
scheduling") introduces multiple traffic bands, and per-band maximum
packet count.
Per-band limits ensures that packets in one class cannot fill the
entire qdisc and so cause DoS to the traffic in the other classes.
Verify this behavior:
1. set the limit to 10 per band
2. send 20 pkts on band A: verify that 10 are queued, 10 dropped
3. send 20 pkts on band A: verify that 0 are queued, 20 dropped
4. send 20 pkts on band B: verify that 10 are queued, 10 dropped
Packets must remain queued for a period to trigger this behavior.
Use SO_TXTIME to store packets for 100 msec.
The test reuses existing upstream test infra. The script is a fork of
cmsg_time.sh. The scripts call cmsg_sender.
The test extends cmsg_sender with two arguments:
* '-P' SO_PRIORITY
There is a subtle difference between IPv4 and IPv6 stack behavior:
PF_INET/IP_TOS sets IP header bits and sk_priority
PF_INET6/IPV6_TCLASS sets IP header bits BUT NOT sk_priority
* '-n' num pkts
Send multiple packets in quick succession.
I first attempted a for loop in the script, but this is too slow in
virtualized environments, causing flakiness as the 100ms timeout is
reached and packets are dequeued.
Also do not wait for timestamps to be queued unless timestamps are
requested.
Signed-off-by: Willem de Bruijn <willemb(a)google.com>
---
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/cmsg_sender.c | 50 ++++++++++------
.../testing/selftests/net/fq_band_pktlimit.sh | 57 +++++++++++++++++++
3 files changed, 91 insertions(+), 17 deletions(-)
create mode 100755 tools/testing/selftests/net/fq_band_pktlimit.sh
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 5b2aca4c5f10..9274edfb76ff 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -91,6 +91,7 @@ TEST_PROGS += test_bridge_neigh_suppress.sh
TEST_PROGS += test_vxlan_nolocalbypass.sh
TEST_PROGS += test_bridge_backup_port.sh
TEST_PROGS += fdb_flush.sh
+TEST_PROGS += fq_band_pktlimit.sh
TEST_FILES := settings
diff --git a/tools/testing/selftests/net/cmsg_sender.c b/tools/testing/selftests/net/cmsg_sender.c
index 24b21b15ed3f..8d7575389f58 100644
--- a/tools/testing/selftests/net/cmsg_sender.c
+++ b/tools/testing/selftests/net/cmsg_sender.c
@@ -45,11 +45,13 @@ struct options {
const char *host;
const char *service;
unsigned int size;
+ unsigned int num_pkt;
struct {
unsigned int mark;
unsigned int dontfrag;
unsigned int tclass;
unsigned int hlimit;
+ unsigned int priority;
} sockopt;
struct {
unsigned int family;
@@ -72,6 +74,7 @@ struct options {
} v6;
} opt = {
.size = 13,
+ .num_pkt = 1,
.sock = {
.family = AF_UNSPEC,
.type = SOCK_DGRAM,
@@ -112,7 +115,7 @@ static void cs_parse_args(int argc, char *argv[])
{
int o;
- while ((o = getopt(argc, argv, "46sS:p:m:M:d:tf:F:c:C:l:L:H:")) != -1) {
+ while ((o = getopt(argc, argv, "46sS:p:P:m:M:n:d:tf:F:c:C:l:L:H:")) != -1) {
switch (o) {
case 's':
opt.silent_send = true;
@@ -138,7 +141,9 @@ static void cs_parse_args(int argc, char *argv[])
cs_usage(argv[0]);
}
break;
-
+ case 'P':
+ opt.sockopt.priority = atoi(optarg);
+ break;
case 'm':
opt.mark.ena = true;
opt.mark.val = atoi(optarg);
@@ -146,6 +151,9 @@ static void cs_parse_args(int argc, char *argv[])
case 'M':
opt.sockopt.mark = atoi(optarg);
break;
+ case 'n':
+ opt.num_pkt = atoi(optarg);
+ break;
case 'd':
opt.txtime.ena = true;
opt.txtime.delay = atoi(optarg);
@@ -410,6 +418,10 @@ static void ca_set_sockopts(int fd)
setsockopt(fd, SOL_IPV6, IPV6_UNICAST_HOPS,
&opt.sockopt.hlimit, sizeof(opt.sockopt.hlimit)))
error(ERN_SOCKOPT, errno, "setsockopt IPV6_HOPLIMIT");
+ if (opt.sockopt.priority &&
+ setsockopt(fd, SOL_SOCKET, SO_PRIORITY,
+ &opt.sockopt.priority, sizeof(opt.sockopt.priority)))
+ error(ERN_SOCKOPT, errno, "setsockopt SO_PRIORITY");
}
int main(int argc, char *argv[])
@@ -421,6 +433,7 @@ int main(int argc, char *argv[])
char *buf;
int err;
int fd;
+ int i;
cs_parse_args(argc, argv);
@@ -480,24 +493,27 @@ int main(int argc, char *argv[])
cs_write_cmsg(fd, &msg, cbuf, sizeof(cbuf));
- err = sendmsg(fd, &msg, 0);
- if (err < 0) {
- if (!opt.silent_send)
- fprintf(stderr, "send failed: %s\n", strerror(errno));
- err = ERN_SEND;
- goto err_out;
- } else if (err != (int)opt.size) {
- fprintf(stderr, "short send\n");
- err = ERN_SEND_SHORT;
- goto err_out;
- } else {
- err = ERN_SUCCESS;
+ for (i = 0; i < opt.num_pkt; i++) {
+ err = sendmsg(fd, &msg, 0);
+ if (err < 0) {
+ if (!opt.silent_send)
+ fprintf(stderr, "send failed: %s\n", strerror(errno));
+ err = ERN_SEND;
+ goto err_out;
+ } else if (err != (int)opt.size) {
+ fprintf(stderr, "short send\n");
+ err = ERN_SEND_SHORT;
+ goto err_out;
+ }
}
+ err = ERN_SUCCESS;
- /* Make sure all timestamps have time to loop back */
- usleep(opt.txtime.delay);
+ if (opt.ts.ena) {
+ /* Make sure all timestamps have time to loop back */
+ usleep(opt.txtime.delay);
- cs_read_cmsg(fd, &msg, cbuf, sizeof(cbuf));
+ cs_read_cmsg(fd, &msg, cbuf, sizeof(cbuf));
+ }
err_out:
close(fd);
diff --git a/tools/testing/selftests/net/fq_band_pktlimit.sh b/tools/testing/selftests/net/fq_band_pktlimit.sh
new file mode 100755
index 000000000000..24b77bdf41ff
--- /dev/null
+++ b/tools/testing/selftests/net/fq_band_pktlimit.sh
@@ -0,0 +1,57 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Verify that FQ has a packet limit per band:
+#
+# 1. set the limit to 10 per band
+# 2. send 20 pkts on band A: verify that 10 are queued, 10 dropped
+# 3. send 20 pkts on band A: verify that 0 are queued, 20 dropped
+# 4. send 20 pkts on band B: verify that 10 are queued, 10 dropped
+#
+# Send packets with a 100ms delay to ensure that previously sent
+# packets are still queued when later ones are sent.
+# Use SO_TXTIME for this.
+
+die() {
+ echo "$1"
+ exit 1
+}
+
+# run inside private netns
+if [[ $# -eq 0 ]]; then
+ ./in_netns.sh "$0" __subprocess
+ exit
+fi
+
+ip link add type dummy
+ip link set dev dummy0 up
+ip -6 addr add fdaa::1/128 dev dummy0
+ip -6 route add fdaa::/64 dev dummy0
+tc qdisc replace dev dummy0 root handle 1: fq quantum 1514 initial_quantum 1514 limit 10
+
+./cmsg_sender -6 -p u -d 100000 -n 20 fdaa::2 8000
+OUT1="$(tc -s qdisc show dev dummy0 | grep '^\ Sent')"
+
+./cmsg_sender -6 -p u -d 100000 -n 20 fdaa::2 8000
+OUT2="$(tc -s qdisc show dev dummy0 | grep '^\ Sent')"
+
+./cmsg_sender -6 -p u -d 100000 -n 20 -P 7 fdaa::2 8000
+OUT3="$(tc -s qdisc show dev dummy0 | grep '^\ Sent')"
+
+# Initial stats will report zero sent, as all packets are still
+# queued in FQ. Sleep for the delay period (100ms) and see that
+# twenty are now sent.
+sleep 0.1
+OUT4="$(tc -s qdisc show dev dummy0 | grep '^\ Sent')"
+
+# Log the output after the test
+echo "${OUT1}"
+echo "${OUT2}"
+echo "${OUT3}"
+echo "${OUT4}"
+
+# Test the output for expected values
+echo "${OUT1}" | grep -q '0\ pkt\ (dropped\ 10' || die "unexpected drop count at 1"
+echo "${OUT2}" | grep -q '0\ pkt\ (dropped\ 30' || die "unexpected drop count at 2"
+echo "${OUT3}" | grep -q '0\ pkt\ (dropped\ 40' || die "unexpected drop count at 3"
+echo "${OUT4}" | grep -q '20\ pkt\ (dropped\ 40' || die "unexpected accept count at 4"
--
2.43.0.rc1.413.gea7ed67945-goog
On Wed, Nov 15, 2023 at 01:17:06PM +0800, Liu, Jing2 wrote:
> This is the right way to approach it,
>
> I learned that there was discussion about using io_uring to get the
> page fault without
>
> eventfd notification in [1], and I am new at io_uring and studying the
> man page of
>
> liburing, but there're questions in my mind on how can QEMU get the
> coming page fault
>
> with a good performance.
>
> Since both QEMU and Kernel don't know when comes faults, after QEMU
> submits one
>
> read task to io_uring, we want kernel pending until fault comes. While
> based on
>
> hwpt_fault_fops_read() in [patch v2 4/6], it just returns 0 since
> there's now no fault,
>
> thus this round of read completes to CQ but it's not what we want. So
> I'm wondering
>
> how kernel pending on the read until fault comes. Does fops callback
> need special work to
Implement a fops with poll support that triggers when a new event is
pushed and everything will be fine. There are many examples in the
kernel. The ones in the mlx5 vfio driver spring to mind as a scheme I
recently looked at.
Jason
Multiple files/programs in `tools/testing/selftests/bpf/prog_tests/` still
heavily use the `CHECK` macro, even when better `ASSERT_` alternatives are
available.
As it was already pointed out by Yonghong Song [1] in the bpf selftests the use
of the ASSERT_* series of macros is preferred over the CHECK macro.
This patchset replaces the usage of `CHECK(` macros to the equivalent `ASSERT_`
family of macros in the following prog_tests:
- bind_perm.c
- bpf_obj_id.c
- bpf_tcp_ca.c
- vmlinux.c
[1] https://lore.kernel.org/lkml/0a142924-633c-44e6-9a92-2dc019656bf2@linux.dev
Changes in v2:
- Fixed pthread_join assertion that broke the previous test
Previous version:
v1 - https://lore.kernel.org/lkml/GV1PR10MB6563FCFF1C5DEBE84FEA985FE8B0A@GV1PR10…
Yuran Pereira (4):
Replaces the usage of CHECK calls for ASSERTs in bpf_tcp_ca
Replaces the usage of CHECK calls for ASSERTs in bind_perm
Replaces the usage of CHECK calls for ASSERTs in bpf_obj_id
selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in
vmlinux
.../selftests/bpf/prog_tests/bind_perm.c | 6 +-
.../selftests/bpf/prog_tests/bpf_obj_id.c | 204 +++++++-----------
.../selftests/bpf/prog_tests/bpf_tcp_ca.c | 50 ++---
.../selftests/bpf/prog_tests/vmlinux.c | 16 +-
4 files changed, 106 insertions(+), 170 deletions(-)
--
2.25.1
v4:
- Update patch 1 to move apply_wqattrs_lock() and apply_wqattrs_unlock()
down into CONFIG_SYSFS block to avoid compilation warnings.
v3:
- Break out a separate patch to make workqueue_set_unbound_cpumask()
static and move it down to the CONFIG_SYSFS section.
- Remove the "__DEBUG__." prefix and the CFTYPE_DEBUG flag from the
new root only cpuset.cpus.isolated control files and update the
test accordingly.
v2:
- Add 2 read-only workqueue sysfs files to expose the user requested
cpumask as well as the isolated CPUs to be excluded from
wq_unbound_cpumask.
- Ensure that caller of the new workqueue_unbound_exclude_cpumask()
hold cpus_read_lock.
- Update the cpuset code to make sure the cpus_read_lock is held
whenever workqueue_unbound_exclude_cpumask() may be called.
Isolated cpuset partition can currently be created to contain an
exclusive set of CPUs not used in other cgroups and with load balancing
disabled to reduce interference from the scheduler.
The main purpose of this isolated partition type is to dynamically
emulate what can be done via the "isolcpus" boot command line option,
specifically the default domain flag. One effect of the "isolcpus" option
is to remove the isolated CPUs from the cpumasks of unbound workqueues
since running work functions in an isolated CPU can be a major source
of interference. Changing the unbound workqueue cpumasks can be done at
run time by writing an appropriate cpumask without the isolated CPUs to
/sys/devices/virtual/workqueue/cpumask. So one can set up an isolated
cpuset partition and then write to the cpumask sysfs file to achieve
similar level of CPU isolation. However, this manual process can be
error prone.
This patch series implements automatic exclusion of isolated CPUs from
unbound workqueue cpumasks when an isolated cpuset partition is created
and then adds those CPUs back when the isolated partition is destroyed.
There are also other places in the kernel that look at the HK_FLAG_DOMAIN
cpumask or other HK_FLAG_* cpumasks and exclude the isolated CPUs from
certain actions to further reduce interference. CPUs in an isolated
cpuset partition will not be able to avoid those interferences yet. That
may change in the future as the need arises.
Waiman Long (5):
workqueue: Make workqueue_set_unbound_cpumask() static
workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs
from wq_unbound_cpumask
selftests/cgroup: Minor code cleanup and reorganization of
test_cpuset_prs.sh
cgroup/cpuset: Keep track of CPUs in isolated partitions
cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumask
Documentation/admin-guide/cgroup-v2.rst | 10 +-
include/linux/workqueue.h | 2 +-
kernel/cgroup/cpuset.c | 286 +++++++++++++-----
kernel/workqueue.c | 165 +++++++---
.../selftests/cgroup/test_cpuset_prs.sh | 216 ++++++++-----
5 files changed, 475 insertions(+), 204 deletions(-)
--
2.39.3