Hi,
as discussed in [1], this is a manual backport of the remaining two patches to let the io worker threads respect the affinites defined by the cgroup of the process.
In 6.1 one worker is created per NUMA node, while in da64d6db3bd3 ("io_uring: One wqe per wq") this is changed to only have a single worker. As this patch is pretty invasive, Jens and me agreed to not backport it.
Instead we now limit the workers cpuset to the cpus that are in the intersection between what the cgroup allows and what the NUMA node has. This leaves the question what to do in case the intersection is empty: To be backwarts compatible, we allow this case, but restrict the cpumask of the poller to the cpuset defined by the cgroup. We further believe this is a reasonable decision, as da64d6db3bd3 drops the NUMA awareness anyways.
[1] https://lore.kernel.org/lkml/ec01745a-b102-4f6e-abc9-abd636d36319@kernel.dk
Best regards, Felix Moessbauer Siemens AG
Felix Moessbauer (2): io_uring/io-wq: do not allow pinning outside of cpuset io_uring/io-wq: inherit cpuset of cgroup in io worker
io_uring/io-wq.c | 33 ++++++++++++++++++++++++++------- 1 file changed, 26 insertions(+), 7 deletions(-)
commit 0997aa5497c714edbb349ca366d28bd550ba3408 upstream.
The io worker threads are userland threads that just never exit to the userland. By that, they are also assigned to a cgroup (the group of the creating task).
When changing the affinity of the io_wq thread via syscall, we must only allow cpumasks within the limits defined by the cpuset controller of the cgroup (if enabled).
Fixes: da64d6db3bd3 ("io_uring: One wqe per wq") Signed-off-by: Felix Moessbauer felix.moessbauer@siemens.com --- io_uring/io-wq.c | 25 +++++++++++++++++++------ 1 file changed, 19 insertions(+), 6 deletions(-)
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c index 139cd49b2c27..c74bcc8d2f06 100644 --- a/io_uring/io-wq.c +++ b/io_uring/io-wq.c @@ -13,6 +13,7 @@ #include <linux/slab.h> #include <linux/rculist_nulls.h> #include <linux/cpu.h> +#include <linux/cpuset.h> #include <linux/task_work.h> #include <linux/audit.h> #include <uapi/linux/io_uring.h> @@ -1362,22 +1363,34 @@ static int io_wq_cpu_offline(unsigned int cpu, struct hlist_node *node)
int io_wq_cpu_affinity(struct io_uring_task *tctx, cpumask_var_t mask) { + cpumask_var_t allowed_mask; + int ret = 0; int i;
if (!tctx || !tctx->io_wq) return -EINVAL;
+ if (!alloc_cpumask_var(&allowed_mask, GFP_KERNEL)) + return -ENOMEM; + cpuset_cpus_allowed(tctx->io_wq->task, allowed_mask); + rcu_read_lock(); for_each_node(i) { struct io_wqe *wqe = tctx->io_wq->wqes[i]; - - if (mask) - cpumask_copy(wqe->cpu_mask, mask); - else - cpumask_copy(wqe->cpu_mask, cpumask_of_node(i)); + if (mask) { + if (cpumask_subset(mask, allowed_mask)) + cpumask_copy(wqe->cpu_mask, mask); + else + ret = -EINVAL; + } else { + if (!cpumask_and(wqe->cpu_mask, cpumask_of_node(i), allowed_mask)) + cpumask_copy(wqe->cpu_mask, allowed_mask); + } } rcu_read_unlock(); - return 0; + + free_cpumask_var(allowed_mask); + return ret; }
/*
commit 84eacf177faa605853c58e5b1c0d9544b88c16fd upstream.
The io worker threads are userland threads that just never exit to the userland. By that, they are also assigned to a cgroup (the group of the creating task).
When creating a new io worker, this worker should inherit the cpuset of the cgroup.
Fixes: da64d6db3bd3 ("io_uring: One wqe per wq") Signed-off-by: Felix Moessbauer felix.moessbauer@siemens.com --- io_uring/io-wq.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c index c74bcc8d2f06..04265bf8d319 100644 --- a/io_uring/io-wq.c +++ b/io_uring/io-wq.c @@ -1157,6 +1157,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) { int ret, node, i; struct io_wq *wq; + cpumask_var_t allowed_mask;
if (WARN_ON_ONCE(!data->free_work || !data->do_work)) return ERR_PTR(-EINVAL); @@ -1176,6 +1177,9 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) wq->do_work = data->do_work;
ret = -ENOMEM; + if (!alloc_cpumask_var(&allowed_mask, GFP_KERNEL)) + goto err; + cpuset_cpus_allowed(current, allowed_mask); for_each_node(node) { struct io_wqe *wqe; int alloc_node = node; @@ -1188,7 +1192,8 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) wq->wqes[node] = wqe; if (!alloc_cpumask_var(&wqe->cpu_mask, GFP_KERNEL)) goto err; - cpumask_copy(wqe->cpu_mask, cpumask_of_node(node)); + if (!cpumask_and(wqe->cpu_mask, cpumask_of_node(node), allowed_mask)) + cpumask_copy(wqe->cpu_mask, allowed_mask); wqe->node = alloc_node; wqe->acct[IO_WQ_ACCT_BOUND].max_workers = bounded; wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers = @@ -1222,6 +1227,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) free_cpumask_var(wq->wqes[node]->cpu_mask); kfree(wq->wqes[node]); } + free_cpumask_var(allowed_mask); err_wq: kfree(wq); return ERR_PTR(ret);
On 9/11/24 10:23 AM, Felix Moessbauer wrote:
Hi,
as discussed in [1], this is a manual backport of the remaining two patches to let the io worker threads respect the affinites defined by the cgroup of the process.
In 6.1 one worker is created per NUMA node, while in da64d6db3bd3 ("io_uring: One wqe per wq") this is changed to only have a single worker. As this patch is pretty invasive, Jens and me agreed to not backport it.
Instead we now limit the workers cpuset to the cpus that are in the intersection between what the cgroup allows and what the NUMA node has. This leaves the question what to do in case the intersection is empty: To be backwarts compatible, we allow this case, but restrict the cpumask of the poller to the cpuset defined by the cgroup. We further believe this is a reasonable decision, as da64d6db3bd3 drops the NUMA awareness anyways.
[1] https://lore.kernel.org/lkml/ec01745a-b102-4f6e-abc9-abd636d36319@kernel.dk
The upstream patches are staged for 6.12 and marked for a backport, so they should go upstream next week. Once they are upstream, I'll make sure to check in on these on the stable front.
On Wed, Sep 11, 2024 at 06:23:14PM +0200, Felix Moessbauer wrote:
Hi,
as discussed in [1], this is a manual backport of the remaining two patches to let the io worker threads respect the affinites defined by the cgroup of the process.
In 6.1 one worker is created per NUMA node, while in da64d6db3bd3 ("io_uring: One wqe per wq") this is changed to only have a single worker. As this patch is pretty invasive, Jens and me agreed to not backport it.
Instead we now limit the workers cpuset to the cpus that are in the intersection between what the cgroup allows and what the NUMA node has. This leaves the question what to do in case the intersection is empty: To be backwarts compatible, we allow this case, but restrict the cpumask of the poller to the cpuset defined by the cgroup. We further believe this is a reasonable decision, as da64d6db3bd3 drops the NUMA awareness anyways.
[1] https://lore.kernel.org/lkml/ec01745a-b102-4f6e-abc9-abd636d36319@kernel.dk
Why was neither of these actually tagged for inclusion in a stable tree? Why just 6.1.y? Please submit them for all relevent kernel versions.
thanks,
greg k-h
On Mon, 2024-09-30 at 21:15 +0200, Greg KH wrote:
On Wed, Sep 11, 2024 at 06:23:14PM +0200, Felix Moessbauer wrote:
Hi,
as discussed in [1], this is a manual backport of the remaining two patches to let the io worker threads respect the affinites defined by the cgroup of the process.
In 6.1 one worker is created per NUMA node, while in da64d6db3bd3 ("io_uring: One wqe per wq") this is changed to only have a single worker. As this patch is pretty invasive, Jens and me agreed to not backport it.
Instead we now limit the workers cpuset to the cpus that are in the intersection between what the cgroup allows and what the NUMA node has. This leaves the question what to do in case the intersection is empty: To be backwarts compatible, we allow this case, but restrict the cpumask of the poller to the cpuset defined by the cgroup. We further believe this is a reasonable decision, as da64d6db3bd3 drops the NUMA awareness anyways.
[1] https://lore.kernel.org/lkml/ec01745a-b102-4f6e-abc9-abd636d36319@kernel.dk
Why was neither of these actually tagged for inclusion in a stable tree?
This is a manual backport of these patches for 6.1, as the subsystem changed significantly between 6.1 and 6.2, making an automated backport impossible. This has been agreed on with Jens in https://lore.kernel.org/lkml/ec01745a-b102-4f6e-abc9-abd636d36319@kernel.dk/
Why just 6.1.y? Please submit them for all relevent kernel versions.
The original patch was tagged stable and got accepted in 6.6, 6.10 and 6.11.
Felix
thanks,
greg k-h
On Tue, Oct 01, 2024 at 07:32:42AM +0000, MOESSBAUER, Felix wrote:
On Mon, 2024-09-30 at 21:15 +0200, Greg KH wrote:
On Wed, Sep 11, 2024 at 06:23:14PM +0200, Felix Moessbauer wrote:
Hi,
as discussed in [1], this is a manual backport of the remaining two patches to let the io worker threads respect the affinites defined by the cgroup of the process.
In 6.1 one worker is created per NUMA node, while in da64d6db3bd3 ("io_uring: One wqe per wq") this is changed to only have a single worker. As this patch is pretty invasive, Jens and me agreed to not backport it.
Instead we now limit the workers cpuset to the cpus that are in the intersection between what the cgroup allows and what the NUMA node has. This leaves the question what to do in case the intersection is empty: To be backwarts compatible, we allow this case, but restrict the cpumask of the poller to the cpuset defined by the cgroup. We further believe this is a reasonable decision, as da64d6db3bd3 drops the NUMA awareness anyways.
[1] https://lore.kernel.org/lkml/ec01745a-b102-4f6e-abc9-abd636d36319@kernel.dk
Why was neither of these actually tagged for inclusion in a stable tree?
This is a manual backport of these patches for 6.1, as the subsystem changed significantly between 6.1 and 6.2, making an automated backport impossible. This has been agreed on with Jens in https://lore.kernel.org/lkml/ec01745a-b102-4f6e-abc9-abd636d36319@kernel.dk/
Why just 6.1.y? Please submit them for all relevent kernel versions.
The original patch was tagged stable and got accepted in 6.6, 6.10 and 6.11.
No they were not at all. Please properly tag them in the future as per the documentation if you wish to have things applied to the stable trees: https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
thanks,
greg k-h
On 10/1/24 1:50 AM, gregkh@linuxfoundation.org wrote:
On Tue, Oct 01, 2024 at 07:32:42AM +0000, MOESSBAUER, Felix wrote:
On Mon, 2024-09-30 at 21:15 +0200, Greg KH wrote:
On Wed, Sep 11, 2024 at 06:23:14PM +0200, Felix Moessbauer wrote:
Hi,
as discussed in [1], this is a manual backport of the remaining two patches to let the io worker threads respect the affinites defined by the cgroup of the process.
In 6.1 one worker is created per NUMA node, while in da64d6db3bd3 ("io_uring: One wqe per wq") this is changed to only have a single worker. As this patch is pretty invasive, Jens and me agreed to not backport it.
Instead we now limit the workers cpuset to the cpus that are in the intersection between what the cgroup allows and what the NUMA node has. This leaves the question what to do in case the intersection is empty: To be backwarts compatible, we allow this case, but restrict the cpumask of the poller to the cpuset defined by the cgroup. We further believe this is a reasonable decision, as da64d6db3bd3 drops the NUMA awareness anyways.
[1] https://lore.kernel.org/lkml/ec01745a-b102-4f6e-abc9-abd636d36319@kernel.dk
Why was neither of these actually tagged for inclusion in a stable tree?
This is a manual backport of these patches for 6.1, as the subsystem changed significantly between 6.1 and 6.2, making an automated backport impossible. This has been agreed on with Jens in https://lore.kernel.org/lkml/ec01745a-b102-4f6e-abc9-abd636d36319@kernel.dk/
Why just 6.1.y? Please submit them for all relevent kernel versions.
The original patch was tagged stable and got accepted in 6.6, 6.10 and 6.11.
No they were not at all. Please properly tag them in the future as per the documentation if you wish to have things applied to the stable trees: https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
That's my bad, missed that one of them did not get marked for stable, the sqpoll one did.
linux-stable-mirror@lists.linaro.org