Hi,
[ Marc, can you help reviewing? Esp. the first patch? ]
This series of backports from upstream to stable 5.15 and 5.10 fixes an issue we're seeing on AWS ARM instances where attaching an EBS volume (which is a nvme device) to the instance after offlining CPUs causes the device to take several minutes to show up and eventually nvme kworkers and other threads start getting stuck.
This series fixes the issue for 5.15.79 and 5.10.155. I can't reproduce it on 5.4. Also, I couldn't reproduce this on x86 even w/ affected kernels.
An easy reproducer is:
1. Start an ARM instance with 32 CPUs 2. Once the instance is booted, offline all CPUs but CPU 0. Eg: # for i in $(seq 1 32); do chcpu -d $i; done 3. Once the CPUs are offline, attach an EBS volume 4. Watch lsblk and dmesg in the instance
Eventually, you get this stack trace:
[ 71.842974] pci 0000:00:1f.0: [1d0f:8061] type 00 class 0x010802 [ 71.843966] pci 0000:00:1f.0: reg 0x10: [mem 0x00000000-0x00003fff] [ 71.845149] pci 0000:00:1f.0: PME# supported from D0 D1 D2 D3hot D3cold [ 71.846694] pci 0000:00:1f.0: BAR 0: assigned [mem 0x8011c000-0x8011ffff] [ 71.848458] ACPI: _SB_.PCI0.GSI3: Enabled at IRQ 38 [ 71.850852] nvme nvme1: pci function 0000:00:1f.0 [ 71.851611] nvme 0000:00:1f.0: enabling device (0000 -> 0002) [ 135.887787] nvme nvme1: I/O 22 QID 0 timeout, completion polled [ 197.328276] nvme nvme1: I/O 23 QID 0 timeout, completion polled [ 197.329221] nvme nvme1: 1/0/0 default/read/poll queues [ 243.408619] INFO: task kworker/u64:2:275 blocked for more than 122 seconds. [ 243.409674] Not tainted 5.15.79 #1 [ 243.410270] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 243.411389] task:kworker/u64:2 state:D stack: 0 pid: 275 ppid: 2 flags:0x00000008 [ 243.412602] Workqueue: events_unbound async_run_entry_fn [ 243.413417] Call trace: [ 243.413797] __switch_to+0x15c/0x1a4 [ 243.414335] __schedule+0x2bc/0x990 [ 243.414849] schedule+0x68/0xf8 [ 243.415334] schedule_timeout+0x184/0x340 [ 243.415946] wait_for_completion+0xc8/0x220 [ 243.416543] __flush_work.isra.43+0x240/0x2f0 [ 243.417179] flush_work+0x20/0x2c [ 243.417666] nvme_async_probe+0x20/0x3c [ 243.418228] async_run_entry_fn+0x3c/0x1e0 [ 243.418858] process_one_work+0x1bc/0x460 [ 243.419437] worker_thread+0x164/0x528 [ 243.420030] kthread+0x118/0x124 [ 243.420517] ret_from_fork+0x10/0x20 [ 258.768771] nvme nvme1: I/O 20 QID 0 timeout, completion polled [ 320.209266] nvme nvme1: I/O 21 QID 0 timeout, completion polled
For completion, I tested the same test-case on x86 with this series applied on 5.15.79 and 5.10.155 as well. It works as expected.
Thanks,
Marc Zyngier (4): genirq/msi: Shutdown managed interrupts with unsatifiable affinities genirq: Always limit the affinity to online CPUs irqchip/gic-v3: Always trust the managed affinity provided by the core code genirq: Take the proposed affinity at face value if force==true
drivers/irqchip/irq-gic-v3-its.c | 2 +- kernel/irq/manage.c | 31 +++++++++++++++++++++++-------- kernel/irq/msi.c | 7 +++++++ 3 files changed, 31 insertions(+), 9 deletions(-)
From: Marc Zyngier maz@kernel.org
commit d802057c7c553ad426520a053da9f9fe08e2c35a upstream.
[ This commit is almost a rewrite because it conflicts with Thomas Gleixner's refactoring of this code in v5.17-rc1. I wasn't sure if I should drop all the s-o-bs (including Mark's), but decided to keep as the original commit ]
When booting with maxcpus=<small number>, interrupt controllers such as the GICv3 ITS may not be able to satisfy the affinity of some managed interrupts, as some of the HW resources are simply not available.
The same thing happens when loading a driver using managed interrupts while CPUs are offline.
In order to deal with this, do not try to activate such interrupt if there is no online CPU capable of handling it. Instead, place it in shutdown state. Once a capable CPU shows up, it will be activated.
Reported-by: John Garry john.garry@huawei.com Reported-by: David Decotigny ddecotig@google.com Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: John Garry john.garry@huawei.com Link: https://lore.kernel.org/r/20220405185040.206297-2-maz@kernel.org
Signed-off-by: Luiz Capitulino luizcap@amazon.com --- kernel/irq/msi.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c index 7f350ae59c5f..d75586dc584f 100644 --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -596,6 +596,13 @@ int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, irqd_clr_can_reserve(irq_data); if (domain->flags & IRQ_DOMAIN_MSI_NOMASK_QUIRK) irqd_set_msi_nomask_quirk(irq_data); + if ((info->flags & MSI_FLAG_ACTIVATE_EARLY) && + irqd_affinity_is_managed(irq_data) && + !cpumask_intersects(irq_data_get_affinity_mask(irq_data), + cpu_online_mask)) { + irqd_set_managed_shutdown(irq_data); + continue; + } } ret = irq_domain_activate_irq(irq_data, can_reserve); if (ret)
From: Marc Zyngier maz@kernel.org
commit 33de0aa4bae982ed6f7c777f86b5af3e627ac937 upstream.
[ Fixed small conflicts due to the HK_FLAG_MANAGED_IRQ flag been renamed on upstream ]
When booting with maxcpus=<small number> (or even loading a driver while most CPUs are offline), it is pretty easy to observe managed affinities containing a mix of online and offline CPUs being passed to the irqchip driver.
This means that the irqchip cannot trust the affinity passed down from the core code, which is a bit annoying and requires (at least in theory) all drivers to implement some sort of affinity narrowing.
In order to address this, always limit the cpumask to the set of online CPUs.
Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Link: https://lore.kernel.org/r/20220405185040.206297-3-maz@kernel.org
Signed-off-by: Luiz Capitulino luizcap@amazon.com --- kernel/irq/manage.c | 25 +++++++++++++++++-------- 1 file changed, 17 insertions(+), 8 deletions(-)
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 0c3c26fb054f..a1727cdaebed 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -222,11 +222,16 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask, { struct irq_desc *desc = irq_data_to_desc(data); struct irq_chip *chip = irq_data_get_irq_chip(data); + const struct cpumask *prog_mask; int ret;
+ static DEFINE_RAW_SPINLOCK(tmp_mask_lock); + static struct cpumask tmp_mask; + if (!chip || !chip->irq_set_affinity) return -EINVAL;
+ raw_spin_lock(&tmp_mask_lock); /* * If this is a managed interrupt and housekeeping is enabled on * it check whether the requested affinity mask intersects with @@ -248,24 +253,28 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask, */ if (irqd_affinity_is_managed(data) && housekeeping_enabled(HK_FLAG_MANAGED_IRQ)) { - const struct cpumask *hk_mask, *prog_mask; - - static DEFINE_RAW_SPINLOCK(tmp_mask_lock); - static struct cpumask tmp_mask; + const struct cpumask *hk_mask;
hk_mask = housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);
- raw_spin_lock(&tmp_mask_lock); cpumask_and(&tmp_mask, mask, hk_mask); if (!cpumask_intersects(&tmp_mask, cpu_online_mask)) prog_mask = mask; else prog_mask = &tmp_mask; - ret = chip->irq_set_affinity(data, prog_mask, force); - raw_spin_unlock(&tmp_mask_lock); } else { - ret = chip->irq_set_affinity(data, mask, force); + prog_mask = mask; } + + /* Make sure we only provide online CPUs to the irqchip */ + cpumask_and(&tmp_mask, prog_mask, cpu_online_mask); + if (!cpumask_empty(&tmp_mask)) + ret = chip->irq_set_affinity(data, &tmp_mask, force); + else + ret = -EINVAL; + + raw_spin_unlock(&tmp_mask_lock); + switch (ret) { case IRQ_SET_MASK_OK: case IRQ_SET_MASK_OK_DONE:
From: Marc Zyngier maz@kernel.org
commit 3f893a5962d31c0164efdbf6174ed0784f1d7603 upstream.
Now that the core code has been fixed to always give us an affinity that only includes online CPUs, directly use this affinity when computing a target CPU.
Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Link: https://lore.kernel.org/r/20220405185040.206297-4-maz@kernel.org
Signed-off-by: Luiz Capitulino luizcap@amazon.com --- drivers/irqchip/irq-gic-v3-its.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c index fc1bfffc468f..59a5d06b2d3e 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -1620,7 +1620,7 @@ static int its_select_cpu(struct irq_data *d,
cpu = cpumask_pick_least_loaded(d, tmpmask); } else { - cpumask_and(tmpmask, irq_data_get_affinity_mask(d), cpu_online_mask); + cpumask_copy(tmpmask, aff_mask);
/* If we cannot cross sockets, limit the search to that node */ if ((its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_23144) &&
From: Marc Zyngier maz@kernel.org
commit c48c8b829d2b966a6649827426bcdba082ccf922 upstream.
Although setting the affinity of an interrupt to a set of CPUs that doesn't have any online CPU is generally frowned apon, there are a few limited cases where such affinity is set from a CPUHP notifier, setting the affinity to a CPU that isn't online yet.
The saving grace is that this is always done using the 'force' attribute, which gives a hint that the affinity setting can be outside of the online CPU mask and the callsite set this flag with the knowledge that the underlying interrupt controller knows to handle it.
This restores the expected behaviour on Marek's system.
Fixes: 33de0aa4bae9 ("genirq: Always limit the affinity to online CPUs") Reported-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Marek Szyprowski m.szyprowski@samsung.com Link: https://lore.kernel.org/r/4b7fc13c-887b-a664-26e8-45aed13f048a@samsung.com Link: https://lore.kernel.org/r/20220414140011.541725-1-maz@kernel.org
Signed-off-by: Luiz Capitulino luizcap@amazon.com --- kernel/irq/manage.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index a1727cdaebed..9862372e0f01 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -266,10 +266,16 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask, prog_mask = mask; }
- /* Make sure we only provide online CPUs to the irqchip */ + /* + * Make sure we only provide online CPUs to the irqchip, + * unless we are being asked to force the affinity (in which + * case we do as we are told). + */ cpumask_and(&tmp_mask, prog_mask, cpu_online_mask); - if (!cpumask_empty(&tmp_mask)) + if (!force && !cpumask_empty(&tmp_mask)) ret = chip->irq_set_affinity(data, &tmp_mask, force); + else if (force) + ret = chip->irq_set_affinity(data, mask, force); else ret = -EINVAL;
On Mon, 28 Nov 2022 17:08:31 +0000, Luiz Capitulino luizcap@amazon.com wrote:
Hi,
[ Marc, can you help reviewing? Esp. the first patch? ]
This series of backports from upstream to stable 5.15 and 5.10 fixes an issue we're seeing on AWS ARM instances where attaching an EBS volume (which is a nvme device) to the instance after offlining CPUs causes the device to take several minutes to show up and eventually nvme kworkers and other threads start getting stuck.
This series fixes the issue for 5.15.79 and 5.10.155. I can't reproduce it on 5.4. Also, I couldn't reproduce this on x86 even w/ affected kernels.
That's because x86 has a very different allocation policy compared to what the ITS does. The x86 vector space is tiny, so vectors are only allocated when required. In your case, that's when the CPUs are onlined.
With the ITS, all the vectors are allocated upfront, as this is essentially free. But in the case of managed interrupts, these vectors are now pointing to offline CPUs. The ITS tries to fix that, but doesn't nearly have enough information. And the correct course of action is to keep these interrupts in the shutdown state, which is what the series is doing.
An easy reproducer is:
- Start an ARM instance with 32 CPUs
To satisfy my own curiosity, is that in a guest or bare metal? It shouldn't make any difference, but hey...
Anyway, patch #1 looks OK to me, but I haven't tried to dig further into something that is "oh so last year" ;-). Specially as we're rewriting the whole of the MSI stack! FWIW:
Acked-by: Marc Zyngier maz@kernel.org
M.
On 2022-11-28 12:53, Marc Zyngier wrote:
On Mon, 28 Nov 2022 17:08:31 +0000, Luiz Capitulino luizcap@amazon.com wrote:
Hi,
[ Marc, can you help reviewing? Esp. the first patch? ]
This series of backports from upstream to stable 5.15 and 5.10 fixes an issue we're seeing on AWS ARM instances where attaching an EBS volume (which is a nvme device) to the instance after offlining CPUs causes the device to take several minutes to show up and eventually nvme kworkers and other threads start getting stuck.
This series fixes the issue for 5.15.79 and 5.10.155. I can't reproduce it on 5.4. Also, I couldn't reproduce this on x86 even w/ affected kernels.
That's because x86 has a very different allocation policy compared to what the ITS does. The x86 vector space is tiny, so vectors are only allocated when required. In your case, that's when the CPUs are onlined.
With the ITS, all the vectors are allocated upfront, as this is essentially free. But in the case of managed interrupts, these vectors are now pointing to offline CPUs. The ITS tries to fix that, but doesn't nearly have enough information. And the correct course of action is to keep these interrupts in the shutdown state, which is what the series is doing.
Thank you for the explanation, Marc. I also immensely
appreciate the super fast response! (more below).
An easy reproducer is:
- Start an ARM instance with 32 CPUs
To satisfy my own curiosity, is that in a guest or bare metal? It shouldn't make any difference, but hey...
This is a guest. I'll test on a bare-metal instance, it may
take a few hours. I'll reply here.
Anyway, patch #1 looks OK to me, but I haven't tried to dig further into something that is "oh so last year" ;-). Specially as we're rewriting the whole of the MSI stack! FWIW:
Acked-by: Marc Zyngier maz@kernel.org
Thank you again, Marc!
M.
-- Without deviation from the norm, progress is not possible.
On 2022-11-28 13:27, Luiz Capitulino wrote:
On 2022-11-28 12:53, Marc Zyngier wrote:
On Mon, 28 Nov 2022 17:08:31 +0000, Luiz Capitulino luizcap@amazon.com wrote:
Hi,
[ Marc, can you help reviewing? Esp. the first patch? ]
This series of backports from upstream to stable 5.15 and 5.10 fixes an issue we're seeing on AWS ARM instances where attaching an EBS volume (which is a nvme device) to the instance after offlining CPUs causes the device to take several minutes to show up and eventually nvme kworkers and other threads start getting stuck.
This series fixes the issue for 5.15.79 and 5.10.155. I can't reproduce it on 5.4. Also, I couldn't reproduce this on x86 even w/ affected kernels.
That's because x86 has a very different allocation policy compared to what the ITS does. The x86 vector space is tiny, so vectors are only allocated when required. In your case, that's when the CPUs are onlined.
With the ITS, all the vectors are allocated upfront, as this is essentially free. But in the case of managed interrupts, these vectors are now pointing to offline CPUs. The ITS tries to fix that, but doesn't nearly have enough information. And the correct course of action is to keep these interrupts in the shutdown state, which is what the series is doing.
Thank you for the explanation, Marc. I also immensely
appreciate the super fast response! (more below).
An easy reproducer is:
- Start an ARM instance with 32 CPUs
To satisfy my own curiosity, is that in a guest or bare metal? It shouldn't make any difference, but hey...
This is a guest. I'll test on a bare-metal instance, it may
take a few hours. I'll reply here.
I was able to test this on a bare-metal instance on both arm64 and x86 with and without this series. It all works as expected.
The only difference in that on the arm64 bare-metal instance, I get a PCI error on an unfixed kernel (below) and the system never hangs (whereas on a guest, I get no PCI error and eventually threads start hanging).
This series fixes this case too and the device is added as expected on a fixed kernel.
So, all seems good!
[ 162.618277] pcieport 0000:14:06.0: bridge window [io 0x1000-0x0fff] to [bus 1b] add_size 1000 [ 162.618905] pcieport 0000:14:06.0: BAR 13: no space for [io size 0x1000] [ 162.619398] pcieport 0000:14:06.0: BAR 13: failed to assign [io size 0x1000] [ 162.619916] pcieport 0000:14:06.0: BAR 13: no space for [io size 0x1000] [ 162.620410] pcieport 0000:14:06.0: BAR 13: failed to assign [io size 0x1000] [ 162.620929] pci 0000:1b:00.0: BAR 0: assigned [mem 0x83200000-0x833fffff 64bit] [ 162.621472] pcieport 0000:14:06.0: PCI bridge to [bus 1b] [ 162.621872] pcieport 0000:14:06.0: bridge window [mem 0x83200000-0x833fffff] [ 162.622398] pcieport 0000:14:06.0: bridge window [mem 0x18019000000-0x18019ffffff 64bit pref] [ 162.623411] nvme 0000:1b:00.0: Adding to iommu group 56 [ 162.624081] nvme nvme2: pci function 0000:1b:00.0 [ 162.624455] nvme 0000:1b:00.0: enabling device (0000 -> 0002) [ 162.627776] nvme nvme2: Removing after probe failure status: -5 [ 187.396805] nvme nvme1: I/O 3 QID 0 timeout, reset controller [ 187.399390] nvme nvme1: Identify namespace failed (-4) [ 187.429068] nvme nvme1: Removing after probe failure status: -5
Anyway, patch #1 looks OK to me, but I haven't tried to dig further into something that is "oh so last year" ;-). Specially as we're rewriting the whole of the MSI stack! FWIW:
Acked-by: Marc Zyngier maz@kernel.org
Thank you again, Marc!
M.
-- Without deviation from the norm, progress is not possible.
On Mon, Nov 28, 2022 at 05:08:31PM +0000, Luiz Capitulino wrote:
Hi,
[ Marc, can you help reviewing? Esp. the first patch? ]
This series of backports from upstream to stable 5.15 and 5.10 fixes an issue we're seeing on AWS ARM instances where attaching an EBS volume (which is a nvme device) to the instance after offlining CPUs causes the device to take several minutes to show up and eventually nvme kworkers and other threads start getting stuck.
This series fixes the issue for 5.15.79 and 5.10.155. I can't reproduce it on 5.4. Also, I couldn't reproduce this on x86 even w/ affected kernels.
An easy reproducer is:
- Start an ARM instance with 32 CPUs
- Once the instance is booted, offline all CPUs but CPU 0. Eg: # for i in $(seq 1 32); do chcpu -d $i; done
- Once the CPUs are offline, attach an EBS volume
- Watch lsblk and dmesg in the instance
Eventually, you get this stack trace:
[ 71.842974] pci 0000:00:1f.0: [1d0f:8061] type 00 class 0x010802 [ 71.843966] pci 0000:00:1f.0: reg 0x10: [mem 0x00000000-0x00003fff] [ 71.845149] pci 0000:00:1f.0: PME# supported from D0 D1 D2 D3hot D3cold [ 71.846694] pci 0000:00:1f.0: BAR 0: assigned [mem 0x8011c000-0x8011ffff] [ 71.848458] ACPI: _SB_.PCI0.GSI3: Enabled at IRQ 38 [ 71.850852] nvme nvme1: pci function 0000:00:1f.0 [ 71.851611] nvme 0000:00:1f.0: enabling device (0000 -> 0002) [ 135.887787] nvme nvme1: I/O 22 QID 0 timeout, completion polled [ 197.328276] nvme nvme1: I/O 23 QID 0 timeout, completion polled [ 197.329221] nvme nvme1: 1/0/0 default/read/poll queues [ 243.408619] INFO: task kworker/u64:2:275 blocked for more than 122 seconds. [ 243.409674] Not tainted 5.15.79 #1 [ 243.410270] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 243.411389] task:kworker/u64:2 state:D stack: 0 pid: 275 ppid: 2 flags:0x00000008 [ 243.412602] Workqueue: events_unbound async_run_entry_fn [ 243.413417] Call trace: [ 243.413797] __switch_to+0x15c/0x1a4 [ 243.414335] __schedule+0x2bc/0x990 [ 243.414849] schedule+0x68/0xf8 [ 243.415334] schedule_timeout+0x184/0x340 [ 243.415946] wait_for_completion+0xc8/0x220 [ 243.416543] __flush_work.isra.43+0x240/0x2f0 [ 243.417179] flush_work+0x20/0x2c [ 243.417666] nvme_async_probe+0x20/0x3c [ 243.418228] async_run_entry_fn+0x3c/0x1e0 [ 243.418858] process_one_work+0x1bc/0x460 [ 243.419437] worker_thread+0x164/0x528 [ 243.420030] kthread+0x118/0x124 [ 243.420517] ret_from_fork+0x10/0x20 [ 258.768771] nvme nvme1: I/O 20 QID 0 timeout, completion polled [ 320.209266] nvme nvme1: I/O 21 QID 0 timeout, completion polled
For completion, I tested the same test-case on x86 with this series applied on 5.15.79 and 5.10.155 as well. It works as expected.
All now queued up, thanks.
greg k-h
linux-stable-mirror@lists.linaro.org