Depending on the number of online CPUs in the original kernel, it is likely for CPU #0 to be offline in a kdump kernel. The associated IRQs in the affinity mappings provided by irq_create_affinity_masks() are thus not started by irq_startup(), as per-design with managed IRQs.
This can be a problem with multi-queue block devices driven by blk-mq : such a non-started IRQ is very likely paired with the single queue enforced by blk-mq during kdump (see blk_mq_alloc_tag_set()). This causes the device to remain silent and likely hangs the guest at some point.
This is a regression caused by commit 9ea69a55b3b9 ("powerpc/pseries: Pass MSI affinity to irq_create_mapping()"). Note that this only happens with the XIVE interrupt controller because XICS has a workaround to bypass affinity, which is activated during kdump with the "noirqdistrib" kernel parameter.
The issue comes from a combination of factors: - discrepancy between the number of queues detected by the multi-queue block driver, that was used to create the MSI vectors, and the single queue mode enforced later on by blk-mq because of kdump (i.e. keeping all queues fixes the issue) - CPU#0 offline (i.e. kdump always succeed with CPU#0)
Given that I couldn't reproduce on x86, which seems to always have CPU#0 online even during kdump, I'm not sure where this should be fixed. Hence going for another approach : fine-grained affinity is for performance and we don't really care about that during kdump. Simply revert to the previous working behavior of ignoring affinity masks in this case only.
Fixes: 9ea69a55b3b9 ("powerpc/pseries: Pass MSI affinity to irq_create_mapping()") Cc: lvivier@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Greg Kurz groug@kaod.org --- arch/powerpc/platforms/pseries/msi.c | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/msi.c b/arch/powerpc/platforms/pseries/msi.c index b3ac2455faad..29d04b83288d 100644 --- a/arch/powerpc/platforms/pseries/msi.c +++ b/arch/powerpc/platforms/pseries/msi.c @@ -458,8 +458,28 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int nvec_in, int type) return hwirq; }
- virq = irq_create_mapping_affinity(NULL, hwirq, - entry->affinity); + /* + * Depending on the number of online CPUs in the original + * kernel, it is likely for CPU #0 to be offline in a kdump + * kernel. The associated IRQs in the affinity mappings + * provided by irq_create_affinity_masks() are thus not + * started by irq_startup(), as per-design for managed IRQs. + * This can be a problem with multi-queue block devices driven + * by blk-mq : such a non-started IRQ is very likely paired + * with the single queue enforced by blk-mq during kdump (see + * blk_mq_alloc_tag_set()). This causes the device to remain + * silent and likely hangs the guest at some point. + * + * We don't really care for fine-grained affinity when doing + * kdump actually : simply ignore the pre-computed affinity + * masks in this case and let the default mask with all CPUs + * be used when creating the IRQ mappings. + */ + if (is_kdump_kernel()) + virq = irq_create_mapping(NULL, hwirq); + else + virq = irq_create_mapping_affinity(NULL, hwirq, + entry->affinity);
if (!virq) { pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq);
On 12/02/2021 17:41, Greg Kurz wrote:
Depending on the number of online CPUs in the original kernel, it is likely for CPU #0 to be offline in a kdump kernel. The associated IRQs in the affinity mappings provided by irq_create_affinity_masks() are thus not started by irq_startup(), as per-design with managed IRQs.
This can be a problem with multi-queue block devices driven by blk-mq : such a non-started IRQ is very likely paired with the single queue enforced by blk-mq during kdump (see blk_mq_alloc_tag_set()). This causes the device to remain silent and likely hangs the guest at some point.
This is a regression caused by commit 9ea69a55b3b9 ("powerpc/pseries: Pass MSI affinity to irq_create_mapping()"). Note that this only happens with the XIVE interrupt controller because XICS has a workaround to bypass affinity, which is activated during kdump with the "noirqdistrib" kernel parameter.
The issue comes from a combination of factors:
- discrepancy between the number of queues detected by the multi-queue block driver, that was used to create the MSI vectors, and the single queue mode enforced later on by blk-mq because of kdump (i.e. keeping all queues fixes the issue)
- CPU#0 offline (i.e. kdump always succeed with CPU#0)
Given that I couldn't reproduce on x86, which seems to always have CPU#0 online even during kdump, I'm not sure where this should be fixed. Hence going for another approach : fine-grained affinity is for performance and we don't really care about that during kdump. Simply revert to the previous working behavior of ignoring affinity masks in this case only.
Fixes: 9ea69a55b3b9 ("powerpc/pseries: Pass MSI affinity to irq_create_mapping()") Cc: lvivier@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Greg Kurz groug@kaod.org
arch/powerpc/platforms/pseries/msi.c | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/msi.c b/arch/powerpc/platforms/pseries/msi.c index b3ac2455faad..29d04b83288d 100644 --- a/arch/powerpc/platforms/pseries/msi.c +++ b/arch/powerpc/platforms/pseries/msi.c @@ -458,8 +458,28 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int nvec_in, int type) return hwirq; }
virq = irq_create_mapping_affinity(NULL, hwirq,
entry->affinity);
/*
* Depending on the number of online CPUs in the original
* kernel, it is likely for CPU #0 to be offline in a kdump
* kernel. The associated IRQs in the affinity mappings
* provided by irq_create_affinity_masks() are thus not
* started by irq_startup(), as per-design for managed IRQs.
* This can be a problem with multi-queue block devices driven
* by blk-mq : such a non-started IRQ is very likely paired
* with the single queue enforced by blk-mq during kdump (see
* blk_mq_alloc_tag_set()). This causes the device to remain
* silent and likely hangs the guest at some point.
*
* We don't really care for fine-grained affinity when doing
* kdump actually : simply ignore the pre-computed affinity
* masks in this case and let the default mask with all CPUs
* be used when creating the IRQ mappings.
*/
if (is_kdump_kernel())
virq = irq_create_mapping(NULL, hwirq);
else
virq = irq_create_mapping_affinity(NULL, hwirq,
entry->affinity);
if (!virq) { pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq);
Reviewed-by: Laurent Vivier lvivier@redhat.com
On 2/12/21 5:41 PM, Greg Kurz wrote:
Depending on the number of online CPUs in the original kernel, it is likely for CPU #0 to be offline in a kdump kernel. The associated IRQs in the affinity mappings provided by irq_create_affinity_masks() are thus not started by irq_startup(), as per-design with managed IRQs.
This can be a problem with multi-queue block devices driven by blk-mq : such a non-started IRQ is very likely paired with the single queue enforced by blk-mq during kdump (see blk_mq_alloc_tag_set()). This causes the device to remain silent and likely hangs the guest at some point.
This is a regression caused by commit 9ea69a55b3b9 ("powerpc/pseries: Pass MSI affinity to irq_create_mapping()"). Note that this only happens with the XIVE interrupt controller because XICS has a workaround to bypass affinity, which is activated during kdump with the "noirqdistrib" kernel parameter.
The issue comes from a combination of factors:
- discrepancy between the number of queues detected by the multi-queue block driver, that was used to create the MSI vectors, and the single queue mode enforced later on by blk-mq because of kdump (i.e. keeping all queues fixes the issue)
- CPU#0 offline (i.e. kdump always succeed with CPU#0)
Given that I couldn't reproduce on x86, which seems to always have CPU#0 online even during kdump, I'm not sure where this should be fixed. Hence going for another approach : fine-grained affinity is for performance and we don't really care about that during kdump. Simply revert to the previous working behavior of ignoring affinity masks in this case only.
Fixes: 9ea69a55b3b9 ("powerpc/pseries: Pass MSI affinity to irq_create_mapping()") Cc: lvivier@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Greg Kurz groug@kaod.org
Reviewed-by: Cédric Le Goater clg@kaod.org
Thanks for tracking this issue.
This layer needs a rework. Patches adding a MSI domain should be ready in a couple of releases. Hopefully.
C.
arch/powerpc/platforms/pseries/msi.c | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/msi.c b/arch/powerpc/platforms/pseries/msi.c index b3ac2455faad..29d04b83288d 100644 --- a/arch/powerpc/platforms/pseries/msi.c +++ b/arch/powerpc/platforms/pseries/msi.c @@ -458,8 +458,28 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int nvec_in, int type) return hwirq; }
virq = irq_create_mapping_affinity(NULL, hwirq,
entry->affinity);
/*
* Depending on the number of online CPUs in the original
* kernel, it is likely for CPU #0 to be offline in a kdump
* kernel. The associated IRQs in the affinity mappings
* provided by irq_create_affinity_masks() are thus not
* started by irq_startup(), as per-design for managed IRQs.
* This can be a problem with multi-queue block devices driven
* by blk-mq : such a non-started IRQ is very likely paired
* with the single queue enforced by blk-mq during kdump (see
* blk_mq_alloc_tag_set()). This causes the device to remain
* silent and likely hangs the guest at some point.
*
* We don't really care for fine-grained affinity when doing
* kdump actually : simply ignore the pre-computed affinity
* masks in this case and let the default mask with all CPUs
* be used when creating the IRQ mappings.
*/
if (is_kdump_kernel())
virq = irq_create_mapping(NULL, hwirq);
else
virq = irq_create_mapping_affinity(NULL, hwirq,
entry->affinity);
if (!virq) { pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq);
Hi Greg,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on powerpc/next] [also build test ERROR on linus/master v5.11-rc7 next-20210211] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Greg-Kurz/powerpc-pseries-Don-t-enf... base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next config: powerpc-allyesconfig (attached as .config) compiler: powerpc64-linux-gcc (GCC) 9.3.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/0day-ci/linux/commit/1e5f7523fcfc57ab9437b8c7b29a974b62bd... git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Greg-Kurz/powerpc-pseries-Don-t-enforce-MSI-affinity-with-kdump/20210213-004658 git checkout 1e5f7523fcfc57ab9437b8c7b29a974b62bde79d # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=powerpc
If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot lkp@intel.com
All errors (new ones prefixed by >>):
arch/powerpc/platforms/pseries/msi.c: In function 'rtas_setup_msi_irqs':
arch/powerpc/platforms/pseries/msi.c:478:7: error: implicit declaration of function 'is_kdump_kernel' [-Werror=implicit-function-declaration]
478 | if (is_kdump_kernel()) | ^~~~~~~~~~~~~~~ cc1: some warnings being treated as errors
vim +/is_kdump_kernel +478 arch/powerpc/platforms/pseries/msi.c
369 370 static int rtas_setup_msi_irqs(struct pci_dev *pdev, int nvec_in, int type) 371 { 372 struct pci_dn *pdn; 373 int hwirq, virq, i, quota, rc; 374 struct msi_desc *entry; 375 struct msi_msg msg; 376 int nvec = nvec_in; 377 int use_32bit_msi_hack = 0; 378 379 if (type == PCI_CAP_ID_MSIX) 380 rc = check_req_msix(pdev, nvec); 381 else 382 rc = check_req_msi(pdev, nvec); 383 384 if (rc) 385 return rc; 386 387 quota = msi_quota_for_device(pdev, nvec); 388 389 if (quota && quota < nvec) 390 return quota; 391 392 if (type == PCI_CAP_ID_MSIX && check_msix_entries(pdev)) 393 return -EINVAL; 394 395 /* 396 * Firmware currently refuse any non power of two allocation 397 * so we round up if the quota will allow it. 398 */ 399 if (type == PCI_CAP_ID_MSIX) { 400 int m = roundup_pow_of_two(nvec); 401 quota = msi_quota_for_device(pdev, m); 402 403 if (quota >= m) 404 nvec = m; 405 } 406 407 pdn = pci_get_pdn(pdev); 408 409 /* 410 * Try the new more explicit firmware interface, if that fails fall 411 * back to the old interface. The old interface is known to never 412 * return MSI-Xs. 413 */ 414 again: 415 if (type == PCI_CAP_ID_MSI) { 416 if (pdev->no_64bit_msi) { 417 rc = rtas_change_msi(pdn, RTAS_CHANGE_32MSI_FN, nvec); 418 if (rc < 0) { 419 /* 420 * We only want to run the 32 bit MSI hack below if 421 * the max bus speed is Gen2 speed 422 */ 423 if (pdev->bus->max_bus_speed != PCIE_SPEED_5_0GT) 424 return rc; 425 426 use_32bit_msi_hack = 1; 427 } 428 } else 429 rc = -1; 430 431 if (rc < 0) 432 rc = rtas_change_msi(pdn, RTAS_CHANGE_MSI_FN, nvec); 433 434 if (rc < 0) { 435 pr_debug("rtas_msi: trying the old firmware call.\n"); 436 rc = rtas_change_msi(pdn, RTAS_CHANGE_FN, nvec); 437 } 438 439 if (use_32bit_msi_hack && rc > 0) 440 rtas_hack_32bit_msi_gen2(pdev); 441 } else 442 rc = rtas_change_msi(pdn, RTAS_CHANGE_MSIX_FN, nvec); 443 444 if (rc != nvec) { 445 if (nvec != nvec_in) { 446 nvec = nvec_in; 447 goto again; 448 } 449 pr_debug("rtas_msi: rtas_change_msi() failed\n"); 450 return rc; 451 } 452 453 i = 0; 454 for_each_pci_msi_entry(entry, pdev) { 455 hwirq = rtas_query_irq_number(pdn, i++); 456 if (hwirq < 0) { 457 pr_debug("rtas_msi: error (%d) getting hwirq\n", rc); 458 return hwirq; 459 } 460 461 /* 462 * Depending on the number of online CPUs in the original 463 * kernel, it is likely for CPU #0 to be offline in a kdump 464 * kernel. The associated IRQs in the affinity mappings 465 * provided by irq_create_affinity_masks() are thus not 466 * started by irq_startup(), as per-design for managed IRQs. 467 * This can be a problem with multi-queue block devices driven 468 * by blk-mq : such a non-started IRQ is very likely paired 469 * with the single queue enforced by blk-mq during kdump (see 470 * blk_mq_alloc_tag_set()). This causes the device to remain 471 * silent and likely hangs the guest at some point. 472 * 473 * We don't really care for fine-grained affinity when doing 474 * kdump actually : simply ignore the pre-computed affinity 475 * masks in this case and let the default mask with all CPUs 476 * be used when creating the IRQ mappings. 477 */
478 if (is_kdump_kernel())
479 virq = irq_create_mapping(NULL, hwirq); 480 else 481 virq = irq_create_mapping_affinity(NULL, hwirq, 482 entry->affinity); 483 484 if (!virq) { 485 pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq); 486 return -ENOSPC; 487 } 488 489 dev_dbg(&pdev->dev, "rtas_msi: allocated virq %d\n", virq); 490 irq_set_msi_desc(virq, entry); 491 492 /* Read config space back so we can restore after reset */ 493 __pci_read_msi_msg(entry, &msg); 494 entry->msg = msg; 495 } 496 497 return 0; 498 } 499
--- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
linux-stable-mirror@lists.linaro.org