Hi Doug and Jason,
Here are some patches to go to for-next. These include the couple patches that needed rework that were posted before the OFA conf. Well actually those patches that had issues were just dropped with the exception of the one from Alex, to add handling of kernel restart to hfi1 and qib. Patch 8 is his V2.
Nothing else too scary or exciting in here. Well OK so that's not quite right the CQ completion vector patch is rather interesting. This adds support for compeltion vectors for hfi1 and helps improve performance in things like IPoIB.
There is a signifianct patch from Mitko that redoes a lof our fault injection stuff. It's a big patch but I'm not sure it lends itself to being broken up further.
One other thing of note is the "Create common functions" patch from Sebastian depends on one of the patches that I sent for the -rc. It won't apply cleanly without that.
---
Alex Estrin (2): IB/hfi1: Complete check for locally terminated smp IB/{hfi1,qib}: Add handling of kernel restart
Brian Welty (1): IB/{hfi1,qib,rdmavt}: Move logic to allocate receive WQE into rdmavt
Kamenee Arumugam (1): IB/Hfi1: Read CCE Revision register to verify the device is responsive
Michael J. Ruhl (4): IB/hfi1: Return actual error value from program_rcvarray() IB/hfi1: Use after free race condition in send context error path IB/hfi1: Return correct value for device state IB/hfi1: Reorder incorrect send context disable
Mike Marciniszyn (1): IB/hfi1: Fix fault injection init/exit issues
Mitko Haralanov (1): IB/hfi1: Rework fault injection machinery
Sebastian Sanchez (4): IB/hfi1: Prevent LNI hang when LCB can't obtain lanes IB/hfi1: Optimize kthread pointer locking when queuing CQ entries IB/hfi1: Create common functions for affinity CPU mask operations IB/{hfi1,rdmavt,qib}: Implement CQ completion vector support
drivers/infiniband/hw/hfi1/Makefile | 10 - drivers/infiniband/hw/hfi1/affinity.c | 497 +++++++++++++++++++++++++-- drivers/infiniband/hw/hfi1/affinity.h | 10 - drivers/infiniband/hw/hfi1/chip.c | 74 +++- drivers/infiniband/hw/hfi1/chip.h | 15 + drivers/infiniband/hw/hfi1/chip_registers.h | 7 drivers/infiniband/hw/hfi1/debugfs.c | 292 ---------------- drivers/infiniband/hw/hfi1/debugfs.h | 93 +++-- drivers/infiniband/hw/hfi1/driver.c | 20 + drivers/infiniband/hw/hfi1/fault.c | 375 ++++++++++++++++++++ drivers/infiniband/hw/hfi1/fault.h | 109 ++++++ drivers/infiniband/hw/hfi1/file_ops.c | 2 drivers/infiniband/hw/hfi1/hfi.h | 14 + drivers/infiniband/hw/hfi1/init.c | 28 +- drivers/infiniband/hw/hfi1/mad.c | 36 +- drivers/infiniband/hw/hfi1/pcie.c | 8 drivers/infiniband/hw/hfi1/pio.c | 44 ++ drivers/infiniband/hw/hfi1/rc.c | 8 drivers/infiniband/hw/hfi1/ruc.c | 154 -------- drivers/infiniband/hw/hfi1/trace.c | 3 drivers/infiniband/hw/hfi1/trace_dbg.h | 3 drivers/infiniband/hw/hfi1/uc.c | 4 drivers/infiniband/hw/hfi1/ud.c | 4 drivers/infiniband/hw/hfi1/user_exp_rcv.c | 1 drivers/infiniband/hw/hfi1/verbs.c | 20 - drivers/infiniband/hw/hfi1/verbs.h | 8 drivers/infiniband/hw/qib/qib.h | 1 drivers/infiniband/hw/qib/qib_init.c | 13 + drivers/infiniband/hw/qib/qib_rc.c | 8 drivers/infiniband/hw/qib/qib_ruc.c | 154 -------- drivers/infiniband/hw/qib/qib_uc.c | 4 drivers/infiniband/hw/qib/qib_ud.c | 4 drivers/infiniband/hw/qib/qib_verbs.c | 6 drivers/infiniband/hw/qib/qib_verbs.h | 2 drivers/infiniband/sw/rdmavt/cq.c | 74 ++-- drivers/infiniband/sw/rdmavt/cq.h | 6 drivers/infiniband/sw/rdmavt/qp.c | 149 ++++++++ drivers/infiniband/sw/rdmavt/trace_cq.h | 35 ++ drivers/infiniband/sw/rdmavt/vt.c | 35 +- include/rdma/rdma_vt.h | 7 include/rdma/rdmavt_cq.h | 5 include/rdma/rdmavt_qp.h | 1 42 files changed, 1491 insertions(+), 852 deletions(-) create mode 100644 drivers/infiniband/hw/hfi1/fault.c create mode 100644 drivers/infiniband/hw/hfi1/fault.h
-- -Denny
From: Mike Marciniszyn mike.marciniszyn@intel.com
There are config dependent code paths that expose panics in unload paths both in this file and in debugfs_remove_recursive() because CONFIG_FAULT_INJECTION and CONFIG_FAULT_INJECTION_DEBUG_FS can be set independently.
Having CONFIG_FAULT_INJECTION set and CONFIG_FAULT_INJECTION_DEBUG_FS reset causes fault_create_debugfs_attr() to return an error.
The debugfs.c routines tolerate failures, but the module unload panics dereferencing a NULL in the two exit routines. If that is fixed, the dir passed to debugfs_remove_recursive comes from a memory location that was freed and potentially reused causing a segfault or corrupting memory.
Here is an example of the NULL deref panic:
[66866.286829] BUG: unable to handle kernel NULL pointer dereference at 0000000000000088 [66866.295602] IP: hfi1_dbg_ibdev_exit+0x2a/0x80 [hfi1] [66866.301138] PGD 858496067 P4D 858496067 PUD 8433a7067 PMD 0 [66866.307452] Oops: 0000 [#1] SMP [66866.310953] Modules linked in: hfi1(-) rdmavt rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm iw_cm ib_cm ib_core rpcsec_gss_krb5 nfsv4 dns_resolver nfsv3 nfs fscache sb_edac x86_pkg_temp_thermal intel_powerclamp vfat fat coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel iTCO_wdt iTCO_vendor_support crypto_simd mei_me glue_helper cryptd mxm_wmi ipmi_si pcspkr lpc_ich sg mei ioatdma ipmi_devintf i2c_i801 mfd_core shpchp ipmi_msghandler wmi acpi_power_meter acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt igb fb_sys_fops ttm ahci ptp crc32c_intel libahci pps_core drm dca libata i2c_algo_bit i2c_core [last unloaded: opa_vnic] [66866.385551] CPU: 8 PID: 7470 Comm: rmmod Not tainted 4.14.0-mam-tid-rdma #2 [66866.393317] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0018.C4.072020161249 07/20/2016 [66866.405252] task: ffff88084f28c380 task.stack: ffffc90008454000 [66866.411866] RIP: 0010:hfi1_dbg_ibdev_exit+0x2a/0x80 [hfi1] [66866.417984] RSP: 0018:ffffc90008457da0 EFLAGS: 00010202 [66866.423812] RAX: 0000000000000000 RBX: ffff880857de0000 RCX: 0000000180040001 [66866.431773] RDX: 0000000180040002 RSI: ffffea0021088200 RDI: 0000000040000000 [66866.439734] RBP: ffffc90008457da8 R08: ffff88084220e000 R09: 0000000180040001 [66866.447696] R10: 000000004220e001 R11: ffff88084220e000 R12: ffff88085a31c000 [66866.455657] R13: ffffffffa07c9820 R14: ffffffffa07c9890 R15: ffff881059d78100 [66866.463618] FS: 00007f6876047740(0000) GS:ffff88085f800000(0000) knlGS:0000000000000000 [66866.472644] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [66866.479053] CR2: 0000000000000088 CR3: 0000000856357006 CR4: 00000000001606e0 [66866.487013] Call Trace: [66866.489747] remove_one+0x1f/0x220 [hfi1] [66866.494221] pci_device_remove+0x39/0xc0 [66866.498596] device_release_driver_internal+0x141/0x210 [66866.504424] driver_detach+0x3f/0x80 [66866.508409] bus_remove_driver+0x55/0xd0 [66866.512784] driver_unregister+0x2c/0x50 [66866.517164] pci_unregister_driver+0x2a/0xa0 [66866.521934] hfi1_mod_cleanup+0x10/0xaa2 [hfi1] [66866.526988] SyS_delete_module+0x171/0x250 [66866.531558] do_syscall_64+0x67/0x1b0 [66866.535644] entry_SYSCALL64_slow_path+0x25/0x25 [66866.540792] RIP: 0033:0x7f6875525c27 [66866.544777] RSP: 002b:00007ffd48528e78 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [66866.553224] RAX: ffffffffffffffda RBX: 0000000001cc01d0 RCX: 00007f6875525c27 [66866.561185] RDX: 00007f6875596000 RSI: 0000000000000800 RDI: 0000000001cc0238 [66866.569146] RBP: 0000000000000000 R08: 00007f68757e9060 R09: 00007f6875596000 [66866.577120] R10: 00007ffd48528c00 R11: 0000000000000206 R12: 00007ffd48529db4 [66866.585080] R13: 0000000000000000 R14: 0000000001cc01d0 R15: 0000000001cc0010 [66866.593040] Code: 90 0f 1f 44 00 00 48 83 3d a3 8b 03 00 00 55 48 89 e5 53 48 89 fb 74 4e 48 8d bf 18 0c 00 00 e8 9d f2 ff ff 48 8b 83 20 0c 00 00 <48> 8b b8 88 00 00 00 e8 2a 21 b3 e0 48 8b bb 20 0c 00 00 e8 0e [66866.614127] RIP: hfi1_dbg_ibdev_exit+0x2a/0x80 [hfi1] RSP: ffffc90008457da0 [66866.621885] CR2: 0000000000000088 [66866.625618] ---[ end trace c4817425783fb092 ]---
Fix by insuring that upon failure from fault_create_debugfs_attr() the parent pointer for the routines is always set to NULL and guards added in the exit routines to insure that debugfs_remove_recursive() is not called when when the parent pointer is NULL.
Fixes: 0181ce31b260 ("IB/hfi1: Add receive fault injection feature") Cc: stable@vger.kernel.org # 4.14.x Reviewed-by: Michael J. Ruhl michael.j.ruhl@intel.com Signed-off-by: Mike Marciniszyn mike.marciniszyn@intel.com Signed-off-by: Dennis Dalessandro dennis.dalessandro@intel.com --- drivers/infiniband/hw/hfi1/debugfs.c | 8 ++++++-- 1 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/infiniband/hw/hfi1/debugfs.c b/drivers/infiniband/hw/hfi1/debugfs.c index 852173b..5343960 100644 --- a/drivers/infiniband/hw/hfi1/debugfs.c +++ b/drivers/infiniband/hw/hfi1/debugfs.c @@ -1227,7 +1227,8 @@ static int _fault_stats_seq_show(struct seq_file *s, void *v)
static void fault_exit_opcode_debugfs(struct hfi1_ibdev *ibd) { - debugfs_remove_recursive(ibd->fault_opcode->dir); + if (ibd->fault_opcode) + debugfs_remove_recursive(ibd->fault_opcode->dir); kfree(ibd->fault_opcode); ibd->fault_opcode = NULL; } @@ -1255,6 +1256,7 @@ static int fault_init_opcode_debugfs(struct hfi1_ibdev *ibd) &ibd->fault_opcode->attr); if (IS_ERR(ibd->fault_opcode->dir)) { kfree(ibd->fault_opcode); + ibd->fault_opcode = NULL; return -ENOENT; }
@@ -1278,7 +1280,8 @@ static int fault_init_opcode_debugfs(struct hfi1_ibdev *ibd)
static void fault_exit_packet_debugfs(struct hfi1_ibdev *ibd) { - debugfs_remove_recursive(ibd->fault_packet->dir); + if (ibd->fault_packet) + debugfs_remove_recursive(ibd->fault_packet->dir); kfree(ibd->fault_packet); ibd->fault_packet = NULL; } @@ -1304,6 +1307,7 @@ static int fault_init_packet_debugfs(struct hfi1_ibdev *ibd) &ibd->fault_opcode->attr); if (IS_ERR(ibd->fault_packet->dir)) { kfree(ibd->fault_packet); + ibd->fault_packet = NULL; return -ENOENT; }
From: Michael J. Ruhl michael.j.ruhl@intel.com
A pio send egress error can occur when the PSM library attempts to to send a bad packet. That issue is still being investigated.
The pio error interrupt handler then attempts to progress the recovery of the errored pio send context.
Code inspection reveals that the handling lacks the necessary locking if that recovery interleaves with a PSM close of the "context" object contains the pio send context.
The lack of the locking can cause the recovery to access the already freed pio send context object and incorrectly deduce that the pio send context is actually a kernel pio send context as shown by the NULL deref stack below:
[<ffffffff8143d78c>] _dev_info+0x6c/0x90 [<ffffffffc0613230>] sc_restart+0x70/0x1f0 [hfi1] [<ffffffff816ab124>] ? __schedule+0x424/0x9b0 [<ffffffffc06133c5>] sc_halted+0x15/0x20 [hfi1] [<ffffffff810aa3ba>] process_one_work+0x17a/0x440 [<ffffffff810ab086>] worker_thread+0x126/0x3c0 [<ffffffff810aaf60>] ? manage_workers.isra.24+0x2a0/0x2a0 [<ffffffff810b252f>] kthread+0xcf/0xe0 [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40 [<ffffffff816b8798>] ret_from_fork+0x58/0x90 [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
This is the best case scenario and other scenarios can corrupt the already freed memory.
Fix by adding the necessary locking in the pio send context error handler.
Cc: stable@vger.kernel.org # 4.9.x Reviewed-by: Mike Marciniszyn mike.marciniszyn@intel.com Reviewed-by: Dennis Dalessandro dennis.dalessandro@intel.com Signed-off-by: Michael J. Ruhl michael.j.ruhl@intel.com Signed-off-by: Dennis Dalessandro dennis.dalessandro@intel.com --- drivers/infiniband/hw/hfi1/chip.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/drivers/infiniband/hw/hfi1/chip.c b/drivers/infiniband/hw/hfi1/chip.c index cb9095d..e5254f8 100644 --- a/drivers/infiniband/hw/hfi1/chip.c +++ b/drivers/infiniband/hw/hfi1/chip.c @@ -5944,6 +5944,7 @@ static void is_sendctxt_err_int(struct hfi1_devdata *dd, u64 status; u32 sw_index; int i = 0; + unsigned long irq_flags;
sw_index = dd->hw_to_sw[hw_context]; if (sw_index >= dd->num_send_contexts) { @@ -5953,10 +5954,12 @@ static void is_sendctxt_err_int(struct hfi1_devdata *dd, return; } sci = &dd->send_contexts[sw_index]; + spin_lock_irqsave(&dd->sc_lock, irq_flags); sc = sci->sc; if (!sc) { dd_dev_err(dd, "%s: context %u(%u): no sc?\n", __func__, sw_index, hw_context); + spin_unlock_irqrestore(&dd->sc_lock, irq_flags); return; }
@@ -5978,6 +5981,7 @@ static void is_sendctxt_err_int(struct hfi1_devdata *dd, */ if (sc->type != SC_USER) queue_work(dd->pport->hfi1_wq, &sc->halt_work); + spin_unlock_irqrestore(&dd->sc_lock, irq_flags);
/* * Update the counters for the corresponding status bits.
On Wed, May 02, 2018 at 06:42:51AM -0700, Dennis Dalessandro wrote:
From: Michael J. Ruhl michael.j.ruhl@intel.com
A pio send egress error can occur when the PSM library attempts to to send a bad packet. That issue is still being investigated.
The pio error interrupt handler then attempts to progress the recovery of the errored pio send context.
Code inspection reveals that the handling lacks the necessary locking if that recovery interleaves with a PSM close of the "context" object contains the pio send context.
The lack of the locking can cause the recovery to access the already freed pio send context object and incorrectly deduce that the pio send context is actually a kernel pio send context as shown by the NULL deref stack below:
[<ffffffff8143d78c>] _dev_info+0x6c/0x90 [<ffffffffc0613230>] sc_restart+0x70/0x1f0 [hfi1] [<ffffffff816ab124>] ? __schedule+0x424/0x9b0 [<ffffffffc06133c5>] sc_halted+0x15/0x20 [hfi1] [<ffffffff810aa3ba>] process_one_work+0x17a/0x440 [<ffffffff810ab086>] worker_thread+0x126/0x3c0 [<ffffffff810aaf60>] ? manage_workers.isra.24+0x2a0/0x2a0 [<ffffffff810b252f>] kthread+0xcf/0xe0 [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40 [<ffffffff816b8798>] ret_from_fork+0x58/0x90 [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
This is the best case scenario and other scenarios can corrupt the already freed memory.
Fix by adding the necessary locking in the pio send context error handler.
Cc: stable@vger.kernel.org # 4.9.x Reviewed-by: Mike Marciniszyn mike.marciniszyn@intel.com Reviewed-by: Dennis Dalessandro dennis.dalessandro@intel.com Signed-off-by: Michael J. Ruhl michael.j.ruhl@intel.com Signed-off-by: Dennis Dalessandro dennis.dalessandro@intel.com
drivers/infiniband/hw/hfi1/chip.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-)
Why are you sending this to for-next not for-rc?
Jason
On 5/4/2018 2:38 PM, Jason Gunthorpe wrote:
On Wed, May 02, 2018 at 06:42:51AM -0700, Dennis Dalessandro wrote:
From: Michael J. Ruhl michael.j.ruhl@intel.com
A pio send egress error can occur when the PSM library attempts to to send a bad packet. That issue is still being investigated.
The pio error interrupt handler then attempts to progress the recovery of the errored pio send context.
Code inspection reveals that the handling lacks the necessary locking if that recovery interleaves with a PSM close of the "context" object contains the pio send context.
The lack of the locking can cause the recovery to access the already freed pio send context object and incorrectly deduce that the pio send context is actually a kernel pio send context as shown by the NULL deref stack below:
[<ffffffff8143d78c>] _dev_info+0x6c/0x90 [<ffffffffc0613230>] sc_restart+0x70/0x1f0 [hfi1] [<ffffffff816ab124>] ? __schedule+0x424/0x9b0 [<ffffffffc06133c5>] sc_halted+0x15/0x20 [hfi1] [<ffffffff810aa3ba>] process_one_work+0x17a/0x440 [<ffffffff810ab086>] worker_thread+0x126/0x3c0 [<ffffffff810aaf60>] ? manage_workers.isra.24+0x2a0/0x2a0 [<ffffffff810b252f>] kthread+0xcf/0xe0 [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40 [<ffffffff816b8798>] ret_from_fork+0x58/0x90 [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
This is the best case scenario and other scenarios can corrupt the already freed memory.
Fix by adding the necessary locking in the pio send context error handler.
Cc: stable@vger.kernel.org # 4.9.x Reviewed-by: Mike Marciniszyn mike.marciniszyn@intel.com Reviewed-by: Dennis Dalessandro dennis.dalessandro@intel.com Signed-off-by: Michael J. Ruhl michael.j.ruhl@intel.com Signed-off-by: Dennis Dalessandro dennis.dalessandro@intel.com
drivers/infiniband/hw/hfi1/chip.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-)
Why are you sending this to for-next not for-rc?
I went back and forth on this one. In the end decided against it because we've lived with it for so long, note stable tag is all the way back to 4.9, that and the fact that it's extremely unlikely to occur. I would be fine including it with the -rc though. I think a case could be made either way.
-Denny
On Fri, 2018-05-04 at 16:01 -0400, Dennis Dalessandro wrote:
On 5/4/2018 2:38 PM, Jason Gunthorpe wrote:
On Wed, May 02, 2018 at 06:42:51AM -0700, Dennis Dalessandro wrote:
From: Michael J. Ruhl michael.j.ruhl@intel.com
A pio send egress error can occur when the PSM library attempts to to send a bad packet. That issue is still being investigated.
The pio error interrupt handler then attempts to progress the recovery of the errored pio send context.
Code inspection reveals that the handling lacks the necessary locking if that recovery interleaves with a PSM close of the "context" object contains the pio send context.
The lack of the locking can cause the recovery to access the already freed pio send context object and incorrectly deduce that the pio send context is actually a kernel pio send context as shown by the NULL deref stack below:
[<ffffffff8143d78c>] _dev_info+0x6c/0x90 [<ffffffffc0613230>] sc_restart+0x70/0x1f0 [hfi1] [<ffffffff816ab124>] ? __schedule+0x424/0x9b0 [<ffffffffc06133c5>] sc_halted+0x15/0x20 [hfi1] [<ffffffff810aa3ba>] process_one_work+0x17a/0x440 [<ffffffff810ab086>] worker_thread+0x126/0x3c0 [<ffffffff810aaf60>] ? manage_workers.isra.24+0x2a0/0x2a0 [<ffffffff810b252f>] kthread+0xcf/0xe0 [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40 [<ffffffff816b8798>] ret_from_fork+0x58/0x90 [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
This is the best case scenario and other scenarios can corrupt the already freed memory.
Fix by adding the necessary locking in the pio send context error handler.
Cc: stable@vger.kernel.org # 4.9.x Reviewed-by: Mike Marciniszyn mike.marciniszyn@intel.com Reviewed-by: Dennis Dalessandro dennis.dalessandro@intel.com Signed-off-by: Michael J. Ruhl michael.j.ruhl@intel.com Signed-off-by: Dennis Dalessandro dennis.dalessandro@intel.com
drivers/infiniband/hw/hfi1/chip.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-)
Why are you sending this to for-next not for-rc?
I went back and forth on this one. In the end decided against it because we've lived with it for so long, note stable tag is all the way back to 4.9, that and the fact that it's extremely unlikely to occur. I would be fine including it with the -rc though. I think a case could be made either way.
-Denny
I went ahead and pulled this into for-rc instead of for-next. Thanks.
From: Michael J. Ruhl michael.j.ruhl@intel.com
User send context integrity bits are cleared before the context is disabled. If the send context is still processing data, any packets that need those integrity bits will cause an error and halt the send context.
During the disable handling, the driver waits for the context to drain. If the context is halted, the driver will eventually timeout because the context won't drain and then incorrectly bounce the link.
Reorder the bit clearing and the context disable.
Examine the software state and send context status as well as the egress status to determine if a send context is in the halted state.
Promote the check macros to static functions for consistency with the new check and to follow kernel style.
Remove an unused define that refers to the egress timeout.
Cc: stable@vger.kernel.org # 4.9.x Reviewed-by: Mitko Haralanov mitko.haralanov@intel.com Reviewed-by: Mike Marciniszyn mike.marciniszyn@intel.com Signed-off-by: Michael J. Ruhl michael.j.ruhl@intel.com Signed-off-by: Dennis Dalessandro dennis.dalessandro@intel.com --- drivers/infiniband/hw/hfi1/file_ops.c | 2 +- drivers/infiniband/hw/hfi1/pio.c | 44 ++++++++++++++++++++++++++------- 2 files changed, 35 insertions(+), 11 deletions(-)
diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c index 1b778fd..c9d23c3 100644 --- a/drivers/infiniband/hw/hfi1/file_ops.c +++ b/drivers/infiniband/hw/hfi1/file_ops.c @@ -689,8 +689,8 @@ static int hfi1_file_close(struct inode *inode, struct file *fp) * checks to default and disable the send context. */ if (uctxt->sc) { - set_pio_integrity(uctxt->sc); sc_disable(uctxt->sc); + set_pio_integrity(uctxt->sc); }
hfi1_free_ctxt_rcv_groups(uctxt); diff --git a/drivers/infiniband/hw/hfi1/pio.c b/drivers/infiniband/hw/hfi1/pio.c index 40dac4d..9cac15d 100644 --- a/drivers/infiniband/hw/hfi1/pio.c +++ b/drivers/infiniband/hw/hfi1/pio.c @@ -50,8 +50,6 @@ #include "qp.h" #include "trace.h"
-#define SC_CTXT_PACKET_EGRESS_TIMEOUT 350 /* in chip cycles */ - #define SC(name) SEND_CTXT_##name /* * Send Context functions @@ -961,15 +959,40 @@ void sc_disable(struct send_context *sc) }
/* return SendEgressCtxtStatus.PacketOccupancy */ -#define packet_occupancy(r) \ - (((r) & SEND_EGRESS_CTXT_STATUS_CTXT_EGRESS_PACKET_OCCUPANCY_SMASK)\ - >> SEND_EGRESS_CTXT_STATUS_CTXT_EGRESS_PACKET_OCCUPANCY_SHIFT) +static u64 packet_occupancy(u64 reg) +{ + return (reg & + SEND_EGRESS_CTXT_STATUS_CTXT_EGRESS_PACKET_OCCUPANCY_SMASK) + >> SEND_EGRESS_CTXT_STATUS_CTXT_EGRESS_PACKET_OCCUPANCY_SHIFT; +}
/* is egress halted on the context? */ -#define egress_halted(r) \ - ((r) & SEND_EGRESS_CTXT_STATUS_CTXT_EGRESS_HALT_STATUS_SMASK) +static bool egress_halted(u64 reg) +{ + return !!(reg & SEND_EGRESS_CTXT_STATUS_CTXT_EGRESS_HALT_STATUS_SMASK); +}
-/* wait for packet egress, optionally pause for credit return */ +/* is the send context halted? */ +static bool is_sc_halted(struct hfi1_devdata *dd, u32 hw_context) +{ + return !!(read_kctxt_csr(dd, hw_context, SC(STATUS)) & + SC(STATUS_CTXT_HALTED_SMASK)); +} + +/** + * sc_wait_for_packet_egress + * @sc: valid send context + * @pause: wait for credit return + * + * Wait for packet egress, optionally pause for credit return + * + * Egress halt and Context halt are not necessarily the same thing, so + * check for both. + * + * NOTE: The context halt bit may not be set immediately. Because of this, + * it is necessary to check the SW SFC_HALTED bit (set in the IRQ) and the HW + * context bit to determine if the context is halted. + */ static void sc_wait_for_packet_egress(struct send_context *sc, int pause) { struct hfi1_devdata *dd = sc->dd; @@ -981,8 +1004,9 @@ static void sc_wait_for_packet_egress(struct send_context *sc, int pause) reg_prev = reg; reg = read_csr(dd, sc->hw_context * 8 + SEND_EGRESS_CTXT_STATUS); - /* done if egress is stopped */ - if (egress_halted(reg)) + /* done if any halt bits, SW or HW are set */ + if (sc->flags & SCF_HALTED || + is_sc_halted(dd, sc->hw_context) || egress_halted(reg)) break; reg = packet_occupancy(reg); if (reg == 0)
From: Alex Estrin alex.estrin@intel.com
A warm restart will fail to unload the driver, leaving link state potentially flapping up to the point the BIOS resets the adapter. Correct the issue by hooking the shutdown pci method, which will bring port down.
Cc: stable@vger.kernel.org # 4.9.x Reviewed-by: Mike Marciniszyn mike.marciniszyn@intel.com Signed-off-by: Alex Estrin alex.estrin@intel.com Signed-off-by: Dennis Dalessandro dennis.dalessandro@intel.com
--- v1->v2 limit shutdown callback function to stop device HW only. --- drivers/infiniband/hw/hfi1/hfi.h | 1 + drivers/infiniband/hw/hfi1/init.c | 13 +++++++++++++ drivers/infiniband/hw/qib/qib.h | 1 + drivers/infiniband/hw/qib/qib_init.c | 13 +++++++++++++ 4 files changed, 28 insertions(+), 0 deletions(-)
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h index cac2c62..9c97c18 100644 --- a/drivers/infiniband/hw/hfi1/hfi.h +++ b/drivers/infiniband/hw/hfi1/hfi.h @@ -1856,6 +1856,7 @@ struct cc_state *get_cc_state_protected(struct hfi1_pportdata *ppd) #define HFI1_HAS_SDMA_TIMEOUT 0x8 #define HFI1_HAS_SEND_DMA 0x10 /* Supports Send DMA */ #define HFI1_FORCED_FREEZE 0x80 /* driver forced freeze mode */ +#define HFI1_SHUTDOWN 0x100 /* device is shutting down */
/* IB dword length mask in PBC (lower 11 bits); same for all chips */ #define HFI1_PBC_LENGTH_MASK ((1 << 11) - 1) diff --git a/drivers/infiniband/hw/hfi1/init.c b/drivers/infiniband/hw/hfi1/init.c index 6309edf..790542c 100644 --- a/drivers/infiniband/hw/hfi1/init.c +++ b/drivers/infiniband/hw/hfi1/init.c @@ -1058,6 +1058,10 @@ static void shutdown_device(struct hfi1_devdata *dd) unsigned pidx; int i;
+ if (dd->flags & HFI1_SHUTDOWN) + return; + dd->flags |= HFI1_SHUTDOWN; + for (pidx = 0; pidx < dd->num_pports; ++pidx) { ppd = dd->pport + pidx;
@@ -1391,6 +1395,7 @@ void hfi1_disable_after_error(struct hfi1_devdata *dd)
static void remove_one(struct pci_dev *); static int init_one(struct pci_dev *, const struct pci_device_id *); +static void shutdown_one(struct pci_dev *);
#define DRIVER_LOAD_MSG "Intel " DRIVER_NAME " loaded: " #define PFX DRIVER_NAME ": " @@ -1407,6 +1412,7 @@ void hfi1_disable_after_error(struct hfi1_devdata *dd) .name = DRIVER_NAME, .probe = init_one, .remove = remove_one, + .shutdown = shutdown_one, .id_table = hfi1_pci_tbl, .err_handler = &hfi1_pci_err_handler, }; @@ -1816,6 +1822,13 @@ static void remove_one(struct pci_dev *pdev) postinit_cleanup(dd); }
+static void shutdown_one(struct pci_dev *pdev) +{ + struct hfi1_devdata *dd = pci_get_drvdata(pdev); + + shutdown_device(dd); +} + /** * hfi1_create_rcvhdrq - create a receive header queue * @dd: the hfi1_ib device diff --git a/drivers/infiniband/hw/qib/qib.h b/drivers/infiniband/hw/qib/qib.h index 4607245..43a68d7 100644 --- a/drivers/infiniband/hw/qib/qib.h +++ b/drivers/infiniband/hw/qib/qib.h @@ -1228,6 +1228,7 @@ int qib_cdev_init(int minor, const char *name, #define QIB_BADINTR 0x8000 /* severe interrupt problems */ #define QIB_DCA_ENABLED 0x10000 /* Direct Cache Access enabled */ #define QIB_HAS_QSFP 0x20000 /* device (card instance) has QSFP */ +#define QIB_SHUTDOWN 0x40000 /* device is shutting down */
/* * values for ppd->lflags (_ib_port_ related flags) diff --git a/drivers/infiniband/hw/qib/qib_init.c b/drivers/infiniband/hw/qib/qib_init.c index 6c68f8a..0155202 100644 --- a/drivers/infiniband/hw/qib/qib_init.c +++ b/drivers/infiniband/hw/qib/qib_init.c @@ -841,6 +841,10 @@ static void qib_shutdown_device(struct qib_devdata *dd) struct qib_pportdata *ppd; unsigned pidx;
+ if (dd->flags & QIB_SHUTDOWN) + return; + dd->flags |= QIB_SHUTDOWN; + for (pidx = 0; pidx < dd->num_pports; ++pidx) { ppd = dd->pport + pidx;
@@ -1182,6 +1186,7 @@ void qib_disable_after_error(struct qib_devdata *dd)
static void qib_remove_one(struct pci_dev *); static int qib_init_one(struct pci_dev *, const struct pci_device_id *); +static void qib_shutdown_one(struct pci_dev *);
#define DRIVER_LOAD_MSG "Intel " QIB_DRV_NAME " loaded: " #define PFX QIB_DRV_NAME ": " @@ -1199,6 +1204,7 @@ void qib_disable_after_error(struct qib_devdata *dd) .name = QIB_DRV_NAME, .probe = qib_init_one, .remove = qib_remove_one, + .shutdown = qib_shutdown_one, .id_table = qib_pci_tbl, .err_handler = &qib_pci_err_handler, }; @@ -1549,6 +1555,13 @@ static void qib_remove_one(struct pci_dev *pdev) qib_postinit_cleanup(dd); }
+static void qib_shutdown_one(struct pci_dev *pdev) +{ + struct qib_devdata *dd = pci_get_drvdata(pdev); + + qib_shutdown_device(dd); +} + /** * qib_create_rcvhdrq - create a receive header queue * @dd: the qlogic_ib device
From: Sebastian Sanchez sebastian.sanchez@intel.com
All threads queuing CQ entries on different CQs are unnecessarily synchronized by a spin lock to check if the CQ kthread worker hasn't been destroyed before queuing an CQ entry.
The lock used in 6efaf10f163d ("IB/rdmavt: Avoid queuing work into a destroyed cq kthread worker") is a device global lock and will have poor performance at scale as completions are entered from a large number of CPUs.
Convert to use RCU where the read side of RCU is rvt_cq_enter() to determine that the worker is alive prior to triggering the completion event. Apply write side RCU semantics in rvt_driver_cq_init() and rvt_cq_exit().
Fixes: 6efaf10f163d ("IB/rdmavt: Avoid queuing work into a destroyed cq kthread worker") Cc: stable@vger.kernel.org # 4.14.x Reviewed-by: Mike Marciniszyn mike.marciniszyn@intel.com Signed-off-by: Sebastian Sanchez sebastian.sanchez@intel.com Signed-off-by: Dennis Dalessandro dennis.dalessandro@intel.com --- drivers/infiniband/sw/rdmavt/cq.c | 31 +++++++++++++++++++------------ include/rdma/rdma_vt.h | 2 +- 2 files changed, 20 insertions(+), 13 deletions(-)
diff --git a/drivers/infiniband/sw/rdmavt/cq.c b/drivers/infiniband/sw/rdmavt/cq.c index fb52b66..340c17a 100644 --- a/drivers/infiniband/sw/rdmavt/cq.c +++ b/drivers/infiniband/sw/rdmavt/cq.c @@ -120,17 +120,20 @@ void rvt_cq_enter(struct rvt_cq *cq, struct ib_wc *entry, bool solicited) if (cq->notify == IB_CQ_NEXT_COMP || (cq->notify == IB_CQ_SOLICITED && (solicited || entry->status != IB_WC_SUCCESS))) { + struct kthread_worker *worker; + /* * This will cause send_complete() to be called in * another thread. */ - spin_lock(&cq->rdi->n_cqs_lock); - if (likely(cq->rdi->worker)) { + rcu_read_lock(); + worker = rcu_dereference(cq->rdi->worker); + if (likely(worker)) { cq->notify = RVT_CQ_NONE; cq->triggered++; - kthread_queue_work(cq->rdi->worker, &cq->comptask); + kthread_queue_work(worker, &cq->comptask); } - spin_unlock(&cq->rdi->n_cqs_lock); + rcu_read_unlock(); }
spin_unlock_irqrestore(&cq->lock, flags); @@ -512,7 +515,7 @@ int rvt_driver_cq_init(struct rvt_dev_info *rdi) int cpu; struct kthread_worker *worker;
- if (rdi->worker) + if (rcu_access_pointer(rdi->worker)) return 0;
spin_lock_init(&rdi->n_cqs_lock); @@ -524,7 +527,7 @@ int rvt_driver_cq_init(struct rvt_dev_info *rdi) return PTR_ERR(worker);
set_user_nice(worker->task, MIN_NICE); - rdi->worker = worker; + RCU_INIT_POINTER(rdi->worker, worker); return 0; }
@@ -536,15 +539,19 @@ void rvt_cq_exit(struct rvt_dev_info *rdi) { struct kthread_worker *worker;
- /* block future queuing from send_complete() */ - spin_lock_irq(&rdi->n_cqs_lock); - worker = rdi->worker; + if (!rcu_access_pointer(rdi->worker)) + return; + + spin_lock(&rdi->n_cqs_lock); + worker = rcu_dereference_protected(rdi->worker, + lockdep_is_held(&rdi->n_cqs_lock)); if (!worker) { - spin_unlock_irq(&rdi->n_cqs_lock); + spin_unlock(&rdi->n_cqs_lock); return; } - rdi->worker = NULL; - spin_unlock_irq(&rdi->n_cqs_lock); + RCU_INIT_POINTER(rdi->worker, NULL); + spin_unlock(&rdi->n_cqs_lock); + synchronize_rcu();
kthread_destroy_worker(worker); } diff --git a/include/rdma/rdma_vt.h b/include/rdma/rdma_vt.h index 3f4c187..eec495e 100644 --- a/include/rdma/rdma_vt.h +++ b/include/rdma/rdma_vt.h @@ -402,7 +402,7 @@ struct rvt_dev_info { spinlock_t pending_lock; /* protect pending mmap list */
/* CQ */ - struct kthread_worker *worker; /* per device cq worker */ + struct kthread_worker __rcu *worker; /* per device cq worker */ u32 n_cqs_allocated; /* number of CQs allocated for device */ spinlock_t n_cqs_lock; /* protect count of in use cqs */
On Wed, 2018-05-02 at 06:42 -0700, Dennis Dalessandro wrote:
Hi Doug and Jason,
Here are some patches to go to for-next. These include the couple patches that needed rework that were posted before the OFA conf. Well actually those patches that had issues were just dropped with the exception of the one from Alex, to add handling of kernel restart to hfi1 and qib. Patch 8 is his V2.
Nothing else too scary or exciting in here. Well OK so that's not quite right the CQ completion vector patch is rather interesting. This adds support for compeltion vectors for hfi1 and helps improve performance in things like IPoIB.
There is a signifianct patch from Mitko that redoes a lof our fault injection stuff. It's a big patch but I'm not sure it lends itself to being broken up further.
One other thing of note is the "Create common functions" patch from Sebastian depends on one of the patches that I sent for the -rc. It won't apply cleanly without that.
Alex Estrin (2): IB/hfi1: Complete check for locally terminated smp IB/{hfi1,qib}: Add handling of kernel restart
Brian Welty (1): IB/{hfi1,qib,rdmavt}: Move logic to allocate receive WQE into rdmavt
Kamenee Arumugam (1): IB/Hfi1: Read CCE Revision register to verify the device is responsive
Michael J. Ruhl (4): IB/hfi1: Return actual error value from program_rcvarray() IB/hfi1: Use after free race condition in send context error path IB/hfi1: Return correct value for device state IB/hfi1: Reorder incorrect send context disable
Mike Marciniszyn (1): IB/hfi1: Fix fault injection init/exit issues
Mitko Haralanov (1): IB/hfi1: Rework fault injection machinery
Sebastian Sanchez (4): IB/hfi1: Prevent LNI hang when LCB can't obtain lanes IB/hfi1: Optimize kthread pointer locking when queuing CQ entries IB/hfi1: Create common functions for affinity CPU mask operations IB/{hfi1,rdmavt,qib}: Implement CQ completion vector support
Hi Denny,
As mentioned, I pulled patch 5 of this series into for-rc, and I pulled the rest into for-next last week. Thanks.
linux-stable-mirror@lists.linaro.org