On Jul 29, 2020, at 8:03 AM, Hou Pu houpu@bytedance.com wrote:
The iscsi target login thread might stuck in following stack:
cat /proc/`pidof iscsi_np`/stack [<0>] down_interruptible+0x42/0x50 [<0>] iscsit_access_np+0xe3/0x167 [<0>] iscsi_target_locate_portal+0x695/0x8ac [<0>] __iscsi_target_login_thread+0x855/0xb82 [<0>] iscsi_target_login_thread+0x2f/0x5a [<0>] kthread+0xfa/0x130 [<0>] ret_from_fork+0x1f/0x30
This could be reproduced by following steps:
- Initiator A try to login iqn1-tpg1 on port 3260. After finishing
PDU exchange in the login thread and before the negotiation is finished, at this time the network link is down. In a production environment, this could happen. I could emulated it by bring the network card down in the initiator node by ifconfig eth0 down. (Now A could never finish this login. And tpg->np_login_sem is hold by it). 2. Initiator B try to login iqn2-tpg1 on port 3260. After finishing PDU exchange in the login thread. The target expect to process remaining login PDUs in workqueue context. 3. Initiator A' try to re-login to iqn1-tpg1 on port 3260 from a new socket. It will wait for tpg->np_login_sem with np->np_login_timer loaded to wait for at most 15 second. (Because the lock is held by A. A never gets a change to release tpg->np_login_sem. so A' should finally get timeout). 4. Before A' got timeout. Initiator B gets negotiation failed and calls iscsi_target_login_drop()->iscsi_target_login_sess_out(). The np->np_login_timer is canceled. And initiator A' will hang there forever. Because A' is now in the login thread. All other login requests could not be serviced.
iqn1 and iqn1 are different targets right? It’s not clear to me how when initiator B fails negotiation that it cancels the timer for the portal under a different iqn/target.
Is iqn2-tpg1->np1 a different struct than iqn1-tpg1-np1? I mean iscsit_get_tpg_from_np would return a different np struct for initiator B and for A?