On 17. 03. 23, 3:41, juanfengpy@gmail.com wrote:
From: Hui Li caelli@tencent.com
We have met a hang on pty device, the reader was blocking at epoll on master side, the writer was sleeping at wait_woken inside n_tty_write on slave side, and the write buffer on tty_port was full, we found that the reader and writer would never be woken again and blocked forever.
The problem was caused by a race between reader and kworker: n_tty_read(reader): n_tty_receive_buf_common(kworker): copy_from_read_buf()| |room = N_TTY_BUF_SIZE - (ldata->read_head - tail) |room <= 0 n_tty_kick_worker() | |ldata->no_room = true
After writing to slave device, writer wakes up kworker to flush data on tty_port to reader, and the kworker finds that reader has no room to store data so room <= 0 is met. At this moment, reader consumes all the data on reader buffer and calls n_tty_kick_worker to check ldata->no_room which is false and reader quits reading. Then kworker sets ldata->no_room=true and quits too.
...
--- a/drivers/tty/n_tty.c +++ b/drivers/tty/n_tty.c
...
@@ -1729,6 +1729,27 @@ n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp, } else n_tty_check_throttle(tty);
- if (unlikely(ldata->no_room)) {
/*
* Barrier here is to ensure to read the latest read_tail in
* chars_in_buffer() and to make sure that read_tail is not loaded
* before ldata->no_room is set,
I am not sure I would keep the following part of the comment in the code:
otherwise, following race may occur:
* n_tty_receive_buf_common()
* n_tty_read()
* if (!chars_in_buffer(tty))->false
* copy_from_read_buf()
* read_tail=commit_head
* n_tty_kick_worker()
* if (ldata->no_room)->false
* ldata->no_room = 1
* Then both kworker and reader will fail to kick n_tty_kick_worker(),
* smp_mb is paired with smp_mb() in n_tty_read().
I would only let it ^^^ documented in the commit log as you did.
*/
smp_mb();
if (!chars_in_buffer(tty))
n_tty_kick_worker(tty);
- }
- up_read(&tty->termios_rwsem);
return rcvd; @@ -2282,8 +2303,25 @@ static ssize_t n_tty_read(struct tty_struct *tty, struct file *file, if (time) timeout = time; }
- if (old_tail != ldata->read_tail)
- if (old_tail != ldata->read_tail) {
/*
* Make sure no_room is not read in n_tty_kick_worker()
* before setting ldata->read_tail in copy_from_read_buf(),
The same here (it's only repeated). I think the above two lines are enough for the comment. We have git blame after all.
* otherwise, following race may occur:
* n_tty_read()
* n_tty_receive_buf_common()
* n_tty_kick_worker()
* if(ldata->no_room)->false
* ldata->no_room = 1
* if (!chars_in_buffer(tty))->false
* copy_from_read_buf()
* read_tail=commit_head
* Both reader and kworker will fail to kick tty_buffer_restart_work(),
* smp_mb is paired with smp_mb() in n_tty_receive_buf_common().
*/
n_tty_kick_worker(tty);smp_mb();
- } up_read(&tty->termios_rwsem);
remove_wait_queue(&tty->read_wait, &wait);