The patch titled
Subject: epoll: ensure ep_poll() doesn't miss wakeup events
has been added to the -mm tree. Its filename is
epoll-ensure-ep_poll-doesnt-miss-wakeup-events.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/epoll-ensure-ep_poll-doesnt-miss-w…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/epoll-ensure-ep_poll-doesnt-miss-w…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Jason Baron <jbaron(a)akamai.com>
Subject: epoll: ensure ep_poll() doesn't miss wakeup events
Now that the ep_events_available() check is done in a lockless way, and we
no longer perform wakeups from ep_scan_ready_list(), we need to ensure
that either ep->rdllist has items or the overflow list is active. Prior
to: commit 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested
epoll"), we did wake_up(&ep->wq) after manipulating the ep->rdllist and
the overflow list. Thus, any waiters would observe the correct state.
However, with that wake_up() now removed we need to be more careful to
ensure that condition.
Here's an example of what could go wrong:
We have epoll fds: epfd1, epfd2. And epfd1 is added to epfd2 and epfd2 is
added to a socket: epfd1->epfd2->socket. Thread a is doing epoll_wait()
on epfd1, and thread b is doing epoll_wait on epfd2. Then:
1) data comes in on socket
ep_poll_callback() wakes up threads a and b
2) thread a runs
ep_poll()
ep_scan_ready_list()
ep_send_events_proc()
ep_item_poll()
ep_scan_ready_list()
list_splice_init(&ep->rdllist, &txlist);
3) now thread b is running
ep_poll()
ep_events_available()
returns false
schedule_hrtimeout_range()
Thus, thread b has now scheduled and missed the wakeup.
Link: http://lkml.kernel.org/r/1588360533-11828-1-git-send-email-jbaron@akamai.com
Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
Signed-off-by: Jason Baron <jbaron(a)akamai.com>
Cc: Alexander Viro <viro(a)zeniv.linux.org.uk>
Cc: Heiher <r(a)hev.cc>
Cc: Roman Penyaev <rpenyaev(a)suse.de>
Cc: Khazhismel Kumykov <khazhy(a)google.com>
Cc: Davidlohr Bueso <dbueso(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/eventpoll.c | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)
--- a/fs/eventpoll.c~epoll-ensure-ep_poll-doesnt-miss-wakeup-events
+++ a/fs/eventpoll.c
@@ -704,8 +704,14 @@ static __poll_t ep_scan_ready_list(struc
* in a lockless way.
*/
write_lock_irq(&ep->lock);
- list_splice_init(&ep->rdllist, &txlist);
WRITE_ONCE(ep->ovflist, NULL);
+ /*
+ * In ep_poll() we use ep_events_available() in a lockless way to decide
+ * if events are available. So we need to preserve that either
+ * ep->oflist != EP_UNACTIVE_PTR or there are events on the ep->rdllist.
+ */
+ smp_wmb();
+ list_splice_init(&ep->rdllist, &txlist);
write_unlock_irq(&ep->lock);
/*
@@ -737,16 +743,21 @@ static __poll_t ep_scan_ready_list(struc
}
}
/*
+ * Quickly re-inject items left on "txlist".
+ */
+ list_splice(&txlist, &ep->rdllist);
+ /*
+ * In ep_poll() we use ep_events_available() in a lockless way to decide
+ * if events are available. So we need to preserve that either
+ * ep->oflist != EP_UNACTIVE_PTR or there are events on the ep->rdllist.
+ */
+ smp_wmb();
+ /*
* We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
* releasing the lock, events will be queued in the normal way inside
* ep->rdllist.
*/
WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
-
- /*
- * Quickly re-inject items left on "txlist".
- */
- list_splice(&txlist, &ep->rdllist);
__pm_relax(ep->ws);
write_unlock_irq(&ep->lock);
_
Patches currently in -mm which might be from jbaron(a)akamai.com are
epoll-ensure-ep_poll-doesnt-miss-wakeup-events.patch
This patch does two things:
1. fixes lost wakeup introduced by:
339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
2. improves performance for events delivery.
The description of the problem is the following: if N (>1) threads
are waiting on ep->wq for new events and M (>1) events come, it is
quite likely that >1 wakeups hit the same wait queue entry, because
there is quite a big window between __add_wait_queue_exclusive() and
the following __remove_wait_queue() calls in ep_poll() function. This
can lead to lost wakeups, because thread, which was woken up, can
handle not all the events in ->rdllist. (in better words the problem
is described here: https://lkml.org/lkml/2019/10/7/905)
The idea of the current patch is to use init_wait() instead of
init_waitqueue_entry(). Internally init_wait() sets
autoremove_wake_function as a callback, which removes the wait entry
atomically (under the wq locks) from the list, thus the next coming
wakeup hits the next wait entry in the wait queue, thus preventing
lost wakeups.
Problem is very well reproduced by the epoll60 test case [1].
Wait entry removal on wakeup has also performance benefits, because
there is no need to take a ep->lock and remove wait entry from the
queue after the successful wakeup. Here is the timing output of
the epoll60 test case:
With explicit wakeup from ep_scan_ready_list() (the state of the
code prior 339ddb53d373):
real 0m6.970s
user 0m49.786s
sys 0m0.113s
After this patch:
real 0m5.220s
user 0m36.879s
sys 0m0.019s
The other testcase is the stress-epoll [2], where one thread consumes
all the events and other threads produce many events:
With explicit wakeup from ep_scan_ready_list() (the state of the
code prior 339ddb53d373):
threads events/ms run-time ms
8 5427 1474
16 6163 2596
32 6824 4689
64 7060 9064
128 6991 18309
After this patch:
threads events/ms run-time ms
8 5598 1429
16 7073 2262
32 7502 4265
64 7640 8376
128 7634 16767
(number of "events/ms" represents event bandwidth, thus higher is
better; number of "run-time ms" represents overall time spent
doing the benchmark, thus lower is better)
[1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
[2] https://github.com/rouming/test-tools/blob/master/stress-epoll.c
Signed-off-by: Roman Penyaev <rpenyaev(a)suse.de>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Khazhismel Kumykov <khazhy(a)google.com>
Cc: Alexander Viro <viro(a)zeniv.linux.org.uk>
Cc: Heiher <r(a)hev.cc>
Cc: Jason Baron <jbaron(a)akamai.com>
Cc: stable(a)vger.kernel.org
Cc: linux-fsdevel(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
---
fs/eventpoll.c | 43 ++++++++++++++++++++++++-------------------
1 file changed, 24 insertions(+), 19 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d6ba0e52439b..aba03ee749f8 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1822,7 +1822,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
{
int res = 0, eavail, timed_out = 0;
u64 slack = 0;
- bool waiter = false;
wait_queue_entry_t wait;
ktime_t expires, *to = NULL;
@@ -1867,21 +1866,23 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
*/
ep_reset_busy_poll_napi_id(ep);
- /*
- * We don't have any available event to return to the caller. We need
- * to sleep here, and we will be woken by ep_poll_callback() when events
- * become available.
- */
- if (!waiter) {
- waiter = true;
- init_waitqueue_entry(&wait, current);
-
+ do {
+ /*
+ * Internally init_wait() uses autoremove_wake_function(),
+ * thus wait entry is removed from the wait queue on each
+ * wakeup. Why it is important? In case of several waiters
+ * each new wakeup will hit the next waiter, giving it the
+ * chance to harvest new event. Otherwise wakeup can be
+ * lost. This is also good performance-wise, because on
+ * normal wakeup path no need to call __remove_wait_queue()
+ * explicitly, thus ep->lock is not taken, which halts the
+ * event delivery.
+ */
+ init_wait(&wait);
write_lock_irq(&ep->lock);
__add_wait_queue_exclusive(&ep->wq, &wait);
write_unlock_irq(&ep->lock);
- }
- for (;;) {
/*
* We don't want to sleep if the ep_poll_callback() sends us
* a wakeup in between. That's why we set the task state
@@ -1911,10 +1912,20 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
timed_out = 1;
break;
}
- }
+
+ /* We were woken up, thus go and try to harvest some events */
+ eavail = 1;
+
+ } while (0);
__set_current_state(TASK_RUNNING);
+ if (!list_empty_careful(&wait.entry)) {
+ write_lock_irq(&ep->lock);
+ __remove_wait_queue(&ep->wq, &wait);
+ write_unlock_irq(&ep->lock);
+ }
+
send_events:
/*
* Try to transfer events to user space. In case we get 0 events and
@@ -1925,12 +1936,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
!(res = ep_send_events(ep, events, maxevents)) && !timed_out)
goto fetch_events;
- if (waiter) {
- write_lock_irq(&ep->lock);
- __remove_wait_queue(&ep->wq, &wait);
- write_unlock_irq(&ep->lock);
- }
-
return res;
}
--
2.24.1