This patch does two things:
1. fixes lost wakeup introduced by:
339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
2. improves performance for events delivery.
The description of the problem is the following: if N (>1) threads
are waiting on ep->wq for new events and M (>1) events come, it is
quite likely that >1 wakeups hit the same wait queue entry, because
there is quite a big window between __add_wait_queue_exclusive() and
the following __remove_wait_queue() calls in ep_poll() function. This
can lead to lost wakeups, because thread, which was woken up, can
handle not all the events in ->rdllist. (in better words the problem
is described here: https://lkml.org/lkml/2019/10/7/905)
The idea of the current patch is to use init_wait() instead of
init_waitqueue_entry(). Internally init_wait() sets
autoremove_wake_function as a callback, which removes the wait entry
atomically (under the wq locks) from the list, thus the next coming
wakeup hits the next wait entry in the wait queue, thus preventing
lost wakeups.
Problem is very well reproduced by the epoll60 test case [1].
Wait entry removal on wakeup has also performance benefits, because
there is no need to take a ep->lock and remove wait entry from the
queue after the successful wakeup. Here is the timing output of
the epoll60 test case:
With explicit wakeup from ep_scan_ready_list() (the state of the
code prior 339ddb53d373):
real 0m6.970s
user 0m49.786s
sys 0m0.113s
After this patch:
real 0m5.220s
user 0m36.879s
sys 0m0.019s
The other testcase is the stress-epoll [2], where one thread consumes
all the events and other threads produce many events:
With explicit wakeup from ep_scan_ready_list() (the state of the
code prior 339ddb53d373):
threads events/ms run-time ms
8 5427 1474
16 6163 2596
32 6824 4689
64 7060 9064
128 6991 18309
After this patch:
threads events/ms run-time ms
8 5598 1429
16 7073 2262
32 7502 4265
64 7640 8376
128 7634 16767
(number of "events/ms" represents event bandwidth, thus higher is
better; number of "run-time ms" represents overall time spent
doing the benchmark, thus lower is better)
[1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
[2] https://github.com/rouming/test-tools/blob/master/stress-epoll.c
Signed-off-by: Roman Penyaev <rpenyaev(a)suse.de>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Khazhismel Kumykov <khazhy(a)google.com>
Cc: Alexander Viro <viro(a)zeniv.linux.org.uk>
Cc: Heiher <r(a)hev.cc>
Cc: Jason Baron <jbaron(a)akamai.com>
Cc: stable(a)vger.kernel.org
Cc: linux-fsdevel(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
---
fs/eventpoll.c | 43 ++++++++++++++++++++++++-------------------
1 file changed, 24 insertions(+), 19 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d6ba0e52439b..aba03ee749f8 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1822,7 +1822,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
{
int res = 0, eavail, timed_out = 0;
u64 slack = 0;
- bool waiter = false;
wait_queue_entry_t wait;
ktime_t expires, *to = NULL;
@@ -1867,21 +1866,23 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
*/
ep_reset_busy_poll_napi_id(ep);
- /*
- * We don't have any available event to return to the caller. We need
- * to sleep here, and we will be woken by ep_poll_callback() when events
- * become available.
- */
- if (!waiter) {
- waiter = true;
- init_waitqueue_entry(&wait, current);
-
+ do {
+ /*
+ * Internally init_wait() uses autoremove_wake_function(),
+ * thus wait entry is removed from the wait queue on each
+ * wakeup. Why it is important? In case of several waiters
+ * each new wakeup will hit the next waiter, giving it the
+ * chance to harvest new event. Otherwise wakeup can be
+ * lost. This is also good performance-wise, because on
+ * normal wakeup path no need to call __remove_wait_queue()
+ * explicitly, thus ep->lock is not taken, which halts the
+ * event delivery.
+ */
+ init_wait(&wait);
write_lock_irq(&ep->lock);
__add_wait_queue_exclusive(&ep->wq, &wait);
write_unlock_irq(&ep->lock);
- }
- for (;;) {
/*
* We don't want to sleep if the ep_poll_callback() sends us
* a wakeup in between. That's why we set the task state
@@ -1911,10 +1912,20 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
timed_out = 1;
break;
}
- }
+
+ /* We were woken up, thus go and try to harvest some events */
+ eavail = 1;
+
+ } while (0);
__set_current_state(TASK_RUNNING);
+ if (!list_empty_careful(&wait.entry)) {
+ write_lock_irq(&ep->lock);
+ __remove_wait_queue(&ep->wq, &wait);
+ write_unlock_irq(&ep->lock);
+ }
+
send_events:
/*
* Try to transfer events to user space. In case we get 0 events and
@@ -1925,12 +1936,6 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
!(res = ep_send_events(ep, events, maxevents)) && !timed_out)
goto fetch_events;
- if (waiter) {
- write_lock_irq(&ep->lock);
- __remove_wait_queue(&ep->wq, &wait);
- write_unlock_irq(&ep->lock);
- }
-
return res;
}
--
2.24.1
Writing to the devfreq sysfs nodes while the GPU is powered down can
result in a system crash (on a5xx) or a nasty GMU error (on a6xx):
$ /sys/class/devfreq/5000000.gpu# echo 500000000 > min_freq
[ 104.841625] platform 506a000.gmu: [drm:a6xx_gmu_set_oob]
*ERROR* Timeout waiting for GMU OOB set GPU_DCVS: 0x0
Despite the fact that we carefully try to suspend the devfreq device when
the hardware is powered down there are lots of holes in the governors that
don't check for the suspend state and blindly call into the devfreq
callbacks that end up triggering hardware reads in the GPU driver.
Call pm_runtime_get_if_in_use() in the gpu_busy() and gpu_set_freq()
callbacks to skip the hardware access if it isn't active.
v2: Use pm_runtime_get_if_in_use() per Eric Anholt
Cc: stable(a)vger.kernel.org
Signed-off-by: Jordan Crouse <jcrouse(a)codeaurora.org>
---
drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 6 ++++++
drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 8 ++++++++
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 7 +++++++
3 files changed, 21 insertions(+)
diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
index 724024a2243a..4d7f269edfcc 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
@@ -1404,6 +1404,10 @@ static unsigned long a5xx_gpu_busy(struct msm_gpu *gpu)
{
u64 busy_cycles, busy_time;
+ /* Only read the gpu busy if the hardware is already active */
+ if (pm_runtime_get_if_in_use(&gpu->pdev->dev) <= 0)
+ return 0;
+
busy_cycles = gpu_read64(gpu, REG_A5XX_RBBM_PERFCTR_RBBM_0_LO,
REG_A5XX_RBBM_PERFCTR_RBBM_0_HI);
@@ -1412,6 +1416,8 @@ static unsigned long a5xx_gpu_busy(struct msm_gpu *gpu)
gpu->devfreq.busy_cycles = busy_cycles;
+ pm_runtime_put(&gpu->pdev->dev);
+
if (WARN_ON(busy_time > ~0LU))
return ~0LU;
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index c4e71abbdd53..8ace989b11db 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -108,6 +108,13 @@ static void __a6xx_gmu_set_freq(struct a6xx_gmu *gmu, int index)
struct msm_gpu *gpu = &adreno_gpu->base;
int ret;
+ /*
+ * This can get called from devfreq while the hardware is idle. Don't
+ * bring up the power if it isn't already active
+ */
+ if (pm_runtime_get_if_in_use(gmu->dev) <= 0)
+ return;
+
gmu_write(gmu, REG_A6XX_GMU_DCVS_ACK_OPTION, 0);
gmu_write(gmu, REG_A6XX_GMU_DCVS_PERF_SETTING,
@@ -134,6 +141,7 @@ static void __a6xx_gmu_set_freq(struct a6xx_gmu *gmu, int index)
* for now leave it at max so that the performance is nominal.
*/
icc_set_bw(gpu->icc_path, 0, MBps_to_icc(7216));
+ pm_runtime_put(gmu->dev);
}
void a6xx_gmu_set_freq(struct msm_gpu *gpu, unsigned long freq)
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 68af24150de5..bf43eb2fb078 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -810,6 +810,11 @@ static unsigned long a6xx_gpu_busy(struct msm_gpu *gpu)
struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
u64 busy_cycles, busy_time;
+
+ /* Only read the gpu busy if the hardware is already active */
+ if (pm_runtime_get_if_in_use(a6xx_gpu->gmu.dev) <= 0)
+ return 0;
+
busy_cycles = gmu_read64(&a6xx_gpu->gmu,
REG_A6XX_GMU_CX_GMU_POWER_COUNTER_XOCLK_0_L,
REG_A6XX_GMU_CX_GMU_POWER_COUNTER_XOCLK_0_H);
@@ -819,6 +824,8 @@ static unsigned long a6xx_gpu_busy(struct msm_gpu *gpu)
gpu->devfreq.busy_cycles = busy_cycles;
+ pm_runtime_put(a6xx_gpu->gmu.dev);
+
if (WARN_ON(busy_time > ~0LU))
return ~0LU;
--
2.17.1
Writing to the devfreq sysfs nodes while the GPU is powered down can
result in a system crash (on a5xx) or a nasty GMU error (on a6xx):
$ /sys/class/devfreq/5000000.gpu# echo 500000000 > min_freq
[ 104.841625] platform 506a000.gmu: [drm:a6xx_gmu_set_oob]
*ERROR* Timeout waiting for GMU OOB set GPU_DCVS: 0x0
Despite the fact that we carefully try to suspend the devfreq device when
the hardware is powered down there are lots of holes in the governors that
don't check for the suspend state and blindly call into the devfreq
callbacks that end up triggering hardware reads in the GPU driver.
Check the power state in the gpu_busy() and gpu_set_freq() callbacks for
a5xx and a6xx to make sure that the hardware is active before trying to
access it.
Cc: stable(a)vger.kernel.org
Signed-off-by: Jordan Crouse <jcrouse(a)codeaurora.org>
---
drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 4 ++++
drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 4 ++++
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 4 ++++
3 files changed, 12 insertions(+)
diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
index 724024a2243a..9d8c7fe6377e 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
@@ -1404,6 +1404,10 @@ static unsigned long a5xx_gpu_busy(struct msm_gpu *gpu)
{
u64 busy_cycles, busy_time;
+ /* devfreq can get here while we are idle so do a quick check first */
+ if (!pm_runtime_active(&gpu->pdev->dev))
+ return ~0LU;
+
busy_cycles = gpu_read64(gpu, REG_A5XX_RBBM_PERFCTR_RBBM_0_LO,
REG_A5XX_RBBM_PERFCTR_RBBM_0_HI);
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index c4e71abbdd53..84618f2ff073 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -108,6 +108,10 @@ static void __a6xx_gmu_set_freq(struct a6xx_gmu *gmu, int index)
struct msm_gpu *gpu = &adreno_gpu->base;
int ret;
+ /* This can be called via devfreq even when the power is off */
+ if (!pm_runtime_active(gmu->dev))
+ return;
+
gmu_write(gmu, REG_A6XX_GMU_DCVS_ACK_OPTION, 0);
gmu_write(gmu, REG_A6XX_GMU_DCVS_PERF_SETTING,
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 68af24150de5..443efc952f13 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -810,6 +810,10 @@ static unsigned long a6xx_gpu_busy(struct msm_gpu *gpu)
struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
u64 busy_cycles, busy_time;
+ /* devfreq can get here while we are idle so do a quick check first */
+ if (!pm_runtime_active(a6xx_gpu->gmu.dev))
+ return ~0LU;
+
busy_cycles = gmu_read64(&a6xx_gpu->gmu,
REG_A6XX_GMU_CX_GMU_POWER_COUNTER_XOCLK_0_L,
REG_A6XX_GMU_CX_GMU_POWER_COUNTER_XOCLK_0_H);
--
2.17.1