During my rwsem testing, it was found that after a down_read(), the reader count may occasionally become 0 or even negative. Consequently, a writer may steal the lock at that time and execute with the reader in parallel thus breaking the mutual exclusion guarantee of the write lock. In other words, both readers and writer can become rwsem owners simultaneously.
The current reader wakeup code does it in one pass to clear waiter->task and put them into wake_q before fully incrementing the reader count. Once waiter->task is cleared, the corresponding reader may see it, finish the critical section and do unlock to decrement the count before the count is incremented. This is not a problem if there is only one reader to wake up as the count has been pre-incremented by 1. It is a problem if there are more than one readers to be woken up and writer can steal the lock.
The wakeup is actually done in 2 passes before the v4.9 commit 70800c3c0cc5 ("locking/rwsem: Scan the wait_list for readers only once"). To fix this problem, the wakeup is now done in two passes again. In the first pass, we collect the readers and count them. The reader count is then fully incremented. In the second pass, the waiter->task is then cleared and they are put into wake_q to be woken up later.
Fixes: 70800c3c0cc5 ("locking/rwsem: Scan the wait_list for readers only once") Cc: stable@vger.kernel.org Signed-off-by: Waiman Long longman@redhat.com --- kernel/locking/rwsem-xadd.c | 45 +++++++++++++++++++++++++------------ 1 file changed, 31 insertions(+), 14 deletions(-)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index 6b3ee9948bf1..cab5b1f6b2de 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -130,6 +130,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem, { struct rwsem_waiter *waiter, *tmp; long oldcount, woken = 0, adjustment = 0; + struct list_head wlist;
/* * Take a peek at the queue head waiter such that we can determine @@ -188,18 +189,44 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem, * of the queue. We know that woken will be at least 1 as we accounted * for above. Note we increment the 'active part' of the count by the * number of readers before waking any processes up. + * + * We have to do wakeup in 2 passes to prevent the possibility that + * the reader count may be decremented before it is incremented. It + * is because the to-be-woken waiter may not have slept yet. So it + * may see waiter->task got cleared, finish its critical section and + * do an unlock before the reader count increment. + * + * 1) Collect the read-waiters in a separate list, count them and + * fully increment the reader count in rwsem. + * 2) For each waiters in the new list, clear waiter->task and + * put them into wake_q to be woken up later. */ + INIT_LIST_HEAD(&wlist); list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) { - struct task_struct *tsk; - if (waiter->type == RWSEM_WAITING_FOR_WRITE) break;
woken++; - tsk = waiter->task; + list_move_tail(&waiter->list, &wlist); + } + + adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment; + lockevent_cond_inc(rwsem_wake_reader, woken); + if (list_empty(&sem->wait_list)) { + /* hit end of list above */ + adjustment -= RWSEM_WAITING_BIAS; + } + + if (adjustment) + atomic_long_add(adjustment, &sem->count); + + /* 2nd pass */ + list_for_each_entry(waiter, &wlist, list) { + struct task_struct *tsk;
+ tsk = waiter->task; get_task_struct(tsk); - list_del(&waiter->list); + /* * Ensure calling get_task_struct() before setting the reader * waiter to nil such that rwsem_down_read_failed() cannot @@ -213,16 +240,6 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem, */ wake_q_add_safe(wake_q, tsk); } - - adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment; - lockevent_cond_inc(rwsem_wake_reader, woken); - if (list_empty(&sem->wait_list)) { - /* hit end of list above */ - adjustment -= RWSEM_WAITING_BIAS; - } - - if (adjustment) - atomic_long_add(adjustment, &sem->count); }
/*
On 4/28/19 11:57 AM, Waiman Long wrote:
During my rwsem testing, it was found that after a down_read(), the reader count may occasionally become 0 or even negative. Consequently, a writer may steal the lock at that time and execute with the reader in parallel thus breaking the mutual exclusion guarantee of the write lock. In other words, both readers and writer can become rwsem owners simultaneously.
I would like to have this particular patch merged in the next merge window, if possible, so that I can backport it downstream.
Thanks, Longman
On Sun, Apr 28, 2019 at 8:58 AM Waiman Long longman@redhat.com wrote:
/* 2nd pass */
list_for_each_entry(waiter, &wlist, list) {
This is not safe, as far as I can tell.
As this loop walks the list, you do that
smp_store_release(&waiter->task, NULL);
and that very action means that the "waiter" is now released, and you cannot possibly use it.
HOWEVER.
"list_for_each_entry()" will load "waiter->next" to go forward in the list.
So you absolutely *have* to use "list_for_each_entry_safe()" in this loop, I think. You should treat that "smp_store_release()" as if you deleted the list entry, because you cannot trust it after you've done it, because the sleeper may have gone its own merry ways and released the on-stack 'waiter' allocation.
It's the *first* loop that you could play games with, because you hold the lock, and the list is stable during that loop. So the *first* loop could just walk the list, and then do one list splitting operation instead of doing that "list_move_tail()" thing for each entry.
But as long as you do "list_move_tail()" in the first loop, you'll obviously have to use list_for_each_entry_safe() there too, since right now you change that list as you walk it.
I'm just saying that you *could* optimize that first phase to just walk it and then perhaps split it with list_cut_before() when you find the first entry that isn't going to be woken up.
Linus
On Sun, Apr 28, 2019 at 10:41 AM Linus Torvalds torvalds@linux-foundation.org wrote:
It's the *first* loop that you could play games with, because you hold the lock, and the list is stable during that loop. So the *first* loop could just walk the list, and then do one list splitting operation instead of doing that "list_move_tail()" thing for each entry.
.. having looked at that, I would suggest against it.
I _think_ this short and sweet code snippet might just work fine for the first loop:
list_for_each_entry(waiter, &sem->wait_list, list) { if (waiter->type == RWSEM_WAITING_FOR_WRITE) break; woken++; } list_cut_before(&wlist, &sem->wait_list, waiter);
and if it *does* work it would be both smaller and more efficient. But it looks a bit too subtle to my taste. Somebody would need to go through that with a fine comb, and double-check that it gets the "whole list" case right, for example.
So the "phase 1" loop could be perhaps simplified to the above cute things.
But the "phase 2" loop absolutely has to be changed to use list_for_each_entry_safe().
Linus
On 4/28/19 1:59 PM, Linus Torvalds wrote:
On Sun, Apr 28, 2019 at 10:41 AM Linus Torvalds torvalds@linux-foundation.org wrote:
It's the *first* loop that you could play games with, because you hold the lock, and the list is stable during that loop. So the *first* loop could just walk the list, and then do one list splitting operation instead of doing that "list_move_tail()" thing for each entry.
.. having looked at that, I would suggest against it.
I _think_ this short and sweet code snippet might just work fine for the first loop:
list_for_each_entry(waiter, &sem->wait_list, list) { if (waiter->type == RWSEM_WAITING_FOR_WRITE) break; woken++; } list_cut_before(&wlist, &sem->wait_list, waiter);
and if it *does* work it would be both smaller and more efficient. But it looks a bit too subtle to my taste. Somebody would need to go through that with a fine comb, and double-check that it gets the "whole list" case right, for example.
So the "phase 1" loop could be perhaps simplified to the above cute things.
But the "phase 2" loop absolutely has to be changed to use list_for_each_entry_safe().
Linus
You are right. Thanks for checking my code. I will make the necessary change.
Cheers, Longman
linux-stable-mirror@lists.linaro.org