This patch series is to remove reader optimistic spinning in kernel 5.10 to improve the MongoDB performance. Performance measurements (10 times running average of overall throughput ops/sec) are using MongoDB 5.0.11 and YCSB [1] microbenchmark with workloadA [2] on AWS EC2 m5.4xlarge/m6g.4xlarge (16-vCPU 64GiB-memory) instances with a 512GB EBS IO1 drive disk with 5000 IOPS and separating MongoDB and YCSB load generator on 2 instances and setting recordcount=25000000 and operationcount=10000000 to see the impacts of these changes:
Before - v5.10.165 kernel in OS Amazon Linux 2 After - v5.10.165 kernel with reader spinning disabled in OS Amazon Linux 2
| Arch | Instance Type | Before | After | |---------+---------------+---------+---------| | x86_64 | m5.4xlarge | 37365.4 | 42373.9 | |---------+---------------+---------+---------| | aarch64 | m6g.4xlarge | 33823.1 | 43113.7 | |---------+---------------+---------+---------|
It can be seen that the MongoDB throughput can be improved around 13% in x86_64 and 27% in aarch64 after disabling reader optimistic spinning and these patches can be applied to 5.10 with no conflict so we wonder if it's possible to backport them to stable 5.10?
[1] https://github.com/brianfrankcooper/YCSB/releases/download/0.17.0/ycsb-0.17.... [2] https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloada
Thanks, Shaoying
Peter Zijlstra (3): locking/rwsem: Better collate rwsem_read_trylock() locking/rwsem: Introduce rwsem_write_trylock() locking/rwsem: Fold __down_{read,write}*()
Waiman Long (4): locking/rwsem: Pass the current atomic count to rwsem_down_read_slowpath() locking/rwsem: Prevent potential lock starvation locking/rwsem: Enable reader optimistic lock stealing locking/rwsem: Remove reader optimistic spinning
kernel/locking/lock_events_list.h | 6 +- kernel/locking/rwsem.c | 359 +++++++++--------------------- 2 files changed, 106 insertions(+), 259 deletions(-)
From: Peter Zijlstra peterz@infradead.org
commit 3379116a0ca965b00e6522c7ea3f16c9dbd8f9f9 upstream
All users of rwsem_read_trylock() do rwsem_set_reader_owned(sem) on success, move it into rwsem_read_trylock() proper.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20201207090243.GE3040@hirez.programming.kicks-ass.... Cc: stable@vger.kernel.org # 5.10 Signed-off-by: Shaoying Xu shaoyi@amazon.com --- kernel/locking/rwsem.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index cc5cc889b5b7..7bf45b0a1b1d 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -273,9 +273,16 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem) static inline bool rwsem_read_trylock(struct rw_semaphore *sem) { long cnt = atomic_long_add_return_acquire(RWSEM_READER_BIAS, &sem->count); + if (WARN_ON_ONCE(cnt < 0)) rwsem_set_nonspinnable(sem); - return !(cnt & RWSEM_READ_FAILED_MASK); + + if (!(cnt & RWSEM_READ_FAILED_MASK)) { + rwsem_set_reader_owned(sem); + return true; + } + + return false; }
/* @@ -1340,8 +1347,6 @@ static inline void __down_read(struct rw_semaphore *sem) if (!rwsem_read_trylock(sem)) { rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE); DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem); - } else { - rwsem_set_reader_owned(sem); } }
@@ -1351,8 +1356,6 @@ static inline int __down_read_interruptible(struct rw_semaphore *sem) if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_INTERRUPTIBLE))) return -EINTR; DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem); - } else { - rwsem_set_reader_owned(sem); } return 0; } @@ -1363,8 +1366,6 @@ static inline int __down_read_killable(struct rw_semaphore *sem) if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE))) return -EINTR; DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem); - } else { - rwsem_set_reader_owned(sem); } return 0; }
From: Peter Zijlstra peterz@infradead.org
commit 285c61aedf6bc5d81b37e4dc48c19012e8ff9836 upstream
One copy of this logic is better than three.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20201207090243.GE3040@hirez.programming.kicks-ass.... Cc: stable@vger.kernel.org # 5.10 Signed-off-by: Shaoying Xu shaoyi@amazon.com --- kernel/locking/rwsem.c | 38 ++++++++++++++++---------------------- 1 file changed, 16 insertions(+), 22 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 7bf45b0a1b1d..6b62654eb0a8 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -285,6 +285,18 @@ static inline bool rwsem_read_trylock(struct rw_semaphore *sem) return false; }
+static inline bool rwsem_write_trylock(struct rw_semaphore *sem) +{ + long tmp = RWSEM_UNLOCKED_VALUE; + + if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, RWSEM_WRITER_LOCKED)) { + rwsem_set_owner(sem); + return true; + } + + return false; +} + /* * Return just the real task structure pointer of the owner */ @@ -1395,42 +1407,24 @@ static inline int __down_read_trylock(struct rw_semaphore *sem) */ static inline void __down_write(struct rw_semaphore *sem) { - long tmp = RWSEM_UNLOCKED_VALUE; - - if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, - RWSEM_WRITER_LOCKED))) + if (unlikely(!rwsem_write_trylock(sem))) rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE); - else - rwsem_set_owner(sem); }
static inline int __down_write_killable(struct rw_semaphore *sem) { - long tmp = RWSEM_UNLOCKED_VALUE; - - if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, - RWSEM_WRITER_LOCKED))) { + if (unlikely(!rwsem_write_trylock(sem))) { if (IS_ERR(rwsem_down_write_slowpath(sem, TASK_KILLABLE))) return -EINTR; - } else { - rwsem_set_owner(sem); } + return 0; }
static inline int __down_write_trylock(struct rw_semaphore *sem) { - long tmp; - DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem); - - tmp = RWSEM_UNLOCKED_VALUE; - if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, - RWSEM_WRITER_LOCKED)) { - rwsem_set_owner(sem); - return true; - } - return false; + return rwsem_write_trylock(sem); }
/*
From: Peter Zijlstra peterz@infradead.org
commit c995e638ccbbc65a76d1713c4fdcf927e7e2cb83 upstream
There's a lot needless duplication in __down_{read,write}*(), cure that with a helper.
Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20201207090243.GE3040@hirez.programming.kicks-ass.... Cc: stable@vger.kernel.org # 5.10 Signed-off-by: Shaoying Xu shaoyi@amazon.com --- kernel/locking/rwsem.c | 45 +++++++++++++++++++++--------------------- 1 file changed, 23 insertions(+), 22 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 6b62654eb0a8..46f308a01045 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -1354,32 +1354,29 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem) /* * lock for reading */ -static inline void __down_read(struct rw_semaphore *sem) +static inline int __down_read_common(struct rw_semaphore *sem, int state) { if (!rwsem_read_trylock(sem)) { - rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE); + if (IS_ERR(rwsem_down_read_slowpath(sem, state))) + return -EINTR; DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem); } + return 0; +} + +static inline void __down_read(struct rw_semaphore *sem) +{ + __down_read_common(sem, TASK_UNINTERRUPTIBLE); }
static inline int __down_read_interruptible(struct rw_semaphore *sem) { - if (!rwsem_read_trylock(sem)) { - if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_INTERRUPTIBLE))) - return -EINTR; - DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem); - } - return 0; + return __down_read_common(sem, TASK_INTERRUPTIBLE); }
static inline int __down_read_killable(struct rw_semaphore *sem) { - if (!rwsem_read_trylock(sem)) { - if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE))) - return -EINTR; - DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem); - } - return 0; + return __down_read_common(sem, TASK_KILLABLE); }
static inline int __down_read_trylock(struct rw_semaphore *sem) @@ -1405,22 +1402,26 @@ static inline int __down_read_trylock(struct rw_semaphore *sem) /* * lock for writing */ -static inline void __down_write(struct rw_semaphore *sem) -{ - if (unlikely(!rwsem_write_trylock(sem))) - rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE); -} - -static inline int __down_write_killable(struct rw_semaphore *sem) +static inline int __down_write_common(struct rw_semaphore *sem, int state) { if (unlikely(!rwsem_write_trylock(sem))) { - if (IS_ERR(rwsem_down_write_slowpath(sem, TASK_KILLABLE))) + if (IS_ERR(rwsem_down_write_slowpath(sem, state))) return -EINTR; }
return 0; }
+static inline void __down_write(struct rw_semaphore *sem) +{ + __down_write_common(sem, TASK_UNINTERRUPTIBLE); +} + +static inline int __down_write_killable(struct rw_semaphore *sem) +{ + return __down_write_common(sem, TASK_KILLABLE); +} + static inline int __down_write_trylock(struct rw_semaphore *sem) { DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
From: Waiman Long longman@redhat.com
commit c8fe8b0564388f41147326f31e4587171aacccd4 upstream
The atomic count value right after reader count increment can be useful to determine the rwsem state at trylock time. So the count value is passed down to rwsem_down_read_slowpath() to be used when appropriate.
Signed-off-by: Waiman Long longman@redhat.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Davidlohr Bueso dbueso@suse.de Link: https://lkml.kernel.org/r/20201121041416.12285-2-longman@redhat.com Cc: stable@vger.kernel.org # 5.10 Signed-off-by: Shaoying Xu shaoyi@amazon.com --- kernel/locking/rwsem.c | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 46f308a01045..94d0fa715214 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -270,14 +270,14 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem) owner | RWSEM_NONSPINNABLE)); }
-static inline bool rwsem_read_trylock(struct rw_semaphore *sem) +static inline bool rwsem_read_trylock(struct rw_semaphore *sem, long *cntp) { - long cnt = atomic_long_add_return_acquire(RWSEM_READER_BIAS, &sem->count); + *cntp = atomic_long_add_return_acquire(RWSEM_READER_BIAS, &sem->count);
- if (WARN_ON_ONCE(cnt < 0)) + if (WARN_ON_ONCE(*cntp < 0)) rwsem_set_nonspinnable(sem);
- if (!(cnt & RWSEM_READ_FAILED_MASK)) { + if (!(*cntp & RWSEM_READ_FAILED_MASK)) { rwsem_set_reader_owned(sem); return true; } @@ -1008,9 +1008,9 @@ rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable) * Wait for the read lock to be granted */ static struct rw_semaphore __sched * -rwsem_down_read_slowpath(struct rw_semaphore *sem, int state) +rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, int state) { - long count, adjustment = -RWSEM_READER_BIAS; + long adjustment = -RWSEM_READER_BIAS; struct rwsem_waiter waiter; DEFINE_WAKE_Q(wake_q); bool wake = false; @@ -1356,8 +1356,10 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem) */ static inline int __down_read_common(struct rw_semaphore *sem, int state) { - if (!rwsem_read_trylock(sem)) { - if (IS_ERR(rwsem_down_read_slowpath(sem, state))) + long count; + + if (!rwsem_read_trylock(sem, &count)) { + if (IS_ERR(rwsem_down_read_slowpath(sem, count, state))) return -EINTR; DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem); }
From: Waiman Long longman@redhat.com
commit 2f06f702925b512a95b95dca3855549c047eef58 upstream
The lock handoff bit is added in commit 4f23dbc1e657 ("locking/rwsem: Implement lock handoff to prevent lock starvation") to avoid lock starvation. However, allowing readers to do optimistic spinning does introduce an unlikely scenario where lock starvation can happen.
The lock handoff bit may only be set when a waiter is being woken up. In the case of reader unlock, wakeup happens only when the reader count reaches 0. If there is a continuous stream of incoming readers acquiring read lock via optimistic spinning, it is possible that the reader count may never reach 0 and so the handoff bit will never be asserted.
One way to prevent this scenario from happening is to disallow optimistic spinning if the rwsem is currently owned by readers. If the previous or current owner is a writer, optimistic spinning will be allowed.
If the previous owner is a reader but the reader count has reached 0 before, a wakeup should have been issued. So the handoff mechanism will be kicked in to prevent lock starvation. As a result, it should be OK to do optimistic spinning in this case.
This patch may have some impact on reader performance as it reduces reader optimistic spinning especially if the lock critical sections are short the number of contending readers are small.
Signed-off-by: Waiman Long longman@redhat.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Davidlohr Bueso dbueso@suse.de Link: https://lkml.kernel.org/r/20201121041416.12285-3-longman@redhat.com Cc: stable@vger.kernel.org # 5.10 Signed-off-by: Shaoying Xu shaoyi@amazon.com --- kernel/locking/rwsem.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 94d0fa715214..029a832f725e 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -1010,16 +1010,27 @@ rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable) static struct rw_semaphore __sched * rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, int state) { - long adjustment = -RWSEM_READER_BIAS; + long owner, adjustment = -RWSEM_READER_BIAS; + long rcnt = (count >> RWSEM_READER_SHIFT); struct rwsem_waiter waiter; DEFINE_WAKE_Q(wake_q); bool wake = false;
+ /* + * To prevent a constant stream of readers from starving a sleeping + * waiter, don't attempt optimistic spinning if the lock is currently + * owned by readers. + */ + owner = atomic_long_read(&sem->owner); + if ((owner & RWSEM_READER_OWNED) && (rcnt > 1) && + !(count & RWSEM_WRITER_LOCKED)) + goto queue; + /* * Save the current read-owner of rwsem, if available, and the * reader nonspinnable bit. */ - waiter.last_rowner = atomic_long_read(&sem->owner); + waiter.last_rowner = owner; if (!(waiter.last_rowner & RWSEM_READER_OWNED)) waiter.last_rowner &= RWSEM_RD_NONSPINNABLE;
From: Waiman Long longman@redhat.com
commit 1a728dff855a318bb58bcc1259b1826a7ad9f0bd upstream
If the optimistic spinning queue is empty and the rwsem does not have the handoff or write-lock bits set, it is actually not necessary to call rwsem_optimistic_spin() to spin on it. Instead, it can steal the lock directly as its reader bias is in the count already. If it is the first reader in this state, it will try to wake up other readers in the wait queue.
With this patch applied, the following were the lock event counts after rebooting a 2-socket system and a "make -j96" kernel rebuild.
rwsem_opt_rlock=4437 rwsem_rlock=29 rwsem_rlock_steal=19
So lock stealing represents about 0.4% of all the read locks acquired in the slow path.
Signed-off-by: Waiman Long longman@redhat.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Davidlohr Bueso dbueso@suse.de Link: https://lkml.kernel.org/r/20201121041416.12285-4-longman@redhat.com Cc: stable@vger.kernel.org # 5.10 Signed-off-by: Shaoying Xu shaoyi@amazon.com --- kernel/locking/lock_events_list.h | 1 + kernel/locking/rwsem.c | 28 ++++++++++++++++++++++++++++ 2 files changed, 29 insertions(+)
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h index 239039d0ce21..270a0d351932 100644 --- a/kernel/locking/lock_events_list.h +++ b/kernel/locking/lock_events_list.h @@ -63,6 +63,7 @@ LOCK_EVENT(rwsem_opt_nospin) /* # of disabled optspins */ LOCK_EVENT(rwsem_opt_norspin) /* # of disabled reader-only optspins */ LOCK_EVENT(rwsem_opt_rlock2) /* # of opt-acquired 2ndary read locks */ LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */ +LOCK_EVENT(rwsem_rlock_steal) /* # of read locks by lock stealing */ LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */ LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */ LOCK_EVENT(rwsem_rlock_handoff) /* # of read lock handoffs */ diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 029a832f725e..c2e176e96499 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -976,6 +976,12 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem, } return false; } + +static inline bool rwsem_no_spinners(struct rw_semaphore *sem) +{ + return !osq_is_locked(&sem->osq); +} + #else static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable) @@ -996,6 +1002,11 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem, return false; }
+static inline bool rwsem_no_spinners(sem) +{ + return false; +} + static inline int rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable) { @@ -1026,6 +1037,22 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, int state) !(count & RWSEM_WRITER_LOCKED)) goto queue;
+ /* + * Reader optimistic lock stealing + * + * We can take the read lock directly without doing + * rwsem_optimistic_spin() if the conditions are right. + * Also wake up other readers if it is the first reader. + */ + if (!(count & (RWSEM_WRITER_LOCKED | RWSEM_FLAG_HANDOFF)) && + rwsem_no_spinners(sem)) { + rwsem_set_reader_owned(sem); + lockevent_inc(rwsem_rlock_steal); + if (rcnt == 1) + goto wake_readers; + return sem; + } + /* * Save the current read-owner of rwsem, if available, and the * reader nonspinnable bit. @@ -1048,6 +1075,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, int state) * Wake up other readers in the wait list if the front * waiter is a reader. */ +wake_readers: if ((atomic_long_read(&sem->count) & RWSEM_FLAG_WAITERS)) { raw_spin_lock_irq(&sem->wait_lock); if (!list_empty(&sem->wait_list))
From: Waiman Long longman@redhat.com
commit 617f3ef95177840c77f59c2aec1029d27d5547d6 upstream
Reader optimistic spinning is helpful when the reader critical section is short and there aren't that many readers around. It also improves the chance that a reader can get the lock as writer optimistic spinning disproportionally favors writers much more than readers.
Since commit d3681e269fff ("locking/rwsem: Wake up almost all readers in wait queue"), all the waiting readers are woken up so that they can all get the read lock and run in parallel. When the number of contending readers is large, allowing reader optimistic spinning will likely cause reader fragmentation where multiple smaller groups of readers can get the read lock in a sequential manner separated by writers. That reduces reader parallelism.
One possible way to address that drawback is to limit the number of readers (preferably one) that can do optimistic spinning. These readers act as representatives of all the waiting readers in the wait queue as they will wake up all those waiting readers once they get the lock.
Alternatively, as reader optimistic lock stealing has already enhanced fairness to readers, it may be easier to just remove reader optimistic spinning and simplifying the optimistic spinning code as a result.
Performance measurements (locking throughput kops/s) using a locking microbenchmark with 50/50 reader/writer distribution and turbo-boost disabled was done on a 2-socket Cascade Lake system (48-core 96-thread) to see the impacts of these changes:
1) Vanilla - 5.10-rc3 kernel 2) Before - 5.10-rc3 kernel with previous patches in this series 2) limit-rspin - 5.10-rc3 kernel with limited reader spinning patch 3) no-rspin - 5.10-rc3 kernel with reader spinning disabled
# of threads CS Load Vanilla Before limit-rspin no-rspin ------------ ------- ------- ------ ----------- -------- 2 1 5,185 5,662 5,214 5,077 4 1 5,107 4,983 5,188 4,760 8 1 4,782 4,564 4,720 4,628 16 1 4,680 4,053 4,567 3,402 32 1 4,299 1,115 1,118 1,098 64 1 3,218 983 1,001 957 96 1 1,938 944 957 930
2 20 2,008 2,128 2,264 1,665 4 20 1,390 1,033 1,046 1,101 8 20 1,472 1,155 1,098 1,213 16 20 1,332 1,077 1,089 1,122 32 20 967 914 917 980 64 20 787 874 891 858 96 20 730 836 847 844
2 100 372 356 360 355 4 100 492 425 434 392 8 100 533 537 529 538 16 100 548 572 568 598 32 100 499 520 527 537 64 100 466 517 526 512 96 100 406 497 506 509
The column "CS Load" represents the number of pause instructions issued in the locking critical section. A CS load of 1 is extremely short and is not likey in real situations. A load of 20 (moderate) and 100 (long) are more realistic.
It can be seen that the previous patches in this series have reduced performance in general except in highly contended cases with moderate or long critical sections that performance improves a bit. This change is mostly caused by the "Prevent potential lock starvation" patch that reduce reader optimistic spinning and hence reduce reader fragmentation.
The patch that further limit reader optimistic spinning doesn't seem to have too much impact on overall performance as shown in the benchmark data.
The patch that disables reader optimistic spinning shows reduced performance at lightly loaded cases, but comparable or slightly better performance on with heavier contention.
This patch just removes reader optimistic spinning for now. As readers are not going to do optimistic spinning anymore, we don't need to consider if the OSQ is empty or not when doing lock stealing.
Signed-off-by: Waiman Long longman@redhat.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Davidlohr Bueso dbueso@suse.de Link: https://lkml.kernel.org/r/20201121041416.12285-6-longman@redhat.com Cc: stable@vger.kernel.org # 5.10 Signed-off-by: Shaoying Xu shaoyi@amazon.com --- kernel/locking/lock_events_list.h | 5 +- kernel/locking/rwsem.c | 284 +++++------------------------- 2 files changed, 49 insertions(+), 240 deletions(-)
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h index 270a0d351932..97fb6f3f840a 100644 --- a/kernel/locking/lock_events_list.h +++ b/kernel/locking/lock_events_list.h @@ -56,12 +56,9 @@ LOCK_EVENT(rwsem_sleep_reader) /* # of reader sleeps */ LOCK_EVENT(rwsem_sleep_writer) /* # of writer sleeps */ LOCK_EVENT(rwsem_wake_reader) /* # of reader wakeups */ LOCK_EVENT(rwsem_wake_writer) /* # of writer wakeups */ -LOCK_EVENT(rwsem_opt_rlock) /* # of opt-acquired read locks */ -LOCK_EVENT(rwsem_opt_wlock) /* # of opt-acquired write locks */ +LOCK_EVENT(rwsem_opt_lock) /* # of opt-acquired write locks */ LOCK_EVENT(rwsem_opt_fail) /* # of failed optspins */ LOCK_EVENT(rwsem_opt_nospin) /* # of disabled optspins */ -LOCK_EVENT(rwsem_opt_norspin) /* # of disabled reader-only optspins */ -LOCK_EVENT(rwsem_opt_rlock2) /* # of opt-acquired 2ndary read locks */ LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */ LOCK_EVENT(rwsem_rlock_steal) /* # of read locks by lock stealing */ LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */ diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index c2e176e96499..abba5df50006 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -31,19 +31,13 @@ #include "lock_events.h"
/* - * The least significant 3 bits of the owner value has the following + * The least significant 2 bits of the owner value has the following * meanings when set. * - Bit 0: RWSEM_READER_OWNED - The rwsem is owned by readers - * - Bit 1: RWSEM_RD_NONSPINNABLE - Readers cannot spin on this lock. - * - Bit 2: RWSEM_WR_NONSPINNABLE - Writers cannot spin on this lock. + * - Bit 1: RWSEM_NONSPINNABLE - Cannot spin on a reader-owned lock * - * When the rwsem is either owned by an anonymous writer, or it is - * reader-owned, but a spinning writer has timed out, both nonspinnable - * bits will be set to disable optimistic spinning by readers and writers. - * In the later case, the last unlocking reader should then check the - * writer nonspinnable bit and clear it only to give writers preference - * to acquire the lock via optimistic spinning, but not readers. Similar - * action is also done in the reader slowpath. + * When the rwsem is reader-owned and a spinning writer has timed out, + * the nonspinnable bit will be set to disable optimistic spinning.
* When a writer acquires a rwsem, it puts its task_struct pointer * into the owner field. It is cleared after an unlock. @@ -59,46 +53,14 @@ * is involved. Ideally we would like to track all the readers that own * a rwsem, but the overhead is simply too big. * - * Reader optimistic spinning is helpful when the reader critical section - * is short and there aren't that many readers around. It makes readers - * relatively more preferred than writers. When a writer times out spinning - * on a reader-owned lock and set the nospinnable bits, there are two main - * reasons for that. - * - * 1) The reader critical section is long, perhaps the task sleeps after - * acquiring the read lock. - * 2) There are just too many readers contending the lock causing it to - * take a while to service all of them. - * - * In the former case, long reader critical section will impede the progress - * of writers which is usually more important for system performance. In - * the later case, reader optimistic spinning tends to make the reader - * groups that contain readers that acquire the lock together smaller - * leading to more of them. That may hurt performance in some cases. In - * other words, the setting of nonspinnable bits indicates that reader - * optimistic spinning may not be helpful for those workloads that cause - * it. - * - * Therefore, any writers that had observed the setting of the writer - * nonspinnable bit for a given rwsem after they fail to acquire the lock - * via optimistic spinning will set the reader nonspinnable bit once they - * acquire the write lock. Similarly, readers that observe the setting - * of reader nonspinnable bit at slowpath entry will set the reader - * nonspinnable bits when they acquire the read lock via the wakeup path. - * - * Once the reader nonspinnable bit is on, it will only be reset when - * a writer is able to acquire the rwsem in the fast path or somehow a - * reader or writer in the slowpath doesn't observe the nonspinable bit. - * - * This is to discourage reader optmistic spinning on that particular - * rwsem and make writers more preferred. This adaptive disabling of reader - * optimistic spinning will alleviate the negative side effect of this - * feature. + * A fast path reader optimistic lock stealing is supported when the rwsem + * is previously owned by a writer and the following conditions are met: + * - OSQ is empty + * - rwsem is not currently writer owned + * - the handoff isn't set. */ #define RWSEM_READER_OWNED (1UL << 0) -#define RWSEM_RD_NONSPINNABLE (1UL << 1) -#define RWSEM_WR_NONSPINNABLE (1UL << 2) -#define RWSEM_NONSPINNABLE (RWSEM_RD_NONSPINNABLE | RWSEM_WR_NONSPINNABLE) +#define RWSEM_NONSPINNABLE (1UL << 1) #define RWSEM_OWNER_FLAGS_MASK (RWSEM_READER_OWNED | RWSEM_NONSPINNABLE)
#ifdef CONFIG_DEBUG_RWSEMS @@ -203,7 +165,7 @@ static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem, struct task_struct *owner) { unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED | - (atomic_long_read(&sem->owner) & RWSEM_RD_NONSPINNABLE); + (atomic_long_read(&sem->owner) & RWSEM_NONSPINNABLE);
atomic_long_set(&sem->owner, val); } @@ -372,7 +334,6 @@ struct rwsem_waiter { struct task_struct *task; enum rwsem_waiter_type type; unsigned long timeout; - unsigned long last_rowner; }; #define rwsem_first_waiter(sem) \ list_first_entry(&sem->wait_list, struct rwsem_waiter, list) @@ -486,10 +447,6 @@ static void rwsem_mark_wake(struct rw_semaphore *sem, * the reader is copied over. */ owner = waiter->task; - if (waiter->last_rowner & RWSEM_RD_NONSPINNABLE) { - owner = (void *)((unsigned long)owner | RWSEM_RD_NONSPINNABLE); - lockevent_inc(rwsem_opt_norspin); - } __rwsem_set_reader_owned(sem, owner); }
@@ -620,30 +577,6 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem, }
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER -/* - * Try to acquire read lock before the reader is put on wait queue. - * Lock acquisition isn't allowed if the rwsem is locked or a writer handoff - * is ongoing. - */ -static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem) -{ - long count = atomic_long_read(&sem->count); - - if (count & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF)) - return false; - - count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count); - if (!(count & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { - rwsem_set_reader_owned(sem); - lockevent_inc(rwsem_opt_rlock); - return true; - } - - /* Back out the change */ - atomic_long_add(-RWSEM_READER_BIAS, &sem->count); - return false; -} - /* * Try to acquire write lock before the writer has been put on wait queue. */ @@ -655,7 +588,7 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem) if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, count | RWSEM_WRITER_LOCKED)) { rwsem_set_owner(sem); - lockevent_inc(rwsem_opt_wlock); + lockevent_inc(rwsem_opt_lock); return true; } } @@ -671,8 +604,7 @@ static inline bool owner_on_cpu(struct task_struct *owner) return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); }
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, - unsigned long nonspinnable) +static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) { struct task_struct *owner; unsigned long flags; @@ -689,7 +621,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, /* * Don't check the read-owner as the entry may be stale. */ - if ((flags & nonspinnable) || + if ((flags & RWSEM_NONSPINNABLE) || (owner && !(flags & RWSEM_READER_OWNED) && !owner_on_cpu(owner))) ret = false; rcu_read_unlock(); @@ -719,9 +651,9 @@ enum owner_state { #define OWNER_SPINNABLE (OWNER_NULL | OWNER_WRITER | OWNER_READER)
static inline enum owner_state -rwsem_owner_state(struct task_struct *owner, unsigned long flags, unsigned long nonspinnable) +rwsem_owner_state(struct task_struct *owner, unsigned long flags) { - if (flags & nonspinnable) + if (flags & RWSEM_NONSPINNABLE) return OWNER_NONSPINNABLE;
if (flags & RWSEM_READER_OWNED) @@ -731,14 +663,14 @@ rwsem_owner_state(struct task_struct *owner, unsigned long flags, unsigned long }
static noinline enum owner_state -rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable) +rwsem_spin_on_owner(struct rw_semaphore *sem) { struct task_struct *new, *owner; unsigned long flags, new_flags; enum owner_state state;
owner = rwsem_owner_flags(sem, &flags); - state = rwsem_owner_state(owner, flags, nonspinnable); + state = rwsem_owner_state(owner, flags); if (state != OWNER_WRITER) return state;
@@ -752,7 +684,7 @@ rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable) */ new = rwsem_owner_flags(sem, &new_flags); if ((new != owner) || (new_flags != flags)) { - state = rwsem_owner_state(new, new_flags, nonspinnable); + state = rwsem_owner_state(new, new_flags); break; }
@@ -801,14 +733,12 @@ static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem) return sched_clock() + delta; }
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock) +static bool rwsem_optimistic_spin(struct rw_semaphore *sem) { bool taken = false; int prev_owner_state = OWNER_NULL; int loop = 0; u64 rspin_threshold = 0; - unsigned long nonspinnable = wlock ? RWSEM_WR_NONSPINNABLE - : RWSEM_RD_NONSPINNABLE;
preempt_disable();
@@ -825,15 +755,14 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock) for (;;) { enum owner_state owner_state;
- owner_state = rwsem_spin_on_owner(sem, nonspinnable); + owner_state = rwsem_spin_on_owner(sem); if (!(owner_state & OWNER_SPINNABLE)) break;
/* * Try to acquire the lock */ - taken = wlock ? rwsem_try_write_lock_unqueued(sem) - : rwsem_try_read_lock_unqueued(sem); + taken = rwsem_try_write_lock_unqueued(sem);
if (taken) break; @@ -841,7 +770,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock) /* * Time-based reader-owned rwsem optimistic spinning */ - if (wlock && (owner_state == OWNER_READER)) { + if (owner_state == OWNER_READER) { /* * Re-initialize rspin_threshold every time when * the owner state changes from non-reader to reader. @@ -850,7 +779,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock) * the beginning of the 2nd reader phase. */ if (prev_owner_state != OWNER_READER) { - if (rwsem_test_oflags(sem, nonspinnable)) + if (rwsem_test_oflags(sem, RWSEM_NONSPINNABLE)) break; rspin_threshold = rwsem_rspin_threshold(sem); loop = 0; @@ -926,89 +855,30 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock) }
/* - * Clear the owner's RWSEM_WR_NONSPINNABLE bit if it is set. This should + * Clear the owner's RWSEM_NONSPINNABLE bit if it is set. This should * only be called when the reader count reaches 0. - * - * This give writers better chance to acquire the rwsem first before - * readers when the rwsem was being held by readers for a relatively long - * period of time. Race can happen that an optimistic spinner may have - * just stolen the rwsem and set the owner, but just clearing the - * RWSEM_WR_NONSPINNABLE bit will do no harm anyway. - */ -static inline void clear_wr_nonspinnable(struct rw_semaphore *sem) -{ - if (rwsem_test_oflags(sem, RWSEM_WR_NONSPINNABLE)) - atomic_long_andnot(RWSEM_WR_NONSPINNABLE, &sem->owner); -} - -/* - * This function is called when the reader fails to acquire the lock via - * optimistic spinning. In this case we will still attempt to do a trylock - * when comparing the rwsem state right now with the state when entering - * the slowpath indicates that the reader is still in a valid reader phase. - * This happens when the following conditions are true: - * - * 1) The lock is currently reader owned, and - * 2) The lock is previously not reader-owned or the last read owner changes. - * - * In the former case, we have transitioned from a writer phase to a - * reader-phase while spinning. In the latter case, it means the reader - * phase hasn't ended when we entered the optimistic spinning loop. In - * both cases, the reader is eligible to acquire the lock. This is the - * secondary path where a read lock is acquired optimistically. - * - * The reader non-spinnable bit wasn't set at time of entry or it will - * not be here at all. */ -static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem, - unsigned long last_rowner) +static inline void clear_nonspinnable(struct rw_semaphore *sem) { - unsigned long owner = atomic_long_read(&sem->owner); - - if (!(owner & RWSEM_READER_OWNED)) - return false; - - if (((owner ^ last_rowner) & ~RWSEM_OWNER_FLAGS_MASK) && - rwsem_try_read_lock_unqueued(sem)) { - lockevent_inc(rwsem_opt_rlock2); - lockevent_add(rwsem_opt_fail, -1); - return true; - } - return false; -} - -static inline bool rwsem_no_spinners(struct rw_semaphore *sem) -{ - return !osq_is_locked(&sem->osq); + if (rwsem_test_oflags(sem, RWSEM_NONSPINNABLE)) + atomic_long_andnot(RWSEM_NONSPINNABLE, &sem->owner); }
#else -static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, - unsigned long nonspinnable) +static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) { return false; }
-static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock) +static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem) { return false; }
-static inline void clear_wr_nonspinnable(struct rw_semaphore *sem) { } - -static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem, - unsigned long last_rowner) -{ - return false; -} - -static inline bool rwsem_no_spinners(sem) -{ - return false; -} +static inline void clear_nonspinnable(struct rw_semaphore *sem) { }
static inline int -rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable) +rwsem_spin_on_owner(struct rw_semaphore *sem) { return 0; } @@ -1021,7 +891,7 @@ rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable) static struct rw_semaphore __sched * rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, int state) { - long owner, adjustment = -RWSEM_READER_BIAS; + long adjustment = -RWSEM_READER_BIAS; long rcnt = (count >> RWSEM_READER_SHIFT); struct rwsem_waiter waiter; DEFINE_WAKE_Q(wake_q); @@ -1029,54 +899,25 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, int state)
/* * To prevent a constant stream of readers from starving a sleeping - * waiter, don't attempt optimistic spinning if the lock is currently - * owned by readers. + * waiter, don't attempt optimistic lock stealing if the lock is + * currently owned by readers. */ - owner = atomic_long_read(&sem->owner); - if ((owner & RWSEM_READER_OWNED) && (rcnt > 1) && - !(count & RWSEM_WRITER_LOCKED)) + if ((atomic_long_read(&sem->owner) & RWSEM_READER_OWNED) && + (rcnt > 1) && !(count & RWSEM_WRITER_LOCKED)) goto queue;
/* - * Reader optimistic lock stealing - * - * We can take the read lock directly without doing - * rwsem_optimistic_spin() if the conditions are right. - * Also wake up other readers if it is the first reader. + * Reader optimistic lock stealing. */ - if (!(count & (RWSEM_WRITER_LOCKED | RWSEM_FLAG_HANDOFF)) && - rwsem_no_spinners(sem)) { + if (!(count & (RWSEM_WRITER_LOCKED | RWSEM_FLAG_HANDOFF))) { rwsem_set_reader_owned(sem); lockevent_inc(rwsem_rlock_steal); - if (rcnt == 1) - goto wake_readers; - return sem; - }
- /* - * Save the current read-owner of rwsem, if available, and the - * reader nonspinnable bit. - */ - waiter.last_rowner = owner; - if (!(waiter.last_rowner & RWSEM_READER_OWNED)) - waiter.last_rowner &= RWSEM_RD_NONSPINNABLE; - - if (!rwsem_can_spin_on_owner(sem, RWSEM_RD_NONSPINNABLE)) - goto queue; - - /* - * Undo read bias from down_read() and do optimistic spinning. - */ - atomic_long_add(-RWSEM_READER_BIAS, &sem->count); - adjustment = 0; - if (rwsem_optimistic_spin(sem, false)) { - /* rwsem_optimistic_spin() implies ACQUIRE on success */ /* - * Wake up other readers in the wait list if the front - * waiter is a reader. + * Wake up other readers in the wait queue if it is + * the first reader. */ -wake_readers: - if ((atomic_long_read(&sem->count) & RWSEM_FLAG_WAITERS)) { + if ((rcnt == 1) && (count & RWSEM_FLAG_WAITERS)) { raw_spin_lock_irq(&sem->wait_lock); if (!list_empty(&sem->wait_list)) rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, @@ -1085,9 +926,6 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, int state) wake_up_q(&wake_q); } return sem; - } else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) { - /* rwsem_reader_phase_trylock() implies ACQUIRE on success */ - return sem; }
queue: @@ -1103,7 +941,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, int state) * exit the slowpath and return immediately as its * RWSEM_READER_BIAS has already been set in the count. */ - if (adjustment && !(atomic_long_read(&sem->count) & + if (!(atomic_long_read(&sem->count) & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) { /* Provide lock ACQUIRE */ smp_acquire__after_ctrl_dep(); @@ -1117,10 +955,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, int state) list_add_tail(&waiter.list, &sem->wait_list);
/* we're now waiting on the lock, but no longer actively locking */ - if (adjustment) - count = atomic_long_add_return(adjustment, &sem->count); - else - count = atomic_long_read(&sem->count); + count = atomic_long_add_return(adjustment, &sem->count);
/* * If there are no active locks, wake the front queued process(es). @@ -1129,7 +964,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, int state) * wake our own waiter to join the existing active readers ! */ if (!(count & RWSEM_LOCK_MASK)) { - clear_wr_nonspinnable(sem); + clear_nonspinnable(sem); wake = true; } if (wake || (!(count & RWSEM_WRITER_MASK) && @@ -1174,19 +1009,6 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, int state) return ERR_PTR(-EINTR); }
-/* - * This function is called by the a write lock owner. So the owner value - * won't get changed by others. - */ -static inline void rwsem_disable_reader_optspin(struct rw_semaphore *sem, - bool disable) -{ - if (unlikely(disable)) { - atomic_long_or(RWSEM_RD_NONSPINNABLE, &sem->owner); - lockevent_inc(rwsem_opt_norspin); - } -} - /* * Wait until we successfully acquire the write lock */ @@ -1194,26 +1016,17 @@ static struct rw_semaphore * rwsem_down_write_slowpath(struct rw_semaphore *sem, int state) { long count; - bool disable_rspin; enum writer_wait_state wstate; struct rwsem_waiter waiter; struct rw_semaphore *ret = sem; DEFINE_WAKE_Q(wake_q);
/* do optimistic spinning and steal lock if possible */ - if (rwsem_can_spin_on_owner(sem, RWSEM_WR_NONSPINNABLE) && - rwsem_optimistic_spin(sem, true)) { + if (rwsem_can_spin_on_owner(sem) && rwsem_optimistic_spin(sem)) { /* rwsem_optimistic_spin() implies ACQUIRE on success */ return sem; }
- /* - * Disable reader optimistic spinning for this rwsem after - * acquiring the write lock when the setting of the nonspinnable - * bits are observed. - */ - disable_rspin = atomic_long_read(&sem->owner) & RWSEM_NONSPINNABLE; - /* * Optimistic spinning failed, proceed to the slowpath * and block until we can acquire the sem. @@ -1282,7 +1095,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state) * without sleeping. */ if (wstate == WRITER_HANDOFF && - rwsem_spin_on_owner(sem, RWSEM_NONSPINNABLE) == OWNER_NULL) + rwsem_spin_on_owner(sem) == OWNER_NULL) goto trylock_again;
/* Block until there are no active lockers. */ @@ -1324,7 +1137,6 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state) } __set_current_state(TASK_RUNNING); list_del(&waiter.list); - rwsem_disable_reader_optspin(sem, disable_rspin); raw_spin_unlock_irq(&sem->wait_lock); lockevent_inc(rwsem_wlock);
@@ -1484,7 +1296,7 @@ static inline void __up_read(struct rw_semaphore *sem) DEBUG_RWSEMS_WARN_ON(tmp < 0, sem); if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) == RWSEM_FLAG_WAITERS)) { - clear_wr_nonspinnable(sem); + clear_nonspinnable(sem); rwsem_wake(sem, tmp); } }
On Tue, Feb 07, 2023 at 07:01:28PM +0000, Shaoying Xu wrote:
This patch series is to remove reader optimistic spinning in kernel 5.10 to improve the MongoDB performance. Performance measurements (10 times running average of overall throughput ops/sec) are using MongoDB 5.0.11 and YCSB [1] microbenchmark with workloadA [2] on AWS EC2 m5.4xlarge/m6g.4xlarge (16-vCPU 64GiB-memory) instances with a 512GB EBS IO1 drive disk with 5000 IOPS and separating MongoDB and YCSB load generator on 2 instances and setting recordcount=25000000 and operationcount=10000000 to see the impacts of these changes:
Before - v5.10.165 kernel in OS Amazon Linux 2 After - v5.10.165 kernel with reader spinning disabled in OS Amazon Linux 2
| Arch | Instance Type | Before | After | |---------+---------------+---------+---------| | x86_64 | m5.4xlarge | 37365.4 | 42373.9 | |---------+---------------+---------+---------| | aarch64 | m6g.4xlarge | 33823.1 | 43113.7 | |---------+---------------+---------+---------|
It can be seen that the MongoDB throughput can be improved around 13% in x86_64 and 27% in aarch64 after disabling reader optimistic spinning and these patches can be applied to 5.10 with no conflict so we wonder if it's possible to backport them to stable 5.10?
This is, frankly, crazy. :)
This is great that you did this work and found this out, but really, shouldn't you have done the less work and just moved to 5.15.y instead? You're going to have to do that anyway, what's preventing that from happening now, with the HUGE justification that you get a big workload increase and power savings (i.e. real money)?
So now you just delay the inevitable and spend more work overall (i.e. the backport work now, and the 5.15.y move later?) This feels like a bad management decision somewhere, who do I need to talk to to resolve this?
thanks,
greg k-h
This is great that you did this work and found this out, but really, shouldn't you have done the less work and just moved to 5.15.y instead? You're going to have to do that anyway, what's preventing that from happening now, with the HUGE justification that you get a big workload increase and power savings (i.e. real money)?
Hey Greg,
We are actually shipping kernel 5.15 as part of Amazon Linux kernel releases so theoretically moving to 5.15 should be the way to go however usually the relevant teams take some time for workload specific testing and benchmark before they do a major upgrade like moving from 5.10 to 5.15. We usually ask whoever is reporting a regression/bug/kernel enhancement to run with the latest kernel as you said while sometimes we backport fixes if the production migration to the latest kernel is something that will take time for the reasons I mentioned above. We thought that this performance improvement will also be beneficial for Linux 5.10 users hence we preferred these patches to be merged to the stable 5.10 rather than us just consume them as downstream patches. We are currently working with the relevant team on a plan for the possible 5.15 migration as a long term solution.
Thank you.
Hazem
On 09/02/2023, 10:37, "Greg KH" <gregkh@linuxfoundation.org mailto:gregkh@linuxfoundation.org <mailto:gregkh@linuxfoundation.org mailto:gregkh@linuxfoundation.org> <mailto:gregkh@linuxfoundation.org mailto:gregkh@linuxfoundation.org <mailto:gregkh@linuxfoundation.org mailto:gregkh@linuxfoundation.org>> <mailto:gregkh@linuxfoundation.org mailto:gregkh@linuxfoundation.org <mailto:gregkh@linuxfoundation.org mailto:gregkh@linuxfoundation.org> <mailto:gregkh@linuxfoundation.org mailto:gregkh@linuxfoundation.org <mailto:gregkh@linuxfoundation.org mailto:gregkh@linuxfoundation.org>>>> wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On Tue, Feb 07, 2023 at 07:01:28PM +0000, Shaoying Xu wrote:
This patch series is to remove reader optimistic spinning in kernel 5.10 to improve the MongoDB performance. Performance measurements (10 times running average of overall throughput ops/sec) are using MongoDB 5.0.11 and YCSB [1] microbenchmark with workloadA [2] on AWS EC2 m5.4xlarge/m6g.4xlarge (16-vCPU 64GiB-memory) instances with a 512GB EBS IO1 drive disk with 5000 IOPS and separating MongoDB and YCSB load generator on 2 instances and setting recordcount=25000000 and operationcount=10000000 to see the impacts of these changes:
Before - v5.10.165 kernel in OS Amazon Linux 2 After - v5.10.165 kernel with reader spinning disabled in OS Amazon Linux 2
| Arch | Instance Type | Before | After | |---------+---------------+---------+---------| | x86_64 | m5.4xlarge | 37365.4 | 42373.9 | |---------+---------------+---------+---------| | aarch64 | m6g.4xlarge | 33823.1 | 43113.7 | |---------+---------------+---------+---------|
It can be seen that the MongoDB throughput can be improved around 13% in x86_64 and 27% in aarch64 after disabling reader optimistic spinning and these patches can be applied to 5.10 with no conflict so we wonder if it's possible to backport them to stable 5.10?
This is, frankly, crazy. :)
This is great that you did this work and found this out, but really, shouldn't you have done the less work and just moved to 5.15.y instead? You're going to have to do that anyway, what's preventing that from happening now, with the HUGE justification that you get a big workload increase and power savings (i.e. real money)?
So now you just delay the inevitable and spend more work overall (i.e. the backport work now, and the 5.15.y move later?) This feels like a bad management decision somewhere, who do I need to talk to to resolve this?
thanks,
greg k-h
On Mon, Feb 13, 2023 at 12:11:55PM +0000, Mohamed Abuelfotoh, Hazem wrote:
This is great that you did this work and found this out, but really, shouldn't you have done the less work and just moved to 5.15.y instead? You're going to have to do that anyway, what's preventing that from happening now, with the HUGE justification that you get a big workload increase and power savings (i.e. real money)?
Hey Greg,
Hi,
Odd whitespace, you should fix your email client :)
We are actually shipping kernel 5.15 as part of Amazon Linux kernel releases so theoretically moving to 5.15 should be the way to go however usually the relevant teams take some time for workload specific testing and benchmark before they do a major upgrade like moving from 5.10 to 5.15. We usually ask whoever is reporting a regression/bug/kernel enhancement to run with the latest kernel as you said while sometimes we backport fixes if the production migration to the latest kernel is something that will take time for the reasons I mentioned above. We thought that this performance improvement will also be beneficial for Linux 5.10 users hence we preferred these patches to be merged to the stable 5.10 rather than us just consume them as downstream patches. We are currently working with the relevant team on a plan for the possible 5.15 migration as a long term solution.
And use some \n characters :)
But this isn't a regression, it's a speedup. And a good reason to use a newer kernel version. What type of guarantees that these changes work properly and have all needed future bugfixes applied to them? How were they tested?
I am loath to want to replace something as core as this without a lot of testing and verification and agreement from the subsystem authors/maintainers that this is ok to make.
Seriously, just move to 5.15.y or 6.1.y, that will be easier for you in the long run, the amount of testing and validation should have been the same for this small patchset as it would be for a kernel update, so there shouldn't be any issues here.
thanks,
greg k-h
linux-stable-mirror@lists.linaro.org