Patch 1: add a selftest app to benchmark the performance of ptr_ring. Patch 2: make __ptr_ring_empty() checking more reliable and use the just added selftest to benchmark the performance impact.
V2: add patch 1 and add performance data for patch 2.
Yunsheng Lin (2): selftests/ptr_ring: add benchmark application for ptr_ring ptr_ring: make __ptr_ring_empty() checking more reliable
MAINTAINERS | 5 + include/linux/ptr_ring.h | 25 ++- tools/testing/selftests/ptr_ring/Makefile | 6 + tools/testing/selftests/ptr_ring/ptr_ring_test.c | 249 +++++++++++++++++++++++ tools/testing/selftests/ptr_ring/ptr_ring_test.h | 150 ++++++++++++++ 5 files changed, 426 insertions(+), 9 deletions(-) create mode 100644 tools/testing/selftests/ptr_ring/Makefile create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.c create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.h
Currently ptr_ring selftest is embedded within the virtio selftest, which involves some specific virtio operation, such as notifying and kicking.
As ptr_ring has been used by various subsystems, it deserves it's owner's selftest in order to benchmark different usecase of ptr_ring, such as page pool and pfifo_fast qdisc.
So add a simple application to benchmark ptr_ring performance. Currently two test mode is supported: Mode 0: Both enqueuing and dequeuing is done in a single thread, it is called simple test mode in the test app. Mode 1: Enqueuing and dequeuing is done in different thread concurrently, also known as SPSC(single-producer/ single-consumer) test.
The multi-producer/single-consumer test for pfifo_fast case is not added yet, which can be added if using CAS atomic operation to enable lockless multi-producer is proved to be better than using r->producer_lock.
Only supported on x86 and arm64 for now.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- MAINTAINERS | 5 + tools/testing/selftests/ptr_ring/Makefile | 6 + tools/testing/selftests/ptr_ring/ptr_ring_test.c | 249 +++++++++++++++++++++++ tools/testing/selftests/ptr_ring/ptr_ring_test.h | 150 ++++++++++++++ 4 files changed, 410 insertions(+) create mode 100644 tools/testing/selftests/ptr_ring/Makefile create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.c create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.h
diff --git a/MAINTAINERS b/MAINTAINERS index cc375fd..1227022 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14847,6 +14847,11 @@ F: drivers/net/phy/dp83640* F: drivers/ptp/* F: include/linux/ptp_cl*
+PTR RING BENCHMARK +M: Yunsheng Lin linyunsheng@huawei.com +L: netdev@vger.kernel.org +F: tools/testing/selftests/ptr_ring/ + PTRACE SUPPORT M: Oleg Nesterov oleg@redhat.com S: Maintained diff --git a/tools/testing/selftests/ptr_ring/Makefile b/tools/testing/selftests/ptr_ring/Makefile new file mode 100644 index 0000000..346dea9 --- /dev/null +++ b/tools/testing/selftests/ptr_ring/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0 +LDLIBS = -lpthread + +TEST_GEN_PROGS := ptr_ring_test + +include ../lib.mk diff --git a/tools/testing/selftests/ptr_ring/ptr_ring_test.c b/tools/testing/selftests/ptr_ring/ptr_ring_test.c new file mode 100644 index 0000000..4f32d3d --- /dev/null +++ b/tools/testing/selftests/ptr_ring/ptr_ring_test.c @@ -0,0 +1,249 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2021 HiSilicon Limited. + */ + +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <string.h> +#include <errno.h> +#include <sys/time.h> +#include <malloc.h> +#include <assert.h> +#include <stdbool.h> +#include <pthread.h> + +#include "ptr_ring_test.h" +#include "../../../../include/linux/ptr_ring.h" + +#define MIN_RING_SIZE 2 +#define MAX_RING_SIZE 10000000 + +static struct ptr_ring ring ____cacheline_aligned_in_smp; + +struct worker_info { + pthread_t tid; + int test_count; + bool error; + long duration_us; +}; + +static void *produce_worker(void *arg) +{ + struct worker_info *info = arg; + struct timeval start, end; + unsigned long i = 0; + long sec, us; + int ret; + + gettimeofday(&start, NULL); + + while (++i <= info->test_count) { + while (__ptr_ring_full(&ring)) + cpu_relax(); + + ret = __ptr_ring_produce(&ring, (void *)i); + if (ret) { + fprintf(stderr, "produce failed: %d\n", ret); + info->error = true; + return NULL; + } + } + + gettimeofday(&end, NULL); + + sec = (end.tv_sec - start.tv_sec); + us = ((sec * 1000000) + end.tv_usec) - (start.tv_usec); + info->duration_us = us; + info->error = false; + + return NULL; +} + +static void *consume_worker(void *arg) +{ + struct worker_info *info = arg; + struct timeval start, end; + unsigned long i = 0; + long sec, us; + int *ptr; + + gettimeofday(&start, NULL); + + while (++i <= info->test_count) { + while (__ptr_ring_empty(&ring)) + cpu_relax(); + + ptr = __ptr_ring_consume(&ring); + if ((unsigned long)ptr != i) { + fprintf(stderr, "consumer failed, ptr: %lu, i: %lu\n", + (unsigned long)ptr, i); + info->error = true; + return NULL; + } + } + + gettimeofday(&end, NULL); + + if (!__ptr_ring_empty(&ring)) { + fprintf(stderr, "ring should be empty, test failed\n"); + info->error = true; + return NULL; + } + + sec = (end.tv_sec - start.tv_sec); + us = ((sec * 1000000) + end.tv_usec) - (start.tv_usec); + info->duration_us = us; + info->error = false; + return NULL; +} + +/* test case for single producer single consumer */ +static void spsc_test(int size, int count) +{ + struct worker_info producer, consumer; + pthread_attr_t attr; + void *res; + int ret; + + ret = ptr_ring_init(&ring, size, 0); + if (ret) { + fprintf(stderr, "init failed: %d\n", ret); + return; + } + + producer.test_count = count; + consumer.test_count = count; + + ret = pthread_attr_init(&attr); + if (ret) { + fprintf(stderr, "pthread attr init failed: %d\n", ret); + goto out; + } + + ret = pthread_create(&producer.tid, &attr, + produce_worker, &producer); + if (ret) { + fprintf(stderr, "create producer thread failed: %d\n", ret); + goto out; + } + + ret = pthread_create(&consumer.tid, &attr, + consume_worker, &consumer); + if (ret) { + fprintf(stderr, "create consumer thread failed: %d\n", ret); + goto out; + } + + ret = pthread_join(producer.tid, &res); + if (ret) { + fprintf(stderr, "join producer thread failed: %d\n", ret); + goto out; + } + + ret = pthread_join(consumer.tid, &res); + if (ret) { + fprintf(stderr, "join consumer thread failed: %d\n", ret); + goto out; + } + + if (producer.error || consumer.error) { + fprintf(stderr, "spsc test failed\n"); + goto out; + } + + printf("ptr_ring(size:%d) perf spsc test for %d times, took %ld us + %ld us\n", + size, count, producer.duration_us, consumer.duration_us); +out: + ptr_ring_cleanup(&ring, NULL); +} + +static void simple_test(int size, int count) +{ + struct timeval start, end; + long sec, us; + int i = 0; + int *ptr; + int ret; + + ret = ptr_ring_init(&ring, size, 0); + if (ret) { + fprintf(stderr, "init failed: %d\n", ret); + return; + } + + gettimeofday(&start, NULL); + + while (++i <= count) { + ret = __ptr_ring_produce(&ring, &count); + if (ret) { + fprintf(stderr, "produce failed: %d\n", ret); + goto out; + } + + ptr = __ptr_ring_consume(&ring); + if (ptr != &count) { + fprintf(stderr, "consume failed: %p\n", ptr); + goto out; + } + } + + gettimeofday(&end, NULL); + sec = (end.tv_sec - start.tv_sec); + us = ((sec * 1000000) + end.tv_usec) - (start.tv_usec); + printf("ptr_ring(size:%d) perf simple test for %d times, took %ld us\n", + size, count, us); + +out: + ptr_ring_cleanup(&ring, NULL); +} + +int main(int argc, char *argv[]) +{ + int count = 1000000; + int size = 1000; + int mode = 0; + int opt; + + while ((opt = getopt(argc, argv, "N:s:m:")) != -1) { + switch (opt) { + case 'N': + count = atoi(optarg); + break; + case 's': + size = atoi(optarg); + break; + case 'm': + mode = atoi(optarg); + break; + default: + return -1; + } + } + + if (count <= 0) { + fprintf(stderr, "invalid test count, must be > 0\n"); + return -1; + } + + if (size < MIN_RING_SIZE || size > MAX_RING_SIZE) { + fprintf(stderr, "invalid ring size, must be in %d-%d\n", + MIN_RING_SIZE, MAX_RING_SIZE); + return -1; + } + + switch (mode) { + case 0: + simple_test(size, count); + break; + case 1: + spsc_test(size, count); + break; + default: + fprintf(stderr, "invalid test mode\n"); + return -1; + } + + return 0; +} diff --git a/tools/testing/selftests/ptr_ring/ptr_ring_test.h b/tools/testing/selftests/ptr_ring/ptr_ring_test.h new file mode 100644 index 0000000..6bf2494 --- /dev/null +++ b/tools/testing/selftests/ptr_ring/ptr_ring_test.h @@ -0,0 +1,150 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ + +#ifndef _TEST_PTR_RING_IMPL_H +#define _TEST_PTR_RING_IMPL_H + +#if defined(__x86_64__) || defined(__i386__) +static inline void cpu_relax(void) +{ + asm volatile ("rep; nop" ::: "memory"); +} +#elif defined(__aarch64__) +static inline void cpu_relax(void) +{ + asm volatile("yield" ::: "memory"); +} +#else +#define cpu_relax() assert(0) +#endif + +static inline void barrier(void) +{ + asm volatile("" ::: "memory"); +} + +/* + * This abuses the atomic builtins for thread fences, and + * adds a compiler barrier. + */ +#define smp_release() do { \ + barrier(); \ + __atomic_thread_fence(__ATOMIC_RELEASE); \ +} while (0) + +#define smp_acquire() do { \ + __atomic_thread_fence(__ATOMIC_ACQUIRE); \ + barrier(); \ +} while (0) + +#if defined(__i386__) || defined(__x86_64__) +#define smp_wmb() barrier() +#else +#define smp_wmb() smp_release() +#endif + +#define READ_ONCE(x) (*(volatile typeof(x) *)&(x)) +#define WRITE_ONCE(x, val) ((*(volatile typeof(x) *)&(x)) = (val)) +#define SMP_CACHE_BYTES 64 +#define cache_line_size SMP_CACHE_BYTES +#define unlikely(x) (__builtin_expect(!!(x), 0)) +#define likely(x) (__builtin_expect(!!(x), 1)) +#define ALIGN(x, a) (((x) + (a) - 1) / (a) * (a)) +#define SIZE_MAX (~(size_t)0) +#define KMALLOC_MAX_SIZE SIZE_MAX +#define spinlock_t pthread_spinlock_t +#define gfp_t int +#define __GFP_ZERO 0x1 + +#define ____cacheline_aligned_in_smp __attribute__((aligned(SMP_CACHE_BYTES))) + +static void *kmalloc(unsigned int size, gfp_t gfp) +{ + void *p; + + p = memalign(64, size); + if (!p) + return p; + + if (gfp & __GFP_ZERO) + memset(p, 0, size); + + return p; +} + +static inline void *kzalloc(unsigned int size, gfp_t flags) +{ + return kmalloc(size, flags | __GFP_ZERO); +} + +static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags) +{ + if (size != 0 && n > SIZE_MAX / size) + return NULL; + return kmalloc(n * size, flags); +} + +static inline void *kcalloc(size_t n, size_t size, gfp_t flags) +{ + return kmalloc_array(n, size, flags | __GFP_ZERO); +} + +static void kfree(void *p) +{ + free(p); +} + +#define kvmalloc_array kmalloc_array +#define kvfree kfree + +static void spin_lock_init(spinlock_t *lock) +{ + int r = pthread_spin_init(lock, 0); + + assert(!r); +} + +static void spin_lock(spinlock_t *lock) +{ + int ret = pthread_spin_lock(lock); + + assert(!ret); +} + +static void spin_unlock(spinlock_t *lock) +{ + int ret = pthread_spin_unlock(lock); + + assert(!ret); +} + +static void spin_lock_bh(spinlock_t *lock) +{ + spin_lock(lock); +} + +static void spin_unlock_bh(spinlock_t *lock) +{ + spin_unlock(lock); +} + +static void spin_lock_irq(spinlock_t *lock) +{ + spin_lock(lock); +} + +static void spin_unlock_irq(spinlock_t *lock) +{ + spin_unlock(lock); +} + +static void spin_lock_irqsave(spinlock_t *lock, unsigned long f) +{ + spin_lock(lock); +} + +static void spin_unlock_irqrestore(spinlock_t *lock, unsigned long f) +{ + spin_unlock(lock); +} + +#endif
Currently r->queue[] is cleared after r->consumer_head is moved forward, which makes the __ptr_ring_empty() checking called in page_pool_refill_alloc_cache() unreliable if the checking is done after the r->queue clearing and before the consumer_head moving forward.
Move the r->queue[] clearing after consumer_head moving forward to make __ptr_ring_empty() checking more reliable.
As a side effect of above change, a consumer_head checking is avoided for the likely case, and it has noticeable performance improvement when it is tested using the ptr_ring_test selftest added in the previous patch.
Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" to test the case of single thread doing both the enqueuing and dequeuing:
arch unpatched patched delta arm64 4648 ms 4464 ms +3.9% X86 2562 ms 2401 ms +6.2%
Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" to test the case of one thread doing enqueuing and another thread doing dequeuing concurrently, also known as single-producer/single- consumer:
arch unpatched patched delta arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6%
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com --- V2: Add performance data. --- include/linux/ptr_ring.h | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 808f9d3..db9c282 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty * to work correctly. */ - int consumer_head = r->consumer_head; - int head = consumer_head++; + int consumer_head = r->consumer_head + 1;
/* Once we have processed enough entries invalidate them in * the ring all at once so producer can reuse their space in the ring. @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) */ if (unlikely(consumer_head - r->consumer_tail >= r->batch || consumer_head >= r->size)) { + int tail = r->consumer_tail; + + if (unlikely(consumer_head >= r->size)) { + r->consumer_tail = 0; + WRITE_ONCE(r->consumer_head, 0); + } else { + r->consumer_tail = consumer_head; + WRITE_ONCE(r->consumer_head, consumer_head); + } + /* Zero out entries in the reverse order: this way we touch the * cache line that producer might currently be reading the last; * producer won't make progress and touch other cache lines * besides the first one until we write out all entries. */ - while (likely(head >= r->consumer_tail)) - r->queue[head--] = NULL; - r->consumer_tail = consumer_head; - } - if (unlikely(consumer_head >= r->size)) { - consumer_head = 0; - r->consumer_tail = 0; + while (likely(--consumer_head >= tail)) + r->queue[consumer_head] = NULL; + + return; } + /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ WRITE_ONCE(r->consumer_head, consumer_head); }
在 2021/6/25 上午11:18, Yunsheng Lin 写道:
Currently ptr_ring selftest is embedded within the virtio selftest, which involves some specific virtio operation, such as notifying and kicking.
As ptr_ring has been used by various subsystems, it deserves it's owner's selftest in order to benchmark different usecase of ptr_ring, such as page pool and pfifo_fast qdisc.
So add a simple application to benchmark ptr_ring performance. Currently two test mode is supported: Mode 0: Both enqueuing and dequeuing is done in a single thread, it is called simple test mode in the test app. Mode 1: Enqueuing and dequeuing is done in different thread concurrently, also known as SPSC(single-producer/ single-consumer) test.
The multi-producer/single-consumer test for pfifo_fast case is not added yet, which can be added if using CAS atomic operation to enable lockless multi-producer is proved to be better than using r->producer_lock.
Only supported on x86 and arm64 for now.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
MAINTAINERS | 5 + tools/testing/selftests/ptr_ring/Makefile | 6 + tools/testing/selftests/ptr_ring/ptr_ring_test.c | 249 +++++++++++++++++++++++ tools/testing/selftests/ptr_ring/ptr_ring_test.h | 150 ++++++++++++++ 4 files changed, 410 insertions(+)
Why can't you simply reuse tools/virtio/ringtest?
Thanks
On 2021/6/25 11:36, Jason Wang wrote:
在 2021/6/25 上午11:18, Yunsheng Lin 写道:
Currently ptr_ring selftest is embedded within the virtio selftest, which involves some specific virtio operation, such as notifying and kicking.
As ptr_ring has been used by various subsystems, it deserves it's owner's selftest in order to benchmark different usecase of ptr_ring, such as page pool and pfifo_fast qdisc.
So add a simple application to benchmark ptr_ring performance. Currently two test mode is supported: Mode 0: Both enqueuing and dequeuing is done in a single thread, it is called simple test mode in the test app. Mode 1: Enqueuing and dequeuing is done in different thread concurrently, also known as SPSC(single-producer/ single-consumer) test.
The multi-producer/single-consumer test for pfifo_fast case is not added yet, which can be added if using CAS atomic operation to enable lockless multi-producer is proved to be better than using r->producer_lock.
Only supported on x86 and arm64 for now.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
MAINTAINERS | 5 + tools/testing/selftests/ptr_ring/Makefile | 6 + tools/testing/selftests/ptr_ring/ptr_ring_test.c | 249 +++++++++++++++++++++++ tools/testing/selftests/ptr_ring/ptr_ring_test.h | 150 ++++++++++++++ 4 files changed, 410 insertions(+)
Why can't you simply reuse tools/virtio/ringtest?
The main reason is stated in the commit log: "Currently ptr_ring selftest is embedded within the virtio selftest, which involves some specific virtio operation, such as notifying and kicking.
As ptr_ring has been used by various subsystems, it deserves it's owner's selftest in order to benchmark different usecase of ptr_ring, such as page pool and pfifo_fast qdisc."
More specificly in tools/virtio/ringtest/main.c and tools/virtio/ringtest/ptr_ring.c, there are a lot of operation related to virtio usecase, such as start_guest(), start_host(), poll_used(), notify() or kick() ....., so it makes more sense to add a generic selftest for ptr ring as it is not only used by virtio now.
Thanks
.
On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
Currently r->queue[] is cleared after r->consumer_head is moved forward, which makes the __ptr_ring_empty() checking called in page_pool_refill_alloc_cache() unreliable if the checking is done after the r->queue clearing and before the consumer_head moving forward.
Well the documentation for __ptr_ring_empty clearly states is is not guaranteed to be reliable.
* * NB: This is only safe to call if ring is never resized. * * However, if some other CPU consumes ring entries at the same time, the value * returned is not guaranteed to be correct. * * In this case - to avoid incorrectly detecting the ring * as empty - the CPU consuming the ring entries is responsible * for either consuming all ring entries until the ring is empty, * or synchronizing with some other CPU and causing it to * re-test __ptr_ring_empty and/or consume the ring enteries * after the synchronization point. *
Is it then the case that page_pool_refill_alloc_cache violates this requirement? How?
It looks like you are trying to make the guarantee stronger and ensure no false positives.
If yes please document this as such, update the comment so all code can be evaluated with the eye towards whether the new stronger guarantee is maintained. In particular I think I see at least one issue with this immediately.
Move the r->queue[] clearing after consumer_head moving forward to make __ptr_ring_empty() checking more reliable.
As a side effect of above change, a consumer_head checking is avoided for the likely case, and it has noticeable performance improvement when it is tested using the ptr_ring_test selftest added in the previous patch.
Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" to test the case of single thread doing both the enqueuing and dequeuing:
arch unpatched patched delta arm64 4648 ms 4464 ms +3.9% X86 2562 ms 2401 ms +6.2%
Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" to test the case of one thread doing enqueuing and another thread doing dequeuing concurrently, also known as single-producer/single- consumer:
arch unpatched patched delta arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6%
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
V2: Add performance data.
include/linux/ptr_ring.h | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 808f9d3..db9c282 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty * to work correctly. */
- int consumer_head = r->consumer_head;
- int head = consumer_head++;
- int consumer_head = r->consumer_head + 1;
/* Once we have processed enough entries invalidate them in * the ring all at once so producer can reuse their space in the ring. @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) */ if (unlikely(consumer_head - r->consumer_tail >= r->batch || consumer_head >= r->size)) {
int tail = r->consumer_tail;
if (unlikely(consumer_head >= r->size)) {
r->consumer_tail = 0;
WRITE_ONCE(r->consumer_head, 0);
} else {
r->consumer_tail = consumer_head;
WRITE_ONCE(r->consumer_head, consumer_head);
}
- /* Zero out entries in the reverse order: this way we touch the
*/
- cache line that producer might currently be reading the last;
- producer won't make progress and touch other cache lines
- besides the first one until we write out all entries.
while (likely(head >= r->consumer_tail))
r->queue[head--] = NULL;
r->consumer_tail = consumer_head;
- }
- if (unlikely(consumer_head >= r->size)) {
consumer_head = 0;
r->consumer_tail = 0;
while (likely(--consumer_head >= tail))
r->queue[consumer_head] = NULL;
return;
So if now we need this to be reliable then we also need smp_wmb before writing r->queue[consumer_head], there could be other gotchas.
}
- /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ WRITE_ONCE(r->consumer_head, consumer_head);
}
2.7.4
On Fri, Jun 25, 2021 at 11:18:55AM +0800, Yunsheng Lin wrote:
Currently ptr_ring selftest is embedded within the virtio selftest, which involves some specific virtio operation, such as notifying and kicking.
As ptr_ring has been used by various subsystems, it deserves it's owner's selftest in order to benchmark different usecase of ptr_ring, such as page pool and pfifo_fast qdisc.
So add a simple application to benchmark ptr_ring performance. Currently two test mode is supported: Mode 0: Both enqueuing and dequeuing is done in a single thread, it is called simple test mode in the test app. Mode 1: Enqueuing and dequeuing is done in different thread concurrently, also known as SPSC(single-producer/ single-consumer) test.
The multi-producer/single-consumer test for pfifo_fast case is not added yet, which can be added if using CAS atomic operation to enable lockless multi-producer is proved to be better than using r->producer_lock.
Only supported on x86 and arm64 for now.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
MAINTAINERS | 5 + tools/testing/selftests/ptr_ring/Makefile | 6 + tools/testing/selftests/ptr_ring/ptr_ring_test.c | 249 +++++++++++++++++++++++ tools/testing/selftests/ptr_ring/ptr_ring_test.h | 150 ++++++++++++++ 4 files changed, 410 insertions(+) create mode 100644 tools/testing/selftests/ptr_ring/Makefile create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.c create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.h
diff --git a/MAINTAINERS b/MAINTAINERS index cc375fd..1227022 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14847,6 +14847,11 @@ F: drivers/net/phy/dp83640* F: drivers/ptp/* F: include/linux/ptp_cl* +PTR RING BENCHMARK +M: Yunsheng Lin linyunsheng@huawei.com +L: netdev@vger.kernel.org +F: tools/testing/selftests/ptr_ring/
PTRACE SUPPORT M: Oleg Nesterov oleg@redhat.com S: Maintained diff --git a/tools/testing/selftests/ptr_ring/Makefile b/tools/testing/selftests/ptr_ring/Makefile new file mode 100644 index 0000000..346dea9 --- /dev/null +++ b/tools/testing/selftests/ptr_ring/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0 +LDLIBS = -lpthread
+TEST_GEN_PROGS := ptr_ring_test
+include ../lib.mk diff --git a/tools/testing/selftests/ptr_ring/ptr_ring_test.c b/tools/testing/selftests/ptr_ring/ptr_ring_test.c new file mode 100644 index 0000000..4f32d3d --- /dev/null +++ b/tools/testing/selftests/ptr_ring/ptr_ring_test.c @@ -0,0 +1,249 @@ +// SPDX-License-Identifier: GPL-2.0-only
Can we keep this GPL-2.0-or-later same as ptr ring itself? Encourages reuse ...
+/*
- Copyright (C) 2021 HiSilicon Limited.
- */
+#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <string.h> +#include <errno.h> +#include <sys/time.h> +#include <malloc.h> +#include <assert.h> +#include <stdbool.h> +#include <pthread.h>
+#include "ptr_ring_test.h" +#include "../../../../include/linux/ptr_ring.h"
+#define MIN_RING_SIZE 2 +#define MAX_RING_SIZE 10000000
+static struct ptr_ring ring ____cacheline_aligned_in_smp;
+struct worker_info {
- pthread_t tid;
- int test_count;
- bool error;
- long duration_us;
+};
+static void *produce_worker(void *arg) +{
- struct worker_info *info = arg;
- struct timeval start, end;
- unsigned long i = 0;
- long sec, us;
- int ret;
- gettimeofday(&start, NULL);
- while (++i <= info->test_count) {
while (__ptr_ring_full(&ring))
cpu_relax();
ret = __ptr_ring_produce(&ring, (void *)i);
if (ret) {
fprintf(stderr, "produce failed: %d\n", ret);
info->error = true;
return NULL;
}
- }
- gettimeofday(&end, NULL);
- sec = (end.tv_sec - start.tv_sec);
- us = ((sec * 1000000) + end.tv_usec) - (start.tv_usec);
- info->duration_us = us;
- info->error = false;
- return NULL;
+}
perf does all of this and more. Let's not reinvent the wheel - just run the test.
+static void *consume_worker(void *arg) +{
- struct worker_info *info = arg;
- struct timeval start, end;
- unsigned long i = 0;
- long sec, us;
- int *ptr;
- gettimeofday(&start, NULL);
- while (++i <= info->test_count) {
while (__ptr_ring_empty(&ring))
cpu_relax();
ptr = __ptr_ring_consume(&ring);
if ((unsigned long)ptr != i) {
fprintf(stderr, "consumer failed, ptr: %lu, i: %lu\n",
(unsigned long)ptr, i);
info->error = true;
return NULL;
}
- }
- gettimeofday(&end, NULL);
- if (!__ptr_ring_empty(&ring)) {
fprintf(stderr, "ring should be empty, test failed\n");
info->error = true;
return NULL;
- }
- sec = (end.tv_sec - start.tv_sec);
- us = ((sec * 1000000) + end.tv_usec) - (start.tv_usec);
- info->duration_us = us;
- info->error = false;
- return NULL;
+}
+/* test case for single producer single consumer */ +static void spsc_test(int size, int count) +{
- struct worker_info producer, consumer;
- pthread_attr_t attr;
- void *res;
- int ret;
- ret = ptr_ring_init(&ring, size, 0);
- if (ret) {
fprintf(stderr, "init failed: %d\n", ret);
return;
- }
- producer.test_count = count;
- consumer.test_count = count;
- ret = pthread_attr_init(&attr);
- if (ret) {
fprintf(stderr, "pthread attr init failed: %d\n", ret);
goto out;
- }
- ret = pthread_create(&producer.tid, &attr,
produce_worker, &producer);
- if (ret) {
fprintf(stderr, "create producer thread failed: %d\n", ret);
goto out;
- }
- ret = pthread_create(&consumer.tid, &attr,
consume_worker, &consumer);
- if (ret) {
fprintf(stderr, "create consumer thread failed: %d\n", ret);
goto out;
- }
- ret = pthread_join(producer.tid, &res);
- if (ret) {
fprintf(stderr, "join producer thread failed: %d\n", ret);
goto out;
- }
- ret = pthread_join(consumer.tid, &res);
- if (ret) {
fprintf(stderr, "join consumer thread failed: %d\n", ret);
goto out;
- }
- if (producer.error || consumer.error) {
fprintf(stderr, "spsc test failed\n");
goto out;
- }
- printf("ptr_ring(size:%d) perf spsc test for %d times, took %ld us + %ld us\n",
size, count, producer.duration_us, consumer.duration_us);
+out:
- ptr_ring_cleanup(&ring, NULL);
+}
+static void simple_test(int size, int count) +{
- struct timeval start, end;
- long sec, us;
- int i = 0;
- int *ptr;
- int ret;
- ret = ptr_ring_init(&ring, size, 0);
- if (ret) {
fprintf(stderr, "init failed: %d\n", ret);
return;
- }
- gettimeofday(&start, NULL);
- while (++i <= count) {
ret = __ptr_ring_produce(&ring, &count);
if (ret) {
fprintf(stderr, "produce failed: %d\n", ret);
goto out;
}
ptr = __ptr_ring_consume(&ring);
if (ptr != &count) {
fprintf(stderr, "consume failed: %p\n", ptr);
goto out;
}
- }
- gettimeofday(&end, NULL);
- sec = (end.tv_sec - start.tv_sec);
- us = ((sec * 1000000) + end.tv_usec) - (start.tv_usec);
- printf("ptr_ring(size:%d) perf simple test for %d times, took %ld us\n",
size, count, us);
+out:
- ptr_ring_cleanup(&ring, NULL);
+}
+int main(int argc, char *argv[]) +{
- int count = 1000000;
- int size = 1000;
- int mode = 0;
- int opt;
- while ((opt = getopt(argc, argv, "N:s:m:")) != -1) {
switch (opt) {
case 'N':
count = atoi(optarg);
break;
case 's':
size = atoi(optarg);
break;
case 'm':
mode = atoi(optarg);
break;
default:
return -1;
}
- }
- if (count <= 0) {
fprintf(stderr, "invalid test count, must be > 0\n");
return -1;
- }
- if (size < MIN_RING_SIZE || size > MAX_RING_SIZE) {
fprintf(stderr, "invalid ring size, must be in %d-%d\n",
MIN_RING_SIZE, MAX_RING_SIZE);
return -1;
- }
- switch (mode) {
- case 0:
simple_test(size, count);
break;
- case 1:
spsc_test(size, count);
break;
- default:
fprintf(stderr, "invalid test mode\n");
return -1;
- }
- return 0;
+} diff --git a/tools/testing/selftests/ptr_ring/ptr_ring_test.h b/tools/testing/selftests/ptr_ring/ptr_ring_test.h new file mode 100644 index 0000000..6bf2494 --- /dev/null +++ b/tools/testing/selftests/ptr_ring/ptr_ring_test.h @@ -0,0 +1,150 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */
We already have hacks like this in the virtio test. Let's refactor not duplicate please.
+#ifndef _TEST_PTR_RING_IMPL_H +#define _TEST_PTR_RING_IMPL_H
+#if defined(__x86_64__) || defined(__i386__) +static inline void cpu_relax(void) +{
- asm volatile ("rep; nop" ::: "memory");
+} +#elif defined(__aarch64__) +static inline void cpu_relax(void) +{
- asm volatile("yield" ::: "memory");
+} +#else +#define cpu_relax() assert(0) +#endif
+static inline void barrier(void) +{
- asm volatile("" ::: "memory");
+}
+/*
- This abuses the atomic builtins for thread fences, and
- adds a compiler barrier.
- */
+#define smp_release() do { \
- barrier(); \
- __atomic_thread_fence(__ATOMIC_RELEASE); \
+} while (0)
+#define smp_acquire() do { \
- __atomic_thread_fence(__ATOMIC_ACQUIRE); \
- barrier(); \
+} while (0)
+#if defined(__i386__) || defined(__x86_64__) +#define smp_wmb() barrier() +#else +#define smp_wmb() smp_release() +#endif
+#define READ_ONCE(x) (*(volatile typeof(x) *)&(x)) +#define WRITE_ONCE(x, val) ((*(volatile typeof(x) *)&(x)) = (val)) +#define SMP_CACHE_BYTES 64 +#define cache_line_size SMP_CACHE_BYTES +#define unlikely(x) (__builtin_expect(!!(x), 0)) +#define likely(x) (__builtin_expect(!!(x), 1)) +#define ALIGN(x, a) (((x) + (a) - 1) / (a) * (a)) +#define SIZE_MAX (~(size_t)0) +#define KMALLOC_MAX_SIZE SIZE_MAX +#define spinlock_t pthread_spinlock_t +#define gfp_t int +#define __GFP_ZERO 0x1
+#define ____cacheline_aligned_in_smp __attribute__((aligned(SMP_CACHE_BYTES)))
+static void *kmalloc(unsigned int size, gfp_t gfp) +{
- void *p;
- p = memalign(64, size);
- if (!p)
return p;
- if (gfp & __GFP_ZERO)
memset(p, 0, size);
- return p;
+}
+static inline void *kzalloc(unsigned int size, gfp_t flags) +{
- return kmalloc(size, flags | __GFP_ZERO);
+}
+static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags) +{
- if (size != 0 && n > SIZE_MAX / size)
return NULL;
- return kmalloc(n * size, flags);
+}
+static inline void *kcalloc(size_t n, size_t size, gfp_t flags) +{
- return kmalloc_array(n, size, flags | __GFP_ZERO);
+}
+static void kfree(void *p) +{
- free(p);
+}
+#define kvmalloc_array kmalloc_array +#define kvfree kfree
+static void spin_lock_init(spinlock_t *lock) +{
- int r = pthread_spin_init(lock, 0);
- assert(!r);
+}
+static void spin_lock(spinlock_t *lock) +{
- int ret = pthread_spin_lock(lock);
- assert(!ret);
+}
+static void spin_unlock(spinlock_t *lock) +{
- int ret = pthread_spin_unlock(lock);
- assert(!ret);
+}
+static void spin_lock_bh(spinlock_t *lock) +{
- spin_lock(lock);
+}
+static void spin_unlock_bh(spinlock_t *lock) +{
- spin_unlock(lock);
+}
+static void spin_lock_irq(spinlock_t *lock) +{
- spin_lock(lock);
+}
+static void spin_unlock_irq(spinlock_t *lock) +{
- spin_unlock(lock);
+}
+static void spin_lock_irqsave(spinlock_t *lock, unsigned long f) +{
- spin_lock(lock);
+}
+static void spin_unlock_irqrestore(spinlock_t *lock, unsigned long f) +{
- spin_unlock(lock);
+}
+#endif
2.7.4
On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
Currently r->queue[] is cleared after r->consumer_head is moved forward, which makes the __ptr_ring_empty() checking called in page_pool_refill_alloc_cache() unreliable if the checking is done after the r->queue clearing and before the consumer_head moving forward.
Move the r->queue[] clearing after consumer_head moving forward to make __ptr_ring_empty() checking more reliable.
As a side effect of above change, a consumer_head checking is avoided for the likely case, and it has noticeable performance improvement when it is tested using the ptr_ring_test selftest added in the previous patch.
Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" to test the case of single thread doing both the enqueuing and dequeuing:
arch unpatched patched delta arm64 4648 ms 4464 ms +3.9% X86 2562 ms 2401 ms +6.2%
Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" to test the case of one thread doing enqueuing and another thread doing dequeuing concurrently, also known as single-producer/single- consumer:
arch unpatched patched delta arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6%
Nice but it's small - could be a fluke. How many tests did you run? What is the variance? Did you try pinning to different CPUs to observe numa effects? Please use perf or some other modern tool for this kind of benchmark. Thanks!
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
V2: Add performance data.
include/linux/ptr_ring.h | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 808f9d3..db9c282 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty * to work correctly. */
- int consumer_head = r->consumer_head;
- int head = consumer_head++;
- int consumer_head = r->consumer_head + 1;
/* Once we have processed enough entries invalidate them in * the ring all at once so producer can reuse their space in the ring. @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) */ if (unlikely(consumer_head - r->consumer_tail >= r->batch || consumer_head >= r->size)) {
int tail = r->consumer_tail;
if (unlikely(consumer_head >= r->size)) {
r->consumer_tail = 0;
WRITE_ONCE(r->consumer_head, 0);
} else {
r->consumer_tail = consumer_head;
WRITE_ONCE(r->consumer_head, consumer_head);
}
- /* Zero out entries in the reverse order: this way we touch the
*/
- cache line that producer might currently be reading the last;
- producer won't make progress and touch other cache lines
- besides the first one until we write out all entries.
while (likely(head >= r->consumer_tail))
r->queue[head--] = NULL;
r->consumer_tail = consumer_head;
- }
- if (unlikely(consumer_head >= r->size)) {
consumer_head = 0;
r->consumer_tail = 0;
while (likely(--consumer_head >= tail))
r->queue[consumer_head] = NULL;
}return;
- /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ WRITE_ONCE(r->consumer_head, consumer_head);
}
2.7.4
On Fri, Jun 25, 2021 at 11:18:54AM +0800, Yunsheng Lin wrote:
Patch 1: add a selftest app to benchmark the performance of ptr_ring. Patch 2: make __ptr_ring_empty() checking more reliable and use the just added selftest to benchmark the performance impact.
V2: add patch 1 and add performance data for patch 2.
Thanks for the patches! There are some things to improve there - I sent comments in response to invididual patches.
Yunsheng Lin (2): selftests/ptr_ring: add benchmark application for ptr_ring ptr_ring: make __ptr_ring_empty() checking more reliable
MAINTAINERS | 5 + include/linux/ptr_ring.h | 25 ++- tools/testing/selftests/ptr_ring/Makefile | 6 + tools/testing/selftests/ptr_ring/ptr_ring_test.c | 249 +++++++++++++++++++++++ tools/testing/selftests/ptr_ring/ptr_ring_test.h | 150 ++++++++++++++ 5 files changed, 426 insertions(+), 9 deletions(-) create mode 100644 tools/testing/selftests/ptr_ring/Makefile create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.c create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.h
-- 2.7.4
On 2021/6/25 14:32, Michael S. Tsirkin wrote:
On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
Currently r->queue[] is cleared after r->consumer_head is moved forward, which makes the __ptr_ring_empty() checking called in page_pool_refill_alloc_cache() unreliable if the checking is done after the r->queue clearing and before the consumer_head moving forward.
Well the documentation for __ptr_ring_empty clearly states is is not guaranteed to be reliable.
Yes, this patch does not make __ptr_ring_empty() strictly reliable without taking the r->consumer_lock, as the disscuission in [1].
1. https://patchwork.kernel.org/project/netdevbpf/patch/1622032173-11883-1-git-...
- NB: This is only safe to call if ring is never resized.
- However, if some other CPU consumes ring entries at the same time, the value
- returned is not guaranteed to be correct.
- In this case - to avoid incorrectly detecting the ring
- as empty - the CPU consuming the ring entries is responsible
- for either consuming all ring entries until the ring is empty,
- or synchronizing with some other CPU and causing it to
- re-test __ptr_ring_empty and/or consume the ring enteries
- after the synchronization point.
Is it then the case that page_pool_refill_alloc_cache violates this requirement? How?
As my understanding: page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid taking r->consumer_lock, when the above data race happens, it will exit out and allocate page from the page allocator instead of reusing the page in ptr_ring, which *may* not be happening if __ptr_ring_empty() is more reliable.
It looks like you are trying to make the guarantee stronger and ensure no false positives.
If yes please document this as such, update the comment so all code can be evaluated with the eye towards whether the new stronger guarantee is maintained. In particular I think I see at least one issue with this immediately.
Move the r->queue[] clearing after consumer_head moving forward to make __ptr_ring_empty() checking more reliable.
As a side effect of above change, a consumer_head checking is avoided for the likely case, and it has noticeable performance improvement when it is tested using the ptr_ring_test selftest added in the previous patch.
Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" to test the case of single thread doing both the enqueuing and dequeuing:
arch unpatched patched delta arm64 4648 ms 4464 ms +3.9% X86 2562 ms 2401 ms +6.2%
Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" to test the case of one thread doing enqueuing and another thread doing dequeuing concurrently, also known as single-producer/single- consumer:
arch unpatched patched delta arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6%
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
V2: Add performance data.
include/linux/ptr_ring.h | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 808f9d3..db9c282 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty * to work correctly. */
- int consumer_head = r->consumer_head;
- int head = consumer_head++;
- int consumer_head = r->consumer_head + 1;
/* Once we have processed enough entries invalidate them in * the ring all at once so producer can reuse their space in the ring. @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) */ if (unlikely(consumer_head - r->consumer_tail >= r->batch || consumer_head >= r->size)) {
int tail = r->consumer_tail;
if (unlikely(consumer_head >= r->size)) {
r->consumer_tail = 0;
WRITE_ONCE(r->consumer_head, 0);
} else {
r->consumer_tail = consumer_head;
WRITE_ONCE(r->consumer_head, consumer_head);
}
- /* Zero out entries in the reverse order: this way we touch the
*/
- cache line that producer might currently be reading the last;
- producer won't make progress and touch other cache lines
- besides the first one until we write out all entries.
while (likely(head >= r->consumer_tail))
r->queue[head--] = NULL;
r->consumer_tail = consumer_head;
- }
- if (unlikely(consumer_head >= r->size)) {
consumer_head = 0;
r->consumer_tail = 0;
while (likely(--consumer_head >= tail))
r->queue[consumer_head] = NULL;
return;
So if now we need this to be reliable then we also need smp_wmb before writing r->queue[consumer_head], there could be other gotchas.
Yes, This patch does not make it strictly reliable. T think I could mention that in the commit log?
}
- /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ WRITE_ONCE(r->consumer_head, consumer_head);
}
2.7.4
.
On Fri, Jun 25, 2021 at 03:21:33PM +0800, Yunsheng Lin wrote:
On 2021/6/25 14:32, Michael S. Tsirkin wrote:
On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
Currently r->queue[] is cleared after r->consumer_head is moved forward, which makes the __ptr_ring_empty() checking called in page_pool_refill_alloc_cache() unreliable if the checking is done after the r->queue clearing and before the consumer_head moving forward.
Well the documentation for __ptr_ring_empty clearly states is is not guaranteed to be reliable.
Yes, this patch does not make __ptr_ring_empty() strictly reliable without taking the r->consumer_lock, as the disscuission in [1].
- NB: This is only safe to call if ring is never resized.
- However, if some other CPU consumes ring entries at the same time, the value
- returned is not guaranteed to be correct.
- In this case - to avoid incorrectly detecting the ring
- as empty - the CPU consuming the ring entries is responsible
- for either consuming all ring entries until the ring is empty,
- or synchronizing with some other CPU and causing it to
- re-test __ptr_ring_empty and/or consume the ring enteries
- after the synchronization point.
Is it then the case that page_pool_refill_alloc_cache violates this requirement? How?
As my understanding: page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid taking r->consumer_lock, when the above data race happens, it will exit out and allocate page from the page allocator instead of reusing the page in ptr_ring, which *may* not be happening if __ptr_ring_empty() is more reliable.
Question is how do we know it's more reliable? It would be nice if we did actually made it more reliable, as it is we are just shifting races around.
It looks like you are trying to make the guarantee stronger and ensure no false positives.
If yes please document this as such, update the comment so all code can be evaluated with the eye towards whether the new stronger guarantee is maintained. In particular I think I see at least one issue with this immediately.
Move the r->queue[] clearing after consumer_head moving forward to make __ptr_ring_empty() checking more reliable.
As a side effect of above change, a consumer_head checking is avoided for the likely case, and it has noticeable performance improvement when it is tested using the ptr_ring_test selftest added in the previous patch.
Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" to test the case of single thread doing both the enqueuing and dequeuing:
arch unpatched patched delta arm64 4648 ms 4464 ms +3.9% X86 2562 ms 2401 ms +6.2%
Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" to test the case of one thread doing enqueuing and another thread doing dequeuing concurrently, also known as single-producer/single- consumer:
arch unpatched patched delta arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6%
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
V2: Add performance data.
include/linux/ptr_ring.h | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 808f9d3..db9c282 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty * to work correctly. */
- int consumer_head = r->consumer_head;
- int head = consumer_head++;
- int consumer_head = r->consumer_head + 1;
/* Once we have processed enough entries invalidate them in * the ring all at once so producer can reuse their space in the ring. @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) */ if (unlikely(consumer_head - r->consumer_tail >= r->batch || consumer_head >= r->size)) {
int tail = r->consumer_tail;
if (unlikely(consumer_head >= r->size)) {
r->consumer_tail = 0;
WRITE_ONCE(r->consumer_head, 0);
} else {
r->consumer_tail = consumer_head;
WRITE_ONCE(r->consumer_head, consumer_head);
}
- /* Zero out entries in the reverse order: this way we touch the
*/
- cache line that producer might currently be reading the last;
- producer won't make progress and touch other cache lines
- besides the first one until we write out all entries.
while (likely(head >= r->consumer_tail))
r->queue[head--] = NULL;
r->consumer_tail = consumer_head;
- }
- if (unlikely(consumer_head >= r->size)) {
consumer_head = 0;
r->consumer_tail = 0;
while (likely(--consumer_head >= tail))
r->queue[consumer_head] = NULL;
return;
So if now we need this to be reliable then we also need smp_wmb before writing r->queue[consumer_head], there could be other gotchas.
Yes, This patch does not make it strictly reliable. T think I could mention that in the commit log?
OK so it's not that it makes it more reliable - this patch simply makes a possible false positive less likely while making a false negative more likely. Our assumption is that a false negative is cheaper then?
How do we know that it is?
And even if we prove the ptr_ring itself is faster now, how do we know what affects callers in a better way a false positive or a false negative?
I would rather we worked on actually making it reliable e.g. if we can guarantee no false positives, that would be a net win.
}
- /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ WRITE_ONCE(r->consumer_head, consumer_head);
}
2.7.4
.
On 2021/6/25 14:37, Michael S. Tsirkin wrote:
On Fri, Jun 25, 2021 at 11:18:55AM +0800, Yunsheng Lin wrote:
Currently ptr_ring selftest is embedded within the virtio selftest, which involves some specific virtio operation, such as notifying and kicking.
As ptr_ring has been used by various subsystems, it deserves it's owner's selftest in order to benchmark different usecase of ptr_ring, such as page pool and pfifo_fast qdisc.
So add a simple application to benchmark ptr_ring performance. Currently two test mode is supported: Mode 0: Both enqueuing and dequeuing is done in a single thread, it is called simple test mode in the test app. Mode 1: Enqueuing and dequeuing is done in different thread concurrently, also known as SPSC(single-producer/ single-consumer) test.
The multi-producer/single-consumer test for pfifo_fast case is not added yet, which can be added if using CAS atomic operation to enable lockless multi-producer is proved to be better than using r->producer_lock.
Only supported on x86 and arm64 for now.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
MAINTAINERS | 5 + tools/testing/selftests/ptr_ring/Makefile | 6 + tools/testing/selftests/ptr_ring/ptr_ring_test.c | 249 +++++++++++++++++++++++ tools/testing/selftests/ptr_ring/ptr_ring_test.h | 150 ++++++++++++++ 4 files changed, 410 insertions(+) create mode 100644 tools/testing/selftests/ptr_ring/Makefile create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.c create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.h
diff --git a/MAINTAINERS b/MAINTAINERS index cc375fd..1227022 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14847,6 +14847,11 @@ F: drivers/net/phy/dp83640* F: drivers/ptp/* F: include/linux/ptp_cl* +PTR RING BENCHMARK +M: Yunsheng Lin linyunsheng@huawei.com +L: netdev@vger.kernel.org +F: tools/testing/selftests/ptr_ring/
PTRACE SUPPORT M: Oleg Nesterov oleg@redhat.com S: Maintained diff --git a/tools/testing/selftests/ptr_ring/Makefile b/tools/testing/selftests/ptr_ring/Makefile new file mode 100644 index 0000000..346dea9 --- /dev/null +++ b/tools/testing/selftests/ptr_ring/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0 +LDLIBS = -lpthread
+TEST_GEN_PROGS := ptr_ring_test
+include ../lib.mk diff --git a/tools/testing/selftests/ptr_ring/ptr_ring_test.c b/tools/testing/selftests/ptr_ring/ptr_ring_test.c new file mode 100644 index 0000000..4f32d3d --- /dev/null +++ b/tools/testing/selftests/ptr_ring/ptr_ring_test.c @@ -0,0 +1,249 @@ +// SPDX-License-Identifier: GPL-2.0-only
Can we keep this GPL-2.0-or-later same as ptr ring itself? Encourages reuse ...
Ok.
+/*
- Copyright (C) 2021 HiSilicon Limited.
- */
+#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <string.h> +#include <errno.h> +#include <sys/time.h> +#include <malloc.h> +#include <assert.h> +#include <stdbool.h> +#include <pthread.h>
+#include "ptr_ring_test.h" +#include "../../../../include/linux/ptr_ring.h"
+#define MIN_RING_SIZE 2 +#define MAX_RING_SIZE 10000000
+static struct ptr_ring ring ____cacheline_aligned_in_smp;
+struct worker_info {
- pthread_t tid;
- int test_count;
- bool error;
- long duration_us;
+};
+static void *produce_worker(void *arg) +{
- struct worker_info *info = arg;
- struct timeval start, end;
- unsigned long i = 0;
- long sec, us;
- int ret;
- gettimeofday(&start, NULL);
- while (++i <= info->test_count) {
while (__ptr_ring_full(&ring))
cpu_relax();
ret = __ptr_ring_produce(&ring, (void *)i);
if (ret) {
fprintf(stderr, "produce failed: %d\n", ret);
info->error = true;
return NULL;
}
- }
- gettimeofday(&end, NULL);
- sec = (end.tv_sec - start.tv_sec);
- us = ((sec * 1000000) + end.tv_usec) - (start.tv_usec);
- info->duration_us = us;
- info->error = false;
- return NULL;
+}
perf does all of this and more. Let's not reinvent the wheel - just run the test.
You are suggesting to use perf stat + "test cmd" and remove the above timestamp sampling, right?
+static void *consume_worker(void *arg) +{
- struct worker_info *info = arg;
- struct timeval start, end;
- unsigned long i = 0;
- long sec, us;
- int *ptr;
- gettimeofday(&start, NULL);
- while (++i <= info->test_count) {
while (__ptr_ring_empty(&ring))
cpu_relax();
ptr = __ptr_ring_consume(&ring);
if ((unsigned long)ptr != i) {
fprintf(stderr, "consumer failed, ptr: %lu, i: %lu\n",
(unsigned long)ptr, i);
info->error = true;
return NULL;
}
- }
- gettimeofday(&end, NULL);
- if (!__ptr_ring_empty(&ring)) {
fprintf(stderr, "ring should be empty, test failed\n");
info->error = true;
return NULL;
- }
- sec = (end.tv_sec - start.tv_sec);
- us = ((sec * 1000000) + end.tv_usec) - (start.tv_usec);
- info->duration_us = us;
- info->error = false;
- return NULL;
+}
+/* test case for single producer single consumer */ +static void spsc_test(int size, int count) +{
- struct worker_info producer, consumer;
- pthread_attr_t attr;
- void *res;
- int ret;
- ret = ptr_ring_init(&ring, size, 0);
- if (ret) {
fprintf(stderr, "init failed: %d\n", ret);
return;
- }
- producer.test_count = count;
- consumer.test_count = count;
- ret = pthread_attr_init(&attr);
- if (ret) {
fprintf(stderr, "pthread attr init failed: %d\n", ret);
goto out;
- }
- ret = pthread_create(&producer.tid, &attr,
produce_worker, &producer);
- if (ret) {
fprintf(stderr, "create producer thread failed: %d\n", ret);
goto out;
- }
- ret = pthread_create(&consumer.tid, &attr,
consume_worker, &consumer);
- if (ret) {
fprintf(stderr, "create consumer thread failed: %d\n", ret);
goto out;
- }
- ret = pthread_join(producer.tid, &res);
- if (ret) {
fprintf(stderr, "join producer thread failed: %d\n", ret);
goto out;
- }
- ret = pthread_join(consumer.tid, &res);
- if (ret) {
fprintf(stderr, "join consumer thread failed: %d\n", ret);
goto out;
- }
- if (producer.error || consumer.error) {
fprintf(stderr, "spsc test failed\n");
goto out;
- }
- printf("ptr_ring(size:%d) perf spsc test for %d times, took %ld us + %ld us\n",
size, count, producer.duration_us, consumer.duration_us);
+out:
- ptr_ring_cleanup(&ring, NULL);
+}
+static void simple_test(int size, int count) +{
- struct timeval start, end;
- long sec, us;
- int i = 0;
- int *ptr;
- int ret;
- ret = ptr_ring_init(&ring, size, 0);
- if (ret) {
fprintf(stderr, "init failed: %d\n", ret);
return;
- }
- gettimeofday(&start, NULL);
- while (++i <= count) {
ret = __ptr_ring_produce(&ring, &count);
if (ret) {
fprintf(stderr, "produce failed: %d\n", ret);
goto out;
}
ptr = __ptr_ring_consume(&ring);
if (ptr != &count) {
fprintf(stderr, "consume failed: %p\n", ptr);
goto out;
}
- }
- gettimeofday(&end, NULL);
- sec = (end.tv_sec - start.tv_sec);
- us = ((sec * 1000000) + end.tv_usec) - (start.tv_usec);
- printf("ptr_ring(size:%d) perf simple test for %d times, took %ld us\n",
size, count, us);
+out:
- ptr_ring_cleanup(&ring, NULL);
+}
+int main(int argc, char *argv[]) +{
- int count = 1000000;
- int size = 1000;
- int mode = 0;
- int opt;
- while ((opt = getopt(argc, argv, "N:s:m:")) != -1) {
switch (opt) {
case 'N':
count = atoi(optarg);
break;
case 's':
size = atoi(optarg);
break;
case 'm':
mode = atoi(optarg);
break;
default:
return -1;
}
- }
- if (count <= 0) {
fprintf(stderr, "invalid test count, must be > 0\n");
return -1;
- }
- if (size < MIN_RING_SIZE || size > MAX_RING_SIZE) {
fprintf(stderr, "invalid ring size, must be in %d-%d\n",
MIN_RING_SIZE, MAX_RING_SIZE);
return -1;
- }
- switch (mode) {
- case 0:
simple_test(size, count);
break;
- case 1:
spsc_test(size, count);
break;
- default:
fprintf(stderr, "invalid test mode\n");
return -1;
- }
- return 0;
+} diff --git a/tools/testing/selftests/ptr_ring/ptr_ring_test.h b/tools/testing/selftests/ptr_ring/ptr_ring_test.h new file mode 100644 index 0000000..6bf2494 --- /dev/null +++ b/tools/testing/selftests/ptr_ring/ptr_ring_test.h @@ -0,0 +1,150 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */
We already have hacks like this in the virtio test. Let's refactor not duplicate please.
Yes, I took most of below from virtio test. But I am not sure I understand what you meant by refactoring. Are you suggesting to use function from standard C library instead of using the below "#if defined" hack?
I am not sure if all of the below function has a similiar one in standard C library.
Would you be more specific about what does refactoring mean?
+#ifndef _TEST_PTR_RING_IMPL_H +#define _TEST_PTR_RING_IMPL_H
+#if defined(__x86_64__) || defined(__i386__) +static inline void cpu_relax(void) +{
- asm volatile ("rep; nop" ::: "memory");
+} +#elif defined(__aarch64__) +static inline void cpu_relax(void) +{
- asm volatile("yield" ::: "memory");
+} +#else +#define cpu_relax() assert(0) +#endif
+static inline void barrier(void) +{
- asm volatile("" ::: "memory");
+}
+/*
- This abuses the atomic builtins for thread fences, and
- adds a compiler barrier.
- */
+#define smp_release() do { \
- barrier(); \
- __atomic_thread_fence(__ATOMIC_RELEASE); \
+} while (0)
+#define smp_acquire() do { \
- __atomic_thread_fence(__ATOMIC_ACQUIRE); \
- barrier(); \
+} while (0)
+#if defined(__i386__) || defined(__x86_64__) +#define smp_wmb() barrier() +#else +#define smp_wmb() smp_release() +#endif
+#define READ_ONCE(x) (*(volatile typeof(x) *)&(x)) +#define WRITE_ONCE(x, val) ((*(volatile typeof(x) *)&(x)) = (val)) +#define SMP_CACHE_BYTES 64 +#define cache_line_size SMP_CACHE_BYTES +#define unlikely(x) (__builtin_expect(!!(x), 0)) +#define likely(x) (__builtin_expect(!!(x), 1)) +#define ALIGN(x, a) (((x) + (a) - 1) / (a) * (a)) +#define SIZE_MAX (~(size_t)0) +#define KMALLOC_MAX_SIZE SIZE_MAX +#define spinlock_t pthread_spinlock_t +#define gfp_t int +#define __GFP_ZERO 0x1
+#define ____cacheline_aligned_in_smp __attribute__((aligned(SMP_CACHE_BYTES)))
+static void *kmalloc(unsigned int size, gfp_t gfp) +{
- void *p;
- p = memalign(64, size);
- if (!p)
return p;
- if (gfp & __GFP_ZERO)
memset(p, 0, size);
- return p;
+}
+static inline void *kzalloc(unsigned int size, gfp_t flags) +{
- return kmalloc(size, flags | __GFP_ZERO);
+}
+static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags) +{
- if (size != 0 && n > SIZE_MAX / size)
return NULL;
- return kmalloc(n * size, flags);
+}
+static inline void *kcalloc(size_t n, size_t size, gfp_t flags) +{
- return kmalloc_array(n, size, flags | __GFP_ZERO);
+}
+static void kfree(void *p) +{
- free(p);
+}
+#define kvmalloc_array kmalloc_array +#define kvfree kfree
+static void spin_lock_init(spinlock_t *lock) +{
- int r = pthread_spin_init(lock, 0);
- assert(!r);
+}
+static void spin_lock(spinlock_t *lock) +{
- int ret = pthread_spin_lock(lock);
- assert(!ret);
+}
+static void spin_unlock(spinlock_t *lock) +{
- int ret = pthread_spin_unlock(lock);
- assert(!ret);
+}
+static void spin_lock_bh(spinlock_t *lock) +{
- spin_lock(lock);
+}
+static void spin_unlock_bh(spinlock_t *lock) +{
- spin_unlock(lock);
+}
+static void spin_lock_irq(spinlock_t *lock) +{
- spin_lock(lock);
+}
+static void spin_unlock_irq(spinlock_t *lock) +{
- spin_unlock(lock);
+}
+static void spin_lock_irqsave(spinlock_t *lock, unsigned long f) +{
- spin_lock(lock);
+}
+static void spin_unlock_irqrestore(spinlock_t *lock, unsigned long f) +{
- spin_unlock(lock);
+}
+#endif
2.7.4
.
On 2021/6/25 15:30, Michael S. Tsirkin wrote:
On Fri, Jun 25, 2021 at 03:21:33PM +0800, Yunsheng Lin wrote:
On 2021/6/25 14:32, Michael S. Tsirkin wrote:
On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
Currently r->queue[] is cleared after r->consumer_head is moved forward, which makes the __ptr_ring_empty() checking called in page_pool_refill_alloc_cache() unreliable if the checking is done after the r->queue clearing and before the consumer_head moving forward.
Well the documentation for __ptr_ring_empty clearly states is is not guaranteed to be reliable.
Yes, this patch does not make __ptr_ring_empty() strictly reliable without taking the r->consumer_lock, as the disscuission in [1].
- NB: This is only safe to call if ring is never resized.
- However, if some other CPU consumes ring entries at the same time, the value
- returned is not guaranteed to be correct.
- In this case - to avoid incorrectly detecting the ring
- as empty - the CPU consuming the ring entries is responsible
- for either consuming all ring entries until the ring is empty,
- or synchronizing with some other CPU and causing it to
- re-test __ptr_ring_empty and/or consume the ring enteries
- after the synchronization point.
Is it then the case that page_pool_refill_alloc_cache violates this requirement? How?
As my understanding: page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid taking r->consumer_lock, when the above data race happens, it will exit out and allocate page from the page allocator instead of reusing the page in ptr_ring, which *may* not be happening if __ptr_ring_empty() is more reliable.
Question is how do we know it's more reliable? It would be nice if we did actually made it more reliable, as it is we are just shifting races around.
It looks like you are trying to make the guarantee stronger and ensure no false positives.
If yes please document this as such, update the comment so all code can be evaluated with the eye towards whether the new stronger guarantee is maintained. In particular I think I see at least one issue with this immediately.
Move the r->queue[] clearing after consumer_head moving forward to make __ptr_ring_empty() checking more reliable.
As a side effect of above change, a consumer_head checking is avoided for the likely case, and it has noticeable performance improvement when it is tested using the ptr_ring_test selftest added in the previous patch.
Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" to test the case of single thread doing both the enqueuing and dequeuing:
arch unpatched patched delta arm64 4648 ms 4464 ms +3.9% X86 2562 ms 2401 ms +6.2%
Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" to test the case of one thread doing enqueuing and another thread doing dequeuing concurrently, also known as single-producer/single- consumer:
arch unpatched patched delta arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6%
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
V2: Add performance data.
include/linux/ptr_ring.h | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 808f9d3..db9c282 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty * to work correctly. */
- int consumer_head = r->consumer_head;
- int head = consumer_head++;
- int consumer_head = r->consumer_head + 1;
/* Once we have processed enough entries invalidate them in * the ring all at once so producer can reuse their space in the ring. @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) */ if (unlikely(consumer_head - r->consumer_tail >= r->batch || consumer_head >= r->size)) {
int tail = r->consumer_tail;
if (unlikely(consumer_head >= r->size)) {
r->consumer_tail = 0;
WRITE_ONCE(r->consumer_head, 0);
} else {
r->consumer_tail = consumer_head;
WRITE_ONCE(r->consumer_head, consumer_head);
}
- /* Zero out entries in the reverse order: this way we touch the
*/
- cache line that producer might currently be reading the last;
- producer won't make progress and touch other cache lines
- besides the first one until we write out all entries.
while (likely(head >= r->consumer_tail))
r->queue[head--] = NULL;
r->consumer_tail = consumer_head;
- }
- if (unlikely(consumer_head >= r->size)) {
consumer_head = 0;
r->consumer_tail = 0;
while (likely(--consumer_head >= tail))
r->queue[consumer_head] = NULL;
return;
So if now we need this to be reliable then we also need smp_wmb before writing r->queue[consumer_head], there could be other gotchas.
Yes, This patch does not make it strictly reliable. T think I could mention that in the commit log?
OK so it's not that it makes it more reliable - this patch simply makes a possible false positive less likely while making a false negative more likely. Our assumption is that a false negative is cheaper then?
How do we know that it is?
And even if we prove the ptr_ring itself is faster now, how do we know what affects callers in a better way a false positive or a false negative?
I would rather we worked on actually making it reliable e.g. if we can guarantee no false positives, that would be a net win.
I thought deeper about the case you mentioned above, it seems for the above to happen, the consumer_head need to be rolled back to zero and incremented to the point when caller of __ptr_ring_empty() is still *not* able to see the r->queue[] which has been set to NULL in __ptr_ring_discard_one().
It seems smp_wmb() only need to be done once when consumer_head is rolled back to zero, and maybe that is enough to make sure the case you mentioned is fixed too?
And the smp_wmb() is only done once in a round of producing/ consuming, so the performance impact should be minimized?(of course we need to test it too).
}
- /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ WRITE_ONCE(r->consumer_head, consumer_head);
}
2.7.4
.
.
On 2021/6/25 14:39, Michael S. Tsirkin wrote:
On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
Currently r->queue[] is cleared after r->consumer_head is moved forward, which makes the __ptr_ring_empty() checking called in page_pool_refill_alloc_cache() unreliable if the checking is done after the r->queue clearing and before the consumer_head moving forward.
Move the r->queue[] clearing after consumer_head moving forward to make __ptr_ring_empty() checking more reliable.
As a side effect of above change, a consumer_head checking is avoided for the likely case, and it has noticeable performance improvement when it is tested using the ptr_ring_test selftest added in the previous patch.
Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" to test the case of single thread doing both the enqueuing and dequeuing:
arch unpatched patched delta arm64 4648 ms 4464 ms +3.9% X86 2562 ms 2401 ms +6.2%
Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" to test the case of one thread doing enqueuing and another thread doing dequeuing concurrently, also known as single-producer/single- consumer:
arch unpatched patched delta arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6%
Nice but it's small - could be a fluke. How many tests did you run? What is the variance? Did you try pinning to different CPUs to observe numa effects? Please use perf or some other modern tool for this kind of benchmark. Thanks!
The result is quite stable, and retest using perf stat:
---------------unpatched ptr_ring.c begin----------------------------------
perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2385198 us
Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':
2385.49 msec task-clock # 1.000 CPUs utilized 26 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.024 K/sec 6202023521 cycles # 2.600 GHz 17424187640 instructions # 2.81 insn per cycle <not supported> branches 6506477 branch-misses
2.385785170 seconds time elapsed
2.384014000 seconds user 0.000000000 seconds sys
root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2383385 us
Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':
2383.67 msec task-clock # 1.000 CPUs utilized 26 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.024 K/sec 6197278066 cycles # 2.600 GHz 17424207772 instructions # 2.81 insn per cycle <not supported> branches 6495766 branch-misses
2.383941170 seconds time elapsed
2.382215000 seconds user 0.000000000 seconds sys
root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2390858 us
Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':
2391.16 msec task-clock # 1.000 CPUs utilized 25 context-switches # 0.010 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.024 K/sec 6216704120 cycles # 2.600 GHz 17424243041 instructions # 2.80 insn per cycle <not supported> branches 6483886 branch-misses
2.391420440 seconds time elapsed
2.389647000 seconds user 0.000000000 seconds sys
root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2389810 us
Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':
2390.10 msec task-clock # 1.000 CPUs utilized 26 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 58 page-faults # 0.024 K/sec 6213995715 cycles # 2.600 GHz 17424227499 instructions # 2.80 insn per cycle <not supported> branches 6474069 branch-misses
2.390367070 seconds time elapsed
2.388644000 seconds user 0.000000000 seconds sys
---------------unpatched ptr_ring.c end----------------------------------
---------------patched ptr_ring.c begin---------------------------------- root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2198894 us
Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
2199.18 msec task-clock # 1.000 CPUs utilized 23 context-switches # 0.010 K/sec 0 cpu-migrations # 0.000 K/sec 56 page-faults # 0.025 K/sec 5717671859 cycles # 2.600 GHz 16124164124 instructions # 2.82 insn per cycle <not supported> branches 6564829 branch-misses
2.199445990 seconds time elapsed
2.197859000 seconds user 0.000000000 seconds sys
root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2222337 us
Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
2222.63 msec task-clock # 1.000 CPUs utilized 23 context-switches # 0.010 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.026 K/sec 5778632853 cycles # 2.600 GHz 16124210769 instructions # 2.79 insn per cycle <not supported> branches 6603904 branch-misses
2.222901020 seconds time elapsed
2.221312000 seconds user 0.000000000 seconds sys
root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2251980 us
Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
2252.28 msec task-clock # 1.000 CPUs utilized 25 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.025 K/sec 5855668335 cycles # 2.600 GHz 16124310588 instructions # 2.75 insn per cycle <not supported> branches 6777279 branch-misses
2.252543340 seconds time elapsed
2.250897000 seconds user 0.000000000 seconds sys
root@(none):~# root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2209415 us
Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
2209.70 msec task-clock # 1.000 CPUs utilized 24 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 58 page-faults # 0.026 K/sec 5745003772 cycles # 2.600 GHz 16124198886 instructions # 2.81 insn per cycle <not supported> branches 6508414 branch-misses
2.209973960 seconds time elapsed
2.208354000 seconds user 0.000000000 seconds sys
root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2211409 us
Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
2211.70 msec task-clock # 1.000 CPUs utilized 24 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.026 K/sec 5750136694 cycles # 2.600 GHz 16124176577 instructions # 2.80 insn per cycle <not supported> branches 6553023 branch-misses
2.211968470 seconds time elapsed
2.210303000 seconds user 0.000000000 seconds sys
---------------patched ptr_ring.c end----------------------------------
On Fri, Jun 25, 2021 at 04:33:40PM +0800, Yunsheng Lin wrote:
On 2021/6/25 15:30, Michael S. Tsirkin wrote:
On Fri, Jun 25, 2021 at 03:21:33PM +0800, Yunsheng Lin wrote:
On 2021/6/25 14:32, Michael S. Tsirkin wrote:
On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
Currently r->queue[] is cleared after r->consumer_head is moved forward, which makes the __ptr_ring_empty() checking called in page_pool_refill_alloc_cache() unreliable if the checking is done after the r->queue clearing and before the consumer_head moving forward.
Well the documentation for __ptr_ring_empty clearly states is is not guaranteed to be reliable.
Yes, this patch does not make __ptr_ring_empty() strictly reliable without taking the r->consumer_lock, as the disscuission in [1].
- NB: This is only safe to call if ring is never resized.
- However, if some other CPU consumes ring entries at the same time, the value
- returned is not guaranteed to be correct.
- In this case - to avoid incorrectly detecting the ring
- as empty - the CPU consuming the ring entries is responsible
- for either consuming all ring entries until the ring is empty,
- or synchronizing with some other CPU and causing it to
- re-test __ptr_ring_empty and/or consume the ring enteries
- after the synchronization point.
Is it then the case that page_pool_refill_alloc_cache violates this requirement? How?
As my understanding: page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid taking r->consumer_lock, when the above data race happens, it will exit out and allocate page from the page allocator instead of reusing the page in ptr_ring, which *may* not be happening if __ptr_ring_empty() is more reliable.
Question is how do we know it's more reliable? It would be nice if we did actually made it more reliable, as it is we are just shifting races around.
It looks like you are trying to make the guarantee stronger and ensure no false positives.
If yes please document this as such, update the comment so all code can be evaluated with the eye towards whether the new stronger guarantee is maintained. In particular I think I see at least one issue with this immediately.
Move the r->queue[] clearing after consumer_head moving forward to make __ptr_ring_empty() checking more reliable.
As a side effect of above change, a consumer_head checking is avoided for the likely case, and it has noticeable performance improvement when it is tested using the ptr_ring_test selftest added in the previous patch.
Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" to test the case of single thread doing both the enqueuing and dequeuing:
arch unpatched patched delta arm64 4648 ms 4464 ms +3.9% X86 2562 ms 2401 ms +6.2%
Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" to test the case of one thread doing enqueuing and another thread doing dequeuing concurrently, also known as single-producer/single- consumer:
arch unpatched patched delta arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6%
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
V2: Add performance data.
include/linux/ptr_ring.h | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 808f9d3..db9c282 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) /* Note: we must keep consumer_head valid at all times for __ptr_ring_empty * to work correctly. */
- int consumer_head = r->consumer_head;
- int head = consumer_head++;
- int consumer_head = r->consumer_head + 1;
/* Once we have processed enough entries invalidate them in * the ring all at once so producer can reuse their space in the ring. @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r) */ if (unlikely(consumer_head - r->consumer_tail >= r->batch || consumer_head >= r->size)) {
int tail = r->consumer_tail;
if (unlikely(consumer_head >= r->size)) {
r->consumer_tail = 0;
WRITE_ONCE(r->consumer_head, 0);
} else {
r->consumer_tail = consumer_head;
WRITE_ONCE(r->consumer_head, consumer_head);
}
- /* Zero out entries in the reverse order: this way we touch the
*/
- cache line that producer might currently be reading the last;
- producer won't make progress and touch other cache lines
- besides the first one until we write out all entries.
while (likely(head >= r->consumer_tail))
r->queue[head--] = NULL;
r->consumer_tail = consumer_head;
- }
- if (unlikely(consumer_head >= r->size)) {
consumer_head = 0;
r->consumer_tail = 0;
while (likely(--consumer_head >= tail))
r->queue[consumer_head] = NULL;
return;
So if now we need this to be reliable then we also need smp_wmb before writing r->queue[consumer_head], there could be other gotchas.
Yes, This patch does not make it strictly reliable. T think I could mention that in the commit log?
OK so it's not that it makes it more reliable - this patch simply makes a possible false positive less likely while making a false negative more likely. Our assumption is that a false negative is cheaper then?
How do we know that it is?
And even if we prove the ptr_ring itself is faster now, how do we know what affects callers in a better way a false positive or a false negative?
I would rather we worked on actually making it reliable e.g. if we can guarantee no false positives, that would be a net win.
I thought deeper about the case you mentioned above, it seems for the above to happen, the consumer_head need to be rolled back to zero and incremented to the point when caller of __ptr_ring_empty() is still *not* able to see the r->queue[] which has been set to NULL in __ptr_ring_discard_one().
It seems smp_wmb() only need to be done once when consumer_head is rolled back to zero, and maybe that is enough to make sure the case you mentioned is fixed too?
And the smp_wmb() is only done once in a round of producing/ consuming, so the performance impact should be minimized?(of course we need to test it too).
Sorry I don't really understand the question here. I think I agree it's enough to do one smp_wmb between the write of r->queue and write of consumer_head to help guarantee no false positives. What other code changes are necessary I can't yet say without more a deeper code review.
}
- /* matching READ_ONCE in __ptr_ring_empty for lockless tests */ WRITE_ONCE(r->consumer_head, consumer_head);
}
2.7.4
.
.
On Fri, Jun 25, 2021 at 05:20:10PM +0800, Yunsheng Lin wrote:
On 2021/6/25 14:39, Michael S. Tsirkin wrote:
On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
Currently r->queue[] is cleared after r->consumer_head is moved forward, which makes the __ptr_ring_empty() checking called in page_pool_refill_alloc_cache() unreliable if the checking is done after the r->queue clearing and before the consumer_head moving forward.
Move the r->queue[] clearing after consumer_head moving forward to make __ptr_ring_empty() checking more reliable.
As a side effect of above change, a consumer_head checking is avoided for the likely case, and it has noticeable performance improvement when it is tested using the ptr_ring_test selftest added in the previous patch.
Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" to test the case of single thread doing both the enqueuing and dequeuing:
arch unpatched patched delta arm64 4648 ms 4464 ms +3.9% X86 2562 ms 2401 ms +6.2%
Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" to test the case of one thread doing enqueuing and another thread doing dequeuing concurrently, also known as single-producer/single- consumer:
arch unpatched patched delta arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6%
Nice but it's small - could be a fluke. How many tests did you run? What is the variance? Did you try pinning to different CPUs to observe numa effects? Please use perf or some other modern tool for this kind of benchmark. Thanks!
The result is quite stable, and retest using perf stat:
How stable exactly? Try with -r so we can find out.
---------------unpatched ptr_ring.c begin----------------------------------
perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2385198 us
Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':
2385.49 msec task-clock # 1.000 CPUs utilized 26 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.024 K/sec 6202023521 cycles # 2.600 GHz 17424187640 instructions # 2.81 insn per cycle
<not supported> branches 6506477 branch-misses
2.385785170 seconds time elapsed 2.384014000 seconds user 0.000000000 seconds sys
root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2383385 us
Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':
2383.67 msec task-clock # 1.000 CPUs utilized 26 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.024 K/sec 6197278066 cycles # 2.600 GHz 17424207772 instructions # 2.81 insn per cycle
<not supported> branches 6495766 branch-misses
2.383941170 seconds time elapsed 2.382215000 seconds user 0.000000000 seconds sys
root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2390858 us
Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':
2391.16 msec task-clock # 1.000 CPUs utilized 25 context-switches # 0.010 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.024 K/sec 6216704120 cycles # 2.600 GHz 17424243041 instructions # 2.80 insn per cycle
<not supported> branches 6483886 branch-misses
2.391420440 seconds time elapsed 2.389647000 seconds user 0.000000000 seconds sys
root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2389810 us
Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':
2390.10 msec task-clock # 1.000 CPUs utilized 26 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 58 page-faults # 0.024 K/sec 6213995715 cycles # 2.600 GHz 17424227499 instructions # 2.80 insn per cycle
<not supported> branches 6474069 branch-misses
2.390367070 seconds time elapsed 2.388644000 seconds user 0.000000000 seconds sys
---------------unpatched ptr_ring.c end----------------------------------
---------------patched ptr_ring.c begin---------------------------------- root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2198894 us
Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
2199.18 msec task-clock # 1.000 CPUs utilized 23 context-switches # 0.010 K/sec 0 cpu-migrations # 0.000 K/sec 56 page-faults # 0.025 K/sec 5717671859 cycles # 2.600 GHz 16124164124 instructions # 2.82 insn per cycle
<not supported> branches 6564829 branch-misses
2.199445990 seconds time elapsed 2.197859000 seconds user 0.000000000 seconds sys
root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2222337 us
Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
2222.63 msec task-clock # 1.000 CPUs utilized 23 context-switches # 0.010 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.026 K/sec 5778632853 cycles # 2.600 GHz 16124210769 instructions # 2.79 insn per cycle
<not supported> branches 6603904 branch-misses
2.222901020 seconds time elapsed 2.221312000 seconds user 0.000000000 seconds sys
root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2251980 us
Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
2252.28 msec task-clock # 1.000 CPUs utilized 25 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.025 K/sec 5855668335 cycles # 2.600 GHz 16124310588 instructions # 2.75 insn per cycle
<not supported> branches 6777279 branch-misses
2.252543340 seconds time elapsed 2.250897000 seconds user 0.000000000 seconds sys
root@(none):~# root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2209415 us
Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
2209.70 msec task-clock # 1.000 CPUs utilized 24 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 58 page-faults # 0.026 K/sec 5745003772 cycles # 2.600 GHz 16124198886 instructions # 2.81 insn per cycle
<not supported> branches 6508414 branch-misses
2.209973960 seconds time elapsed 2.208354000 seconds user 0.000000000 seconds sys
root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000 ptr_ring(size:1000) perf simple test for 100000000 times, took 2211409 us
Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
2211.70 msec task-clock # 1.000 CPUs utilized 24 context-switches # 0.011 K/sec 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.026 K/sec 5750136694 cycles # 2.600 GHz 16124176577 instructions # 2.80 insn per cycle
<not supported> branches 6553023 branch-misses
2.211968470 seconds time elapsed 2.210303000 seconds user 0.000000000 seconds sys
---------------patched ptr_ring.c end----------------------------------
On Fri, Jun 25, 2021 at 11:52:16AM +0800, Yunsheng Lin wrote:
On 2021/6/25 11:36, Jason Wang wrote:
在 2021/6/25 上午11:18, Yunsheng Lin 写道:
Currently ptr_ring selftest is embedded within the virtio selftest, which involves some specific virtio operation, such as notifying and kicking.
As ptr_ring has been used by various subsystems, it deserves it's owner's selftest in order to benchmark different usecase of ptr_ring, such as page pool and pfifo_fast qdisc.
So add a simple application to benchmark ptr_ring performance. Currently two test mode is supported: Mode 0: Both enqueuing and dequeuing is done in a single thread, it is called simple test mode in the test app. Mode 1: Enqueuing and dequeuing is done in different thread concurrently, also known as SPSC(single-producer/ single-consumer) test.
The multi-producer/single-consumer test for pfifo_fast case is not added yet, which can be added if using CAS atomic operation to enable lockless multi-producer is proved to be better than using r->producer_lock.
Only supported on x86 and arm64 for now.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
MAINTAINERS | 5 + tools/testing/selftests/ptr_ring/Makefile | 6 + tools/testing/selftests/ptr_ring/ptr_ring_test.c | 249 +++++++++++++++++++++++ tools/testing/selftests/ptr_ring/ptr_ring_test.h | 150 ++++++++++++++ 4 files changed, 410 insertions(+)
Why can't you simply reuse tools/virtio/ringtest?
The main reason is stated in the commit log: "Currently ptr_ring selftest is embedded within the virtio selftest, which involves some specific virtio operation, such as notifying and kicking.
As ptr_ring has been used by various subsystems, it deserves it's owner's selftest in order to benchmark different usecase of ptr_ring, such as page pool and pfifo_fast qdisc."
More specificly in tools/virtio/ringtest/main.c and tools/virtio/ringtest/ptr_ring.c, there are a lot of operation related to virtio usecase, such as start_guest(), start_host(), poll_used(), notify() or kick() ....., so it makes more sense to add a generic selftest for ptr ring as it is not only used by virtio now.
Okay that answers why you didn't just run main.c but why not add a new test under tools/virtio/ringtest/ reusing the rest of infrastructure that you currently copied?
Thanks
.
On 2021/6/27 14:09, Michael S. Tsirkin wrote:
On Fri, Jun 25, 2021 at 11:52:16AM +0800, Yunsheng Lin wrote:
On 2021/6/25 11:36, Jason Wang wrote:
在 2021/6/25 上午11:18, Yunsheng Lin 写道:
Currently ptr_ring selftest is embedded within the virtio selftest, which involves some specific virtio operation, such as notifying and kicking.
As ptr_ring has been used by various subsystems, it deserves it's owner's selftest in order to benchmark different usecase of ptr_ring, such as page pool and pfifo_fast qdisc.
So add a simple application to benchmark ptr_ring performance. Currently two test mode is supported: Mode 0: Both enqueuing and dequeuing is done in a single thread, it is called simple test mode in the test app. Mode 1: Enqueuing and dequeuing is done in different thread concurrently, also known as SPSC(single-producer/ single-consumer) test.
The multi-producer/single-consumer test for pfifo_fast case is not added yet, which can be added if using CAS atomic operation to enable lockless multi-producer is proved to be better than using r->producer_lock.
Only supported on x86 and arm64 for now.
Signed-off-by: Yunsheng Lin linyunsheng@huawei.com
MAINTAINERS | 5 + tools/testing/selftests/ptr_ring/Makefile | 6 + tools/testing/selftests/ptr_ring/ptr_ring_test.c | 249 +++++++++++++++++++++++ tools/testing/selftests/ptr_ring/ptr_ring_test.h | 150 ++++++++++++++ 4 files changed, 410 insertions(+)
Why can't you simply reuse tools/virtio/ringtest?
The main reason is stated in the commit log: "Currently ptr_ring selftest is embedded within the virtio selftest, which involves some specific virtio operation, such as notifying and kicking.
As ptr_ring has been used by various subsystems, it deserves it's owner's selftest in order to benchmark different usecase of ptr_ring, such as page pool and pfifo_fast qdisc."
More specificly in tools/virtio/ringtest/main.c and tools/virtio/ringtest/ptr_ring.c, there are a lot of operation related to virtio usecase, such as start_guest(), start_host(), poll_used(), notify() or kick() ....., so it makes more sense to add a generic selftest for ptr ring as it is not only used by virtio now.
Okay that answers why you didn't just run main.c but why not add a new test under tools/virtio/ringtest/ reusing the rest of infrastructure that you currently copied?
Actually, my first attempt was to reuse the infrastructure in tools/virtio/ or tools/virtio/ringtest/, and neither of them was able to be compiled in the latest kernel.
And then I read through the code to try fixing the compile error, I found that the testcase under tools/virtio/ is coupled deeply to virtio as explained above, which was difficult to read for someone who is not fimiliar with virtio.
So I searched for how testing is supposed to be added in the kernel, it seems it is more common to add the testing in tools/testing or tools/testing/selftest, and ptr ring is not only used by virtio now, so it seems more appropriate to add a sperate testing for virtio by instinct.
Most of tools/virtio/ is to do testing related to virtio testing, IMHO, most of them are better to be in tools/testing/selftest. Even if most of virtio testing is moved to tools/testing/selftest, I think it makes more sense to decouple the virtio testing to ptr_ring testing too if we can find some mechanism to share the abstract infrastructure in ptr_ring_test.h for both virtio and ptr_ring testing.
Thanks
.
.
On 2021/6/27 14:07, Michael S. Tsirkin wrote:
On Fri, Jun 25, 2021 at 05:20:10PM +0800, Yunsheng Lin wrote:
On 2021/6/25 14:39, Michael S. Tsirkin wrote:
On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
Currently r->queue[] is cleared after r->consumer_head is moved forward, which makes the __ptr_ring_empty() checking called in page_pool_refill_alloc_cache() unreliable if the checking is done after the r->queue clearing and before the consumer_head moving forward.
Move the r->queue[] clearing after consumer_head moving forward to make __ptr_ring_empty() checking more reliable.
As a side effect of above change, a consumer_head checking is avoided for the likely case, and it has noticeable performance improvement when it is tested using the ptr_ring_test selftest added in the previous patch.
Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000" to test the case of single thread doing both the enqueuing and dequeuing:
arch unpatched patched delta arm64 4648 ms 4464 ms +3.9% X86 2562 ms 2401 ms +6.2%
Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000" to test the case of one thread doing enqueuing and another thread doing dequeuing concurrently, also known as single-producer/single- consumer:
arch unpatched patched delta arm64 3624 ms + 3624 ms 3462 ms + 3462 ms +4.4% x86 2758 ms + 2758 ms 2547 ms + 2547 ms +7.6%
Nice but it's small - could be a fluke. How many tests did you run? What is the variance? Did you try pinning to different CPUs to observe numa effects? Please use perf or some other modern tool for this kind of benchmark. Thanks!
The result is quite stable, and retest using perf stat:
How stable exactly? Try with -r so we can find out.
Retest with "perf stat -r":
For unpatched one: Performance counter stats for './ptr_ring_test -s 1000 -m 1 -N 100000000' (100 runs):
6780.97 msec task-clock # 2.000 CPUs utilized ( +- 5.36% ) 73 context-switches # 0.011 K/sec ( +- 5.07% ) 0 cpu-migrations # 0.000 K/sec ( +-100.00% ) 81 page-faults # 0.012 K/sec ( +- 0.76% ) 17629544748 cycles # 2.600 GHz ( +- 5.36% ) 25496488950 instructions # 1.45 insn per cycle ( +- 0.26% ) <not supported> branches 11489031 branch-misses ( +- 1.69% )
3.391 +- 0.182 seconds time elapsed ( +- 5.35% )
For patched one: Performance counter stats for './ptr_ring_test_opt -s 1000 -m 1 -N 100000000' (100 runs):
6567.83 msec task-clock # 2.000 CPUs utilized ( +- 5.53% ) 71 context-switches # 0.011 K/sec ( +- 5.26% ) 0 cpu-migrations # 0.000 K/sec 82 page-faults # 0.012 K/sec ( +- 0.85% ) 17075489298 cycles # 2.600 GHz ( +- 5.53% ) 23861051578 instructions # 1.40 insn per cycle ( +- 0.07% ) <not supported> branches 10473776 branch-misses ( +- 0.60% )
3.284 +- 0.182 seconds time elapsed ( +- 5.53% )
The result is more stable when using taskset to limit the running cpu, but I suppose the above data is stable enough to justify the performance improvement?
On 2021/6/27 14:03, Michael S. Tsirkin wrote:
So if now we need this to be reliable then we also need smp_wmb before writing r->queue[consumer_head], there could be other gotchas.
Yes, This patch does not make it strictly reliable. T think I could mention that in the commit log?
OK so it's not that it makes it more reliable - this patch simply makes a possible false positive less likely while making a false negative more likely. Our assumption is that a false negative is cheaper then?
How do we know that it is?
And even if we prove the ptr_ring itself is faster now, how do we know what affects callers in a better way a false positive or a false negative?
I would rather we worked on actually making it reliable e.g. if we can guarantee no false positives, that would be a net win.
I thought deeper about the case you mentioned above, it seems for the above to happen, the consumer_head need to be rolled back to zero and incremented to the point when caller of __ptr_ring_empty() is still *not* able to see the r->queue[] which has been set to NULL in __ptr_ring_discard_one().
It seems smp_wmb() only need to be done once when consumer_head is rolled back to zero, and maybe that is enough to make sure the case you mentioned is fixed too?
And the smp_wmb() is only done once in a round of producing/ consuming, so the performance impact should be minimized?(of course we need to test it too).
Sorry I don't really understand the question here. I think I agree it's enough to do one smp_wmb between the write of r->queue and write of consumer_head to help guarantee no false positives. What other code changes are necessary I can't yet say without more a deeper code review.
Ok, thanks for the reviewing. Will add handling the case you mentioned above in V3 if there is no noticable performanc impact for handling the above case.
linux-kselftest-mirror@lists.linaro.org