This series contains a few improvements to fp-stress performance, only noticable on emulated platforms which both run more slowly and are stressed far more by fp-stress due to supporting more VLs for SVE and SME. The bulk of the improvement comes from the first patch which reduces the amount of time the main fp-stress executable is swamped by load from the child processes during startup, the other two patches are much more marginal.
v2: - Rebase onto arm64/for-next/selftests
Mark Brown (3): kselftest/arm64: Hold fp-stress children until they're all spawned kselftest/arm64: Don't drain output while spawning children kselftest/arm64: Allow epoll_wait() to return more than one result
tools/testing/selftests/arm64/fp/fp-stress.c | 74 +++++++++++++++----- 1 file changed, 57 insertions(+), 17 deletions(-)
base-commit: 642978981ec8a79da00828c696c58b3732b993a6
At present fp-stress has a bit of a thundering herd problem since the children it spawns start running immediately, meaning that they can start starving the parent process of CPU before it has even started all the children. This is much more severe on virtual platforms since they tend to support far more SVE and SME vector lengths, be slower in general and for some have issues with performance when simulating multiple CPUs.
We can mitigate this problem by having all the child processes block before starting the test program, meaning that we at least have all the child processes started before we start heavily using CPU. We still have the same load issues while waiting for the actual stress test programs to start up and produce output but they're at least all ready to go before that kicks in, resulting in substantial reductions in overall runtime on some of the severely affected systems. One test was showing about 20% improvement.
Signed-off-by: Mark Brown broonie@kernel.org --- tools/testing/selftests/arm64/fp/fp-stress.c | 41 +++++++++++++++++++- 1 file changed, 40 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/arm64/fp/fp-stress.c b/tools/testing/selftests/arm64/fp/fp-stress.c index 69ca4a5f7e6e..7c04f5001648 100644 --- a/tools/testing/selftests/arm64/fp/fp-stress.c +++ b/tools/testing/selftests/arm64/fp/fp-stress.c @@ -44,6 +44,8 @@ static bool terminate;
static void drain_output(bool flush);
+static int startup_pipe[2]; + static int num_processors(void) { long nproc = sysconf(_SC_NPROCESSORS_CONF); @@ -81,13 +83,37 @@ static void child_start(struct child_data *child, const char *program) exit(EXIT_FAILURE); }
+ /* + * Duplicate the read side of the startup pipe to + * FD 3 so we can close everything else. + */ + ret = dup2(startup_pipe[0], 3); + if (ret == -1) { + fprintf(stderr, "dup2() %d\n", errno); + exit(EXIT_FAILURE); + } + /* * Very dumb mechanism to clean open FDs other than * stdio. We don't want O_CLOEXEC for the pipes... */ - for (i = 3; i < 8192; i++) + for (i = 4; i < 8192; i++) close(i);
+ /* + * Read from the startup pipe, there should be no data + * and we should block until it is closed. We just + * carry on on error since this isn't super critical. + */ + ret = read(3, &i, sizeof(i)); + if (ret < 0) + fprintf(stderr, "read(startp pipe) failed: %s (%d)\n", + strerror(errno), errno); + if (ret > 0) + fprintf(stderr, "%d bytes of data on startup pipe\n", + ret); + close(3); + ret = execl(program, program, NULL); fprintf(stderr, "execl(%s) failed: %d (%s)\n", program, errno, strerror(errno)); @@ -467,6 +493,12 @@ int main(int argc, char **argv) strerror(errno), ret); epoll_fd = ret;
+ /* Create a pipe which children will block on before execing */ + ret = pipe(startup_pipe); + if (ret != 0) + ksft_exit_fail_msg("Failed to create startup pipe: %s (%d)\n", + strerror(errno), errno); + /* Get signal handers ready before we start any children */ memset(&sa, 0, sizeof(sa)); sa.sa_sigaction = handle_exit_signal; @@ -499,6 +531,13 @@ int main(int argc, char **argv) } }
+ /* + * All children started, close the startup pipe and let them + * run. + */ + close(startup_pipe[0]); + close(startup_pipe[1]); + for (;;) { /* Did we get a signal asking us to exit? */ if (terminate)
Now we hold execution of the stress test programs until all children are started there is no need to drain output while that is happening.
Signed-off-by: Mark Brown broonie@kernel.org --- tools/testing/selftests/arm64/fp/fp-stress.c | 8 -------- 1 file changed, 8 deletions(-)
diff --git a/tools/testing/selftests/arm64/fp/fp-stress.c b/tools/testing/selftests/arm64/fp/fp-stress.c index 7c04f5001648..b3bbfe8d9f56 100644 --- a/tools/testing/selftests/arm64/fp/fp-stress.c +++ b/tools/testing/selftests/arm64/fp/fp-stress.c @@ -42,8 +42,6 @@ static struct child_data *children; static int num_children; static bool terminate;
-static void drain_output(bool flush); - static int startup_pipe[2];
static int num_processors(void) @@ -138,12 +136,6 @@ static void child_start(struct child_data *child, const char *program) ksft_exit_fail_msg("%s EPOLL_CTL_ADD failed: %s (%d)\n", child->name, strerror(errno), errno); } - - /* - * Keep output flowing during child startup so logs - * are more timely, can help debugging. - */ - drain_output(false); } }
When everything is starting up we are likely to have a lot of child processes producing output at once. This means that we can reduce overhead a bit by allowing epoll_wait() to return more than one descriptor at once, it cuts down on the number of system calls we need to do which on virtual platforms where the syscall overhead is a bit more noticable and we're likely to have a lot more children active can make a small but noticable difference.
On physical platforms the relatively small number of processes being run and vastly improved speeds push the effects of this change into the noise.
Signed-off-by: Mark Brown broonie@kernel.org --- tools/testing/selftests/arm64/fp/fp-stress.c | 27 +++++++++++++------- 1 file changed, 18 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/arm64/fp/fp-stress.c b/tools/testing/selftests/arm64/fp/fp-stress.c index b3bbfe8d9f56..f8b2f41aac36 100644 --- a/tools/testing/selftests/arm64/fp/fp-stress.c +++ b/tools/testing/selftests/arm64/fp/fp-stress.c @@ -39,6 +39,8 @@ struct child_data {
static int epoll_fd; static struct child_data *children; +static struct epoll_event *evs; +static int tests; static int num_children; static bool terminate;
@@ -393,11 +395,11 @@ static void probe_vls(int vls[], int *vl_count, int set_vl) /* Handle any pending output without blocking */ static void drain_output(bool flush) { - struct epoll_event ev; int ret = 1; + int i;
while (ret > 0) { - ret = epoll_wait(epoll_fd, &ev, 1, 0); + ret = epoll_wait(epoll_fd, evs, tests, 0); if (ret < 0) { if (errno == EINTR) continue; @@ -405,8 +407,8 @@ static void drain_output(bool flush) strerror(errno), errno); }
- if (ret == 1) - child_output(ev.data.ptr, ev.events, flush); + for (i = 0; i < ret; i++) + child_output(evs[i].data.ptr, evs[i].events, flush); } }
@@ -419,12 +421,11 @@ int main(int argc, char **argv) { int ret; int timeout = 10; - int cpus, tests, i, j, c; + int cpus, i, j, c; int sve_vl_count, sme_vl_count, fpsimd_per_cpu; bool all_children_started = false; int seen_children; int sve_vls[MAX_VLS], sme_vls[MAX_VLS]; - struct epoll_event ev; struct sigaction sa;
while ((c = getopt_long(argc, argv, "t:", options, NULL)) != -1) { @@ -510,6 +511,11 @@ int main(int argc, char **argv) ksft_print_msg("Failed to install SIGCHLD handler: %s (%d)\n", strerror(errno), errno);
+ evs = calloc(tests, sizeof(*evs)); + if (!evs) + ksft_exit_fail_msg("Failed to allocated %d epoll events\n", + tests); + for (i = 0; i < cpus; i++) { for (j = 0; j < fpsimd_per_cpu; j++) start_fpsimd(&children[num_children++], i, j); @@ -543,7 +549,7 @@ int main(int argc, char **argv) * useful in emulation where we will both be slow and * likely to have a large set of VLs. */ - ret = epoll_wait(epoll_fd, &ev, 1, 1000); + ret = epoll_wait(epoll_fd, evs, tests, 1000); if (ret < 0) { if (errno == EINTR) continue; @@ -552,8 +558,11 @@ int main(int argc, char **argv) }
/* Output? */ - if (ret == 1) { - child_output(ev.data.ptr, ev.events, false); + if (ret > 0) { + for (i = 0; i < ret; i++) { + child_output(evs[i].data.ptr, evs[i].events, + false); + } continue; }
On Tue, 29 Nov 2022 21:59:22 +0000, Mark Brown wrote:
This series contains a few improvements to fp-stress performance, only noticable on emulated platforms which both run more slowly and are stressed far more by fp-stress due to supporting more VLs for SVE and SME. The bulk of the improvement comes from the first patch which reduces the amount of time the main fp-stress executable is swamped by load from the child processes during startup, the other two patches are much more marginal.
[...]
Applied to arm64 (for-next/selftests), thanks!
[1/3] kselftest/arm64: Hold fp-stress children until they're all spawned https://git.kernel.org/arm64/c/98102a2cb786 [2/3] kselftest/arm64: Don't drain output while spawning children https://git.kernel.org/arm64/c/92145d88ce0b [3/3] kselftest/arm64: Allow epoll_wait() to return more than one result https://git.kernel.org/arm64/c/c4e8720f2eb0
Cheers,
linux-kselftest-mirror@lists.linaro.org