[PATCH 0/6] selftests/resctrl: Support diverse platforms with MBM and MBA tests

List overview All Threads
Download

newer

older

[PATCH net-next 0/3] lan743x: This...

[PATCH v2 00/19] random: Resolve...

Reinette Chatre

29 Aug 2024 29 Aug '24

10:52 p.m.

The resctrl selftests for Memory Bandwidth Allocation (MBA) and Memory Bandwidth Monitoring (MBM) are failing on some (for example [1]) Emerald Rapids systems. The test failures result from the following two properties of these systems: 1) Emerald Rapids systems can have up to 320MB L3 cache. The resctrl MBA and MBM selftests measure memory traffic for which a hardcoded 250MB buffer has been sufficient so far. On platforms with L3 cache larger than the buffer, the buffer fits in the L3 cache and thus no/very little memory traffic is generated during the "memory bandwidth" tests. 2) Some platform features, for example RAS features or memory performance features that generate memory traffic may drive accesses that are counted differently by performance counters and MBM respectively, for instance generating "overhead" traffic which is not counted against any specific RMID. Until now these counting differences have always been "in the noise". On Emerald Rapids systems the maximum MBA throttling (10% memory bandwidth) throttles memory bandwidth to where memory accesses by these other platform features push the memory bandwidth difference between memory controller performance counters and resctrl (MBM) beyond the tests' hardcoded tolerance.

Make the tests more robust against platform variations: 1) Let the buffer used by memory bandwidth tests be guided by the size of the L3 cache. 2) Larger buffers require longer initialization time before the buffer can be used to measurement. Rework the tests to ensure that buffer initialization is complete before measurements start. 3) Do not compare performance counters and MBM measurements at low bandwidth. The value of "low" is hardcoded to 750MiB based on measurements on Emerald Rapids, Sapphire Rapids, and Ice Lake systems. This limit is not applicable to AMD systems since it only applies to the MBA and MBM tests that are isolated to Intel.

[1] https://ark.intel.com/content/www/us/en/ark/products/237261/intel-xeon-plati...

Reinette Chatre (6): selftests/resctrl: Fix sparse warnings selftests/resctrl: Ensure measurements skip initialization of default benchmark selftests/resctrl: Simplify benchmark parameter passing selftests/resctrl: Use cache size to determine "fill_buf" buffer size selftests/resctrl: Do not compare performance counters and resctrl at low bandwidth selftests/resctrl: Keep results from first test run

tools/testing/selftests/resctrl/cmt_test.c | 33 +-- tools/testing/selftests/resctrl/fill_buf.c | 19 +- tools/testing/selftests/resctrl/mba_test.c | 26 +- tools/testing/selftests/resctrl/mbm_test.c | 25 +- tools/testing/selftests/resctrl/resctrl.h | 57 +++-- .../testing/selftests/resctrl/resctrl_tests.c | 15 +- tools/testing/selftests/resctrl/resctrl_val.c | 223 +++++------------- 7 files changed, 152 insertions(+), 246 deletions(-)

-- 2.46.0

Show replies by date

Reinette Chatre

29 Aug 29 Aug

10:52 p.m.

New subject: [PATCH 1/6] selftests/resctrl: Fix sparse warnings

Fix following sparse warnings: tools/testing/selftests/resctrl/resctrl_val.c:47:6: warning: symbol 'membw_initialize_perf_event_attr' was not declared. Should it be static? tools/testing/selftests/resctrl/resctrl_val.c:64:6: warning: symbol 'membw_ioctl_perf_event_ioc_reset_enable' was not declared. Should it be static? tools/testing/selftests/resctrl/resctrl_val.c:70:6: warning: symbol 'membw_ioctl_perf_event_ioc_disable' was not declared. Should it be static? tools/testing/selftests/resctrl/resctrl_val.c:81:6: warning: symbol 'get_event_and_umask' was not declared. Should it be static?

Signed-off-by: Reinette Chatre reinette.chatre@intel.com --- tools/testing/selftests/resctrl/resctrl_val.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/resctrl/resctrl_val.c b/tools/testing/selftests/resctrl/resctrl_val.c index 8c275f6b4dd7..70e8e31f5d1a 100644 --- a/tools/testing/selftests/resctrl/resctrl_val.c +++ b/tools/testing/selftests/resctrl/resctrl_val.c @@ -44,7 +44,7 @@ static int imcs; static struct imc_counter_config imc_counters_config[MAX_IMCS][2]; static const struct resctrl_test *current_test;

-void membw_initialize_perf_event_attr(int i, int j) +static void membw_initialize_perf_event_attr(int i, int j) { memset(&imc_counters_config[i][j].pe, 0, sizeof(struct perf_event_attr)); @@ -61,13 +61,13 @@ void membw_initialize_perf_event_attr(int i, int j) PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING; }

-void membw_ioctl_perf_event_ioc_reset_enable(int i, int j) +static void membw_ioctl_perf_event_ioc_reset_enable(int i, int j) { ioctl(imc_counters_config[i][j].fd, PERF_EVENT_IOC_RESET, 0); ioctl(imc_counters_config[i][j].fd, PERF_EVENT_IOC_ENABLE, 0); }

-void membw_ioctl_perf_event_ioc_disable(int i, int j) +static void membw_ioctl_perf_event_ioc_disable(int i, int j) { ioctl(imc_counters_config[i][j].fd, PERF_EVENT_IOC_DISABLE, 0); } @@ -78,7 +78,7 @@ void membw_ioctl_perf_event_ioc_disable(int i, int j) * @count: iMC number * @op: Operation (read/write) */ -void get_event_and_umask(char *cas_count_cfg, int count, bool op) +static void get_event_and_umask(char *cas_count_cfg, int count, bool op) { char *token[MAX_TOKENS]; int i = 0;

-- 2.46.0

Ilpo Järvinen

30 Aug 30 Aug

10:29 a.m.

New subject: [PATCH 1/6] selftests/resctrl: Fix sparse warnings

On Thu, 29 Aug 2024, Reinette Chatre wrote:

...

Fix following sparse warnings: tools/testing/selftests/resctrl/resctrl_val.c:47:6: warning: symbol 'membw_initialize_perf_event_attr' was not declared. Should it be static? tools/testing/selftests/resctrl/resctrl_val.c:64:6: warning: symbol 'membw_ioctl_perf_event_ioc_reset_enable' was not declared. Should it be static? tools/testing/selftests/resctrl/resctrl_val.c:70:6: warning: symbol 'membw_ioctl_perf_event_ioc_disable' was not declared. Should it be static? tools/testing/selftests/resctrl/resctrl_val.c:81:6: warning: symbol 'get_event_and_umask' was not declared. Should it be static?

Signed-off-by: Reinette Chatre reinette.chatre@intel.com

Reviewed-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com

-- i.

Reinette Chatre

29 Aug 29 Aug

10:52 p.m.

New subject: [PATCH 2/6] selftests/resctrl: Ensure measurements skip initialization of default benchmark

The CMT, MBA, and MBM tests rely on the resctrl_val() wrapper to start and run a benchmark while providing test specific flows via callbacks to do test specific configuration and measurements.

At a high level, the resctrl_val() flow is: a) Start by fork()ing a child process that installs a signal handler for SIGUSR1 that, on receipt of SIGUSR1, will start running a benchmark. b) Assign the child process created in (a) to the resctrl control and monitoring group that dictates the memory and cache allocations with which the process can run and will contain all resctrl monitoring data of that process. c) Once parent and child are considered "ready" (determined via a message over a pipe) the parent signals the child (via SIGUSR1) to start the benchmark, waits one second for the benchmark to run, and then starts collecting monitoring data for the tests, potentially also changing allocation configuration depending on the various test callbacks.

A problem with the above flow is the "black box" view of the benchmark that is combined with an arbitrarily chosen "wait one second" before measurements start. No matter what the benchmark does, it is given one second to initialize before measurements start.

The default benchmark "fill_buf" consists of two parts, first it prepares a buffer (allocate, initialize, then flush), then it reads from the buffer (in unpredictable ways) until terminated. Depending on the system and the size of the buffer, the first "prepare" part may not be complete by the time the one second delay expires. Test measurements may thus start before the work needing to be measured runs.

Split the default benchmark into its "prepare" and "runtime" parts and simplify the resctrl_val() wrapper while doing so. This same split cannot be done for the user provided benchmark (without a user interface change), so the current behavior is maintained for user provided benchmark.

Assign the test itself to the control and monitoring group and run the "prepare" part of the benchmark in this context, ensuring it runs with required cache and memory bandwidth allocations. With the benchmark preparation complete it is only needed to fork() the "runtime" part of the benchmark (or entire user provided benchmark).

Keep the "wait one second" delay before measurements start. For the default "fill_buf" benchmark this time now covers only the "runtime" portion that needs to be measured. For the user provided benchmark this delay maintains current behavior.

Signed-off-by: Reinette Chatre reinette.chatre@intel.com --- tools/testing/selftests/resctrl/fill_buf.c | 19 +- tools/testing/selftests/resctrl/resctrl.h | 2 +- tools/testing/selftests/resctrl/resctrl_val.c | 225 ++++++------------ 3 files changed, 70 insertions(+), 176 deletions(-)

diff --git a/tools/testing/selftests/resctrl/fill_buf.c b/tools/testing/selftests/resctrl/fill_buf.c index ae120f1735c0..12c71bb44cb6 100644 --- a/tools/testing/selftests/resctrl/fill_buf.c +++ b/tools/testing/selftests/resctrl/fill_buf.c @@ -114,7 +114,7 @@ void fill_cache_read(unsigned char *buf, size_t buf_size, bool once) *value_sink = ret; }

-static void fill_cache_write(unsigned char *buf, size_t buf_size, bool once) +void fill_cache_write(unsigned char *buf, size_t buf_size, bool once) { while (1) { fill_one_span_write(buf, buf_size); @@ -150,20 +150,3 @@ unsigned char *alloc_buffer(size_t buf_size, int memflush)

return buf; } - -int run_fill_buf(size_t buf_size, int memflush, int op, bool once) -{ - unsigned char *buf; - - buf = alloc_buffer(buf_size, memflush); - if (!buf) - return -1; - - if (op == 0) - fill_cache_read(buf, buf_size, once); - else - fill_cache_write(buf, buf_size, once); - free(buf); - - return 0; -} diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h index 2dda56084588..0afbc4dd18e4 100644 --- a/tools/testing/selftests/resctrl/resctrl.h +++ b/tools/testing/selftests/resctrl/resctrl.h @@ -142,7 +142,7 @@ int perf_event_open(struct perf_event_attr *hw_event, pid_t pid, int cpu, unsigned char *alloc_buffer(size_t buf_size, int memflush); void mem_flush(unsigned char *buf, size_t buf_size); void fill_cache_read(unsigned char *buf, size_t buf_size, bool once); -int run_fill_buf(size_t buf_size, int memflush, int op, bool once); +void fill_cache_write(unsigned char *buf, size_t buf_size, bool once); int initialize_mem_bw_imc(void); int measure_mem_bw(const struct user_params *uparams, struct resctrl_val_param *param, pid_t bm_pid, diff --git a/tools/testing/selftests/resctrl/resctrl_val.c b/tools/testing/selftests/resctrl/resctrl_val.c index 70e8e31f5d1a..574b72604f95 100644 --- a/tools/testing/selftests/resctrl/resctrl_val.c +++ b/tools/testing/selftests/resctrl/resctrl_val.c @@ -448,7 +448,7 @@ static int get_mem_bw_resctrl(FILE *fp, unsigned long *mbm_total) return 0; }

-static pid_t bm_pid, ppid; +static pid_t bm_pid;

void ctrlc_handler(int signum, siginfo_t *info, void *ptr) { @@ -506,13 +506,6 @@ void signal_handler_unregister(void) } }

-static void parent_exit(pid_t ppid) -{ - kill(ppid, SIGKILL); - umount_resctrlfs(); - exit(EXIT_FAILURE); -} - /* * print_results_bw: the memory bandwidth results are stored in a file * @filename: file that stores the results @@ -614,61 +607,6 @@ int measure_mem_bw(const struct user_params *uparams, return ret; }

-/* - * run_benchmark - Run a specified benchmark or fill_buf (default benchmark) - * in specified signal. Direct benchmark stdio to /dev/null. - * @signum: signal number - * @info: signal info - * @ucontext: user context in signal handling - */ -static void run_benchmark(int signum, siginfo_t *info, void *ucontext) -{ - int operation, ret, memflush; - char **benchmark_cmd; - size_t span; - bool once; - FILE *fp; - - benchmark_cmd = info->si_ptr; - - /* - * Direct stdio of child to /dev/null, so that only parent writes to - * stdio (console) - */ - fp = freopen("/dev/null", "w", stdout); - if (!fp) { - ksft_perror("Unable to direct benchmark status to /dev/null"); - parent_exit(ppid); - } - - if (strcmp(benchmark_cmd[0], "fill_buf") == 0) { - /* Execute default fill_buf benchmark */ - span = strtoul(benchmark_cmd[1], NULL, 10); - memflush = atoi(benchmark_cmd[2]); - operation = atoi(benchmark_cmd[3]); - if (!strcmp(benchmark_cmd[4], "true")) { - once = true; - } else if (!strcmp(benchmark_cmd[4], "false")) { - once = false; - } else { - ksft_print_msg("Invalid once parameter\n"); - parent_exit(ppid); - } - - if (run_fill_buf(span, memflush, operation, once)) - fprintf(stderr, "Error in running fill buffer\n"); - } else { - /* Execute specified benchmark */ - ret = execvp(benchmark_cmd[0], benchmark_cmd); - if (ret) - ksft_perror("execvp"); - } - - fclose(stdout); - ksft_print_msg("Unable to run specified benchmark\n"); - parent_exit(ppid); -} - /* * resctrl_val: execute benchmark and measure memory bandwidth on * the benchmark @@ -684,11 +622,13 @@ int resctrl_val(const struct resctrl_test *test, const char * const *benchmark_cmd, struct resctrl_val_param *param) { - struct sigaction sigact; - int ret = 0, pipefd[2]; - char pipe_message = 0; - union sigval value; - int domain_id; + int domain_id, operation = 0, memflush = 1; + size_t span = DEFAULT_SPAN; + unsigned char *buf = NULL; + cpu_set_t old_affinity; + bool once = false; + int ret = 0; + pid_t ppid;

if (strcmp(param->filename, "") == 0) sprintf(param->filename, "stdio"); @@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test *test, return ret; }

- /* - * If benchmark wasn't successfully started by child, then child should - * kill parent, so save parent's pid - */ ppid = getpid();

- if (pipe(pipefd)) { - ksft_perror("Unable to create pipe"); + /* Taskset test to specified CPU. */ + ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity); + if (ret) + return ret;

- return -1; + /* Write test to specified control & monitoring group in resctrl FS. */ + ret = write_bm_pid_to_resctrl(ppid, param->ctrlgrp, param->mongrp); + if (ret) + goto reset_affinity; + + if (param->init) { + ret = param->init(param, domain_id); + if (ret) + goto reset_affinity; }

/* - * Fork to start benchmark, save child's pid so that it can be killed - * when needed + * If not running user provided benchmark, run the default + * "fill_buf". First phase of "fill_buf" is to prepare the + * buffer that the benchmark will operate on. No measurements + * are needed during this phase and prepared memory will be + * passed to next part of benchmark via copy-on-write. TBD + * how this impacts "write" benchmark, but no test currently + * uses this. */ - fflush(stdout); + if (strcmp(benchmark_cmd[0], "fill_buf") == 0) { + span = strtoul(benchmark_cmd[1], NULL, 10); + memflush = atoi(benchmark_cmd[2]); + operation = atoi(benchmark_cmd[3]); + if (!strcmp(benchmark_cmd[4], "true")) { + once = true; + } else if (!strcmp(benchmark_cmd[4], "false")) { + once = false; + } else { + ksft_print_msg("Invalid once parameter\n"); + ret = -EINVAL; + goto reset_affinity; + } + + buf = alloc_buffer(span, memflush); + if (!buf) { + ret = -ENOMEM; + goto reset_affinity; + } + } + bm_pid = fork(); if (bm_pid == -1) { + ret = -errno; ksft_perror("Unable to fork"); - - return -1; + goto free_buf; }

+ /* + * What needs to be measured runs in separate process until + * terminated. + */ if (bm_pid == 0) { - /* - * Mask all signals except SIGUSR1, parent uses SIGUSR1 to - * start benchmark - */ - sigfillset(&sigact.sa_mask); - sigdelset(&sigact.sa_mask, SIGUSR1); - - sigact.sa_sigaction = run_benchmark; - sigact.sa_flags = SA_SIGINFO; - - /* Register for "SIGUSR1" signal from parent */ - if (sigaction(SIGUSR1, &sigact, NULL)) { - ksft_perror("Can't register child for signal"); - parent_exit(ppid); + if (strcmp(benchmark_cmd[0], "fill_buf") == 0) { + if (operation == 0) + fill_cache_read(buf, span, once); + else + fill_cache_write(buf, span, once); + } else { + execvp(benchmark_cmd[0], (char **)benchmark_cmd); } - - /* Tell parent that child is ready */ - close(pipefd[0]); - pipe_message = 1; - if (write(pipefd[1], &pipe_message, sizeof(pipe_message)) < - sizeof(pipe_message)) { - ksft_perror("Failed signaling parent process"); - close(pipefd[1]); - return -1; - } - close(pipefd[1]); - - /* Suspend child until delivery of "SIGUSR1" from parent */ - sigsuspend(&sigact.sa_mask); - - ksft_perror("Child is done"); - parent_exit(ppid); + exit(EXIT_SUCCESS); }

ksft_print_msg("Benchmark PID: %d\n", (int)bm_pid);

- /* - * The cast removes constness but nothing mutates benchmark_cmd within - * the context of this process. At the receiving process, it becomes - * argv, which is mutable, on exec() but that's after fork() so it - * doesn't matter for the process running the tests. - */ - value.sival_ptr = (void *)benchmark_cmd; - - /* Taskset benchmark to specified cpu */ - ret = taskset_benchmark(bm_pid, uparams->cpu, NULL); - if (ret) - goto out; - - /* Write benchmark to specified control&monitoring grp in resctrl FS */ - ret = write_bm_pid_to_resctrl(bm_pid, param->ctrlgrp, param->mongrp); - if (ret) - goto out; - - if (param->init) { - ret = param->init(param, domain_id); - if (ret) - goto out; - } - - /* Parent waits for child to be ready. */ - close(pipefd[1]); - while (pipe_message != 1) { - if (read(pipefd[0], &pipe_message, sizeof(pipe_message)) < - sizeof(pipe_message)) { - ksft_perror("Failed reading message from child process"); - close(pipefd[0]); - goto out; - } - } - close(pipefd[0]); - - /* Signal child to start benchmark */ - if (sigqueue(bm_pid, SIGUSR1, value) == -1) { - ksft_perror("sigqueue SIGUSR1 to child"); - ret = -1; - goto out; - } - - /* Give benchmark enough time to fully run */ + /* Give benchmark enough time to fully run. */ sleep(1);

/* Test runs until the callback setup() tells the test to stop. */ @@ -821,8 +730,10 @@ int resctrl_val(const struct resctrl_test *test, break; }

-out: kill(bm_pid, SIGKILL); - +free_buf: + free(buf); +reset_affinity: + taskset_restore(ppid, &old_affinity); return ret; }

-- 2.46.0

Ilpo Järvinen

30 Aug 30 Aug

10:56 a.m.

New subject: [PATCH 2/6] selftests/resctrl: Ensure measurements skip initialization of default benchmark

On Thu, 29 Aug 2024, Reinette Chatre wrote:

...

The CMT, MBA, and MBM tests rely on the resctrl_val() wrapper to start and run a benchmark while providing test specific flows via callbacks to do test specific configuration and measurements.

At a high level, the resctrl_val() flow is: a) Start by fork()ing a child process that installs a signal handler for SIGUSR1 that, on receipt of SIGUSR1, will start running a benchmark. b) Assign the child process created in (a) to the resctrl control and monitoring group that dictates the memory and cache allocations with which the process can run and will contain all resctrl monitoring data of that process. c) Once parent and child are considered "ready" (determined via a message over a pipe) the parent signals the child (via SIGUSR1) to start the benchmark, waits one second for the benchmark to run, and then starts collecting monitoring data for the tests, potentially also changing allocation configuration depending on the various test callbacks.

A problem with the above flow is the "black box" view of the benchmark that is combined with an arbitrarily chosen "wait one second" before measurements start. No matter what the benchmark does, it is given one second to initialize before measurements start.

The default benchmark "fill_buf" consists of two parts, first it prepares a buffer (allocate, initialize, then flush), then it reads from the buffer (in unpredictable ways) until terminated. Depending on the system and the size of the buffer, the first "prepare" part may not be complete by the time the one second delay expires. Test measurements may thus start before the work needing to be measured runs.

Split the default benchmark into its "prepare" and "runtime" parts and simplify the resctrl_val() wrapper while doing so. This same split cannot be done for the user provided benchmark (without a user interface change), so the current behavior is maintained for user provided benchmark.

Assign the test itself to the control and monitoring group and run the "prepare" part of the benchmark in this context, ensuring it runs with required cache and memory bandwidth allocations. With the benchmark preparation complete it is only needed to fork() the "runtime" part of the benchmark (or entire user provided benchmark).

Keep the "wait one second" delay before measurements start. For the default "fill_buf" benchmark this time now covers only the "runtime" portion that needs to be measured. For the user provided benchmark this delay maintains current behavior.

Signed-off-by: Reinette Chatre reinette.chatre@intel.com

tools/testing/selftests/resctrl/fill_buf.c | 19 +- tools/testing/selftests/resctrl/resctrl.h | 2 +- tools/testing/selftests/resctrl/resctrl_val.c | 225 ++++++------------ 3 files changed, 70 insertions(+), 176 deletions(-)

diff --git a/tools/testing/selftests/resctrl/fill_buf.c b/tools/testing/selftests/resctrl/fill_buf.c index ae120f1735c0..12c71bb44cb6 100644 --- a/tools/testing/selftests/resctrl/fill_buf.c +++ b/tools/testing/selftests/resctrl/fill_buf.c @@ -114,7 +114,7 @@ void fill_cache_read(unsigned char *buf, size_t buf_size, bool once) *value_sink = ret; } -static void fill_cache_write(unsigned char *buf, size_t buf_size, bool once) +void fill_cache_write(unsigned char *buf, size_t buf_size, bool once) { while (1) { fill_one_span_write(buf, buf_size); @@ -150,20 +150,3 @@ unsigned char *alloc_buffer(size_t buf_size, int memflush) return buf; }

-int run_fill_buf(size_t buf_size, int memflush, int op, bool once) -{
unsigned char *buf;

buf = alloc_buffer(buf_size, memflush);

if (!buf)
return -1;
if (op == 0)
fill_cache_read(buf, buf_size, once);
else
fill_cache_write(buf, buf_size, once);
free(buf);

return 0;
-} diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h index 2dda56084588..0afbc4dd18e4 100644 --- a/tools/testing/selftests/resctrl/resctrl.h +++ b/tools/testing/selftests/resctrl/resctrl.h @@ -142,7 +142,7 @@ int perf_event_open(struct perf_event_attr *hw_event, pid_t pid, int cpu, unsigned char *alloc_buffer(size_t buf_size, int memflush); void mem_flush(unsigned char *buf, size_t buf_size); void fill_cache_read(unsigned char *buf, size_t buf_size, bool once); -int run_fill_buf(size_t buf_size, int memflush, int op, bool once); +void fill_cache_write(unsigned char *buf, size_t buf_size, bool once); int initialize_mem_bw_imc(void); int measure_mem_bw(const struct user_params *uparams, struct resctrl_val_param *param, pid_t bm_pid, diff --git a/tools/testing/selftests/resctrl/resctrl_val.c b/tools/testing/selftests/resctrl/resctrl_val.c index 70e8e31f5d1a..574b72604f95 100644 --- a/tools/testing/selftests/resctrl/resctrl_val.c +++ b/tools/testing/selftests/resctrl/resctrl_val.c @@ -448,7 +448,7 @@ static int get_mem_bw_resctrl(FILE *fp, unsigned long *mbm_total) return 0; } -static pid_t bm_pid, ppid; +static pid_t bm_pid; void ctrlc_handler(int signum, siginfo_t *info, void *ptr) { @@ -506,13 +506,6 @@ void signal_handler_unregister(void) } } -static void parent_exit(pid_t ppid) -{

kill(ppid, SIGKILL);

umount_resctrlfs();

exit(EXIT_FAILURE);

-}

/*

print_results_bw: the memory bandwidth results are stored in a file

@filename: file that stores the results

@@ -614,61 +607,6 @@ int measure_mem_bw(const struct user_params *uparams, return ret; } -/*
run_benchmark - Run a specified benchmark or fill_buf (default benchmark)
   in specified signal. Direct benchmark stdio to /dev/null.
@signum: signal number

@info: signal info

@ucontext: user context in signal handling

*/
-static void run_benchmark(int signum, siginfo_t *info, void *ucontext) -{
int operation, ret, memflush;

char **benchmark_cmd;

size_t span;

bool once;

FILE *fp;

benchmark_cmd = info->si_ptr;

/*
* Direct stdio of child to /dev/null, so that only parent writes to
* stdio (console)
*/
fp = freopen("/dev/null", "w", stdout);

if (!fp) {
ksft_perror("Unable to direct benchmark status to /dev/null");
parent_exit(ppid);
}

if (strcmp(benchmark_cmd[0], "fill_buf") == 0) {
/* Execute default fill_buf benchmark */
span = strtoul(benchmark_cmd[1], NULL, 10);
memflush =  atoi(benchmark_cmd[2]);
operation = atoi(benchmark_cmd[3]);
if (!strcmp(benchmark_cmd[4], "true")) {
	once = true;
} else if (!strcmp(benchmark_cmd[4], "false")) {
	once = false;
} else {
	ksft_print_msg("Invalid once parameter\n");
	parent_exit(ppid);
}
if (run_fill_buf(span, memflush, operation, once))
	fprintf(stderr, "Error in running fill buffer\n");
} else {
/* Execute specified benchmark */
ret = execvp(benchmark_cmd[0], benchmark_cmd);
if (ret)
	ksft_perror("execvp");
}

fclose(stdout);

ksft_print_msg("Unable to run specified benchmark\n");

parent_exit(ppid);
-}

/*
resctrl_val: execute benchmark and measure memory bandwidth on
	the benchmark
@@ -684,11 +622,13 @@ int resctrl_val(const struct resctrl_test *test, const char * const *benchmark_cmd, struct resctrl_val_param *param) {

struct sigaction sigact;

int ret = 0, pipefd[2];

char pipe_message = 0;

union sigval value;

int domain_id;

int domain_id, operation = 0, memflush = 1;

size_t span = DEFAULT_SPAN;

unsigned char *buf = NULL;

cpu_set_t old_affinity;

bool once = false;

int ret = 0;

pid_t ppid;

if (strcmp(param->filename, "") == 0) sprintf(param->filename, "stdio"); @@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test *test, return ret; }
/*
* If benchmark wasn't successfully started by child, then child should
* kill parent, so save parent's pid
*/
ppid = getpid();
if (pipe(pipefd)) {
ksft_perror("Unable to create pipe");
/* Taskset test to specified CPU. */

ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity);

Previously only CPU affinity for bm_pid was set but now it's set before fork(). Quickly checking the Internet, it seems that CPU affinity gets inherited on fork() so now both processes will have the same affinity which might make the other process to interfere with the measurement.

...

if (ret)
return ret;
return -1;
/* Write test to specified control & monitoring group in resctrl FS. */

ret = write_bm_pid_to_resctrl(ppid, param->ctrlgrp, param->mongrp);

Previously, this was done for bm_pid but now it's done for the parent. I'm not sure how inheritance goes with resctrl on fork(), will the forked PID get added to the list of PIDs or not? You probably know the answer :-). Neither behavior, however, seems to result in the intended behavior as we either get interfering processes (if inherited) or no desired resctrl setup for the benchmark process.

...

if (ret)
```
goto reset_affinity;
```
if (param->init) {
```
ret = param->init(param, domain_id);
```
```
if (ret)
```
```
	goto reset_affinity;
```
}

* Fork to start benchmark, save child's pid so that it can be killed

```
* when needed
```

* If not running user provided benchmark, run the default

* "fill_buf". First phase of "fill_buf" is to prepare the

* buffer that the benchmark will operate on. No measurements

* are needed during this phase and prepared memory will be

* passed to next part of benchmark via copy-on-write. TBD

* how this impacts "write" benchmark, but no test currently

```
* uses this.
```
*/

fflush(stdout);

Please don't remove fflush() in front of fork() as it leads to duplicating messages.

-- i. > + if (strcmp(benchmark_cmd[0], "fill_buf") == 0) { > + span = strtoul(benchmark_cmd[1], NULL, 10); > + memflush = atoi(benchmark_cmd[2]); > + operation = atoi(benchmark_cmd[3]); > + if (!strcmp(benchmark_cmd[4], "true")) { > + once = true; > + } else if (!strcmp(benchmark_cmd[4], "false")) { > + once = false; > + } else { > + ksft_print_msg("Invalid once parameter\n"); > + ret = -EINVAL; > + goto reset_affinity; > + } > + > + buf = alloc_buffer(span, memflush); > + if (!buf) { > + ret = -ENOMEM; > + goto reset_affinity; > + } > + } > + > bm_pid = fork(); > if (bm_pid == -1) { > + ret = -errno; > ksft_perror("Unable to fork"); > - > - return -1; > + goto free_buf; > } > > + /* > + * What needs to be measured runs in separate process until > + * terminated. > + */ > if (bm_pid == 0) { > - /* > - * Mask all signals except SIGUSR1, parent uses SIGUSR1 to > - * start benchmark > - */ > - sigfillset(&sigact.sa_mask); > - sigdelset(&sigact.sa_mask, SIGUSR1); > - > - sigact.sa_sigaction = run_benchmark; > - sigact.sa_flags = SA_SIGINFO; > - > - /* Register for "SIGUSR1" signal from parent */ > - if (sigaction(SIGUSR1, &sigact, NULL)) { > - ksft_perror("Can't register child for signal"); > - parent_exit(ppid); > + if (strcmp(benchmark_cmd[0], "fill_buf") == 0) { > + if (operation == 0) > + fill_cache_read(buf, span, once); > + else > + fill_cache_write(buf, span, once); > + } else { > + execvp(benchmark_cmd[0], (char **)benchmark_cmd); > } > - > - /* Tell parent that child is ready */ > - close(pipefd[0]); > - pipe_message = 1; > - if (write(pipefd[1], &pipe_message, sizeof(pipe_message)) < > - sizeof(pipe_message)) { > - ksft_perror("Failed signaling parent process"); > - close(pipefd[1]); > - return -1; > - } > - close(pipefd[1]); > - > - /* Suspend child until delivery of "SIGUSR1" from parent */ > - sigsuspend(&sigact.sa_mask); > - > - ksft_perror("Child is done"); > - parent_exit(ppid); > + exit(EXIT_SUCCESS); > } > > ksft_print_msg("Benchmark PID: %d\n", (int)bm_pid); > > - /* > - * The cast removes constness but nothing mutates benchmark_cmd within > - * the context of this process. At the receiving process, it becomes > - * argv, which is mutable, on exec() but that's after fork() so it > - * doesn't matter for the process running the tests. > - */ > - value.sival_ptr = (void *)benchmark_cmd; > - > - /* Taskset benchmark to specified cpu */ > - ret = taskset_benchmark(bm_pid, uparams->cpu, NULL); > - if (ret) > - goto out; > - > - /* Write benchmark to specified control&monitoring grp in resctrl FS */ > - ret = write_bm_pid_to_resctrl(bm_pid, param->ctrlgrp, param->mongrp); > - if (ret) > - goto out; > - > - if (param->init) { > - ret = param->init(param, domain_id); > - if (ret) > - goto out; > - } > - > - /* Parent waits for child to be ready. */ > - close(pipefd[1]); > - while (pipe_message != 1) { > - if (read(pipefd[0], &pipe_message, sizeof(pipe_message)) < > - sizeof(pipe_message)) { > - ksft_perror("Failed reading message from child process"); > - close(pipefd[0]); > - goto out; > - } > - } > - close(pipefd[0]); > - > - /* Signal child to start benchmark */ > - if (sigqueue(bm_pid, SIGUSR1, value) == -1) { > - ksft_perror("sigqueue SIGUSR1 to child"); > - ret = -1; > - goto out; > - } > - > - /* Give benchmark enough time to fully run */ > + /* Give benchmark enough time to fully run. */ > sleep(1); > > /* Test runs until the callback setup() tells the test to stop. */ > @@ -821,8 +730,10 @@ int resctrl_val(const struct resctrl_test *test, > break; > } > > -out: > kill(bm_pid, SIGKILL); > - > +free_buf: > + free(buf); > +reset_affinity: > + taskset_restore(ppid, &old_affinity); > return ret; > } >

Reinette Chatre

4 p.m.

New subject: [PATCH 2/6] selftests/resctrl: Ensure measurements skip initialization of default benchmark

Hi Ilpo,

Thank you for taking a look.

On 8/30/24 3:56 AM, Ilpo Järvinen wrote:

...

On Thu, 29 Aug 2024, Reinette Chatre wrote:

...

...
@@ -684,11 +622,13 @@ int resctrl_val(const struct resctrl_test *test, const char * const *benchmark_cmd, struct resctrl_val_param *param) {

struct sigaction sigact;

int ret = 0, pipefd[2];

char pipe_message = 0;

union sigval value;

int domain_id;

int domain_id, operation = 0, memflush = 1;

size_t span = DEFAULT_SPAN;

unsigned char *buf = NULL;

cpu_set_t old_affinity;

bool once = false;

int ret = 0;

pid_t ppid;

if (strcmp(param->filename, "") == 0) sprintf(param->filename, "stdio"); @@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test *test, return ret; }
/*
* If benchmark wasn't successfully started by child, then child should
* kill parent, so save parent's pid
*/
ppid = getpid();
if (pipe(pipefd)) {
ksft_perror("Unable to create pipe");
/* Taskset test to specified CPU. */

ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity);
Previously only CPU affinity for bm_pid was set but now it's set before fork(). Quickly checking the Internet, it seems that CPU affinity gets inherited on fork() so now both processes will have the same affinity which might make the other process to interfere with the measurement.

Setting the affinity is intended to ensure that the buffer preparation occurs in the same topology as where the runtime portion will run. This preparation is done before the work to be measured starts.

This does tie in with the association with the resctrl group and I will elaborate more below ...

...

...
if (ret)
return ret;
return -1;
/* Write test to specified control & monitoring group in resctrl FS. */

ret = write_bm_pid_to_resctrl(ppid, param->ctrlgrp, param->mongrp);
Previously, this was done for bm_pid but now it's done for the parent. I'm not sure how inheritance goes with resctrl on fork(), will the forked PID get added to the list of PIDs or not? You probably know the answer :-).

Yes. A process fork()ed will land in the same resctrl group as its parent.

...

Neither behavior, however, seems to result in the intended behavior as we either get interfering processes (if inherited) or no desired resctrl setup for the benchmark process.

There are two processes to consider in the resource group, the parent (that sets up the buffer and does the measurements) and the child (that runs the workload to be measured). Thanks to your commit da50de0a92f3 ("selftests/resctrl: Calculate resctrl FS derived mem bw over sleep(1) only") the parent will be sleeping while the child runs its workload and there is no other interference I am aware of. The only additional measurements that I can see would be the work needed to actually start and stop the measurements and from what I can tell this falls into the noise.

Please do keep in mind that the performance counters used, iMC, cannot actually be bound to a single CPU since it is a per-socket PMU. The measurements have thus never been as fine grained as the code pretends it to be.

...

if (ret)
```
goto reset_affinity;
```
if (param->init) {
```
ret = param->init(param, domain_id);
```
```
if (ret)
```
```
	goto reset_affinity;
```
}

* Fork to start benchmark, save child's pid so that it can be killed

```
* when needed
```

* If not running user provided benchmark, run the default

* "fill_buf". First phase of "fill_buf" is to prepare the

* buffer that the benchmark will operate on. No measurements

* are needed during this phase and prepared memory will be

* passed to next part of benchmark via copy-on-write. TBD

* how this impacts "write" benchmark, but no test currently

```
* uses this.
```
*/

fflush(stdout);

Please don't remove fflush() in front of fork() as it leads to duplicating messages.

Indeed. Shaopeng just fixed this for us. Thank you!.

Reinette

Ilpo Järvinen

4 Sep 4 Sep

11:57 a.m.

New subject: [PATCH 2/6] selftests/resctrl: Ensure measurements skip initialization of default benchmark

On Fri, 30 Aug 2024, Reinette Chatre wrote:

...

Thank you for taking a look.

On 8/30/24 3:56 AM, Ilpo Järvinen wrote:

...
On Thu, 29 Aug 2024, Reinette Chatre wrote:

...

...
...
@@ -684,11 +622,13 @@ int resctrl_val(const struct resctrl_test *test, const char * const *benchmark_cmd, struct resctrl_val_param *param) {

struct sigaction sigact;

int ret = 0, pipefd[2];

char pipe_message = 0;

union sigval value;

int domain_id;

int domain_id, operation = 0, memflush = 1;

size_t span = DEFAULT_SPAN;

unsigned char *buf = NULL;

cpu_set_t old_affinity;

bool once = false;

int ret = 0;

pid_t ppid; if (strcmp(param->filename, "") == 0) sprintf(param->filename, "stdio");

@@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test *test, return ret; }
/*
* If benchmark wasn't successfully started by child, then child
should
* kill parent, so save parent's pid
*/
ppid = getpid();

if (pipe(pipefd)) {
ksft_perror("Unable to create pipe");
/* Taskset test to specified CPU. */

ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity);
Previously only CPU affinity for bm_pid was set but now it's set before fork(). Quickly checking the Internet, it seems that CPU affinity gets inherited on fork() so now both processes will have the same affinity which might make the other process to interfere with the measurement.
Setting the affinity is intended to ensure that the buffer preparation occurs in the same topology as where the runtime portion will run. This preparation is done before the work to be measured starts.

This does tie in with the association with the resctrl group and I will elaborate more below ...

Okay, that's useful to retain but thinking this further, now we're also going do non-trivial amount of work in between the setup and the test by forking. I guess that doesn't matter for memflush = true case but might be meaningful for the memflush = false case that seems to be there to allow keeping caches hot (I personally haven't thought how to use "caches hot" test but we do have that capability by the fact that memflush paremeter exists).

...

...
...
if (ret)
return ret;
return -1;
/* Write test to specified control & monitoring group in resctrl FS.
*/

ret = write_bm_pid_to_resctrl(ppid, param->ctrlgrp, param->mongrp);
Previously, this was done for bm_pid but now it's done for the parent. I'm not sure how inheritance goes with resctrl on fork(), will the forked PID get added to the list of PIDs or not? You probably know the answer :-).
Yes. A process fork()ed will land in the same resctrl group as its parent.

...
Neither behavior, however, seems to result in the intended behavior as we either get interfering processes (if inherited) or no desired resctrl setup for the benchmark process.

There are two processes to consider in the resource group, the parent (that sets up the buffer and does the measurements) and the child (that runs the workload to be measured). Thanks to your commit da50de0a92f3 ("selftests/resctrl: Calculate resctrl FS derived mem bw over sleep(1) only") the parent will be sleeping while the child runs its workload and there is no other interference I am aware of. The only additional measurements that I can see would be the work needed to actually start and stop the measurements and from what I can tell this falls into the noise.

Please do keep in mind that the performance counters used, iMC, cannot actually be bound to a single CPU since it is a per-socket PMU. The measurements have thus never been as fine grained as the code pretends it to be.

I was thinking if I should note the amount of work is small. Maybe it's fine to leave that noise there and I'm just overly cautious :-), when I used to do networking research in the past life, I wanted to eliminate as much noise sources so I guess it comes from that background.

-- i.

Reinette Chatre

9:15 p.m.

New subject: [PATCH 2/6] selftests/resctrl: Ensure measurements skip initialization of default benchmark

Hi Ilpo,

On 9/4/24 4:57 AM, Ilpo Järvinen wrote:

...

On Fri, 30 Aug 2024, Reinette Chatre wrote:

...
Thank you for taking a look.

On 8/30/24 3:56 AM, Ilpo Järvinen wrote:

...
On Thu, 29 Aug 2024, Reinette Chatre wrote:

...

...
...
@@ -684,11 +622,13 @@ int resctrl_val(const struct resctrl_test *test, const char * const *benchmark_cmd, struct resctrl_val_param *param) {

struct sigaction sigact;

int ret = 0, pipefd[2];

char pipe_message = 0;

union sigval value;

int domain_id;

int domain_id, operation = 0, memflush = 1;

size_t span = DEFAULT_SPAN;

unsigned char *buf = NULL;

cpu_set_t old_affinity;

bool once = false;

int ret = 0;

pid_t ppid; if (strcmp(param->filename, "") == 0) sprintf(param->filename, "stdio");

@@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test *test, return ret; }
/*
* If benchmark wasn't successfully started by child, then child
should
* kill parent, so save parent's pid
*/
ppid = getpid();

if (pipe(pipefd)) {
ksft_perror("Unable to create pipe");
/* Taskset test to specified CPU. */

ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity);
Previously only CPU affinity for bm_pid was set but now it's set before fork(). Quickly checking the Internet, it seems that CPU affinity gets inherited on fork() so now both processes will have the same affinity which might make the other process to interfere with the measurement.
Setting the affinity is intended to ensure that the buffer preparation occurs in the same topology as where the runtime portion will run. This preparation is done before the work to be measured starts.

This does tie in with the association with the resctrl group and I will elaborate more below ...
Okay, that's useful to retain but thinking this further, now we're also going do non-trivial amount of work in between the setup and the test by

Could you please elaborate how the amount of work during setup can be an issue? I have been focused on the measurements that are done afterwards that do have clear boundaries from what I can tell.

...

forking. I guess that doesn't matter for memflush = true case but might be meaningful for the memflush = false case that seems to be there to allow keeping caches hot (I personally haven't thought how to use "caches hot" test but we do have that capability by the fact that memflush paremeter exists).

I believe that memflush = true will always be needed/used by the tests relying on memory bandwidth measurement since that reduces cache hits during measurement phase and avoids the additional guessing on how long the workload should be run before reliable/consistent measurements can start.

Thinking about the memflush = false case I now think that we should use that for the CMT test. The buffer is allocated and initialized while the task is configured with appropriate allocation limits so there should not be a reason to flush the buffer from the cache. In fact, flushing the cache introduces the requirement to guess the workload's "settle" time (time to allocate the buffer into the cache again) before its occupancy can be measured. As a quick test I set memflush = false on one system and it brought down the average diff between the cache portion size and the occupancy counts. I'll try it out on a few more systems to confirm.

...

...
...
...
if (ret)
return ret;
return -1;
/* Write test to specified control & monitoring group in resctrl FS.
*/

ret = write_bm_pid_to_resctrl(ppid, param->ctrlgrp, param->mongrp);
Previously, this was done for bm_pid but now it's done for the parent. I'm not sure how inheritance goes with resctrl on fork(), will the forked PID get added to the list of PIDs or not? You probably know the answer :-).
Yes. A process fork()ed will land in the same resctrl group as its parent.

...
Neither behavior, however, seems to result in the intended behavior as we either get interfering processes (if inherited) or no desired resctrl setup for the benchmark process.

There are two processes to consider in the resource group, the parent (that sets up the buffer and does the measurements) and the child (that runs the workload to be measured). Thanks to your commit da50de0a92f3 ("selftests/resctrl: Calculate resctrl FS derived mem bw over sleep(1) only") the parent will be sleeping while the child runs its workload and there is no other interference I am aware of. The only additional measurements that I can see would be the work needed to actually start and stop the measurements and from what I can tell this falls into the noise.

Please do keep in mind that the performance counters used, iMC, cannot actually be bound to a single CPU since it is a per-socket PMU. The measurements have thus never been as fine grained as the code pretends it to be.
I was thinking if I should note the amount of work is small. Maybe it's fine to leave that noise there and I'm just overly cautious :-), when I used to do networking research in the past life, I wanted to eliminate as much noise sources so I guess it comes from that background.

The goal of these tests are to verify *resctrl*, these are not intended to be hardware validation tests. I think it would be better for resctrl if more time is spent on functional tests of resctrl than these performance tests.

Reinette

Ilpo Järvinen

5 Sep 5 Sep

12:10 p.m.

New subject: [PATCH 2/6] selftests/resctrl: Ensure measurements skip initialization of default benchmark

On Wed, 4 Sep 2024, Reinette Chatre wrote:

...

On 9/4/24 4:57 AM, Ilpo Järvinen wrote:

...
On Fri, 30 Aug 2024, Reinette Chatre wrote:

...
On 8/30/24 3:56 AM, Ilpo Järvinen wrote:

...
On Thu, 29 Aug 2024, Reinette Chatre wrote:

...

...
...
...
...
@@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test *test, return ret; }
/*
* If benchmark wasn't successfully started by child, then
child should
* kill parent, so save parent's pid
*/
ppid = getpid();

if (pipe(pipefd)) {
ksft_perror("Unable to create pipe");
/* Taskset test to specified CPU. */

ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity);
Previously only CPU affinity for bm_pid was set but now it's set before fork(). Quickly checking the Internet, it seems that CPU affinity gets inherited on fork() so now both processes will have the same affinity which might make the other process to interfere with the measurement.
Setting the affinity is intended to ensure that the buffer preparation occurs in the same topology as where the runtime portion will run. This preparation is done before the work to be measured starts.

This does tie in with the association with the resctrl group and I will elaborate more below ...
Okay, that's useful to retain but thinking this further, now we're also going do non-trivial amount of work in between the setup and the test by
Could you please elaborate how the amount of work during setup can be an issue? I have been focused on the measurements that are done afterwards that do have clear boundaries from what I can tell.

Well, you used it as a justification: "Setting the affinity is intended to ensure that the buffer preparation occurs in the same topology as where the runtime portion will run." So I assumed you had some expectations about "preparations" done outside of those "clear boundaries" but now you seem to take entirely opposite stance?

fork() quite heavy operation as it has to copy various things including the address space which you just made to contain a huge mem blob. :-)

BTW, perhaps we could use some lighter weighted fork variant in the resctrl selftests, the processes don't really need to be that separated to justify using full fork() (and custom benchmarks will do execvp()).

...

...
forking. I guess that doesn't matter for memflush = true case but might be meaningful for the memflush = false case that seems to be there to allow keeping caches hot (I personally haven't thought how to use "caches hot" test but we do have that capability by the fact that memflush paremeter exists).

I believe that memflush = true will always be needed/used by the tests relying on memory bandwidth measurement since that reduces cache hits during measurement phase and avoids the additional guessing on how long the workload should be run before reliable/consistent measurements can start.

Thinking about the memflush = false case I now think that we should use that for the CMT test. The buffer is allocated and initialized while the task is configured with appropriate allocation limits so there should not be a reason to flush the buffer from the cache. In fact, flushing the cache introduces the requirement to guess the workload's "settle" time (time to allocate the buffer into the cache again) before its occupancy can be measured. As a quick test I set memflush = false on one system and it brought down the average diff between the cache portion size and the occupancy counts. I'll try it out on a few more systems to confirm.

Oh great!

I've not really figured out the logic used in the old CMT test because there was the rewrite for it in the series I started to upstream some of these improvements from. But I was unable to rebase successfully that rewrite either because somebody had used a non-publically available tree as a basis for it so I never did even have time to understand what even the rewritten test did thanks to the very complex diff.

...

...
...
...
Neither behavior, however, seems to result in the intended behavior as we either get interfering processes (if inherited) or no desired resctrl setup for the benchmark process.

There are two processes to consider in the resource group, the parent (that sets up the buffer and does the measurements) and the child (that runs the workload to be measured). Thanks to your commit da50de0a92f3 ("selftests/resctrl: Calculate resctrl FS derived mem bw over sleep(1) only") the parent will be sleeping while the child runs its workload and there is no other interference I am aware of. The only additional measurements that I can see would be the work needed to actually start and stop the measurements and from what I can tell this falls into the noise.

Please do keep in mind that the performance counters used, iMC, cannot actually be bound to a single CPU since it is a per-socket PMU. The measurements have thus never been as fine grained as the code pretends it to be.

I was thinking if I should note the amount of work is small. Maybe it's fine to leave that noise there and I'm just overly cautious :-), when I used to do networking research in the past life, I wanted to eliminate as much noise sources so I guess it comes from that background.

The goal of these tests are to verify *resctrl*, these are not intended to be hardware validation tests. I think it would be better for resctrl if more time is spent on functional tests of resctrl than these performance tests.

This sounds so easy... (no offence) :-) If only there wouldn't be the black boxes and we'd have good and fine-grained ways to instrument it, it would be so much easier to realize non-statistical means to do functional tests.

-- i.

Reinette Chatre

6:08 p.m.

New subject: [PATCH 2/6] selftests/resctrl: Ensure measurements skip initialization of default benchmark

Hi Ilpo,

On 9/5/24 5:10 AM, Ilpo Järvinen wrote:

...

On Wed, 4 Sep 2024, Reinette Chatre wrote:

...
On 9/4/24 4:57 AM, Ilpo Järvinen wrote:

...
On Fri, 30 Aug 2024, Reinette Chatre wrote:

...
On 8/30/24 3:56 AM, Ilpo Järvinen wrote:

...
On Thu, 29 Aug 2024, Reinette Chatre wrote:

...
...
...
...
...
@@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test *test, return ret; } - /*
* If benchmark wasn't successfully started by child, then
child should
* kill parent, so save parent's pid
*/
ppid = getpid();
if (pipe(pipefd)) {
ksft_perror("Unable to create pipe");
/* Taskset test to specified CPU. */

ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity);
Previously only CPU affinity for bm_pid was set but now it's set before fork(). Quickly checking the Internet, it seems that CPU affinity gets inherited on fork() so now both processes will have the same affinity which might make the other process to interfere with the measurement.
Setting the affinity is intended to ensure that the buffer preparation occurs in the same topology as where the runtime portion will run. This preparation is done before the work to be measured starts.

This does tie in with the association with the resctrl group and I will elaborate more below ...
Okay, that's useful to retain but thinking this further, now we're also going do non-trivial amount of work in between the setup and the test by
Could you please elaborate how the amount of work during setup can be an issue? I have been focused on the measurements that are done afterwards that do have clear boundaries from what I can tell.
Well, you used it as a justification: "Setting the affinity is intended to ensure that the buffer preparation occurs in the same topology as where the runtime portion will run." So I assumed you had some expectations about "preparations" done outside of those "clear boundaries" but now you seem to take entirely opposite stance?

I do not follow you here. With the "clear boundaries" I meant the measurements and associated variables that have a clear start/init and stop/read while the other task sleeps. Yes, preparations are done outside of this since that should not be measured. You stated "now we're also going do non-trivial amount of work in between the setup and the test" ... could you clarify what the problem is with this? Before this work the "non-trivial amount of work" (for "fill_buf") was done as part of the test with (wrong) guesses about how long it takes. This work aims to improve this.

...

fork() quite heavy operation as it has to copy various things including the address space which you just made to contain a huge mem blob. :-)

As highlighted in a comment found in the patch, the kernel uses copy-on-write and all tests only read from the buffer after fork().

...

BTW, perhaps we could use some lighter weighted fork variant in the resctrl selftests, the processes don't really need to be that separated to justify using full fork() (and custom benchmarks will do execvp()).

Are you thinking about pthreads? It is not clear to me that this is necessary. It is also not clear to me what problem you are describing that needs this solution. When I understand that better it will be easier to discuss solutions.

...

...
...
forking. I guess that doesn't matter for memflush = true case but might be meaningful for the memflush = false case that seems to be there to allow keeping caches hot (I personally haven't thought how to use "caches hot" test but we do have that capability by the fact that memflush paremeter exists).

I believe that memflush = true will always be needed/used by the tests relying on memory bandwidth measurement since that reduces cache hits during measurement phase and avoids the additional guessing on how long the workload should be run before reliable/consistent measurements can start.

Thinking about the memflush = false case I now think that we should use that for the CMT test. The buffer is allocated and initialized while the task is configured with appropriate allocation limits so there should not be a reason to flush the buffer from the cache. In fact, flushing the cache introduces the requirement to guess the workload's "settle" time (time to allocate the buffer into the cache again) before its occupancy can be measured. As a quick test I set memflush = false on one system and it brought down the average diff between the cache portion size and the occupancy counts. I'll try it out on a few more systems to confirm.

Oh great!

I've not really figured out the logic used in the old CMT test because there was the rewrite for it in the series I started to upstream some of these improvements from. But I was unable to rebase successfully that rewrite either because somebody had used a non-publically available tree as a basis for it so I never did even have time to understand what even the rewritten test did thanks to the very complex diff.

...
...
...
...
Neither behavior, however, seems to result in the intended behavior as we either get interfering processes (if inherited) or no desired resctrl setup for the benchmark process.

There are two processes to consider in the resource group, the parent (that sets up the buffer and does the measurements) and the child (that runs the workload to be measured). Thanks to your commit da50de0a92f3 ("selftests/resctrl: Calculate resctrl FS derived mem bw over sleep(1) only") the parent will be sleeping while the child runs its workload and there is no other interference I am aware of. The only additional measurements that I can see would be the work needed to actually start and stop the measurements and from what I can tell this falls into the noise.

Please do keep in mind that the performance counters used, iMC, cannot actually be bound to a single CPU since it is a per-socket PMU. The measurements have thus never been as fine grained as the code pretends it to be.

I was thinking if I should note the amount of work is small. Maybe it's fine to leave that noise there and I'm just overly cautious :-), when I used to do networking research in the past life, I wanted to eliminate as much noise sources so I guess it comes from that background.

The goal of these tests are to verify *resctrl*, these are not intended to be hardware validation tests. I think it would be better for resctrl if more time is spent on functional tests of resctrl than these performance tests.

This sounds so easy... (no offence) :-) If only there wouldn't be the black boxes and we'd have good and fine-grained ways to instrument it, it would be so much easier to realize non-statistical means to do functional tests.

To me functional tests of resctrl indeed sounds easier. Tests can interact with the resctrl interface to ensure it works as expected ... test the boundaries of user input to the various files, test task movement between groups, test moving of monitor groups, test assigning CPUs to resource groups, ensure the "mode" of a resource group can be changed and behaves as expected (for example, shared vs exclusive), ensure changes to one file are reflected in others, like changing schemata is reflected in "size" and "bit_usage", etc. etc. These are all tests of *resctrl* that supports development and can be used to verify all is well as major changes (that we are seeing more and more of) are made to the kernel subsystem. None of this is "black box" and is all deterministic with obvious pass/fail. This can be made even more reliable with not just checking of resctrl files but see if changes via resctrl is reflected in MSRs as expected.

Reinette

Ilpo Järvinen

6 Sep 6 Sep

10 a.m.

New subject: [PATCH 2/6] selftests/resctrl: Ensure measurements skip initialization of default benchmark

On Thu, 5 Sep 2024, Reinette Chatre wrote:

...

On 9/5/24 5:10 AM, Ilpo Järvinen wrote:

...
On Wed, 4 Sep 2024, Reinette Chatre wrote:

...
On 9/4/24 4:57 AM, Ilpo Järvinen wrote:

...
On Fri, 30 Aug 2024, Reinette Chatre wrote:

...
On 8/30/24 3:56 AM, Ilpo Järvinen wrote:

...
On Thu, 29 Aug 2024, Reinette Chatre wrote:

...
...
...
...
> @@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test > *test, > return ret; > } > - /* > - * If benchmark wasn't successfully started by child, then > child > should > - * kill parent, so save parent's pid > - */ > ppid = getpid(); > - if (pipe(pipefd)) { > - ksft_perror("Unable to create pipe"); > + /* Taskset test to specified CPU. */ > + ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity);

Previously only CPU affinity for bm_pid was set but now it's set before fork(). Quickly checking the Internet, it seems that CPU affinity gets inherited on fork() so now both processes will have the same affinity which might make the other process to interfere with the measurement.

Setting the affinity is intended to ensure that the buffer preparation occurs in the same topology as where the runtime portion will run. This preparation is done before the work to be measured starts.

This does tie in with the association with the resctrl group and I will elaborate more below ...

Okay, that's useful to retain but thinking this further, now we're also going do non-trivial amount of work in between the setup and the test by

Could you please elaborate how the amount of work during setup can be an issue? I have been focused on the measurements that are done afterwards that do have clear boundaries from what I can tell.

Well, you used it as a justification: "Setting the affinity is intended to ensure that the buffer preparation occurs in the same topology as where the runtime portion will run." So I assumed you had some expectations about "preparations" done outside of those "clear boundaries" but now you seem to take entirely opposite stance?

I do not follow you here. With the "clear boundaries" I meant the measurements and associated variables that have a clear start/init and stop/read while the other task sleeps. Yes, preparations are done outside of this since that should not be measured.

Of course the preparations are not measured (at least not after this patch :-)).

But that's not what I meant. You said the preparations are to be done using the same topology as the test but if it's outside of the measurement period, the topology during preparations only matters if you make some hidden assumption that some state carries from preparations to the actual test. If no state carry-over is assumed, the topology during preparations is not important. So which way it is? Is the topology during preparations important or not?

You used the topology argument to justify why both parent and child are now in the same resctrl group unlike before when only the actual test was.

...

You stated "now we're also going do non-trivial amount of work in between the setup and the test" ... could you clarify what the problem is with this? Before this work the "non-trivial amount of work" (for "fill_buf") was done as part of the test with (wrong) guesses about how long it takes. This work aims to improve this.

I understand why you're trying to do with this change.

However, I was trying to say that before preparations occurred right before the test which is no longer the case because there's fork() in between.

Does that extra work impact the state carry-over from preparations to the test?

...

...
fork() quite heavy operation as it has to copy various things including the address space which you just made to contain a huge mem blob. :-)

As highlighted in a comment found in the patch, the kernel uses copy-on-write and all tests only read from the buffer after fork().

Write test is possible using -b fill_buf ... as mentioned in comments to the other patch.

...

...
BTW, perhaps we could use some lighter weighted fork variant in the resctrl selftests, the processes don't really need to be that separated to justify using full fork() (and custom benchmarks will do execvp()).

Are you thinking about pthreads? It is not clear to me that this is necessary. It is also not clear to me what problem you are describing that needs this solution. When I understand that better it will be easier to discuss solutions.

I was trying to say I see advantage of retaining the address space which is not possible with fork(), so perhaps using clone() with CLONE_VM would be useful and maybe some other flags too. I think this would solve the write test case.

...

...
...
...
I was thinking if I should note the amount of work is small. Maybe it's fine to leave that noise there and I'm just overly cautious :-), when I used to do networking research in the past life, I wanted to eliminate as much noise sources so I guess it comes from that background.

The goal of these tests are to verify *resctrl*, these are not intended to be hardware validation tests. I think it would be better for resctrl if more time is spent on functional tests of resctrl than these performance tests.

This sounds so easy... (no offence) :-) If only there wouldn't be the black boxes and we'd have good and fine-grained ways to instrument it, it would be so much easier to realize non-statistical means to do functional tests.

To me functional tests of resctrl indeed sounds easier. Tests can interact with the resctrl interface to ensure it works as expected ... test the boundaries of user input to the various files, test task movement between groups, test moving of monitor groups, test assigning CPUs to resource groups, ensure the "mode" of a resource group can be changed and behaves as expected (for example, shared vs exclusive), ensure changes to one file are reflected in others, like changing schemata is reflected in "size" and "bit_usage", etc. etc. These are all tests of *resctrl* that supports development and can be used to verify all is well as major changes (that we are seeing more and more of) are made to the kernel subsystem. None of this is "black box" and is all deterministic with obvious pass/fail. This can be made even more reliable with not just checking of resctrl files but see if changes via resctrl is reflected in MSRs as expected.

Okay, I get it now, you meant testing the kernel-userspace interfaces and with using MSRs as a further validation step to test kernel-HW interface too.

I'll probably take a look at this when I've some spare time what I can come up with.

-- i.

Reinette Chatre

7 Sep 7 Sep

12:05 a.m.

New subject: [PATCH 2/6] selftests/resctrl: Ensure measurements skip initialization of default benchmark

Hi Ilpo,

On 9/6/24 3:00 AM, Ilpo Järvinen wrote:

...

On Thu, 5 Sep 2024, Reinette Chatre wrote:

...
On 9/5/24 5:10 AM, Ilpo Järvinen wrote:

...
On Wed, 4 Sep 2024, Reinette Chatre wrote:

...
On 9/4/24 4:57 AM, Ilpo Järvinen wrote:

...
On Fri, 30 Aug 2024, Reinette Chatre wrote:

...
On 8/30/24 3:56 AM, Ilpo Järvinen wrote: > On Thu, 29 Aug 2024, Reinette Chatre wrote:

...
...
...
>> @@ -699,111 +639,80 @@ int resctrl_val(const struct resctrl_test >> *test, >> return ret; >> } >> - /* >> - * If benchmark wasn't successfully started by child, then >> child >> should >> - * kill parent, so save parent's pid >> - */ >> ppid = getpid(); >> - if (pipe(pipefd)) { >> - ksft_perror("Unable to create pipe"); >> + /* Taskset test to specified CPU. */ >> + ret = taskset_benchmark(ppid, uparams->cpu, &old_affinity); > > Previously only CPU affinity for bm_pid was set but now it's set > before > fork(). Quickly checking the Internet, it seems that CPU affinity > gets > inherited on fork() so now both processes will have the same > affinity > which might make the other process to interfere with the > measurement.

Setting the affinity is intended to ensure that the buffer preparation occurs in the same topology as where the runtime portion will run. This preparation is done before the work to be measured starts.

This does tie in with the association with the resctrl group and I will elaborate more below ...

Okay, that's useful to retain but thinking this further, now we're also going do non-trivial amount of work in between the setup and the test by

Could you please elaborate how the amount of work during setup can be an issue? I have been focused on the measurements that are done afterwards that do have clear boundaries from what I can tell.

Well, you used it as a justification: "Setting the affinity is intended to ensure that the buffer preparation occurs in the same topology as where the runtime portion will run." So I assumed you had some expectations about "preparations" done outside of those "clear boundaries" but now you seem to take entirely opposite stance?

I do not follow you here. With the "clear boundaries" I meant the measurements and associated variables that have a clear start/init and stop/read while the other task sleeps. Yes, preparations are done outside of this since that should not be measured.

Of course the preparations are not measured (at least not after this patch :-)).

But that's not what I meant. You said the preparations are to be done using the same topology as the test but if it's outside of the measurement period, the topology during preparations only matters if you make some hidden assumption that some state carries from preparations to the actual test. If no state carry-over is assumed, the topology during preparations is not important. So which way it is? Is the topology during preparations important or not?

Topology during preparations is important.

In the CMT test this is more relevant with the transition to using memflush = false. The preparation phase and measure phase uses the same cache alloc configuration and just as importantly, the same monitoring configuration. During preparation the cache portion that will be used during measurement will be filled with the contents of the buffer. During measurement it will be the same cache portion into which any new reads will be allocated and measured. In fact, the preparation phase will thus form part of the measurement. If preparation was done with different configuration, then I see a problem whether memflush = true as well as when memflush = false. In the case of memflush = true it will have the familiar issue of the test needing to "guess" the workload settle time. In the case of memflush = false the buffer will remain allocated into the cache portion used during preparation phase, when the workload runs it will read the data from a cache portion that does not belong to it and since it does not need to allocate into its own cache portion its LLC occupancy counts will not be accurate (the LLC occupancy associated with the buffer will be attributed to prepare portion).

I am not familiar with the details of memory allocation but as I understand Linux does attempt to satisfy memory requests from the node to which the requesting CPU is assigned. For the MBM and MBA tests I thus believe it is important to allocate the memory from where it will be used. I have encountered systems where CPU0 and CPU1 are on different sockets and by default the workload is set to run on CPU1. If the preparation phase runs on CPU0 then it may be that memory could be allocated from a different NUMA node than where the workload will be running. By doing preparation within the same topology as what the workload will be running I believe that memory will be allocated appropriate to workload and thus result in more reliable measurements.

...

You used the topology argument to justify why both parent and child are now in the same resctrl group unlike before when only the actual test was.

...
You stated "now we're also going do non-trivial amount of work in between the setup and the test" ... could you clarify what the problem is with this? Before this work the "non-trivial amount of work" (for "fill_buf") was done as part of the test with (wrong) guesses about how long it takes. This work aims to improve this.

I understand why you're trying to do with this change.

However, I was trying to say that before preparations occurred right before the test which is no longer the case because there's fork() in between.

If by "test" you mean the measurement phase then in the case of "fill_buf" preparations only now reliably occur before the measurement. Original behavior is maintained with user provided benchmark.

...

Does that extra work impact the state carry-over from preparations to the test?

It is not clear to me what extra work or state you are referring to.

...

...
...
fork() quite heavy operation as it has to copy various things including the address space which you just made to contain a huge mem blob. :-)

As highlighted in a comment found in the patch, the kernel uses copy-on-write and all tests only read from the buffer after fork().

Write test is possible using -b fill_buf ... as mentioned in comments to the other patch.

Yes, it is theoretically possible, but looking closer it is not supported. Note how measure_mem_bw() is always called with hardcoded "reads". Reading through the history of commits I do not believe modifying fill_buf parameters was ever intended to be a use case for the "-b" parameter. It seems intended to provide an alternate benchmark to fill_buf.

I am not interested in adding support for the write test because I do not see how it helps us to improve resctrl subsystem health. Considering that nobody has ever complained about the write test being broken I think all that dead code should just be removed instead.

I prefer the focus be on the health of the resctrl subsystem instead of additional hardware performance testing. I do not think it is energy well spent to further tune these performance tests unless it benefits resctrl subsystem health. These performance tests are difficult to maintain and I have not yet seen how they benefit the resctrl subsystem.

...

...
...
BTW, perhaps we could use some lighter weighted fork variant in the resctrl selftests, the processes don't really need to be that separated to justify using full fork() (and custom benchmarks will do execvp()).

Are you thinking about pthreads? It is not clear to me that this is necessary. It is also not clear to me what problem you are describing that needs this solution. When I understand that better it will be easier to discuss solutions.

I was trying to say I see advantage of retaining the address space which is not possible with fork(), so perhaps using clone() with CLONE_VM would be useful and maybe some other flags too. I think this would solve the write test case.

clone() brings with it the complexity of needing to manage the child's stack. This again aims for a performance improvement. What does this fix? What is the benefit to resctrl health? I would prefer that our energy instead be focused on improving resctrl subsystem health.

Reinette

Ilpo Järvinen

9 Sep 9 Sep

12:52 p.m.

New subject: [PATCH 2/6] selftests/resctrl: Ensure measurements skip initialization of default benchmark

On Fri, 6 Sep 2024, Reinette Chatre wrote:

...

On 9/6/24 3:00 AM, Ilpo Järvinen wrote:

...
On Thu, 5 Sep 2024, Reinette Chatre wrote:

...
On 9/5/24 5:10 AM, Ilpo Järvinen wrote:

...
On Wed, 4 Sep 2024, Reinette Chatre wrote:

...
On 9/4/24 4:57 AM, Ilpo Järvinen wrote:

...
On Fri, 30 Aug 2024, Reinette Chatre wrote: > On 8/30/24 3:56 AM, Ilpo Järvinen wrote: > > On Thu, 29 Aug 2024, Reinette Chatre wrote:

...
...
> > > @@ -699,111 +639,80 @@ int resctrl_val(const struct > > > resctrl_test > > > *test, > > > return ret; > > > } > > > - /* > > > - * If benchmark wasn't successfully started by child, > > > then > > > child > > > should > > > - * kill parent, so save parent's pid > > > - */ > > > ppid = getpid(); > > > - if (pipe(pipefd)) { > > > - ksft_perror("Unable to create pipe"); > > > + /* Taskset test to specified CPU. */ > > > + ret = taskset_benchmark(ppid, uparams->cpu, > > > &old_affinity); > > > > Previously only CPU affinity for bm_pid was set but now it's set > > before > > fork(). Quickly checking the Internet, it seems that CPU > > affinity > > gets > > inherited on fork() so now both processes will have the same > > affinity > > which might make the other process to interfere with the > > measurement. > > Setting the affinity is intended to ensure that the buffer > preparation > occurs in the same topology as where the runtime portion will run. > This preparation is done before the work to be measured starts. > > This does tie in with the association with the resctrl group and I > will elaborate more below ...

Okay, that's useful to retain but thinking this further, now we're also going do non-trivial amount of work in between the setup and the test by

Could you please elaborate how the amount of work during setup can be an issue? I have been focused on the measurements that are done afterwards that do have clear boundaries from what I can tell.

Well, you used it as a justification: "Setting the affinity is intended to ensure that the buffer preparation occurs in the same topology as where the runtime portion will run." So I assumed you had some expectations about "preparations" done outside of those "clear boundaries" but now you seem to take entirely opposite stance?

I do not follow you here. With the "clear boundaries" I meant the measurements and associated variables that have a clear start/init and stop/read while the other task sleeps. Yes, preparations are done outside of this since that should not be measured.

Of course the preparations are not measured (at least not after this patch :-)).

But that's not what I meant. You said the preparations are to be done using the same topology as the test but if it's outside of the measurement period, the topology during preparations only matters if you make some hidden assumption that some state carries from preparations to the actual test. If no state carry-over is assumed, the topology during preparations is not important. So which way it is? Is the topology during preparations important or not?

Topology during preparations is important.

In the CMT test this is more relevant with the transition to using memflush = false. The preparation phase and measure phase uses the same cache alloc configuration and just as importantly, the same monitoring configuration. During preparation the cache portion that will be used during measurement will be filled with the contents of the buffer. During measurement it will be the same cache portion into which any new reads will be allocated and measured. In fact, the preparation phase will thus form part of the measurement. If preparation was done with different configuration, then I see a problem whether memflush = true as well as when memflush = false. In the case of memflush = true it will have the familiar issue of the test needing to "guess" the workload settle time. In the case of memflush = false the buffer will remain allocated into the cache portion used during preparation phase, when the workload runs it will read the data from a cache portion that does not belong to it and since it does not need to allocate into its own cache portion its LLC occupancy counts will not be accurate (the LLC occupancy associated with the buffer will be attributed to prepare portion).

I am not familiar with the details of memory allocation but as I understand Linux does attempt to satisfy memory requests from the node to which the requesting CPU is assigned. For the MBM and MBA tests I thus believe it is important to allocate the memory from where it will be used. I have encountered systems where CPU0 and CPU1 are on different sockets and by default the workload is set to run on CPU1. If the preparation phase runs on CPU0 then it may be that memory could be allocated from a different NUMA node than where the workload will be running. By doing preparation within the same topology as what the workload will be running I believe that memory will be allocated appropriate to workload and thus result in more reliable measurements.

...
You used the topology argument to justify why both parent and child are now in the same resctrl group unlike before when only the actual test was.

...
You stated "now we're also going do non-trivial amount of work in between the setup and the test" ... could you clarify what the problem is with this? Before this work the "non-trivial amount of work" (for "fill_buf") was done as part of the test with (wrong) guesses about how long it takes. This work aims to improve this.

I understand why you're trying to do with this change.

However, I was trying to say that before preparations occurred right before the test which is no longer the case because there's fork() in between.

If by "test" you mean the measurement phase then in the case of "fill_buf" preparations only now reliably occur before the measurement. Original behavior is maintained with user provided benchmark.

...
Does that extra work impact the state carry-over from preparations to the test?

It is not clear to me what extra work or state you are referring to.

...
...
...
fork() quite heavy operation as it has to copy various things including the address space which you just made to contain a huge mem blob. :-)

As highlighted in a comment found in the patch, the kernel uses copy-on-write and all tests only read from the buffer after fork().

Write test is possible using -b fill_buf ... as mentioned in comments to the other patch.

Yes, it is theoretically possible, but looking closer it is not supported. Note how measure_mem_bw() is always called with hardcoded "reads". Reading through the history of commits I do not believe modifying fill_buf parameters was ever intended to be a use case for the "-b" parameter. It seems intended to provide an alternate benchmark to fill_buf.

I am not interested in adding support for the write test because I do not see how it helps us to improve resctrl subsystem health. Considering that nobody has ever complained about the write test being broken I think all that dead code should just be removed instead.

I prefer the focus be on the health of the resctrl subsystem instead of additional hardware performance testing. I do not think it is energy well spent to further tune these performance tests unless it benefits resctrl subsystem health. These performance tests are difficult to maintain and I have not yet seen how they benefit the resctrl subsystem.

...
...
...
BTW, perhaps we could use some lighter weighted fork variant in the resctrl selftests, the processes don't really need to be that separated to justify using full fork() (and custom benchmarks will do execvp()).

Are you thinking about pthreads? It is not clear to me that this is necessary. It is also not clear to me what problem you are describing that needs this solution. When I understand that better it will be easier to discuss solutions.

I was trying to say I see advantage of retaining the address space which is not possible with fork(), so perhaps using clone() with CLONE_VM would be useful and maybe some other flags too. I think this would solve the write test case.

clone() brings with it the complexity of needing to manage the child's stack. This again aims for a performance improvement. What does this fix? What is the benefit to resctrl health? I would prefer that our energy instead be focused on improving resctrl subsystem health.

Fair enough.

-- i.

Reinette Chatre

29 Aug 29 Aug

10:52 p.m.

New subject: [PATCH 3/6] selftests/resctrl: Simplify benchmark parameter passing

The benchmark used during the CMT, MBM, and MBA tests can be provided by the user via (-b) parameter to the tests, if not provided the default "fill_buf" benchmark is used.

The "fill_buf" benchmark requires parameters and these are managed as an array of strings.

Using an array of strings to manage the "fill_buf" parameters is complex because it requires transformations to/from strings at every producer and consumer. This is made worse for the individual tests where the default benchmark parameters values may not be appropriate and additional data wrangling is required. For example, the CMT test duplicates the entire array of strings in order to replace one of the parameters.

Replace the "array of strings" parameters used for "fill_buf" with a struct that contains the "fill_buf" parameters that can be used directly without transformations to/from strings. Make these parameters part of the parameters associated with each test so that each test can set benchmark parameters that are appropriate for it.

Signed-off-by: Reinette Chatre reinette.chatre@intel.com --- tools/testing/selftests/resctrl/cmt_test.c | 28 +++-------- tools/testing/selftests/resctrl/mba_test.c | 7 ++- tools/testing/selftests/resctrl/mbm_test.c | 9 +++- tools/testing/selftests/resctrl/resctrl.h | 49 +++++++++++++------ .../testing/selftests/resctrl/resctrl_tests.c | 15 +----- tools/testing/selftests/resctrl/resctrl_val.c | 38 +++++--------- 6 files changed, 69 insertions(+), 77 deletions(-)

diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c index 0c045080d808..f09d5dfab38c 100644 --- a/tools/testing/selftests/resctrl/cmt_test.c +++ b/tools/testing/selftests/resctrl/cmt_test.c @@ -116,15 +116,12 @@ static void cmt_test_cleanup(void)

static int cmt_run_test(const struct resctrl_test *test, const struct user_params *uparams) { - const char * const *cmd = uparams->benchmark_cmd; - const char *new_cmd[BENCHMARK_ARGS]; unsigned long cache_total_size = 0; int n = uparams->bits ? : 5; unsigned long long_mask; - char *span_str = NULL; int count_of_bits; size_t span; - int ret, i; + int ret;

ret = get_full_cbm("L3", &long_mask); if (ret) @@ -155,32 +152,21 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param

span = cache_portion_size(cache_total_size, param.mask, long_mask);

- if (strcmp(cmd[0], "fill_buf") == 0) { - /* Duplicate the command to be able to replace span in it */ - for (i = 0; uparams->benchmark_cmd[i]; i++) - new_cmd[i] = uparams->benchmark_cmd[i]; - new_cmd[i] = NULL; - - ret = asprintf(&span_str, "%zu", span); - if (ret < 0) - return -1; - new_cmd[1] = span_str; - cmd = new_cmd; - } + param.fill_buf.buf_size = span; + param.fill_buf.memflush = 1; + param.fill_buf.operation = 0; + param.fill_buf.once = false;

remove(RESULT_FILE_NAME);

- ret = resctrl_val(test, uparams, cmd, &param); + ret = resctrl_val(test, uparams, &param); if (ret) - goto out; + return ret;

ret = check_results(&param, span, n); if (ret && (get_vendor() == ARCH_INTEL)) ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");

-out: - free(span_str); - return ret; }

diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index ab8496a4925b..8ad433495f61 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -174,7 +174,12 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param

remove(RESULT_FILE_NAME);

- ret = resctrl_val(test, uparams, uparams->benchmark_cmd, &param); + param.fill_buf.buf_size = DEFAULT_SPAN; + param.fill_buf.memflush = 1; + param.fill_buf.operation = 0; + param.fill_buf.once = false; + + ret = resctrl_val(test, uparams, &param); if (ret) return ret;

diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c index 6b5a3b52d861..b6883f274c74 100644 --- a/tools/testing/selftests/resctrl/mbm_test.c +++ b/tools/testing/selftests/resctrl/mbm_test.c @@ -142,11 +142,16 @@ static int mbm_run_test(const struct resctrl_test *test, const struct user_param

remove(RESULT_FILE_NAME);

- ret = check_results(DEFAULT_SPAN); + ret = check_results(param.fill_buf.buf_size); if (ret && (get_vendor() == ARCH_INTEL)) ksft_print_msg("Intel MBM may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");

diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h index 0afbc4dd18e4..0e5456165a6a 100644 --- a/tools/testing/selftests/resctrl/resctrl.h +++ b/tools/testing/selftests/resctrl/resctrl.h @@ -79,6 +79,26 @@ struct resctrl_test { void (*cleanup)(void); };

+/* + * fill_buf_param: "fill_buf" benchmark parameters + * @buf_size: Size (in bytes) of buffer used in benchmark. + * "fill_buf" allocates and initializes buffer of + * @buf_size. + * @operation: If 0, then only read operations are performed on + * the buffer, if 1 then only write operations are + * performed on the buffer. + * @memflush: 1 if buffer should be flushed after + * allocation and initialization. + * @once: Benchmark will perform @operation once if true, + * infinitely (until terminated) if false. + */ +struct fill_buf_param { + size_t buf_size; + int operation; + int memflush; + int once; +}; + /* * resctrl_val_param: resctrl test parameters * @ctrlgrp: Name of the control monitor group (con_mon grp) @@ -87,21 +107,23 @@ struct resctrl_test { * @init: Callback function to initialize test environment * @setup: Callback function to setup per test run environment * @measure: Callback that performs the measurement (a single test) + * @fill_buf: Parameters for default "fill_buf" benchmark */ struct resctrl_val_param { - const char *ctrlgrp; - const char *mongrp; - char filename[64]; - unsigned long mask; - int num_of_runs; - int (*init)(const struct resctrl_val_param *param, - int domain_id); - int (*setup)(const struct resctrl_test *test, - const struct user_params *uparams, - struct resctrl_val_param *param); - int (*measure)(const struct user_params *uparams, - struct resctrl_val_param *param, - pid_t bm_pid); + const char *ctrlgrp; + const char *mongrp; + char filename[64]; + unsigned long mask; + int num_of_runs; + int (*init)(const struct resctrl_val_param *param, + int domain_id); + int (*setup)(const struct resctrl_test *test, + const struct user_params *uparams, + struct resctrl_val_param *param); + int (*measure)(const struct user_params *uparams, + struct resctrl_val_param *param, + pid_t bm_pid); + struct fill_buf_param fill_buf; };

struct perf_event_read { @@ -151,7 +173,6 @@ void initialize_mem_bw_resctrl(const struct resctrl_val_param *param, int domain_id); int resctrl_val(const struct resctrl_test *test, const struct user_params *uparams, - const char * const *benchmark_cmd, struct resctrl_val_param *param); unsigned long create_bit_mask(unsigned int start, unsigned int len); unsigned int count_contiguous_bits(unsigned long val, unsigned int *start); diff --git a/tools/testing/selftests/resctrl/resctrl_tests.c b/tools/testing/selftests/resctrl/resctrl_tests.c index ecbb7605a981..ce8fcc769d57 100644 --- a/tools/testing/selftests/resctrl/resctrl_tests.c +++ b/tools/testing/selftests/resctrl/resctrl_tests.c @@ -162,7 +162,7 @@ int main(int argc, char **argv) bool test_param_seen = false; struct user_params uparams; char *span_str = NULL; - int ret, c, i; + int c, i;

init_user_params(&uparams);

@@ -257,19 +257,6 @@ int main(int argc, char **argv)

filter_dmesg();

- if (!uparams.benchmark_cmd[0]) { - /* If no benchmark is given by "-b" argument, use fill_buf. */ - uparams.benchmark_cmd[0] = "fill_buf"; - ret = asprintf(&span_str, "%u", DEFAULT_SPAN); - if (ret < 0) - ksft_exit_fail_msg("Out of memory!\n"); - uparams.benchmark_cmd[1] = span_str; - uparams.benchmark_cmd[2] = "1"; - uparams.benchmark_cmd[3] = "0"; - uparams.benchmark_cmd[4] = "false"; - uparams.benchmark_cmd[5] = NULL; - } - ksft_set_plan(tests);

for (i = 0; i < ARRAY_SIZE(resctrl_tests); i++) diff --git a/tools/testing/selftests/resctrl/resctrl_val.c b/tools/testing/selftests/resctrl/resctrl_val.c index 574b72604f95..9a5a9a67e059 100644 --- a/tools/testing/selftests/resctrl/resctrl_val.c +++ b/tools/testing/selftests/resctrl/resctrl_val.c @@ -612,21 +612,17 @@ int measure_mem_bw(const struct user_params *uparams, * the benchmark * @test: test information structure * @uparams: user supplied parameters - * @benchmark_cmd: benchmark command and its arguments * @param: parameters passed to resctrl_val() * * Return: 0 when the test was run, < 0 on error. */ int resctrl_val(const struct resctrl_test *test, const struct user_params *uparams, - const char * const *benchmark_cmd, struct resctrl_val_param *param) { - int domain_id, operation = 0, memflush = 1; - size_t span = DEFAULT_SPAN; unsigned char *buf = NULL; cpu_set_t old_affinity; - bool once = false; + int domain_id; int ret = 0; pid_t ppid;

@@ -666,21 +662,9 @@ int resctrl_val(const struct resctrl_test *test, * how this impacts "write" benchmark, but no test currently * uses this. */ - if (strcmp(benchmark_cmd[0], "fill_buf") == 0) { - span = strtoul(benchmark_cmd[1], NULL, 10); - memflush = atoi(benchmark_cmd[2]); - operation = atoi(benchmark_cmd[3]); - if (!strcmp(benchmark_cmd[4], "true")) { - once = true; - } else if (!strcmp(benchmark_cmd[4], "false")) { - once = false; - } else { - ksft_print_msg("Invalid once parameter\n"); - ret = -EINVAL; - goto reset_affinity; - } - - buf = alloc_buffer(span, memflush); + if (!uparams->benchmark_cmd[0]) { + buf = alloc_buffer(param->fill_buf.buf_size, + param->fill_buf.memflush); if (!buf) { ret = -ENOMEM; goto reset_affinity; @@ -699,13 +683,17 @@ int resctrl_val(const struct resctrl_test *test, * terminated. */ if (bm_pid == 0) { - if (strcmp(benchmark_cmd[0], "fill_buf") == 0) { - if (operation == 0) - fill_cache_read(buf, span, once); + if (!uparams->benchmark_cmd[0]) { + if (param->fill_buf.operation == 0) + fill_cache_read(buf, + param->fill_buf.buf_size, + param->fill_buf.once); else - fill_cache_write(buf, span, once); + fill_cache_write(buf, + param->fill_buf.buf_size, + param->fill_buf.once); } else { - execvp(benchmark_cmd[0], (char **)benchmark_cmd); + execvp(uparams->benchmark_cmd[0], (char **)uparams->benchmark_cmd); } exit(EXIT_SUCCESS); }

-- 2.46.0

Ilpo Järvinen

30 Aug 30 Aug

11:13 a.m.

New subject: [PATCH 3/6] selftests/resctrl: Simplify benchmark parameter passing

On Thu, 29 Aug 2024, Reinette Chatre wrote:

...

The benchmark used during the CMT, MBM, and MBA tests can be provided by the user via (-b) parameter to the tests, if not provided the default "fill_buf" benchmark is used.

The "fill_buf" benchmark requires parameters and these are managed as an array of strings.

Using an array of strings to manage the "fill_buf" parameters is complex because it requires transformations to/from strings at every producer and consumer. This is made worse for the individual tests where the default benchmark parameters values may not be appropriate and additional data wrangling is required. For example, the CMT test duplicates the entire array of strings in order to replace one of the parameters.

Replace the "array of strings" parameters used for "fill_buf" with a struct that contains the "fill_buf" parameters that can be used directly without transformations to/from strings. Make these parameters part of the parameters associated with each test so that each test can set benchmark parameters that are appropriate for it.

Signed-off-by: Reinette Chatre reinette.chatre@intel.com

tools/testing/selftests/resctrl/cmt_test.c | 28 +++-------- tools/testing/selftests/resctrl/mba_test.c | 7 ++- tools/testing/selftests/resctrl/mbm_test.c | 9 +++- tools/testing/selftests/resctrl/resctrl.h | 49 +++++++++++++------ .../testing/selftests/resctrl/resctrl_tests.c | 15 +----- tools/testing/selftests/resctrl/resctrl_val.c | 38 +++++--------- 6 files changed, 69 insertions(+), 77 deletions(-)

diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c index 0c045080d808..f09d5dfab38c 100644 --- a/tools/testing/selftests/resctrl/cmt_test.c +++ b/tools/testing/selftests/resctrl/cmt_test.c @@ -116,15 +116,12 @@ static void cmt_test_cleanup(void) static int cmt_run_test(const struct resctrl_test *test, const struct user_params *uparams) {

const char * const *cmd = uparams->benchmark_cmd;

const char *new_cmd[BENCHMARK_ARGS]; unsigned long cache_total_size = 0; int n = uparams->bits ? : 5; unsigned long long_mask;

char *span_str = NULL; int count_of_bits; size_t span;

int ret, i;

int ret;

ret = get_full_cbm("L3", &long_mask); if (ret) @@ -155,32 +152,21 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param span = cache_portion_size(cache_total_size, param.mask, long_mask);
if (strcmp(cmd[0], "fill_buf") == 0) {
/* Duplicate the command to be able to replace span in it */
for (i = 0; uparams->benchmark_cmd[i]; i++)
	new_cmd[i] = uparams->benchmark_cmd[i];
new_cmd[i] = NULL;
ret = asprintf(&span_str, "%zu", span);
if (ret < 0)
	return -1;
new_cmd[1] = span_str;
cmd = new_cmd;
}
param.fill_buf.buf_size = span;

param.fill_buf.memflush = 1;

param.fill_buf.operation = 0;

param.fill_buf.once = false;

remove(RESULT_FILE_NAME);

ret = resctrl_val(test, uparams, cmd, &param);

ret = resctrl_val(test, uparams, &param); if (ret)
goto out;
return ret;
ret = check_results(&param, span, n); if (ret && (get_vendor() == ARCH_INTEL)) ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n"); -out:

free(span_str);

return ret;

} diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index ab8496a4925b..8ad433495f61 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -174,7 +174,12 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param remove(RESULT_FILE_NAME);

ret = resctrl_val(test, uparams, uparams->benchmark_cmd, &param);

param.fill_buf.buf_size = DEFAULT_SPAN;

param.fill_buf.memflush = 1;

param.fill_buf.operation = 0;

param.fill_buf.once = false;

ret = resctrl_val(test, uparams, &param); if (ret) return ret;

diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c index 6b5a3b52d861..b6883f274c74 100644 --- a/tools/testing/selftests/resctrl/mbm_test.c +++ b/tools/testing/selftests/resctrl/mbm_test.c @@ -142,11 +142,16 @@ static int mbm_run_test(const struct resctrl_test *test, const struct user_param remove(RESULT_FILE_NAME);

ret = resctrl_val(test, uparams, uparams->benchmark_cmd, &param);

param.fill_buf.buf_size = DEFAULT_SPAN;

param.fill_buf.memflush = 1;

param.fill_buf.operation = 0;

param.fill_buf.once = false;

ret = resctrl_val(test, uparams, &param); if (ret) return ret;

ret = check_results(DEFAULT_SPAN);

ret = check_results(param.fill_buf.buf_size); if (ret && (get_vendor() == ARCH_INTEL)) ksft_print_msg("Intel MBM may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n");

diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h index 0afbc4dd18e4..0e5456165a6a 100644 --- a/tools/testing/selftests/resctrl/resctrl.h +++ b/tools/testing/selftests/resctrl/resctrl.h @@ -79,6 +79,26 @@ struct resctrl_test { void (*cleanup)(void); }; +/*
fill_buf_param: "fill_buf" benchmark parameters

@buf_size: Size (in bytes) of buffer used in benchmark.
	"fill_buf" allocates and initializes buffer of
	@buf_size.
@operation: If 0, then only read operations are performed on
	the buffer, if 1 then only write operations are
	performed on the buffer.
@memflush: 1 if buffer should be flushed after
	allocation and initialization.
@once: Benchmark will perform @operation once if true,
	infinitely (until terminated) if false.
*/
+struct fill_buf_param {

size_t buf_size;

int operation;

int memflush;

int once;

+};

/*

resctrl_val_param: resctrl test parameters

@ctrlgrp: Name of the control monitor group (con_mon grp)

@@ -87,21 +107,23 @@ struct resctrl_test {

@init: Callback function to initialize test environment

@setup: Callback function to setup per test run environment

@measure: Callback that performs the measurement (a single test)

@fill_buf: Parameters for default "fill_buf" benchmark

*/

struct resctrl_val_param {
const char *ctrlgrp;

const char *mongrp;

char filename[64];

unsigned long mask;

int num_of_runs;

int (*init)(const struct resctrl_val_param *param,
		int domain_id);
int (*setup)(const struct resctrl_test *test,
		 const struct user_params *uparams,
		 struct resctrl_val_param *param);
int (*measure)(const struct user_params *uparams,
		   struct resctrl_val_param *param,
		   pid_t bm_pid);
const char *ctrlgrp;

const char *mongrp;

char filename[64];

unsigned long mask;

int num_of_runs;

int (*init)(const struct resctrl_val_param *param,
			int domain_id);
int (*setup)(const struct resctrl_test *test,
			 const struct user_params *uparams,
			 struct resctrl_val_param *param);
int (*measure)(const struct user_params *uparams,
			   struct resctrl_val_param *param,
			   pid_t bm_pid);
struct fill_buf_param fill_buf;
}; struct perf_event_read { @@ -151,7 +173,6 @@ void initialize_mem_bw_resctrl(const struct resctrl_val_param *param, int domain_id); int resctrl_val(const struct resctrl_test *test, const struct user_params *uparams,
const char * const *benchmark_cmd,
struct resctrl_val_param *param);
unsigned long create_bit_mask(unsigned int start, unsigned int len); unsigned int count_contiguous_bits(unsigned long val, unsigned int *start); diff --git a/tools/testing/selftests/resctrl/resctrl_tests.c b/tools/testing/selftests/resctrl/resctrl_tests.c index ecbb7605a981..ce8fcc769d57 100644 --- a/tools/testing/selftests/resctrl/resctrl_tests.c +++ b/tools/testing/selftests/resctrl/resctrl_tests.c @@ -162,7 +162,7 @@ int main(int argc, char **argv) bool test_param_seen = false; struct user_params uparams; char *span_str = NULL;

int ret, c, i;

int c, i;

init_user_params(&uparams); @@ -257,19 +257,6 @@ int main(int argc, char **argv) filter_dmesg();
if (!uparams.benchmark_cmd[0]) {
/* If no benchmark is given by "-b" argument, use fill_buf. */
uparams.benchmark_cmd[0] = "fill_buf";
ret = asprintf(&span_str, "%u", DEFAULT_SPAN);
if (ret < 0)
	ksft_exit_fail_msg("Out of memory!\n");
uparams.benchmark_cmd[1] = span_str;
uparams.benchmark_cmd[2] = "1";
uparams.benchmark_cmd[3] = "0";
uparams.benchmark_cmd[4] = "false";
uparams.benchmark_cmd[5] = NULL;
}

ksft_set_plan(tests);
for (i = 0; i < ARRAY_SIZE(resctrl_tests); i++) diff --git a/tools/testing/selftests/resctrl/resctrl_val.c b/tools/testing/selftests/resctrl/resctrl_val.c index 574b72604f95..9a5a9a67e059 100644 --- a/tools/testing/selftests/resctrl/resctrl_val.c +++ b/tools/testing/selftests/resctrl/resctrl_val.c @@ -612,21 +612,17 @@ int measure_mem_bw(const struct user_params *uparams,
	the benchmark
@test: test information structure

@uparams: user supplied parameters
@benchmark_cmd: benchmark command and its arguments

@param: parameters passed to resctrl_val()

Return: 0 when the test was run, < 0 on error.

*/

int resctrl_val(const struct resctrl_test *test, const struct user_params *uparams,
const char * const *benchmark_cmd,
struct resctrl_val_param *param)
{

int domain_id, operation = 0, memflush = 1;

size_t span = DEFAULT_SPAN; unsigned char *buf = NULL; cpu_set_t old_affinity;

bool once = false;

int domain_id; int ret = 0; pid_t ppid;

@@ -666,21 +662,9 @@ int resctrl_val(const struct resctrl_test *test, * how this impacts "write" benchmark, but no test currently * uses this. */
if (strcmp(benchmark_cmd[0], "fill_buf") == 0) {
span = strtoul(benchmark_cmd[1], NULL, 10);
memflush =  atoi(benchmark_cmd[2]);
operation = atoi(benchmark_cmd[3]);
if (!strcmp(benchmark_cmd[4], "true")) {
	once = true;
} else if (!strcmp(benchmark_cmd[4], "false")) {
	once = false;
} else {
	ksft_print_msg("Invalid once parameter\n");
	ret = -EINVAL;
	goto reset_affinity;
}
buf = alloc_buffer(span, memflush);
if (!uparams->benchmark_cmd[0]) {
buf = alloc_buffer(param->fill_buf.buf_size,
		   param->fill_buf.memflush);
if (!buf) { ret = -ENOMEM; goto reset_affinity;
@@ -699,13 +683,17 @@ int resctrl_val(const struct resctrl_test *test, * terminated. */ if (bm_pid == 0) {
if (strcmp(benchmark_cmd[0], "fill_buf") == 0) {
	if (operation == 0)
		fill_cache_read(buf, span, once);
if (!uparams->benchmark_cmd[0]) {
	if (param->fill_buf.operation == 0)
		fill_cache_read(buf,
				param->fill_buf.buf_size,
				param->fill_buf.once);
else
		fill_cache_write(buf, span, once);
		fill_cache_write(buf,
				 param->fill_buf.buf_size,
				 param->fill_buf.once);
} else {
	execvp(benchmark_cmd[0], (char **)benchmark_cmd);
	execvp(uparams->benchmark_cmd[0], (char **)uparams->benchmark_cmd);
} exit(EXIT_SUCCESS); }

If I didn't miss anything important, this change takes away the ability to alter fill_buf's parameters using -b option which to me felt the most useful way to use that parameter. The current code of course was lacks many safeguards for that case but still felt an useful feature.

I suggest that while parsing -b parameter, check if it starts with "fill_buf", and if it does, parse the argument into fill_buf_param in user_params which will override the default fill_buf parameters.

While parsing, adding new sanity checks wouldn't be a bad idea.

It might be some parameters might be better to be overridden always by the tests, e.g. "once" but specifying "operation" (W instead or R) or "buf_size" seems okay use cases to me.

-- i.

Reinette Chatre

4:01 p.m.

New subject: [PATCH 3/6] selftests/resctrl: Simplify benchmark parameter passing

Hi Ilpo,

On 8/30/24 4:13 AM, Ilpo Järvinen wrote:

...

On Thu, 29 Aug 2024, Reinette Chatre wrote:

...
The benchmark used during the CMT, MBM, and MBA tests can be provided by the user via (-b) parameter to the tests, if not provided the default "fill_buf" benchmark is used.

The "fill_buf" benchmark requires parameters and these are managed as an array of strings.

Using an array of strings to manage the "fill_buf" parameters is complex because it requires transformations to/from strings at every producer and consumer. This is made worse for the individual tests where the default benchmark parameters values may not be appropriate and additional data wrangling is required. For example, the CMT test duplicates the entire array of strings in order to replace one of the parameters.

Replace the "array of strings" parameters used for "fill_buf" with a struct that contains the "fill_buf" parameters that can be used directly without transformations to/from strings. Make these parameters part of the parameters associated with each test so that each test can set benchmark parameters that are appropriate for it.

Signed-off-by: Reinette Chatre reinette.chatre@intel.com

...

If I didn't miss anything important, this change takes away the ability to alter fill_buf's parameters using -b option which to me felt the most useful way to use that parameter. The current code of course was lacks many safeguards for that case but still felt an useful feature.

hmmm ... thank you for pointing this out. I did not consider this use case. I have never received feedback on how these tests are used and if folks even use "-b", for "fill_buf" parameter changes or something else.

...

I suggest that while parsing -b parameter, check if it starts with "fill_buf", and if it does, parse the argument into fill_buf_param in user_params which will override the default fill_buf parameters.

While parsing, adding new sanity checks wouldn't be a bad idea.

It might be some parameters might be better to be overridden always by the tests, e.g. "once" but specifying "operation" (W instead or R) or "buf_size" seems okay use cases to me.

Will do. Thank you for the suggestion.

Reinette

Reinette Chatre

29 Aug 29 Aug

10:52 p.m.

New subject: [PATCH 4/6] selftests/resctrl: Use cache size to determine "fill_buf" buffer size

By default the MBM and MBA tests use the "fill_buf" benchmark to read from a buffer with the goal to measure the memory bandwidth generated by this buffer access.

Care should be taken when sizing the buffer used by the "fill_buf" benchmark. If the buffer is small enough to fit in the cache then it cannot be expected that the benchmark will generate much memory bandwidth. For example, on a system with 320MB L3 cache the existing hardcoded default of 250MB is insufficient.

Use the measured cache size to determine a buffer size that can be expected to trigger memory access while keeping the existing default as minimum that has been appropriate for testing so far.

Signed-off-by: Reinette Chatre reinette.chatre@intel.com --- tools/testing/selftests/resctrl/mba_test.c | 8 +++++++- tools/testing/selftests/resctrl/mbm_test.c | 8 +++++++- 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index 8ad433495f61..cad473b81a64 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -170,11 +170,17 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param .setup = mba_setup, .measure = mba_measure, }; + unsigned long cache_total_size = 0; int ret;

remove(RESULT_FILE_NAME);

- param.fill_buf.buf_size = DEFAULT_SPAN; + ret = get_cache_size(uparams->cpu, "L3", &cache_total_size); + if (ret) + return ret; + + param.fill_buf.buf_size = cache_total_size > DEFAULT_SPAN ? + cache_total_size * 2 : DEFAULT_SPAN; param.fill_buf.memflush = 1; param.fill_buf.operation = 0; param.fill_buf.once = false; diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c index b6883f274c74..734bfa4f42b3 100644 --- a/tools/testing/selftests/resctrl/mbm_test.c +++ b/tools/testing/selftests/resctrl/mbm_test.c @@ -138,11 +138,17 @@ static int mbm_run_test(const struct resctrl_test *test, const struct user_param .setup = mbm_setup, .measure = mbm_measure, }; + unsigned long cache_total_size = 0; int ret;

remove(RESULT_FILE_NAME);

-- 2.46.0

Ilpo Järvinen

30 Aug 30 Aug

11:25 a.m.

New subject: [PATCH 4/6] selftests/resctrl: Use cache size to determine "fill_buf" buffer size

On Thu, 29 Aug 2024, Reinette Chatre wrote:

...

By default the MBM and MBA tests use the "fill_buf" benchmark to read from a buffer with the goal to measure the memory bandwidth generated by this buffer access.

Care should be taken when sizing the buffer used by the "fill_buf" benchmark. If the buffer is small enough to fit in the cache then it cannot be expected that the benchmark will generate much memory bandwidth. For example, on a system with 320MB L3 cache the existing hardcoded default of 250MB is insufficient.

Use the measured cache size to determine a buffer size that can be expected to trigger memory access while keeping the existing default as minimum that has been appropriate for testing so far.

Signed-off-by: Reinette Chatre reinette.chatre@intel.com

tools/testing/selftests/resctrl/mba_test.c | 8 +++++++- tools/testing/selftests/resctrl/mbm_test.c | 8 +++++++- 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index 8ad433495f61..cad473b81a64 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -170,11 +170,17 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param .setup = mba_setup, .measure = mba_measure, };

unsigned long cache_total_size = 0; int ret;

remove(RESULT_FILE_NAME);

param.fill_buf.buf_size = DEFAULT_SPAN;
ret = get_cache_size(uparams->cpu, "L3", &cache_total_size);

if (ret)
return ret;
param.fill_buf.buf_size = cache_total_size > DEFAULT_SPAN ?
		  cache_total_size * 2 : DEFAULT_SPAN;

Should the check leave a bit of safeguard so that the buf_size is at least 2x (or x1.25 or some other factor)?

In here buf_size immediate jumps from 1x -> 2x when cache_total_size goes from DEFAULT_SPAN to DEFAULT_SPAN+1 (obviously L3 size won't be odd like that but I think you get my point).

Also, user might want to override this as mentioned in my reply to the previous patch.

-- i. > param.fill_buf.memflush = 1; > param.fill_buf.operation = 0; > param.fill_buf.once = false; > diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c > index b6883f274c74..734bfa4f42b3 100644 > --- a/tools/testing/selftests/resctrl/mbm_test.c > +++ b/tools/testing/selftests/resctrl/mbm_test.c > @@ -138,11 +138,17 @@ static int mbm_run_test(const struct resctrl_test *test, const struct user_param > .setup = mbm_setup, > .measure = mbm_measure, > }; > + unsigned long cache_total_size = 0; > int ret; > > remove(RESULT_FILE_NAME); > > - param.fill_buf.buf_size = DEFAULT_SPAN; > + ret = get_cache_size(uparams->cpu, "L3", &cache_total_size); > + if (ret) > + return ret; > + > + param.fill_buf.buf_size = cache_total_size > DEFAULT_SPAN ? > + cache_total_size * 2 : DEFAULT_SPAN; > param.fill_buf.memflush = 1; > param.fill_buf.operation = 0; > param.fill_buf.once = false; >

Reinette Chatre

4 p.m.

New subject: [PATCH 4/6] selftests/resctrl: Use cache size to determine "fill_buf" buffer size

Hi Ilpo,

On 8/30/24 4:25 AM, Ilpo Järvinen wrote:

...

On Thu, 29 Aug 2024, Reinette Chatre wrote:

...
By default the MBM and MBA tests use the "fill_buf" benchmark to read from a buffer with the goal to measure the memory bandwidth generated by this buffer access.

Care should be taken when sizing the buffer used by the "fill_buf" benchmark. If the buffer is small enough to fit in the cache then it cannot be expected that the benchmark will generate much memory bandwidth. For example, on a system with 320MB L3 cache the existing hardcoded default of 250MB is insufficient.

Use the measured cache size to determine a buffer size that can be expected to trigger memory access while keeping the existing default as minimum that has been appropriate for testing so far.

Signed-off-by: Reinette Chatre reinette.chatre@intel.com

tools/testing/selftests/resctrl/mba_test.c | 8 +++++++- tools/testing/selftests/resctrl/mbm_test.c | 8 +++++++- 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index 8ad433495f61..cad473b81a64 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -170,11 +170,17 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param .setup = mba_setup, .measure = mba_measure, };

unsigned long cache_total_size = 0; int ret;

remove(RESULT_FILE_NAME);

param.fill_buf.buf_size = DEFAULT_SPAN;
ret = get_cache_size(uparams->cpu, "L3", &cache_total_size);

if (ret)
return ret;
param.fill_buf.buf_size = cache_total_size > DEFAULT_SPAN ?
		  cache_total_size * 2 : DEFAULT_SPAN;
Should the check leave a bit of safeguard so that the buf_size is at least 2x (or x1.25 or some other factor)?

In here buf_size immediate jumps from 1x -> 2x when cache_total_size goes from DEFAULT_SPAN to DEFAULT_SPAN+1 (obviously L3 size won't be odd like that but I think you get my point).

Good catch. Will fix.

...

Also, user might want to override this as mentioned in my reply to the previous patch.

ack.

Reinette

Reinette Chatre

29 Aug 29 Aug

10:52 p.m.

New subject: [PATCH 5/6] selftests/resctrl: Do not compare performance counters and resctrl at low bandwidth

The MBA test incrementally throttles memory bandwidth, each time followed by a comparison between the memory bandwidth observed by the performance counters and resctrl respectively.

While a comparison between performance counters and resctrl is generally appropriate, they do not have an identical view of memory bandwidth. For example RAS features or memory performance features that generate memory traffic may drive accesses that are counted differently by performance counters and MBM respectively, for instance generating "overhead" traffic which is not counted against any specific RMID. As a ratio, this different view of memory bandwidth becomes more apparent at low memory bandwidths.

It is not practical to enable/disable the various features that may generate memory bandwidth to give performance counters and resctrl an identical view. Instead, do not compare performance counters and resctrl view of memory bandwidth when the memory bandwidth is low.

Bandwidth throttling behaves differently across platforms so it is not appropriate to drop measurement data simply based on the throttling level. Instead, use a threshold of 750MiB that has been observed to support adequate comparison between performance counters and resctrl.

Signed-off-by: Reinette Chatre reinette.chatre@intel.com --- tools/testing/selftests/resctrl/mba_test.c | 7 +++++++ tools/testing/selftests/resctrl/resctrl.h | 6 ++++++ 2 files changed, 13 insertions(+)

diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index cad473b81a64..204b9ac4b108 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -96,6 +96,13 @@ static bool show_mba_info(unsigned long *bw_imc, unsigned long *bw_resc)

avg_bw_imc = sum_bw_imc / (NUM_OF_RUNS - 1); avg_bw_resc = sum_bw_resc / (NUM_OF_RUNS - 1); + if (avg_bw_imc < THROTTLE_THRESHOLD || avg_bw_resc < THROTTLE_THRESHOLD) { + ksft_print_msg("Bandwidth below threshold (%d MiB). Dropping results from MBA schemata %u.\n", + THROTTLE_THRESHOLD, + ALLOCATION_MAX - ALLOCATION_STEP * allocation); + break; + } + avg_diff = (float)labs(avg_bw_resc - avg_bw_imc) / avg_bw_imc; avg_diff_per = (int)(avg_diff * 100);

diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h index 0e5456165a6a..e65c5fb76b17 100644 --- a/tools/testing/selftests/resctrl/resctrl.h +++ b/tools/testing/selftests/resctrl/resctrl.h @@ -43,6 +43,12 @@

#define DEFAULT_SPAN (250 * MB)

+/* + * Memory bandwidth (in MiB) below which the bandwidth comparisons + * between iMC and resctrl are considered unreliable. + */ +#define THROTTLE_THRESHOLD 750 + /* * user_params: User supplied parameters * @cpu: CPU number to which the benchmark will be bound to

-- 2.46.0

Ilpo Järvinen

30 Aug 30 Aug

11:42 a.m.

New subject: [PATCH 5/6] selftests/resctrl: Do not compare performance counters and resctrl at low bandwidth

On Thu, 29 Aug 2024, Reinette Chatre wrote:

...

The MBA test incrementally throttles memory bandwidth, each time followed by a comparison between the memory bandwidth observed by the performance counters and resctrl respectively.

While a comparison between performance counters and resctrl is generally appropriate, they do not have an identical view of memory bandwidth. For example RAS features or memory performance features that generate memory traffic may drive accesses that are counted differently by performance counters and MBM respectively, for instance generating "overhead" traffic which is not counted against any specific RMID. As a ratio, this different view of memory bandwidth becomes more apparent at low memory bandwidths.

Interesting.

I did some time back prototype with a change to MBM test such that instead of using once=false I changed fill_buf to be able to run N passes through the buffer which allowed me to know how many reads were performed by the benchmark. This yielded numerical difference between all those 3 values (# of reads, MBM, perf) which also varied from arch to another so it didn't end up making an usable test.

I guess I now have an explanation for at least a part of the differences.

...

It is not practical to enable/disable the various features that may generate memory bandwidth to give performance counters and resctrl an identical view. Instead, do not compare performance counters and resctrl view of memory bandwidth when the memory bandwidth is low.

Bandwidth throttling behaves differently across platforms so it is not appropriate to drop measurement data simply based on the throttling level. Instead, use a threshold of 750MiB that has been observed to support adequate comparison between performance counters and resctrl.

Signed-off-by: Reinette Chatre reinette.chatre@intel.com

tools/testing/selftests/resctrl/mba_test.c | 7 +++++++ tools/testing/selftests/resctrl/resctrl.h | 6 ++++++ 2 files changed, 13 insertions(+)

diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index cad473b81a64..204b9ac4b108 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -96,6 +96,13 @@ static bool show_mba_info(unsigned long *bw_imc, unsigned long *bw_resc) avg_bw_imc = sum_bw_imc / (NUM_OF_RUNS - 1); avg_bw_resc = sum_bw_resc / (NUM_OF_RUNS - 1);
if (avg_bw_imc < THROTTLE_THRESHOLD || avg_bw_resc < THROTTLE_THRESHOLD) {
	ksft_print_msg("Bandwidth below threshold (%d MiB).  Dropping results from MBA schemata %u.\n",
			THROTTLE_THRESHOLD,
			ALLOCATION_MAX - ALLOCATION_STEP * allocation);

The second one too should be %d.

-- i. > + break; > + } > + > avg_diff = (float)labs(avg_bw_resc - avg_bw_imc) / avg_bw_imc; > avg_diff_per = (int)(avg_diff * 100); > > diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h > index 0e5456165a6a..e65c5fb76b17 100644 > --- a/tools/testing/selftests/resctrl/resctrl.h > +++ b/tools/testing/selftests/resctrl/resctrl.h > @@ -43,6 +43,12 @@ > > #define DEFAULT_SPAN (250 * MB) > > +/* > + * Memory bandwidth (in MiB) below which the bandwidth comparisons > + * between iMC and resctrl are considered unreliable. > + */ > +#define THROTTLE_THRESHOLD 750 > + > /* > * user_params: User supplied parameters > * @cpu: CPU number to which the benchmark will be bound to >

Reinette Chatre

4 p.m.

New subject: [PATCH 5/6] selftests/resctrl: Do not compare performance counters and resctrl at low bandwidth

Hi Ilpo,

On 8/30/24 4:42 AM, Ilpo Järvinen wrote:

...

On Thu, 29 Aug 2024, Reinette Chatre wrote:

...
The MBA test incrementally throttles memory bandwidth, each time followed by a comparison between the memory bandwidth observed by the performance counters and resctrl respectively.

While a comparison between performance counters and resctrl is generally appropriate, they do not have an identical view of memory bandwidth. For example RAS features or memory performance features that generate memory traffic may drive accesses that are counted differently by performance counters and MBM respectively, for instance generating "overhead" traffic which is not counted against any specific RMID. As a ratio, this different view of memory bandwidth becomes more apparent at low memory bandwidths.

Interesting.

I did some time back prototype with a change to MBM test such that instead of using once=false I changed fill_buf to be able to run N passes through the buffer which allowed me to know how many reads were performed by the benchmark. This yielded numerical difference between all those 3 values (# of reads, MBM, perf) which also varied from arch to another so it didn't end up making an usable test.

I guess I now have an explanation for at least a part of the differences.

...
It is not practical to enable/disable the various features that may generate memory bandwidth to give performance counters and resctrl an identical view. Instead, do not compare performance counters and resctrl view of memory bandwidth when the memory bandwidth is low.

Bandwidth throttling behaves differently across platforms so it is not appropriate to drop measurement data simply based on the throttling level. Instead, use a threshold of 750MiB that has been observed to support adequate comparison between performance counters and resctrl.

Signed-off-by: Reinette Chatre reinette.chatre@intel.com

tools/testing/selftests/resctrl/mba_test.c | 7 +++++++ tools/testing/selftests/resctrl/resctrl.h | 6 ++++++ 2 files changed, 13 insertions(+)

diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index cad473b81a64..204b9ac4b108 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -96,6 +96,13 @@ static bool show_mba_info(unsigned long *bw_imc, unsigned long *bw_resc) avg_bw_imc = sum_bw_imc / (NUM_OF_RUNS - 1); avg_bw_resc = sum_bw_resc / (NUM_OF_RUNS - 1);
if (avg_bw_imc < THROTTLE_THRESHOLD || avg_bw_resc < THROTTLE_THRESHOLD) {
	ksft_print_msg("Bandwidth below threshold (%d MiB).  Dropping results from MBA schemata %u.\n",
			THROTTLE_THRESHOLD,
			ALLOCATION_MAX - ALLOCATION_STEP * allocation);
The second one too should be %d.

hmmm ... I intended to have it be consistent with the ksft_print_msg() that follows. Perhaps allocation can be made unsigned instead?

Reinette

Ilpo Järvinen

4 Sep 4 Sep

11:43 a.m.

New subject: [PATCH 5/6] selftests/resctrl: Do not compare performance counters and resctrl at low bandwidth

On Fri, 30 Aug 2024, Reinette Chatre wrote:

...

On 8/30/24 4:42 AM, Ilpo Järvinen wrote:

...
On Thu, 29 Aug 2024, Reinette Chatre wrote:

...
The MBA test incrementally throttles memory bandwidth, each time followed by a comparison between the memory bandwidth observed by the performance counters and resctrl respectively.

While a comparison between performance counters and resctrl is generally appropriate, they do not have an identical view of memory bandwidth. For example RAS features or memory performance features that generate memory traffic may drive accesses that are counted differently by performance counters and MBM respectively, for instance generating "overhead" traffic which is not counted against any specific RMID. As a ratio, this different view of memory bandwidth becomes more apparent at low memory bandwidths.

Interesting.

I did some time back prototype with a change to MBM test such that instead of using once=false I changed fill_buf to be able to run N passes through the buffer which allowed me to know how many reads were performed by the benchmark. This yielded numerical difference between all those 3 values (# of reads, MBM, perf) which also varied from arch to another so it didn't end up making an usable test.

I guess I now have an explanation for at least a part of the differences.

...
It is not practical to enable/disable the various features that may generate memory bandwidth to give performance counters and resctrl an identical view. Instead, do not compare performance counters and resctrl view of memory bandwidth when the memory bandwidth is low.

Bandwidth throttling behaves differently across platforms so it is not appropriate to drop measurement data simply based on the throttling level. Instead, use a threshold of 750MiB that has been observed to support adequate comparison between performance counters and resctrl.

Signed-off-by: Reinette Chatre reinette.chatre@intel.com

tools/testing/selftests/resctrl/mba_test.c | 7 +++++++ tools/testing/selftests/resctrl/resctrl.h | 6 ++++++ 2 files changed, 13 insertions(+)

diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index cad473b81a64..204b9ac4b108 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -96,6 +96,13 @@ static bool show_mba_info(unsigned long *bw_imc, unsigned long *bw_resc) avg_bw_imc = sum_bw_imc / (NUM_OF_RUNS - 1); avg_bw_resc = sum_bw_resc / (NUM_OF_RUNS - 1);
if (avg_bw_imc < THROTTLE_THRESHOLD || avg_bw_resc <
THROTTLE_THRESHOLD) {
	ksft_print_msg("Bandwidth below threshold (%d MiB).
Dropping results from MBA schemata %u.\n",
			THROTTLE_THRESHOLD,
			ALLOCATION_MAX - ALLOCATION_STEP *
allocation);
The second one too should be %d.
hmmm ... I intended to have it be consistent with the ksft_print_msg() that follows. Perhaps allocation can be made unsigned instead?

If you go that way, then it would be good to make the related defines and allocation in mba_setup() unsigned too, although the latter is a bit scary because it does allocation -= ALLOCATION_STEP which could underflow if the defines are ever changed.

-- i.

Reinette Chatre

9:15 p.m.

New subject: [PATCH 5/6] selftests/resctrl: Do not compare performance counters and resctrl at low bandwidth

Hi Ilpo,

On 9/4/24 4:43 AM, Ilpo Järvinen wrote:

...

On Fri, 30 Aug 2024, Reinette Chatre wrote:

...
On 8/30/24 4:42 AM, Ilpo Järvinen wrote:

...
On Thu, 29 Aug 2024, Reinette Chatre wrote:

...
The MBA test incrementally throttles memory bandwidth, each time followed by a comparison between the memory bandwidth observed by the performance counters and resctrl respectively.

While a comparison between performance counters and resctrl is generally appropriate, they do not have an identical view of memory bandwidth. For example RAS features or memory performance features that generate memory traffic may drive accesses that are counted differently by performance counters and MBM respectively, for instance generating "overhead" traffic which is not counted against any specific RMID. As a ratio, this different view of memory bandwidth becomes more apparent at low memory bandwidths.

Interesting.

I did some time back prototype with a change to MBM test such that instead of using once=false I changed fill_buf to be able to run N passes through the buffer which allowed me to know how many reads were performed by the benchmark. This yielded numerical difference between all those 3 values (# of reads, MBM, perf) which also varied from arch to another so it didn't end up making an usable test.

I guess I now have an explanation for at least a part of the differences.

...
It is not practical to enable/disable the various features that may generate memory bandwidth to give performance counters and resctrl an identical view. Instead, do not compare performance counters and resctrl view of memory bandwidth when the memory bandwidth is low.

Bandwidth throttling behaves differently across platforms so it is not appropriate to drop measurement data simply based on the throttling level. Instead, use a threshold of 750MiB that has been observed to support adequate comparison between performance counters and resctrl.

Signed-off-by: Reinette Chatre reinette.chatre@intel.com

tools/testing/selftests/resctrl/mba_test.c | 7 +++++++ tools/testing/selftests/resctrl/resctrl.h | 6 ++++++ 2 files changed, 13 insertions(+)

diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index cad473b81a64..204b9ac4b108 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -96,6 +96,13 @@ static bool show_mba_info(unsigned long *bw_imc, unsigned long *bw_resc) avg_bw_imc = sum_bw_imc / (NUM_OF_RUNS - 1); avg_bw_resc = sum_bw_resc / (NUM_OF_RUNS - 1);
if (avg_bw_imc < THROTTLE_THRESHOLD || avg_bw_resc <
THROTTLE_THRESHOLD) {
	ksft_print_msg("Bandwidth below threshold (%d MiB).
Dropping results from MBA schemata %u.\n",
			THROTTLE_THRESHOLD,
			ALLOCATION_MAX - ALLOCATION_STEP *
allocation);
The second one too should be %d.
hmmm ... I intended to have it be consistent with the ksft_print_msg() that follows. Perhaps allocation can be made unsigned instead?
If you go that way, then it would be good to make the related defines and allocation in mba_setup() unsigned too, although the latter is a bit scary

Sure, will look into that.

...

because it does allocation -= ALLOCATION_STEP which could underflow if the defines are ever changed.

Is this not already covered in the following check: if (allocation < ALLOCATION_MIN || allocation > ALLOCATION_MAX) return END_OF_TESTS;

We are talking about hardcoded constants though.

Reinette

Ilpo Järvinen

5 Sep 5 Sep

11:45 a.m.

New subject: [PATCH 5/6] selftests/resctrl: Do not compare performance counters and resctrl at low bandwidth

On Wed, 4 Sep 2024, Reinette Chatre wrote:

...

On 9/4/24 4:43 AM, Ilpo Järvinen wrote:

...
On Fri, 30 Aug 2024, Reinette Chatre wrote:

...
On 8/30/24 4:42 AM, Ilpo Järvinen wrote:

...
On Thu, 29 Aug 2024, Reinette Chatre wrote:

...
The MBA test incrementally throttles memory bandwidth, each time followed by a comparison between the memory bandwidth observed by the performance counters and resctrl respectively.

While a comparison between performance counters and resctrl is generally appropriate, they do not have an identical view of memory bandwidth. For example RAS features or memory performance features that generate memory traffic may drive accesses that are counted differently by performance counters and MBM respectively, for instance generating "overhead" traffic which is not counted against any specific RMID. As a ratio, this different view of memory bandwidth becomes more apparent at low memory bandwidths.

Interesting.

I did some time back prototype with a change to MBM test such that instead of using once=false I changed fill_buf to be able to run N passes through the buffer which allowed me to know how many reads were performed by the benchmark. This yielded numerical difference between all those 3 values (# of reads, MBM, perf) which also varied from arch to another so it didn't end up making an usable test.

I guess I now have an explanation for at least a part of the differences.

...
It is not practical to enable/disable the various features that may generate memory bandwidth to give performance counters and resctrl an identical view. Instead, do not compare performance counters and resctrl view of memory bandwidth when the memory bandwidth is low.

Bandwidth throttling behaves differently across platforms so it is not appropriate to drop measurement data simply based on the throttling level. Instead, use a threshold of 750MiB that has been observed to support adequate comparison between performance counters and resctrl.

Signed-off-by: Reinette Chatre reinette.chatre@intel.com

tools/testing/selftests/resctrl/mba_test.c | 7 +++++++ tools/testing/selftests/resctrl/resctrl.h | 6 ++++++ 2 files changed, 13 insertions(+)

diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index cad473b81a64..204b9ac4b108 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -96,6 +96,13 @@ static bool show_mba_info(unsigned long *bw_imc, unsigned long *bw_resc) avg_bw_imc = sum_bw_imc / (NUM_OF_RUNS - 1); avg_bw_resc = sum_bw_resc / (NUM_OF_RUNS - 1);
if (avg_bw_imc < THROTTLE_THRESHOLD || avg_bw_resc <
THROTTLE_THRESHOLD) {
	ksft_print_msg("Bandwidth below threshold (%d
MiB). Dropping results from MBA schemata %u.\n",
			THROTTLE_THRESHOLD,
			ALLOCATION_MAX -
ALLOCATION_STEP * allocation);
The second one too should be %d.
hmmm ... I intended to have it be consistent with the ksft_print_msg() that follows. Perhaps allocation can be made unsigned instead?
If you go that way, then it would be good to make the related defines and allocation in mba_setup() unsigned too, although the latter is a bit scary
Sure, will look into that.

...
because it does allocation -= ALLOCATION_STEP which could underflow if the defines are ever changed.

Is this not already covered in the following check: if (allocation < ALLOCATION_MIN || allocation > ALLOCATION_MAX) return END_OF_TESTS;

We are talking about hardcoded constants though.

Borderline yes. It is "covered" by the allocation > ALLOCATION_MAX but it's also very non-intuitive to let the value underflow and then check for that with the > operator.

You're correct that they're constants so one would need to tweak the source to end up into this condition in the first place.

Perhaps I'm being overly pendantic here but I in general don't like leaving trappy and non-obvious logic like that lying around because one day one of such will come back biting.

So, if a variable is unsigned and we ever do subtraction (or adding negative numbers to it), I'd prefer additional check to prevent ever underflowing it unexpectedly. Or leave them signed.

-- i.

Reinette Chatre

6:08 p.m.

New subject: [PATCH 5/6] selftests/resctrl: Do not compare performance counters and resctrl at low bandwidth

Hi Ilpo,

On 9/5/24 4:45 AM, Ilpo Järvinen wrote:

...

On Wed, 4 Sep 2024, Reinette Chatre wrote:

...
On 9/4/24 4:43 AM, Ilpo Järvinen wrote:

...
On Fri, 30 Aug 2024, Reinette Chatre wrote:

...
On 8/30/24 4:42 AM, Ilpo Järvinen wrote:

...
On Thu, 29 Aug 2024, Reinette Chatre wrote:

...
The MBA test incrementally throttles memory bandwidth, each time followed by a comparison between the memory bandwidth observed by the performance counters and resctrl respectively.

While a comparison between performance counters and resctrl is generally appropriate, they do not have an identical view of memory bandwidth. For example RAS features or memory performance features that generate memory traffic may drive accesses that are counted differently by performance counters and MBM respectively, for instance generating "overhead" traffic which is not counted against any specific RMID. As a ratio, this different view of memory bandwidth becomes more apparent at low memory bandwidths.

Interesting.

I did some time back prototype with a change to MBM test such that instead of using once=false I changed fill_buf to be able to run N passes through the buffer which allowed me to know how many reads were performed by the benchmark. This yielded numerical difference between all those 3 values (# of reads, MBM, perf) which also varied from arch to another so it didn't end up making an usable test.

I guess I now have an explanation for at least a part of the differences.

...
It is not practical to enable/disable the various features that may generate memory bandwidth to give performance counters and resctrl an identical view. Instead, do not compare performance counters and resctrl view of memory bandwidth when the memory bandwidth is low.

Bandwidth throttling behaves differently across platforms so it is not appropriate to drop measurement data simply based on the throttling level. Instead, use a threshold of 750MiB that has been observed to support adequate comparison between performance counters and resctrl.

Signed-off-by: Reinette Chatre reinette.chatre@intel.com
tools/testing/selftests/resctrl/mba_test.c | 7 +++++++
tools/testing/selftests/resctrl/resctrl.h  | 6 ++++++
2 files changed, 13 insertions(+)
diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index cad473b81a64..204b9ac4b108 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -96,6 +96,13 @@ static bool show_mba_info(unsigned long *bw_imc, unsigned long *bw_resc) avg_bw_imc = sum_bw_imc / (NUM_OF_RUNS - 1); avg_bw_resc = sum_bw_resc / (NUM_OF_RUNS - 1);
if (avg_bw_imc < THROTTLE_THRESHOLD || avg_bw_resc <
THROTTLE_THRESHOLD) {
	ksft_print_msg("Bandwidth below threshold (%d
MiB). Dropping results from MBA schemata %u.\n",
			THROTTLE_THRESHOLD,
			ALLOCATION_MAX -
ALLOCATION_STEP * allocation);
The second one too should be %d.
hmmm ... I intended to have it be consistent with the ksft_print_msg() that follows. Perhaps allocation can be made unsigned instead?
If you go that way, then it would be good to make the related defines and allocation in mba_setup() unsigned too, although the latter is a bit scary
Sure, will look into that.

...
because it does allocation -= ALLOCATION_STEP which could underflow if the defines are ever changed.

Is this not already covered in the following check: if (allocation < ALLOCATION_MIN || allocation > ALLOCATION_MAX) return END_OF_TESTS;

We are talking about hardcoded constants though.
Borderline yes. It is "covered" by the allocation > ALLOCATION_MAX but it's also very non-intuitive to let the value underflow and then check for that with the > operator.

My understanding is that this is the traditional way of checking overflow (or more accurately wraparound). There are several examples of this pattern in the kernel and it is also the pattern that I understand Linus referred to as "traditional" in [1]. Even the compiler's intrinsic overflow checkers do checking in this way (perform the subtraction and then check if it overflowed) [2].

...

You're correct that they're constants so one would need to tweak the source to end up into this condition in the first place.

Perhaps I'm being overly pendantic here but I in general don't like leaving trappy and non-obvious logic like that lying around because one day one of such will come back biting.

It is not clear to me what is "trappy" about this. The current checks will catch the wraparound if somebody changes ALLOCATION_STEP in your scenario, no?

...

So, if a variable is unsigned and we ever do subtraction (or adding negative numbers to it), I'd prefer additional check to prevent ever underflowing it unexpectedly. Or leave them signed.

To support checking at the time of subtraction we either need to change the flow of that function since it cannot exit with failure if that subtraction fails because of overflow/wraparound, or we need to introduce more state that will be an additional check that the existing "if (allocation < ALLOCATION_MIN || allocation > ALLOCATION_MAX)" would have caught anyway.

In either case, to do this checking at the time of subtraction the ideal way would be to use check_sub_overflow() ... but it again does exactly what you find to be non-intuitive since the implementation in tools/include/linux/overflow.h uses the gcc intrinsics that does subtraction first and then checks if overflow occurred.

It is not clear to me what problem you are aiming to solve here. If the major concern is that the current logic is not obvious, perhaps it can be clarified with a comment as below:

if (runs_per_allocation++ != 0) return 0;

+ /* + * Do not attempt allocation outside valid range. Safeguard + * against any potential wraparound caused by subtraction. + */ if (allocation < ALLOCATION_MIN || allocation > ALLOCATION_MAX) return END_OF_TESTS;

Reinette

[1] https://lwn.net/ml/linux-kernel/CAHk-=whS7FSbBoo1gxe+83twO2JeGNsUKMhAcfWymw9... [2] https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html

Ilpo Järvinen

6 Sep 6 Sep

8:44 a.m.

New subject: [PATCH 5/6] selftests/resctrl: Do not compare performance counters and resctrl at low bandwidth

On Thu, 5 Sep 2024, Reinette Chatre wrote:

...

On 9/5/24 4:45 AM, Ilpo Järvinen wrote:

...
On Wed, 4 Sep 2024, Reinette Chatre wrote:

...
On 9/4/24 4:43 AM, Ilpo Järvinen wrote:

...
On Fri, 30 Aug 2024, Reinette Chatre wrote:

...
On 8/30/24 4:42 AM, Ilpo Järvinen wrote:

...
On Thu, 29 Aug 2024, Reinette Chatre wrote:

> The MBA test incrementally throttles memory bandwidth, each time > followed by a comparison between the memory bandwidth observed > by the performance counters and resctrl respectively. > > While a comparison between performance counters and resctrl is > generally appropriate, they do not have an identical view of > memory bandwidth. For example RAS features or memory performance > features that generate memory traffic may drive accesses that are > counted differently by performance counters and MBM respectively, > for instance generating "overhead" traffic which is not counted > against any specific RMID. As a ratio, this different view of > memory > bandwidth becomes more apparent at low memory bandwidths.

Interesting.

I did some time back prototype with a change to MBM test such that instead of using once=false I changed fill_buf to be able to run N passes through the buffer which allowed me to know how many reads were performed by the benchmark. This yielded numerical difference between all those 3 values (# of reads, MBM, perf) which also varied from arch to another so it didn't end up making an usable test.

I guess I now have an explanation for at least a part of the differences.

> It is not practical to enable/disable the various features that > may generate memory bandwidth to give performance counters and > resctrl an identical view. Instead, do not compare performance > counters and resctrl view of memory bandwidth when the memory > bandwidth is low. > > Bandwidth throttling behaves differently across platforms > so it is not appropriate to drop measurement data simply based > on the throttling level. Instead, use a threshold of 750MiB > that has been observed to support adequate comparison between > performance counters and resctrl. > > Signed-off-by: Reinette Chatre reinette.chatre@intel.com > --- > tools/testing/selftests/resctrl/mba_test.c | 7 +++++++ > tools/testing/selftests/resctrl/resctrl.h | 6 ++++++ > 2 files changed, 13 insertions(+) > > diff --git a/tools/testing/selftests/resctrl/mba_test.c > b/tools/testing/selftests/resctrl/mba_test.c > index cad473b81a64..204b9ac4b108 100644 > --- a/tools/testing/selftests/resctrl/mba_test.c > +++ b/tools/testing/selftests/resctrl/mba_test.c > @@ -96,6 +96,13 @@ static bool show_mba_info(unsigned long > *bw_imc, > unsigned long *bw_resc) > avg_bw_imc = sum_bw_imc / (NUM_OF_RUNS - 1); > avg_bw_resc = sum_bw_resc / (NUM_OF_RUNS - 1); > + if (avg_bw_imc < THROTTLE_THRESHOLD || avg_bw_resc < > THROTTLE_THRESHOLD) { > + ksft_print_msg("Bandwidth below threshold (%d > MiB). > Dropping results from MBA schemata %u.\n", > + THROTTLE_THRESHOLD, > + ALLOCATION_MAX - > ALLOCATION_STEP * > allocation);

The second one too should be %d.

hmmm ... I intended to have it be consistent with the ksft_print_msg() that follows. Perhaps allocation can be made unsigned instead?

If you go that way, then it would be good to make the related defines and allocation in mba_setup() unsigned too, although the latter is a bit scary

Sure, will look into that.

...
because it does allocation -= ALLOCATION_STEP which could underflow if the defines are ever changed.

Is this not already covered in the following check: if (allocation < ALLOCATION_MIN || allocation > ALLOCATION_MAX) return END_OF_TESTS;

We are talking about hardcoded constants though.

Borderline yes. It is "covered" by the allocation > ALLOCATION_MAX but it's also very non-intuitive to let the value underflow and then check for that with the > operator.

My understanding is that this is the traditional way of checking overflow (or more accurately wraparound). There are several examples of this pattern in the kernel and it is also the pattern that I understand Linus referred to as "traditional" in [1]. Even the compiler's intrinsic overflow checkers do checking in this way (perform the subtraction and then check if it overflowed) [2].

Fair enough. I've never come across that kind of claim before.

...

...
You're correct that they're constants so one would need to tweak the source to end up into this condition in the first place.

Perhaps I'm being overly pendantic here but I in general don't like leaving trappy and non-obvious logic like that lying around because one day one of such will come back biting.

It is not clear to me what is "trappy" about this. The current checks will catch the wraparound if somebody changes ALLOCATION_STEP in your scenario, no?

...
So, if a variable is unsigned and we ever do subtraction (or adding negative numbers to it), I'd prefer additional check to prevent ever underflowing it unexpectedly. Or leave them signed.

To support checking at the time of subtraction we either need to change the flow of that function since it cannot exit with failure if that subtraction fails because of overflow/wraparound, or we need to introduce more state that will be an additional check that the existing "if (allocation < ALLOCATION_MIN || allocation > ALLOCATION_MAX)" would have caught anyway.

In either case, to do this checking at the time of subtraction the ideal way would be to use check_sub_overflow() ... but it again does exactly what you find to be non-intuitive since the implementation in tools/include/linux/overflow.h uses the gcc intrinsics that does subtraction first and then checks if overflow occurred.

It's trappy because by glance, that check looks unnecessary since allocation starts from max and goes downwards. Also worth to note, the check is not immediately after the decrement but done on the next iteration.

The risk here is that somebody removes allocation > ALLOCATION_MAX check.

Something called check_sub_overflow() is not subject to a similar risk regardless of what operations occur inside it.

...

It is not clear to me what problem you are aiming to solve here. If the major concern is that the current logic is not obvious, perhaps it can be clarified with a comment as below:

if (runs_per_allocation++ != 0) return 0;
/*
* Do not attempt allocation outside valid range. Safeguard
* against any potential wraparound caused by subtraction.
*/
if (allocation < ALLOCATION_MIN || allocation > ALLOCATION_MAX) return END_OF_TESTS;

That would probably help but then it seems Linus is against such attempts and considers this hole in the cheese (i.e., representing something that is clearly a negative number with a positive number) "traditional".

-- i.

Reinette Chatre

7 Sep 7 Sep

12:05 a.m.

New subject: [PATCH 5/6] selftests/resctrl: Do not compare performance counters and resctrl at low bandwidth

Hi Ilpo,

On 9/6/24 1:44 AM, Ilpo Järvinen wrote:

...

On Thu, 5 Sep 2024, Reinette Chatre wrote:

...
On 9/5/24 4:45 AM, Ilpo Järvinen wrote:

...
On Wed, 4 Sep 2024, Reinette Chatre wrote:

...
On 9/4/24 4:43 AM, Ilpo Järvinen wrote:

...
On Fri, 30 Aug 2024, Reinette Chatre wrote:

...
On 8/30/24 4:42 AM, Ilpo Järvinen wrote: > On Thu, 29 Aug 2024, Reinette Chatre wrote: > >> The MBA test incrementally throttles memory bandwidth, each time >> followed by a comparison between the memory bandwidth observed >> by the performance counters and resctrl respectively. >> >> While a comparison between performance counters and resctrl is >> generally appropriate, they do not have an identical view of >> memory bandwidth. For example RAS features or memory performance >> features that generate memory traffic may drive accesses that are >> counted differently by performance counters and MBM respectively, >> for instance generating "overhead" traffic which is not counted >> against any specific RMID. As a ratio, this different view of >> memory >> bandwidth becomes more apparent at low memory bandwidths. > > Interesting. > > I did some time back prototype with a change to MBM test such that > instead > of using once=false I changed fill_buf to be able to run N passes > through > the buffer which allowed me to know how many reads were performed by > the > benchmark. This yielded numerical difference between all those 3 > values > (# of reads, MBM, perf) which also varied from arch to another so it > didn't end up making an usable test. > > I guess I now have an explanation for at least a part of the > differences. > >> It is not practical to enable/disable the various features that >> may generate memory bandwidth to give performance counters and >> resctrl an identical view. Instead, do not compare performance >> counters and resctrl view of memory bandwidth when the memory >> bandwidth is low. >> >> Bandwidth throttling behaves differently across platforms >> so it is not appropriate to drop measurement data simply based >> on the throttling level. Instead, use a threshold of 750MiB >> that has been observed to support adequate comparison between >> performance counters and resctrl. >> >> Signed-off-by: Reinette Chatre reinette.chatre@intel.com >> --- >> tools/testing/selftests/resctrl/mba_test.c | 7 +++++++ >> tools/testing/selftests/resctrl/resctrl.h | 6 ++++++ >> 2 files changed, 13 insertions(+) >> >> diff --git a/tools/testing/selftests/resctrl/mba_test.c >> b/tools/testing/selftests/resctrl/mba_test.c >> index cad473b81a64..204b9ac4b108 100644 >> --- a/tools/testing/selftests/resctrl/mba_test.c >> +++ b/tools/testing/selftests/resctrl/mba_test.c >> @@ -96,6 +96,13 @@ static bool show_mba_info(unsigned long >> *bw_imc, >> unsigned long *bw_resc) >> avg_bw_imc = sum_bw_imc / (NUM_OF_RUNS - 1); >> avg_bw_resc = sum_bw_resc / (NUM_OF_RUNS - 1); >> + if (avg_bw_imc < THROTTLE_THRESHOLD || avg_bw_resc < >> THROTTLE_THRESHOLD) { >> + ksft_print_msg("Bandwidth below threshold (%d >> MiB). >> Dropping results from MBA schemata %u.\n", >> + THROTTLE_THRESHOLD, >> + ALLOCATION_MAX - >> ALLOCATION_STEP * >> allocation); > > The second one too should be %d. >

hmmm ... I intended to have it be consistent with the ksft_print_msg() that follows. Perhaps allocation can be made unsigned instead?

If you go that way, then it would be good to make the related defines and allocation in mba_setup() unsigned too, although the latter is a bit scary

Sure, will look into that.

...
because it does allocation -= ALLOCATION_STEP which could underflow if the defines are ever changed.

Is this not already covered in the following check: if (allocation < ALLOCATION_MIN || allocation > ALLOCATION_MAX) return END_OF_TESTS;

We are talking about hardcoded constants though.

Borderline yes. It is "covered" by the allocation > ALLOCATION_MAX but it's also very non-intuitive to let the value underflow and then check for that with the > operator.

My understanding is that this is the traditional way of checking overflow (or more accurately wraparound). There are several examples of this pattern in the kernel and it is also the pattern that I understand Linus referred to as "traditional" in [1]. Even the compiler's intrinsic overflow checkers do checking in this way (perform the subtraction and then check if it overflowed) [2].

Fair enough. I've never come across that kind of claim before.

...
...
You're correct that they're constants so one would need to tweak the source to end up into this condition in the first place.

Perhaps I'm being overly pendantic here but I in general don't like leaving trappy and non-obvious logic like that lying around because one day one of such will come back biting.

It is not clear to me what is "trappy" about this. The current checks will catch the wraparound if somebody changes ALLOCATION_STEP in your scenario, no?

...
So, if a variable is unsigned and we ever do subtraction (or adding negative numbers to it), I'd prefer additional check to prevent ever underflowing it unexpectedly. Or leave them signed.

To support checking at the time of subtraction we either need to change the flow of that function since it cannot exit with failure if that subtraction fails because of overflow/wraparound, or we need to introduce more state that will be an additional check that the existing "if (allocation < ALLOCATION_MIN || allocation > ALLOCATION_MAX)" would have caught anyway.

In either case, to do this checking at the time of subtraction the ideal way would be to use check_sub_overflow() ... but it again does exactly what you find to be non-intuitive since the implementation in tools/include/linux/overflow.h uses the gcc intrinsics that does subtraction first and then checks if overflow occurred.

It's trappy because by glance, that check looks unnecessary since allocation starts from max and goes downwards. Also worth to note, the check is not immediately after the decrement but done on the next iteration.

Right. This is probably what causes most confusion.

Considering that, what do you think of below that avoids wraparound entirely:

---8<--- From: Reinette Chatre reinette.chatre@intel.com Subject: [PATCH] selftests/resctrl: Make wraparound handling obvious

Within mba_setup() the programmed bandwidth delay value starts at the maximum (100, or rather ALLOCATION_MAX) and progresses towards ALLOCATION_MIN by decrementing with ALLOCATION_STEP.

The programmed bandwidth delay should never be negative, so representing it with an unsigned int is most appropriate. This may introduce confusion because of the "allocation > ALLOCATION_MAX" check used to check wraparound of the subtraction.

Modify the mba_setup() flow to start at the minimum, ALLOCATION_MIN, and incrementally, with ALLOCATION_STEP steps, adjust the bandwidth delay value. This avoids wraparound while making the purpose of "allocation > ALLOCATION_MAX" clear and eliminates the need for the "allocation < ALLOCATION_MIN" check.

Reported-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Signed-off-by: Reinette Chatre reinette.chatre@intel.com --- Changes since V1: - New patch --- tools/testing/selftests/resctrl/mba_test.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index ab8496a4925b..947d5699f0c8 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -39,7 +39,8 @@ static int mba_setup(const struct resctrl_test *test, const struct user_params *uparams, struct resctrl_val_param *p) { - static int runs_per_allocation, allocation = 100; + static unsigned int allocation = ALLOCATION_MIN; + static int runs_per_allocation = 0; char allocation_str[64]; int ret;

@@ -50,7 +51,7 @@ static int mba_setup(const struct resctrl_test *test, if (runs_per_allocation++ != 0) return 0;

- if (allocation < ALLOCATION_MIN || allocation > ALLOCATION_MAX) + if (allocation > ALLOCATION_MAX) return END_OF_TESTS;

sprintf(allocation_str, "%d", allocation); @@ -59,7 +60,7 @@ static int mba_setup(const struct resctrl_test *test, if (ret < 0) return ret;

- allocation -= ALLOCATION_STEP; + allocation += ALLOCATION_STEP;

return 0; } @@ -72,8 +73,9 @@ static int mba_measure(const struct user_params *uparams,

static bool show_mba_info(unsigned long *bw_imc, unsigned long *bw_resc) { - int allocation, runs; + unsigned int allocation; bool ret = false; + int runs;

ksft_print_msg("Results are displayed in (MB)\n"); /* Memory bandwidth from 100% down to 10% */ @@ -103,7 +105,7 @@ static bool show_mba_info(unsigned long *bw_imc, unsigned long *bw_resc) avg_diff_per > MAX_DIFF_PERCENT ? "Fail:" : "Pass:", MAX_DIFF_PERCENT, - ALLOCATION_MAX - ALLOCATION_STEP * allocation); + ALLOCATION_MIN + ALLOCATION_STEP * allocation);

ksft_print_msg("avg_diff_per: %d%%\n", avg_diff_per); ksft_print_msg("avg_bw_imc: %lu\n", avg_bw_imc);

...

The risk here is that somebody removes allocation > ALLOCATION_MAX check.

Something called check_sub_overflow() is not subject to a similar risk regardless of what operations occur inside it.

Reinette

Ilpo Järvinen

9 Sep 9 Sep

8:13 a.m.

New subject: [PATCH 5/6] selftests/resctrl: Do not compare performance counters and resctrl at low bandwidth

On Fri, 6 Sep 2024, Reinette Chatre wrote:

...

On 9/6/24 1:44 AM, Ilpo Järvinen wrote:

...
On Thu, 5 Sep 2024, Reinette Chatre wrote:

...
On 9/5/24 4:45 AM, Ilpo Järvinen wrote:

...
On Wed, 4 Sep 2024, Reinette Chatre wrote:

...
On 9/4/24 4:43 AM, Ilpo Järvinen wrote:

...
On Fri, 30 Aug 2024, Reinette Chatre wrote: > On 8/30/24 4:42 AM, Ilpo Järvinen wrote: > > On Thu, 29 Aug 2024, Reinette Chatre wrote: > > > > > The MBA test incrementally throttles memory bandwidth, each > > > time > > > followed by a comparison between the memory bandwidth observed > > > by the performance counters and resctrl respectively. > > > > > > While a comparison between performance counters and resctrl is > > > generally appropriate, they do not have an identical view of > > > memory bandwidth. For example RAS features or memory > > > performance > > > features that generate memory traffic may drive accesses that > > > are > > > counted differently by performance counters and MBM > > > respectively, > > > for instance generating "overhead" traffic which is not > > > counted > > > against any specific RMID. As a ratio, this different view of > > > memory > > > bandwidth becomes more apparent at low memory bandwidths. > > > > Interesting. > > > > I did some time back prototype with a change to MBM test such > > that > > instead > > of using once=false I changed fill_buf to be able to run N > > passes > > through > > the buffer which allowed me to know how many reads were > > performed by > > the > > benchmark. This yielded numerical difference between all those 3 > > values > > (# of reads, MBM, perf) which also varied from arch to another > > so it > > didn't end up making an usable test. > > > > I guess I now have an explanation for at least a part of the > > differences. > > > > > It is not practical to enable/disable the various features > > > that > > > may generate memory bandwidth to give performance counters and > > > resctrl an identical view. Instead, do not compare performance > > > counters and resctrl view of memory bandwidth when the memory > > > bandwidth is low. > > > > > > Bandwidth throttling behaves differently across platforms > > > so it is not appropriate to drop measurement data simply based > > > on the throttling level. Instead, use a threshold of 750MiB > > > that has been observed to support adequate comparison between > > > performance counters and resctrl. > > > > > > Signed-off-by: Reinette Chatre reinette.chatre@intel.com > > > --- > > > tools/testing/selftests/resctrl/mba_test.c | 7 +++++++ > > > tools/testing/selftests/resctrl/resctrl.h | 6 ++++++ > > > 2 files changed, 13 insertions(+) > > > > > > diff --git a/tools/testing/selftests/resctrl/mba_test.c > > > b/tools/testing/selftests/resctrl/mba_test.c > > > index cad473b81a64..204b9ac4b108 100644 > > > --- a/tools/testing/selftests/resctrl/mba_test.c > > > +++ b/tools/testing/selftests/resctrl/mba_test.c > > > @@ -96,6 +96,13 @@ static bool show_mba_info(unsigned long > > > *bw_imc, > > > unsigned long *bw_resc) > > > avg_bw_imc = sum_bw_imc / (NUM_OF_RUNS - 1); > > > avg_bw_resc = sum_bw_resc / (NUM_OF_RUNS - 1); > > > + if (avg_bw_imc < THROTTLE_THRESHOLD || > > > avg_bw_resc < > > > THROTTLE_THRESHOLD) { > > > + ksft_print_msg("Bandwidth below > > > threshold (%d > > > MiB). > > > Dropping results from MBA schemata %u.\n", > > > + THROTTLE_THRESHOLD, > > > + ALLOCATION_MAX - > > > ALLOCATION_STEP * > > > allocation); > > > > The second one too should be %d. > > > > hmmm ... I intended to have it be consistent with the > ksft_print_msg() > that > follows. Perhaps allocation can be made unsigned instead?

If you go that way, then it would be good to make the related defines and allocation in mba_setup() unsigned too, although the latter is a bit scary

Sure, will look into that.

...
because it does allocation -= ALLOCATION_STEP which could underflow if the defines are ever changed.

Is this not already covered in the following check: if (allocation < ALLOCATION_MIN || allocation > ALLOCATION_MAX) return END_OF_TESTS;

We are talking about hardcoded constants though.

Borderline yes. It is "covered" by the allocation > ALLOCATION_MAX but it's also very non-intuitive to let the value underflow and then check for that with the > operator.

My understanding is that this is the traditional way of checking overflow (or more accurately wraparound). There are several examples of this pattern in the kernel and it is also the pattern that I understand Linus referred to as "traditional" in [1]. Even the compiler's intrinsic overflow checkers do checking in this way (perform the subtraction and then check if it overflowed) [2].

Fair enough. I've never come across that kind of claim before.

...
...
You're correct that they're constants so one would need to tweak the source to end up into this condition in the first place.

Perhaps I'm being overly pendantic here but I in general don't like leaving trappy and non-obvious logic like that lying around because one day one of such will come back biting.

It is not clear to me what is "trappy" about this. The current checks will catch the wraparound if somebody changes ALLOCATION_STEP in your scenario, no?

...
So, if a variable is unsigned and we ever do subtraction (or adding negative numbers to it), I'd prefer additional check to prevent ever underflowing it unexpectedly. Or leave them signed.

To support checking at the time of subtraction we either need to change the flow of that function since it cannot exit with failure if that subtraction fails because of overflow/wraparound, or we need to introduce more state that will be an additional check that the existing "if (allocation < ALLOCATION_MIN || allocation > ALLOCATION_MAX)" would have caught anyway.

In either case, to do this checking at the time of subtraction the ideal way would be to use check_sub_overflow() ... but it again does exactly what you find to be non-intuitive since the implementation in tools/include/linux/overflow.h uses the gcc intrinsics that does subtraction first and then checks if overflow occurred.

It's trappy because by glance, that check looks unnecessary since allocation starts from max and goes downwards. Also worth to note, the check is not immediately after the decrement but done on the next iteration.

Right. This is probably what causes most confusion.

Considering that, what do you think of below that avoids wraparound entirely:

---8<--- From: Reinette Chatre reinette.chatre@intel.com Subject: [PATCH] selftests/resctrl: Make wraparound handling obvious

Within mba_setup() the programmed bandwidth delay value starts at the maximum (100, or rather ALLOCATION_MAX) and progresses towards ALLOCATION_MIN by decrementing with ALLOCATION_STEP.

The programmed bandwidth delay should never be negative, so representing it with an unsigned int is most appropriate. This may introduce confusion because of the "allocation > ALLOCATION_MAX" check used to check wraparound of the subtraction.

Modify the mba_setup() flow to start at the minimum, ALLOCATION_MIN, and incrementally, with ALLOCATION_STEP steps, adjust the bandwidth delay value. This avoids wraparound while making the purpose of "allocation > ALLOCATION_MAX" clear and eliminates the need for the "allocation < ALLOCATION_MIN" check.

Reported-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com Signed-off-by: Reinette Chatre reinette.chatre@intel.com

Changes since V1:

New patch

tools/testing/selftests/resctrl/mba_test.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index ab8496a4925b..947d5699f0c8 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -39,7 +39,8 @@ static int mba_setup(const struct resctrl_test *test, const struct user_params *uparams, struct resctrl_val_param *p) {

static int runs_per_allocation, allocation = 100;

static unsigned int allocation = ALLOCATION_MIN;

static int runs_per_allocation = 0; char allocation_str[64]; int ret;

@@ -50,7 +51,7 @@ static int mba_setup(const struct resctrl_test *test, if (runs_per_allocation++ != 0) return 0;

if (allocation < ALLOCATION_MIN || allocation > ALLOCATION_MAX)

if (allocation > ALLOCATION_MAX) return END_OF_TESTS; sprintf(allocation_str, "%d", allocation);

@@ -59,7 +60,7 @@ static int mba_setup(const struct resctrl_test *test, if (ret < 0) return ret;

allocation -= ALLOCATION_STEP;

allocation += ALLOCATION_STEP; return 0;

} @@ -72,8 +73,9 @@ static int mba_measure(const struct user_params *uparams, static bool show_mba_info(unsigned long *bw_imc, unsigned long *bw_resc) {

int allocation, runs;

unsigned int allocation; bool ret = false;

int runs; ksft_print_msg("Results are displayed in (MB)\n"); /* Memory bandwidth from 100% down to 10% */

@@ -103,7 +105,7 @@ static bool show_mba_info(unsigned long *bw_imc, unsigned long *bw_resc) avg_diff_per > MAX_DIFF_PERCENT ? "Fail:" : "Pass:", MAX_DIFF_PERCENT,
	       ALLOCATION_MAX - ALLOCATION_STEP * allocation);
	       ALLOCATION_MIN + ALLOCATION_STEP * allocation);
ksft_print_msg("avg_diff_per: %d%%\n", avg_diff_per); ksft_print_msg("avg_bw_imc: %lu\n", avg_bw_imc);

Looks fine.

Reviewed-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com

-- i. > > > > > > The risk here is that somebody removes allocation > ALLOCATION_MAX check. > > > > Something called check_sub_overflow() is not subject to a similar risk > > regardless of what operations occur inside it. > > Reinette >

Reinette Chatre

29 Aug 29 Aug

10:52 p.m.

New subject: [PATCH 6/6] selftests/resctrl: Keep results from first test run

The resctrl selftests drop the results from every first test run to avoid (per comment) "inaccurate due to monitoring setup transition phase" data. Previously inaccurate data resulted from workloads needing some time to "settle" and also the measurements themselves to account for earlier measurements to measure across needed timeframe.

commit da50de0a92f3 ("selftests/resctrl: Calculate resctrl FS derived mem bw over sleep(1) only")

ensured that measurements accurately measure just the time frame of interest. The default "fill_buf" benchmark since separated the buffer prepare phase from the benchmark run phase reducing the need for the tests themselves to accommodate the benchmark's "settle" time.

With these enhancements there are no remaining portions needing to "settle" and the first test run can contribute to measurements.

Signed-off-by: Reinette Chatre reinette.chatre@intel.com --- tools/testing/selftests/resctrl/cmt_test.c | 5 ++--- tools/testing/selftests/resctrl/mba_test.c | 6 +++--- tools/testing/selftests/resctrl/mbm_test.c | 10 +++------- 3 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c index f09d5dfab38c..85cb93d525b8 100644 --- a/tools/testing/selftests/resctrl/cmt_test.c +++ b/tools/testing/selftests/resctrl/cmt_test.c @@ -99,14 +99,13 @@ static int check_results(struct resctrl_val_param *param, size_t span, int no_of }

/* Field 3 is llc occ resc value */ - if (runs > 0) - sum_llc_occu_resc += strtoul(token_array[3], NULL, 0); + sum_llc_occu_resc += strtoul(token_array[3], NULL, 0); runs++; } fclose(fp);

return show_results_info(sum_llc_occu_resc, no_of_bits, span, - MAX_DIFF, MAX_DIFF_PERCENT, runs - 1, true); + MAX_DIFF, MAX_DIFF_PERCENT, runs, true); }

static void cmt_test_cleanup(void) diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index 204b9ac4b108..dddf9bc04cfa 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -88,14 +88,14 @@ static bool show_mba_info(unsigned long *bw_imc, unsigned long *bw_resc) * The first run is discarded due to inaccurate value from * phase transition. */ - for (runs = NUM_OF_RUNS * allocation + 1; + for (runs = NUM_OF_RUNS * allocation; runs < NUM_OF_RUNS * allocation + NUM_OF_RUNS ; runs++) { sum_bw_imc += bw_imc[runs]; sum_bw_resc += bw_resc[runs]; }

- avg_bw_imc = sum_bw_imc / (NUM_OF_RUNS - 1); - avg_bw_resc = sum_bw_resc / (NUM_OF_RUNS - 1); + avg_bw_imc = sum_bw_imc / NUM_OF_RUNS; + avg_bw_resc = sum_bw_resc / NUM_OF_RUNS; if (avg_bw_imc < THROTTLE_THRESHOLD || avg_bw_resc < THROTTLE_THRESHOLD) { ksft_print_msg("Bandwidth below threshold (%d MiB). Dropping results from MBA schemata %u.\n", THROTTLE_THRESHOLD, diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c index 734bfa4f42b3..bbacba4ec195 100644 --- a/tools/testing/selftests/resctrl/mbm_test.c +++ b/tools/testing/selftests/resctrl/mbm_test.c @@ -22,17 +22,13 @@ show_bw_info(unsigned long *bw_imc, unsigned long *bw_resc, size_t span) int runs, ret, avg_diff_per; float avg_diff = 0;

- /* - * Discard the first value which is inaccurate due to monitoring setup - * transition phase. - */ - for (runs = 1; runs < NUM_OF_RUNS ; runs++) { + for (runs = 0; runs < NUM_OF_RUNS ; runs++) { sum_bw_imc += bw_imc[runs]; sum_bw_resc += bw_resc[runs]; }

- avg_bw_imc = sum_bw_imc / 4; - avg_bw_resc = sum_bw_resc / 4; + avg_bw_imc = sum_bw_imc / NUM_OF_RUNS; + avg_bw_resc = sum_bw_resc / NUM_OF_RUNS; avg_diff = (float)labs(avg_bw_resc - avg_bw_imc) / avg_bw_imc; avg_diff_per = (int)(avg_diff * 100);

-- 2.46.0

312

days inactive

323

days old

linux-kselftest-mirror@lists.linaro.org

29 comments

participants

tags (0)

participants (2)

Ilpo Järvinen
Reinette Chatre