Sub-Numa Clustering (SNC) allows splitting CPU cores, caches and memory into multiple NUMA nodes. When enabled, NUMA-aware applications can achieve better performance on bigger server platforms.
The series adding SNC support to the kernel is currently in review [1] but the selftests for resctrl need NUMA-aware adjustments to use these changes. Issues with resctrl selftests not working properly with SNC enabled were originally reported by Shaopeng Tan [2][3] and the following series resolves them.
The main concept currently missing from resctrl selftests is that while resctrl tracks memory accesses on a single NUMA node (which normally is the same as the CPU socket) on machines with SNC enabled memory accesses can leak outside of the local NUMA node and into other NUMA nodes on the same socket. In that case resctrl could report a diminished value in one of its monitoring technologies: Cache Monitoring Technology (CMT) or Memory Bandwidth Monitoring (MBM) .
Implemented solutions for both CMT and MBM follow the same idea which is to simply sum values reported by different NUMA nodes for a single Resource Monitoring ID (RMID).
Series was tested on Ice Lake server platforms with SNC disabled, SNC-2 and SNC-4. The tests were also ran with and without kernel support for SNC.
Series applies cleanly on kselftest/next.
[1] https://lore.kernel.org/all/20240228112215.8044-tony.luck@intel.com/ [2] https://lore.kernel.org/all/TYAPR01MB6330B9B17686EF426D2C3F308B25A@TYAPR01MB... [3] https://lore.kernel.org/lkml/TYAPR01MB6330A4EB3633B791939EA45E8B39A@TYAPR01M...
Maciej Wieczor-Retman (4): selftests/resctrl: Adjust effective L3 cache size with SNC enabled selftests/resctrl: SNC support for CMT selftests/resctrl: SNC support for MBM selftests/resctrl: Adjust SNC support messages
tools/testing/selftests/resctrl/cache.c | 17 ++- tools/testing/selftests/resctrl/cat_test.c | 2 +- tools/testing/selftests/resctrl/cmt_test.c | 6 +- tools/testing/selftests/resctrl/mba_test.c | 3 +- tools/testing/selftests/resctrl/mbm_test.c | 4 +- tools/testing/selftests/resctrl/resctrl.h | 13 +- tools/testing/selftests/resctrl/resctrl_val.c | 46 ++++--- tools/testing/selftests/resctrl/resctrlfs.c | 128 +++++++++++++++++- 8 files changed, 185 insertions(+), 34 deletions(-)
Sub-NUMA Cluster divides CPUs sharing an L3 cache into separate NUMA nodes. Systems may support splitting into either two or four nodes.
When SNC mode is enabled the effective amount of L3 cache available for allocation is divided by the number of nodes per L3.
Detect which SNC mode is active by comparing the number of CPUs that share a cache with CPU0, with the number of CPUs on node0.
Signed-off-by: Tony Luck tony.luck@intel.com Co-developed-by: Maciej Wieczor-Retman maciej.wieczor-retman@intel.com Signed-off-by: Maciej Wieczor-Retman maciej.wieczor-retman@intel.com --- tools/testing/selftests/resctrl/resctrl.h | 4 ++ tools/testing/selftests/resctrl/resctrlfs.c | 59 +++++++++++++++++++++ 2 files changed, 63 insertions(+)
diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h index 2051bd135e0d..41811e87f81c 100644 --- a/tools/testing/selftests/resctrl/resctrl.h +++ b/tools/testing/selftests/resctrl/resctrl.h @@ -11,6 +11,7 @@ #include <signal.h> #include <dirent.h> #include <stdbool.h> +#include <ctype.h> #include <sys/stat.h> #include <sys/ioctl.h> #include <sys/mount.h> @@ -43,6 +44,8 @@
#define DEFAULT_SPAN (250 * MB)
+#define MAX_SNC 4 + #define PARENT_EXIT() \ do { \ kill(ppid, SIGKILL); \ @@ -129,6 +132,7 @@ extern pid_t bm_pid, ppid;
extern char llc_occup_path[1024];
+int snc_ways(void); int get_vendor(void); bool check_resctrlfs_support(void); int filter_dmesg(void); diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c index 1cade75176eb..e4d3624a8817 100644 --- a/tools/testing/selftests/resctrl/resctrlfs.c +++ b/tools/testing/selftests/resctrl/resctrlfs.c @@ -156,6 +156,63 @@ int get_domain_id(const char *resource, int cpu_no, int *domain_id) return 0; }
+/* + * Count number of CPUs in a /sys bit map + */ +static unsigned int count_sys_bitmap_bits(char *name) +{ + FILE *fp = fopen(name, "r"); + int count = 0, c; + + if (!fp) + return 0; + + while ((c = fgetc(fp)) != EOF) { + if (!isxdigit(c)) + continue; + switch (c) { + case 'f': + count++; + case '7': case 'b': case 'd': case 'e': + count++; + case '3': case '5': case '6': case '9': case 'a': case 'c': + count++; + case '1': case '2': case '4': case '8': + count++; + } + } + fclose(fp); + + return count; +} + +/* + * Detect SNC by comparing #CPUs in node0 with #CPUs sharing LLC with CPU0. + * If some CPUs are offline the numbers may not be exact multiples of each + * other. Any offline CPUs on node0 will be also gone from shared_cpu_map of + * CPU0 but offline CPUs from other nodes will only make the cache_cpus value + * lower. Still try to get the ratio right by preventing the second possibility. + */ +int snc_ways(void) +{ + int node_cpus, cache_cpus, i; + + node_cpus = count_sys_bitmap_bits("/sys/devices/system/node/node0/cpumap"); + cache_cpus = count_sys_bitmap_bits("/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map"); + + if (!node_cpus || !cache_cpus) { + fprintf(stderr, "Warning could not determine Sub-NUMA Cluster mode\n"); + return 1; + } + + for (i = 1; i <= MAX_SNC ; i++) { + if (i * node_cpus >= cache_cpus) + return i; + } + + return 1; +} + /* * get_cache_size - Get cache size for a specified CPU * @cpu_no: CPU number @@ -211,6 +268,8 @@ int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size break; }
+ if (cache_num == 3) + *cache_size /= snc_ways(); return 0; }
Cache Monitoring Technology (CMT) works by measuring how much data in L3 cache is occupied by a given process identified by its Resource Monitoring ID (RMID).
On systems with Sub-Numa Clusters (SNC) enabled, a process can occupy not only the cache that belongs to its own NUMA node but also pieces of other NUMA nodes' caches that lie on the same socket.
A simple correction to make the CMT selftest NUMA-aware is to sum values reported by all nodes on the same socket for a given RMID.
Reported-by: "Shaopeng Tan (Fujitsu)" tan.shaopeng@fujitsu.com Closes: https://lore.kernel.org/all/TYAPR01MB6330B9B17686EF426D2C3F308B25A@TYAPR01MB... Signed-off-by: Maciej Wieczor-Retman maciej.wieczor-retman@intel.com --- tools/testing/selftests/resctrl/cache.c | 17 +++++++++++------ tools/testing/selftests/resctrl/resctrl.h | 4 +++- tools/testing/selftests/resctrl/resctrl_val.c | 9 ++++++--- 3 files changed, 20 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/resctrl/cache.c b/tools/testing/selftests/resctrl/cache.c index 1b339d6bbff1..dab81920033b 100644 --- a/tools/testing/selftests/resctrl/cache.c +++ b/tools/testing/selftests/resctrl/cache.c @@ -161,16 +161,21 @@ int perf_event_measure(int pe_fd, struct perf_event_read *pe_read, * * Return: =0 on success. <0 on failure. */ -int measure_llc_resctrl(const char *filename, int bm_pid) +int measure_llc_resctrl(const char *filename, int bm_pid, const char *ctrlgrp, + const char *mongrp, int res_id) { - unsigned long llc_occu_resc = 0; + unsigned long sum = 0, llc_occu_resc = 0; int ret;
- ret = get_llc_occu_resctrl(&llc_occu_resc); - if (ret < 0) - return ret; + for (int i = 0 ; i < snc_ways() ; i++) { + set_cmt_path(ctrlgrp, mongrp, res_id + i); + ret = get_llc_occu_resctrl(&llc_occu_resc); + if (ret < 0) + return ret; + sum += llc_occu_resc; + }
- return print_results_cache(filename, bm_pid, llc_occu_resc); + return print_results_cache(filename, bm_pid, sum); }
/* diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h index 41811e87f81c..178fb2eab13a 100644 --- a/tools/testing/selftests/resctrl/resctrl.h +++ b/tools/testing/selftests/resctrl/resctrl.h @@ -133,6 +133,7 @@ extern pid_t bm_pid, ppid; extern char llc_occup_path[1024];
int snc_ways(void); +void set_cmt_path(const char *ctrlgrp, const char *mongrp, char sock_num); int get_vendor(void); bool check_resctrlfs_support(void); int filter_dmesg(void); @@ -182,7 +183,8 @@ int perf_open(struct perf_event_attr *pea, pid_t pid, int cpu_no); int perf_event_reset_enable(int pe_fd); int perf_event_measure(int pe_fd, struct perf_event_read *pe_read, const char *filename, int bm_pid); -int measure_llc_resctrl(const char *filename, int bm_pid); +int measure_llc_resctrl(const char *filename, int bm_pid, const char *ctrlgrp, + const char *mongrp, int res_id); void show_cache_info(int no_of_bits, __u64 avg_llc_val, size_t cache_span, bool lines);
/* diff --git a/tools/testing/selftests/resctrl/resctrl_val.c b/tools/testing/selftests/resctrl/resctrl_val.c index 5a49f07a6c85..e75e3923ebe2 100644 --- a/tools/testing/selftests/resctrl/resctrl_val.c +++ b/tools/testing/selftests/resctrl/resctrl_val.c @@ -557,7 +557,7 @@ static int print_results_bw(char *filename, int bm_pid, float bw_imc, return 0; }
-static void set_cmt_path(const char *ctrlgrp, const char *mongrp, char sock_num) +void set_cmt_path(const char *ctrlgrp, const char *mongrp, char sock_num) { if (strlen(ctrlgrp) && strlen(mongrp)) sprintf(llc_occup_path, CON_MON_LCC_OCCUP_PATH, RESCTRL_PATH, @@ -698,8 +698,8 @@ int resctrl_val(const struct resctrl_test *test, { char *resctrl_val = param->resctrl_val; unsigned long bw_resc_start = 0; + int res_id, ret = 0, pipefd[2]; struct sigaction sigact; - int ret = 0, pipefd[2]; char pipe_message = 0; union sigval value;
@@ -828,6 +828,8 @@ int resctrl_val(const struct resctrl_test *test, sleep(1);
/* Test runs until the callback setup() tells the test to stop. */ + get_domain_id("L3", uparams->cpu, &res_id); + res_id *= snc_ways(); while (1) { ret = param->setup(test, uparams, param); if (ret == END_OF_TESTS) { @@ -844,7 +846,8 @@ int resctrl_val(const struct resctrl_test *test, break; } else if (!strncmp(resctrl_val, CMT_STR, sizeof(CMT_STR))) { sleep(1); - ret = measure_llc_resctrl(param->filename, bm_pid); + ret = measure_llc_resctrl(param->filename, bm_pid, param->ctrlgrp, + param->mongrp, res_id); if (ret) break; }
On Wed, 6 Mar 2024, Maciej Wieczor-Retman wrote:
Cache Monitoring Technology (CMT) works by measuring how much data in L3 cache is occupied by a given process identified by its Resource Monitoring ID (RMID).
On systems with Sub-Numa Clusters (SNC) enabled, a process can occupy not only the cache that belongs to its own NUMA node but also pieces of other NUMA nodes' caches that lie on the same socket.
A simple correction to make the CMT selftest NUMA-aware is to sum values reported by all nodes on the same socket for a given RMID.
Reported-by: "Shaopeng Tan (Fujitsu)" tan.shaopeng@fujitsu.com Closes: https://lore.kernel.org/all/TYAPR01MB6330B9B17686EF426D2C3F308B25A@TYAPR01MB... Signed-off-by: Maciej Wieczor-Retman maciej.wieczor-retman@intel.com
tools/testing/selftests/resctrl/cache.c | 17 +++++++++++------ tools/testing/selftests/resctrl/resctrl.h | 4 +++- tools/testing/selftests/resctrl/resctrl_val.c | 9 ++++++--- 3 files changed, 20 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/resctrl/cache.c b/tools/testing/selftests/resctrl/cache.c index 1b339d6bbff1..dab81920033b 100644 --- a/tools/testing/selftests/resctrl/cache.c +++ b/tools/testing/selftests/resctrl/cache.c @@ -161,16 +161,21 @@ int perf_event_measure(int pe_fd, struct perf_event_read *pe_read,
- Return: =0 on success. <0 on failure.
*/ -int measure_llc_resctrl(const char *filename, int bm_pid) +int measure_llc_resctrl(const char *filename, int bm_pid, const char *ctrlgrp,
const char *mongrp, int res_id)
{
- unsigned long llc_occu_resc = 0;
- unsigned long sum = 0, llc_occu_resc = 0; int ret;
- ret = get_llc_occu_resctrl(&llc_occu_resc);
- if (ret < 0)
return ret;
- for (int i = 0 ; i < snc_ways() ; i++) {
Spaces as per usual coding style:
for (int i = 0; i < snc_ways(); i++) {
set_cmt_path(ctrlgrp, mongrp, res_id + i);
ret = get_llc_occu_resctrl(&llc_occu_resc);
if (ret < 0)
return ret;
sum += llc_occu_resc;
- }
- return print_results_cache(filename, bm_pid, llc_occu_resc);
- return print_results_cache(filename, bm_pid, sum);
} /*
@@ -828,6 +828,8 @@ int resctrl_val(const struct resctrl_test *test, sleep(1); /* Test runs until the callback setup() tells the test to stop. */
- get_domain_id("L3", uparams->cpu, &res_id);
Hardcoding L3 here limits the genericness of this function. You don't even need to do it, get_domain_id() does "MB" -> "L3" transformation implicitly for you so you can just pass test->resource instead.
Also, I don't understand why you now again make the naming inconsistent with "res_id".
If you based this on top of the patches I just posted, resctl_val() already the domain_id variable.
On Fri, 8 Mar 2024, Ilpo Järvinen wrote:
On Wed, 6 Mar 2024, Maciej Wieczor-Retman wrote:
Cache Monitoring Technology (CMT) works by measuring how much data in L3 cache is occupied by a given process identified by its Resource Monitoring ID (RMID).
On systems with Sub-Numa Clusters (SNC) enabled, a process can occupy not only the cache that belongs to its own NUMA node but also pieces of other NUMA nodes' caches that lie on the same socket.
A simple correction to make the CMT selftest NUMA-aware is to sum values reported by all nodes on the same socket for a given RMID.
Reported-by: "Shaopeng Tan (Fujitsu)" tan.shaopeng@fujitsu.com Closes: https://lore.kernel.org/all/TYAPR01MB6330B9B17686EF426D2C3F308B25A@TYAPR01MB... Signed-off-by: Maciej Wieczor-Retman maciej.wieczor-retman@intel.com
@@ -828,6 +828,8 @@ int resctrl_val(const struct resctrl_test *test, sleep(1); /* Test runs until the callback setup() tells the test to stop. */
- get_domain_id("L3", uparams->cpu, &res_id);
Hardcoding L3 here limits the genericness of this function. You don't even need to do it, get_domain_id() does "MB" -> "L3" transformation implicitly for you so you can just pass test->resource instead.
Also, I don't understand why you now again make the naming inconsistent with "res_id".
If you based this on top of the patches I just posted, resctl_val() already the domain_id variable.
Ah, I retract what I said. I see you actually want it only from L3.
res_id *= snc_ways();
I don't understand what this is trying to achieve and how.
On 2024-03-08 at 15:59:02 +0200, Ilpo Järvinen wrote:
On Fri, 8 Mar 2024, Ilpo Järvinen wrote:
On Wed, 6 Mar 2024, Maciej Wieczor-Retman wrote:
Cache Monitoring Technology (CMT) works by measuring how much data in L3 cache is occupied by a given process identified by its Resource Monitoring ID (RMID).
On systems with Sub-Numa Clusters (SNC) enabled, a process can occupy not only the cache that belongs to its own NUMA node but also pieces of other NUMA nodes' caches that lie on the same socket.
A simple correction to make the CMT selftest NUMA-aware is to sum values reported by all nodes on the same socket for a given RMID.
Reported-by: "Shaopeng Tan (Fujitsu)" tan.shaopeng@fujitsu.com Closes: https://lore.kernel.org/all/TYAPR01MB6330B9B17686EF426D2C3F308B25A@TYAPR01MB... Signed-off-by: Maciej Wieczor-Retman maciej.wieczor-retman@intel.com
@@ -828,6 +828,8 @@ int resctrl_val(const struct resctrl_test *test, sleep(1); /* Test runs until the callback setup() tells the test to stop. */
- get_domain_id("L3", uparams->cpu, &res_id);
Hardcoding L3 here limits the genericness of this function. You don't even need to do it, get_domain_id() does "MB" -> "L3" transformation implicitly for you so you can just pass test->resource instead.
Also, I don't understand why you now again make the naming inconsistent with "res_id".
If you based this on top of the patches I just posted, resctl_val() already the domain_id variable.
Ah, I retract what I said. I see you actually want it only from L3.
res_id *= snc_ways();
I don't understand what this is trying to achieve and how.
We exchanged some private messages on this but I'll post the explanation here too if anyone else was looking for it.
get_domain_id("L3"...) essentially gives us the number of the socket (on platforms that have one L3 cache per socket - I still have too look into other ones). The problem here is that to get an accurate reading with SNC enabled we need to collect values from all nodes on a single socket that has the CPU our test is running on. So we need to find the first node on that socket so then we can loop through all the nodes on that socket. To do that we multiply res_id by the amount of SNC nodes per socket (res_id *= snc_ways()) and that's it.
I'll add some helper with an explaining comment on what it does in the next version.
-- i.
Memory Bandwidth Monitoring (MBM) measures how much data flows between cache levels. Only the flow for a process specified with a Resource Monitoring ID (RMID) is measured.
Sub-Numa Clustering (SNC) causes MBM selftest to fail because the increased amount of NUMA nodes per socket is not taken into account. That in turn makes the test use incorrect values for the start and end of the measurement by tracking the wrong node.
For the MBM selftest to be NUMA-aware it needs to track the start and end values of a measurement not for a single node but for all nodes on the same socket. Then summing all measured values comes out as the real bandwidth used by the process.
Reported-by: "Shaopeng Tan (Fujitsu)" tan.shaopeng@fujitsu.com Closes: https://lore.kernel.org/lkml/TYAPR01MB6330A4EB3633B791939EA45E8B39A@TYAPR01M... Signed-off-by: Maciej Wieczor-Retman maciej.wieczor-retman@intel.com --- tools/testing/selftests/resctrl/mba_test.c | 1 - tools/testing/selftests/resctrl/resctrl_val.c | 37 ++++++++++++------- 2 files changed, 23 insertions(+), 15 deletions(-)
diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index 7946e32e85c8..fc31a61dab0c 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -147,7 +147,6 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param struct resctrl_val_param param = { .resctrl_val = MBA_STR, .ctrlgrp = "c1", - .mongrp = "m1", .filename = RESULT_FILE_NAME, .bw_report = "reads", .setup = mba_setup diff --git a/tools/testing/selftests/resctrl/resctrl_val.c b/tools/testing/selftests/resctrl/resctrl_val.c index e75e3923ebe2..2fe9f8bb4a45 100644 --- a/tools/testing/selftests/resctrl/resctrl_val.c +++ b/tools/testing/selftests/resctrl/resctrl_val.c @@ -595,9 +595,10 @@ static void initialize_llc_occu_resctrl(const char *ctrlgrp, const char *mongrp,
static int measure_vals(const struct user_params *uparams, struct resctrl_val_param *param, - unsigned long *bw_resc_start) + unsigned long *bw_resc_start, + int res_id) { - unsigned long bw_resc, bw_resc_end; + unsigned long bw_resc = 0, bw_resc_sum = 0, bw_resc_end; float bw_imc; int ret;
@@ -612,17 +613,19 @@ static int measure_vals(const struct user_params *uparams, if (ret < 0) return ret;
- ret = get_mem_bw_resctrl(&bw_resc_end); - if (ret < 0) - return ret; + for (int i = 0 ; i < snc_ways() ; i++) { + set_mbm_path(param->ctrlgrp, strlen(param->mongrp) > 0 ? param->mongrp : NULL, + res_id + i); + ret = get_mem_bw_resctrl(&bw_resc_end); + bw_resc = (bw_resc_end - bw_resc_start[i]) / MB; + bw_resc_sum += bw_resc; + bw_resc_start[i] = bw_resc_end; + }
- bw_resc = (bw_resc_end - *bw_resc_start) / MB; - ret = print_results_bw(param->filename, bm_pid, bw_imc, bw_resc); + ret = print_results_bw(param->filename, bm_pid, bw_imc, bw_resc_sum); if (ret) return ret;
- *bw_resc_start = bw_resc_end; - return 0; }
@@ -697,12 +700,16 @@ int resctrl_val(const struct resctrl_test *test, struct resctrl_val_param *param) { char *resctrl_val = param->resctrl_val; - unsigned long bw_resc_start = 0; int res_id, ret = 0, pipefd[2]; + unsigned long *bw_resc_start; struct sigaction sigact; char pipe_message = 0; union sigval value;
+ bw_resc_start = calloc(snc_ways(), sizeof(unsigned long)); + if (!bw_resc_start) + return -1; + if (strcmp(param->filename, "") == 0) sprintf(param->filename, "stdio");
@@ -710,7 +717,7 @@ int resctrl_val(const struct resctrl_test *test, !strncmp(resctrl_val, MBM_STR, sizeof(MBM_STR))) { ret = validate_bw_report_request(param->bw_report); if (ret) - return ret; + goto out_free; }
/* @@ -721,7 +728,7 @@ int resctrl_val(const struct resctrl_test *test,
if (pipe(pipefd)) { ksft_perror("Unable to create pipe"); - + free(bw_resc_start); return -1; }
@@ -733,7 +740,7 @@ int resctrl_val(const struct resctrl_test *test, bm_pid = fork(); if (bm_pid == -1) { ksft_perror("Unable to fork"); - + free(bw_resc_start); return -1; }
@@ -841,7 +848,7 @@ int resctrl_val(const struct resctrl_test *test,
if (!strncmp(resctrl_val, MBM_STR, sizeof(MBM_STR)) || !strncmp(resctrl_val, MBA_STR, sizeof(MBA_STR))) { - ret = measure_vals(uparams, param, &bw_resc_start); + ret = measure_vals(uparams, param, bw_resc_start, res_id); if (ret) break; } else if (!strncmp(resctrl_val, CMT_STR, sizeof(CMT_STR))) { @@ -855,6 +862,8 @@ int resctrl_val(const struct resctrl_test *test,
out: kill(bm_pid, SIGKILL); +out_free: + free(bw_resc_start);
return ret; }
On Wed, 6 Mar 2024, Maciej Wieczor-Retman wrote:
Memory Bandwidth Monitoring (MBM) measures how much data flows between cache levels. Only the flow for a process specified with a Resource Monitoring ID (RMID) is measured.
Sub-Numa Clustering (SNC) causes MBM selftest to fail because the increased amount of NUMA nodes per socket is not taken into account. That in turn makes the test use incorrect values for the start and end of the measurement by tracking the wrong node.
For the MBM selftest to be NUMA-aware it needs to track the start and end values of a measurement not for a single node but for all nodes on the same socket. Then summing all measured values comes out as the real bandwidth used by the process.
Reported-by: "Shaopeng Tan (Fujitsu)" tan.shaopeng@fujitsu.com Closes: https://lore.kernel.org/lkml/TYAPR01MB6330A4EB3633B791939EA45E8B39A@TYAPR01M... Signed-off-by: Maciej Wieczor-Retman maciej.wieczor-retman@intel.com
tools/testing/selftests/resctrl/mba_test.c | 1 - tools/testing/selftests/resctrl/resctrl_val.c | 37 ++++++++++++------- 2 files changed, 23 insertions(+), 15 deletions(-)
diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index 7946e32e85c8..fc31a61dab0c 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -147,7 +147,6 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param struct resctrl_val_param param = { .resctrl_val = MBA_STR, .ctrlgrp = "c1",
.filename = RESULT_FILE_NAME, .bw_report = "reads", .setup = mba_setup.mongrp = "m1",
diff --git a/tools/testing/selftests/resctrl/resctrl_val.c b/tools/testing/selftests/resctrl/resctrl_val.c index e75e3923ebe2..2fe9f8bb4a45 100644 --- a/tools/testing/selftests/resctrl/resctrl_val.c +++ b/tools/testing/selftests/resctrl/resctrl_val.c @@ -595,9 +595,10 @@ static void initialize_llc_occu_resctrl(const char *ctrlgrp, const char *mongrp, static int measure_vals(const struct user_params *uparams, struct resctrl_val_param *param,
unsigned long *bw_resc_start)
unsigned long *bw_resc_start,
int res_id)
{
- unsigned long bw_resc, bw_resc_end;
- unsigned long bw_resc = 0, bw_resc_sum = 0, bw_resc_end; float bw_imc; int ret;
@@ -612,17 +613,19 @@ static int measure_vals(const struct user_params *uparams, if (ret < 0) return ret;
- ret = get_mem_bw_resctrl(&bw_resc_end);
- if (ret < 0)
return ret;
- for (int i = 0 ; i < snc_ways() ; i++) {
set_mbm_path(param->ctrlgrp, strlen(param->mongrp) > 0 ? param->mongrp : NULL,
res_id + i);
ret = get_mem_bw_resctrl(&bw_resc_end);
bw_resc = (bw_resc_end - bw_resc_start[i]) / MB;
bw_resc_sum += bw_resc;
bw_resc_start[i] = bw_resc_end;
- }
- bw_resc = (bw_resc_end - *bw_resc_start) / MB;
- ret = print_results_bw(param->filename, bm_pid, bw_imc, bw_resc);
- ret = print_results_bw(param->filename, bm_pid, bw_imc, bw_resc_sum); if (ret) return ret;
- *bw_resc_start = bw_resc_end;
- return 0;
} @@ -697,12 +700,16 @@ int resctrl_val(const struct resctrl_test *test, struct resctrl_val_param *param) { char *resctrl_val = param->resctrl_val;
- unsigned long bw_resc_start = 0; int res_id, ret = 0, pipefd[2];
- unsigned long *bw_resc_start; struct sigaction sigact; char pipe_message = 0; union sigval value;
- bw_resc_start = calloc(snc_ways(), sizeof(unsigned long));
While correct, this seems a bit overkill given is MAX_SNC = 4, not something large or unbounded.
This patch would be be much simpler on top of my resctrl_val() cleanup patches because bw_resc_start is only local to the measurement function.
On 2024-03-08 at 16:07:05 +0200, Ilpo Järvinen wrote:
On Wed, 6 Mar 2024, Maciej Wieczor-Retman wrote:
@@ -697,12 +700,16 @@ int resctrl_val(const struct resctrl_test *test, struct resctrl_val_param *param) { char *resctrl_val = param->resctrl_val;
- unsigned long bw_resc_start = 0; int res_id, ret = 0, pipefd[2];
- unsigned long *bw_resc_start; struct sigaction sigact; char pipe_message = 0; union sigval value;
- bw_resc_start = calloc(snc_ways(), sizeof(unsigned long));
While correct, this seems a bit overkill given is MAX_SNC = 4, not something large or unbounded.
This patch would be be much simpler on top of my resctrl_val() cleanup patches because bw_resc_start is only local to the measurement function.
You're right, the series will get a lot simpler rebased onto yours. I'll try out some different approaches and comment any relevant points under your thread [1].
[1] https://lore.kernel.org/all/20240311135230.7007-1-ilpo.jarvinen@linux.intel....
-- i.
Resctrl selftest prints a message on test failure that Sub-Numa Clustering (SNC) could be enabled and points the user to check theirs BIOS settings. No actual check is performed before printing that message so it is not very accurate in pinpointing a problem.
Figuring out if SNC is enabled is only one part of the problem, the other being whether the kernel supports it. As there is no easy interface that simply states SNC support in the kernel one can find that information by comparing L3 cache sizes from different sources. Cache size reported by /sys/devices/system/node/node0/cpu0/cache/index3/size will always show the full cache size even if it's split by enabled SNC. On the other hand /sys/fs/resctrl/size has information about L3 size, that with kernel support is adjusted for enabled SNC.
Add a function to find a cache size from /sys/fs/resctrl/size since finding that information from the other source is already implemented.
Add a function that compares the two cache sizes and use it to make the SNC support message more meaningful.
Add the SNC support message just after MBA's check_results() since MBA shares code with MBM and also can suffer from enabled SNC if there is no support in the kernel.
Signed-off-by: Maciej Wieczor-Retman maciej.wieczor-retman@intel.com --- tools/testing/selftests/resctrl/cat_test.c | 2 +- tools/testing/selftests/resctrl/cmt_test.c | 6 +- tools/testing/selftests/resctrl/mba_test.c | 2 + tools/testing/selftests/resctrl/mbm_test.c | 4 +- tools/testing/selftests/resctrl/resctrl.h | 5 +- tools/testing/selftests/resctrl/resctrlfs.c | 69 ++++++++++++++++++++- 6 files changed, 79 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/resctrl/cat_test.c b/tools/testing/selftests/resctrl/cat_test.c index 4cb991be8e31..1cdaadf35f03 100644 --- a/tools/testing/selftests/resctrl/cat_test.c +++ b/tools/testing/selftests/resctrl/cat_test.c @@ -253,7 +253,7 @@ static int cat_run_test(const struct resctrl_test *test, const struct user_param return ret;
/* Get L3/L2 cache size */ - ret = get_cache_size(uparams->cpu, test->resource, &cache_total_size); + ret = get_sys_cache_size(uparams->cpu, test->resource, &cache_total_size); if (ret) return ret; ksft_print_msg("Cache size :%lu\n", cache_total_size); diff --git a/tools/testing/selftests/resctrl/cmt_test.c b/tools/testing/selftests/resctrl/cmt_test.c index a81f91222a89..b7cada602484 100644 --- a/tools/testing/selftests/resctrl/cmt_test.c +++ b/tools/testing/selftests/resctrl/cmt_test.c @@ -112,7 +112,7 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param if (ret) return ret;
- ret = get_cache_size(uparams->cpu, "L3", &cache_total_size); + ret = get_sys_cache_size(uparams->cpu, "L3", &cache_total_size); if (ret) return ret; ksft_print_msg("Cache size :%lu\n", cache_total_size); @@ -157,8 +157,8 @@ static int cmt_run_test(const struct resctrl_test *test, const struct user_param goto out;
ret = check_results(¶m, span, n); - if (ret && (get_vendor() == ARCH_INTEL)) - ksft_print_msg("Intel CMT may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n"); + if (ret && (get_vendor() == ARCH_INTEL) && snc_ways() > 1 && !snc_kernel_support()) + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
out: cmt_test_cleanup(); diff --git a/tools/testing/selftests/resctrl/mba_test.c b/tools/testing/selftests/resctrl/mba_test.c index fc31a61dab0c..89fe3ecbf497 100644 --- a/tools/testing/selftests/resctrl/mba_test.c +++ b/tools/testing/selftests/resctrl/mba_test.c @@ -160,6 +160,8 @@ static int mba_run_test(const struct resctrl_test *test, const struct user_param goto out;
ret = check_results(); + if (ret && (get_vendor() == ARCH_INTEL) && snc_ways() > 1 && !snc_kernel_support()) + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
out: mba_test_cleanup(); diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c index d67ffa3ec63a..e12b4b06f6d5 100644 --- a/tools/testing/selftests/resctrl/mbm_test.c +++ b/tools/testing/selftests/resctrl/mbm_test.c @@ -129,8 +129,8 @@ static int mbm_run_test(const struct resctrl_test *test, const struct user_param goto out;
ret = check_results(DEFAULT_SPAN); - if (ret && (get_vendor() == ARCH_INTEL)) - ksft_print_msg("Intel MBM may be inaccurate when Sub-NUMA Clustering is enabled. Check BIOS configuration.\n"); + if (ret && (get_vendor() == ARCH_INTEL) && snc_ways() > 1 && !snc_kernel_support()) + ksft_print_msg("Kernel doesn't support Sub-NUMA Clustering but it is enabled. Check BIOS configuration.\n");
out: mbm_test_cleanup(); diff --git a/tools/testing/selftests/resctrl/resctrl.h b/tools/testing/selftests/resctrl/resctrl.h index 178fb2eab13a..038e1269a3fc 100644 --- a/tools/testing/selftests/resctrl/resctrl.h +++ b/tools/testing/selftests/resctrl/resctrl.h @@ -28,6 +28,7 @@ #define RESCTRL_PATH "/sys/fs/resctrl" #define PHYS_ID_PATH "/sys/devices/system/cpu/cpu" #define INFO_PATH "/sys/fs/resctrl/info" +#define SIZE_PATH "/sys/fs/resctrl/size"
/* * CPU vendor IDs @@ -168,14 +169,16 @@ unsigned long create_bit_mask(unsigned int start, unsigned int len); unsigned int count_contiguous_bits(unsigned long val, unsigned int *start); int get_full_cbm(const char *cache_type, unsigned long *mask); int get_mask_no_shareable(const char *cache_type, unsigned long *mask); -int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size); int resource_info_unsigned_get(const char *resource, const char *filename, unsigned int *val); +int get_sys_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size); +int get_resctrl_cache_size(const char *cache_type, unsigned long *cache_size); void ctrlc_handler(int signum, siginfo_t *info, void *ptr); int signal_handler_register(void); void signal_handler_unregister(void); void cat_test_cleanup(void); unsigned int count_bits(unsigned long n); void cmt_test_cleanup(void); +int snc_kernel_support(void);
void perf_event_attr_initialize(struct perf_event_attr *pea, __u64 config); void perf_event_initialize_read_format(struct perf_event_read *pe_read); diff --git a/tools/testing/selftests/resctrl/resctrlfs.c b/tools/testing/selftests/resctrl/resctrlfs.c index e4d3624a8817..dbd10cb7abf5 100644 --- a/tools/testing/selftests/resctrl/resctrlfs.c +++ b/tools/testing/selftests/resctrl/resctrlfs.c @@ -214,14 +214,14 @@ int snc_ways(void) }
/* - * get_cache_size - Get cache size for a specified CPU + * get_sys_cache_size - Get cache size for a specified CPU * @cpu_no: CPU number * @cache_type: Cache level L2/L3 * @cache_size: pointer to cache_size * * Return: = 0 on success, < 0 on failure. */ -int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size) +int get_sys_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size) { char cache_path[1024], cache_str[64]; int length, i, cache_num; @@ -273,6 +273,44 @@ int get_cache_size(int cpu_no, const char *cache_type, unsigned long *cache_size return 0; }
+/* + * get_resctrl_cache_size - Get cache size as reported by resctrl + * @cache_type: Cache level L2/L3 + * @cache_size: pointer to cache_size + * + * Return: = 0 on success, < 0 on failure. + */ +int get_resctrl_cache_size(const char *cache_type, unsigned long *cache_size) +{ + char line[256], cache_prefix[16], *stripped_line, *token; + size_t len; + FILE *fp; + + strcpy(cache_prefix, cache_type); + strncat(cache_prefix, ":", 1); + + fp = fopen(SIZE_PATH, "r"); + if (!fp) { + ksft_print_msg("Failed to open %s : '%s'\n", + SIZE_PATH, strerror(errno)); + return -1; + } + + while (fgets(line, sizeof(line), fp)) { + stripped_line = strstr(line, cache_prefix); + + if (stripped_line) { + len = strlen(cache_prefix); + stripped_line += len; + token = strtok(stripped_line, ";"); + if (sscanf(token, "0=%lu", cache_size) <= 0) + return -1; + } + } + fclose(fp); + return 0; +} + #define CORE_SIBLINGS_PATH "/sys/bus/cpu/devices/cpu"
/* @@ -935,3 +973,30 @@ unsigned int count_bits(unsigned long n)
return count; } + +/** + * snc_kernel_support - Compare system reported cache size and resctrl + * reported cache size to get an idea if SNC is supported on the kernel side. + * If SNC is enabled and the kernel does support it the value should be equal. + * If the kernel doesn't support SNC the. + * + * Return: 0 if not supported, 1 if supported, < 0 on failure. + */ +int snc_kernel_support(void) +{ + unsigned long resctrl_cache_size, node_cache_size; + int ret; + + ret = get_sys_cache_size(0, "L3", &node_cache_size); + if (ret < 0) + return ret; + + ret = get_resctrl_cache_size("L3", &resctrl_cache_size); + if (ret < 0) + return ret; + + if (resctrl_cache_size == node_cache_size) + return 1; + + return 0; +}
Figuring out if SNC is enabled is only one part of the problem, the other being whether the kernel supports it. As there is no easy interface that simply states SNC support in the kernel one can find that information by comparing L3 cache sizes from different sources. Cache size reported by /sys/devices/system/node/node0/cpu0/cache/index3/size will always show the full cache size even if it's split by enabled SNC. On the other hand /sys/fs/resctrl/size has information about L3 size, that with kernel support is adjusted for enabled SNC.
Early versions of the kernel SNC patch series added an info/l3_MON/snc_ways file to provide this information. I was talked out of it then:
https://lore.kernel.org/all/f0841866-315b-4727-0a6c-ec60d22ca29c@arm.com/
But that discussion didn't consider the question you discuss here: "Does this instance of the kernel support SNC?"
So you have a clever solution. But it seems like a roundabout way for an application to discover whether the kernel has configured resctrl for SNC mode.
Should the kernel provide an info/ file listing the SNC configuration?
If so, what should it be named? "snc_ways" as a kernel variable was later replaced by "snc_nodes_per_l3_cache". Is that a good filename?
-Tony
On 2024-03-06 at 21:54:02 +0000, Luck, Tony wrote:
Figuring out if SNC is enabled is only one part of the problem, the other being whether the kernel supports it. As there is no easy interface that simply states SNC support in the kernel one can find that information by comparing L3 cache sizes from different sources. Cache size reported by /sys/devices/system/node/node0/cpu0/cache/index3/size will always show the full cache size even if it's split by enabled SNC. On the other hand /sys/fs/resctrl/size has information about L3 size, that with kernel support is adjusted for enabled SNC.
Early versions of the kernel SNC patch series added an info/l3_MON/snc_ways file to provide this information. I was talked out of it then:
https://lore.kernel.org/all/f0841866-315b-4727-0a6c-ec60d22ca29c@arm.com/
But that discussion didn't consider the question you discuss here: "Does this instance of the kernel support SNC?"
So you have a clever solution. But it seems like a roundabout way for an application to discover whether the kernel has configured resctrl for SNC mode.
Should the kernel provide an info/ file listing the SNC configuration?
I suppose it would be a benefit for other numa aware applications to have an easy access to this kernel support information.
If so, what should it be named? "snc_ways" as a kernel variable was later replaced by "snc_nodes_per_l3_cache". Is that a good filename?
"snc_nodes_per_l3_cache" seems okay to me.
And I understand that the file content would show SNC mode and the presence or absence of this file would tell if kernel supports SNC?
-Tony
If so, what should it be named? "snc_ways" as a kernel variable was later replaced by "snc_nodes_per_l3_cache". Is that a good filename?
"snc_nodes_per_l3_cache" seems okay to me.
And I understand that the file content would show SNC mode and the presence or absence of this file would tell if kernel supports SNC?
Yes. The existence of the file indicates the kernel is SNC aware.
The value read from the file would give the number of nodes per L3 (obviously :-) )
SNC not supported by this platform or not enabled:
$ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache 1
SNC2 enabled:
$ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache 2
etc.
-Tony
Hi Tony,
On 3/7/2024 9:18 AM, Luck, Tony wrote:
If so, what should it be named? "snc_ways" as a kernel variable was later replaced by "snc_nodes_per_l3_cache". Is that a good filename?
"snc_nodes_per_l3_cache" seems okay to me.
And I understand that the file content would show SNC mode and the presence or absence of this file would tell if kernel supports SNC?
Yes. The existence of the file indicates the kernel is SNC aware.
The value read from the file would give the number of nodes per L3 (obviously :-) )
SNC not supported by this platform or not enabled:
$ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache 1
SNC2 enabled:
$ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache 2
This would be useful. I believe "SNC" is architecture specific? What if the file always exists and is named "nodes_per_l3_cache"?
I assume that the internals of handling more nodes per L3 cache should be hidden from user space and it should not be necessary for user space to know if this is because of SNC or potentially some other mechanism on another platform?
I think that may reduce fragmentation of resctrl .... not having resctrl look so different on different architectures but maintains the promise of a generic interface.
I am not sure if this is specific to monitoring though, why not host file in /sys/fs/resctrl/info/L3 ?
Reinette
SNC2 enabled:
$ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache 2
This would be useful. I believe "SNC" is architecture specific? What if the file always exists and is named "nodes_per_l3_cache"?
I assume that the internals of handling more nodes per L3 cache should be hidden from user space and it should not be necessary for user space to know if this is because of SNC or potentially some other mechanism on another platform?
I think that may reduce fragmentation of resctrl .... not having resctrl look so different on different architectures but maintains the promise of a generic interface.
I am not sure if this is specific to monitoring though, why not host file in /sys/fs/resctrl/info/L3 ?
Reinette,
On the name change - sure. It doesn't need the "snc_" prefix.
The Intel implementation of SNC has far more effect on monitoring than on control. The user can read separate cache occupancy and memory bandwidth values for each SNC node. But cache allocation bitmasks and memory throttling still have a single control point for each L3 cache instance, not for each node. There are still some impacts on control, e.g. each bit in a CAT bitmask represents less actual space in the L3 cache.
Maybe move it to the top level of the info/ directory:
$ cat /sys/fs/resctrl/info/nodes_per_l3_cache 3
-Tony
Hi Tony,
On 3/7/2024 9:57 AM, Luck, Tony wrote:
SNC2 enabled:
$ cat /sys/fs/resctrl/info/L3_mon/ snc_nodes_per_l3_cache 2
This would be useful. I believe "SNC" is architecture specific? What if the file always exists and is named "nodes_per_l3_cache"?
I assume that the internals of handling more nodes per L3 cache should be hidden from user space and it should not be necessary for user space to know if this is because of SNC or potentially some other mechanism on another platform?
I think that may reduce fragmentation of resctrl .... not having resctrl look so different on different architectures but maintains the promise of a generic interface.
I am not sure if this is specific to monitoring though, why not host file in /sys/fs/resctrl/info/L3 ?
Reinette,
On the name change - sure. It doesn't need the "snc_" prefix.
The Intel implementation of SNC has far more effect on monitoring than on control. The user can read separate cache occupancy and memory bandwidth values for each SNC node. But cache allocation bitmasks and memory throttling still have a single control point for each L3 cache instance, not for each node. There are still some impacts on control, e.g. each bit in a CAT bitmask represents less actual space in the L3 cache.
I understand the impact but I am trying to view this conceptually. The info directory exists to "contain information about the enabled resources" (as per documentation) and it seems appropriate to make this a property of the L3 resource.
Maybe move it to the top level of the info/ directory:
$ cat /sys/fs/resctrl/info/nodes_per_l3_cache 3
Thinking about it even differently. The goal is to give information to userspace so we need to think about what would help user space? For example, what if there is a file in info that shows which CPUs are associated with each domain?
Reinette
Thinking about it even differently. The goal is to give information to userspace so we need to think about what would help user space? For example, what if there is a file in info that shows which CPUs are associated with each domain?
Reinette,
Interesting idea. That would save users from having to chase through /sys/devices/system/cpu/cpu*/cache/index?/* to figure out what the domain numbers in schemata files and the mon_data/mon_L3_XX values mean.
May be extra useful for ARM which seems to have big random-looking numbers for domains that came out of ACPI tables.
For SNC it would get the user directly to what they probably care about (which CPUs are in which domain).
So something like this for an SNC 2 system:
$ cat /sys/fs/resctrl/info/L3/cpus 0: 0-35,72-107 1: 36-71,108-143
$ cat /sys/fs/resctrl/info/L3_MON/cpus 0: 0-17,72-89 1: 18-35,90-107 2: 36-53,108-125 3: 54-71,126-143
[maybe there is a better name than "cpus" for this file?]
-Tony
Hi Tony,
On 3/7/2024 1:14 PM, Luck, Tony wrote:
Thinking about it even differently. The goal is to give information to userspace so we need to think about what would help user space? For example, what if there is a file in info that shows which CPUs are associated with each domain?
Reinette,
Interesting idea. That would save users from having to chase through /sys/devices/system/cpu/cpu*/cache/index?/* to figure out what the domain numbers in schemata files and the mon_data/mon_L3_XX values mean.
May be extra useful for ARM which seems to have big random-looking numbers for domains that came out of ACPI tables.
For SNC it would get the user directly to what they probably care about (which CPUs are in which domain).
I agree.
So something like this for an SNC 2 system:
$ cat /sys/fs/resctrl/info/L3/cpus 0: 0-35,72-107 1: 36-71,108-143
$ cat /sys/fs/resctrl/info/L3_MON/cpus 0: 0-17,72-89 1: 18-35,90-107 2: 36-53,108-125 3: 54-71,126-143
[maybe there is a better name than "cpus" for this file?]
Thank you for the example. I find that significantly easier to understand than a single number in a generic "nodes_per_l3_cache". Especially with potential confusion surrounding inconsistent "nodes" between allocation and monitoring.
How about domain_cpu_list and domain_cpu_map ?
Reinette
On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
Thank you for the example. I find that significantly easier to understand than a single number in a generic "nodes_per_l3_cache". Especially with potential confusion surrounding inconsistent "nodes" between allocation and monitoring.
How about domain_cpu_list and domain_cpu_map ?
Reinette,
Like this (my test system doesn't have SNC, so all domains are the same):
$ cd /sys/fs/resctrl/info/ $ grep . */domain* L3/domain_cpu_list:0: 0-35,72-107 L3/domain_cpu_list:1: 36-71,108-143 L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 L3_MON/domain_cpu_list:0: 0-35,72-107 L3_MON/domain_cpu_list:1: 36-71,108-143 L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 MB/domain_cpu_list:0: 0-35,72-107 MB/domain_cpu_list:1: 36-71,108-143 MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
The patch to do this is pretty straightforward.
-Tony
---
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index ae80170a0d1b..c180b80640e3 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -957,6 +957,20 @@ static int rdt_num_closids_show(struct kernfs_open_file *of, return 0; }
+static int rdt_ctrl_cpus_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct resctrl_schema *s = of->kn->parent->priv; + struct rdt_resource *r = s->res; + struct rdt_ctrl_domain *d; + + list_for_each_entry(d, &r->ctrl_domains, hdr.list) + seq_printf(seq, is_cpu_list(of) ? "%d: %*pbl\n" : "%d: %*pb\n", + d->hdr.id, cpumask_pr_args(&d->hdr.cpu_mask)); + + return 0; +} + static int rdt_default_ctrl_show(struct kernfs_open_file *of, struct seq_file *seq, void *v) { @@ -1103,6 +1117,19 @@ static int rdt_num_rmids_show(struct kernfs_open_file *of, return 0; }
+static int rdt_mon_cpus_show(struct kernfs_open_file *of, + struct seq_file *seq, void *v) +{ + struct rdt_resource *r = of->kn->parent->priv; + struct rdt_mon_domain *d; + + list_for_each_entry(d, &r->mon_domains, hdr.list) + seq_printf(seq, is_cpu_list(of) ? "%d: %*pbl\n" : "%d: %*pb\n", + d->hdr.id, cpumask_pr_args(&d->hdr.cpu_mask)); + + return 0; +} + static int rdt_mon_features_show(struct kernfs_open_file *of, struct seq_file *seq, void *v) { @@ -1810,6 +1837,21 @@ static struct rftype res_common_files[] = { .seq_show = rdt_num_closids_show, .fflags = RFTYPE_CTRL_INFO, }, + { + .name = "domain_cpu_list", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_ctrl_cpus_show, + .flags = RFTYPE_FLAGS_CPUS_LIST, + .fflags = RFTYPE_CTRL_INFO, + }, + { + .name = "domain_cpu_map", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_ctrl_cpus_show, + .fflags = RFTYPE_CTRL_INFO, + }, { .name = "mon_features", .mode = 0444, @@ -1824,6 +1866,21 @@ static struct rftype res_common_files[] = { .seq_show = rdt_num_rmids_show, .fflags = RFTYPE_MON_INFO, }, + { + .name = "domain_cpu_list", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_mon_cpus_show, + .flags = RFTYPE_FLAGS_CPUS_LIST, + .fflags = RFTYPE_MON_INFO, + }, + { + .name = "domain_cpu_map", + .mode = 0444, + .kf_ops = &rdtgroup_kf_single_ops, + .seq_show = rdt_mon_cpus_show, + .fflags = RFTYPE_MON_INFO, + }, { .name = "cbm_mask", .mode = 0444,
Hi Tony,
On 3/7/2024 3:16 PM, Tony Luck wrote:
On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
Thank you for the example. I find that significantly easier to understand than a single number in a generic "nodes_per_l3_cache". Especially with potential confusion surrounding inconsistent "nodes" between allocation and monitoring.
How about domain_cpu_list and domain_cpu_map ?
Reinette,
Like this (my test system doesn't have SNC, so all domains are the same):
$ cd /sys/fs/resctrl/info/ $ grep . */domain* L3/domain_cpu_list:0: 0-35,72-107 L3/domain_cpu_list:1: 36-71,108-143 L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 L3_MON/domain_cpu_list:0: 0-35,72-107 L3_MON/domain_cpu_list:1: 36-71,108-143 L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 MB/domain_cpu_list:0: 0-35,72-107 MB/domain_cpu_list:1: 36-71,108-143 MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
The patch to do this is pretty straightforward.
Thank you for looking into this. This looks like valuable information for user space. Back to what started this discussion ... I expect user space can compare CPUs associated with control and monitoring domains to learn if SNC is enabled? (And now existence of domain_cpu_list and/or domain_cpu_map can also be used to determine if kernel supports SNC).
Reinette
Hi guys,
On 07/03/2024 23:16, Tony Luck wrote:
On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
Thank you for the example. I find that significantly easier to understand than a single number in a generic "nodes_per_l3_cache". Especially with potential confusion surrounding inconsistent "nodes" between allocation and monitoring.
How about domain_cpu_list and domain_cpu_map ?
Like this (my test system doesn't have SNC, so all domains are the same):
$ cd /sys/fs/resctrl/info/ $ grep . */domain* L3/domain_cpu_list:0: 0-35,72-107 L3/domain_cpu_list:1: 36-71,108-143 L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 L3_MON/domain_cpu_list:0: 0-35,72-107 L3_MON/domain_cpu_list:1: 36-71,108-143 L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 MB/domain_cpu_list:0: 0-35,72-107 MB/domain_cpu_list:1: 36-71,108-143 MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
This duplicates the information in /sys/devices/system/cpu/cpuX/cache/indexY ... is this really because that information is, er, wrong on SNC systems. Is it possible to fix that?
From Tony's earlier description of how SNC changes things, the MB controls remain per-socket. To me it feels less invasive to fix the definition of L3 on these platforms to describe how it behaves (assuming that is possible), and define a new 'MB' that is NUMA scoped. This direction of redefining L3 means /sys/fs/resctrl and /sys/devices have different views of 'the' cache hierarchy.
(I also think that this be over the threshold on 'funny machines look funny' - but I bet someone builds an arm machine that looks like this too!)
Thanks,
James
On Fri, Mar 08, 2024 at 06:06:45PM +0000, James Morse wrote:
Hi guys,
On 07/03/2024 23:16, Tony Luck wrote:
On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
Thank you for the example. I find that significantly easier to understand than a single number in a generic "nodes_per_l3_cache". Especially with potential confusion surrounding inconsistent "nodes" between allocation and monitoring.
How about domain_cpu_list and domain_cpu_map ?
Like this (my test system doesn't have SNC, so all domains are the same):
$ cd /sys/fs/resctrl/info/ $ grep . */domain* L3/domain_cpu_list:0: 0-35,72-107 L3/domain_cpu_list:1: 36-71,108-143 L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 L3_MON/domain_cpu_list:0: 0-35,72-107 L3_MON/domain_cpu_list:1: 36-71,108-143 L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 MB/domain_cpu_list:0: 0-35,72-107 MB/domain_cpu_list:1: 36-71,108-143 MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
This duplicates the information in /sys/devices/system/cpu/cpuX/cache/indexY ... is this really because that information is, er, wrong on SNC systems. Is it possible to fix that?
On an SNC system the resctrl domain for L3_MON becomes the SNC node instead of the L3 cache instance. With 2, 3, or 4 SNC nodes per L3.
Even without the SNC issue this duplication may be a useful convienience. On Intel to get from a resctrl domain is a multi-step process to first find which of the indexY directories has level=3 and then look for the "id" that matches the domain.
From Tony's earlier description of how SNC changes things, the MB controls remain
per-socket. To me it feels less invasive to fix the definition of L3 on these platforms to describe how it behaves (assuming that is possible), and define a new 'MB' that is NUMA scoped. This direction of redefining L3 means /sys/fs/resctrl and /sys/devices have different views of 'the' cache hierarchy.
I almost went partly in that direction when I started this epic voyage. The "almost" part was to change the names of the monitoring directories under mon_data from (legacy non-SNC system):
$ ls -l mon_data total 0 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_00 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_01
to (2 socket, SNC=2 system):
$ ls -l mon_data total 0 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_00 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_01 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_02 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_03
While that is in some ways a more accurate view, it breaks a lot of legacy monitoring applications that expect the "L3" names.
(I also think that this be over the threshold on 'funny machines look funny' - but I bet someone builds an arm machine that looks like this too!)
-Tony
Hi Tony,
On 08/03/2024 18:42, Tony Luck wrote:
On Fri, Mar 08, 2024 at 06:06:45PM +0000, James Morse wrote:
Hi guys,
On 07/03/2024 23:16, Tony Luck wrote:
On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
Thank you for the example. I find that significantly easier to understand than a single number in a generic "nodes_per_l3_cache". Especially with potential confusion surrounding inconsistent "nodes" between allocation and monitoring.
How about domain_cpu_list and domain_cpu_map ?
Like this (my test system doesn't have SNC, so all domains are the same):
$ cd /sys/fs/resctrl/info/ $ grep . */domain* L3/domain_cpu_list:0: 0-35,72-107 L3/domain_cpu_list:1: 36-71,108-143 L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 L3_MON/domain_cpu_list:0: 0-35,72-107 L3_MON/domain_cpu_list:1: 36-71,108-143 L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 MB/domain_cpu_list:0: 0-35,72-107 MB/domain_cpu_list:1: 36-71,108-143 MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
This duplicates the information in /sys/devices/system/cpu/cpuX/cache/indexY ... is this really because that information is, er, wrong on SNC systems. Is it possible to fix that?
On an SNC system the resctrl domain for L3_MON becomes the SNC node instead of the L3 cache instance. With 2, 3, or 4 SNC nodes per L3.
Even without the SNC issue this duplication may be a useful convienience. On Intel to get from a resctrl domain is a multi-step process to first find which of the indexY directories has level=3 and then look for the "id" that matches the domain.
From Tony's earlier description of how SNC changes things, the MB controls remain
per-socket. To me it feels less invasive to fix the definition of L3 on these platforms to describe how it behaves (assuming that is possible), and define a new 'MB' that is NUMA scoped. This direction of redefining L3 means /sys/fs/resctrl and /sys/devices have different views of 'the' cache hierarchy.
I almost went partly in that direction when I started this epic voyage. The "almost" part was to change the names of the monitoring directories under mon_data from (legacy non-SNC system):
$ ls -l mon_data total 0 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_00 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_01
to (2 socket, SNC=2 system):
$ ls -l mon_data total 0 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_00 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_01 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_02 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_03
This would be useful for MPAM. I've seen a couple of MPAM systems that have per-NUMA MPAM controls on the 'L3', but describe it as a single global L3. The MPAM driver currently hides this by summing the NUMA node counters and reporting it as the global L3's value.
While that is in some ways a more accurate view, it breaks a lot of legacy monitoring applications that expect the "L3" names.
True - but the behaviour is different from a non SNC system, if this software can read the file - but goes wrong because the contents of the file represent something different, its still broken.
Thanks,
James
On 3/15/2024 11:02 AM, James Morse wrote:
On 08/03/2024 18:42, Tony Luck wrote:
On Fri, Mar 08, 2024 at 06:06:45PM +0000, James Morse wrote:
Hi guys,
On 07/03/2024 23:16, Tony Luck wrote:
On Thu, Mar 07, 2024 at 02:39:08PM -0800, Reinette Chatre wrote:
Thank you for the example. I find that significantly easier to understand than a single number in a generic "nodes_per_l3_cache". Especially with potential confusion surrounding inconsistent "nodes" between allocation and monitoring.
How about domain_cpu_list and domain_cpu_map ?
Like this (my test system doesn't have SNC, so all domains are the same):
$ cd /sys/fs/resctrl/info/ $ grep . */domain* L3/domain_cpu_list:0: 0-35,72-107 L3/domain_cpu_list:1: 36-71,108-143 L3/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff L3/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 L3_MON/domain_cpu_list:0: 0-35,72-107 L3_MON/domain_cpu_list:1: 36-71,108-143 L3_MON/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff L3_MON/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000 MB/domain_cpu_list:0: 0-35,72-107 MB/domain_cpu_list:1: 36-71,108-143 MB/domain_cpu_map:0: 0000,00000fff,ffffff00,0000000f,ffffffff MB/domain_cpu_map:1: ffff,fffff000,000000ff,fffffff0,00000000
This duplicates the information in /sys/devices/system/cpu/cpuX/cache/indexY ... is this really because that information is, er, wrong on SNC systems. Is it possible to fix that?
On an SNC system the resctrl domain for L3_MON becomes the SNC node instead of the L3 cache instance. With 2, 3, or 4 SNC nodes per L3.
Even without the SNC issue this duplication may be a useful convienience. On Intel to get from a resctrl domain is a multi-step process to first find which of the indexY directories has level=3 and then look for the "id" that matches the domain.
From Tony's earlier description of how SNC changes things, the MB controls remain
per-socket. To me it feels less invasive to fix the definition of L3 on these platforms to describe how it behaves (assuming that is possible), and define a new 'MB' that is NUMA scoped. This direction of redefining L3 means /sys/fs/resctrl and /sys/devices have different views of 'the' cache hierarchy.
I almost went partly in that direction when I started this epic voyage. The "almost" part was to change the names of the monitoring directories under mon_data from (legacy non-SNC system):
$ ls -l mon_data total 0 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_00 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_L3_01
to (2 socket, SNC=2 system):
$ ls -l mon_data total 0 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_00 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_01 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_02 dr-xr-xr-x. 2 root root 0 Mar 8 10:31 mon_NODE_03
This would be useful for MPAM. I've seen a couple of MPAM systems that have per-NUMA MPAM controls on the 'L3', but describe it as a single global L3. The MPAM driver currently hides this by summing the NUMA node counters and reporting it as the global L3's value.
While that is in some ways a more accurate view, it breaks a lot of legacy monitoring applications that expect the "L3" names.
True - but the behaviour is different from a non SNC system, if this software can read the file - but goes wrong because the contents of the file represent something different, its still broken.
This is a good point. There is also /sys/fs/resctrl/info/L3_MON to consider and trying to think what to do about that makes me go in circles about when user space may expect resctrl to indicate the resource and when user space may expect resctrl to indicate the scope. For example, /sys/fs/resctrl/mon_data/mon_L3_00 contains files with data that monitor the "L3" _resource_, no? If we change that to /sys/fs/resctrl/mon_data/mon_NODE_00 then it switches the meaning of the middle term to be "scope" while it still contains the monitoring data of the "L3" resource. So does that mean user space would need to rely on /sys/fs/resctrl/info/L3_MON to obtain the information about which monitoring files (/sys/fs/resctrl/info/L3_MON/mon_features) are related to the particular resource and then match those filenames with the filenames in /sys/fs/resctrl/mon_data/mon_NODE_00 to know which resource it applies to and learn from the directory name what scope measurement is at?
Reinette
While that is in some ways a more accurate view, it breaks a lot of legacy monitoring applications that expect the "L3" names.
True - but the behaviour is different from a non SNC system, if this software can read the file - but goes wrong because the contents of the file represent something different, its still broken.
This is a good point. There is also /sys/fs/resctrl/info/L3_MON to consider and trying to think what to do about that makes me go in circles about when user space may expect resctrl to indicate the resource and when user space may expect resctrl to indicate the scope. For example, /sys/fs/resctrl/mon_data/mon_L3_00 contains files with data that monitor the "L3" _resource_, no? If we change that to /sys/fs/resctrl/mon_data/mon_NODE_00 then it switches the meaning of the middle term to be "scope" while it still contains the monitoring data of the "L3" resource. So does that mean user space would need to rely on /sys/fs/resctrl/info/L3_MON to obtain the information about which monitoring files (/sys/fs/resctrl/info/L3_MON/mon_features) are related to the particular resource and then match those filenames with the filenames in /sys/fs/resctrl/mon_data/mon_NODE_00 to know which resource it applies to and learn from the directory name what scope measurement is at?
Reinette,
It's both a wave and a particle, depending on the observer.
In SNC systems resources on each socket are divided into 2, 3, 4 nodes. But the division is complicated. Memory and CPU cores are easy. They are each assigned to an SNC node. The cache is more complicated. The hash function for memory address to cache index is the part that is SNC aware. So memory on SNC node1 will allocate in the cache indices assigned to SNC node1. But that function has to be independent of which CPU is doing the access. That's why I keep mentioning "well behaved NUMA applications when talking about SNC.
So the resctrl monitoring operations still work on the L3 cache, but in SNC mode they work on a portion of the L3 cache. As long as all accesses are NUMA local you can think of the cache as partitioned between the SNC nodes.
But not everything is well behaved from a NUMA perspective. It would be misleading to describe the occupancy and bandwidth as belonging to an SNC node.
It's also a bit misleading to describe in terms of an L3 cache instance. But doing so doesn't require application changes.
-Tony
Hi Tony,
On 3/18/2024 12:34 PM, Luck, Tony wrote:
While that is in some ways a more accurate view, it breaks a lot of legacy monitoring applications that expect the "L3" names.
True - but the behaviour is different from a non SNC system, if this software can read the file - but goes wrong because the contents of the file represent something different, its still broken.
This is a good point. There is also /sys/fs/resctrl/info/L3_MON to consider and trying to think what to do about that makes me go in circles about when user space may expect resctrl to indicate the resource and when user space may expect resctrl to indicate the scope. For example, /sys/fs/resctrl/mon_data/mon_L3_00 contains files with data that monitor the "L3" _resource_, no? If we change that to /sys/fs/resctrl/mon_data/mon_NODE_00 then it switches the meaning of the middle term to be "scope" while it still contains the monitoring data of the "L3" resource. So does that mean user space would need to rely on /sys/fs/resctrl/info/L3_MON to obtain the information about which monitoring files (/sys/fs/resctrl/info/L3_MON/mon_features) are related to the particular resource and then match those filenames with the filenames in /sys/fs/resctrl/mon_data/mon_NODE_00 to know which resource it applies to and learn from the directory name what scope measurement is at?
Reinette,
It's both a wave and a particle, depending on the observer.
In SNC systems resources on each socket are divided into 2, 3, 4 nodes. But the division is complicated. Memory and CPU cores are easy. They are each assigned to an SNC node. The cache is more complicated. The hash function for memory address to cache index is the part that is SNC aware. So memory on SNC node1 will allocate in the cache indices assigned to SNC node1. But that function has to be independent of which CPU is doing the access. That's why I keep mentioning "well behaved NUMA applications when talking about SNC.
So the resctrl monitoring operations still work on the L3 cache, but in SNC mode they work on a portion of the L3 cache. As long as all accesses are NUMA local you can think of the cache as partitioned between the SNC nodes.
But not everything is well behaved from a NUMA perspective. It would be misleading to describe the occupancy and bandwidth as belonging to an SNC node.
It's also a bit misleading to describe in terms of an L3 cache instance. But doing so doesn't require application changes.
What is the use case for needing to expose the individual cluster counts? What if resctrl just summed the cluster counts and presented the data as before - per L3 cache instance? I doubt that resctrl would be what applications would use to verify whether they are "well behaved" wrt NUMA.
Reinette
What is the use case for needing to expose the individual cluster counts? What if resctrl just summed the cluster counts and presented the data as before - per L3 cache instance? I doubt that resctrl would be what applications would use to verify whether they are "well behaved" wrt NUMA.
Reinette,
My (perhaps naïve) belief is that in a cloud server environment there are many well behaved NUMA applications. Only presenting the sum would lose the detailed information from each SNC node.
-Tony
What is the use case for needing to expose the individual cluster counts? What if resctrl just summed the cluster counts and presented the data as before - per L3 cache instance? I doubt that resctrl would be what applications would use to verify whether they are "well behaved" wrt NUMA.
Reinette,
My (perhaps naïve) belief is that in a cloud server environment there are many well behaved NUMA applications. Only presenting the sum would lose the detailed information from each SNC node.
Is the answer to "A" or "B" ... why not provide both:
$ ls -l /sys/fs/resctrl/mon_data total 0 dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_L3_00 dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_L3_01 dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_00 dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_01 dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_02 dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_03
The "L3" entries provide the sum across all SNC nodes sharing the cache. The NODE ones give the broken out counts.
-Tony
On 3/18/2024 2:04 PM, Luck, Tony wrote:
What is the use case for needing to expose the individual cluster counts? What if resctrl just summed the cluster counts and presented the data as before - per L3 cache instance? I doubt that resctrl would be what applications would use to verify whether they are "well behaved" wrt NUMA.
Reinette,
My (perhaps naïve) belief is that in a cloud server environment there are many well behaved NUMA applications. Only presenting the sum would lose the detailed information from each SNC node.
Is the answer to "A" or "B" ... why not provide both:
$ ls -l /sys/fs/resctrl/mon_data total 0 dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_L3_00 dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_L3_01 dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_00 dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_01 dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_02 dr-xr-xr-x. 2 root root 0 Mar 18 14:01 mon_NODE_03
The "L3" entries provide the sum across all SNC nodes sharing the cache. The NODE ones give the broken out counts.
Perhaps ... in this case it may make things easier to understand if those "mon_NODE_*" directories are sub-directories of the appropriate "mon_L3_*" directories.
Reinette
Perhaps ... in this case it may make things easier to understand if those "mon_NODE_*" directories are sub-directories of the appropriate "mon_L3_*" directories.
Reinette,
Like this?
$ tree mon_data/ mon_data/ ├── mon_L3_00 │ ├── llc_occupancy │ ├── mbm_local_bytes │ ├── mbm_total_bytes │ ├── mon_NODE_00 │ │ ├── llc_occupancy │ │ ├── mbm_local_bytes │ │ └── mbm_total_bytes │ └── mon_NODE_01 │ ├── llc_occupancy │ ├── mbm_local_bytes │ └── mbm_total_bytes └── mon_L3_01 ├── llc_occupancy ├── mbm_local_bytes ├── mbm_total_bytes ├── mon_NODE_02 │ ├── llc_occupancy │ ├── mbm_local_bytes │ └── mbm_total_bytes └── mon_NODE_03 ├── llc_occupancy ├── mbm_local_bytes └── mbm_total_bytes
-Tony
Hi Tony,
On 3/18/2024 3:00 PM, Luck, Tony wrote:
Perhaps ... in this case it may make things easier to understand if those "mon_NODE_*" directories are sub-directories of the appropriate "mon_L3_*" directories.
Reinette,
Like this?
$ tree mon_data/ mon_data/ ├── mon_L3_00 │ ├── llc_occupancy │ ├── mbm_local_bytes │ ├── mbm_total_bytes │ ├── mon_NODE_00 │ │ ├── llc_occupancy │ │ ├── mbm_local_bytes │ │ └── mbm_total_bytes │ └── mon_NODE_01 │ ├── llc_occupancy │ ├── mbm_local_bytes │ └── mbm_total_bytes └── mon_L3_01 ├── llc_occupancy ├── mbm_local_bytes ├── mbm_total_bytes ├── mon_NODE_02 │ ├── llc_occupancy │ ├── mbm_local_bytes │ └── mbm_total_bytes └── mon_NODE_03 ├── llc_occupancy ├── mbm_local_bytes └── mbm_total_bytes
Yes.
Pro: * This is familiar to users when compared to existing CTRL_MON group counts that are the sum of the MON groups within it.
* Users continue to see the clear connection between files listed in /sys/fs/resctrl/info/L3_MON/mon_features found in "mon_L3*" directories.
* If I understand correctly it also would continue to be useful to Arm [1] while not breaking existing user space since the mon_L3* counts continue to reflect the entire L3 resource.
* This addresses your comment of maintaining the detailed information from each SNC node.
* I do assume that if there is only one SNC node per L3 cache then those mon_NODE_* directories will not be present and, to address the issue that triggered this thread, user space can use presence of multiple "mon_NODE_*" directories per cache instance to know if SNC is enabled.
Con: * Unclear how this may need to change if/when SNC becomes architectural.
* Continues to "muddy" the naming of the directories: mon_<resource>_<id> vs mon_<scope>_<id>. This cannot be turned into agreement with user space where first directory is mon_<resource>_<id> and second directory is mon_<scope>_<id> because then we would need to have, for example, mon_L3_00/mon_L3_00 where the first directory is for the resource and the second is for the scope, which appears redundant.
* Things may get confusing if there is ever a "node" resource.
This is starting to look like an interface that "checks" all the requirements I've seen so far. Looking at the "con" it is difficult for me to see how doing something like this now may cause frustrations later.
Reinette
[1] https://lore.kernel.org/lkml/88430722-67b3-4f7d-8db2-95ee52b6f0b0@arm.com/
Hi Tony,
On 3/18/2024 1:47 PM, Luck, Tony wrote:
What is the use case for needing to expose the individual cluster counts? What if resctrl just summed the cluster counts and presented the data as before - per L3 cache instance? I doubt that resctrl would be what applications would use to verify whether they are "well behaved" wrt NUMA.
Reinette,
My (perhaps naïve) belief is that in a cloud server environment there are many well behaved NUMA applications. Only presenting the sum would lose the detailed information from each SNC node.
Yes ... I understand by providing a sum the values that contributed to the sum are lost.
Could you please help me understand the details by answering my first question: What is the use case for needing to expose the individual cluster counts?
This is a model specific feature so if this is something needed for just a couple of systems I think we should be less inclined to make changes to resctrl interface. I am starting to be concerned about something similar becoming architectural later and then we need to wrangle this model specific resctrl support (which has then become ABI) again to support whatever that may look like.
Reinette
Could you please help me understand the details by answering my first question: What is the use case for needing to expose the individual cluster counts?
This is a model specific feature so if this is something needed for just a couple of systems I think we should be less inclined to make changes to resctrl interface. I am starting to be concerned about something similar becoming architectural later and then we need to wrangle this model specific resctrl support (which has then become ABI) again to support whatever that may look like.
Reinette,
Model specific. But present in multiple consecutive generations (Sapphire Rapids, Emerald Rapids, Granite Rapids, Sierra Forest).
Adding Peter Newman for a resctrl user perspective on SNC, rather than me continue to speculate on possible ways this might be used.
Peter: You will need to dig back a few messages on lore.kernel.org to get context.
-Tony
Hi Tony,
On Mon, Mar 18, 2024 at 3:05 PM Luck, Tony tony.luck@intel.com wrote:
Could you please help me understand the details by answering my first question: What is the use case for needing to expose the individual cluster counts?
This is a model specific feature so if this is something needed for just a couple of systems I think we should be less inclined to make changes to resctrl interface. I am starting to be concerned about something similar becoming architectural later and then we need to wrangle this model specific resctrl support (which has then become ABI) again to support whatever that may look like.
Reinette,
Model specific. But present in multiple consecutive generations (Sapphire Rapids, Emerald Rapids, Granite Rapids, Sierra Forest).
Adding Peter Newman for a resctrl user perspective on SNC, rather than me continue to speculate on possible ways this might be used.
Peter: You will need to dig back a few messages on lore.kernel.org to get context.
Our main concern with supporting SNC in resctrl is all of the monitoring groups successfully recording memory bandwidth from all CPUs, regardless of the RMIDs they're assigned.
I would prefer that we don't complicate the model of resctrl monitoring domains for this feature. On ARM SoCs there will be a plethora of technologies influencing the layout of resources, so we shouldn't start cluttering the model with special cases for each.
I think it's valid for the number of domains in the L3 resource to increase or stay the same when the system is configured for SNC. I don't think the details of how the domains came about is relevant at the resctrl interface level so long as the user has enough information to deduce what the domain is referring to based on knowledge of their system configuration.
I would prefer per-cluster as more information could prove useful in some future investigation, but if you feel the data is misleading, providing the clusters combined is also fine. I would prefer that the choice remains consistent from this point forward on any particular implementation to avoid breaking existing controller software developed for that implementation.
In our main use case, we sum mon_data/*/mbm_total_bytes to determine a group's total bandwidth, so please don't cause this logic to produce the wrong answer.
Thanks! -Peter
linux-kselftest-mirror@lists.linaro.org