On Fri, 27 Oct 2023, Maciej Wieczór-Retman wrote:
On 2023-10-24 at 12:26:26 +0300, Ilpo Järvinen wrote:
CAT test spawns two processes into two different control groups with exclusive schemata. Both the processes alloc a buffer from memory matching their allocated LLC block size and flush the entire buffer out of caches. Since the processes are reading through the buffer only once during the measurement and initially all the buffer was flushed, the test isn't testing CAT.
Rewrite the CAT test to allocate a buffer sized to half of LLC. Then perform a sequence of tests with different LLC alloc sizes starting from half of the CBM bits down to 1-bit CBM. Flush the buffer before each test and read the buffer twice. Observe the LLC misses on the second read through the buffer. As the allocated LLC block gets smaller and smaller, the LLC misses will become larger and larger giving a strong signal on CAT working properly.
The new CAT test is using only a single process because it relies on measured effect against another run of itself rather than another process adding noise. The rest of the system is allocated the CBM bits not used by the CAT test to keep the test isolated.
Replace count_bits() with count_contiguous_bits() to get the first bit position in order to be able to calculate masks based on it.
This change has been tested with a number of systems from different generations.
Suggested-by: Reinette Chatre reinette.chatre@intel.com Signed-off-by: Ilpo Järvinen ilpo.jarvinen@linux.intel.com
tools/testing/selftests/resctrl/cat_test.c | 286 +++++++++----------- tools/testing/selftests/resctrl/fill_buf.c | 6 +- tools/testing/selftests/resctrl/resctrl.h | 5 +- tools/testing/selftests/resctrl/resctrlfs.c | 44 +-- 4 files changed, 137 insertions(+), 204 deletions(-)
diff --git a/tools/testing/selftests/resctrl/cat_test.c b/tools/testing/selftests/resctrl/cat_test.c index e71690a9bbb3..7518c520c5cc 100644 --- a/tools/testing/selftests/resctrl/cat_test.c +++ b/tools/testing/selftests/resctrl/cat_test.c @@ -11,65 +11,68 @@ #include "resctrl.h" #include <unistd.h>
-#define RESULT_FILE_NAME1 "result_cat1" -#define RESULT_FILE_NAME2 "result_cat2" +#define RESULT_FILE_NAME "result_cat" #define NUM_OF_RUNS 5 -#define MAX_DIFF_PERCENT 4 -#define MAX_DIFF 1000000
/*
- Change schemata. Write schemata to specified
- con_mon grp, mon_grp in resctrl FS.
- Run 5 times in order to get average values.
- Minimum difference in LLC misses between a test with n+1 bits CBM mask to
- the test with n bits. With e.g. 5 vs 4 bits in the CBM mask, the minimum
- difference must be at least MIN_DIFF_PERCENT_PER_BIT * (4 - 1) = 3 percent.
- The relationship between number of used CBM bits and difference in LLC
- misses is not expected to be linear. With a small number of bits, the
- margin is smaller than with larger number of bits. For selftest purposes,
- however, linear approach is enough because ultimately only pass/fail
- decision has to be made and distinction between strong and stronger
- signal is irrelevant.
*/ -static int cat_setup(struct resctrl_val_param *p) -{
- char schemata[64];
- int ret = 0;
- /* Run NUM_OF_RUNS times */
- if (p->num_of_runs >= NUM_OF_RUNS)
return END_OF_TESTS;
- if (p->num_of_runs == 0) {
sprintf(schemata, "%lx", p->mask);
ret = write_schemata(p->ctrlgrp, schemata, p->cpu_no,
p->resctrl_val);
- }
- p->num_of_runs++;
- return ret;
-} +#define MIN_DIFF_PERCENT_PER_BIT 1
static int show_results_info(__u64 sum_llc_val, int no_of_bits,
unsigned long cache_span, unsigned long max_diff,
unsigned long max_diff_percent, unsigned long num_of_runs,
bool platform)
unsigned long cache_span, long min_diff_percent,
unsigned long num_of_runs, bool platform,
__s64 *prev_avg_llc_val)
{ __u64 avg_llc_val = 0;
- float diff_percent;
- int ret;
float avg_diff;
int ret = 0;
avg_llc_val = sum_llc_val / num_of_runs;
- diff_percent = ((float)cache_span - avg_llc_val) / cache_span * 100;
- if (*prev_avg_llc_val) {
float delta = (__s64)(avg_llc_val - *prev_avg_llc_val);
- ret = platform && abs((int)diff_percent) > max_diff_percent;
avg_diff = delta / *prev_avg_llc_val;
ret = platform && (avg_diff * 100) < (float)min_diff_percent;
- ksft_print_msg("%s Check cache miss rate within %lu%%\n",
ret ? "Fail:" : "Pass:", max_diff_percent);
ksft_print_msg("%s Check cache miss rate changed more than %.1f%%\n",
ret ? "Fail:" : "Pass:", (float)min_diff_percent);
Shouldn't "Fail" and "Pass" be flipped in the ternary operator? Or the condition sign above "<" should be ">"?
I must not touch ret ? "Fail:" : "Pass:" logic, it's the correct way around. If I'd touch it, it'd break what the calling code assumes about the return value.
(More explanation below).
Now it looks like if (avg_diff * 100) is smaller than the min_diff_percent the test is supposed to fail but the text suggests it's the other way around.
I also ran this selftest and that's the output:
# Pass: Check cache miss rate changed more than 3.0% # Percent diff=45.8 # Number of bits: 4 # Average LLC val: 322489 # Cache span (lines): 294912 # Pass: Check cache miss rate changed more than 2.0% # Percent diff=38.0 # Number of bits: 3 # Average LLC val: 445005 # Cache span (lines): 221184 # Pass: Check cache miss rate changed more than 1.0% # Percent diff=27.2 # Number of bits: 2 # Average LLC val: 566145 # Cache span (lines): 147456 # Pass: Check cache miss rate changed more than 0.0% # Percent diff=18.3 # Number of bits: 1 # Average LLC val: 669657 # Cache span (lines): 73728 ok 1 CAT: test
The diff percentages are much larger than the thresholds they're supposed to be within and the test is passed.
No, the whole test logic is changed dramatically by this patch and failure logic is reverse now because of it. Note how I also altered these things:
- MAX_DIFF_PERCENT -> MIN_DIFF_PERCENT_PER_BIT - max_diff_percent -> min_diff_percent - "cache miss rate within" -> "cache miss rate changed more than"
The new CAT test measures the # of cache misses (or in case of L2 CAT test, LLC accesses which is used as a proxy for L2 misses). Then it takes one bit away from the allocation mask and repeats the measurement.
If the # of LLC misses changes more than min_diff_precent when the number of bits in the allocation was changed, it is a strong indicator CAT is working like it should. Based on your numbers above, I'm extremely confident CAT works as expected!
I know for a fact that when the selftest is bound to a wrong resource id (which actually occurs on broadwell's with CoD enabled without one of the later patches in this series), this test is guaranteed to fail 100%, there's no noticeable difference measured in LLC misses in that case.
@@ -143,54 +168,64 @@ static int cat_test(struct resctrl_val_param *param, size_t span) if (ret) return ret;
- buf = alloc_buffer(span, 1);
- if (buf == NULL)
Similiar to patch 01/24, wouldn't this: if (!buf) be better?
I've already changed this based on the comment you made against 1/24 :-).