RE: Some thoughts on Linux benchmarks results & processing

8 Nov 2024

      ...
-----Original Message-----
From: Konstantin Belov konstantin.belov@linaro.org
Hello colleagues,
Following up on Tim Bird's presentation "Adding benchmarks results
support to KTAP/kselftest", I would like to share some thoughts on
kernel benchmarking and kernel performance evaluation. Tim suggested
sharing these comments with the wider kselftest community for
discussion.
The topic of performance evaluation is obviously extremely complex, so
I’ve organised my comments into several paragraphs, each of which
focuses on a specific aspect. This should make it easier to follow and
understand the key points, such as metrics, reference values, results
data lake, interpretation of contradictory results, system profiles,
analysis and methodology.
# Metrics
A few remarks on benchmark metrics which were called “values” in the
original presentation:

Metrics must be accompanied by standardised units. This

standardisation ensures consistency across different tests and
environments, simplifying accurate comparisons and analysis.
Agreed.  I've used the term "metrics" in the past, but the word
is somewhat overloaded so I avoided it in my recent presentation.
My proposed system includes units, and the possibility of conversion
between units, when needed.
...

Each metric should be clearly labelled with its nature or kind

(throughput, speed, latency, etc). This classification is essential
for proper interpretation of the results and prevents
misunderstandings that could lead to incorrect conclusions.
I'm not sure I agree on this.
...

Presentation contains "May also include allowable variance", but

variance must be included into the analysis as we deal with
statistical calculations and multiple randomised values.
In my tool, including variance along with a reference value
is optional.  For some types of value thresholds I don't think the
threshold necessarily requires a variance.
...

I would like to note that other statistical parameters are also

worth including into comparison, like confidence levels, sample size
and so on.
# Reference Values
The concept of "reference values" introduced in the slides could be
significantly enhanced by implementing a collaborative, transparent
system for data collection and validation. This system could operate
as follows:

Data Collection: Any user could submit benchmark results to a

centralised and public repository. This would allow for a diverse
range of hardware configurations and use cases to be represented.
I agree there should be a centralized repository for reference values.
It will be easier to create reference values, IMHO if there is also 
a repository of value results as well.
...

Vendor Validation: Hardware vendors would have the opportunity to

review submitted results pertaining to their products. They could then
mark certain results as "Vendor Approved," indicating that the results
align with their own testing and expectations.
It would be nice if Vendors provided reference values along with their
distributions of Linux.
...

Community Review: The broader community of users and experts could

also review and vote on submitted results. Results that receive
substantial positive feedback could be marked as "Community Approved,"
providing an additional layer of validation.
This sounds a little formal to me.
...

Automated Validation: Reference values must be checked, validated

and supported by multiple sources. This can be done only in an
automatic way as those processes are time consuming and require
extreme attention to details.

Transparency: All submitted results would need to be accompanied by

detailed information about the testing environment, hardware
specifications, and methodology used. This would ensure
reproducibility and allow others to understand the context of each
result.
Indeed, there will need to be a lot of meta-data associated with reference
values, in order to make sure that the correct set of reference values
are used in testing specific machines, environments, and kernel versions.
...

Trust Building: The combination of vendor and community approval

would help establish trust in the reference values. It would mitigate
concerns about marketing bias and provide a more reliable basis for
performance comparisons.

Accessibility: The system would be publicly accessible, allowing

anyone to reference and utilise this data in their own testing and
analysis.
Yes and yes.
...
Implementation of such a system would require careful consideration of
governance and funding. A community-driven, non-profit organisation
sponsored by multiple stakeholders could be an appropriate model. This
structure would help maintain neutrality and avoid potential conflicts
of interest.
KernelCI seems like a good project to host such a repository.  Possibly
it could be KCIdb, who have just added values to their database schema.
...
While the specifics of building and managing such a system would need
further exploration, this approach could significantly improve the
reliability and usefulness of reference values in benchmark testing.
It would foster a more collaborative and transparent environment for
performance evaluation in the Linux ecosystem as well as attract
interested vendors to submit and review results.
I’m not very informed about the current state of the community in this
field, but I’m sure you know better how exactly this can be done.
# Results Data Lake
Along with reference values it’s important to collect results on a
regular basis as the kernel evolves so results must follow this
evolution as well. To do this cloud-based data lake is needed (a
self-hosted system will be too expensive from my point of view).
This data lake should be able to collect and process incoming data as
well as to serve reference values for users. Data processing flow
should be quite standard: Collection -> Parsing + Enhancement ->
Storage -> Analysis -> Serving.
Tim proposed to use file names for reference files, I would like to
note that such approach could fail pretty fast if system will collect
more and more data and there will rise a need to have more granular
and detailed features to identify reference results and this can lead
to very long filenames, which will be hard to use.
No argument there.  I used filenames for my proof of concept, but
clearly a different meta-data matching system that utilizes more
variables will need to be used.  I'm currently working on gathering data for
a boot-time test to inform the set of meta-data that applies to
boot-time performance data.
...
I propose to use
UUID4-based identification, which provides very low chances for
collision. Those IDs will be keys in the database with all information
required for clear identification of relevant results and
corresponding details. Moreover this approach can be easily extended
on the database side if more data is needed.
I'm not sure how this solves the matching problem.  To determine
test outcomes, you need to use reference data that is most similar
to your machine.  Having a shared repository of reference values will
be useful for testers with limited experience, who are working on common
hardware. However, I envision that many testers will use previous results
from their own board, for reference values for tests (after validating the numbers
against their requirements).
...
Yes, UUID4 is not human-readable, but do we need such an option if we
have tools, which can provide a better interface?
For example, this could be something like:
request: results-cli search -b "Test Suite D" -v "v1.2.3" -o "Ubuntu
22.04" -t "baseline" -m "response_time>100"
response:
[
{
     "id": "550e8400-e29b-41d4-a716-446655440005",
     "benchmark": "Test Suite A",
     "version": "v1.2.3",
     "target_os": "Ubuntu 22.04",
     "metrics": {
         "cpu_usage": 70.5,
         "memory_usage": 2048,
         "response_time": 120
     },
     "tags": ["baseline", "v1.0"],
     "created_at": "2024-10-25T10:00:00Z"
},
...
]

or
request: results-cli search "<Domain-Specific-Language-Query>"
response: [ {}, {}, {}...]

or
request: results-cli get 550e8400-e29b-41d4-a716-446655440005
response:
{
     "id": "550e8400-e29b-41d4-a716-446655440005",
     "benchmark": "Test Suite A",
     "version": "v1.2.3",
     "target_os": "Ubuntu 22.04",
     "metrics": {
         "cpu_usage": 70.5,
         "memory_usage": 2048,
         "response_time": 120
     },
     "tags": ["baseline", "v1.0"],
     "created_at": "2024-10-25T10:00:00Z"
}

or
request: curl -X POST http://api.example.com/references/search \ -d '{
"query": "benchmark = "Test Suite A" AND (version >= "v1.2" OR tag
IN ["baseline", "regression"]) AND cpu_usage > 60" }'
...

It sounds like you have a database  implementation already in mind.
...
Another point of use DB-based approach is the following: in case when
a user works with particular hardware and/or would like to use a
reference he/she does not need a full database with all collected
reference values, but only a small slice of it. This slice can be
downloaded from public repo or accessed via API.
# Large results dataset
If we collect a large benchmarks dataset in one place accompanied with
detailed information about target systems from which this dataset was
collected, then it will allow us to calculate precise baselines across
different compositions of parameters, making performance deviations
easier to detect. Long-term trend analysis can identify small changes
and correlate them with updates, revealing performance drift.
Another use of such a database - predictive modelling, which can
provide forecasts of expected results and setting dynamic performance
thresholds, enabling early issue detection. Anomaly detection becomes
more effective with context, distinguishing unusual deviations from
normal behaviour.
# Interpretation of contradictory results
It’s not clear how to deal with contradictory results to make a
decision on regression presence. For example, we have a set of 10
tests, which test more or less the same, for example disk performance.
It’s unclear what to do when one subset of tests show degradation and
another subset shows neutral status or improvements. Is there a
regression?
If a test shows a regression, then either there has been a regression and
the test is accurate, or there has not been a regression and the test
or reference values need fixing.
...
I suppose that the availability of historical data can help to deal
with such situations as historical data can show behaviour of
particular tests and allow to assign weights in decision-making
algorithms, but it’s just my guess.
# System Profiles
Tim's idea to reduce results to - “pass / fail” and my experience with
various people trying to interpret benchmarking results led me to
think of “profiles” - a set of parameters and metrics collected from a
reference system while execution of a particular configuration of a
particular benchmark.
Numbers alone have no meaning. They have to be converted into some
signal indicating that action is needed.  At Plumbers one developer
indicated that more than just a binary state would be desirable.  The testcase
outcome is intended to indicate that some issue needs addressing, and
there will always be a threshold needed.
...
Profiles can be used for A/B comparison with pass/fail outcomes or
match/not match, and this approach does not hide/miss the details and
allows to capture multiple characteristics of the experiment, like
presence of outliers/errors or skewed distribution form. Interested
persons (like kernel developers or performance engineers, for example)
can dig deeper to find a reason for such mismatch and those who are
interested just in high-level results - pass/fail should be enough.
Of course results should be available to allow diagnosing the problem.
But for a CI system to indicate that action is needed, there must be
some numeric comparison (whether that represents something more
expansive like a curve shape or an aggregation of data points)
...
Here is how I imaging a structure of a profile:
profile_a:
  system packages:

pkg_1
pkg_2

# Additional packages...
settings:

cmdline

# Additional settings...
indicators:
   cpu: null
   ram: null
   loadavg: null
   # Additional indicators...
benchmark:
settings:
   param_1: null
   param_2: null
   param_x: null
metrics:
   metric_1: null
   metric_2: null
   metric_x: null

System Packages, System Settings: Usually we do not pay much

attention to this, but I think it’s worth highlighting that base OS is
an important factor, as there are distribution-specific modifications
present in the filesystem. Most commonly developers and researchers
use Ubuntu (as the most popular distro) or Debian (as a cleaner and
lightweight version of Ubuntu), but distributions apply their own
patches to the kernel and system libraries, which may impact
performance. Another kind of base OS - cloud OS images which can be
modified by cloud providers to add internal packages & services which
could potentially affect performance as well. While comparing we must
take into account this aspect to compare apples-to-apples.

System Indicators: These are periodic statistics like CPU

utilisation, RAM consumption, and other params collected before
benchmarking, while benchmarking and after benchmarking.

Benchmark Settings: Benchmarking systems have multiple parameters,

so it’s important to capture them and use them in analysis.

Benchmark Metrics: That’s obviously - benchmark results. It’s not a

rare case when a benchmark test provides more than a single number.
# Analysis
Proposed rules-based analysis will work only for highly determined
environments and systems, where rules can describe all the aspects.
Rule-based systems are easier to understand and implement than other
types, but for a small set of rules. However, we deal with the live
system and it constantly evolves, so rules will deprecate extremely
fast. It's the same story as with rule-based recommended systems in
early years of machine learning.
If you want to follow a rules-based approach, it's probably worth
taking a look at https://www.clipsrules.net as this will allow to
decouple results from analysis and avoid reinventing the analysis
engine.
Declaration of those rules will be error-prone due to the nature of
their origin - they must be declared and maintained by humans. IMHO a
human-less approach and use modern ML methods instead would be more
beneficial in the long run.
Well, the rules I was proposing (criteria rules) were about the conversion 
from numbers to testcase outcomes, and some way to express the
direction of operation for numeric comparisons.  Not much more than that.
There was no intelligence intended beyond simple numeric comparisons
of result values with reference values.  I'd prefer not to overcomplicate
the proposal with analysis of data.  It's quite possible that the rules
should be baked into the test, rather than having them separate like
I proposed.  Most tests would have rather obvious comparison directions,
so having the rules be separate may be an unnecessary abstraction.
(ie, throughput less than a reference value is a regression, or latency more
than a reference value is a regression.)
...
# Methodology
Used methodology - another aspect which is not directly related to
Tim's slides, but it's an important topic for results processing and
interpretation, probably an idea of automated results interpretation
can force use of one or another methodology.
# Next steps
I would be glad to participate in further discussions and share
experience to improve kernel performance testing automation, analysis
and interpretation of results. If there is interest, I'm open to
collaborating on implementing some of these ideas.
Sounds good.  I'm tied up with some boot-time initiatives at the moment,
but I hope to work on my proposal some more, and find a home (ie public
repository) for some kselftest-related reference values, before the end of
the year.  I plan to submit the actual code for my proposal, along with
some tests that utilize it to the list, and we can have more discussion when
that happens.
Thanks for sharing your ideas.
   -- Tim

2025

2024

2023

2022

2021

2020

2019

2018

2017

RE: Some thoughts on Linux benchmarks results & processing

For example, this could be something like:

Here is how I imaging a structure of a profile: