-----Original Message----- From: Konstantin Belov konstantin.belov@linaro.org Hello colleagues,
Following up on Tim Bird's presentation "Adding benchmarks results support to KTAP/kselftest", I would like to share some thoughts on kernel benchmarking and kernel performance evaluation. Tim suggested sharing these comments with the wider kselftest community for discussion.
The topic of performance evaluation is obviously extremely complex, so I’ve organised my comments into several paragraphs, each of which focuses on a specific aspect. This should make it easier to follow and understand the key points, such as metrics, reference values, results data lake, interpretation of contradictory results, system profiles, analysis and methodology.
# Metrics A few remarks on benchmark metrics which were called “values” in the original presentation:
- Metrics must be accompanied by standardised units. This
standardisation ensures consistency across different tests and environments, simplifying accurate comparisons and analysis.
Agreed. I've used the term "metrics" in the past, but the word is somewhat overloaded so I avoided it in my recent presentation. My proposed system includes units, and the possibility of conversion between units, when needed.
- Each metric should be clearly labelled with its nature or kind
(throughput, speed, latency, etc). This classification is essential for proper interpretation of the results and prevents misunderstandings that could lead to incorrect conclusions.
I'm not sure I agree on this.
- Presentation contains "May also include allowable variance", but
variance must be included into the analysis as we deal with statistical calculations and multiple randomised values.
In my tool, including variance along with a reference value is optional. For some types of value thresholds I don't think the threshold necessarily requires a variance.
- I would like to note that other statistical parameters are also
worth including into comparison, like confidence levels, sample size and so on.
# Reference Values The concept of "reference values" introduced in the slides could be significantly enhanced by implementing a collaborative, transparent system for data collection and validation. This system could operate as follows:
- Data Collection: Any user could submit benchmark results to a
centralised and public repository. This would allow for a diverse range of hardware configurations and use cases to be represented.
I agree there should be a centralized repository for reference values. It will be easier to create reference values, IMHO if there is also a repository of value results as well.
- Vendor Validation: Hardware vendors would have the opportunity to
review submitted results pertaining to their products. They could then mark certain results as "Vendor Approved," indicating that the results align with their own testing and expectations.
It would be nice if Vendors provided reference values along with their distributions of Linux.
- Community Review: The broader community of users and experts could
also review and vote on submitted results. Results that receive substantial positive feedback could be marked as "Community Approved," providing an additional layer of validation.
This sounds a little formal to me.
- Automated Validation: Reference values must be checked, validated
and supported by multiple sources. This can be done only in an automatic way as those processes are time consuming and require extreme attention to details.
- Transparency: All submitted results would need to be accompanied by
detailed information about the testing environment, hardware specifications, and methodology used. This would ensure reproducibility and allow others to understand the context of each result.
Indeed, there will need to be a lot of meta-data associated with reference values, in order to make sure that the correct set of reference values are used in testing specific machines, environments, and kernel versions.
- Trust Building: The combination of vendor and community approval
would help establish trust in the reference values. It would mitigate concerns about marketing bias and provide a more reliable basis for performance comparisons.
- Accessibility: The system would be publicly accessible, allowing
anyone to reference and utilise this data in their own testing and analysis.
Yes and yes.
Implementation of such a system would require careful consideration of governance and funding. A community-driven, non-profit organisation sponsored by multiple stakeholders could be an appropriate model. This structure would help maintain neutrality and avoid potential conflicts of interest.
KernelCI seems like a good project to host such a repository. Possibly it could be KCIdb, who have just added values to their database schema.
While the specifics of building and managing such a system would need further exploration, this approach could significantly improve the reliability and usefulness of reference values in benchmark testing. It would foster a more collaborative and transparent environment for performance evaluation in the Linux ecosystem as well as attract interested vendors to submit and review results.
I’m not very informed about the current state of the community in this field, but I’m sure you know better how exactly this can be done.
# Results Data Lake Along with reference values it’s important to collect results on a regular basis as the kernel evolves so results must follow this evolution as well. To do this cloud-based data lake is needed (a self-hosted system will be too expensive from my point of view).
This data lake should be able to collect and process incoming data as well as to serve reference values for users. Data processing flow should be quite standard: Collection -> Parsing + Enhancement -> Storage -> Analysis -> Serving.
Tim proposed to use file names for reference files, I would like to note that such approach could fail pretty fast if system will collect more and more data and there will rise a need to have more granular and detailed features to identify reference results and this can lead to very long filenames, which will be hard to use.
No argument there. I used filenames for my proof of concept, but clearly a different meta-data matching system that utilizes more variables will need to be used. I'm currently working on gathering data for a boot-time test to inform the set of meta-data that applies to boot-time performance data.
I propose to use UUID4-based identification, which provides very low chances for collision. Those IDs will be keys in the database with all information required for clear identification of relevant results and corresponding details. Moreover this approach can be easily extended on the database side if more data is needed.
I'm not sure how this solves the matching problem. To determine test outcomes, you need to use reference data that is most similar to your machine. Having a shared repository of reference values will be useful for testers with limited experience, who are working on common hardware. However, I envision that many testers will use previous results from their own board, for reference values for tests (after validating the numbers against their requirements).
Yes, UUID4 is not human-readable, but do we need such an option if we have tools, which can provide a better interface?
For example, this could be something like:
request: results-cli search -b "Test Suite D" -v "v1.2.3" -o "Ubuntu 22.04" -t "baseline" -m "response_time>100" response: [ { "id": "550e8400-e29b-41d4-a716-446655440005", "benchmark": "Test Suite A", "version": "v1.2.3", "target_os": "Ubuntu 22.04", "metrics": { "cpu_usage": 70.5, "memory_usage": 2048, "response_time": 120 }, "tags": ["baseline", "v1.0"], "created_at": "2024-10-25T10:00:00Z" }, ... ]
or request: results-cli search "<Domain-Specific-Language-Query>" response: [ {}, {}, {}...]
or request: results-cli get 550e8400-e29b-41d4-a716-446655440005 response: { "id": "550e8400-e29b-41d4-a716-446655440005", "benchmark": "Test Suite A", "version": "v1.2.3", "target_os": "Ubuntu 22.04", "metrics": { "cpu_usage": 70.5, "memory_usage": 2048, "response_time": 120 }, "tags": ["baseline", "v1.0"], "created_at": "2024-10-25T10:00:00Z" }
or request: curl -X POST http://api.example.com/references/search \ -d '{ "query": "benchmark = "Test Suite A" AND (version >= "v1.2" OR tag IN ["baseline", "regression"]) AND cpu_usage > 60" }' ...
It sounds like you have a database implementation already in mind.
Another point of use DB-based approach is the following: in case when a user works with particular hardware and/or would like to use a reference he/she does not need a full database with all collected reference values, but only a small slice of it. This slice can be downloaded from public repo or accessed via API.
# Large results dataset If we collect a large benchmarks dataset in one place accompanied with detailed information about target systems from which this dataset was collected, then it will allow us to calculate precise baselines across different compositions of parameters, making performance deviations easier to detect. Long-term trend analysis can identify small changes and correlate them with updates, revealing performance drift.
Another use of such a database - predictive modelling, which can provide forecasts of expected results and setting dynamic performance thresholds, enabling early issue detection. Anomaly detection becomes more effective with context, distinguishing unusual deviations from normal behaviour.
# Interpretation of contradictory results It’s not clear how to deal with contradictory results to make a decision on regression presence. For example, we have a set of 10 tests, which test more or less the same, for example disk performance. It’s unclear what to do when one subset of tests show degradation and another subset shows neutral status or improvements. Is there a regression?
If a test shows a regression, then either there has been a regression and the test is accurate, or there has not been a regression and the test or reference values need fixing.
I suppose that the availability of historical data can help to deal with such situations as historical data can show behaviour of particular tests and allow to assign weights in decision-making algorithms, but it’s just my guess.
# System Profiles Tim's idea to reduce results to - “pass / fail” and my experience with various people trying to interpret benchmarking results led me to think of “profiles” - a set of parameters and metrics collected from a reference system while execution of a particular configuration of a particular benchmark.
Numbers alone have no meaning. They have to be converted into some signal indicating that action is needed. At Plumbers one developer indicated that more than just a binary state would be desirable. The testcase outcome is intended to indicate that some issue needs addressing, and there will always be a threshold needed.
Profiles can be used for A/B comparison with pass/fail outcomes or match/not match, and this approach does not hide/miss the details and allows to capture multiple characteristics of the experiment, like presence of outliers/errors or skewed distribution form. Interested persons (like kernel developers or performance engineers, for example) can dig deeper to find a reason for such mismatch and those who are interested just in high-level results - pass/fail should be enough.
Of course results should be available to allow diagnosing the problem. But for a CI system to indicate that action is needed, there must be some numeric comparison (whether that represents something more expansive like a curve shape or an aggregation of data points)
Here is how I imaging a structure of a profile:
profile_a: system packages:
- pkg_1
- pkg_2
# Additional packages...
settings:
- cmdline
# Additional settings...
indicators: cpu: null ram: null loadavg: null # Additional indicators...
benchmark: settings: param_1: null param_2: null param_x: null
metrics: metric_1: null metric_2: null metric_x: null
- System Packages, System Settings: Usually we do not pay much
attention to this, but I think it’s worth highlighting that base OS is an important factor, as there are distribution-specific modifications present in the filesystem. Most commonly developers and researchers use Ubuntu (as the most popular distro) or Debian (as a cleaner and lightweight version of Ubuntu), but distributions apply their own patches to the kernel and system libraries, which may impact performance. Another kind of base OS - cloud OS images which can be modified by cloud providers to add internal packages & services which could potentially affect performance as well. While comparing we must take into account this aspect to compare apples-to-apples.
- System Indicators: These are periodic statistics like CPU
utilisation, RAM consumption, and other params collected before benchmarking, while benchmarking and after benchmarking.
- Benchmark Settings: Benchmarking systems have multiple parameters,
so it’s important to capture them and use them in analysis.
- Benchmark Metrics: That’s obviously - benchmark results. It’s not a
rare case when a benchmark test provides more than a single number.
# Analysis Proposed rules-based analysis will work only for highly determined environments and systems, where rules can describe all the aspects. Rule-based systems are easier to understand and implement than other types, but for a small set of rules. However, we deal with the live system and it constantly evolves, so rules will deprecate extremely fast. It's the same story as with rule-based recommended systems in early years of machine learning.
If you want to follow a rules-based approach, it's probably worth taking a look at https://www.clipsrules.net as this will allow to decouple results from analysis and avoid reinventing the analysis engine.
Declaration of those rules will be error-prone due to the nature of their origin - they must be declared and maintained by humans. IMHO a human-less approach and use modern ML methods instead would be more beneficial in the long run.
Well, the rules I was proposing (criteria rules) were about the conversion from numbers to testcase outcomes, and some way to express the direction of operation for numeric comparisons. Not much more than that. There was no intelligence intended beyond simple numeric comparisons of result values with reference values. I'd prefer not to overcomplicate the proposal with analysis of data. It's quite possible that the rules should be baked into the test, rather than having them separate like I proposed. Most tests would have rather obvious comparison directions, so having the rules be separate may be an unnecessary abstraction. (ie, throughput less than a reference value is a regression, or latency more than a reference value is a regression.)
# Methodology Used methodology - another aspect which is not directly related to Tim's slides, but it's an important topic for results processing and interpretation, probably an idea of automated results interpretation can force use of one or another methodology.
# Next steps I would be glad to participate in further discussions and share experience to improve kernel performance testing automation, analysis and interpretation of results. If there is interest, I'm open to collaborating on implementing some of these ideas.
Sounds good. I'm tied up with some boot-time initiatives at the moment, but I hope to work on my proposal some more, and find a home (ie public repository) for some kselftest-related reference values, before the end of the year. I plan to submit the actual code for my proposal, along with some tests that utilize it to the list, and we can have more discussion when that happens.
Thanks for sharing your ideas. -- Tim