Introduce Aim-oriented Feedback-driven DAMOS Aggressiveness Auto-tuning.
It makes DAMOS self-tuned with periodic simple user feedback.
Patchset Changelog
==================
From RFC
(https://lore.kernel.org/damon/20231112194607.61399-1-sj@kernel.org/)
- Wordsmith commit messages and cover letter
Background: DAMOS Control Difficulty
====================================
DAMOS helps users easily implement access pattern aware system
operations. However, controlling DAMOS in the wild is not that easy.
The basic way for DAMOS control is specifying the target access pattern.
In this approach, the user is assumed to well understand the access
pattern and the characteristics of the system and the workloads. Though
there are useful tools for that, it takes time and effort depending on
the complexity and the dynamicity of the system and the workloads.
After all, the access pattern consists of three ranges, namely the size,
the access rate, and the age of the regions. It means users need to
tune six parameters, which is anyway not a simple task.
One of the worst cases would be DAMOS being too aggressive like a
berserker, and therefore consuming too much system resource and making
unwanted radical system operations. To let users avoid such cases,
DAMOS allows users to set the upper-limit of the schemes'
aggressiveness, namely DAMOS quota. DAMOS further provides its
best-effort under the limit by prioritizing regions based on the access
pattern of the regions. For example, users can ask DAMOS to page out up
to 100 MiB of memory regions per second. Then DAMOS pages out regions
that are not accessed for a longer time (colder) first under the limit.
This allows users to set the target access pattern a bit naive with
wider ranges, and focus on tuning only one parameter, the quota. In
other words, the number of parameters to tune can be reduced from six to
one.
Still, however, the optimum value for the quota depends on the system
and the workloads' characteristics, so not that simple. The number of
parameters to tune can also increase again if the user needs to run
multiple schemes.
Aim-oriented Feedback-driven DAMOS Aggressiveness Auto Tuning
=============================================================
Users would use DAMOS since they want to achieve something with it.
They will likely have measurable metrics representing the achievement
and the target number of the metric like SLO, and continuously measure
that anyway. While the additional cost of getting the information is
nearly zero, it could be useful for DAMOS to understand how appropriate
its current aggressiveness is set, and adjust it on its own to make the
metric value more close to the target.
Based on this idea, we introduce a new way of tuning DAMOS with nearly
zero additional effort, namely Aim-oriented Feedback-driven DAMOS
Aggressiveness Auto Tuning. It asks users to provide feedback
representing how well DAMOS is doing relative to the users' aim. Then
DAMOS adjusts its aggressiveness, specifically the quota that provides
the best effort result under the limit, based on the current level of
the aggressiveness and the users' feedback.
Implementation
--------------
The implementation asks users to represent the feedback with score
numbers. The scores could be anything including user-space specific
metrics including latency and throughput of special user-space
workloads, and system metrics including free memory ratio, memory
pressure stall time (PSI), and active to inactive LRU lists size ratio.
The feedback scores and the aggressiveness of the given DAMOS scheme are
assumed to be positively proportional, though. Selecting metrics of the
assumption is the users' responsibility.
The core logic uses the below simple feedback loop algorithm to
calculate the next aggressiveness level of the scheme from the current
aggressiveness level and the current feedback (target_score and
current_score). It calculates the compensation for next aggressiveness
as a proportion of current aggressiveness and distance to the target
score. As a result, it arrives at the near-goal state in a short time
using big steps when it's far from the goal, but avoids making
unnecessarily radical changes that could turn out to be a bad decision
using small steps when its near to the goal.
f(n) = max(1, f(n - 1) * ((target_score - current_score) / target_score + 1))
Note that the compensation value becomes negative when it's over
achieving the goal. That's why the feedback metric and the
aggressiveness of the scheme should be positively proportional. The
distance-adaptive speed manipulation is simply applied.
Example Use Cases
-----------------
If users want to reduce the memory footprint of the system as much as
possible as long as the time spent for handling the resulting memory
pressure is within a threshold, they could use DAMOS scheme that
reclaims cold memory regions aiming for a little level of memory
pressure stall time.
If users want the active/inactive LRU lists well balanced to reduce the
performance impact due to possible future memory pressure, they could
use two schemes. The first one would be set to locate hot pages in the
active LRU list, aiming for a specific active-to-inactive LRU list size
ratio, say, 70%. The second one would be to locate cold pages in the
inactive LRU list, aiming for a specific inactive-to-active LRU list
size ratio, say, 30%. Then, DAMOS will balance the two schemes based on
the goal and feedback.
This aim-oriented auto tuning could also be useful for general
balancing-required access aware system operations such as system memory
auto scaling[3] and tiered memory management[4]. These two example
usages are not what current DAMOS implementation is already supporting,
but require additional DAMOS action developments, though.
Evaluation: subtle memory pressure aiming proactive reclamation
---------------------------------------------------------------
To show if the implementation works as expected, we prepare four
different system configurations on AWS i3.metal instances. The first
setup (original) runs the workload without any DAMOS scheme. The second
setup (not-tuned) runs the workload with a virtual address space-based
proactive reclamation scheme that pages out memory regions that are not
accessed for five seconds or more. The third setup (offline-tuned) runs
the same proactive reclamation DAMOS scheme, but after making it tuned
for each workload offline, using our previous user-space driven
automatic tuning approach, namely DAMOOS[1]. The fourth and final setup
(AFDAA) runs the scheme that is the same as that of 'not-tuned' setup,
but aims to keep 0.5% of 'some' memory pressure stall time (PSI) for the
last 10 seconds using the aiming-oriented auto tuning.
For each setup, we run realistic workloads from PARSEC3 and SPLASH-2X
benchmark suites. For each run, we measure RSS and runtime of the
workload, and 'some' memory pressure stall time (PSI) of the system. We
repeat the runs five times and use averaged measurements.
For simple comparison of the results, we normalize the measurements to
those of 'original'. In the case of the PSI, though, the measurement
for 'original' was zero, so we normalize the value to that of
'not-tuned' scheme's result. The normalized results are shown below.
Not-tuned Offline-tuned AFDAA
RSS 0.622688178226118 0.787950678944904 0.740093483278979
runtime 1.11767826657912 1.0564674983585 1.0910833880499
PSI 1 0.727521443794069 0.308498846350299
The 'not-tuned' scheme achieves about 38.7% memory saving but incur
about 11.7% runtime slowdown. The 'offline-tuned' scheme achieves about
22.2% memory saving with about 5.5% runtime slowdown. It also achieves
about 28.2% memory pressure stall time saving. AFDAA achieves about 26%
memory saving with about 9.1% runtime slowdown. It also achieves about
69.1% memory pressure stall time saving. We repeat this test multiple
times, and get consistent results. AFDAA is now integrated in our daily
DAMON performance test setup.
Apparently the aggressiveness of 'AFDAA' setup is somewhere between
those of 'not-tuned' and 'offline-tuned' setup, since its memory saving
and runtime overhead are between those of the other two setups.
Actually we set the memory pressure stall time goal aiming for this
middle aggressiveness. The difference in the two metrics are not
significant, though. However, it shows significant saving of the memory
pressure stall time, which was the goal of the auto-tuning, over the two
variants. Hence, we conclude the automatic tuning is working as
expected.
Please note that the AFDAA setup is only for the evaluation, and
therefore intentionally set a bit aggressive. It might not be
appropriate for production environments.
The test code is also available[2], so you could reproduce it on your
system and workloads.
Patches Sequence
================
The first four patches implement the core logic and user interfaces for
the auto tuning. The first patch implements the core logic for the auto
tuning, and the API for DAMOS users in the kernel space. The second
patch implements basic file operations of DAMON sysfs directories and
files that will be used for setting the goals and providing the
feedback. The third patch connects the quota goals files inputs to the
DAMOS core logic. Finally the fourth patch implements a dedicated DAMOS
sysfs command for efficiently committing the quota goals feedback.
Two patches for simple tests of the logic and interfaces follow. The
fifth patch implements the core logic unit test. The sixth patch
implements a selftest for the DAMON Sysfs interface for the goals.
Finally, three patches for documentation follows. The seventh patch
documents the design of the feature. The eighth patch updates the API
doc for the new sysfs files. The final eighth patch updates the usage
document for the features.
References
==========
[1] DAOS paper:
https://www.amazon.science/publications/daos-data-access-aware-operating-sy…
[2] Evaluation code:
https://github.com/damonitor/damon-tests/commit/3f884e61193f0166b8724554b6d…
[3] Memory auto scaling RFC idea:
https://lore.kernel.org/damon/20231112195114.61474-1-sj@kernel.org/
[4] DAMON-based tiered memory management RFC idea:
https://lore.kernel.org/damon/20231112195602.61525-1-sj@kernel.org/
SeongJae Park (9):
mm/damon/core: implement goal-oriented feedback-driven quota
auto-tuning
mm/damon/sysfs-schemes: implement files for scheme quota goals setup
mm/damon/sysfs-schemes: commit damos quota goals user input to DAMOS
mm/damon/sysfs-schemes: implement a command for scheme quota goals
only commit
mm/damon/core-test: add a unit test for the feedback loop algorithm
selftests/damon: test quota goals directory
Docs/mm/damon/design: document DAMOS quota auto tuning
Docs/ABI/damon: document DAMOS quota goals
Docs/admin-guide/mm/damon/usage: document for quota goals
.../ABI/testing/sysfs-kernel-mm-damon | 33 ++-
Documentation/admin-guide/mm/damon/usage.rst | 48 +++-
Documentation/mm/damon/design.rst | 13 +
include/linux/damon.h | 19 ++
mm/damon/core-test.h | 32 +++
mm/damon/core.c | 68 ++++-
mm/damon/sysfs-common.h | 3 +
mm/damon/sysfs-schemes.c | 272 +++++++++++++++++-
mm/damon/sysfs.c | 27 ++
tools/testing/selftests/damon/sysfs.sh | 27 ++
10 files changed, 517 insertions(+), 25 deletions(-)
base-commit: b4e0245a831a402cae8634a4dc277a04830ff07a
--
2.34.1
These patches update the output of the vdso_test_abi test program to
bring it into line with expected KTAP usage, the main one being the
first patch which ensures we log distinct test names for each reported
result making it much easier for automated systems to track the status
of the tests.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (3):
kselftest/vDSO: Make test name reporting for vdso_abi_test tooling friendly
kselftest/vDSO: Fix message formatting for clock_id logging
kselftest/vDSO: Use ksft_print_msg() rather than printf in vdso_test_abi
tools/testing/selftests/vDSO/vdso_test_abi.c | 72 +++++++++++++++-------------
1 file changed, 39 insertions(+), 33 deletions(-)
---
base-commit: 98b1cc82c4affc16f5598d4fa14b1858671b2263
change-id: 20231122-kselftest-vdso-test-name-44fcc7e16a38
Best regards,
--
Mark Brown <broonie(a)kernel.org>
From: angquan yu <angquan21(a)gmail.com>
This commit resolves a compiler warning regardingthe
use of non-literal format strings in breakpoint_test.c.
The functions `ksft_test_result_pass` and `ksft_test_result_fail`
were previously called with a variable `msg` directly, which could
potentially lead to format string vulnerabilities.
Changes made:
- Modified the calls to `ksft_test_result_pass` and `ksft_test_result_fail`
by adding a "%s" format specifier. This explicitly declares `msg` as a
string argument, adhering to safer coding practices and resolving
the compiler warning.
This change does not affect the functional behavior of the code but ensures
better code safety and compliance with recommended C programming standards.
The previous warning is "breakpoint_test.c:287:17:
warning: format not a string literal and no format arguments
[-Wformat-security]
287 | ksft_test_result_pass(msg);
| ^~~~~~~~~~~~~~~~~~~~~
breakpoint_test.c:289:17: warning: format not a string literal
and no format arguments [-Wformat-security]
289 | ksft_test_result_fail(msg);
| "
Signed-off-by: angquan yu <angquan21(a)gmail.com>
---
tools/testing/selftests/breakpoints/breakpoint_test.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/breakpoints/breakpoint_test.c b/tools/testing/selftests/breakpoints/breakpoint_test.c
index 3266cc929..d46962a24 100644
--- a/tools/testing/selftests/breakpoints/breakpoint_test.c
+++ b/tools/testing/selftests/breakpoints/breakpoint_test.c
@@ -284,9 +284,9 @@ static void check_success(const char *msg)
nr_tests++;
if (ret)
- ksft_test_result_pass(msg);
+ ksft_test_result_pass("%s", msg);
else
- ksft_test_result_fail(msg);
+ ksft_test_result_fail("%s", msg);
}
static void launch_instruction_breakpoints(char *buf, int local, int global)
--
2.39.2
From: angquan yu <angquan21(a)gmail.com>
In the function 'tools/testing/selftests/breakpoints/run_test' within
step_after_suspend_test.c, the ksft_print_msg function call incorrectly
used '$s' as a format specifier. This commit corrects this typo to use the
proper '%s' format specifier, ensuring the error message from
waitpid() is correctly displayed.
The issue manifested as a compilation warning (too many arguments
for format [-Wformat-extra-args]), potentially obscuring actual
runtime errors and complicating debugging processes.
This fix enhances the clarity of error messages during test failures
and ensures compliance with standard C format string conventions.
Signed-off-by: angquan yu <angquan21(a)gmail.com>
---
tools/testing/selftests/breakpoints/step_after_suspend_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/breakpoints/step_after_suspend_test.c b/tools/testing/selftests/breakpoints/step_after_suspend_test.c
index 2cf6f10ab..b8703c499 100644
--- a/tools/testing/selftests/breakpoints/step_after_suspend_test.c
+++ b/tools/testing/selftests/breakpoints/step_after_suspend_test.c
@@ -89,7 +89,7 @@ int run_test(int cpu)
wpid = waitpid(pid, &status, __WALL);
if (wpid != pid) {
- ksft_print_msg("waitpid() failed: $s\n", strerror(errno));
+ ksft_print_msg("waitpid() failed: %s\n", strerror(errno));
return KSFT_FAIL;
}
if (WIFEXITED(status)) {
--
2.39.2
From: angquan yu <angquan21(a)gmail.com>
In tools/testing/selftests/proc/proc-empty->because the return value
of a write call was being ignored. This call was partof a conditional
debugging block (if (0) { ... }), which meant it would neveractually
execute.
This patch removes the unused debug write call. This cleanup resolves
the compi>warning about ignoring the result of write declared with
the warn_unused_resultattribute.
Removing this code also improves the clarity and maintainability of
the function, as it eliminates a non-functional block of code.
This is original warning: proc-empty-vm.c: In function
‘test_proc_pid_statm’ :proc-empty-vm.c:385:17:
warning: ignoring return value of ‘write’
declared with>385 | write(1, buf, rv);|
Signed-off-by: angquan yu <angquan21(a)gmail.com>
---
tools/testing/selftests/proc/proc-empty-vm.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/proc/proc-empty-vm.c b/tools/testing/selftests/proc/proc-empty-vm.c
index 5e7020630..d231e61e4 100644
--- a/tools/testing/selftests/proc/proc-empty-vm.c
+++ b/tools/testing/selftests/proc/proc-empty-vm.c
@@ -383,8 +383,10 @@ static int test_proc_pid_statm(pid_t pid)
assert(rv <= sizeof(buf));
if (0) {
ssize_t written = write(1, buf, rv);
+
if (written == -1) {
perror("write failed to /proc/${pid}");
+ return EXIT_FAILURE;
}
}
--
2.39.2
Hi all:
The core frequency is subjected to the process variation in semiconductors.
Not all cores are able to reach the maximum frequency respecting the
infrastructure limits. Consequently, AMD has redefined the concept of
maximum frequency of a part. This means that a fraction of cores can reach
maximum frequency. To find the best process scheduling policy for a given
scenario, OS needs to know the core ordering informed by the platform through
highest performance capability register of the CPPC interface.
Earlier implementations of amd-pstate preferred core only support a static
core ranking and targeted performance. Now it has the ability to dynamically
change the preferred core based on the workload and platform conditions and
accounting for thermals and aging.
Amd-pstate driver utilizes the functions and data structures provided by
the ITMT architecture to enable the scheduler to favor scheduling on cores
which can be get a higher frequency with lower voltage.
We call it amd-pstate preferred core.
Here sched_set_itmt_core_prio() is called to set priorities and
sched_set_itmt_support() is called to enable ITMT feature.
Amd-pstate driver uses the highest performance value to indicate
the priority of CPU. The higher value has a higher priority.
Amd-pstate driver will provide an initial core ordering at boot time.
It relies on the CPPC interface to communicate the core ranking to the
operating system and scheduler to make sure that OS is choosing the cores
with highest performance firstly for scheduling the process. When amd-pstate
driver receives a message with the highest performance change, it will
update the core ranking.
Changes from V10->V11:
- cpufreq: amd-pstate:
- - according Perry's commnts, I replace the string with str_enabled_disable().
Changes from V9->V10:
- cpufreq: amd-pstate:
- - add judgement for highest_perf. When it is less than 255, the
preferred core feature is enabled. And it will set the priority.
- - deleset "static u32 max_highest_perf" etc, because amd p-state
perferred coe does not require specail process for hotpulg.
Changes form V8->V9:
- all:
- - pick up Tested-By flag added by Oleksandr.
- cpufreq: amd-pstate:
- - pick up Review-By flag added by Wyes.
- - ignore modification of bug.
- - add a attribute of prefcore_ranking.
- - modify data type conversion from u32 to int.
- Documentation: amd-pstate:
- - pick up Review-By flag added by Wyes.
Changes form V7->V8:
- all:
- - pick up Review-By flag added by Mario and Ray.
- cpufreq: amd-pstate:
- - use hw_prefcore embeds into cpudata structure.
- - delete preferred core init from cpu online/off.
Changes form V6->V7:
- x86:
- - Modify kconfig about X86_AMD_PSTATE.
- cpufreq: amd-pstate:
- - modify incorrect comments about scheduler_work().
- - convert highest_perf data type.
- - modify preferred core init when cpu init and online.
- acpi: cppc:
- - modify link of CPPC highest performance.
- cpufreq:
- - modify link of CPPC highest performance changed.
Changes form V5->V6:
- cpufreq: amd-pstate:
- - modify the wrong tag order.
- - modify warning about hw_prefcore sysfs attribute.
- - delete duplicate comments.
- - modify the variable name cppc_highest_perf to prefcore_ranking.
- - modify judgment conditions for setting highest_perf.
- - modify sysfs attribute for CPPC highest perf to pr_debug message.
- Documentation: amd-pstate:
- - modify warning: title underline too short.
Changes form V4->V5:
- cpufreq: amd-pstate:
- - modify sysfs attribute for CPPC highest perf.
- - modify warning about comments
- - rebase linux-next
- cpufreq:
- - Moidfy warning about function declarations.
- Documentation: amd-pstate:
- - align with ``amd-pstat``
Changes form V3->V4:
- Documentation: amd-pstate:
- - Modify inappropriate descriptions.
Changes form V2->V3:
- x86:
- - Modify kconfig and description.
- cpufreq: amd-pstate:
- - Add Co-developed-by tag in commit message.
- cpufreq:
- - Modify commit message.
- Documentation: amd-pstate:
- - Modify inappropriate descriptions.
Changes form V1->V2:
- acpi: cppc:
- - Add reference link.
- cpufreq:
- - Moidfy link error.
- cpufreq: amd-pstate:
- - Init the priorities of all online CPUs
- - Use a single variable to represent the status of preferred core.
- Documentation:
- - Default enabled preferred core.
- Documentation: amd-pstate:
- - Modify inappropriate descriptions.
- - Default enabled preferred core.
- - Use a single variable to represent the status of preferred core.
Meng Li (7):
x86: Drop CPU_SUP_INTEL from SCHED_MC_PRIO for the expansion.
acpi: cppc: Add get the highest performance cppc control
cpufreq: amd-pstate: Enable amd-pstate preferred core supporting.
cpufreq: Add a notification message that the highest perf has changed
cpufreq: amd-pstate: Update amd-pstate preferred core ranking
dynamically
Documentation: amd-pstate: introduce amd-pstate preferred core
Documentation: introduce amd-pstate preferrd core mode kernel command
line options
.../admin-guide/kernel-parameters.txt | 5 +
Documentation/admin-guide/pm/amd-pstate.rst | 59 +++++-
arch/x86/Kconfig | 5 +-
drivers/acpi/cppc_acpi.c | 13 ++
drivers/acpi/processor_driver.c | 6 +
drivers/cpufreq/amd-pstate.c | 187 ++++++++++++++++--
drivers/cpufreq/cpufreq.c | 13 ++
include/acpi/cppc_acpi.h | 5 +
include/linux/amd-pstate.h | 10 +
include/linux/cpufreq.h | 5 +
10 files changed, 288 insertions(+), 20 deletions(-)
--
2.34.1
Passing a gfp_t to KUNIT_EXPECT_EQ() causes a cast warning:
lib/kunit/string-stream-test.c:73:9: sparse: sparse: incorrect type in
initializer (different base types) expected long long right_value
got restricted gfp_t const __right
Avoid this by testing stream->gfp for the expected value and passing the
boolean result of this comparison to KUNIT_EXPECT_TRUE(), as was already
done a few lines above in string_stream_managed_init_test().
Signed-off-by: Richard Fitzgerald <rf(a)opensource.cirrus.com>
Fixes: d1a0d699bfc0 ("kunit: string-stream: Add tests for freeing resource-managed string_stream")
Reported-by: kernel test robot <lkp(a)intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311181918.0mpCu2Xh-lkp@intel.com/
---
lib/kunit/string-stream-test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/kunit/string-stream-test.c b/lib/kunit/string-stream-test.c
index 06822766f29a..03fb511826f7 100644
--- a/lib/kunit/string-stream-test.c
+++ b/lib/kunit/string-stream-test.c
@@ -72,7 +72,7 @@ static void string_stream_unmanaged_init_test(struct kunit *test)
KUNIT_EXPECT_EQ(test, stream->length, 0);
KUNIT_EXPECT_TRUE(test, list_empty(&stream->fragments));
- KUNIT_EXPECT_EQ(test, stream->gfp, GFP_KERNEL);
+ KUNIT_EXPECT_TRUE(test, (stream->gfp == GFP_KERNEL));
KUNIT_EXPECT_FALSE(test, stream->append_newlines);
KUNIT_EXPECT_TRUE(test, string_stream_is_empty(stream));
--
2.30.2
From: David Woodhouse <dwmw(a)amazon.co.uk>
Using -MD without -MP causes build failures when a header file is deleted
or moved. With -MP, the compiler will emit phony targets for the header
files it lists as dependencies, and the Makefiles won't refuse to attempt
to rebuild a C unit which no longer includes the deleted header.
Signed-off-by: David Woodhouse <dwmw(a)amazon.co.uk>
---
tools/testing/selftests/kvm/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index a3bb36fb3cfc..20ea549da570 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -211,7 +211,7 @@ else
LINUX_TOOL_ARCH_INCLUDE = $(top_srcdir)/tools/arch/$(ARCH)/include
endif
CFLAGS += -Wall -Wstrict-prototypes -Wuninitialized -O2 -g -std=gnu99 \
- -Wno-gnu-variable-sized-type-not-at-end -MD\
+ -Wno-gnu-variable-sized-type-not-at-end -MD -MP \
-fno-builtin-memcmp -fno-builtin-memcpy -fno-builtin-memset \
-fno-builtin-strnlen \
-fno-stack-protector -fno-PIE -I$(LINUX_TOOL_INCLUDE) \
--
2.41.0
Changelog:
v6:
* Rebase on top of latest mm-unstable.
* Fix/improve the in-code documentation of the new list_lru
manipulation functions (patch 1)
v5:
* Replace reference getting with an rcu_read_lock() section for
zswap lru modifications (suggested by Yosry)
* Add a new prep patch that allows mem_cgroup_iter() to return
online cgroup.
* Add a callback that updates pool->next_shrink when the cgroup is
offlined (suggested by Yosry Ahmed, Johannes Weiner)
v4:
* Rename list_lru_add to list_lru_add_obj and __list_lru_add to
list_lru_add (patch 1) (suggested by Johannes Weiner and
Yosry Ahmed)
* Some cleanups on the memcg aware LRU patch (patch 2)
(suggested by Yosry Ahmed)
* Use event interface for the new per-cgroup writeback counters.
(patch 3) (suggested by Yosry Ahmed)
* Abstract zswap's lruvec states and handling into
zswap_lruvec_state (patch 5) (suggested by Yosry Ahmed)
v3:
* Add a patch to export per-cgroup zswap writeback counters
* Add a patch to update zswap's kselftest
* Separate the new list_lru functions into its own prep patch
* Do not start from the top of the hierarchy when encounter a memcg
that is not online for the global limit zswap writeback (patch 2)
(suggested by Yosry Ahmed)
* Do not remove the swap entry from list_lru in
__read_swapcache_async() (patch 2) (suggested by Yosry Ahmed)
* Removed a redundant zswap pool getting (patch 2)
(reported by Ryan Roberts)
* Use atomic for the nr_zswap_protected (instead of lruvec's lock)
(patch 5) (suggested by Yosry Ahmed)
* Remove the per-cgroup zswap shrinker knob (patch 5)
(suggested by Yosry Ahmed)
v2:
* Fix loongarch compiler errors
* Use pool stats instead of memcg stats when !CONFIG_MEMCG_KEM
There are currently several issues with zswap writeback:
1. There is only a single global LRU for zswap, making it impossible to
perform worload-specific shrinking - an memcg under memory pressure
cannot determine which pages in the pool it owns, and often ends up
writing pages from other memcgs. This issue has been previously
observed in practice and mitigated by simply disabling
memcg-initiated shrinking:
https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u
But this solution leaves a lot to be desired, as we still do not
have an avenue for an memcg to free up its own memory locked up in
the zswap pool.
2. We only shrink the zswap pool when the user-defined limit is hit.
This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed
ahead of time, as this depends on the workload (specifically, on
factors such as memory access patterns and compressibility of the
memory pages).
This patch series solves these issues by separating the global zswap
LRU into per-memcg and per-NUMA LRUs, and performs workload-specific
(i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The
new shrinker does not have any parameter that must be tuned by the
user, and can be opted in or out on a per-memcg basis.
As a proof of concept, we ran the following synthetic benchmark:
build the linux kernel in a memory-limited cgroup, and allocate some
cold data in tmpfs to see if the shrinker could write them out and
improved the overall performance. Depending on the amount of cold data
generated, we observe from 14% to 35% reduction in kernel CPU time used
in the kernel builds.
Domenico Cerasuolo (3):
zswap: make shrinking memcg-aware
mm: memcg: add per-memcg zswap writeback stat
selftests: cgroup: update per-memcg zswap writeback selftest
Nhat Pham (3):
list_lru: allows explicit memcg and NUMA node selection
memcontrol: allows mem_cgroup_iter() to check for onlineness
zswap: shrinks zswap pool based on memory pressure
Documentation/admin-guide/mm/zswap.rst | 7 +
drivers/android/binder_alloc.c | 5 +-
fs/dcache.c | 8 +-
fs/gfs2/quota.c | 6 +-
fs/inode.c | 4 +-
fs/nfs/nfs42xattr.c | 8 +-
fs/nfsd/filecache.c | 4 +-
fs/xfs/xfs_buf.c | 6 +-
fs/xfs/xfs_dquot.c | 2 +-
fs/xfs/xfs_qm.c | 2 +-
include/linux/list_lru.h | 54 ++-
include/linux/memcontrol.h | 9 +-
include/linux/mmzone.h | 2 +
include/linux/vm_event_item.h | 1 +
include/linux/zswap.h | 27 +-
mm/list_lru.c | 48 ++-
mm/memcontrol.c | 20 +-
mm/mmzone.c | 1 +
mm/shrinker.c | 4 +-
mm/swap.h | 3 +-
mm/swap_state.c | 26 +-
mm/vmscan.c | 26 +-
mm/vmstat.c | 1 +
mm/workingset.c | 4 +-
mm/zswap.c | 426 +++++++++++++++++---
tools/testing/selftests/cgroup/test_zswap.c | 74 ++--
26 files changed, 629 insertions(+), 149 deletions(-)
base-commit: 40b487ae2620fc9187fee68b09d2cb275de0d60e
--
2.34.1