On Wed, Apr 30, 2025 at 2:15 PM Zi Yan ziy@nvidia.com wrote:
On 28 Apr 2025, at 14:29, Nico Pache wrote:
The new defer option for (m)THPs allows for a more conservative approach to (m)THPs. Document its usage in the transhuge admin-guide.
Reviewed-by: Bagas Sanjaya bagasdotme@gmail.com Signed-off-by: Nico Pache npache@redhat.com
Documentation/admin-guide/mm/transhuge.rst | 31 ++++++++++++++++------ 1 file changed, 23 insertions(+), 8 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 5c63fe51b3ad..c50253357793 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -88,8 +88,9 @@ In certain cases when hugepages are enabled system wide, application may end up allocating more memory resources. An application may mmap a large region but only touch 1 byte of it, in that case a 2M page might be allocated instead of a 4k page for no good. This is why it's -possible to disable hugepages system-wide and to only have them inside -MADV_HUGEPAGE madvise regions. +possible to disable hugepages system-wide, only have them inside +MADV_HUGEPAGE madvise regions, or defer them away from the page fault +handler to khugepaged.
Embedded systems should enable hugepages only inside madvise regions to eliminate any risk of wasting any precious byte of memory and to @@ -99,6 +100,15 @@ Applications that gets a lot of benefit from hugepages and that don't risk to lose memory by using hugepages, should use madvise(MADV_HUGEPAGE) on their critical mmapped regions.
+Applications that would like to benefit from THPs but would still like a +more memory conservative approach can choose 'defer'. This avoids +inserting THPs at the page fault handler unless they are MADV_HUGEPAGE. +Khugepaged will then scan the mappings for potential collapses into (m)THP
How about the text below? It explicitly states khugepaged behavior.
Khugepaged will then scan all mappings, even those not explicitly marked with MADV_HUGEPAGE, for potential collapses into (m)THPs.
I agree, this reads better. I can modify it on the V6 :)
+pages. Admins using this the 'defer' setting should consider +tweaking khugepaged/max_ptes_none. The current default of 511 may +aggressively collapse your PTEs into PMDs. Lower this value to conserve +more memory (i.e., max_ptes_none=64).
.. _thp_sysfs:
sysfs @@ -109,11 +119,14 @@ Global THP controls
Transparent Hugepage Support for anonymous memory can be entirely disabled (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE -regions (to avoid the risk of consuming more memory resources) or enabled -system wide. This can be achieved per-supported-THP-size with one of:: +regions (to avoid the risk of consuming more memory resources), deferred to +khugepaged, or enabled system wide.
+This can be achieved per-supported-THP-size with one of::
echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
echo defer >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
where <size> is the hugepage size being addressed, the available sizes @@ -136,6 +149,7 @@ The top-level setting (for use with "inherit") can be set by issuing one of the following commands::
echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo defer >/sys/kernel/mm/transparent_hugepage/enabled echo madvise >/sys/kernel/mm/transparent_hugepage/enabled echo never >/sys/kernel/mm/transparent_hugepage/enabled
@@ -286,7 +300,8 @@ of small pages into one large page:: A higher value leads to use additional memory for programs. A lower value leads to gain less thp performance. Value of max_ptes_none can waste cpu time very little, you can -ignore it. +ignore it. Consider lowering this value when using +``transparent_hugepage=defer``
``max_ptes_swap`` specifies how many pages can be brought in from swap when collapsing a group of pages into a transparent huge page:: @@ -311,14 +326,14 @@ Boot parameters
You can change the sysfs boot time default for the top-level "enabled" control by passing the parameter ``transparent_hugepage=always`` or -``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the -kernel command line. +``transparent_hugepage=madvise`` or ``transparent_hugepage=defer`` or +``transparent_hugepage=never`` to the kernel command line.
Alternatively, each supported anonymous THP size can be controlled by passing ``thp_anon=<size>[KMG],<size>[KMG]:<state>;<size>[KMG]-<size>[KMG]:<state>``, where ``<size>`` is the THP size (must be a power of 2 of PAGE_SIZE and supported anonymous THP) and ``<state>`` is one of ``always``, ``madvise``, -``never`` or ``inherit``. +``defer``, ``never`` or ``inherit``.
For example, the following will set 16K, 32K, 64K THP to ``always``, set 128K, 512K to ``inherit``, set 256K to ``madvise`` and 1M, 2M
Otherwise, LGTM. Thanks. Reviewed-by: Zi Yan ziy@nvidia.com
-- Best Regards, Yan, Zi