To start with I do apologise for coming to this at v6, I realise it's irritating to have push back at this late stage. This is more so my attempt to understand where this series -sits- so I can properly review it.
So please bear with me here :)
So, I remain very confused. This may just be a _me_ thing here :)
So let me check my understanding:
1. This series introduces this new THP deferred mode. 2. By 'follow-up' really you mean 'inspired by' or 'related to' right? 3. If this series lands before [1], commits 2 - 4 are 'undefined behaviour'.
In my view if 3 is true this series should be RFC until [1] merges.
If I've got it wrong and this needs to land first, we should RFC [1].
That way we can un-RFC once the dependency is met.
We have about 5 [m]THP series in flight at the moment, all touching at least vaguely related stuff, so any help for reviewers would be hugely appreciated thanks :)
On Wed, May 21, 2025 at 04:41:54AM -0600, Nico Pache wrote:
On Tue, May 20, 2025 at 3:43 AM Lorenzo Stoakes lorenzo.stoakes@oracle.com wrote:
On Wed, May 14, 2025 at 09:38:53PM -0600, Nico Pache wrote:
This series is a follow-up to [1], which adds mTHP support to khugepaged. mTHP khugepaged support is a "loose" dependency for the sysfs/sysctl configs to make sense. Without it global="defer" and mTHP="inherit" case is "undefined" behavior.
How can this be a follow up to an unmerged series? I'm confused by that.
Hi Lorenzo,
follow up or loose dependency. Not sure the correct terminology.
See above. Let's nail this down please.
Either way, as I was developing this as a potential solution for the THP internal fragmentation issue, upstream was working on adding mTHPs. By adding a new THP sysctl entry I noticed mTHP would now be missing the same entry. Furthermore I was told mTHP support for khugepaged was a desire, so I began working on it in conjunction. So given the undefined behavior of defer globally while any mix of mTHP settings, it became dependent on the khugepaged support. Either way patch 1 of this series is the core functionality. The rest is to fill the undefined behavior gap.
And you're saying that you're introducing 'undefined behaviour' on the assumption that another series which seems to have quite a bit of discussion let to run will be merged?
This could technically get merged without the mTHP khugepaged changes, but then the reviews would probably all be pointing out what I pointed out above. Chicken or Egg problem...
While I'd understand if this was an RFC just to put the idea out there, you're not proposing it as such?
Nope we've already discussed this in both the MM alignment and thp upstream meetings, no one was opposing it, and a lot of testing was done-- by me, RH's CI, and our perf teams. Ive posted several RFCs before posting a patchset.
Unless there's a really good reason we're doing this way (I may be missing something), can we just have this as an RFC until the series it depends on is settled?
Hopefully paragraph one clears this up! They were built in conjunction, but posting them as one series didn't feel right (and IIRC this was also discussed, and this was decided).
'This was also discussed and this was decided' :)
I'm guessing rather you mean discussion was had with other reviewers and of course our earstwhile THP maintainer David, and you guys decided this made more sense?
Obviously upstream discussion is what counts, but as annoying as it is, one does have to address the concerns of reviewers even if late to a series (again, apologies for this).
So, to be clear - I'm not intending to hold up or block the series, I just want to understand how things are, this is the purpose here.
Thanks!
We've seen cases were customers switching from RHEL7 to RHEL8 see a significant increase in the memory footprint for the same workloads.
Through our investigations we found that a large contributing factor to the increase in RSS was an increase in THP usage.
For workloads like MySQL, or when using allocators like jemalloc, it is often recommended to set /transparent_hugepages/enabled=never. This is in part due to performance degradations and increased memory waste.
This series introduces enabled=defer, this setting acts as a middle ground between always and madvise. If the mapping is MADV_HUGEPAGE, the page fault handler will act normally, making a hugepage if possible. If the allocation is not MADV_HUGEPAGE, then the page fault handler will default to the base size allocation. The caveat is that khugepaged can still operate on pages that are not MADV_HUGEPAGE.
This allows for three things... one, applications specifically designed to use hugepages will get them, and two, applications that don't use hugepages can still benefit from them without aggressively inserting THPs at every possible chance. This curbs the memory waste, and defers the use of hugepages to khugepaged. Khugepaged can then scan the memory for eligible collapsing. Lastly there is the added benefit for those who want THPs but experience higher latency PFs. Now you can get base page performance at the PF handler and Hugepage performance for those mappings after they collapse.
Admins may want to lower max_ptes_none, if not, khugepaged may aggressively collapse single allocations into hugepages.
TESTING:
- Built for x86_64, aarch64, ppc64le, and s390x
- selftests mm
- In [1] I provided a script [2] that has multiple access patterns
- lots of general use.
OK so this truly is dependent on the unmerged series? Or isn't it?
Is your testing based on that?
Most of the testing was done in conjunction, but independent testing was also done on this series (including by a large customer that was itching to try the changes, and they were very satisfied with the results).
You should make this very clear in the cover letter.
Because again... that surely makes this series a no-go until we land the prior (which might be changed, and thus necessitate re-testing).
Are you going to provide any of these numbers/data anywhere?
There is a link to the results in this cover letter [3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.ht...
Ultimately it's not ok in mm to have a link to a website that might go away any time, these cover letters are 'baked in' to the commit log. Are you sure this website with 'testoutput2' will exist in 10 years time? :)
You should at the very least add a summary of this data in the cover letter, perhaps referring back to this link as 'at the time of writing full results are available at' or something like this.
redis testing. This test was my original case for the defer mode. What I was able to prove was that THP=always leads to increased max_latency cases; hence why it is recommended to disable THPs for redis servers. However with 'defer' we dont have the max_latency spikes and can still get the system to utilize THPs. I further tested this with the mTHP defer setting and found that redis (and probably other jmalloc users) can utilize THPs via defer (+mTHP defer) without a large latency penalty and some potential gains. I uploaded some mmtest results here[3] which compares: stock+thp=never stock+(m)thp=always khugepaged-mthp + defer (max_ptes_none=64)
The results show that (m)THPs can cause some throughput regression in some cases, but also has gains in other cases. The mTHP+defer results have more gains and less losses over the (m)THP=always case.
V6 Changes:
- nits
- rebased dependent series and added review tags
V5 Changes:
- rebased dependent series
- added reviewed-by tag on 2/4
V4 Changes:
- Minor Documentation fixes
- rebased the dependent series [1] onto mm-unstable commit 0e68b850b1d3 ("vmalloc: use atomic_long_add_return_relaxed()")
V3 Changes:
- Combined the documentation commits into one, and moved a section to the khugepaged mthp patchset
V2 Changes:
- base changes on mTHP khugepaged support
- Fix selftests parsing issue
- add mTHP defer option
- add mTHP defer Documentation
[1] - https://lore.kernel.org/all/20250515032226.128900-1-npache@redhat.com/ [2] - https://gitlab.com/npache/khugepaged_mthp_test [3] - https://people.redhat.com/npache/mthp_khugepaged_defer/testoutput2/output.ht...
Nico Pache (4): mm: defer THP insertion to khugepaged mm: document (m)THP defer usage khugepaged: add defer option to mTHP options selftests: mm: add defer to thp setting parser
Documentation/admin-guide/mm/transhuge.rst | 31 +++++++--- include/linux/huge_mm.h | 18 +++++- mm/huge_memory.c | 69 +++++++++++++++++++--- mm/khugepaged.c | 8 +-- tools/testing/selftests/mm/thp_settings.c | 1 + tools/testing/selftests/mm/thp_settings.h | 1 + 6 files changed, 106 insertions(+), 22 deletions(-)
-- 2.49.0
Cheers, Lorenzo