On Mon, Mar 8, 2021 at 11:30 AM David Hildenbrand david@redhat.com wrote:
On 08.03.21 20:11, Yang Shi wrote:
On Mon, Mar 8, 2021 at 11:01 AM Zi Yan ziy@nvidia.com wrote:
On 8 Mar 2021, at 13:11, David Hildenbrand wrote:
On 08.03.21 18:49, Zi Yan wrote:
On 8 Mar 2021, at 11:17, David Hildenbrand wrote:
On 08.03.21 16:22, Zi Yan wrote: > From: Zi Yan ziy@nvidia.com > > By writing "<pid>,<vaddr_start>,<vaddr_end>" to > <debugfs>/split_huge_pages_in_range_pid, THPs in the process with the > given pid and virtual address range are split. It is used to test > split_huge_page function. In addition, a selftest program is added to > tools/testing/selftests/vm to utilize the interface by splitting > PMD THPs and PTE-mapped THPs.
Won't something like
MADV_HUGEPAGE
Access memory
MADV_NOHUGEPAGE
Have a similar effect? What's the benefit of this?
Thanks for checking the patch.
No, MADV_NOHUGEPAGE just replaces VM_HUGEPAGE with VM_NOHUGEPAGE, nothing else will be done.
Ah, okay - maybe my memory was tricking me. There is some s390x KVM code that forces MADV_NOHUGEPAGE and force-splits everything.
I do wonder, though, if this functionality would be worth a proper user interface (e.g., madvise), though. There might be actual benefit in having this as a !debug interface.
I think you aware of the discussion in https://lkml.kernel.org/r/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com
Yes. Thanks for bringing this up.
If there will be an interface to collapse a THP -- "this memory area is worth extra performance now by collapsing a THP if possible" -- it might also be helpful to have the opposite functionality -- "this memory area is not worth a THP, rather use that somehwere else".
MADV_HUGE_COLLAPSE vs. MADV_HUGE_SPLIT
I agree that MADV_HUGE_SPLIT would be useful as the opposite of COLLAPSE when user might just want PAGESIZE mappings. Right now, HUGE_SPLIT is implicit from mapping changes like mprotect or MADV_DONTNEED.
IMHO, it sounds not very useful. MADV_DONTNEED would split PMD for any partial THP. If the range covers the whole THP, the whole THP is going to be freed anyway. All other places in kernel which need split THP have been covered. So I didn't realize any usecase from userspace for just splitting PMD to PTEs.
THP are a limited resource. So indicating which virtual memory regions are not performance sensitive right now (e.g., cold pages in a databse) and not worth a THP might be quite valuable, no?
Such functionality could be achieved by MADV_COLD or MADV_PAGEOUT, right? Then a subsequent call to MADV_NOHUGEPAGE would prevent from collapsing or allocating THP for that area.
-- Thanks,
David / dhildenb