On Fri, Oct 25, 2024 at 08:13:26PM +0200, Vlastimil Babka wrote:
On 10/23/24 18:24, Lorenzo Stoakes wrote:
The existing generic pagewalk logic permits the walking of page tables, invoking callbacks at individual page table levels via user-provided mm_walk_ops callbacks.
This is useful for traversing existing page table entries, but precludes the ability to establish new ones.
Existing mechanism for performing a walk which also installs page table entries if necessary are heavily duplicated throughout the kernel, each with semantic differences from one another and largely unavailable for use elsewhere.
Rather than add yet another implementation, we extend the generic pagewalk logic to enable the installation of page table entries by adding a new install_pte() callback in mm_walk_ops. If this is specified, then upon encountering a missing page table entry, we allocate and install a new one and continue the traversal.
If a THP huge page is encountered at either the PMD or PUD level we split it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level), otherwise if there is only an ops->install_pte(), we avoid the unnecessary split.
We do not support hugetlb at this stage.
If this function returns an error, or an allocation fails during the operation, we abort the operation altogether. It is up to the caller to deal appropriately with partially populated page table ranges.
If install_pte() is defined, the semantics of pte_entry() change - this callback is then only invoked if the entry already exists. This is a useful property, as it allows a caller to handle existing PTEs while installing new ones where necessary in the specified range.
If install_pte() is not defined, then there is no functional difference to this patch, so all existing logic will work precisely as it did before.
As we only permit the installation of PTEs where a mapping does not already exist there is no need for TLB management, however we do invoke update_mmu_cache() for architectures which require manual maintenance of mappings for other CPUs.
We explicitly do not allow the existing page walk API to expose this feature as it is dangerous and intended for internal mm use only. Therefore we provide a new walk_page_range_mm() function exposed only to mm/internal.h.
Reviewed-by: Jann Horn jannh@google.com Signed-off-by: Lorenzo Stoakes lorenzo.stoakes@oracle.com
Reviewed-by: Vlastimil Babka vbabka@suse.cz
Thanks!
Just a small subjective suggestion in case you agree and there's a respin or followups:
@@ -109,18 +131,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
if (walk->action == ACTION_AGAIN) goto again;
/*
* Check this here so we only break down trans_huge
* pages when we _need_ to
*/
if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
walk->action == ACTION_CONTINUE ||
!(ops->pte_entry))
if (walk->action == ACTION_CONTINUE) continue;
if (!ops->install_pte && !ops->pte_entry)
continue; /* Nothing to do. */
if (!ops->pte_entry && ops->install_pte &&
pmd_present(*pmd) &&
(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
continue; /* Avoid unnecessary split. */
Much better now, thanks, but maybe the last 2 parts could be:
if (!ops->pte_entry) { if (!ops->install_pte) continue; /* Nothing to do. */ else if (pmd_present(*pmd) && (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))) continue; /* Avoid unnecessary split. */ }
I quite liked separating out the 'nothing to do' vs. the 'unnecessary split' cases, but I agree it makes it harder to see that the 2nd case is an 'install pte ONLY' case.
Yeah so I think your version is better, but maybe we can find a way to be more expressive somehow... if we could declare vars mid-way thhrough it'd be easier :P
Will improve on respin if it comes up
Or at least put !ops->pte_entry first in both conditions?
Ack yeah that'd be better!
if (walk->vma) split_huge_pmd(walk->vma, pmd, addr);
else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
continue; /* Nothing to do. */
err = walk_pte_range(pmd, addr, next, walk); if (err)
@@ -148,11 +171,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, again: next = pud_addr_end(addr, end); if (pud_none(*pud)) {
if (ops->pte_hole)
if (ops->install_pte)
err = __pmd_alloc(walk->mm, pud, addr);
else if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); if (err) break;
continue;
if (!ops->install_pte)
continue;
}
walk->action = ACTION_SUBTREE;
@@ -164,14 +190,20 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
if (walk->action == ACTION_AGAIN) goto again;
if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
walk->action == ACTION_CONTINUE ||
!(ops->pmd_entry || ops->pte_entry))
if (walk->action == ACTION_CONTINUE) continue;
if (!ops->install_pte && !ops->pte_entry && !ops->pmd_entry)
continue; /* Nothing to do. */
if (!ops->pmd_entry && !ops->pte_entry && ops->install_pte &&
pud_present(*pud) &&
(pud_trans_huge(*pud) || pud_devmap(*pud)))
continue; /* Avoid unnecessary split. */
Ditto.
Ack!
Thanks!
Cheers!