While testing the split PMD path with lockdep enabled I've got an "Invalid wait context" error caused by split_huge_page_to_list() trying to lock anon_vma->rwsem while inside RCU read section. The issues is due to move_pages_pte() calling split_folio() under RCU read lock. Fix this by unmapping the PTEs and exiting RCU read section before splitting the folio and then retrying. The same retry pattern is used when locking the folio or anon_vma in this function.
Fixes: 94b01c885131 ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Suren Baghdasaryan surenb@google.com --- Patch applies over mm-unstable. Please note that the SHA in Fixes tag is unstable.
mm/userfaultfd.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 5e718014e671..71393410e028 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1078,9 +1078,14 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
/* at this point we have src_folio locked */ if (folio_test_large(src_folio)) { + /* split_folio() can block */ + pte_unmap(&orig_src_pte); + pte_unmap(&orig_dst_pte); + src_pte = dst_pte = NULL; err = split_folio(src_folio); if (err) goto out; + goto retry; }
if (!src_anon_vma) {
On Fri, Dec 29, 2023 at 06:56:07PM -0800, Suren Baghdasaryan wrote:
@@ -1078,9 +1078,14 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, /* at this point we have src_folio locked */ if (folio_test_large(src_folio)) {
/* split_folio() can block */
pte_unmap(&orig_src_pte);
pte_unmap(&orig_dst_pte);
src_pte = dst_pte = NULL; err = split_folio(src_folio); if (err) goto out;
}goto retry;
Do we also need to clear src_folio and src_folio_pte? If the folio is a thp, I think it means it's pte mapped here. Then after the split we may want to fetch the small folio after the split, not the head one?
On Tue, Jan 2, 2024 at 1:00 AM Peter Xu peterx@redhat.com wrote:
On Fri, Dec 29, 2023 at 06:56:07PM -0800, Suren Baghdasaryan wrote:
@@ -1078,9 +1078,14 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
/* at this point we have src_folio locked */ if (folio_test_large(src_folio)) {
/* split_folio() can block */
pte_unmap(&orig_src_pte);
pte_unmap(&orig_dst_pte);
src_pte = dst_pte = NULL; err = split_folio(src_folio); if (err) goto out;
goto retry; }
Do we also need to clear src_folio and src_folio_pte? If the folio is a thp, I think it means it's pte mapped here. Then after the split we may want to fetch the small folio after the split, not the head one?
I think we need to re-fetch the src_folio only if the src_addr falls into a non-head page. Looking at the __split_huge_page(), the head page is skipped in the last loop, so I think it should stay valid. That said, maybe it's just an implementation detail of the __split_huge_page() and I should not rely on that and refetch anyway?
-- Peter Xu
On Tue, Jan 2, 2024 at 8:58 AM Suren Baghdasaryan surenb@google.com wrote:
On Tue, Jan 2, 2024 at 1:00 AM Peter Xu peterx@redhat.com wrote:
On Fri, Dec 29, 2023 at 06:56:07PM -0800, Suren Baghdasaryan wrote:
@@ -1078,9 +1078,14 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
/* at this point we have src_folio locked */ if (folio_test_large(src_folio)) {
/* split_folio() can block */
pte_unmap(&orig_src_pte);
pte_unmap(&orig_dst_pte);
src_pte = dst_pte = NULL; err = split_folio(src_folio); if (err) goto out;
goto retry; }
Do we also need to clear src_folio and src_folio_pte? If the folio is a thp, I think it means it's pte mapped here. Then after the split we may want to fetch the small folio after the split, not the head one?
I think we need to re-fetch the src_folio only if the src_addr falls into a non-head page. Looking at the __split_huge_page(), the head page is skipped in the last loop, so I think it should stay valid. That said, maybe it's just an implementation detail of the __split_huge_page() and I should not rely on that and refetch anyway?
I'll post a v2 with this fix and re-fetching the folio unconditionally. We also don't need to reset src_folio_pte value because it's used only if src_folio is not NULL. Thanks for catching this, Peter!
-- Peter Xu
On Tue, Jan 2, 2024 at 3:16 PM Suren Baghdasaryan surenb@google.com wrote:
On Tue, Jan 2, 2024 at 8:58 AM Suren Baghdasaryan surenb@google.com wrote:
On Tue, Jan 2, 2024 at 1:00 AM Peter Xu peterx@redhat.com wrote:
On Fri, Dec 29, 2023 at 06:56:07PM -0800, Suren Baghdasaryan wrote:
@@ -1078,9 +1078,14 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
/* at this point we have src_folio locked */ if (folio_test_large(src_folio)) {
/* split_folio() can block */
pte_unmap(&orig_src_pte);
pte_unmap(&orig_dst_pte);
src_pte = dst_pte = NULL; err = split_folio(src_folio); if (err) goto out;
goto retry; }
Do we also need to clear src_folio and src_folio_pte? If the folio is a thp, I think it means it's pte mapped here. Then after the split we may want to fetch the small folio after the split, not the head one?
I think we need to re-fetch the src_folio only if the src_addr falls into a non-head page. Looking at the __split_huge_page(), the head page is skipped in the last loop, so I think it should stay valid. That said, maybe it's just an implementation detail of the __split_huge_page() and I should not rely on that and refetch anyway?
I'll post a v2 with this fix and re-fetching the folio unconditionally. We also don't need to reset src_folio_pte value because it's used only if src_folio is not NULL.
Posted at https://lore.kernel.org/all/20240102233256.1077959-1-surenb@google.com/
Thanks for catching this, Peter!
-- Peter Xu
linux-kselftest-mirror@lists.linaro.org