Khugepaged collapse an anonymous thp in two rounds of scans. The 2nd round done in __collapse_huge_page_isolate() after hpage_collapse_scan_pmd(), during which all the locks will be released temporarily. It means the pgtable can change during this phase before 2nd round starts.
It's logically possible some ptes got wr-protected during this phase, and we can errornously collapse a thp without noticing some ptes are wr-protected by userfault. e1e267c7928f wanted to avoid it but it only did that for the 1st phase, not the 2nd phase.
Since __collapse_huge_page_isolate() happens after a round of small page swapins, we don't need to worry on any !present ptes - if it existed khugepaged will already bail out. So we only need to check present ptes with uffd-wp bit set there.
This is something I found only but never had a reproducer, I thought it was one caused a bug in Muhammad's recent pagemap new ioctl work, but it turns out it's not the cause of that but an userspace bug. However this seems to still be a real bug even with a very small race window, still worth to have it fixed and copy stable.
Cc: linux-stable stable@vger.kernel.org Fixes: e1e267c7928f ("khugepaged: skip collapse if uffd-wp detected") Signed-off-by: Peter Xu peterx@redhat.com --- mm/khugepaged.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index a19aa140fd52..42ac93b4bd87 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -575,6 +575,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, result = SCAN_PTE_NON_PRESENT; goto out; } + if (pte_uffd_wp(pteval)) { + result = SCAN_PTE_UFFD_WP; + goto out; + } page = vm_normal_page(vma, address, pteval); if (unlikely(!page) || unlikely(is_zone_device_page(page))) { result = SCAN_PAGE_NULL;
On 05.04.23 17:51, Peter Xu wrote:
Khugepaged collapse an anonymous thp in two rounds of scans. The 2nd round done in __collapse_huge_page_isolate() after hpage_collapse_scan_pmd(), during which all the locks will be released temporarily. It means the pgtable can change during this phase before 2nd round starts.
It's logically possible some ptes got wr-protected during this phase, and we can errornously collapse a thp without noticing some ptes are wr-protected by userfault. e1e267c7928f wanted to avoid it but it only did that for the 1st phase, not the 2nd phase.
Since __collapse_huge_page_isolate() happens after a round of small page swapins, we don't need to worry on any !present ptes - if it existed khugepaged will already bail out. So we only need to check present ptes with uffd-wp bit set there.
This is something I found only but never had a reproducer, I thought it was one caused a bug in Muhammad's recent pagemap new ioctl work, but it turns out it's not the cause of that but an userspace bug. However this seems to still be a real bug even with a very small race window, still worth to have it fixed and copy stable.
Cc: linux-stable stable@vger.kernel.org Fixes: e1e267c7928f ("khugepaged: skip collapse if uffd-wp detected") Signed-off-by: Peter Xu peterx@redhat.com
mm/khugepaged.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index a19aa140fd52..42ac93b4bd87 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -575,6 +575,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, result = SCAN_PTE_NON_PRESENT; goto out; }
if (pte_uffd_wp(pteval)) {
result = SCAN_PTE_UFFD_WP;
goto out;
page = vm_normal_page(vma, address, pteval); if (unlikely(!page) || unlikely(is_zone_device_page(page))) { result = SCAN_PAGE_NULL;}
Yes, I agree that would be a small race where it could happen.
Reviewed-by: David Hildenbrand david@redhat.com
On Wed, Apr 5, 2023 at 8:51 AM Peter Xu peterx@redhat.com wrote:
Khugepaged collapse an anonymous thp in two rounds of scans. The 2nd round done in __collapse_huge_page_isolate() after hpage_collapse_scan_pmd(), during which all the locks will be released temporarily. It means the pgtable can change during this phase before 2nd round starts.
It's logically possible some ptes got wr-protected during this phase, and we can errornously collapse a thp without noticing some ptes are wr-protected by userfault. e1e267c7928f wanted to avoid it but it only did that for the 1st phase, not the 2nd phase.
Since __collapse_huge_page_isolate() happens after a round of small page swapins, we don't need to worry on any !present ptes - if it existed khugepaged will already bail out. So we only need to check present ptes with uffd-wp bit set there.
This is something I found only but never had a reproducer, I thought it was one caused a bug in Muhammad's recent pagemap new ioctl work, but it turns out it's not the cause of that but an userspace bug. However this seems to still be a real bug even with a very small race window, still worth to have it fixed and copy stable.
Yeah, I agree. But I got confused by userfaultfd_wp(vma) and pte_uffd_wp(pte). If a vma is armed with uffd wp, shall we skip the whole vma? If so, whether it is better to just check vma? We do revalidate vma once reacquiring mmap_lock, so we should be able to bail out earlier.
Cc: linux-stable stable@vger.kernel.org Fixes: e1e267c7928f ("khugepaged: skip collapse if uffd-wp detected") Signed-off-by: Peter Xu peterx@redhat.com
mm/khugepaged.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index a19aa140fd52..42ac93b4bd87 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -575,6 +575,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, result = SCAN_PTE_NON_PRESENT; goto out; }
if (pte_uffd_wp(pteval)) {
result = SCAN_PTE_UFFD_WP;
goto out;
} page = vm_normal_page(vma, address, pteval); if (unlikely(!page) || unlikely(is_zone_device_page(page))) { result = SCAN_PAGE_NULL;
-- 2.39.1
On Wed, Apr 05, 2023 at 09:59:15AM -0700, Yang Shi wrote:
On Wed, Apr 5, 2023 at 8:51 AM Peter Xu peterx@redhat.com wrote:
Khugepaged collapse an anonymous thp in two rounds of scans. The 2nd round done in __collapse_huge_page_isolate() after hpage_collapse_scan_pmd(), during which all the locks will be released temporarily. It means the pgtable can change during this phase before 2nd round starts.
It's logically possible some ptes got wr-protected during this phase, and we can errornously collapse a thp without noticing some ptes are wr-protected by userfault. e1e267c7928f wanted to avoid it but it only did that for the 1st phase, not the 2nd phase.
Since __collapse_huge_page_isolate() happens after a round of small page swapins, we don't need to worry on any !present ptes - if it existed khugepaged will already bail out. So we only need to check present ptes with uffd-wp bit set there.
This is something I found only but never had a reproducer, I thought it was one caused a bug in Muhammad's recent pagemap new ioctl work, but it turns out it's not the cause of that but an userspace bug. However this seems to still be a real bug even with a very small race window, still worth to have it fixed and copy stable.
Yeah, I agree. But I got confused by userfaultfd_wp(vma) and pte_uffd_wp(pte). If a vma is armed with uffd wp, shall we skip the whole vma? If so, whether it is better to just check vma? We do revalidate vma once reacquiring mmap_lock, so we should be able to bail out earlier.
Checking against VMA is safe too, the difference is current code still allows thp to be collapsed as long as none of the page is explicitly protected over the thp range, even if the range is registered with userfault-wp. That's also what e1e267c7928f does.
Here we have slightly different handling between anon / file thps (file thps checks against the vma flags), IMHO mostly because we don't scan pgtables when making decisions to collapse a shmem thp, so we made it simple by checking against vma flags. We can make it the same as anon but it might be an overkill just to scan the entries for uffd-wp purpose.
For anon we always scans the pgtable anyway so it's easier to make a more accurate decision.
Thanks,
On Wed, Apr 5, 2023 at 11:10 AM Peter Xu peterx@redhat.com wrote:
On Wed, Apr 05, 2023 at 09:59:15AM -0700, Yang Shi wrote:
On Wed, Apr 5, 2023 at 8:51 AM Peter Xu peterx@redhat.com wrote:
Khugepaged collapse an anonymous thp in two rounds of scans. The 2nd round done in __collapse_huge_page_isolate() after hpage_collapse_scan_pmd(), during which all the locks will be released temporarily. It means the pgtable can change during this phase before 2nd round starts.
It's logically possible some ptes got wr-protected during this phase, and we can errornously collapse a thp without noticing some ptes are wr-protected by userfault. e1e267c7928f wanted to avoid it but it only did that for the 1st phase, not the 2nd phase.
Since __collapse_huge_page_isolate() happens after a round of small page swapins, we don't need to worry on any !present ptes - if it existed khugepaged will already bail out. So we only need to check present ptes with uffd-wp bit set there.
This is something I found only but never had a reproducer, I thought it was one caused a bug in Muhammad's recent pagemap new ioctl work, but it turns out it's not the cause of that but an userspace bug. However this seems to still be a real bug even with a very small race window, still worth to have it fixed and copy stable.
Yeah, I agree. But I got confused by userfaultfd_wp(vma) and pte_uffd_wp(pte). If a vma is armed with uffd wp, shall we skip the whole vma? If so, whether it is better to just check vma? We do revalidate vma once reacquiring mmap_lock, so we should be able to bail out earlier.
Checking against VMA is safe too, the difference is current code still allows thp to be collapsed as long as none of the page is explicitly protected over the thp range, even if the range is registered with userfault-wp. That's also what e1e267c7928f does.
Here we have slightly different handling between anon / file thps (file thps checks against the vma flags), IMHO mostly because we don't scan pgtables when making decisions to collapse a shmem thp, so we made it simple by checking against vma flags. We can make it the same as anon but it might be an overkill just to scan the entries for uffd-wp purpose.
For anon we always scans the pgtable anyway so it's easier to make a more accurate decision.
Aha, I see. It does resolve my confusion. Thanks for elaborating.
The patch looks good to me. Reviewed-by: Yang Shi shy828301@gmail.com
Thanks,
-- Peter Xu
linux-stable-mirror@lists.linaro.org