*Changes in v10* - Add specific condition to return error if hugetlb is used with wp async - Move changes in tools/include/uapi/linux/fs.h to separate patch - Add documentation
*Changes in v9:* - Correct fault resolution for userfaultfd wp async - Fix build warnings and errors which were happening on some configs - Simplify pagemap ioctl's code
*Changes in v8:* - Update uffd async wp implementation - Improve PAGEMAP_IOCTL implementation
*Changes in v7:* - Add uffd wp async - Update the IOCTL to use uffd under the hood instead of soft-dirty flags
Hello,
Note: Soft-dirty pages and pages which have been written-to are synonyms. As kernel already has soft-dirty feature inside which we have given up to use, we are using written-to terminology while using UFFD async WP under the hood.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear the info about page table entries. The following operations are supported in this ioctl: - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped (PAGE_IS_SWAPPED). - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which pages have been written-to. - Find pages which have been written-to and write protect the pages (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace. But it isn't efficient: - The mprotect and SIGSEGV handler for bookkeeping - The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't present earlier: - There is no atomic get soft-dirty/Written-to status and clear present in the kernel. - The pages which have been written-to can not be found in accurate way. (Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU project. The procfs interface is enough for finding the soft-dirty bit status and clearing the soft-dirty bit of all the pages of a process. We have the use case where we need to track the soft-dirty PTE bit for only specific pages on-demand. We need this tracking and clear mechanism of a region of memory while the process is running to emulate the getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which have been written-to from v7 patch series)*: Stop using the soft-dirty flags for finding which pages have been written to. It is too delicate and wrong as it shows more soft-dirty pages than the actual soft-dirty pages. There is no interest in correcting it [2][3] as this is how the feature was written years ago. It shouldn't be updated to changed behaviour. Peter Xu has suggested using the async version of the UFFD WP [4] as it is based inherently on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is asynchronous version of the write protect. When this variant of the UFFD WP is used, the page faults are resolved automatically by the kernel. The pages which have been written-to can be found by reading pagemap file (!PM_UFFD_WP). This feature can be used successfully to find which pages have been written to from the time the pages were write protected. This works just like the soft-dirty flag without showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and swapped is required for the CRIU project [5][6]. The addition of the required mask, any mask, excluded mask and return masks are also required for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific masks. The page addresses are returned in struct page_region in a compact form. The max_pages is needed to support a use case where user only wants to get a specific number of pages. So there is no need to find all the pages of interest in the range when max_pages is specified. The IOCTL returns when the maximum number of the pages are found. The max_pages is optional. If max_pages is specified, it must be equal or greater than the vec_size. This restriction is needed to handle worse case when one page_region only contains info of one page and it cannot be compacted. This is needed to emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an example for the uffd async wp test and PAGEMAP_IOCTL. It shows the interface usages as well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora.... [2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.c... [3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.c... [4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n [5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/ [6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards, Muhammad Usama Anjum
Muhammad Usama Anjum (6): userfaultfd: Add UFFD WP Async support userfaultfd: update documentation to describe UFFD_FEATURE_WP_ASYNC fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs tools headers UAPI: Update linux/fs.h with the kernel sources mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL selftests: vm: add pagemap ioctl tests
Documentation/admin-guide/mm/pagemap.rst | 24 + Documentation/admin-guide/mm/userfaultfd.rst | 7 + fs/proc/task_mmu.c | 290 ++++++ fs/userfaultfd.c | 20 +- include/linux/userfaultfd_k.h | 11 + include/uapi/linux/fs.h | 50 ++ include/uapi/linux/userfaultfd.h | 10 +- mm/memory.c | 23 +- tools/include/uapi/linux/fs.h | 50 ++ tools/testing/selftests/vm/.gitignore | 1 + tools/testing/selftests/vm/Makefile | 5 +- tools/testing/selftests/vm/pagemap_ioctl.c | 881 +++++++++++++++++++ 12 files changed, 1364 insertions(+), 8 deletions(-) create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c
Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page faults on its own. It can be used to track that which pages have been written-to from the time the pages were write-protected. It is very efficient way to track the changes as uffd is by nature pte/pmd based.
UFFD synchronous WP sends the page faults to the userspace where the pages which have been written-to can be tracked. But it is not efficient. This is why this asynchronous version is being added. After setting the WP Async, the pages which have been written to can be found in the pagemap file or information can be obtained from the PAGEMAP_IOCTL.
Suggested-by: Peter Xu peterx@redhat.com Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com --- Changes in v10: - Build fix - Update comments and add error condition to return error from uffd register if hugetlb pages are present when wp async flag is set
Changes in v9: - Correct the fault resolution with code contributed by Peter
Changes in v7: - Remove UFFDIO_WRITEPROTECT_MODE_ASYNC_WP and add UFFD_FEATURE_WP_ASYNC - Handle automatic page fault resolution in better way (thanks to Peter)
update to wp async
uffd wp async --- fs/userfaultfd.c | 20 ++++++++++++++++++-- include/linux/userfaultfd_k.h | 11 +++++++++++ include/uapi/linux/userfaultfd.h | 10 +++++++++- mm/memory.c | 23 ++++++++++++++++++++--- 4 files changed, 58 insertions(+), 6 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 15a5bf765d43..422f2530c63e 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1422,10 +1422,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, goto out_unlock;
/* - * Note vmas containing huge pages + * Note vmas containing huge pages. Hugetlb isn't supported + * with UFFD_FEATURE_WP_ASYNC. */ - if (is_vm_hugetlb_page(cur)) + if (is_vm_hugetlb_page(cur)) { + if (ctx->features & UFFD_FEATURE_WP_ASYNC) + goto out_unlock; + basic_ioctls = true; + }
found = true; } @@ -1867,6 +1872,10 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx, mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP; mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
+ /* The unprotection is not supported if in async WP mode */ + if (!mode_wp && (ctx->features & UFFD_FEATURE_WP_ASYNC)) + return -EINVAL; + if (mode_wp && mode_dontwake) return -EINVAL;
@@ -1950,6 +1959,13 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) return ret; }
+int userfaultfd_wp_async(struct vm_area_struct *vma) +{ + struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx; + + return (ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC)); +} + static inline unsigned int uffd_ctx_features(__u64 user_features) { /* diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 9df0b9a762cc..38c92c2beb16 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -179,6 +179,7 @@ extern int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start, unsigned long end, struct list_head *uf); extern void userfaultfd_unmap_complete(struct mm_struct *mm, struct list_head *uf); +extern int userfaultfd_wp_async(struct vm_area_struct *vma);
#else /* CONFIG_USERFAULTFD */
@@ -189,6 +190,11 @@ static inline vm_fault_t handle_userfault(struct vm_fault *vmf, return VM_FAULT_SIGBUS; }
+static inline void uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *vma, + unsigned long start, unsigned long len, bool enable_wp) +{ +} + static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma, struct vm_userfaultfd_ctx vm_ctx) { @@ -274,6 +280,11 @@ static inline bool uffd_disable_fault_around(struct vm_area_struct *vma) return false; }
+static inline int userfaultfd_wp_async(struct vm_area_struct *vma) +{ + return false; +} + #endif /* CONFIG_USERFAULTFD */
static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry) diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 005e5e306266..30a6f32cf564 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -38,7 +38,8 @@ UFFD_FEATURE_MINOR_HUGETLBFS | \ UFFD_FEATURE_MINOR_SHMEM | \ UFFD_FEATURE_EXACT_ADDRESS | \ - UFFD_FEATURE_WP_HUGETLBFS_SHMEM) + UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \ + UFFD_FEATURE_WP_ASYNC) #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -203,6 +204,12 @@ struct uffdio_api { * * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd * write-protection mode is supported on both shmem and hugetlbfs. + * + * UFFD_FEATURE_WP_ASYNC indicates that userfaultfd write-protection + * asynchronous mode is supported in which the write fault is automatically + * resolved and write-protection is un-set. It only supports anon and shmem + * (hugetlb isn't supported). It only takes effect when a vma is registered + * with write-protection mode. Otherwise the flag is ignored. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK (1<<1) @@ -217,6 +224,7 @@ struct uffdio_api { #define UFFD_FEATURE_MINOR_SHMEM (1<<10) #define UFFD_FEATURE_EXACT_ADDRESS (1<<11) #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12) +#define UFFD_FEATURE_WP_ASYNC (1<<13) __u64 features;
__u64 ioctls; diff --git a/mm/memory.c b/mm/memory.c index 4000e9f017e0..75331fbf7cb4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3351,8 +3351,21 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
if (likely(!unshare)) { if (userfaultfd_pte_wp(vma, *vmf->pte)) { - pte_unmap_unlock(vmf->pte, vmf->ptl); - return handle_userfault(vmf, VM_UFFD_WP); + if (userfaultfd_wp_async(vma)) { + /* + * Nothing needed (cache flush, TLB invalidations, + * etc.) because we're only removing the uffd-wp bit, + * which is completely invisible to the user. + */ + pte_t pte = pte_clear_uffd_wp(*vmf->pte); + + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); + /* Update this to be prepared for following up CoW handling */ + vmf->orig_pte = pte; + } else { + pte_unmap_unlock(vmf->pte, vmf->ptl); + return handle_userfault(vmf, VM_UFFD_WP); + } }
/* @@ -4812,8 +4825,11 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
if (vma_is_anonymous(vmf->vma)) { if (likely(!unshare) && - userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) + userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) { + if (userfaultfd_wp_async(vmf->vma)) + goto split; return handle_userfault(vmf, VM_UFFD_WP); + } return do_huge_pmd_wp_page(vmf); }
@@ -4825,6 +4841,7 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) } }
+split: /* COW or write-notify handled on pte level: split pmd. */ __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
On Thu, Feb 02, 2023 at 04:29:10PM +0500, Muhammad Usama Anjum wrote:
Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page faults on its own. It can be used to track that which pages have been written-to from the time the pages were write-protected. It is very efficient way to track the changes as uffd is by nature pte/pmd based.
UFFD synchronous WP sends the page faults to the userspace where the pages which have been written-to can be tracked. But it is not efficient. This is why this asynchronous version is being added. After setting the WP Async, the pages which have been written to can be found in the pagemap file or information can be obtained from the PAGEMAP_IOCTL.
Suggested-by: Peter Xu peterx@redhat.com Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com
Changes in v10:
- Build fix
- Update comments and add error condition to return error from uffd register if hugetlb pages are present when wp async flag is set
Changes in v9:
- Correct the fault resolution with code contributed by Peter
Changes in v7:
- Remove UFFDIO_WRITEPROTECT_MODE_ASYNC_WP and add UFFD_FEATURE_WP_ASYNC
- Handle automatic page fault resolution in better way (thanks to Peter)
update to wp async
uffd wp async
fs/userfaultfd.c | 20 ++++++++++++++++++-- include/linux/userfaultfd_k.h | 11 +++++++++++ include/uapi/linux/userfaultfd.h | 10 +++++++++- mm/memory.c | 23 ++++++++++++++++++++--- 4 files changed, 58 insertions(+), 6 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 15a5bf765d43..422f2530c63e 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1422,10 +1422,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, goto out_unlock; /*
* Note vmas containing huge pages
* Note vmas containing huge pages. Hugetlb isn't supported
*/* with UFFD_FEATURE_WP_ASYNC.
Need to set "ret = -EINVAL;" here. Or..
if (is_vm_hugetlb_page(cur))
if (is_vm_hugetlb_page(cur)) {
if (ctx->features & UFFD_FEATURE_WP_ASYNC)
goto out_unlock;
.. it'll return -EBUSY, which does not sound like the right errcode here.
Drop this empty line?
basic_ioctls = true;
}
found = true; }
Other than that looks good, thanks.
Hi Peter,
Thank you so much for reviewing!
On 2/9/23 2:12 AM, Peter Xu wrote:
On Thu, Feb 02, 2023 at 04:29:10PM +0500, Muhammad Usama Anjum wrote:
Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page faults on its own. It can be used to track that which pages have been written-to from the time the pages were write-protected. It is very efficient way to track the changes as uffd is by nature pte/pmd based.
UFFD synchronous WP sends the page faults to the userspace where the pages which have been written-to can be tracked. But it is not efficient. This is why this asynchronous version is being added. After setting the WP Async, the pages which have been written to can be found in the pagemap file or information can be obtained from the PAGEMAP_IOCTL.
Suggested-by: Peter Xu peterx@redhat.com Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com
Changes in v10:
- Build fix
- Update comments and add error condition to return error from uffd register if hugetlb pages are present when wp async flag is set
Changes in v9:
- Correct the fault resolution with code contributed by Peter
Changes in v7:
- Remove UFFDIO_WRITEPROTECT_MODE_ASYNC_WP and add UFFD_FEATURE_WP_ASYNC
- Handle automatic page fault resolution in better way (thanks to Peter)
update to wp async
uffd wp async
fs/userfaultfd.c | 20 ++++++++++++++++++-- include/linux/userfaultfd_k.h | 11 +++++++++++ include/uapi/linux/userfaultfd.h | 10 +++++++++- mm/memory.c | 23 ++++++++++++++++++++--- 4 files changed, 58 insertions(+), 6 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 15a5bf765d43..422f2530c63e 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1422,10 +1422,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, goto out_unlock; /*
* Note vmas containing huge pages
* Note vmas containing huge pages. Hugetlb isn't supported
*/* with UFFD_FEATURE_WP_ASYNC.
Need to set "ret = -EINVAL;" here. Or..
Will fix in next version.
if (is_vm_hugetlb_page(cur))
if (is_vm_hugetlb_page(cur)) {
if (ctx->features & UFFD_FEATURE_WP_ASYNC)
goto out_unlock;
.. it'll return -EBUSY, which does not sound like the right errcode here.
Drop this empty line?
basic_ioctls = true;
}
found = true; }
Other than that looks good, thanks.
Thank you so much! This wouldn't have been possible without your help.
Hi Muhammad,
On Thu, Feb 02, 2023 at 04:29:10PM +0500, Muhammad Usama Anjum wrote:
Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page faults on its own. It can be used to track that which pages have been written-to from the time the pages were write-protected. It is very efficient way to track the changes as uffd is by nature pte/pmd based.
UFFD synchronous WP sends the page faults to the userspace where the pages which have been written-to can be tracked. But it is not efficient. This is why this asynchronous version is being added. After setting the WP Async, the pages which have been written to can be found in the pagemap file or information can be obtained from the PAGEMAP_IOCTL.
Suggested-by: Peter Xu peterx@redhat.com Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com
Changes in v10:
- Build fix
- Update comments and add error condition to return error from uffd register if hugetlb pages are present when wp async flag is set
Changes in v9:
- Correct the fault resolution with code contributed by Peter
Changes in v7:
- Remove UFFDIO_WRITEPROTECT_MODE_ASYNC_WP and add UFFD_FEATURE_WP_ASYNC
- Handle automatic page fault resolution in better way (thanks to Peter)
update to wp async
uffd wp async
fs/userfaultfd.c | 20 ++++++++++++++++++-- include/linux/userfaultfd_k.h | 11 +++++++++++ include/uapi/linux/userfaultfd.h | 10 +++++++++- mm/memory.c | 23 ++++++++++++++++++++--- 4 files changed, 58 insertions(+), 6 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 15a5bf765d43..422f2530c63e 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1422,10 +1422,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, goto out_unlock; /*
* Note vmas containing huge pages
* Note vmas containing huge pages. Hugetlb isn't supported
*/* with UFFD_FEATURE_WP_ASYNC.
if (is_vm_hugetlb_page(cur))
if (is_vm_hugetlb_page(cur)) {
if (ctx->features & UFFD_FEATURE_WP_ASYNC)
goto out_unlock;
basic_ioctls = true;
}
found = true; } @@ -1867,6 +1872,10 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx, mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP; mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
- /* The unprotection is not supported if in async WP mode */
- if (!mode_wp && (ctx->features & UFFD_FEATURE_WP_ASYNC))
return -EINVAL;
- if (mode_wp && mode_dontwake) return -EINVAL;
@@ -1950,6 +1959,13 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) return ret; } +int userfaultfd_wp_async(struct vm_area_struct *vma) +{
- struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx;
- return (ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC));
+}
static inline unsigned int uffd_ctx_features(__u64 user_features) { /* diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 9df0b9a762cc..38c92c2beb16 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -179,6 +179,7 @@ extern int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start, unsigned long end, struct list_head *uf); extern void userfaultfd_unmap_complete(struct mm_struct *mm, struct list_head *uf); +extern int userfaultfd_wp_async(struct vm_area_struct *vma); #else /* CONFIG_USERFAULTFD */ @@ -189,6 +190,11 @@ static inline vm_fault_t handle_userfault(struct vm_fault *vmf, return VM_FAULT_SIGBUS; } +static inline void uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *vma,
unsigned long start, unsigned long len, bool enable_wp)
+{ +}
static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma, struct vm_userfaultfd_ctx vm_ctx) { @@ -274,6 +280,11 @@ static inline bool uffd_disable_fault_around(struct vm_area_struct *vma) return false; } +static inline int userfaultfd_wp_async(struct vm_area_struct *vma) +{
- return false;
+}
#endif /* CONFIG_USERFAULTFD */ static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry) diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 005e5e306266..30a6f32cf564 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -38,7 +38,8 @@ UFFD_FEATURE_MINOR_HUGETLBFS | \ UFFD_FEATURE_MINOR_SHMEM | \ UFFD_FEATURE_EXACT_ADDRESS | \
UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \
UFFD_FEATURE_WP_ASYNC)
#define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -203,6 +204,12 @@ struct uffdio_api { * * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd * write-protection mode is supported on both shmem and hugetlbfs.
*
* UFFD_FEATURE_WP_ASYNC indicates that userfaultfd write-protection
* asynchronous mode is supported in which the write fault is automatically
* resolved and write-protection is un-set. It only supports anon and shmem
* (hugetlb isn't supported). It only takes effect when a vma is registered
*/* with write-protection mode. Otherwise the flag is ignored.
Most of mm/ adheres the 80-character limits. Please make your changes to follow it as well.
#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK (1<<1) @@ -217,6 +224,7 @@ struct uffdio_api { #define UFFD_FEATURE_MINOR_SHMEM (1<<10) #define UFFD_FEATURE_EXACT_ADDRESS (1<<11) #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12) +#define UFFD_FEATURE_WP_ASYNC (1<<13) __u64 features; __u64 ioctls; diff --git a/mm/memory.c b/mm/memory.c index 4000e9f017e0..75331fbf7cb4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3351,8 +3351,21 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) if (likely(!unshare)) { if (userfaultfd_pte_wp(vma, *vmf->pte)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
return handle_userfault(vmf, VM_UFFD_WP);
if (userfaultfd_wp_async(vma)) {
/*
* Nothing needed (cache flush, TLB invalidations,
* etc.) because we're only removing the uffd-wp bit,
* which is completely invisible to the user.
*/
pte_t pte = pte_clear_uffd_wp(*vmf->pte);
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
/* Update this to be prepared for following up CoW handling */
vmf->orig_pte = pte;
} else {
pte_unmap_unlock(vmf->pte, vmf->ptl);
return handle_userfault(vmf, VM_UFFD_WP);
}
You can revert the condition here and reduce the nesting:
if (!userfaultfd_wp_async(vma)) { pte_unmap_unlock(vmf->pte, vmf->ptl); return handle_userfault(vmf, VM_UFFD_WP); }
/* handle async WP */
}
/* @@ -4812,8 +4825,11 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) if (vma_is_anonymous(vmf->vma)) { if (likely(!unshare) &&
userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) {
if (userfaultfd_wp_async(vmf->vma))
goto split; return handle_userfault(vmf, VM_UFFD_WP);
return do_huge_pmd_wp_page(vmf); }}
@@ -4825,6 +4841,7 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) } } +split: /* COW or write-notify handled on pte level: split pmd. */ __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL); -- 2.30.2
Hi Mike,
Thanks for reviewing.
On 2/17/23 2:37 PM, Mike Rapoport wrote:
Hi Muhammad,
On Thu, Feb 02, 2023 at 04:29:10PM +0500, Muhammad Usama Anjum wrote:
Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page faults on its own. It can be used to track that which pages have been written-to from the time the pages were write-protected. It is very efficient way to track the changes as uffd is by nature pte/pmd based.
UFFD synchronous WP sends the page faults to the userspace where the pages which have been written-to can be tracked. But it is not efficient. This is why this asynchronous version is being added. After setting the WP Async, the pages which have been written to can be found in the pagemap file or information can be obtained from the PAGEMAP_IOCTL.
Suggested-by: Peter Xu peterx@redhat.com Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com
Changes in v10:
- Build fix
- Update comments and add error condition to return error from uffd register if hugetlb pages are present when wp async flag is set
Changes in v9:
- Correct the fault resolution with code contributed by Peter
Changes in v7:
- Remove UFFDIO_WRITEPROTECT_MODE_ASYNC_WP and add UFFD_FEATURE_WP_ASYNC
- Handle automatic page fault resolution in better way (thanks to Peter)
update to wp async
uffd wp async
fs/userfaultfd.c | 20 ++++++++++++++++++-- include/linux/userfaultfd_k.h | 11 +++++++++++ include/uapi/linux/userfaultfd.h | 10 +++++++++- mm/memory.c | 23 ++++++++++++++++++++--- 4 files changed, 58 insertions(+), 6 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 15a5bf765d43..422f2530c63e 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1422,10 +1422,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, goto out_unlock; /*
* Note vmas containing huge pages
* Note vmas containing huge pages. Hugetlb isn't supported
*/* with UFFD_FEATURE_WP_ASYNC.
if (is_vm_hugetlb_page(cur))
if (is_vm_hugetlb_page(cur)) {
if (ctx->features & UFFD_FEATURE_WP_ASYNC)
goto out_unlock;
basic_ioctls = true;
}
found = true; } @@ -1867,6 +1872,10 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx, mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP; mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
- /* The unprotection is not supported if in async WP mode */
- if (!mode_wp && (ctx->features & UFFD_FEATURE_WP_ASYNC))
return -EINVAL;
- if (mode_wp && mode_dontwake) return -EINVAL;
@@ -1950,6 +1959,13 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) return ret; } +int userfaultfd_wp_async(struct vm_area_struct *vma) +{
- struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx;
- return (ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC));
+}
static inline unsigned int uffd_ctx_features(__u64 user_features) { /* diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 9df0b9a762cc..38c92c2beb16 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -179,6 +179,7 @@ extern int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start, unsigned long end, struct list_head *uf); extern void userfaultfd_unmap_complete(struct mm_struct *mm, struct list_head *uf); +extern int userfaultfd_wp_async(struct vm_area_struct *vma); #else /* CONFIG_USERFAULTFD */ @@ -189,6 +190,11 @@ static inline vm_fault_t handle_userfault(struct vm_fault *vmf, return VM_FAULT_SIGBUS; } +static inline void uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *vma,
unsigned long start, unsigned long len, bool enable_wp)
+{ +}
static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma, struct vm_userfaultfd_ctx vm_ctx) { @@ -274,6 +280,11 @@ static inline bool uffd_disable_fault_around(struct vm_area_struct *vma) return false; } +static inline int userfaultfd_wp_async(struct vm_area_struct *vma) +{
- return false;
+}
#endif /* CONFIG_USERFAULTFD */ static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry) diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 005e5e306266..30a6f32cf564 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -38,7 +38,8 @@ UFFD_FEATURE_MINOR_HUGETLBFS | \ UFFD_FEATURE_MINOR_SHMEM | \ UFFD_FEATURE_EXACT_ADDRESS | \
UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \
UFFD_FEATURE_WP_ASYNC)
#define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -203,6 +204,12 @@ struct uffdio_api { * * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd * write-protection mode is supported on both shmem and hugetlbfs.
*
* UFFD_FEATURE_WP_ASYNC indicates that userfaultfd write-protection
* asynchronous mode is supported in which the write fault is automatically
* resolved and write-protection is un-set. It only supports anon and shmem
* (hugetlb isn't supported). It only takes effect when a vma is registered
*/* with write-protection mode. Otherwise the flag is ignored.
Most of mm/ adheres the 80-character limits. Please make your changes to follow it as well.
Will update in next version.
#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK (1<<1) @@ -217,6 +224,7 @@ struct uffdio_api { #define UFFD_FEATURE_MINOR_SHMEM (1<<10) #define UFFD_FEATURE_EXACT_ADDRESS (1<<11) #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12) +#define UFFD_FEATURE_WP_ASYNC (1<<13) __u64 features; __u64 ioctls; diff --git a/mm/memory.c b/mm/memory.c index 4000e9f017e0..75331fbf7cb4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3351,8 +3351,21 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) if (likely(!unshare)) { if (userfaultfd_pte_wp(vma, *vmf->pte)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
return handle_userfault(vmf, VM_UFFD_WP);
if (userfaultfd_wp_async(vma)) {
/*
* Nothing needed (cache flush, TLB invalidations,
* etc.) because we're only removing the uffd-wp bit,
* which is completely invisible to the user.
*/
pte_t pte = pte_clear_uffd_wp(*vmf->pte);
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
/* Update this to be prepared for following up CoW handling */
vmf->orig_pte = pte;
} else {
pte_unmap_unlock(vmf->pte, vmf->ptl);
return handle_userfault(vmf, VM_UFFD_WP);
}
You can revert the condition here and reduce the nesting:
if (!userfaultfd_wp_async(vma)) { pte_unmap_unlock(vmf->pte, vmf->ptl); return handle_userfault(vmf, VM_UFFD_WP); } /* handle async WP */
I'll update in next version.
}
/* @@ -4812,8 +4825,11 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) if (vma_is_anonymous(vmf->vma)) { if (likely(!unshare) &&
userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) {
if (userfaultfd_wp_async(vmf->vma))
goto split; return handle_userfault(vmf, VM_UFFD_WP);
return do_huge_pmd_wp_page(vmf); }}
@@ -4825,6 +4841,7 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) } } +split: /* COW or write-notify handled on pte level: split pmd. */ __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL); -- 2.30.2
Explain the difference created by UFFD_FEATURE_WP_ASYNC to the write protection (UFFDIO_WRITEPROTECT_MODE_WP) mode.
Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com --- Documentation/admin-guide/mm/userfaultfd.rst | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 83f31919ebb3..4747e7bd5b26 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -221,6 +221,13 @@ former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was used.
+If ``UFFD_FEATURE_WP_ASYNC`` is set while calling ``UFFDIO_API`` ioctl, the +behaviour of ``UFFDIO_WRITEPROTECT_MODE_WP`` changes such that faults for +anon and shmem are resolved automatically by the kernel instead of sending +the message to the userfaultfd. The hugetlb isn't supported. The ``pagemap`` +file can be read to find which pages have ``PM_UFFD_WP`` flag set which +means they are write-protected. + QEMU/KVM ========
On Thu, Feb 02, 2023 at 04:29:11PM +0500, Muhammad Usama Anjum wrote:
Explain the difference created by UFFD_FEATURE_WP_ASYNC to the write protection (UFFDIO_WRITEPROTECT_MODE_WP) mode.
Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com
Documentation/admin-guide/mm/userfaultfd.rst | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 83f31919ebb3..4747e7bd5b26 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -221,6 +221,13 @@ former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was used. +If ``UFFD_FEATURE_WP_ASYNC`` is set while calling ``UFFDIO_API`` ioctl, the +behaviour of ``UFFDIO_WRITEPROTECT_MODE_WP`` changes such that faults for
UFFDIO_WRITEPROTECT_MODE_WP is only a flag in UFFDIO_WRITEPROTECT, while it's forbidden only when not specified.
+anon and shmem are resolved automatically by the kernel instead of sending +the message to the userfaultfd. The hugetlb isn't supported. The ``pagemap`` +file can be read to find which pages have ``PM_UFFD_WP`` flag set which +means they are write-protected.
Here's my version. Please feel free to do modifications on top.
If the userfaultfd context (that has ``UFFDIO_REGISTER_MODE_WP`` registered against) has ``UFFD_FEATURE_WP_ASYNC`` feature enabled, it will work in async write protection mode. It can be seen as a more accurate version of soft-dirty tracking, meanwhile the results will not be easily affected by other operations like vma merging.
Comparing to the generic mode, the async mode will not generate any userfaultfd message when the protected memory range is written. Instead, the kernel will automatically resolve the page fault immediately by dropping the uffd-wp bit in the pgtables. The user app can collect the "written/dirty" status by looking up the uffd-wp bit for the pages being interested in /proc/pagemap.
The page will be under track of uffd-wp async mode until the page is explicitly write-protected by ``UFFDIO_WRITEPROTECT`` ioctl with the mode flag ``UFFDIO_WRITEPROTECT_MODE_WP`` set. Trying to resolve a page fault that was tracked by async mode userfaultfd-wp is invalid.
Currently ``UFFD_FEATURE_WP_ASYNC`` only support anonymous and shmem. Hugetlb is not yet supported.
On 2/9/23 2:31 AM, Peter Xu wrote:
On Thu, Feb 02, 2023 at 04:29:11PM +0500, Muhammad Usama Anjum wrote:
Explain the difference created by UFFD_FEATURE_WP_ASYNC to the write protection (UFFDIO_WRITEPROTECT_MODE_WP) mode.
Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com
Documentation/admin-guide/mm/userfaultfd.rst | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 83f31919ebb3..4747e7bd5b26 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -221,6 +221,13 @@ former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was used. +If ``UFFD_FEATURE_WP_ASYNC`` is set while calling ``UFFDIO_API`` ioctl, the +behaviour of ``UFFDIO_WRITEPROTECT_MODE_WP`` changes such that faults for
UFFDIO_WRITEPROTECT_MODE_WP is only a flag in UFFDIO_WRITEPROTECT, while it's forbidden only when not specified.
+anon and shmem are resolved automatically by the kernel instead of sending +the message to the userfaultfd. The hugetlb isn't supported. The ``pagemap`` +file can be read to find which pages have ``PM_UFFD_WP`` flag set which +means they are write-protected.
Here's my version. Please feel free to do modifications on top.
If the userfaultfd context (that has ``UFFDIO_REGISTER_MODE_WP`` registered against) has ``UFFD_FEATURE_WP_ASYNC`` feature enabled, it will work in async write protection mode. It can be seen as a more accurate version of soft-dirty tracking, meanwhile the results will not be easily affected by other operations like vma merging.
Comparing to the generic mode, the async mode will not generate any userfaultfd message when the protected memory range is written. Instead, the kernel will automatically resolve the page fault immediately by dropping the uffd-wp bit in the pgtables. The user app can collect the "written/dirty" status by looking up the uffd-wp bit for the pages being interested in /proc/pagemap.
The page will be under track of uffd-wp async mode until the page is explicitly write-protected by ``UFFDIO_WRITEPROTECT`` ioctl with the mode flag ``UFFDIO_WRITEPROTECT_MODE_WP`` set. Trying to resolve a page fault that was tracked by async mode userfaultfd-wp is invalid.
Currently ``UFFD_FEATURE_WP_ASYNC`` only support anonymous and shmem. Hugetlb is not yet supported.
It'll get replaced the documentation. I'll add a suggested by tag as well. Thanks.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear the info about page table entries. The following operations are supported in this ioctl: - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped (PAGE_IS_SWAPPED). - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which pages have been written-to. - Find pages which have been written-to and write protect the pages (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
To get information about which pages have been written-to and/or write protect the pages, following must be performed first in order: - The userfaultfd file descriptor is created with userfaultfd syscall. - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL. - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode through UFFDIO_REGISTER IOCTL. Then the any part of the registered memory or the whole memory region can be write protected using the UFFDIO_WRITEPROTECT IOCTL or PAGEMAP_SCAN IOCTL.
struct pagemap_scan_args is used as the argument of the IOCTL. In this struct: - The range is specified through start and len. - The output buffer of struct page_region array and size is specified as vec and vec_len. - The optional maximum requested pages are specified in the max_pages. - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE is the only added flag at this time. - The masks are specified in required_mask, anyof_mask, excluded_ mask and return_mask.
This IOCTL can be extended to get information about more PTE bits. This IOCTL doesn't support hugetlbs at the moment. No information about hugetlb can be obtained. This patch has evolved from a basic patch from Gabriel Krisman Bertazi.
Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com --- Changes in v10: - move changes in tools/include/uapi/linux/fs.h to separate patch - update commit message
Change in v8: - Correct is_pte_uffd_wp() - Improve readability and error checks - Remove some un-needed code
Changes in v7: - Rebase on top of latest next - Fix some corner cases - Base soft-dirty on the uffd wp async - Update the terminologies - Optimize the memory usage inside the ioctl
Changes in v6: - Rename variables and update comments - Make IOCTL independent of soft_dirty config - Change masks and bitmap type to _u64 - Improve code quality
Changes in v5: - Remove tlb flushing even for clear operation
Changes in v4: - Update the interface and implementation
Changes in v3: - Tighten the user-kernel interface by using explicit types and add more error checking
Changes in v2: - Convert the interface from syscall to ioctl - Remove pidfd support as it doesn't make sense in ioctl --- fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 50 +++++++ 2 files changed, 340 insertions(+)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e35a0398db63..c6bde19d63d9 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -19,6 +19,7 @@ #include <linux/shmem_fs.h> #include <linux/uaccess.h> #include <linux/pkeys.h> +#include <linux/minmax.h>
#include <asm/elf.h> #include <asm/tlb.h> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, } #endif
+static inline bool is_pte_uffd_wp(pte_t pte) +{ + if ((pte_present(pte) && pte_uffd_wp(pte)) || + (pte_swp_uffd_wp_any(pte))) + return true; + return false; +} + +static inline bool is_pmd_uffd_wp(pmd_t pmd) +{ + if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || + (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) + return true; + return false; +} + #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE) static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp) @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file) return 0; }
+#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \ + PAGE_IS_PRESENT | PAGE_IS_SWAPPED) +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED) +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE) +#define IS_GET_OP(a) (a->vec) +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages)) + +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \ + (wt | file << 1 | present << 2 | swap << 3) +#define IS_WT_REQUIRED(a) \ + ((a->required_mask & PAGE_IS_WRITTEN) || \ + (a->anyof_mask & PAGE_IS_WRITTEN)) + +struct pagemap_scan_private { + struct page_region *vec; + struct page_region prev; + unsigned long vec_len, vec_index; + unsigned int max_pages, found_pages, flags; + unsigned long required_mask, anyof_mask, excluded_mask, return_mask; +}; + +static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk) +{ + struct pagemap_scan_private *p = walk->private; + struct vm_area_struct *vma = walk->vma; + + if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma)) + return -EPERM; + if (vma->vm_flags & VM_PFNMAP) + return 1; + return 0; +} + +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, + struct pagemap_scan_private *p, unsigned long addr, + unsigned int len) +{ + unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap); + bool cpy = true; + struct page_region *prev = &p->prev; + + if (HAS_NO_SPACE(p)) + return -ENOSPC; + + if (p->max_pages && p->found_pages + len >= p->max_pages) + len = p->max_pages - p->found_pages; + if (!len) + return -EINVAL; + + if (p->required_mask) + cpy = ((p->required_mask & cur) == p->required_mask); + if (cpy && p->anyof_mask) + cpy = (p->anyof_mask & cur); + if (cpy && p->excluded_mask) + cpy = !(p->excluded_mask & cur); + bitmap = cur & p->return_mask; + if (cpy && bitmap) { + if ((prev->len) && (prev->bitmap == bitmap) && + (prev->start + prev->len * PAGE_SIZE == addr)) { + prev->len += len; + p->found_pages += len; + } else if (p->vec_index < p->vec_len) { + if (prev->len) { + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); + p->vec_index++; + } + prev->start = addr; + prev->len = len; + prev->bitmap = bitmap; + p->found_pages += len; + } else { + return -ENOSPC; + } + } + return 0; +} + +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec, + unsigned long *vec_index) +{ + struct page_region *prev = &p->prev; + + if (prev->len) { + if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region))) + return -EFAULT; + p->vec_index++; + (*vec_index)++; + prev->len = 0; + } + return 0; +} + +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, + unsigned long end, struct mm_walk *walk) +{ + struct pagemap_scan_private *p = walk->private; + struct vm_area_struct *vma = walk->vma; + unsigned long addr = end; + spinlock_t *ptl; + int ret = 0; + pte_t *pte; + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + ptl = pmd_trans_huge_lock(pmd, vma); + if (ptl) { + bool pmd_wt; + + pmd_wt = !is_pmd_uffd_wp(*pmd); + /* + * Break huge page into small pages if operation needs to be performed is + * on a portion of the huge page. + */ + if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) { + spin_unlock(ptl); + split_huge_pmd(vma, pmd, start); + goto process_smaller_pages; + } + if (IS_GET_OP(p)) + ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd), + is_swap_pmd(*pmd), p, start, + (end - start)/PAGE_SIZE); + spin_unlock(ptl); + if (!ret) { + if (pmd_wt && IS_WP_ENGAGE_OP(p)) + uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true); + } + return ret; + } +process_smaller_pages: + if (pmd_trans_unstable(pmd)) + return 0; +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + + pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); + if (IS_GET_OP(p)) { + for (addr = start; addr < end; pte++, addr += PAGE_SIZE) { + ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file, + pte_present(*pte), is_swap_pte(*pte), p, addr, 1); + if (ret) + break; + } + } + pte_unmap_unlock(pte - 1, ptl); + if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start)) + uffd_wp_range(walk->mm, vma, start, addr - start, true); + + cond_resched(); + return ret; +} + +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth, + struct mm_walk *walk) +{ + struct pagemap_scan_private *p = walk->private; + struct vm_area_struct *vma = walk->vma; + int ret = 0; + + if (vma) + ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr, + (end - addr)/PAGE_SIZE); + return ret; +} + +/* No hugetlb support is present. */ +static const struct mm_walk_ops pagemap_scan_ops = { + .test_walk = pagemap_scan_test_walk, + .pmd_entry = pagemap_scan_pmd_entry, + .pte_hole = pagemap_scan_pte_hole, +}; + +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg) +{ + unsigned long empty_slots, vec_index = 0; + unsigned long __user start, end; + unsigned long __start, __end; + struct page_region __user *vec; + struct pagemap_scan_private p; + int ret = 0; + + start = (unsigned long)untagged_addr(arg->start); + vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec); + + /* Validate memory ranges */ + if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len))) + return -EINVAL; + if (IS_GET_OP(arg) && ((arg->vec_len == 0) || + (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region))))) + return -EINVAL; + + /* Detect illegal flags and masks */ + if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) || + (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) || + (arg->return_mask & ~PAGEMAP_BITS_ALL)) + return -EINVAL; + if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) || + !arg->return_mask)) + return -EINVAL; + /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */ + if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) || + (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS))) + return -EINVAL; + + end = start + arg->len; + p.max_pages = arg->max_pages; + p.found_pages = 0; + p.flags = arg->flags; + p.required_mask = arg->required_mask; + p.anyof_mask = arg->anyof_mask; + p.excluded_mask = arg->excluded_mask; + p.return_mask = arg->return_mask; + p.prev.len = 0; + p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); + + if (IS_GET_OP(arg)) { + p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL); + if (!p.vec) + return -ENOMEM; + } else { + p.vec = NULL; + } + __start = __end = start; + while (!ret && __end < end) { + p.vec_index = 0; + empty_slots = arg->vec_len - vec_index; + if (p.vec_len > empty_slots) + p.vec_len = empty_slots; + + __end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK; + if (__end > end) + __end = end; + + mmap_read_lock(mm); + ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p); + mmap_read_unlock(mm); + if (!(!ret || ret == -ENOSPC)) + goto free_data; + + __start = __end; + if (IS_GET_OP(arg) && p.vec_index) { + if (copy_to_user(&vec[vec_index], p.vec, + p.vec_index * sizeof(struct page_region))) { + ret = -EFAULT; + goto free_data; + } + vec_index += p.vec_index; + } + } + ret = export_prev_to_out(&p, vec, &vec_index); + if (!ret) + ret = vec_index; +free_data: + if (IS_GET_OP(arg)) + kfree(p.vec); + + return ret; +} + +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg) +{ + struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg; + struct mm_struct *mm = file->private_data; + struct pagemap_scan_arg argument; + + if (cmd == PAGEMAP_SCAN) { + if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg))) + return -EFAULT; + return do_pagemap_cmd(mm, &argument); + } + return -EINVAL; +} + const struct file_operations proc_pagemap_operations = { .llseek = mem_lseek, /* borrow this */ .read = pagemap_read, .open = pagemap_open, .release = pagemap_release, + .unlocked_ioctl = pagemap_scan_ioctl, + .compat_ioctl = pagemap_scan_ioctl, }; #endif /* CONFIG_PROC_PAGE_MONITOR */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index b7b56871029c..1ae9a8684b48 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t; #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ RWF_APPEND)
+/* Pagemap ioctl */ +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg) + +/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */ +#define PAGE_IS_WRITTEN (1 << 0) +#define PAGE_IS_FILE (1 << 1) +#define PAGE_IS_PRESENT (1 << 2) +#define PAGE_IS_SWAPPED (1 << 3) + +/* + * struct page_region - Page region with bitmap flags + * @start: Start of the region + * @len: Length of the region + * bitmap: Bits sets for the region + */ +struct page_region { + __u64 start; + __u64 len; + __u64 bitmap; +}; + +/* + * struct pagemap_scan_arg - Pagemap ioctl argument + * @start: Starting address of the region + * @len: Length of the region (All the pages in this length are included) + * @vec: Address of page_region struct array for output + * @vec_len: Length of the page_region struct array + * @max_pages: Optional max return pages + * @flags: Flags for the IOCTL + * @required_mask: Required mask - All of these bits have to be set in the PTE + * @anyof_mask: Any mask - Any of these bits are set in the PTE + * @excluded_mask: Exclude mask - None of these bits are set in the PTE + * @return_mask: Bits that are to be reported in page_region + */ +struct pagemap_scan_arg { + __u64 start; + __u64 len; + __u64 vec; + __u64 vec_len; + __u32 max_pages; + __u32 flags; + __u64 required_mask; + __u64 anyof_mask; + __u64 excluded_mask; + __u64 return_mask; +}; + +/* Special flags */ +#define PAGEMAP_WP_ENGAGE (1 << 0) + #endif /* _UAPI_LINUX_FS_H */
On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear the info about page table entries. The following operations are supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN), file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped (PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which pages have been written-to.
- Find pages which have been written-to and write protect the pages (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
To get information about which pages have been written-to and/or write protect the pages, following must be performed first in order:
- The userfaultfd file descriptor is created with userfaultfd syscall.
- The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
- The memory range is registered with UFFDIO_REGISTER_MODE_WP mode through UFFDIO_REGISTER IOCTL.
Then the any part of the registered memory or the whole memory region can be write protected using the UFFDIO_WRITEPROTECT IOCTL or PAGEMAP_SCAN IOCTL.
struct pagemap_scan_args is used as the argument of the IOCTL. In this struct:
- The range is specified through start and len.
- The output buffer of struct page_region array and size is specified as vec and vec_len.
- The optional maximum requested pages are specified in the max_pages.
- The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE is the only added flag at this time.
- The masks are specified in required_mask, anyof_mask, excluded_ mask and return_mask.
This IOCTL can be extended to get information about more PTE bits. This IOCTL doesn't support hugetlbs at the moment. No information about hugetlb can be obtained. This patch has evolved from a basic patch from Gabriel Krisman Bertazi.
Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com
Changes in v10:
- move changes in tools/include/uapi/linux/fs.h to separate patch
- update commit message
Change in v8:
- Correct is_pte_uffd_wp()
- Improve readability and error checks
- Remove some un-needed code
Changes in v7:
- Rebase on top of latest next
- Fix some corner cases
- Base soft-dirty on the uffd wp async
- Update the terminologies
- Optimize the memory usage inside the ioctl
Changes in v6:
- Rename variables and update comments
- Make IOCTL independent of soft_dirty config
- Change masks and bitmap type to _u64
- Improve code quality
Changes in v5:
- Remove tlb flushing even for clear operation
Changes in v4:
- Update the interface and implementation
Changes in v3:
- Tighten the user-kernel interface by using explicit types and add more error checking
Changes in v2:
- Convert the interface from syscall to ioctl
- Remove pidfd support as it doesn't make sense in ioctl
fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 50 +++++++ 2 files changed, 340 insertions(+)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e35a0398db63..c6bde19d63d9 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -19,6 +19,7 @@ #include <linux/shmem_fs.h> #include <linux/uaccess.h> #include <linux/pkeys.h> +#include <linux/minmax.h> #include <asm/elf.h> #include <asm/tlb.h> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, } #endif +static inline bool is_pte_uffd_wp(pte_t pte) +{
- if ((pte_present(pte) && pte_uffd_wp(pte)) ||
(pte_swp_uffd_wp_any(pte)))
return true;
- return false;
Sorry I should have mentioned this earlier: you can directly return here.
return (pte_present(pte) && pte_uffd_wp(pte)) || pte_swp_uffd_wp_any(pte);
+}
+static inline bool is_pmd_uffd_wp(pmd_t pmd) +{
- if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
(is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
return true;
- return false;
Same here.
+}
#if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE) static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp) @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file) return 0; } +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \
PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED) +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE) +#define IS_GET_OP(a) (a->vec) +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages))
+#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \
- (wt | file << 1 | present << 2 | swap << 3)
+#define IS_WT_REQUIRED(a) \
- ((a->required_mask & PAGE_IS_WRITTEN) || \
(a->anyof_mask & PAGE_IS_WRITTEN))
+struct pagemap_scan_private {
- struct page_region *vec;
- struct page_region prev;
- unsigned long vec_len, vec_index;
- unsigned int max_pages, found_pages, flags;
- unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
+};
+static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk) +{
- struct pagemap_scan_private *p = walk->private;
- struct vm_area_struct *vma = walk->vma;
- if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma))
Should this be:
(IS_WT_REQUIRED(p) && (!userfaultfd_wp(vma) || !userfaultfd_wp_async(vma)))
Instead?
return -EPERM;
- if (vma->vm_flags & VM_PFNMAP)
return 1;
- return 0;
+}
+static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
struct pagemap_scan_private *p, unsigned long addr,
unsigned int len)
+{
- unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
- bool cpy = true;
- struct page_region *prev = &p->prev;
Nit: switch the above two lines?
- if (HAS_NO_SPACE(p))
return -ENOSPC;
- if (p->max_pages && p->found_pages + len >= p->max_pages)
len = p->max_pages - p->found_pages;
If "p->found_pages + len >= p->max_pages", shouldn't this already return -ENOSPC?
- if (!len)
return -EINVAL;
- if (p->required_mask)
cpy = ((p->required_mask & cur) == p->required_mask);
- if (cpy && p->anyof_mask)
cpy = (p->anyof_mask & cur);
- if (cpy && p->excluded_mask)
cpy = !(p->excluded_mask & cur);
- bitmap = cur & p->return_mask;
- if (cpy && bitmap) {
if ((prev->len) && (prev->bitmap == bitmap) &&
(prev->start + prev->len * PAGE_SIZE == addr)) {
prev->len += len;
p->found_pages += len;
} else if (p->vec_index < p->vec_len) {
if (prev->len) {
memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
p->vec_index++;
}
IIUC you can have:
int pagemap_scan_deposit(p) { if (p->vec_index >= p->vec_len) return -ENOSPC;
if (p->prev->len) { memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); p->vec_index++; }
return 0; }
Then call it here. I think it can also be called below to replace export_prev_to_out().
prev->start = addr;
prev->len = len;
prev->bitmap = bitmap;
p->found_pages += len;
} else {
return -ENOSPC;
}
- }
- return 0;
+}
+static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
unsigned long *vec_index)
+{
- struct page_region *prev = &p->prev;
- if (prev->len) {
if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
return -EFAULT;
p->vec_index++;
(*vec_index)++;
prev->len = 0;
- }
- return 0;
+}
+static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
unsigned long end, struct mm_walk *walk)
+{
- struct pagemap_scan_private *p = walk->private;
- struct vm_area_struct *vma = walk->vma;
- unsigned long addr = end;
This assignment is useless?
- spinlock_t *ptl;
- int ret = 0;
- pte_t *pte;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- ptl = pmd_trans_huge_lock(pmd, vma);
- if (ptl) {
bool pmd_wt;
pmd_wt = !is_pmd_uffd_wp(*pmd);
/*
* Break huge page into small pages if operation needs to be performed is
* on a portion of the huge page.
*/
if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
spin_unlock(ptl);
split_huge_pmd(vma, pmd, start);
goto process_smaller_pages;
}
if (IS_GET_OP(p))
ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd),
is_swap_pmd(*pmd), p, start,
(end - start)/PAGE_SIZE);
spin_unlock(ptl);
if (!ret) {
if (pmd_wt && IS_WP_ENGAGE_OP(p))
uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
}
return ret;
- }
+process_smaller_pages:
- if (pmd_trans_unstable(pmd))
return 0;
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
- pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
- if (IS_GET_OP(p)) {
for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
pte_present(*pte), is_swap_pte(*pte), p, addr, 1);
if (ret)
break;
}
- }
- pte_unmap_unlock(pte - 1, ptl);
- if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
uffd_wp_range(walk->mm, vma, start, addr - start, true);
- cond_resched();
- return ret;
+}
+static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
struct mm_walk *walk)
+{
- struct pagemap_scan_private *p = walk->private;
- struct vm_area_struct *vma = walk->vma;
- int ret = 0;
- if (vma)
ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
(end - addr)/PAGE_SIZE);
- return ret;
+}
+/* No hugetlb support is present. */ +static const struct mm_walk_ops pagemap_scan_ops = {
- .test_walk = pagemap_scan_test_walk,
- .pmd_entry = pagemap_scan_pmd_entry,
- .pte_hole = pagemap_scan_pte_hole,
+};
+static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg) +{
- unsigned long empty_slots, vec_index = 0;
- unsigned long __user start, end;
- unsigned long __start, __end;
- struct page_region __user *vec;
- struct pagemap_scan_private p;
- int ret = 0;
- start = (unsigned long)untagged_addr(arg->start);
- vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
- /* Validate memory ranges */
- if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
return -EINVAL;
- if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
(!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region)))))
return -EINVAL;
- /* Detect illegal flags and masks */
- if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) ||
(arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) ||
(arg->return_mask & ~PAGEMAP_BITS_ALL))
return -EINVAL;
- if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) ||
!arg->return_mask))
return -EINVAL;
- /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */
- if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) ||
(arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
return -EINVAL;
I think you said you'll clean this up a bit. I don't think so..
- end = start + arg->len;
- p.max_pages = arg->max_pages;
- p.found_pages = 0;
- p.flags = arg->flags;
- p.required_mask = arg->required_mask;
- p.anyof_mask = arg->anyof_mask;
- p.excluded_mask = arg->excluded_mask;
- p.return_mask = arg->return_mask;
- p.prev.len = 0;
- p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
- if (IS_GET_OP(arg)) {
p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL);
if (!p.vec)
return -ENOMEM;
- } else {
p.vec = NULL;
- }
- __start = __end = start;
- while (!ret && __end < end) {
p.vec_index = 0;
empty_slots = arg->vec_len - vec_index;
if (p.vec_len > empty_slots)
p.vec_len = empty_slots;
__end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
if (__end > end)
__end = end;
mmap_read_lock(mm);
ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p);
mmap_read_unlock(mm);
if (!(!ret || ret == -ENOSPC))
goto free_data;
__start = __end;
if (IS_GET_OP(arg) && p.vec_index) {
if (copy_to_user(&vec[vec_index], p.vec,
p.vec_index * sizeof(struct page_region))) {
ret = -EFAULT;
goto free_data;
}
vec_index += p.vec_index;
}
I think you can move copy_to_user() to outside the loop, then call pagemap_scan_deposit() before copy_to_user(), then I think you can drop the ugly export_prev_to_out()..
- }
- ret = export_prev_to_out(&p, vec, &vec_index);
- if (!ret)
ret = vec_index;
+free_data:
- if (IS_GET_OP(arg))
kfree(p.vec);
- return ret;
+}
+static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg) +{
- struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg;
- struct mm_struct *mm = file->private_data;
- struct pagemap_scan_arg argument;
- if (cmd == PAGEMAP_SCAN) {
if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg)))
return -EFAULT;
return do_pagemap_cmd(mm, &argument);
- }
- return -EINVAL;
+}
const struct file_operations proc_pagemap_operations = { .llseek = mem_lseek, /* borrow this */ .read = pagemap_read, .open = pagemap_open, .release = pagemap_release,
- .unlocked_ioctl = pagemap_scan_ioctl,
- .compat_ioctl = pagemap_scan_ioctl,
}; #endif /* CONFIG_PROC_PAGE_MONITOR */ diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index b7b56871029c..1ae9a8684b48 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t; #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ RWF_APPEND) +/* Pagemap ioctl */ +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg)
+/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */ +#define PAGE_IS_WRITTEN (1 << 0) +#define PAGE_IS_FILE (1 << 1) +#define PAGE_IS_PRESENT (1 << 2) +#define PAGE_IS_SWAPPED (1 << 3)
+/*
- struct page_region - Page region with bitmap flags
- @start: Start of the region
- @len: Length of the region
- bitmap: Bits sets for the region
- */
+struct page_region {
- __u64 start;
- __u64 len;
- __u64 bitmap;
+};
+/*
- struct pagemap_scan_arg - Pagemap ioctl argument
- @start: Starting address of the region
- @len: Length of the region (All the pages in this length are included)
- @vec: Address of page_region struct array for output
- @vec_len: Length of the page_region struct array
- @max_pages: Optional max return pages
- @flags: Flags for the IOCTL
- @required_mask: Required mask - All of these bits have to be set in the PTE
- @anyof_mask: Any mask - Any of these bits are set in the PTE
- @excluded_mask: Exclude mask - None of these bits are set in the PTE
- @return_mask: Bits that are to be reported in page_region
- */
+struct pagemap_scan_arg {
- __u64 start;
- __u64 len;
- __u64 vec;
- __u64 vec_len;
- __u32 max_pages;
- __u32 flags;
- __u64 required_mask;
- __u64 anyof_mask;
- __u64 excluded_mask;
- __u64 return_mask;
+};
+/* Special flags */ +#define PAGEMAP_WP_ENGAGE (1 << 0)
#endif /* _UAPI_LINUX_FS_H */
2.30.2
On 2/9/23 3:15 AM, Peter Xu wrote:
On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear the info about page table entries. The following operations are supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN), file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped (PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which pages have been written-to.
- Find pages which have been written-to and write protect the pages (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
To get information about which pages have been written-to and/or write protect the pages, following must be performed first in order:
- The userfaultfd file descriptor is created with userfaultfd syscall.
- The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
- The memory range is registered with UFFDIO_REGISTER_MODE_WP mode through UFFDIO_REGISTER IOCTL.
Then the any part of the registered memory or the whole memory region can be write protected using the UFFDIO_WRITEPROTECT IOCTL or PAGEMAP_SCAN IOCTL.
struct pagemap_scan_args is used as the argument of the IOCTL. In this struct:
- The range is specified through start and len.
- The output buffer of struct page_region array and size is specified as vec and vec_len.
- The optional maximum requested pages are specified in the max_pages.
- The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE is the only added flag at this time.
- The masks are specified in required_mask, anyof_mask, excluded_ mask and return_mask.
This IOCTL can be extended to get information about more PTE bits. This IOCTL doesn't support hugetlbs at the moment. No information about hugetlb can be obtained. This patch has evolved from a basic patch from Gabriel Krisman Bertazi.
Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com
Changes in v10:
- move changes in tools/include/uapi/linux/fs.h to separate patch
- update commit message
Change in v8:
- Correct is_pte_uffd_wp()
- Improve readability and error checks
- Remove some un-needed code
Changes in v7:
- Rebase on top of latest next
- Fix some corner cases
- Base soft-dirty on the uffd wp async
- Update the terminologies
- Optimize the memory usage inside the ioctl
Changes in v6:
- Rename variables and update comments
- Make IOCTL independent of soft_dirty config
- Change masks and bitmap type to _u64
- Improve code quality
Changes in v5:
- Remove tlb flushing even for clear operation
Changes in v4:
- Update the interface and implementation
Changes in v3:
- Tighten the user-kernel interface by using explicit types and add more error checking
Changes in v2:
- Convert the interface from syscall to ioctl
- Remove pidfd support as it doesn't make sense in ioctl
fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 50 +++++++ 2 files changed, 340 insertions(+)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e35a0398db63..c6bde19d63d9 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -19,6 +19,7 @@ #include <linux/shmem_fs.h> #include <linux/uaccess.h> #include <linux/pkeys.h> +#include <linux/minmax.h> #include <asm/elf.h> #include <asm/tlb.h> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, } #endif +static inline bool is_pte_uffd_wp(pte_t pte) +{
- if ((pte_present(pte) && pte_uffd_wp(pte)) ||
(pte_swp_uffd_wp_any(pte)))
return true;
- return false;
Sorry I should have mentioned this earlier: you can directly return here.
No problem at all. I'm replacing these two helper functions with following in next version so that !present pages don't show as dirty:
static inline bool is_pte_written(pte_t pte) { if ((pte_present(pte) && pte_uffd_wp(pte)) || (pte_swp_uffd_wp_any(pte))) return false; return (pte_present(pte) || is_swap_pte(pte)); }
static inline bool is_pmd_written(pmd_t pmd) { if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) return false; return (pmd_present(pmd) || is_swap_pmd(pmd)); }
return (pte_present(pte) && pte_uffd_wp(pte)) || pte_swp_uffd_wp_any(pte);
+}
+static inline bool is_pmd_uffd_wp(pmd_t pmd) +{
- if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
(is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
return true;
- return false;
Same here.
+}
#if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE) static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp) @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file) return 0; } +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \
PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED) +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE) +#define IS_GET_OP(a) (a->vec) +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages))
+#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \
- (wt | file << 1 | present << 2 | swap << 3)
+#define IS_WT_REQUIRED(a) \
- ((a->required_mask & PAGE_IS_WRITTEN) || \
(a->anyof_mask & PAGE_IS_WRITTEN))
+struct pagemap_scan_private {
- struct page_region *vec;
- struct page_region prev;
- unsigned long vec_len, vec_index;
- unsigned int max_pages, found_pages, flags;
- unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
+};
+static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk) +{
- struct pagemap_scan_private *p = walk->private;
- struct vm_area_struct *vma = walk->vma;
- if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma))
Should this be:
(IS_WT_REQUIRED(p) && (!userfaultfd_wp(vma) || !userfaultfd_wp_async(vma)))
Instead?
Correct. I'll fix this.
return -EPERM;
- if (vma->vm_flags & VM_PFNMAP)
return 1;
- return 0;
+}
+static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
struct pagemap_scan_private *p, unsigned long addr,
unsigned int len)
+{
- unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
- bool cpy = true;
- struct page_region *prev = &p->prev;
Nit: switch the above two lines?
I'll fix this.
- if (HAS_NO_SPACE(p))
return -ENOSPC;
- if (p->max_pages && p->found_pages + len >= p->max_pages)
len = p->max_pages - p->found_pages;
If "p->found_pages + len >= p->max_pages", shouldn't this already return -ENOSPC?
Length calculation is happening in the funtions calling this function. I'll move this out of here to make things logically better.
- if (!len)
return -EINVAL;
- if (p->required_mask)
cpy = ((p->required_mask & cur) == p->required_mask);
- if (cpy && p->anyof_mask)
cpy = (p->anyof_mask & cur);
- if (cpy && p->excluded_mask)
cpy = !(p->excluded_mask & cur);
- bitmap = cur & p->return_mask;
- if (cpy && bitmap) {
if ((prev->len) && (prev->bitmap == bitmap) &&
(prev->start + prev->len * PAGE_SIZE == addr)) {
prev->len += len;
p->found_pages += len;
} else if (p->vec_index < p->vec_len) {
if (prev->len) {
memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
p->vec_index++;
}
IIUC you can have:
int pagemap_scan_deposit(p) { if (p->vec_index >= p->vec_len) return -ENOSPC;
if (p->prev->len) { memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); p->vec_index++; } return 0;
}
Then call it here. I think it can also be called below to replace export_prev_to_out().
No this isn't possible. We fill up prev until the next range doesn't merge with it. At that point, we put prev into the output buffer and new range is put into prev. Now that we have shifted to smaller page walks of <= 512 entries. We want to visit all ranges before finally putting the prev to output. Sorry to have this some what complex method. The problem is that we want to merge the consective matching regions into one entry in the output. So to achieve this among multiple different page walks, the prev is being used.
Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000 having length of 1024 pages and all of the memory has been written. walk_page_range() will be called 2 times. In the first call, prev will be set having length of 512. In second call, prev will be updated to 1024 as the previous range stored in prev could be extended. After this, the prev will be stored to the user output buffer consuming only 1 struct of page_range.
If we store prev back to output memory in every walk_page_range() call, we wouldn't get 1 struct of page_range with length 1024. Instead we would get 2 elements of page_range structs with half the length.
prev->start = addr;
prev->len = len;
prev->bitmap = bitmap;
p->found_pages += len;
} else {
return -ENOSPC;
}
- }
- return 0;
+}
+static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
unsigned long *vec_index)
+{
- struct page_region *prev = &p->prev;
- if (prev->len) {
if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
return -EFAULT;
p->vec_index++;
(*vec_index)++;
prev->len = 0;
- }
- return 0;
+}
+static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
unsigned long end, struct mm_walk *walk)
+{
- struct pagemap_scan_private *p = walk->private;
- struct vm_area_struct *vma = walk->vma;
- unsigned long addr = end;
This assignment is useless?
No, this assignement gets used when only the WP_ENGAGE operation is used on normal size pages.
- spinlock_t *ptl;
- int ret = 0;
- pte_t *pte;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- ptl = pmd_trans_huge_lock(pmd, vma);
- if (ptl) {
bool pmd_wt;
pmd_wt = !is_pmd_uffd_wp(*pmd);
/*
* Break huge page into small pages if operation needs to be performed is
* on a portion of the huge page.
*/
if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
spin_unlock(ptl);
split_huge_pmd(vma, pmd, start);
goto process_smaller_pages;
}
if (IS_GET_OP(p))
ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd),
is_swap_pmd(*pmd), p, start,
(end - start)/PAGE_SIZE);
spin_unlock(ptl);
if (!ret) {
if (pmd_wt && IS_WP_ENGAGE_OP(p))
uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
}
return ret;
- }
+process_smaller_pages:
- if (pmd_trans_unstable(pmd))
return 0;
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
- pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
- if (IS_GET_OP(p)) {
for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
pte_present(*pte), is_swap_pte(*pte), p, addr, 1);
if (ret)
break;
}
- }
- pte_unmap_unlock(pte - 1, ptl);
- if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
uffd_wp_range(walk->mm, vma, start, addr - start, true);
- cond_resched();
- return ret;
+}
+static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
struct mm_walk *walk)
+{
- struct pagemap_scan_private *p = walk->private;
- struct vm_area_struct *vma = walk->vma;
- int ret = 0;
- if (vma)
ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
(end - addr)/PAGE_SIZE);
- return ret;
+}
+/* No hugetlb support is present. */ +static const struct mm_walk_ops pagemap_scan_ops = {
- .test_walk = pagemap_scan_test_walk,
- .pmd_entry = pagemap_scan_pmd_entry,
- .pte_hole = pagemap_scan_pte_hole,
+};
+static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg) +{
- unsigned long empty_slots, vec_index = 0;
- unsigned long __user start, end;
- unsigned long __start, __end;
- struct page_region __user *vec;
- struct pagemap_scan_private p;
- int ret = 0;
- start = (unsigned long)untagged_addr(arg->start);
- vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
- /* Validate memory ranges */
- if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
return -EINVAL;
- if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
(!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region)))))
return -EINVAL;
- /* Detect illegal flags and masks */
- if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) ||
(arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) ||
(arg->return_mask & ~PAGEMAP_BITS_ALL))
return -EINVAL;
- if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) ||
!arg->return_mask))
return -EINVAL;
- /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */
- if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) ||
(arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
return -EINVAL;
I think you said you'll clean this up a bit. I don't think so..
You had showed a really clean way to put all these error checking conditions. But I wasn't able to put the current error checking conditions in that much nice way. I'd done at least something to make them look better. Sorry, I'll revisit and try to come up with easier to follow error checking conditions.
On Mon, Feb 13, 2023 at 05:55:19PM +0500, Muhammad Usama Anjum wrote:
On 2/9/23 3:15 AM, Peter Xu wrote:
On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear the info about page table entries. The following operations are supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN), file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped (PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which pages have been written-to.
- Find pages which have been written-to and write protect the pages (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
To get information about which pages have been written-to and/or write protect the pages, following must be performed first in order:
- The userfaultfd file descriptor is created with userfaultfd syscall.
- The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
- The memory range is registered with UFFDIO_REGISTER_MODE_WP mode through UFFDIO_REGISTER IOCTL.
Then the any part of the registered memory or the whole memory region can be write protected using the UFFDIO_WRITEPROTECT IOCTL or PAGEMAP_SCAN IOCTL.
struct pagemap_scan_args is used as the argument of the IOCTL. In this struct:
- The range is specified through start and len.
- The output buffer of struct page_region array and size is specified as vec and vec_len.
- The optional maximum requested pages are specified in the max_pages.
- The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE is the only added flag at this time.
- The masks are specified in required_mask, anyof_mask, excluded_ mask and return_mask.
This IOCTL can be extended to get information about more PTE bits. This IOCTL doesn't support hugetlbs at the moment. No information about hugetlb can be obtained. This patch has evolved from a basic patch from Gabriel Krisman Bertazi.
Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com
Changes in v10:
- move changes in tools/include/uapi/linux/fs.h to separate patch
- update commit message
Change in v8:
- Correct is_pte_uffd_wp()
- Improve readability and error checks
- Remove some un-needed code
Changes in v7:
- Rebase on top of latest next
- Fix some corner cases
- Base soft-dirty on the uffd wp async
- Update the terminologies
- Optimize the memory usage inside the ioctl
Changes in v6:
- Rename variables and update comments
- Make IOCTL independent of soft_dirty config
- Change masks and bitmap type to _u64
- Improve code quality
Changes in v5:
- Remove tlb flushing even for clear operation
Changes in v4:
- Update the interface and implementation
Changes in v3:
- Tighten the user-kernel interface by using explicit types and add more error checking
Changes in v2:
- Convert the interface from syscall to ioctl
- Remove pidfd support as it doesn't make sense in ioctl
fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 50 +++++++ 2 files changed, 340 insertions(+)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e35a0398db63..c6bde19d63d9 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -19,6 +19,7 @@ #include <linux/shmem_fs.h> #include <linux/uaccess.h> #include <linux/pkeys.h> +#include <linux/minmax.h> #include <asm/elf.h> #include <asm/tlb.h> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, } #endif +static inline bool is_pte_uffd_wp(pte_t pte) +{
- if ((pte_present(pte) && pte_uffd_wp(pte)) ||
(pte_swp_uffd_wp_any(pte)))
return true;
- return false;
Sorry I should have mentioned this earlier: you can directly return here.
No problem at all. I'm replacing these two helper functions with following in next version so that !present pages don't show as dirty:
static inline bool is_pte_written(pte_t pte) { if ((pte_present(pte) && pte_uffd_wp(pte)) || (pte_swp_uffd_wp_any(pte))) return false; return (pte_present(pte) || is_swap_pte(pte)); }
Could you explain why you don't want to return dirty for !present? A page can be written then swapped out. Don't you want to know that happened (from dirty tracking POV)?
The code looks weird to me too.. We only have three types of ptes: (1) present, (2) swap, (3) none.
Then, "(pte_present() || is_swap_pte())" is the same as !pte_none(). Is that what you're really looking for?
static inline bool is_pmd_written(pmd_t pmd) { if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) return false; return (pmd_present(pmd) || is_swap_pmd(pmd)); }
[...]
- bitmap = cur & p->return_mask;
- if (cpy && bitmap) {
if ((prev->len) && (prev->bitmap == bitmap) &&
(prev->start + prev->len * PAGE_SIZE == addr)) {
prev->len += len;
p->found_pages += len;
} else if (p->vec_index < p->vec_len) {
if (prev->len) {
memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
p->vec_index++;
}
IIUC you can have:
int pagemap_scan_deposit(p) { if (p->vec_index >= p->vec_len) return -ENOSPC;
if (p->prev->len) { memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); p->vec_index++; } return 0;
}
Then call it here. I think it can also be called below to replace export_prev_to_out().
No this isn't possible. We fill up prev until the next range doesn't merge with it. At that point, we put prev into the output buffer and new range is put into prev. Now that we have shifted to smaller page walks of <= 512 entries. We want to visit all ranges before finally putting the prev to output. Sorry to have this some what complex method. The problem is that we want to merge the consective matching regions into one entry in the output. So to achieve this among multiple different page walks, the prev is being used.
Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000 having length of 1024 pages and all of the memory has been written. walk_page_range() will be called 2 times. In the first call, prev will be set having length of 512. In second call, prev will be updated to 1024 as the previous range stored in prev could be extended. After this, the prev will be stored to the user output buffer consuming only 1 struct of page_range.
If we store prev back to output memory in every walk_page_range() call, we wouldn't get 1 struct of page_range with length 1024. Instead we would get 2 elements of page_range structs with half the length.
I didn't mean to merge PREV for each pgtable walk. What I meant is I think with such a pagemap_scan_deposit() you can rewrite it as:
if (cpy && bitmap) { if ((prev->len) && (prev->bitmap == bitmap) && (prev->start + prev->len * PAGE_SIZE == addr)) { prev->len += len; p->found_pages += len; } else { if (pagemap_scan_deposit(p)) return -ENOSPC; prev->start = addr; prev->len = len; prev->bitmap = bitmap; p->found_pages += len; } }
Then you can reuse pagemap_scan_deposit() when before returning to userspace, just to flush PREV to p->vec properly in a single helper. It also makes the code slightly easier to read.
On 2/14/23 2:42 AM, Peter Xu wrote:
On Mon, Feb 13, 2023 at 05:55:19PM +0500, Muhammad Usama Anjum wrote:
On 2/9/23 3:15 AM, Peter Xu wrote:
On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear the info about page table entries. The following operations are supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN), file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped (PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which pages have been written-to.
- Find pages which have been written-to and write protect the pages (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
To get information about which pages have been written-to and/or write protect the pages, following must be performed first in order:
- The userfaultfd file descriptor is created with userfaultfd syscall.
- The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
- The memory range is registered with UFFDIO_REGISTER_MODE_WP mode through UFFDIO_REGISTER IOCTL.
Then the any part of the registered memory or the whole memory region can be write protected using the UFFDIO_WRITEPROTECT IOCTL or PAGEMAP_SCAN IOCTL.
struct pagemap_scan_args is used as the argument of the IOCTL. In this struct:
- The range is specified through start and len.
- The output buffer of struct page_region array and size is specified as vec and vec_len.
- The optional maximum requested pages are specified in the max_pages.
- The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE is the only added flag at this time.
- The masks are specified in required_mask, anyof_mask, excluded_ mask and return_mask.
This IOCTL can be extended to get information about more PTE bits. This IOCTL doesn't support hugetlbs at the moment. No information about hugetlb can be obtained. This patch has evolved from a basic patch from Gabriel Krisman Bertazi.
Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com
Changes in v10:
- move changes in tools/include/uapi/linux/fs.h to separate patch
- update commit message
Change in v8:
- Correct is_pte_uffd_wp()
- Improve readability and error checks
- Remove some un-needed code
Changes in v7:
- Rebase on top of latest next
- Fix some corner cases
- Base soft-dirty on the uffd wp async
- Update the terminologies
- Optimize the memory usage inside the ioctl
Changes in v6:
- Rename variables and update comments
- Make IOCTL independent of soft_dirty config
- Change masks and bitmap type to _u64
- Improve code quality
Changes in v5:
- Remove tlb flushing even for clear operation
Changes in v4:
- Update the interface and implementation
Changes in v3:
- Tighten the user-kernel interface by using explicit types and add more error checking
Changes in v2:
- Convert the interface from syscall to ioctl
- Remove pidfd support as it doesn't make sense in ioctl
fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 50 +++++++ 2 files changed, 340 insertions(+)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e35a0398db63..c6bde19d63d9 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -19,6 +19,7 @@ #include <linux/shmem_fs.h> #include <linux/uaccess.h> #include <linux/pkeys.h> +#include <linux/minmax.h> #include <asm/elf.h> #include <asm/tlb.h> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, } #endif +static inline bool is_pte_uffd_wp(pte_t pte) +{
- if ((pte_present(pte) && pte_uffd_wp(pte)) ||
(pte_swp_uffd_wp_any(pte)))
return true;
- return false;
Sorry I should have mentioned this earlier: you can directly return here.
No problem at all. I'm replacing these two helper functions with following in next version so that !present pages don't show as dirty:
static inline bool is_pte_written(pte_t pte) { if ((pte_present(pte) && pte_uffd_wp(pte)) || (pte_swp_uffd_wp_any(pte))) return false; return (pte_present(pte) || is_swap_pte(pte)); }
Could you explain why you don't want to return dirty for !present? A page can be written then swapped out. Don't you want to know that happened (from dirty tracking POV)?
The code looks weird to me too.. We only have three types of ptes: (1) present, (2) swap, (3) none.
Then, "(pte_present() || is_swap_pte())" is the same as !pte_none(). Is that what you're really looking for?
Yes, this is what I've been trying to do. I'll use !pte_none() to make it simpler.
static inline bool is_pmd_written(pmd_t pmd) { if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) return false; return (pmd_present(pmd) || is_swap_pmd(pmd)); }
[...]
- bitmap = cur & p->return_mask;
- if (cpy && bitmap) {
if ((prev->len) && (prev->bitmap == bitmap) &&
(prev->start + prev->len * PAGE_SIZE == addr)) {
prev->len += len;
p->found_pages += len;
} else if (p->vec_index < p->vec_len) {
if (prev->len) {
memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
p->vec_index++;
}
IIUC you can have:
int pagemap_scan_deposit(p) { if (p->vec_index >= p->vec_len) return -ENOSPC;
if (p->prev->len) { memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); p->vec_index++; } return 0;
}
Then call it here. I think it can also be called below to replace export_prev_to_out().
No this isn't possible. We fill up prev until the next range doesn't merge with it. At that point, we put prev into the output buffer and new range is put into prev. Now that we have shifted to smaller page walks of <= 512 entries. We want to visit all ranges before finally putting the prev to output. Sorry to have this some what complex method. The problem is that we want to merge the consective matching regions into one entry in the output. So to achieve this among multiple different page walks, the prev is being used.
Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000 having length of 1024 pages and all of the memory has been written. walk_page_range() will be called 2 times. In the first call, prev will be set having length of 512. In second call, prev will be updated to 1024 as the previous range stored in prev could be extended. After this, the prev will be stored to the user output buffer consuming only 1 struct of page_range.
If we store prev back to output memory in every walk_page_range() call, we wouldn't get 1 struct of page_range with length 1024. Instead we would get 2 elements of page_range structs with half the length.
I didn't mean to merge PREV for each pgtable walk. What I meant is I think with such a pagemap_scan_deposit() you can rewrite it as:
if (cpy && bitmap) { if ((prev->len) && (prev->bitmap == bitmap) && (prev->start + prev->len * PAGE_SIZE == addr)) { prev->len += len; p->found_pages += len; } else { if (pagemap_scan_deposit(p)) return -ENOSPC; prev->start = addr; prev->len = len; prev->bitmap = bitmap; p->found_pages += len; } }
Then you can reuse pagemap_scan_deposit() when before returning to userspace, just to flush PREV to p->vec properly in a single helper. It also makes the code slightly easier to read.
Yeah, this would have worked as you have described. But in pagemap_scan_output(), we are flushing prev to p->vec. But later in export_prev_to_out() we need to flush prev to user_memory directly.
On Tue, Feb 14, 2023 at 12:57:21PM +0500, Muhammad Usama Anjum wrote:
On 2/14/23 2:42 AM, Peter Xu wrote:
On Mon, Feb 13, 2023 at 05:55:19PM +0500, Muhammad Usama Anjum wrote:
On 2/9/23 3:15 AM, Peter Xu wrote:
On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear the info about page table entries. The following operations are supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN), file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped (PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which pages have been written-to.
- Find pages which have been written-to and write protect the pages (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
To get information about which pages have been written-to and/or write protect the pages, following must be performed first in order:
- The userfaultfd file descriptor is created with userfaultfd syscall.
- The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
- The memory range is registered with UFFDIO_REGISTER_MODE_WP mode through UFFDIO_REGISTER IOCTL.
Then the any part of the registered memory or the whole memory region can be write protected using the UFFDIO_WRITEPROTECT IOCTL or PAGEMAP_SCAN IOCTL.
struct pagemap_scan_args is used as the argument of the IOCTL. In this struct:
- The range is specified through start and len.
- The output buffer of struct page_region array and size is specified as vec and vec_len.
- The optional maximum requested pages are specified in the max_pages.
- The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE is the only added flag at this time.
- The masks are specified in required_mask, anyof_mask, excluded_ mask and return_mask.
This IOCTL can be extended to get information about more PTE bits. This IOCTL doesn't support hugetlbs at the moment. No information about hugetlb can be obtained. This patch has evolved from a basic patch from Gabriel Krisman Bertazi.
Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com
Changes in v10:
- move changes in tools/include/uapi/linux/fs.h to separate patch
- update commit message
Change in v8:
- Correct is_pte_uffd_wp()
- Improve readability and error checks
- Remove some un-needed code
Changes in v7:
- Rebase on top of latest next
- Fix some corner cases
- Base soft-dirty on the uffd wp async
- Update the terminologies
- Optimize the memory usage inside the ioctl
Changes in v6:
- Rename variables and update comments
- Make IOCTL independent of soft_dirty config
- Change masks and bitmap type to _u64
- Improve code quality
Changes in v5:
- Remove tlb flushing even for clear operation
Changes in v4:
- Update the interface and implementation
Changes in v3:
- Tighten the user-kernel interface by using explicit types and add more error checking
Changes in v2:
- Convert the interface from syscall to ioctl
- Remove pidfd support as it doesn't make sense in ioctl
fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 50 +++++++ 2 files changed, 340 insertions(+)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e35a0398db63..c6bde19d63d9 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -19,6 +19,7 @@ #include <linux/shmem_fs.h> #include <linux/uaccess.h> #include <linux/pkeys.h> +#include <linux/minmax.h> #include <asm/elf.h> #include <asm/tlb.h> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, } #endif +static inline bool is_pte_uffd_wp(pte_t pte) +{
- if ((pte_present(pte) && pte_uffd_wp(pte)) ||
(pte_swp_uffd_wp_any(pte)))
return true;
- return false;
Sorry I should have mentioned this earlier: you can directly return here.
No problem at all. I'm replacing these two helper functions with following in next version so that !present pages don't show as dirty:
static inline bool is_pte_written(pte_t pte) { if ((pte_present(pte) && pte_uffd_wp(pte)) || (pte_swp_uffd_wp_any(pte))) return false; return (pte_present(pte) || is_swap_pte(pte)); }
Could you explain why you don't want to return dirty for !present? A page can be written then swapped out. Don't you want to know that happened (from dirty tracking POV)?
The code looks weird to me too.. We only have three types of ptes: (1) present, (2) swap, (3) none.
Then, "(pte_present() || is_swap_pte())" is the same as !pte_none(). Is that what you're really looking for?
Yes, this is what I've been trying to do. I'll use !pte_none() to make it simpler.
Ah I think I see what you wanted to do now.. But I'm afraid it won't work for all cases.
So IIUC the problem is anon pte can be empty, but since uffd-wp bit doesn't persist on anon (but none) ptes, then we got it lost and we cannot identify it from pages being written. Your solution will solve problem for anonymous, but I think it'll break file memories.
Example:
Consider one shmem page that got mapped, write protected (using UFFDIO_WP ioctl), written again (removing uffd-wp bit automatically), then zapped. The pte will be pte_none() but it's actually written, afaiu.
Maybe it's time we should introduce UFFD_FEATURE_WP_ZEROPAGE, so we'll need to install pte markers for anonymous too (then it will work similarly like shmem/hugetlbfs, that we'll report writting to zero pages), then you'll need to have the new UFFD_FEATURE_WP_ASYNC depend on it. With that I think you can keep using the old check and it should start to work.
Please let me know if my understanding is correct above.
I'll see whether I can quickly play with UFFD_FEATURE_WP_ZEROPAGE with some patch at the meantime. That's something we wanted before too, when the app cares about zero pages on anon. We used to populate the pages before doing ioctl(UFFDIO_WP) to make sure zero pages will be repoted too, but that flag should be more efficient.
static inline bool is_pmd_written(pmd_t pmd) { if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) return false; return (pmd_present(pmd) || is_swap_pmd(pmd)); }
[...]
- bitmap = cur & p->return_mask;
- if (cpy && bitmap) {
if ((prev->len) && (prev->bitmap == bitmap) &&
(prev->start + prev->len * PAGE_SIZE == addr)) {
prev->len += len;
p->found_pages += len;
} else if (p->vec_index < p->vec_len) {
if (prev->len) {
memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
p->vec_index++;
}
IIUC you can have:
int pagemap_scan_deposit(p) { if (p->vec_index >= p->vec_len) return -ENOSPC;
if (p->prev->len) { memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); p->vec_index++; } return 0;
}
Then call it here. I think it can also be called below to replace export_prev_to_out().
No this isn't possible. We fill up prev until the next range doesn't merge with it. At that point, we put prev into the output buffer and new range is put into prev. Now that we have shifted to smaller page walks of <= 512 entries. We want to visit all ranges before finally putting the prev to output. Sorry to have this some what complex method. The problem is that we want to merge the consective matching regions into one entry in the output. So to achieve this among multiple different page walks, the prev is being used.
Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000 having length of 1024 pages and all of the memory has been written. walk_page_range() will be called 2 times. In the first call, prev will be set having length of 512. In second call, prev will be updated to 1024 as the previous range stored in prev could be extended. After this, the prev will be stored to the user output buffer consuming only 1 struct of page_range.
If we store prev back to output memory in every walk_page_range() call, we wouldn't get 1 struct of page_range with length 1024. Instead we would get 2 elements of page_range structs with half the length.
I didn't mean to merge PREV for each pgtable walk. What I meant is I think with such a pagemap_scan_deposit() you can rewrite it as:
if (cpy && bitmap) { if ((prev->len) && (prev->bitmap == bitmap) && (prev->start + prev->len * PAGE_SIZE == addr)) { prev->len += len; p->found_pages += len; } else { if (pagemap_scan_deposit(p)) return -ENOSPC; prev->start = addr; prev->len = len; prev->bitmap = bitmap; p->found_pages += len; } }
Then you can reuse pagemap_scan_deposit() when before returning to userspace, just to flush PREV to p->vec properly in a single helper. It also makes the code slightly easier to read.
Yeah, this would have worked as you have described. But in pagemap_scan_output(), we are flushing prev to p->vec. But later in export_prev_to_out() we need to flush prev to user_memory directly.
I think there's a loop to copy_to_user(). Could you use the new helper so the copy_to_user() loop will work without export_prev_to_out()?
I really hope we can get rid of export_prev_to_out(). Thanks,
On 2/15/23 1:59 AM, Peter Xu wrote: [..]
static inline bool is_pte_written(pte_t pte) { if ((pte_present(pte) && pte_uffd_wp(pte)) || (pte_swp_uffd_wp_any(pte))) return false; return (pte_present(pte) || is_swap_pte(pte)); }
Could you explain why you don't want to return dirty for !present? A page can be written then swapped out. Don't you want to know that happened (from dirty tracking POV)?
The code looks weird to me too.. We only have three types of ptes: (1) present, (2) swap, (3) none.
Then, "(pte_present() || is_swap_pte())" is the same as !pte_none(). Is that what you're really looking for?
Yes, this is what I've been trying to do. I'll use !pte_none() to make it simpler.
Ah I think I see what you wanted to do now.. But I'm afraid it won't work for all cases.
So IIUC the problem is anon pte can be empty, but since uffd-wp bit doesn't persist on anon (but none) ptes, then we got it lost and we cannot identify it from pages being written. Your solution will solve problem for anonymous, but I think it'll break file memories.
Example:
Consider one shmem page that got mapped, write protected (using UFFDIO_WP ioctl), written again (removing uffd-wp bit automatically), then zapped. The pte will be pte_none() but it's actually written, afaiu.
Maybe it's time we should introduce UFFD_FEATURE_WP_ZEROPAGE, so we'll need to install pte markers for anonymous too (then it will work similarly like shmem/hugetlbfs, that we'll report writting to zero pages), then you'll need to have the new UFFD_FEATURE_WP_ASYNC depend on it. With that I think you can keep using the old check and it should start to work.
Please let me know if my understanding is correct above.
Thank you for identifying it. Your understanding seems on point. I'll have research things up about PTE Markers. I'm looking at your patches about it [1]. Can you refer me to "mm alignment sessions" discussion in form of presentation or if any transcript is available?
I'll see whether I can quickly play with UFFD_FEATURE_WP_ZEROPAGE with some patch at the meantime. That's something we wanted before too, when the app cares about zero pages on anon. We used to populate the pages before doing ioctl(UFFDIO_WP) to make sure zero pages will be repoted too, but that flag should be more efficient.
Is this discussion public? For what application you were looking into this? I'll dig down to see how can I contribute to it.
static inline bool is_pmd_written(pmd_t pmd) { if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) return false; return (pmd_present(pmd) || is_swap_pmd(pmd)); }
[...]
- bitmap = cur & p->return_mask;
- if (cpy && bitmap) {
if ((prev->len) && (prev->bitmap == bitmap) &&
(prev->start + prev->len * PAGE_SIZE == addr)) {
prev->len += len;
p->found_pages += len;
} else if (p->vec_index < p->vec_len) {
if (prev->len) {
memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
p->vec_index++;
}
IIUC you can have:
int pagemap_scan_deposit(p) { if (p->vec_index >= p->vec_len) return -ENOSPC;
if (p->prev->len) { memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); p->vec_index++; } return 0;
}
Then call it here. I think it can also be called below to replace export_prev_to_out().
No this isn't possible. We fill up prev until the next range doesn't merge with it. At that point, we put prev into the output buffer and new range is put into prev. Now that we have shifted to smaller page walks of <= 512 entries. We want to visit all ranges before finally putting the prev to output. Sorry to have this some what complex method. The problem is that we want to merge the consective matching regions into one entry in the output. So to achieve this among multiple different page walks, the prev is being used.
Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000 having length of 1024 pages and all of the memory has been written. walk_page_range() will be called 2 times. In the first call, prev will be set having length of 512. In second call, prev will be updated to 1024 as the previous range stored in prev could be extended. After this, the prev will be stored to the user output buffer consuming only 1 struct of page_range.
If we store prev back to output memory in every walk_page_range() call, we wouldn't get 1 struct of page_range with length 1024. Instead we would get 2 elements of page_range structs with half the length.
I didn't mean to merge PREV for each pgtable walk. What I meant is I think with such a pagemap_scan_deposit() you can rewrite it as:
if (cpy && bitmap) { if ((prev->len) && (prev->bitmap == bitmap) && (prev->start + prev->len * PAGE_SIZE == addr)) { prev->len += len; p->found_pages += len; } else { if (pagemap_scan_deposit(p)) return -ENOSPC; prev->start = addr; prev->len = len; prev->bitmap = bitmap; p->found_pages += len; } }
Then you can reuse pagemap_scan_deposit() when before returning to userspace, just to flush PREV to p->vec properly in a single helper. It also makes the code slightly easier to read.
Yeah, this would have worked as you have described. But in pagemap_scan_output(), we are flushing prev to p->vec. But later in export_prev_to_out() we need to flush prev to user_memory directly.
I think there's a loop to copy_to_user(). Could you use the new helper so the copy_to_user() loop will work without export_prev_to_out()?
I really hope we can get rid of export_prev_to_out(). Thanks,
I truly understand how you feel about export_prev_to_out(). It is really difficult to understand. Even I had to made a hard try to come up with the current code to avoid consuming a lot of kernel's memory while giving user the compact output. I can surely map both of these with a dirty looking macro. But I'm unable to find a decent macro to replace these. I think I'll put a comment some where to explain whats going-on.
On Wed, Feb 15, 2023 at 03:03:09PM +0500, Muhammad Usama Anjum wrote:
On 2/15/23 1:59 AM, Peter Xu wrote: [..]
static inline bool is_pte_written(pte_t pte) { if ((pte_present(pte) && pte_uffd_wp(pte)) || (pte_swp_uffd_wp_any(pte))) return false; return (pte_present(pte) || is_swap_pte(pte)); }
Could you explain why you don't want to return dirty for !present? A page can be written then swapped out. Don't you want to know that happened (from dirty tracking POV)?
The code looks weird to me too.. We only have three types of ptes: (1) present, (2) swap, (3) none.
Then, "(pte_present() || is_swap_pte())" is the same as !pte_none(). Is that what you're really looking for?
Yes, this is what I've been trying to do. I'll use !pte_none() to make it simpler.
Ah I think I see what you wanted to do now.. But I'm afraid it won't work for all cases.
So IIUC the problem is anon pte can be empty, but since uffd-wp bit doesn't persist on anon (but none) ptes, then we got it lost and we cannot identify it from pages being written. Your solution will solve problem for anonymous, but I think it'll break file memories.
Example:
Consider one shmem page that got mapped, write protected (using UFFDIO_WP ioctl), written again (removing uffd-wp bit automatically), then zapped. The pte will be pte_none() but it's actually written, afaiu.
Maybe it's time we should introduce UFFD_FEATURE_WP_ZEROPAGE, so we'll need to install pte markers for anonymous too (then it will work similarly like shmem/hugetlbfs, that we'll report writting to zero pages), then you'll need to have the new UFFD_FEATURE_WP_ASYNC depend on it. With that I think you can keep using the old check and it should start to work.
Please let me know if my understanding is correct above.
Thank you for identifying it. Your understanding seems on point. I'll have research things up about PTE Markers. I'm looking at your patches about it [1]. Can you refer me to "mm alignment sessions" discussion in form of presentation or if any transcript is available?
No worry now, after a second thought I think zero page is better than pte markers, and I've got a patch that works for it here by injecting zero pages for anonymous:
https://lore.kernel.org/all/20230215210257.224243-1-peterx@redhat.com/
I think we'd also better to enforce your new WP_ASYNC feature bit to depend on this one, so fail the UFFDIO_API if WP_ASYNC && !WP_ZEROPAGE.
Could you please try by rebasing your work upon this one? Hope it'll work for you already. Note again that you'll need to go back to the old is_pte|pmd_written() to make things work always, I think.
[...]
I truly understand how you feel about export_prev_to_out(). It is really difficult to understand. Even I had to made a hard try to come up with the current code to avoid consuming a lot of kernel's memory while giving user the compact output. I can surely map both of these with a dirty looking macro. But I'm unable to find a decent macro to replace these. I think I'll put a comment some where to explain whats going-on.
So maybe I still missed something? I'll read the new version when it comes.
Thanks,
On 2/16/23 2:12 AM, Peter Xu wrote:
On Wed, Feb 15, 2023 at 03:03:09PM +0500, Muhammad Usama Anjum wrote:
On 2/15/23 1:59 AM, Peter Xu wrote: [..]
static inline bool is_pte_written(pte_t pte) { if ((pte_present(pte) && pte_uffd_wp(pte)) || (pte_swp_uffd_wp_any(pte))) return false; return (pte_present(pte) || is_swap_pte(pte)); }
Could you explain why you don't want to return dirty for !present? A page can be written then swapped out. Don't you want to know that happened (from dirty tracking POV)?
The code looks weird to me too.. We only have three types of ptes: (1) present, (2) swap, (3) none.
Then, "(pte_present() || is_swap_pte())" is the same as !pte_none(). Is that what you're really looking for?
Yes, this is what I've been trying to do. I'll use !pte_none() to make it simpler.
Ah I think I see what you wanted to do now.. But I'm afraid it won't work for all cases.
So IIUC the problem is anon pte can be empty, but since uffd-wp bit doesn't persist on anon (but none) ptes, then we got it lost and we cannot identify it from pages being written. Your solution will solve problem for anonymous, but I think it'll break file memories.
Example:
Consider one shmem page that got mapped, write protected (using UFFDIO_WP ioctl), written again (removing uffd-wp bit automatically), then zapped. The pte will be pte_none() but it's actually written, afaiu.
Maybe it's time we should introduce UFFD_FEATURE_WP_ZEROPAGE, so we'll need to install pte markers for anonymous too (then it will work similarly like shmem/hugetlbfs, that we'll report writting to zero pages), then you'll need to have the new UFFD_FEATURE_WP_ASYNC depend on it. With that I think you can keep using the old check and it should start to work.
Please let me know if my understanding is correct above.
Thank you for identifying it. Your understanding seems on point. I'll have research things up about PTE Markers. I'm looking at your patches about it [1]. Can you refer me to "mm alignment sessions" discussion in form of presentation or if any transcript is available?
No worry now, after a second thought I think zero page is better than pte markers, and I've got a patch that works for it here by injecting zero pages for anonymous:
https://lore.kernel.org/all/20230215210257.224243-1-peterx@redhat.com/
I think we'd also better to enforce your new WP_ASYNC feature bit to depend on this one, so fail the UFFDIO_API if WP_ASYNC && !WP_ZEROPAGE.
Could you please try by rebasing your work upon this one? Hope it'll work for you already. Note again that you'll need to go back to the old is_pte|pmd_written() to make things work always, I think.
Thank you so much for sending the ZEROPAGE patch. I've rebased my patches on top of it and my all tests for anon memory are passing. Now we don't need to touch the page before engaging wp. This is what we wanted to achieve. So !wp flag can be easily translated to soft-dirty flag (is_pte_soft_dirty = is_pte_wp).
I've only a few file mem and shmem tests. I'll write more tests.
[...]
I truly understand how you feel about export_prev_to_out(). It is really difficult to understand. Even I had to made a hard try to come up with the current code to avoid consuming a lot of kernel's memory while giving user the compact output. I can surely map both of these with a dirty looking macro. But I'm unable to find a decent macro to replace these. I think I'll put a comment some where to explain whats going-on.
So maybe I still missed something? I'll read the new version when it comes.
Lets reconvene in next patches if you feel like they can be improved.
Thanks,
On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote: ... Hi Muhammad! I'm really sorry for not commenting this code, just out of time and i fear cant look with precise care at least for some time, hopefully other CRIU guys pick it up. Anyway, here a few comment from a glance.
+static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
struct pagemap_scan_private *p, unsigned long addr,
unsigned int len)
+{
This is a big function and usually it's a flag to not declare it as "inline" until there very serious reson to.
- unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
- bool cpy = true;
- struct page_region *prev = &p->prev;
- if (HAS_NO_SPACE(p))
return -ENOSPC;
- if (p->max_pages && p->found_pages + len >= p->max_pages)
len = p->max_pages - p->found_pages;
- if (!len)
return -EINVAL;
- if (p->required_mask)
cpy = ((p->required_mask & cur) == p->required_mask);
- if (cpy && p->anyof_mask)
cpy = (p->anyof_mask & cur);
- if (cpy && p->excluded_mask)
cpy = !(p->excluded_mask & cur);
- bitmap = cur & p->return_mask;
- if (cpy && bitmap) {
You can exit early here simply
if (!cpy || !bitmap) return 0;
saving one tab for the code below.
if ((prev->len) && (prev->bitmap == bitmap) &&
(prev->start + prev->len * PAGE_SIZE == addr)) {
prev->len += len;
p->found_pages += len;
} else if (p->vec_index < p->vec_len) {
if (prev->len) {
memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
p->vec_index++;
}
prev->start = addr;
prev->len = len;
prev->bitmap = bitmap;
p->found_pages += len;
} else {
return -ENOSPC;
}
- }
- return 0;
+}
+static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
unsigned long *vec_index)
+{
No need for inline either.
- struct page_region *prev = &p->prev;
- if (prev->len) {
if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
return -EFAULT;
p->vec_index++;
(*vec_index)++;
prev->len = 0;
- }
- return 0;
+}
+static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
unsigned long end, struct mm_walk *walk)
+{
Same, no need for inline. I've a few comments more in my mind will try to collect them tomorrow.
Hi Cyrill,
Thank you for your time and review.
On 2/9/23 3:22 AM, Cyrill Gorcunov wrote:
On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote: ... Hi Muhammad! I'm really sorry for not commenting this code, just out of time and i fear cant look with precise care at least for some time, hopefully other CRIU guys pick it up. Anyway, here a few comment from a glance.
+static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
struct pagemap_scan_private *p, unsigned long addr,
unsigned int len)
+{
This is a big function and usually it's a flag to not declare it as "inline" until there very serious reson to.
I'll remove all these inline in next revision.
- unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
- bool cpy = true;
- struct page_region *prev = &p->prev;
- if (HAS_NO_SPACE(p))
return -ENOSPC;
- if (p->max_pages && p->found_pages + len >= p->max_pages)
len = p->max_pages - p->found_pages;
- if (!len)
return -EINVAL;
- if (p->required_mask)
cpy = ((p->required_mask & cur) == p->required_mask);
- if (cpy && p->anyof_mask)
cpy = (p->anyof_mask & cur);
- if (cpy && p->excluded_mask)
cpy = !(p->excluded_mask & cur);
- bitmap = cur & p->return_mask;
- if (cpy && bitmap) {
You can exit early here simply
if (!cpy || !bitmap) return 0;
I'm avoiding an extra return here.
saving one tab for the code below.
if ((prev->len) && (prev->bitmap == bitmap) &&
(prev->start + prev->len * PAGE_SIZE == addr)) {
prev->len += len;
p->found_pages += len;
} else if (p->vec_index < p->vec_len) {
if (prev->len) {
memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
p->vec_index++;
}
prev->start = addr;
prev->len = len;
prev->bitmap = bitmap;
p->found_pages += len;
} else {
return -ENOSPC;
}
- }
- return 0;
+}
+static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
unsigned long *vec_index)
+{
No need for inline either.
- struct page_region *prev = &p->prev;
- if (prev->len) {
if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
return -EFAULT;
p->vec_index++;
(*vec_index)++;
prev->len = 0;
- }
- return 0;
+}
+static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
unsigned long end, struct mm_walk *walk)
+{
Same, no need for inline. I've a few comments more in my mind will try to collect them tomorrow.
Your review would be much appreciated.
On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear the info about page table entries. The following operations are supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN), file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped (PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which pages have been written-to.
- Find pages which have been written-to and write protect the pages (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
To get information about which pages have been written-to and/or write protect the pages, following must be performed first in order:
- The userfaultfd file descriptor is created with userfaultfd syscall.
- The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
- The memory range is registered with UFFDIO_REGISTER_MODE_WP mode through UFFDIO_REGISTER IOCTL.
Then the any part of the registered memory or the whole memory region can be write protected using the UFFDIO_WRITEPROTECT IOCTL or PAGEMAP_SCAN IOCTL.
struct pagemap_scan_args is used as the argument of the IOCTL. In this struct:
- The range is specified through start and len.
- The output buffer of struct page_region array and size is specified as vec and vec_len.
- The optional maximum requested pages are specified in the max_pages.
- The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE is the only added flag at this time.
- The masks are specified in required_mask, anyof_mask, excluded_ mask and return_mask.
This IOCTL can be extended to get information about more PTE bits. This IOCTL doesn't support hugetlbs at the moment. No information about hugetlb can be obtained. This patch has evolved from a basic patch from Gabriel Krisman Bertazi.
Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com
Changes in v10:
- move changes in tools/include/uapi/linux/fs.h to separate patch
- update commit message
Change in v8:
- Correct is_pte_uffd_wp()
- Improve readability and error checks
- Remove some un-needed code
Changes in v7:
- Rebase on top of latest next
- Fix some corner cases
- Base soft-dirty on the uffd wp async
- Update the terminologies
- Optimize the memory usage inside the ioctl
Changes in v6:
- Rename variables and update comments
- Make IOCTL independent of soft_dirty config
- Change masks and bitmap type to _u64
- Improve code quality
Changes in v5:
- Remove tlb flushing even for clear operation
Changes in v4:
- Update the interface and implementation
Changes in v3:
- Tighten the user-kernel interface by using explicit types and add more error checking
Changes in v2:
- Convert the interface from syscall to ioctl
- Remove pidfd support as it doesn't make sense in ioctl
fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 50 +++++++ 2 files changed, 340 insertions(+)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e35a0398db63..c6bde19d63d9 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -19,6 +19,7 @@ #include <linux/shmem_fs.h> #include <linux/uaccess.h> #include <linux/pkeys.h> +#include <linux/minmax.h> #include <asm/elf.h> #include <asm/tlb.h> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, } #endif +static inline bool is_pte_uffd_wp(pte_t pte) +{
- if ((pte_present(pte) && pte_uffd_wp(pte)) ||
(pte_swp_uffd_wp_any(pte)))
return true;
- return false;
+}
+static inline bool is_pmd_uffd_wp(pmd_t pmd) +{
- if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
(is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
return true;
- return false;
+}
#if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE) static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp) @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file) return 0; } +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \
PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED) +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE) +#define IS_GET_OP(a) (a->vec) +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages))
+#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \
- (wt | file << 1 | present << 2 | swap << 3)
+#define IS_WT_REQUIRED(a) \
- ((a->required_mask & PAGE_IS_WRITTEN) || \
(a->anyof_mask & PAGE_IS_WRITTEN))
All these macros are specific to pagemap_scan_ioctl() and should be namespaced accordingly, e.g. PM_SCAN_BITS_ALL, PM_SCAN_BITMAP etc.
Also, IS_<opname>_OP() will be more readable as PM_SCAN_OP_IS_<opname> and I'd suggest to open code IS_WP_ENGAGE_OP() and IS_GET_OP() and make HAS_NO_SPACE() and IS_WT_REQUIRED() static inlines rather than macros.
And I'd also make IS_GET_OP() more explicit by defining a PAGEMAP_WP_GET or similar flag rather than using arg->vec.
+struct pagemap_scan_private {
- struct page_region *vec;
- struct page_region prev;
- unsigned long vec_len, vec_index;
- unsigned int max_pages, found_pages, flags;
- unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
+};
+static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk)
Please keep the lines under 80 characters limit.
+{
- struct pagemap_scan_private *p = walk->private;
- struct vm_area_struct *vma = walk->vma;
- if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma))
return -EPERM;
- if (vma->vm_flags & VM_PFNMAP)
return 1;
- return 0;
+}
+static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
struct pagemap_scan_private *p, unsigned long addr,
unsigned int len)
+{
- unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
- bool cpy = true;
- struct page_region *prev = &p->prev;
- if (HAS_NO_SPACE(p))
return -ENOSPC;
- if (p->max_pages && p->found_pages + len >= p->max_pages)
len = p->max_pages - p->found_pages;
- if (!len)
return -EINVAL;
- if (p->required_mask)
cpy = ((p->required_mask & cur) == p->required_mask);
- if (cpy && p->anyof_mask)
cpy = (p->anyof_mask & cur);
- if (cpy && p->excluded_mask)
cpy = !(p->excluded_mask & cur);
- bitmap = cur & p->return_mask;
- if (cpy && bitmap) {
if ((prev->len) && (prev->bitmap == bitmap) &&
(prev->start + prev->len * PAGE_SIZE == addr)) {
prev->len += len;
p->found_pages += len;
} else if (p->vec_index < p->vec_len) {
if (prev->len) {
memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
p->vec_index++;
}
prev->start = addr;
prev->len = len;
prev->bitmap = bitmap;
p->found_pages += len;
} else {
return -ENOSPC;
}
- }
- return 0;
Please don't save on empty lines. Empty lines between logical pieces improve readability.
+}
+static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
unsigned long *vec_index)
+{
- struct page_region *prev = &p->prev;
- if (prev->len) {
if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
return -EFAULT;
p->vec_index++;
(*vec_index)++;
prev->len = 0;
- }
- return 0;
+}
+static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
unsigned long end, struct mm_walk *walk)
+{
- struct pagemap_scan_private *p = walk->private;
- struct vm_area_struct *vma = walk->vma;
- unsigned long addr = end;
- spinlock_t *ptl;
- int ret = 0;
- pte_t *pte;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- ptl = pmd_trans_huge_lock(pmd, vma);
- if (ptl) {
bool pmd_wt;
pmd_wt = !is_pmd_uffd_wp(*pmd);
/*
* Break huge page into small pages if operation needs to be performed is
* on a portion of the huge page.
*/
if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
spin_unlock(ptl);
split_huge_pmd(vma, pmd, start);
goto process_smaller_pages;
}
if (IS_GET_OP(p))
ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd),
is_swap_pmd(*pmd), p, start,
(end - start)/PAGE_SIZE);
spin_unlock(ptl);
if (!ret) {
if (pmd_wt && IS_WP_ENGAGE_OP(p))
uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
}
return ret;
- }
+process_smaller_pages:
- if (pmd_trans_unstable(pmd))
return 0;
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
- pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
- if (IS_GET_OP(p)) {
for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
pte_present(*pte), is_swap_pte(*pte), p, addr, 1);
if (ret)
break;
}
- }
- pte_unmap_unlock(pte - 1, ptl);
- if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
uffd_wp_range(walk->mm, vma, start, addr - start, true);
- cond_resched();
- return ret;
+}
+static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
struct mm_walk *walk)
+{
- struct pagemap_scan_private *p = walk->private;
- struct vm_area_struct *vma = walk->vma;
- int ret = 0;
- if (vma)
ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
(end - addr)/PAGE_SIZE);
- return ret;
+}
+/* No hugetlb support is present. */ +static const struct mm_walk_ops pagemap_scan_ops = {
- .test_walk = pagemap_scan_test_walk,
- .pmd_entry = pagemap_scan_pmd_entry,
- .pte_hole = pagemap_scan_pte_hole,
+};
+static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg) +{
- unsigned long empty_slots, vec_index = 0;
- unsigned long __user start, end;
- unsigned long __start, __end;
- struct page_region __user *vec;
- struct pagemap_scan_private p;
- int ret = 0;
- start = (unsigned long)untagged_addr(arg->start);
- vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
- /* Validate memory ranges */
- if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
return -EINVAL;
- if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
(!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region)))))
return -EINVAL;
- /* Detect illegal flags and masks */
- if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) ||
(arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) ||
(arg->return_mask & ~PAGEMAP_BITS_ALL))
return -EINVAL;
- if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) ||
!arg->return_mask))
return -EINVAL;
- /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */
- if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) ||
(arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
return -EINVAL;
I'd split argument validation into a separate function and split the OR'ed conditions into separate if statements, e.g
bool pm_scan_args_valid(struct pagemap_scan_arg *arg) { if (IS_GET_OP(arg)) { if (!arg->return_mask) return false; if (!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) return false; }
/* ... */
return true; }
- end = start + arg->len;
- p.max_pages = arg->max_pages;
- p.found_pages = 0;
- p.flags = arg->flags;
- p.required_mask = arg->required_mask;
- p.anyof_mask = arg->anyof_mask;
- p.excluded_mask = arg->excluded_mask;
- p.return_mask = arg->return_mask;
- p.prev.len = 0;
- p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
- if (IS_GET_OP(arg)) {
p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL);
if (!p.vec)
return -ENOMEM;
- } else {
p.vec = NULL;
- }
- __start = __end = start;
- while (!ret && __end < end) {
p.vec_index = 0;
empty_slots = arg->vec_len - vec_index;
if (p.vec_len > empty_slots)
p.vec_len = empty_slots;
__end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
if (__end > end)
__end = end;
mmap_read_lock(mm);
ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p);
mmap_read_unlock(mm);
if (!(!ret || ret == -ENOSPC))
goto free_data;
__start = __end;
if (IS_GET_OP(arg) && p.vec_index) {
if (copy_to_user(&vec[vec_index], p.vec,
p.vec_index * sizeof(struct page_region))) {
ret = -EFAULT;
goto free_data;
}
vec_index += p.vec_index;
}
- }
- ret = export_prev_to_out(&p, vec, &vec_index);
- if (!ret)
ret = vec_index;
+free_data:
- if (IS_GET_OP(arg))
kfree(p.vec);
- return ret;
+}
+static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg) +{
- struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg;
- struct mm_struct *mm = file->private_data;
- struct pagemap_scan_arg argument;
- if (cmd == PAGEMAP_SCAN) {
if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg)))
return -EFAULT;
return do_pagemap_cmd(mm, &argument);
- }
- return -EINVAL;
+}
const struct file_operations proc_pagemap_operations = { .llseek = mem_lseek, /* borrow this */ .read = pagemap_read, .open = pagemap_open, .release = pagemap_release,
- .unlocked_ioctl = pagemap_scan_ioctl,
- .compat_ioctl = pagemap_scan_ioctl,
}; #endif /* CONFIG_PROC_PAGE_MONITOR */ diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index b7b56871029c..1ae9a8684b48 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t; #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ RWF_APPEND) +/* Pagemap ioctl */ +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg)
+/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */ +#define PAGE_IS_WRITTEN (1 << 0) +#define PAGE_IS_FILE (1 << 1) +#define PAGE_IS_PRESENT (1 << 2) +#define PAGE_IS_SWAPPED (1 << 3)
+/*
- struct page_region - Page region with bitmap flags
- @start: Start of the region
- @len: Length of the region
- bitmap: Bits sets for the region
- */
+struct page_region {
- __u64 start;
- __u64 len;
- __u64 bitmap;
+};
+/*
- struct pagemap_scan_arg - Pagemap ioctl argument
- @start: Starting address of the region
- @len: Length of the region (All the pages in this length are included)
- @vec: Address of page_region struct array for output
- @vec_len: Length of the page_region struct array
- @max_pages: Optional max return pages
- @flags: Flags for the IOCTL
- @required_mask: Required mask - All of these bits have to be set in the PTE
- @anyof_mask: Any mask - Any of these bits are set in the PTE
- @excluded_mask: Exclude mask - None of these bits are set in the PTE
- @return_mask: Bits that are to be reported in page_region
- */
+struct pagemap_scan_arg {
- __u64 start;
- __u64 len;
- __u64 vec;
- __u64 vec_len;
- __u32 max_pages;
- __u32 flags;
- __u64 required_mask;
- __u64 anyof_mask;
- __u64 excluded_mask;
- __u64 return_mask;
+};
+/* Special flags */ +#define PAGEMAP_WP_ENGAGE (1 << 0)
#endif /* _UAPI_LINUX_FS_H */
2.30.2
On 2/17/23 3:10 PM, Mike Rapoport wrote:
On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear the info about page table entries. The following operations are supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN), file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped (PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which pages have been written-to.
- Find pages which have been written-to and write protect the pages (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
To get information about which pages have been written-to and/or write protect the pages, following must be performed first in order:
- The userfaultfd file descriptor is created with userfaultfd syscall.
- The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
- The memory range is registered with UFFDIO_REGISTER_MODE_WP mode through UFFDIO_REGISTER IOCTL.
Then the any part of the registered memory or the whole memory region can be write protected using the UFFDIO_WRITEPROTECT IOCTL or PAGEMAP_SCAN IOCTL.
struct pagemap_scan_args is used as the argument of the IOCTL. In this struct:
- The range is specified through start and len.
- The output buffer of struct page_region array and size is specified as vec and vec_len.
- The optional maximum requested pages are specified in the max_pages.
- The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE is the only added flag at this time.
- The masks are specified in required_mask, anyof_mask, excluded_ mask and return_mask.
This IOCTL can be extended to get information about more PTE bits. This IOCTL doesn't support hugetlbs at the moment. No information about hugetlb can be obtained. This patch has evolved from a basic patch from Gabriel Krisman Bertazi.
Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com
Changes in v10:
- move changes in tools/include/uapi/linux/fs.h to separate patch
- update commit message
Change in v8:
- Correct is_pte_uffd_wp()
- Improve readability and error checks
- Remove some un-needed code
Changes in v7:
- Rebase on top of latest next
- Fix some corner cases
- Base soft-dirty on the uffd wp async
- Update the terminologies
- Optimize the memory usage inside the ioctl
Changes in v6:
- Rename variables and update comments
- Make IOCTL independent of soft_dirty config
- Change masks and bitmap type to _u64
- Improve code quality
Changes in v5:
- Remove tlb flushing even for clear operation
Changes in v4:
- Update the interface and implementation
Changes in v3:
- Tighten the user-kernel interface by using explicit types and add more error checking
Changes in v2:
- Convert the interface from syscall to ioctl
- Remove pidfd support as it doesn't make sense in ioctl
fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 50 +++++++ 2 files changed, 340 insertions(+)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e35a0398db63..c6bde19d63d9 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -19,6 +19,7 @@ #include <linux/shmem_fs.h> #include <linux/uaccess.h> #include <linux/pkeys.h> +#include <linux/minmax.h> #include <asm/elf.h> #include <asm/tlb.h> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, } #endif +static inline bool is_pte_uffd_wp(pte_t pte) +{
- if ((pte_present(pte) && pte_uffd_wp(pte)) ||
(pte_swp_uffd_wp_any(pte)))
return true;
- return false;
+}
+static inline bool is_pmd_uffd_wp(pmd_t pmd) +{
- if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
(is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
return true;
- return false;
+}
#if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE) static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp) @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file) return 0; } +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \
PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED) +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE) +#define IS_GET_OP(a) (a->vec) +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages))
+#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \
- (wt | file << 1 | present << 2 | swap << 3)
+#define IS_WT_REQUIRED(a) \
- ((a->required_mask & PAGE_IS_WRITTEN) || \
(a->anyof_mask & PAGE_IS_WRITTEN))
All these macros are specific to pagemap_scan_ioctl() and should be namespaced accordingly, e.g. PM_SCAN_BITS_ALL, PM_SCAN_BITMAP etc.
Also, IS_<opname>_OP() will be more readable as PM_SCAN_OP_IS_<opname> and I'd suggest to open code IS_WP_ENGAGE_OP() and IS_GET_OP() and make HAS_NO_SPACE() and IS_WT_REQUIRED() static inlines rather than macros.
Will do in next version.
And I'd also make IS_GET_OP() more explicit by defining a PAGEMAP_WP_GET or similar flag rather than using arg->vec.
I had in the first revisions. But explicit GET_OP was removed in the previous iterations after some feedback. Peter has also suggested this. I'll add the GET_OP flag again.
+struct pagemap_scan_private {
- struct page_region *vec;
- struct page_region prev;
- unsigned long vec_len, vec_index;
- unsigned int max_pages, found_pages, flags;
- unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
+};
+static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk)
Please keep the lines under 80 characters limit.
+{
- struct pagemap_scan_private *p = walk->private;
- struct vm_area_struct *vma = walk->vma;
- if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma))
return -EPERM;
- if (vma->vm_flags & VM_PFNMAP)
return 1;
- return 0;
+}
+static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
struct pagemap_scan_private *p, unsigned long addr,
unsigned int len)
+{
- unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
- bool cpy = true;
- struct page_region *prev = &p->prev;
- if (HAS_NO_SPACE(p))
return -ENOSPC;
- if (p->max_pages && p->found_pages + len >= p->max_pages)
len = p->max_pages - p->found_pages;
- if (!len)
return -EINVAL;
- if (p->required_mask)
cpy = ((p->required_mask & cur) == p->required_mask);
- if (cpy && p->anyof_mask)
cpy = (p->anyof_mask & cur);
- if (cpy && p->excluded_mask)
cpy = !(p->excluded_mask & cur);
- bitmap = cur & p->return_mask;
- if (cpy && bitmap) {
if ((prev->len) && (prev->bitmap == bitmap) &&
(prev->start + prev->len * PAGE_SIZE == addr)) {
prev->len += len;
p->found_pages += len;
} else if (p->vec_index < p->vec_len) {
if (prev->len) {
memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
p->vec_index++;
}
prev->start = addr;
prev->len = len;
prev->bitmap = bitmap;
p->found_pages += len;
} else {
return -ENOSPC;
}
- }
- return 0;
Please don't save on empty lines. Empty lines between logical pieces improve readability.
Sorry, I'll add them.
+}
+static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
unsigned long *vec_index)
+{
- struct page_region *prev = &p->prev;
- if (prev->len) {
if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
return -EFAULT;
p->vec_index++;
(*vec_index)++;
prev->len = 0;
- }
- return 0;
+}
+static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
unsigned long end, struct mm_walk *walk)
+{
- struct pagemap_scan_private *p = walk->private;
- struct vm_area_struct *vma = walk->vma;
- unsigned long addr = end;
- spinlock_t *ptl;
- int ret = 0;
- pte_t *pte;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- ptl = pmd_trans_huge_lock(pmd, vma);
- if (ptl) {
bool pmd_wt;
pmd_wt = !is_pmd_uffd_wp(*pmd);
/*
* Break huge page into small pages if operation needs to be performed is
* on a portion of the huge page.
*/
if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
spin_unlock(ptl);
split_huge_pmd(vma, pmd, start);
goto process_smaller_pages;
}
if (IS_GET_OP(p))
ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd),
is_swap_pmd(*pmd), p, start,
(end - start)/PAGE_SIZE);
spin_unlock(ptl);
if (!ret) {
if (pmd_wt && IS_WP_ENGAGE_OP(p))
uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
}
return ret;
- }
+process_smaller_pages:
- if (pmd_trans_unstable(pmd))
return 0;
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
- pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
- if (IS_GET_OP(p)) {
for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
pte_present(*pte), is_swap_pte(*pte), p, addr, 1);
if (ret)
break;
}
- }
- pte_unmap_unlock(pte - 1, ptl);
- if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
uffd_wp_range(walk->mm, vma, start, addr - start, true);
- cond_resched();
- return ret;
+}
+static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
struct mm_walk *walk)
+{
- struct pagemap_scan_private *p = walk->private;
- struct vm_area_struct *vma = walk->vma;
- int ret = 0;
- if (vma)
ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
(end - addr)/PAGE_SIZE);
- return ret;
+}
+/* No hugetlb support is present. */ +static const struct mm_walk_ops pagemap_scan_ops = {
- .test_walk = pagemap_scan_test_walk,
- .pmd_entry = pagemap_scan_pmd_entry,
- .pte_hole = pagemap_scan_pte_hole,
+};
+static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg) +{
- unsigned long empty_slots, vec_index = 0;
- unsigned long __user start, end;
- unsigned long __start, __end;
- struct page_region __user *vec;
- struct pagemap_scan_private p;
- int ret = 0;
- start = (unsigned long)untagged_addr(arg->start);
- vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
- /* Validate memory ranges */
- if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
return -EINVAL;
- if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
(!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region)))))
return -EINVAL;
- /* Detect illegal flags and masks */
- if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) ||
(arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) ||
(arg->return_mask & ~PAGEMAP_BITS_ALL))
return -EINVAL;
- if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) ||
!arg->return_mask))
return -EINVAL;
- /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */
- if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) ||
(arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
return -EINVAL;
I'd split argument validation into a separate function and split the OR'ed conditions into separate if statements, e.g
bool pm_scan_args_valid(struct pagemap_scan_arg *arg) { if (IS_GET_OP(arg)) { if (!arg->return_mask) return false; if (!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) return false; }
/* ... */
return true; }
This seems a very good way. Thank you so much!
- end = start + arg->len;
- p.max_pages = arg->max_pages;
- p.found_pages = 0;
- p.flags = arg->flags;
- p.required_mask = arg->required_mask;
- p.anyof_mask = arg->anyof_mask;
- p.excluded_mask = arg->excluded_mask;
- p.return_mask = arg->return_mask;
- p.prev.len = 0;
- p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
- if (IS_GET_OP(arg)) {
p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL);
if (!p.vec)
return -ENOMEM;
- } else {
p.vec = NULL;
- }
- __start = __end = start;
- while (!ret && __end < end) {
p.vec_index = 0;
empty_slots = arg->vec_len - vec_index;
if (p.vec_len > empty_slots)
p.vec_len = empty_slots;
__end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
if (__end > end)
__end = end;
mmap_read_lock(mm);
ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p);
mmap_read_unlock(mm);
if (!(!ret || ret == -ENOSPC))
goto free_data;
__start = __end;
if (IS_GET_OP(arg) && p.vec_index) {
if (copy_to_user(&vec[vec_index], p.vec,
p.vec_index * sizeof(struct page_region))) {
ret = -EFAULT;
goto free_data;
}
vec_index += p.vec_index;
}
- }
- ret = export_prev_to_out(&p, vec, &vec_index);
- if (!ret)
ret = vec_index;
+free_data:
- if (IS_GET_OP(arg))
kfree(p.vec);
- return ret;
+}
+static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg) +{
- struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg;
- struct mm_struct *mm = file->private_data;
- struct pagemap_scan_arg argument;
- if (cmd == PAGEMAP_SCAN) {
if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg)))
return -EFAULT;
return do_pagemap_cmd(mm, &argument);
- }
- return -EINVAL;
+}
const struct file_operations proc_pagemap_operations = { .llseek = mem_lseek, /* borrow this */ .read = pagemap_read, .open = pagemap_open, .release = pagemap_release,
- .unlocked_ioctl = pagemap_scan_ioctl,
- .compat_ioctl = pagemap_scan_ioctl,
}; #endif /* CONFIG_PROC_PAGE_MONITOR */ diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index b7b56871029c..1ae9a8684b48 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t; #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ RWF_APPEND) +/* Pagemap ioctl */ +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg)
+/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */ +#define PAGE_IS_WRITTEN (1 << 0) +#define PAGE_IS_FILE (1 << 1) +#define PAGE_IS_PRESENT (1 << 2) +#define PAGE_IS_SWAPPED (1 << 3)
+/*
- struct page_region - Page region with bitmap flags
- @start: Start of the region
- @len: Length of the region
- bitmap: Bits sets for the region
- */
+struct page_region {
- __u64 start;
- __u64 len;
- __u64 bitmap;
+};
+/*
- struct pagemap_scan_arg - Pagemap ioctl argument
- @start: Starting address of the region
- @len: Length of the region (All the pages in this length are included)
- @vec: Address of page_region struct array for output
- @vec_len: Length of the page_region struct array
- @max_pages: Optional max return pages
- @flags: Flags for the IOCTL
- @required_mask: Required mask - All of these bits have to be set in the PTE
- @anyof_mask: Any mask - Any of these bits are set in the PTE
- @excluded_mask: Exclude mask - None of these bits are set in the PTE
- @return_mask: Bits that are to be reported in page_region
- */
+struct pagemap_scan_arg {
- __u64 start;
- __u64 len;
- __u64 vec;
- __u64 vec_len;
- __u32 max_pages;
- __u32 flags;
- __u64 required_mask;
- __u64 anyof_mask;
- __u64 excluded_mask;
- __u64 return_mask;
+};
+/* Special flags */ +#define PAGEMAP_WP_ENGAGE (1 << 0)
#endif /* _UAPI_LINUX_FS_H */
2.30.2
On 2/20/23 3:38 PM, Muhammad Usama Anjum wrote:
+#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \
PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED) +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE) +#define IS_GET_OP(a) (a->vec) +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages))
+#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \
- (wt | file << 1 | present << 2 | swap << 3)
+#define IS_WT_REQUIRED(a) \
- ((a->required_mask & PAGE_IS_WRITTEN) || \
(a->anyof_mask & PAGE_IS_WRITTEN))
All these macros are specific to pagemap_scan_ioctl() and should be namespaced accordingly, e.g. PM_SCAN_BITS_ALL, PM_SCAN_BITMAP etc.
Also, IS_<opname>_OP() will be more readable as PM_SCAN_OP_IS_<opname> and I'd suggest to open code IS_WP_ENGAGE_OP() and IS_GET_OP() and make HAS_NO_SPACE() and IS_WT_REQUIRED() static inlines rather than macros.
Will do in next version.
IS_WP_ENGAGE_OP() and IS_GET_OP() which can be renamed to PM_SCAN_OP_IS_WP() and PM_SCAN_OP_IS_GET() seem better to me instead of open code as they seem more readable to me. I can open code if you insist.
On Mon, Feb 20, 2023 at 04:38:10PM +0500, Muhammad Usama Anjum wrote:
On 2/20/23 3:38 PM, Muhammad Usama Anjum wrote:
+#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \
PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED) +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE) +#define IS_GET_OP(a) (a->vec) +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages))
+#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \
- (wt | file << 1 | present << 2 | swap << 3)
+#define IS_WT_REQUIRED(a) \
- ((a->required_mask & PAGE_IS_WRITTEN) || \
(a->anyof_mask & PAGE_IS_WRITTEN))
All these macros are specific to pagemap_scan_ioctl() and should be namespaced accordingly, e.g. PM_SCAN_BITS_ALL, PM_SCAN_BITMAP etc.
Also, IS_<opname>_OP() will be more readable as PM_SCAN_OP_IS_<opname> and I'd suggest to open code IS_WP_ENGAGE_OP() and IS_GET_OP() and make HAS_NO_SPACE() and IS_WT_REQUIRED() static inlines rather than macros.
Will do in next version.
IS_WP_ENGAGE_OP() and IS_GET_OP() which can be renamed to PM_SCAN_OP_IS_WP() and PM_SCAN_OP_IS_GET() seem better to me instead of open code as they seem more readable to me. I can open code if you insist.
I'd suggest to see how the rework of pagemap_scan_pmd_entry() paves out. An open-coded '&' is surely clearer than a macro/function, but if it's buried in a long sequence of conditions, it may be not such clear win.
-- BR, Muhammad Usama Anjum
On Thu, 2 Feb 2023 at 12:30, Muhammad Usama Anjum usama.anjum@collabora.com wrote: [...]
- The masks are specified in required_mask, anyof_mask, excluded_ mask and return_mask.
[...]
May I suggest a slightly modified interface for the flags?
As I understand, the return_mask is what is applied to page flags to aggregate the list. This is a separate thing, and I think it doesn't need changes except maybe an improvement in the documentation and visual distinction.
For the page-selection mechanism, currently required_mask and excluded_mask have conflicting responsibilities. I suggest to rework that to: 1. negated_flags: page flags which are to be negated before applying the page selection using following masks; 2. required_flags: flags which all have to be set in the (negation-applied) page flags; 3. anyof_flags: flags of which at least one has to be set in the (negation-applied) page flags;
IOW, the resulting algorithm would be:
tested_flags = page_flags ^ negated_flags; if (~tested_flags & required_flags) skip page; if (!(tested_flags & anyof_flags)) skip_page;
aggregate_on(page_flags & return_flags);
Best Regards Michał Mirosław
Hi Michał,
Thank you so much for comment!
On 2/17/23 8:18 PM, Michał Mirosław wrote:
On Thu, 2 Feb 2023 at 12:30, Muhammad Usama Anjum usama.anjum@collabora.com wrote: [...]
- The masks are specified in required_mask, anyof_mask, excluded_ mask and return_mask.
[...]
The interface was suggested by Andrei back on the review of v3 [1]:
I mean we should be able to specify for what pages we need to get info for. An ioctl argument can have these four fields:
- required bits (rmask & mask == mask) - all bits from this mask have to
be set.
- any of these bits (amask & mask != 0) - any of these bits is set.
- exclude masks (emask & mask == 0) = none of these bits are set.
- return mask - bits that have to be reported to user.
May I suggest a slightly modified interface for the flags?
I've added everyone who may be interested in making interface better.
As I understand, the return_mask is what is applied to page flags to aggregate the list. This is a separate thing, and I think it doesn't need changes except maybe an improvement in the documentation and visual distinction.
For the page-selection mechanism, currently required_mask and excluded_mask have conflicting
They are opposite of each other: All the set bits in required_mask must be set for the page to be selected. All the set bits in excluded_mask must _not_ be set for the page to be selected.
responsibilities. I suggest to rework that to:
- negated_flags: page flags which are to be negated before applying
the page selection using following masks;
Sorry I'm unable to understand the negation (which is XOR?). Lets look at the truth table: Page Flag negated_flags 0 0 0 0 1 1 1 0 1 1 1 0
If a page flag is 0 and negated_flag is 1, the result would be 1 which has changed the page flag. It isn't making sense to me. Why the page flag bit is being fliped?
When Anrdei had proposed these masks, they seemed like a fancy way of filtering inside kernel and it was straight forward to understand. These masks would help his use cases for CRIU. So I'd included it. Please can you elaborate what is the purpose of negation?
- required_flags: flags which all have to be set in the
(negation-applied) page flags; 3. anyof_flags: flags of which at least one has to be set in the (negation-applied) page flags;
IOW, the resulting algorithm would be:
tested_flags = page_flags ^ negated_flags; if (~tested_flags & required_flags) skip page; if (!(tested_flags & anyof_flags)) skip_page;
aggregate_on(page_flags & return_flags);
Best Regards Michał Mirosław
[1] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com
On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
Hi Michał,
Thank you so much for comment!
On 2/17/23 8:18 PM, Michał Mirosław wrote:
[...]
For the page-selection mechanism, currently required_mask and excluded_mask have conflicting
They are opposite of each other: All the set bits in required_mask must be set for the page to be selected. All the set bits in excluded_mask must _not_ be set for the page to be selected.
responsibilities. I suggest to rework that to:
- negated_flags: page flags which are to be negated before applying
the page selection using following masks;
Sorry I'm unable to understand the negation (which is XOR?). Lets look at the truth table: Page Flag negated_flags 0 0 0 0 1 1 1 0 1 1 1 0
If a page flag is 0 and negated_flag is 1, the result would be 1 which has changed the page flag. It isn't making sense to me. Why the page flag bit is being fliped?
When Anrdei had proposed these masks, they seemed like a fancy way of filtering inside kernel and it was straight forward to understand. These masks would help his use cases for CRIU. So I'd included it. Please can you elaborate what is the purpose of negation?
The XOR is a way to invert the tested value of a flag (from positive to negative and the other way) without having the API with invalid values (with required_flags and excluded_flags you need to define a rule about what happens if a flag is present in both of the masks - either prioritise one mask over the other or reject the call). (Note: the XOR is applied only to the value of the flags for the purpose of testing page-selection criteria.)
So: 1. if a flag is not set in negated_flags, but set in required_flags, then it means "this flag must be one" - equivalent to it being set in required_flag (in your current version of the API). 2. if a flag is set in negated_flags and also in required_flags, then it means "this flag must be zero" - equivalent to it being set in excluded_flags.
The same thing goes for anyof_flags: if a flag is set in anyof_flags, then for it to be considered matched: 1. it must have a value of 1 if it is not set in negated_flags 2. it must have a value of 0 if it is set in negated_flags
BTW, I think I assumed that both conditions (all flags in required_flags and at least one in anyof_flags is present) need to be true for the page to be selected - is this your intention? The example code has a bug though, in that if anyof_flags is zero it will never match. Let me fix the selection part:
// calc. a mask of flags that have expected ("active") values tested_flags = page_flags ^ negated_flags; // are all required flags in "active" state? [== all zero when negated] if (~tested_flags & required_mask) skip page; // is any extra flag "active"? if (anyof_flags && !(tested_flags & anyof_flags)) skip page;
Best Regards Michał Mirosław
On 2/21/23 5:42 PM, Michał Mirosław wrote:
On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
Hi Michał,
Thank you so much for comment!
On 2/17/23 8:18 PM, Michał Mirosław wrote:
[...]
For the page-selection mechanism, currently required_mask and excluded_mask have conflicting
They are opposite of each other: All the set bits in required_mask must be set for the page to be selected. All the set bits in excluded_mask must _not_ be set for the page to be selected.
responsibilities. I suggest to rework that to:
- negated_flags: page flags which are to be negated before applying
the page selection using following masks;
Sorry I'm unable to understand the negation (which is XOR?). Lets look at the truth table: Page Flag negated_flags 0 0 0 0 1 1 1 0 1 1 1 0
If a page flag is 0 and negated_flag is 1, the result would be 1 which has changed the page flag. It isn't making sense to me. Why the page flag bit is being fliped?
When Anrdei had proposed these masks, they seemed like a fancy way of filtering inside kernel and it was straight forward to understand. These masks would help his use cases for CRIU. So I'd included it. Please can you elaborate what is the purpose of negation?
The XOR is a way to invert the tested value of a flag (from positive to negative and the other way) without having the API with invalid values (with required_flags and excluded_flags you need to define a rule about what happens if a flag is present in both of the masks - either prioritise one mask over the other or reject the call).
At minimum, one mask (required, any or excluded) must be specified. For a page to get selected, the page flags must fulfill the criterion of all the specified masks.
If a flag is present in both required_mask and excluded_mask, the required_mask would select a page. But exculded_mask would drop the page. So page page would be dropped. It is responsibility of the user to correctly specify the flags.
matched = true; if (p->required_mask) matched = ((p->required_mask & bitmap) == p->required_mask); if (matched && p->anyof_mask) matched = (p->anyof_mask & bitmap); if (matched && p->excluded_mask) matched = !(p->excluded_mask & bitmap);
if (matched && bitmap) { // page selected }
Do you accept/like this behavior of masks after explaintation?
(Note: the XOR is applied only to the value of the flags for the purpose of testing page-selection criteria.)
So:
- if a flag is not set in negated_flags, but set in required_flags,
then it means "this flag must be one" - equivalent to it being set in required_flag (in your current version of the API). 2. if a flag is set in negated_flags and also in required_flags, then it means "this flag must be zero" - equivalent to it being set in excluded_flags.
Lets translate words into table: pageflags required_flags negated_flags matched 1 1 0 yes 0 1 1 yes
The same thing goes for anyof_flags: if a flag is set in anyof_flags, then for it to be considered matched:
- it must have a value of 1 if it is not set in negated_flags
- it must have a value of 0 if it is set in negated_flags
pageflags anyof_flags negated_flags matched 1 1 0 yes 0 1 1 yes
BTW, I think I assumed that both conditions (all flags in required_flags and at least one in anyof_flags is present) need to be true for the page to be selected - is this your intention?
All the masks are optional. If all or any of the 3 masks are specified, the page flags must pass these masks to get selected.
The example code has a bug though, in that if anyof_flags is zero it will never match. Let me fix the selection part:
// calc. a mask of flags that have expected ("active") values tested_flags = page_flags ^ negated_flags; // are all required flags in "active" state? [== all zero when negated] if (~tested_flags & required_mask) skip page; // is any extra flag "active"? if (anyof_flags && !(tested_flags & anyof_flags)) skip page;
After taking a while to understand this and compare with already present flag system, `negated flags` is comparatively difficult to understand while already present flags seem easier.
Best Regards Michał Mirosław
On Wed, 22 Feb 2023 at 11:11, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
On 2/21/23 5:42 PM, Michał Mirosław wrote:
On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
Hi Michał,
Thank you so much for comment!
On 2/17/23 8:18 PM, Michał Mirosław wrote:
[...]
For the page-selection mechanism, currently required_mask and excluded_mask have conflicting
They are opposite of each other: All the set bits in required_mask must be set for the page to be selected. All the set bits in excluded_mask must _not_ be set for the page to be selected.
responsibilities. I suggest to rework that to:
- negated_flags: page flags which are to be negated before applying
the page selection using following masks;
Sorry I'm unable to understand the negation (which is XOR?). Lets look at the truth table: Page Flag negated_flags 0 0 0 0 1 1 1 0 1 1 1 0
If a page flag is 0 and negated_flag is 1, the result would be 1 which has changed the page flag. It isn't making sense to me. Why the page flag bit is being fliped?
When Anrdei had proposed these masks, they seemed like a fancy way of filtering inside kernel and it was straight forward to understand. These masks would help his use cases for CRIU. So I'd included it. Please can you elaborate what is the purpose of negation?
The XOR is a way to invert the tested value of a flag (from positive to negative and the other way) without having the API with invalid values (with required_flags and excluded_flags you need to define a rule about what happens if a flag is present in both of the masks - either prioritise one mask over the other or reject the call).
At minimum, one mask (required, any or excluded) must be specified. For a page to get selected, the page flags must fulfill the criterion of all the specified masks.
[Please see the comment below.]
[...]
Lets translate words into table:
[Yes, those tables captured the intent correctly.]
BTW, I think I assumed that both conditions (all flags in required_flags and at least one in anyof_flags is present) need to be true for the page to be selected - is this your intention?
All the masks are optional. If all or any of the 3 masks are specified, the page flags must pass these masks to get selected.
This explanation contradicts in part the introductory paragraph, but this version seems more useful as you can pass all masks zero to have all pages selected.
The example code has a bug though, in that if anyof_flags is zero it will never match. Let me fix the selection part:
// calc. a mask of flags that have expected ("active") values tested_flags = page_flags ^ negated_flags; // are all required flags in "active" state? [== all zero when negated] if (~tested_flags & required_mask) skip page; // is any extra flag "active"? if (anyof_flags && !(tested_flags & anyof_flags)) skip page;
After taking a while to understand this and compare with already present flag system, `negated flags` is comparatively difficult to understand while already present flags seem easier.
Maybe replacing negated_flags in the API with matched_values = ~negated_flags would make this better?
We compare having to understand XOR vs having to understand ordering of required_flags and excluded_flags. IOW my proposal is to replace branches in the masks interpretation (if in one set then matches but if in another set then doesn't; if flags match ... ) with plain calculation (flag is matching when equals ~negated_flags; if flags match the masks ...).
Best Regards Michał Mirosław
On 2/22/23 3:44 PM, Michał Mirosław wrote:
On Wed, 22 Feb 2023 at 11:11, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
On 2/21/23 5:42 PM, Michał Mirosław wrote:
On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
Hi Michał,
Thank you so much for comment!
On 2/17/23 8:18 PM, Michał Mirosław wrote:
[...]
For the page-selection mechanism, currently required_mask and excluded_mask have conflicting
They are opposite of each other: All the set bits in required_mask must be set for the page to be selected. All the set bits in excluded_mask must _not_ be set for the page to be selected.
responsibilities. I suggest to rework that to:
- negated_flags: page flags which are to be negated before applying
the page selection using following masks;
Sorry I'm unable to understand the negation (which is XOR?). Lets look at the truth table: Page Flag negated_flags 0 0 0 0 1 1 1 0 1 1 1 0
If a page flag is 0 and negated_flag is 1, the result would be 1 which has changed the page flag. It isn't making sense to me. Why the page flag bit is being fliped?
When Anrdei had proposed these masks, they seemed like a fancy way of filtering inside kernel and it was straight forward to understand. These masks would help his use cases for CRIU. So I'd included it. Please can you elaborate what is the purpose of negation?
The XOR is a way to invert the tested value of a flag (from positive to negative and the other way) without having the API with invalid values (with required_flags and excluded_flags you need to define a rule about what happens if a flag is present in both of the masks - either prioritise one mask over the other or reject the call).
At minimum, one mask (required, any or excluded) must be specified. For a page to get selected, the page flags must fulfill the criterion of all the specified masks.
[Please see the comment below.]
[...]
Lets translate words into table:
[Yes, those tables captured the intent correctly.]
BTW, I think I assumed that both conditions (all flags in required_flags and at least one in anyof_flags is present) need to be true for the page to be selected - is this your intention?
All the masks are optional. If all or any of the 3 masks are specified, the page flags must pass these masks to get selected.
This explanation contradicts in part the introductory paragraph, but this version seems more useful as you can pass all masks zero to have all pages selected.
Sorry, I wrote it wrongly. (All the masks are not optional.) Let me rephrase. All or at least any 1 of the 3 masks (required, any, exclude) must be specified. The return_mask must always be specified. Error is returned if all 3 masks (required, anyof, exclude) are zero or return_mask is zero.
The example code has a bug though, in that if anyof_flags is zero it will never match. Let me fix the selection part:
// calc. a mask of flags that have expected ("active") values tested_flags = page_flags ^ negated_flags; // are all required flags in "active" state? [== all zero when negated] if (~tested_flags & required_mask) skip page; // is any extra flag "active"? if (anyof_flags && !(tested_flags & anyof_flags)) skip page;
After taking a while to understand this and compare with already present flag system, `negated flags` is comparatively difficult to understand while already present flags seem easier.
Maybe replacing negated_flags in the API with matched_values = ~negated_flags would make this better?
We compare having to understand XOR vs having to understand ordering of required_flags and excluded_flags.
There is no ordering in current masks scheme. No mask is preferable. For a page to get selected, all the definitions of the masks must be fulfilled. You have come up with good example that what if required_mask = exclude_mask. In this case, no page will fulfill the criterion and hence no page would be selected. It is user's fault that he isn't understanding the definitions of these masks correctly.
Now thinking about it, I can add a error check which would return error if a bit in required and excluded masks matches. Would you like it? Lets put this check in place. (Previously I'd left it for user's wisdom not to do this. If he'll specify same masks in them, he'll get no addresses out of the syscall.)
IOW my proposal is to replace branches in the masks interpretation (if in one set then matches but if in another set then doesn't; if flags match ... ) with plain calculation (flag is matching when equals ~negated_flags; if flags match the masks ...).
Best Regards Michał Mirosław
On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
On 2/22/23 3:44 PM, Michał Mirosław wrote:
On Wed, 22 Feb 2023 at 11:11, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
On 2/21/23 5:42 PM, Michał Mirosław wrote:
On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
Hi Michał,
Thank you so much for comment!
On 2/17/23 8:18 PM, Michał Mirosław wrote:
[...]
For the page-selection mechanism, currently required_mask and excluded_mask have conflicting
They are opposite of each other: All the set bits in required_mask must be set for the page to be selected. All the set bits in excluded_mask must _not_ be set for the page to be selected.
responsibilities. I suggest to rework that to:
- negated_flags: page flags which are to be negated before applying
the page selection using following masks;
Sorry I'm unable to understand the negation (which is XOR?). Lets look at the truth table: Page Flag negated_flags 0 0 0 0 1 1 1 0 1 1 1 0
If a page flag is 0 and negated_flag is 1, the result would be 1 which has changed the page flag. It isn't making sense to me. Why the page flag bit is being fliped?
When Anrdei had proposed these masks, they seemed like a fancy way of filtering inside kernel and it was straight forward to understand. These masks would help his use cases for CRIU. So I'd included it. Please can you elaborate what is the purpose of negation?
The XOR is a way to invert the tested value of a flag (from positive to negative and the other way) without having the API with invalid values (with required_flags and excluded_flags you need to define a rule about what happens if a flag is present in both of the masks - either prioritise one mask over the other or reject the call).
At minimum, one mask (required, any or excluded) must be specified. For a page to get selected, the page flags must fulfill the criterion of all the specified masks.
[Please see the comment below.]
[...]
Lets translate words into table:
[Yes, those tables captured the intent correctly.]
BTW, I think I assumed that both conditions (all flags in required_flags and at least one in anyof_flags is present) need to be true for the page to be selected - is this your intention?
All the masks are optional. If all or any of the 3 masks are specified, the page flags must pass these masks to get selected.
This explanation contradicts in part the introductory paragraph, but this version seems more useful as you can pass all masks zero to have all pages selected.
Sorry, I wrote it wrongly. (All the masks are not optional.) Let me rephrase. All or at least any 1 of the 3 masks (required, any, exclude) must be specified. The return_mask must always be specified. Error is returned if all 3 masks (required, anyof, exclude) are zero or return_mask is zero.
Why do you need those restrictions? I'd guess it is valid to request a list of all pages with zero return_mask - this will return a compact list of used ranges of the virtual address space.
After taking a while to understand this and compare with already present flag system, `negated flags` is comparatively difficult to understand while already present flags seem easier.
Maybe replacing negated_flags in the API with matched_values = ~negated_flags would make this better?
We compare having to understand XOR vs having to understand ordering of required_flags and excluded_flags.
There is no ordering in current masks scheme. No mask is preferable. For a page to get selected, all the definitions of the masks must be fulfilled. You have come up with good example that what if required_mask = exclude_mask. In this case, no page will fulfill the criterion and hence no page would be selected. It is user's fault that he isn't understanding the definitions of these masks correctly.
Now thinking about it, I can add a error check which would return error if a bit in required and excluded masks matches. Would you like it? Lets put this check in place. (Previously I'd left it for user's wisdom not to do this. If he'll specify same masks in them, he'll get no addresses out of the syscall.)
This error case is (one of) the problems I propose avoiding. You also need much more text to describe the requred/excluded flags interactions and edge cases than saying that a flag must have a value equal to corresponding bit in ~negated_flags to be matched by requried/anyof masks.
IOW my proposal is to replace branches in the masks interpretation (if in one set then matches but if in another set then doesn't; if flags match ... ) with plain calculation (flag is matching when equals ~negated_flags; if flags match the masks ...).
Best Regards Michał Mirosław
On 2/22/23 4:48 PM, Michał Mirosław wrote:
On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
On 2/22/23 3:44 PM, Michał Mirosław wrote:
On Wed, 22 Feb 2023 at 11:11, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
On 2/21/23 5:42 PM, Michał Mirosław wrote:
On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
Hi Michał,
Thank you so much for comment!
On 2/17/23 8:18 PM, Michał Mirosław wrote:
[...]
> For the page-selection mechanism, currently required_mask and > excluded_mask have conflicting They are opposite of each other: All the set bits in required_mask must be set for the page to be selected. All the set bits in excluded_mask must _not_ be set for the page to be selected.
> responsibilities. I suggest to rework that to: > 1. negated_flags: page flags which are to be negated before applying > the page selection using following masks; Sorry I'm unable to understand the negation (which is XOR?). Lets look at the truth table: Page Flag negated_flags 0 0 0 0 1 1 1 0 1 1 1 0
If a page flag is 0 and negated_flag is 1, the result would be 1 which has changed the page flag. It isn't making sense to me. Why the page flag bit is being fliped?
When Anrdei had proposed these masks, they seemed like a fancy way of filtering inside kernel and it was straight forward to understand. These masks would help his use cases for CRIU. So I'd included it. Please can you elaborate what is the purpose of negation?
The XOR is a way to invert the tested value of a flag (from positive to negative and the other way) without having the API with invalid values (with required_flags and excluded_flags you need to define a rule about what happens if a flag is present in both of the masks - either prioritise one mask over the other or reject the call).
At minimum, one mask (required, any or excluded) must be specified. For a page to get selected, the page flags must fulfill the criterion of all the specified masks.
[Please see the comment below.]
[...]
Lets translate words into table:
[Yes, those tables captured the intent correctly.]
BTW, I think I assumed that both conditions (all flags in required_flags and at least one in anyof_flags is present) need to be true for the page to be selected - is this your intention?
All the masks are optional. If all or any of the 3 masks are specified, the page flags must pass these masks to get selected.
This explanation contradicts in part the introductory paragraph, but this version seems more useful as you can pass all masks zero to have all pages selected.
Sorry, I wrote it wrongly. (All the masks are not optional.) Let me rephrase. All or at least any 1 of the 3 masks (required, any, exclude) must be specified. The return_mask must always be specified. Error is returned if all 3 masks (required, anyof, exclude) are zero or return_mask is zero.
Why do you need those restrictions? I'd guess it is valid to request a list of all pages with zero return_mask - this will return a compact list of used ranges of the virtual address space.
At the time, we are supporting 4 flags (PAGE_IS_WRITTEN, PAGE_IS_FILE, PAGE_IS_PRESENT and PAGE_IS_SWAPPED). The idea is that user mention his flags of interest in the return_mask. If he wants only 1 flag, he'll specify it. Definitely if user wants only 1 flag, initially it doesn't make any sense to mention in the return mask. But we want uniformity. If user want, 2 or more flags in returned, return_mask becomes compulsory. So to keep things simple and generic for any number of flags of interest returned, the return_mask must be specified even if the flag of interest is only 1.
After taking a while to understand this and compare with already present flag system, `negated flags` is comparatively difficult to understand while already present flags seem easier.
Maybe replacing negated_flags in the API with matched_values = ~negated_flags would make this better?
We compare having to understand XOR vs having to understand ordering of required_flags and excluded_flags.
There is no ordering in current masks scheme. No mask is preferable. For a page to get selected, all the definitions of the masks must be fulfilled. You have come up with good example that what if required_mask = exclude_mask. In this case, no page will fulfill the criterion and hence no page would be selected. It is user's fault that he isn't understanding the definitions of these masks correctly.
Now thinking about it, I can add a error check which would return error if a bit in required and excluded masks matches. Would you like it? Lets put this check in place. (Previously I'd left it for user's wisdom not to do this. If he'll specify same masks in them, he'll get no addresses out of the syscall.)
This error case is (one of) the problems I propose avoiding. You also need much more text to describe the requred/excluded flags interactions and edge cases than saying that a flag must have a value equal to corresponding bit in ~negated_flags to be matched by requried/anyof masks.
I've found excluded_mask very intuitive as compared to negated_mask which is so difficult to understand that I don't know how to use it correctly. Lets take an example, I want pages which are PAGE_IS_WRITTEN and are not PAGE_IS_FILE. In addition, the pages must be PAGE_IS_PRESENT or PAGE_IS_SWAPPED. This can be specified as:
required_mask = PAGE_IS_WRITTEN excluded_mask = PAGE_IS_FILE anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
(a) assume page_flags = 0b1111 skip page as 0b1111 & 0b0010 = true
(b) assume page_flags = 0b1001 select page as 0b1001 & 0b0010 = false
It seemed intuitive. Right? How would you achieve same thing with negated_mask?
required_mask = PAGE_IS_WRITTEN negated_mask = PAGE_IS_FILE anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
(1) assume page_flags = 0b1111 tested_flags = 0b1111 ^ 0b0010 = 0b1101
(2) assume page_flags = 0b1001 tested_flags = 0b1001 ^ 0b0010 = 0b1011
In (1), we wanted to skip pages which have PAGE_IS_FILE set. But negated_mask has just masked it and page is still getting tested if it should be selected and it would get selected. It is wrong.
In (2), the PAGE_IS_FILE bit of page_flags was 0 and got updated to 1 or PAGE_IS_FILE in tested_flags.
IOW my proposal is to replace branches in the masks interpretation (if in one set then matches but if in another set then doesn't; if flags match ... ) with plain calculation (flag is matching when equals ~negated_flags; if flags match the masks ...).
Best Regards Michał Mirosław
On Thu, 23 Feb 2023 at 07:44, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
On 2/22/23 4:48 PM, Michał Mirosław wrote:
On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
[...]
BTW, I think I assumed that both conditions (all flags in required_flags and at least one in anyof_flags is present) need to be true for the page to be selected - is this your intention?
All the masks are optional. If all or any of the 3 masks are specified, the page flags must pass these masks to get selected.
This explanation contradicts in part the introductory paragraph, but this version seems more useful as you can pass all masks zero to have all pages selected.
Sorry, I wrote it wrongly. (All the masks are not optional.) Let me rephrase. All or at least any 1 of the 3 masks (required, any, exclude) must be specified. The return_mask must always be specified. Error is returned if all 3 masks (required, anyof, exclude) are zero or return_mask is zero.
Why do you need those restrictions? I'd guess it is valid to request a list of all pages with zero return_mask - this will return a compact list of used ranges of the virtual address space.
At the time, we are supporting 4 flags (PAGE_IS_WRITTEN, PAGE_IS_FILE, PAGE_IS_PRESENT and PAGE_IS_SWAPPED). The idea is that user mention his flags of interest in the return_mask. If he wants only 1 flag, he'll specify it. Definitely if user wants only 1 flag, initially it doesn't make any sense to mention in the return mask. But we want uniformity. If user want, 2 or more flags in returned, return_mask becomes compulsory. So to keep things simple and generic for any number of flags of interest returned, the return_mask must be specified even if the flag of interest is only 1.
I'm not sure why do we want uniformity in the case of 1 flag? If a user specifies a single required flag, I'd expect he doesn't need to look at the flags returned as those will duplicate the information from mere presence of a page. A user might also require a single flag, but want all of them returned. Both requests - return 1 flag and return 0 flags would give meaningful output, so why force one way or the other? Allowing two will also enable users to express the intent: they need either just a list of pages, or they need a list with per-page flags - the need would follow from the code structure or other factors.
After taking a while to understand this and compare with already present flag system, `negated flags` is comparatively difficult to understand while already present flags seem easier.
Maybe replacing negated_flags in the API with matched_values = ~negated_flags would make this better?
We compare having to understand XOR vs having to understand ordering of required_flags and excluded_flags.
There is no ordering in current masks scheme. No mask is preferable. For a page to get selected, all the definitions of the masks must be fulfilled. You have come up with good example that what if required_mask = exclude_mask. In this case, no page will fulfill the criterion and hence no page would be selected. It is user's fault that he isn't understanding the definitions of these masks correctly.
Now thinking about it, I can add a error check which would return error if a bit in required and excluded masks matches. Would you like it? Lets put this check in place. (Previously I'd left it for user's wisdom not to do this. If he'll specify same masks in them, he'll get no addresses out of the syscall.)
This error case is (one of) the problems I propose avoiding. You also need much more text to describe the requred/excluded flags interactions and edge cases than saying that a flag must have a value equal to corresponding bit in ~negated_flags to be matched by requried/anyof masks.
I've found excluded_mask very intuitive as compared to negated_mask which is so difficult to understand that I don't know how to use it correctly. Lets take an example, I want pages which are PAGE_IS_WRITTEN and are not PAGE_IS_FILE. In addition, the pages must be PAGE_IS_PRESENT or PAGE_IS_SWAPPED. This can be specified as:
required_mask = PAGE_IS_WRITTEN excluded_mask = PAGE_IS_FILE anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
(a) assume page_flags = 0b1111 skip page as 0b1111 & 0b0010 = true
(b) assume page_flags = 0b1001 select page as 0b1001 & 0b0010 = false
It seemed intuitive. Right? How would you achieve same thing with negated_mask?
required_mask = PAGE_IS_WRITTEN negated_mask = PAGE_IS_FILE anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
(1) assume page_flags = 0b1111 tested_flags = 0b1111 ^ 0b0010 = 0b1101
(2) assume page_flags = 0b1001 tested_flags = 0b1001 ^ 0b0010 = 0b1011
In (1), we wanted to skip pages which have PAGE_IS_FILE set. But negated_mask has just masked it and page is still getting tested if it should be selected and it would get selected. It is wrong.
In (2), the PAGE_IS_FILE bit of page_flags was 0 and got updated to 1 or PAGE_IS_FILE in tested_flags.
I require flags PAGE_IS_WRITTEN=1, PAGE_IS_FILE=0, so:
required_mask = PAGE_IS_WRITTEN | PAGE_IS_FILE; negated_flags = PAGE_IS_FILE; // flags I want zero
I also require one of PAGE_IS_PRESENT=1 or PAGE_IS_SWAP=1, so:
anyof_mask = PAGE_IS_PRESENT | PAGE_IS_SWAP;
Another case: I want to analyse a process' working set:
required_mask = 0; negated_flags = PAGE_IS_FILE; anyof_mask = PAGE_IS_FILE | PAGE_IS_WRITTEN;
-> gathering pages modified [WRITTEN=1] or not backed by a file [FILE=0].
To clarify a bit: negated_flags doesn't mask anything: the field inverts values of the flags (marks some "active low", if you consider electronic signal analogy).
Best Regards Michał Mirosław
On 2/23/23 1:41 PM, Michał Mirosław wrote:
On Thu, 23 Feb 2023 at 07:44, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
On 2/22/23 4:48 PM, Michał Mirosław wrote:
On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
[...]
> BTW, I think I assumed that both conditions (all flags in > required_flags and at least one in anyof_flags is present) need to be > true for the page to be selected - is this your intention? All the masks are optional. If all or any of the 3 masks are specified, the page flags must pass these masks to get selected.
This explanation contradicts in part the introductory paragraph, but this version seems more useful as you can pass all masks zero to have all pages selected.
Sorry, I wrote it wrongly. (All the masks are not optional.) Let me rephrase. All or at least any 1 of the 3 masks (required, any, exclude) must be specified. The return_mask must always be specified. Error is returned if all 3 masks (required, anyof, exclude) are zero or return_mask is zero.
Why do you need those restrictions? I'd guess it is valid to request a list of all pages with zero return_mask - this will return a compact list of used ranges of the virtual address space.
At the time, we are supporting 4 flags (PAGE_IS_WRITTEN, PAGE_IS_FILE, PAGE_IS_PRESENT and PAGE_IS_SWAPPED). The idea is that user mention his flags of interest in the return_mask. If he wants only 1 flag, he'll specify it. Definitely if user wants only 1 flag, initially it doesn't make any sense to mention in the return mask. But we want uniformity. If user want, 2 or more flags in returned, return_mask becomes compulsory. So to keep things simple and generic for any number of flags of interest returned, the return_mask must be specified even if the flag of interest is only 1.
I'm not sure why do we want uniformity in the case of 1 flag? If a user specifies a single required flag, I'd expect he doesn't need to look at the flags returned as those will duplicate the information from mere presence of a page. A user might also require a single flag, but want all of them returned. Both requests - return 1 flag and return 0 flags would give meaningful output, so why force one way or the other? Allowing two will also enable users to express the intent: they need either just a list of pages, or they need a list with per-page flags - the need would follow from the code structure or other factors.
We can add as much flexibility as much people ask by keeping code simple. But it is going to be dirty to add error check which detects if return_mask = 0 and if there is only 1 flag of interest mentioned by the user. The following mentioned error check is essential to return deterministic output. Do you think this case is worth it to support and we don't want to go with the generality for both 1 or more flag cases?
if (return_mask == 0 && hweight_long(required_mask | any_mask) != 1) return error;
After taking a while to understand this and compare with already present flag system, `negated flags` is comparatively difficult to understand while already present flags seem easier.
Maybe replacing negated_flags in the API with matched_values = ~negated_flags would make this better?
We compare having to understand XOR vs having to understand ordering of required_flags and excluded_flags.
There is no ordering in current masks scheme. No mask is preferable. For a page to get selected, all the definitions of the masks must be fulfilled. You have come up with good example that what if required_mask = exclude_mask. In this case, no page will fulfill the criterion and hence no page would be selected. It is user's fault that he isn't understanding the definitions of these masks correctly.
Now thinking about it, I can add a error check which would return error if a bit in required and excluded masks matches. Would you like it? Lets put this check in place. (Previously I'd left it for user's wisdom not to do this. If he'll specify same masks in them, he'll get no addresses out of the syscall.)
This error case is (one of) the problems I propose avoiding. You also need much more text to describe the requred/excluded flags interactions and edge cases than saying that a flag must have a value equal to corresponding bit in ~negated_flags to be matched by requried/anyof masks.
I've found excluded_mask very intuitive as compared to negated_mask which is so difficult to understand that I don't know how to use it correctly. Lets take an example, I want pages which are PAGE_IS_WRITTEN and are not PAGE_IS_FILE. In addition, the pages must be PAGE_IS_PRESENT or PAGE_IS_SWAPPED. This can be specified as:
required_mask = PAGE_IS_WRITTEN excluded_mask = PAGE_IS_FILE anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
(a) assume page_flags = 0b1111 skip page as 0b1111 & 0b0010 = true
(b) assume page_flags = 0b1001 select page as 0b1001 & 0b0010 = false
It seemed intuitive. Right? How would you achieve same thing with negated_mask?
required_mask = PAGE_IS_WRITTEN negated_mask = PAGE_IS_FILE anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
(1) assume page_flags = 0b1111 tested_flags = 0b1111 ^ 0b0010 = 0b1101
(2) assume page_flags = 0b1001 tested_flags = 0b1001 ^ 0b0010 = 0b1011
In (1), we wanted to skip pages which have PAGE_IS_FILE set. But negated_mask has just masked it and page is still getting tested if it should be selected and it would get selected. It is wrong.
In (2), the PAGE_IS_FILE bit of page_flags was 0 and got updated to 1 or PAGE_IS_FILE in tested_flags.
I require flags PAGE_IS_WRITTEN=1, PAGE_IS_FILE=0, so:
required_mask = PAGE_IS_WRITTEN | PAGE_IS_FILE; negated_flags = PAGE_IS_FILE; // flags I want zero
You want PAGE_IS_FILE to be zero and at the same time you are requiring the PAGE_IS_FILE. It is confusing. Lets go with excluded mask and excluded_mask must never have any bit matching with required_mask. Lets stay with this as it is intuitive and would be easy to use from the user's perspective. Andrei and Danylo had suggested these mask scheme and have use cases for this. Andrei and Danylo can please comment as well.
I also require one of PAGE_IS_PRESENT=1 or PAGE_IS_SWAP=1, so:
anyof_mask = PAGE_IS_PRESENT | PAGE_IS_SWAP;
Another case: I want to analyse a process' working set:
required_mask = 0; negated_flags = PAGE_IS_FILE; anyof_mask = PAGE_IS_FILE | PAGE_IS_WRITTEN;
-> gathering pages modified [WRITTEN=1] or not backed by a file [FILE=0].
To clarify a bit: negated_flags doesn't mask anything: the field inverts values of the flags (marks some "active low", if you consider electronic signal analogy).
Best Regards Michał Mirosław
On Thu, 23 Feb 2023 at 10:23, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
On 2/23/23 1:41 PM, Michał Mirosław wrote:
On Thu, 23 Feb 2023 at 07:44, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
On 2/22/23 4:48 PM, Michał Mirosław wrote:
On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
[...]
>> BTW, I think I assumed that both conditions (all flags in >> required_flags and at least one in anyof_flags is present) need to be >> true for the page to be selected - is this your intention? > All the masks are optional. If all or any of the 3 masks are specified, the > page flags must pass these masks to get selected.
This explanation contradicts in part the introductory paragraph, but this version seems more useful as you can pass all masks zero to have all pages selected.
Sorry, I wrote it wrongly. (All the masks are not optional.) Let me rephrase. All or at least any 1 of the 3 masks (required, any, exclude) must be specified. The return_mask must always be specified. Error is returned if all 3 masks (required, anyof, exclude) are zero or return_mask is zero.
Why do you need those restrictions? I'd guess it is valid to request a list of all pages with zero return_mask - this will return a compact list of used ranges of the virtual address space.
At the time, we are supporting 4 flags (PAGE_IS_WRITTEN, PAGE_IS_FILE, PAGE_IS_PRESENT and PAGE_IS_SWAPPED). The idea is that user mention his flags of interest in the return_mask. If he wants only 1 flag, he'll specify it. Definitely if user wants only 1 flag, initially it doesn't make any sense to mention in the return mask. But we want uniformity. If user want, 2 or more flags in returned, return_mask becomes compulsory. So to keep things simple and generic for any number of flags of interest returned, the return_mask must be specified even if the flag of interest is only 1.
I'm not sure why do we want uniformity in the case of 1 flag? If a user specifies a single required flag, I'd expect he doesn't need to look at the flags returned as those will duplicate the information from mere presence of a page. A user might also require a single flag, but want all of them returned. Both requests - return 1 flag and return 0 flags would give meaningful output, so why force one way or the other? Allowing two will also enable users to express the intent: they need either just a list of pages, or they need a list with per-page flags - the need would follow from the code structure or other factors.
We can add as much flexibility as much people ask by keeping code simple. But it is going to be dirty to add error check which detects if return_mask = 0 and if there is only 1 flag of interest mentioned by the user. The following mentioned error check is essential to return deterministic output. Do you think this case is worth it to support and we don't want to go with the generality for both 1 or more flag cases?
if (return_mask == 0 && hweight_long(required_mask | any_mask) != 1) return error;
Why would you want to add this error check? If a user requires multiple flags but cares only about a list of matching pages, then it would be natural to express this intent as return_mask = 0.
> After taking a while to understand this and compare with already present > flag system, `negated flags` is comparatively difficult to understand while > already present flags seem easier.
Maybe replacing negated_flags in the API with matched_values = ~negated_flags would make this better?
We compare having to understand XOR vs having to understand ordering of required_flags and excluded_flags.
There is no ordering in current masks scheme. No mask is preferable. For a page to get selected, all the definitions of the masks must be fulfilled. You have come up with good example that what if required_mask = exclude_mask. In this case, no page will fulfill the criterion and hence no page would be selected. It is user's fault that he isn't understanding the definitions of these masks correctly.
Now thinking about it, I can add a error check which would return error if a bit in required and excluded masks matches. Would you like it? Lets put this check in place. (Previously I'd left it for user's wisdom not to do this. If he'll specify same masks in them, he'll get no addresses out of the syscall.)
This error case is (one of) the problems I propose avoiding. You also need much more text to describe the requred/excluded flags interactions and edge cases than saying that a flag must have a value equal to corresponding bit in ~negated_flags to be matched by requried/anyof masks.
I've found excluded_mask very intuitive as compared to negated_mask which is so difficult to understand that I don't know how to use it correctly. Lets take an example, I want pages which are PAGE_IS_WRITTEN and are not PAGE_IS_FILE. In addition, the pages must be PAGE_IS_PRESENT or PAGE_IS_SWAPPED. This can be specified as:
required_mask = PAGE_IS_WRITTEN excluded_mask = PAGE_IS_FILE anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
(a) assume page_flags = 0b1111 skip page as 0b1111 & 0b0010 = true
(b) assume page_flags = 0b1001 select page as 0b1001 & 0b0010 = false
It seemed intuitive. Right? How would you achieve same thing with negated_mask?
required_mask = PAGE_IS_WRITTEN negated_mask = PAGE_IS_FILE anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
(1) assume page_flags = 0b1111 tested_flags = 0b1111 ^ 0b0010 = 0b1101
(2) assume page_flags = 0b1001 tested_flags = 0b1001 ^ 0b0010 = 0b1011
In (1), we wanted to skip pages which have PAGE_IS_FILE set. But negated_mask has just masked it and page is still getting tested if it should be selected and it would get selected. It is wrong.
In (2), the PAGE_IS_FILE bit of page_flags was 0 and got updated to 1 or PAGE_IS_FILE in tested_flags.
I require flags PAGE_IS_WRITTEN=1, PAGE_IS_FILE=0, so:
required_mask = PAGE_IS_WRITTEN | PAGE_IS_FILE; negated_flags = PAGE_IS_FILE; // flags I want zero
You want PAGE_IS_FILE to be zero and at the same time you are requiring the PAGE_IS_FILE. It is confusing.
Ok, I believe the misunderstanding comes from the naming. I "require" the flag to be a particular value - hence include it in "required_flags" and specify the required value in ~negated_flags. You "require" the flag to be set (equal 1) and so include it in "required_flags" and you "require" the flag to be clear (equal to 0) so include it in "excluded_flags". Both approaches are correct, but I would not consider one "easier" than the other. The former is more general, though - makes any_of also able to match on flags cleared and removes the possibility of a conflicting case of a flag present in both sets.
Maybe considered_flags or matched_flags then would make the field better understandable?
Best Regards Michał Mirosław
On Tue, Feb 21, 2023 at 4:42 AM Michał Mirosław emmir@google.com wrote:
On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
Hi Michał,
Thank you so much for comment!
On 2/17/23 8:18 PM, Michał Mirosław wrote:
[...]
For the page-selection mechanism, currently required_mask and excluded_mask have conflicting
They are opposite of each other: All the set bits in required_mask must be set for the page to be selected. All the set bits in excluded_mask must _not_ be set for the page to be selected.
responsibilities. I suggest to rework that to:
- negated_flags: page flags which are to be negated before applying
the page selection using following masks;
Sorry I'm unable to understand the negation (which is XOR?). Lets look at the truth table: Page Flag negated_flags 0 0 0 0 1 1 1 0 1 1 1 0
If a page flag is 0 and negated_flag is 1, the result would be 1 which has changed the page flag. It isn't making sense to me. Why the page flag bit is being fliped?
When Anrdei had proposed these masks, they seemed like a fancy way of filtering inside kernel and it was straight forward to understand. These masks would help his use cases for CRIU. So I'd included it. Please can you elaborate what is the purpose of negation?
The XOR is a way to invert the tested value of a flag (from positive to negative and the other way) without having the API with invalid values (with required_flags and excluded_flags you need to define a rule about what happens if a flag is present in both of the masks - either prioritise one mask over the other or reject the call). (Note: the XOR is applied only to the value of the flags for the purpose of testing page-selection criteria.)
Michał,
Your API isn't much different from the current one, but it requires a bit more brain activity for understanding.
The current set of masks can be easy translated to the new one: negated_flags = excluded_flags required_flags_new = excluded_flags | required_flags
As for invalid values, I think it is an advantage of the current API. I mean we can easily detect invalid values and return EINVAL. With your API, such mistakes will be undetectable.
As for priorities, I don't see this problem here If I don't miss something.
We can rewrite the code this way: ``` if (required_mask && ((page_flags & required_mask) != required_mask) skip page; if (anyof_mask && !(page_flags & anyof_mask)) skip page; if (page_flags & excluded_mask) skip page; ```
I think the result is always the same no matter in what order each mask is applied.
Thanks, Andrei
On Fri, 24 Feb 2023 at 03:20, Andrei Vagin avagin@gmail.com wrote:
On Tue, Feb 21, 2023 at 4:42 AM Michał Mirosław emmir@google.com wrote:
On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
Hi Michał,
Thank you so much for comment!
On 2/17/23 8:18 PM, Michał Mirosław wrote:
[...]
For the page-selection mechanism, currently required_mask and excluded_mask have conflicting
They are opposite of each other: All the set bits in required_mask must be set for the page to be selected. All the set bits in excluded_mask must _not_ be set for the page to be selected.
responsibilities. I suggest to rework that to:
- negated_flags: page flags which are to be negated before applying
the page selection using following masks;
Sorry I'm unable to understand the negation (which is XOR?). Lets look at the truth table: Page Flag negated_flags 0 0 0 0 1 1 1 0 1 1 1 0
If a page flag is 0 and negated_flag is 1, the result would be 1 which has changed the page flag. It isn't making sense to me. Why the page flag bit is being fliped?
When Anrdei had proposed these masks, they seemed like a fancy way of filtering inside kernel and it was straight forward to understand. These masks would help his use cases for CRIU. So I'd included it. Please can you elaborate what is the purpose of negation?
The XOR is a way to invert the tested value of a flag (from positive to negative and the other way) without having the API with invalid values (with required_flags and excluded_flags you need to define a rule about what happens if a flag is present in both of the masks - either prioritise one mask over the other or reject the call). (Note: the XOR is applied only to the value of the flags for the purpose of testing page-selection criteria.)
Michał,
Your API isn't much different from the current one, but it requires a bit more brain activity for understanding.
The current set of masks can be easy translated to the new one: negated_flags = excluded_flags required_flags_new = excluded_flags | required_flags
As for invalid values, I think it is an advantage of the current API. I mean we can easily detect invalid values and return EINVAL. With your API, such mistakes will be undetectable.
As for priorities, I don't see this problem here If I don't miss something.
We can rewrite the code this way:
if (required_mask && ((page_flags & required_mask) != required_mask) skip page; if (anyof_mask && !(page_flags & anyof_mask)) skip page; if (page_flags & excluded_mask) skip page;
I think the result is always the same no matter in what order each mask is applied.
Hi,
I would not want the discussion to wander into easier/harder territory as that highty depends on experience one has. What I'm arguing about is the consistency of the API. Let me expand a bit on that.
We have two ways to look at the page_flags: A. the field represents a *set of elements* (tags, attributes) present on the page; B. the field represents a bitfield (structure; a fixed set of boolean fields having a value of 0 or 1)
From A follows the include/exclude way of API design for matching the flags, and from B the matched mask (which flags to check) + value set (what values to require).
My argument is that B is consistent with how the flags are used in the kernel: we don't have operations that add or remove flags, but we have operations that set or change their value.
Best Regards Michał Mirosław
On 2/2/23 1:29 PM, Muhammad Usama Anjum wrote:
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear the info about page table entries. The following operations are supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN), file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped (PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which pages have been written-to.
- Find pages which have been written-to and write protect the pages (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
To get information about which pages have been written-to and/or write protect the pages, following must be performed first in order:
- The userfaultfd file descriptor is created with userfaultfd syscall.
- The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
- The memory range is registered with UFFDIO_REGISTER_MODE_WP mode through UFFDIO_REGISTER IOCTL.
Then the any part of the registered memory or the whole memory region can be write protected using the UFFDIO_WRITEPROTECT IOCTL or PAGEMAP_SCAN IOCTL.
struct pagemap_scan_args is used as the argument of the IOCTL. In this struct:
- The range is specified through start and len.
- The output buffer of struct page_region array and size is specified as vec and vec_len.
- The optional maximum requested pages are specified in the max_pages.
- The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE is the only added flag at this time.
- The masks are specified in required_mask, anyof_mask, excluded_ mask and return_mask.
This IOCTL can be extended to get information about more PTE bits. This IOCTL doesn't support hugetlbs at the moment. No information about hugetlb can be obtained. This patch has evolved from a basic patch from Gabriel Krisman Bertazi.
I was not involved before, so I am not commenting on the API and code to avoid making unhelpful noise.
Having said that, some things in the code seem quite dirty and make understanding the code hard to read.
Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com
Changes in v10:
- move changes in tools/include/uapi/linux/fs.h to separate patch
- update commit message
Change in v8:
- Correct is_pte_uffd_wp()
- Improve readability and error checks
- Remove some un-needed code
Changes in v7:
- Rebase on top of latest next
- Fix some corner cases
- Base soft-dirty on the uffd wp async
- Update the terminologies
- Optimize the memory usage inside the ioctl
Changes in v6:
- Rename variables and update comments
- Make IOCTL independent of soft_dirty config
- Change masks and bitmap type to _u64
- Improve code quality
Changes in v5:
- Remove tlb flushing even for clear operation
Changes in v4:
- Update the interface and implementation
Changes in v3:
- Tighten the user-kernel interface by using explicit types and add more error checking
Changes in v2:
- Convert the interface from syscall to ioctl
- Remove pidfd support as it doesn't make sense in ioctl
fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 50 +++++++ 2 files changed, 340 insertions(+)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e35a0398db63..c6bde19d63d9 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -19,6 +19,7 @@ #include <linux/shmem_fs.h> #include <linux/uaccess.h> #include <linux/pkeys.h> +#include <linux/minmax.h> #include <asm/elf.h> #include <asm/tlb.h> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, } #endif +static inline bool is_pte_uffd_wp(pte_t pte) +{
- if ((pte_present(pte) && pte_uffd_wp(pte)) ||
(pte_swp_uffd_wp_any(pte)))
return true;
- return false;
+}
+static inline bool is_pmd_uffd_wp(pmd_t pmd) +{
- if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
(is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
return true;
- return false;
+}
- #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE) static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp)
@@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file) return 0; } +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \
PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED) +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE) +#define IS_GET_OP(a) (a->vec) +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages))
I think that in general it is better to have an inline function instead of macros when possible, as it is clearer and checks types. Anyhow, IMHO most of these macros are better be open-coded.
+#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \
- (wt | file << 1 | present << 2 | swap << 3)
+#define IS_WT_REQUIRED(a) \
- ((a->required_mask & PAGE_IS_WRITTEN) || \
(a->anyof_mask & PAGE_IS_WRITTEN))
+struct pagemap_scan_private {
- struct page_region *vec;
- struct page_region prev;
- unsigned long vec_len, vec_index;
- unsigned int max_pages, found_pages, flags;
- unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
+};
+static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk) +{
- struct pagemap_scan_private *p = walk->private;
- struct vm_area_struct *vma = walk->vma;
- if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma))
return -EPERM;
- if (vma->vm_flags & VM_PFNMAP)
return 1;
- return 0;
+}
+static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
struct pagemap_scan_private *p, unsigned long addr,
unsigned int len)
+{
- unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
- bool cpy = true;
- struct page_region *prev = &p->prev;
- if (HAS_NO_SPACE(p))
return -ENOSPC;
- if (p->max_pages && p->found_pages + len >= p->max_pages)
len = p->max_pages - p->found_pages;
- if (!len)
return -EINVAL;
- if (p->required_mask)
cpy = ((p->required_mask & cur) == p->required_mask);
- if (cpy && p->anyof_mask)
cpy = (p->anyof_mask & cur);
- if (cpy && p->excluded_mask)
cpy = !(p->excluded_mask & cur);
- bitmap = cur & p->return_mask;
- if (cpy && bitmap) {
if ((prev->len) && (prev->bitmap == bitmap) &&
(prev->start + prev->len * PAGE_SIZE == addr)) {
prev->len += len;
The use of "len" both for bytes and pages is very confusing. Consider changing the name to n_pages or something similar.
p->found_pages += len;
} else if (p->vec_index < p->vec_len) {
if (prev->len) {
memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
p->vec_index++;
}
prev->start = addr;
prev->len = len;
prev->bitmap = bitmap;
p->found_pages += len;
} else {
return -ENOSPC;
}
- }
- return 0;
+}
+static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
unsigned long *vec_index)
+{
- struct page_region *prev = &p->prev;
- if (prev->len) {
if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
return -EFAULT;
p->vec_index++;
(*vec_index)++;
prev->len = 0;
- }
- return 0;
+}
+static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
unsigned long end, struct mm_walk *walk)
+{
- struct pagemap_scan_private *p = walk->private;
- struct vm_area_struct *vma = walk->vma;
- unsigned long addr = end;
- spinlock_t *ptl;
- int ret = 0;
- pte_t *pte;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- ptl = pmd_trans_huge_lock(pmd, vma);
- if (ptl) {
bool pmd_wt;
pmd_wt = !is_pmd_uffd_wp(*pmd);
/*
* Break huge page into small pages if operation needs to be performed is
* on a portion of the huge page.
*/
if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
spin_unlock(ptl);
split_huge_pmd(vma, pmd, start);
goto process_smaller_pages;
I think that such goto's are really confusing and should be avoided. And using 'else' (could have easily prevented the need for goto). It is not the best solution though, since I think it would have been better to invert the conditions.
}
if (IS_GET_OP(p))
ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd),
is_swap_pmd(*pmd), p, start,
(end - start)/PAGE_SIZE);
spin_unlock(ptl);
if (!ret) {
if (pmd_wt && IS_WP_ENGAGE_OP(p))
uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
}
return ret;
- }
+process_smaller_pages:
- if (pmd_trans_unstable(pmd))
return 0;
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
- pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
- if (IS_GET_OP(p)) {
for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
pte_present(*pte), is_swap_pte(*pte), p, addr, 1);
if (ret)
break;
}
- }
- pte_unmap_unlock(pte - 1, ptl);
We might have not entered the loop and pte-1 would be wrong.
- if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
What does 'addr - start' mean? If you want to say they are not equal, why not say so?
uffd_wp_range(walk->mm, vma, start, addr - start, true);
- cond_resched();
- return ret;
+}
+static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
struct mm_walk *walk)
+{
- struct pagemap_scan_private *p = walk->private;
- struct vm_area_struct *vma = walk->vma;
- int ret = 0;
- if (vma)
ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
(end - addr)/PAGE_SIZE);
- return ret;
+}
+/* No hugetlb support is present. */ +static const struct mm_walk_ops pagemap_scan_ops = {
- .test_walk = pagemap_scan_test_walk,
- .pmd_entry = pagemap_scan_pmd_entry,
- .pte_hole = pagemap_scan_pte_hole,
+};
+static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg) +{
- unsigned long empty_slots, vec_index = 0;
- unsigned long __user start, end;
The whole point of __user (attribute) is to be assigned to pointers.
- unsigned long __start, __end;
I think such names do not convey sufficient information.
- struct page_region __user *vec;
- struct pagemap_scan_private p;
- int ret = 0;
- start = (unsigned long)untagged_addr(arg->start);
- vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
- /* Validate memory ranges */
- if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
return -EINVAL;
- if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
(!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region)))))
return -EINVAL;
- /* Detect illegal flags and masks */
- if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) ||
(arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) ||
(arg->return_mask & ~PAGEMAP_BITS_ALL))
Using bitwise or to check
(arg->required_mask | arg->anyof_mask | arg->excluded_mask | arg->return_mask) & ~PAGE_MAP_BITS_ALL
Would have been much cleaner, IMHO.
return -EINVAL;
- if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) ||
!arg->return_mask))
return -EINVAL;
- /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */
- if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) ||
(arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
return -EINVAL;
- end = start + arg->len;
- p.max_pages = arg->max_pages;
- p.found_pages = 0;
- p.flags = arg->flags;
- p.required_mask = arg->required_mask;
- p.anyof_mask = arg->anyof_mask;
- p.excluded_mask = arg->excluded_mask;
- p.return_mask = arg->return_mask;
- p.prev.len = 0;
- p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
- if (IS_GET_OP(arg)) {
p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL);
if (!p.vec)
return -ENOMEM;
- } else {
p.vec = NULL;
I find it cleaner to initialize 'p.vec = NULL' unconditionally before IS_GET_OP() check.
- }
- __start = __end = start;
- while (!ret && __end < end) {
p.vec_index = 0;
empty_slots = arg->vec_len - vec_index;
if (p.vec_len > empty_slots)
p.vec_len = empty_slots;
__end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
if (__end > end)
__end = end;
Easier to understand using min().
mmap_read_lock(mm);
ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p);
mmap_read_unlock(mm);
if (!(!ret || ret == -ENOSPC))
Double negations complicate things unnecessarily.
And if you already "break" on ret, why do you check the condition in the while loop?
goto free_data;
__start = __end;
if (IS_GET_OP(arg) && p.vec_index) {
if (copy_to_user(&vec[vec_index], p.vec,
p.vec_index * sizeof(struct page_region))) {
ret = -EFAULT;
goto free_data;
}
vec_index += p.vec_index;
}
- }
- ret = export_prev_to_out(&p, vec, &vec_index);
- if (!ret)
ret = vec_index;
+free_data:
- if (IS_GET_OP(arg))
kfree(p.vec);
Just call it unconditionally.
- return ret;
+}
+static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg) +{
- struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg;
- struct mm_struct *mm = file->private_data;
- struct pagemap_scan_arg argument;
- if (cmd == PAGEMAP_SCAN) {
if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg)))
return -EFAULT;
return do_pagemap_cmd(mm, &argument);
- }
- return -EINVAL;
+}
- const struct file_operations proc_pagemap_operations = { .llseek = mem_lseek, /* borrow this */ .read = pagemap_read, .open = pagemap_open, .release = pagemap_release,
- .unlocked_ioctl = pagemap_scan_ioctl,
- .compat_ioctl = pagemap_scan_ioctl, }; #endif /* CONFIG_PROC_PAGE_MONITOR */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index b7b56871029c..1ae9a8684b48 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t; #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ RWF_APPEND) +/* Pagemap ioctl */ +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg)
+/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */ +#define PAGE_IS_WRITTEN (1 << 0) +#define PAGE_IS_FILE (1 << 1) +#define PAGE_IS_PRESENT (1 << 2) +#define PAGE_IS_SWAPPED (1 << 3)
These names are way too generic and are likely to be misused for the wrong purpose. The "_IS_" part seems confusing as well. So I think the naming needs to be fixed and some new type (using typedef) or enum should be introduced to hold these flags. I understand it is part of uapi and it is less common there, but it is not unheard of and does make things clearer.
+/*
- struct page_region - Page region with bitmap flags
- @start: Start of the region
- @len: Length of the region
- bitmap: Bits sets for the region
- */
+struct page_region {
- __u64 start;
- __u64 len;
I presume in bytes. Would be useful to mention.
- __u64 bitmap;
+};
+/*
- struct pagemap_scan_arg - Pagemap ioctl argument
- @start: Starting address of the region
- @len: Length of the region (All the pages in this length are included)
- @vec: Address of page_region struct array for output
- @vec_len: Length of the page_region struct array
- @max_pages: Optional max return pages
- @flags: Flags for the IOCTL
- @required_mask: Required mask - All of these bits have to be set in the PTE
- @anyof_mask: Any mask - Any of these bits are set in the PTE
- @excluded_mask: Exclude mask - None of these bits are set in the PTE
- @return_mask: Bits that are to be reported in page_region
- */
+struct pagemap_scan_arg {
- __u64 start;
- __u64 len;
- __u64 vec;
- __u64 vec_len;
- __u32 max_pages;
- __u32 flags;
- __u64 required_mask;
- __u64 anyof_mask;
- __u64 excluded_mask;
- __u64 return_mask;
+};
+/* Special flags */ +#define PAGEMAP_WP_ENGAGE (1 << 0)
- #endif /* _UAPI_LINUX_FS_H */
Hello Nadav,
Thank you so much for reviewing!
On 2/19/23 6:52 PM, Nadav Amit wrote:
On 2/2/23 1:29 PM, Muhammad Usama Anjum wrote:
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear the info about page table entries. The following operations are supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped (PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
To get information about which pages have been written-to and/or write protect the pages, following must be performed first in order:
- The userfaultfd file descriptor is created with userfaultfd syscall.
- The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
- The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
through UFFDIO_REGISTER IOCTL. Then the any part of the registered memory or the whole memory region can be write protected using the UFFDIO_WRITEPROTECT IOCTL or PAGEMAP_SCAN IOCTL.
struct pagemap_scan_args is used as the argument of the IOCTL. In this struct:
- The range is specified through start and len.
- The output buffer of struct page_region array and size is specified as
vec and vec_len.
- The optional maximum requested pages are specified in the max_pages.
- The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
is the only added flag at this time.
- The masks are specified in required_mask, anyof_mask, excluded_ mask
and return_mask.
This IOCTL can be extended to get information about more PTE bits. This IOCTL doesn't support hugetlbs at the moment. No information about hugetlb can be obtained. This patch has evolved from a basic patch from Gabriel Krisman Bertazi.
I was not involved before, so I am not commenting on the API and code to avoid making unhelpful noise.
Having said that, some things in the code seem quite dirty and make understanding the code hard to read.
There is a new proposal about the flags in the interface. I'll include you there.
Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com
Changes in v10:
- move changes in tools/include/uapi/linux/fs.h to separate patch
- update commit message
Change in v8:
- Correct is_pte_uffd_wp()
- Improve readability and error checks
- Remove some un-needed code
Changes in v7:
- Rebase on top of latest next
- Fix some corner cases
- Base soft-dirty on the uffd wp async
- Update the terminologies
- Optimize the memory usage inside the ioctl
Changes in v6:
- Rename variables and update comments
- Make IOCTL independent of soft_dirty config
- Change masks and bitmap type to _u64
- Improve code quality
Changes in v5:
- Remove tlb flushing even for clear operation
Changes in v4:
- Update the interface and implementation
Changes in v3:
- Tighten the user-kernel interface by using explicit types and add more
error checking
Changes in v2:
- Convert the interface from syscall to ioctl
- Remove pidfd support as it doesn't make sense in ioctl
fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 50 +++++++ 2 files changed, 340 insertions(+)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index e35a0398db63..c6bde19d63d9 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -19,6 +19,7 @@ #include <linux/shmem_fs.h> #include <linux/uaccess.h> #include <linux/pkeys.h> +#include <linux/minmax.h> #include <asm/elf.h> #include <asm/tlb.h> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, } #endif +static inline bool is_pte_uffd_wp(pte_t pte) +{ + if ((pte_present(pte) && pte_uffd_wp(pte)) || + (pte_swp_uffd_wp_any(pte))) + return true; + return false; +}
+static inline bool is_pmd_uffd_wp(pmd_t pmd) +{ + if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) || + (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd))) + return true; + return false; +}
#if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE) static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp) @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file) return 0; } +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \ + PAGE_IS_PRESENT | PAGE_IS_SWAPPED) +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED) +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE) +#define IS_GET_OP(a) (a->vec) +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages))
I think that in general it is better to have an inline function instead of macros when possible, as it is clearer and checks types. Anyhow, IMHO most of these macros are better be open-coded.
I'll update most of these in next version.
+#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \ + (wt | file << 1 | present << 2 | swap << 3) +#define IS_WT_REQUIRED(a) \ + ((a->required_mask & PAGE_IS_WRITTEN) || \ + (a->anyof_mask & PAGE_IS_WRITTEN))
+struct pagemap_scan_private { + struct page_region *vec; + struct page_region prev; + unsigned long vec_len, vec_index; + unsigned int max_pages, found_pages, flags; + unsigned long required_mask, anyof_mask, excluded_mask, return_mask; +};
+static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk) +{ + struct pagemap_scan_private *p = walk->private; + struct vm_area_struct *vma = walk->vma;
+ if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma)) + return -EPERM; + if (vma->vm_flags & VM_PFNMAP) + return 1; + return 0; +}
+static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, + struct pagemap_scan_private *p, unsigned long addr, + unsigned int len) +{ + unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap); + bool cpy = true; + struct page_region *prev = &p->prev;
+ if (HAS_NO_SPACE(p)) + return -ENOSPC;
+ if (p->max_pages && p->found_pages + len >= p->max_pages) + len = p->max_pages - p->found_pages; + if (!len) + return -EINVAL;
+ if (p->required_mask) + cpy = ((p->required_mask & cur) == p->required_mask); + if (cpy && p->anyof_mask) + cpy = (p->anyof_mask & cur); + if (cpy && p->excluded_mask) + cpy = !(p->excluded_mask & cur); + bitmap = cur & p->return_mask; + if (cpy && bitmap) { + if ((prev->len) && (prev->bitmap == bitmap) && + (prev->start + prev->len * PAGE_SIZE == addr)) { + prev->len += len;
The use of "len" both for bytes and pages is very confusing. Consider changing the name to n_pages or something similar.
Will update in next version.
+ p->found_pages += len; + } else if (p->vec_index < p->vec_len) { + if (prev->len) { + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region)); + p->vec_index++; + } + prev->start = addr; + prev->len = len; + prev->bitmap = bitmap; + p->found_pages += len; + } else { + return -ENOSPC; + } + } + return 0; +}
+static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec, + unsigned long *vec_index) +{ + struct page_region *prev = &p->prev;
+ if (prev->len) { + if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region))) + return -EFAULT; + p->vec_index++; + (*vec_index)++; + prev->len = 0; + } + return 0; +}
+static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, + unsigned long end, struct mm_walk *walk) +{ + struct pagemap_scan_private *p = walk->private; + struct vm_area_struct *vma = walk->vma; + unsigned long addr = end; + spinlock_t *ptl; + int ret = 0; + pte_t *pte;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE + ptl = pmd_trans_huge_lock(pmd, vma); + if (ptl) { + bool pmd_wt;
+ pmd_wt = !is_pmd_uffd_wp(*pmd); + /* + * Break huge page into small pages if operation needs to be performed is + * on a portion of the huge page. + */ + if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) { + spin_unlock(ptl); + split_huge_pmd(vma, pmd, start); + goto process_smaller_pages;
I think that such goto's are really confusing and should be avoided. And using 'else' (could have easily prevented the need for goto). It is not the best solution though, since I think it would have been better to invert the conditions.
Yeah, else can be used here. But then we'll have to add a tab to all the code after adding else. We have already so many tabs and very less space to right code. Not sure which is better.
+ } + if (IS_GET_OP(p)) + ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd), + is_swap_pmd(*pmd), p, start, + (end - start)/PAGE_SIZE); + spin_unlock(ptl); + if (!ret) { + if (pmd_wt && IS_WP_ENGAGE_OP(p)) + uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true); + } + return ret; + } +process_smaller_pages: + if (pmd_trans_unstable(pmd)) + return 0; +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+ pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); + if (IS_GET_OP(p)) { + for (addr = start; addr < end; pte++, addr += PAGE_SIZE) { + ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file, + pte_present(*pte), is_swap_pte(*pte), p, addr, 1); + if (ret) + break; + } + } + pte_unmap_unlock(pte - 1, ptl);
We might have not entered the loop and pte-1 would be wrong.
+ if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
What does 'addr - start' mean? If you want to say they are not equal, why not say so?
This has been revamped in the next version.
+ uffd_wp_range(walk->mm, vma, start, addr - start, true);
+ cond_resched(); + return ret; +}
+static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth, + struct mm_walk *walk) +{ + struct pagemap_scan_private *p = walk->private; + struct vm_area_struct *vma = walk->vma; + int ret = 0;
+ if (vma) + ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr, + (end - addr)/PAGE_SIZE); + return ret; +}
+/* No hugetlb support is present. */ +static const struct mm_walk_ops pagemap_scan_ops = { + .test_walk = pagemap_scan_test_walk, + .pmd_entry = pagemap_scan_pmd_entry, + .pte_hole = pagemap_scan_pte_hole, +};
+static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg) +{ + unsigned long empty_slots, vec_index = 0; + unsigned long __user start, end;
The whole point of __user (attribute) is to be assigned to pointers.
I'll remove it.
+ unsigned long __start, __end;
I think such names do not convey sufficient information.
I'll update it.
+ struct page_region __user *vec; + struct pagemap_scan_private p; + int ret = 0;
+ start = (unsigned long)untagged_addr(arg->start); + vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
+ /* Validate memory ranges */ + if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len))) + return -EINVAL; + if (IS_GET_OP(arg) && ((arg->vec_len == 0) || + (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region))))) + return -EINVAL;
+ /* Detect illegal flags and masks */ + if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) || + (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) || + (arg->return_mask & ~PAGEMAP_BITS_ALL))
Using bitwise or to check
(arg->required_mask | arg->anyof_mask | arg->excluded_mask | arg->return_mask) & ~PAGE_MAP_BITS_ALL
Would have been much cleaner, IMHO.
I'll update it.
+ return -EINVAL; + if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) || + !arg->return_mask)) + return -EINVAL; + /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */ + if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) || + (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS))) + return -EINVAL;
+ end = start + arg->len; + p.max_pages = arg->max_pages; + p.found_pages = 0; + p.flags = arg->flags; + p.required_mask = arg->required_mask; + p.anyof_mask = arg->anyof_mask; + p.excluded_mask = arg->excluded_mask; + p.return_mask = arg->return_mask; + p.prev.len = 0; + p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
+ if (IS_GET_OP(arg)) { + p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL); + if (!p.vec) + return -ENOMEM; + } else { + p.vec = NULL;
I find it cleaner to initialize 'p.vec = NULL' unconditionally before IS_GET_OP() check.
It'll get updated.
+ } + __start = __end = start; + while (!ret && __end < end) { + p.vec_index = 0; + empty_slots = arg->vec_len - vec_index; + if (p.vec_len > empty_slots) + p.vec_len = empty_slots;
+ __end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK; + if (__end > end) + __end = end;
Easier to understand using min().
Will update.
+ mmap_read_lock(mm); + ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p); + mmap_read_unlock(mm); + if (!(!ret || ret == -ENOSPC))
Double negations complicate things unnecessarily.
And if you already "break" on ret, why do you check the condition in the while loop?
Ohh, good catch.
+ goto free_data;
+ __start = __end; + if (IS_GET_OP(arg) && p.vec_index) { + if (copy_to_user(&vec[vec_index], p.vec, + p.vec_index * sizeof(struct page_region))) { + ret = -EFAULT; + goto free_data; + } + vec_index += p.vec_index; + } + } + ret = export_prev_to_out(&p, vec, &vec_index); + if (!ret) + ret = vec_index; +free_data: + if (IS_GET_OP(arg)) + kfree(p.vec);
Just call it unconditionally.
I didn't know it. I'll do it.
+ return ret; +}
+static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg) +{ + struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg; + struct mm_struct *mm = file->private_data; + struct pagemap_scan_arg argument;
+ if (cmd == PAGEMAP_SCAN) { + if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg))) + return -EFAULT; + return do_pagemap_cmd(mm, &argument); + } + return -EINVAL; +}
const struct file_operations proc_pagemap_operations = { .llseek = mem_lseek, /* borrow this */ .read = pagemap_read, .open = pagemap_open, .release = pagemap_release, + .unlocked_ioctl = pagemap_scan_ioctl, + .compat_ioctl = pagemap_scan_ioctl, }; #endif /* CONFIG_PROC_PAGE_MONITOR */ diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index b7b56871029c..1ae9a8684b48 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t; #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ RWF_APPEND) +/* Pagemap ioctl */ +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg)
+/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */ +#define PAGE_IS_WRITTEN (1 << 0) +#define PAGE_IS_FILE (1 << 1) +#define PAGE_IS_PRESENT (1 << 2) +#define PAGE_IS_SWAPPED (1 << 3)
These names are way too generic and are likely to be misused for the wrong purpose. The "_IS_" part seems confusing as well. So I think the naming needs to be fixed and some new type (using typedef) or enum should be introduced to hold these flags. I understand it is part of uapi and it is less common there, but it is not unheard of and does make things clearer.
Do you think PM_SCAN_PAGE_IS_* work here?
+/*
- struct page_region - Page region with bitmap flags
- @start: Start of the region
- @len: Length of the region
- bitmap: Bits sets for the region
- */
+struct page_region { + __u64 start; + __u64 len;
I presume in bytes. Would be useful to mention.
Length of region in pages.
+ __u64 bitmap; +};
+/*
- struct pagemap_scan_arg - Pagemap ioctl argument
- @start: Starting address of the region
- @len: Length of the region (All the pages in this length are
included)
- @vec: Address of page_region struct array for output
- @vec_len: Length of the page_region struct array
- @max_pages: Optional max return pages
- @flags: Flags for the IOCTL
- @required_mask: Required mask - All of these bits have to be set
in the PTE
- @anyof_mask: Any mask - Any of these bits are set in the PTE
- @excluded_mask: Exclude mask - None of these bits are set in the PTE
- @return_mask: Bits that are to be reported in page_region
- */
+struct pagemap_scan_arg { + __u64 start; + __u64 len; + __u64 vec; + __u64 vec_len; + __u32 max_pages; + __u32 flags; + __u64 required_mask; + __u64 anyof_mask; + __u64 excluded_mask; + __u64 return_mask; +};
+/* Special flags */ +#define PAGEMAP_WP_ENGAGE (1 << 0)
#endif /* _UAPI_LINUX_FS_H */
On Feb 20, 2023, at 5:24 AM, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
+static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
unsigned long end, struct mm_walk *walk)
+{
- struct pagemap_scan_private *p = walk->private;
- struct vm_area_struct *vma = walk->vma;
- unsigned long addr = end;
- spinlock_t *ptl;
- int ret = 0;
- pte_t *pte;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- ptl = pmd_trans_huge_lock(pmd, vma);
- if (ptl) {
bool pmd_wt;
pmd_wt = !is_pmd_uffd_wp(*pmd);
/*
* Break huge page into small pages if operation needs to be
performed is
* on a portion of the huge page.
*/
if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
spin_unlock(ptl);
split_huge_pmd(vma, pmd, start);
goto process_smaller_pages;
I think that such goto's are really confusing and should be avoided. And using 'else' (could have easily prevented the need for goto). It is not the best solution though, since I think it would have been better to invert the conditions.
Yeah, else can be used here. But then we'll have to add a tab to all the code after adding else. We have already so many tabs and very less space to right code. Not sure which is better.
goto’s are usually not the right solution. You can extract things into a different function if you have to.
I’m not sure why IS_GET_OP(p) might be false and what’s the meaning of taking the lock and dropping it in such a case. I think that the code can be simplified and additional condition nesting can be avoided.
--- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t; #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ RWF_APPEND) +/* Pagemap ioctl */ +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg)
+/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */ +#define PAGE_IS_WRITTEN (1 << 0) +#define PAGE_IS_FILE (1 << 1) +#define PAGE_IS_PRESENT (1 << 2) +#define PAGE_IS_SWAPPED (1 << 3)
These names are way too generic and are likely to be misused for the wrong purpose. The "_IS_" part seems confusing as well. So I think the naming needs to be fixed and some new type (using typedef) or enum should be introduced to hold these flags. I understand it is part of uapi and it is less common there, but it is not unheard of and does make things clearer.
Do you think PM_SCAN_PAGE_IS_* work here?
Can we lose the IS somehow?
+/*
- struct page_region - Page region with bitmap flags
- @start: Start of the region
- @len: Length of the region
- bitmap: Bits sets for the region
- */
+struct page_region {
- __u64 start;
- __u64 len;
I presume in bytes. Would be useful to mention.
Length of region in pages.
Very unintuitive to me I must say. If the start is an address, I would expect the len to be in bytes.
Hi Nadav, Mike, Michał,
Can you please share your thoughts at [A] below?
On 2/23/23 12:10 AM, Nadav Amit wrote:
On Feb 20, 2023, at 5:24 AM, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
+static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
unsigned long end, struct mm_walk *walk)
+{
- struct pagemap_scan_private *p = walk->private;
- struct vm_area_struct *vma = walk->vma;
- unsigned long addr = end;
- spinlock_t *ptl;
- int ret = 0;
- pte_t *pte;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- ptl = pmd_trans_huge_lock(pmd, vma);
- if (ptl) {
bool pmd_wt;
pmd_wt = !is_pmd_uffd_wp(*pmd);
/*
* Break huge page into small pages if operation needs to be
performed is
* on a portion of the huge page.
*/
if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
spin_unlock(ptl);
split_huge_pmd(vma, pmd, start);
goto process_smaller_pages;
I think that such goto's are really confusing and should be avoided. And using 'else' (could have easily prevented the need for goto). It is not the best solution though, since I think it would have been better to invert the conditions.
Yeah, else can be used here. But then we'll have to add a tab to all the code after adding else. We have already so many tabs and very less space to right code. Not sure which is better.
goto’s are usually not the right solution. You can extract things into a different function if you have to.
I’m not sure why IS_GET_OP(p) might be false and what’s the meaning of taking the lock and dropping it in such a case. I think that the code can be simplified and additional condition nesting can be avoided.
Lock is taken and we check if pmd has UFFD_WP set or not. In the next version, the GET check has been removed as we have dropped WP_ENGAGE + !GET operation. So get is always specified and condition isn't needed.
Please comment on next version if you want anything more optimized.
--- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t; #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ RWF_APPEND) +/* Pagemap ioctl */ +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg)
+/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */ +#define PAGE_IS_WRITTEN (1 << 0) +#define PAGE_IS_FILE (1 << 1) +#define PAGE_IS_PRESENT (1 << 2) +#define PAGE_IS_SWAPPED (1 << 3)
These names are way too generic and are likely to be misused for the wrong purpose. The "_IS_" part seems confusing as well. So I think the naming needs to be fixed and some new type (using typedef) or enum should be introduced to hold these flags. I understand it is part of uapi and it is less common there, but it is not unheard of and does make things clearer.
Do you think PM_SCAN_PAGE_IS_* work here?
Can we lose the IS somehow?
[A] Do you think these names would work better: PM_SCAN_WRITTEN_PAGE, PM_SCAN_FILE_PAGE, PM_SCAN_SWAP_PAGE, PM_SCAN_PRESENT_PAGE?
+/*
- struct page_region - Page region with bitmap flags
- @start: Start of the region
- @len: Length of the region
- bitmap: Bits sets for the region
- */
+struct page_region {
- __u64 start;
- __u64 len;
I presume in bytes. Would be useful to mention.
Length of region in pages.
Very unintuitive to me I must say. If the start is an address, I would expect the len to be in bytes.
The PAGEMAP_SCAN ioctl is working on page granularity level. We tell the user if a page has certain flags are not. Keeping length in bytes doesn't makes sense.
On Feb 22, 2023, at 11:10 PM, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
Hi Nadav, Mike, Michał,
Can you please share your thoughts at [A] below?
I promised I won't talk about the API, but was persuaded to reconsider. I have a general question regarding the suitablity of currently proposed high-level API. To explore some alternatives, I'd like to suggest an alternative that may have some advantages. If these have already been considered and dismissed, feel free to ignore.
I believe we have two distinct usage scenarios: (1) vectored reads from pagemap, and (2) atomic UFFD WP-read/protect. It's possible that these require separate interfaces
Regarding vectored reads, I believe the simplest solution is to maintain the current pagemap entry format for output and extend it if necessary. The input can be a vector of ranges. I'm uncertain about the purpose of fields such as 'anyof_mask' in 'pagemap_scan_arg', so I can't confirm their necessity and whether the input need to be made. more complicated. There is a possibility that fields such as 'anyof_mask' might expose internal APIs, so I hope they’re not required.
For the atomic operation of 'PAGE_IS_WRITTEN' + 'PAGEMAP_WP_ENGAGE', a different mechanism might be necessary. This function appears to be UFFD-specific. Instead of the proposed IOCTL, an alternative option is to use 'UFFD_FEATURE_WP_ASYNC' to log the pages that were written, similar to page-modification logging on Intel. Since this feature appears to be specific to UFFD, I believe it would be more appropriate to include the log as part of the UFFD mechanism rather than the pagemap.
From my experience with UFFD, proper ordering of events is crucial, although it is not always done well. Therefore, we should aim for improvement, not regression. I believe that utilizing the pagemap-based mechanism for WP'ing might be a step in the wrong direction. I think that it would have been better to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the file descriptor unless the log is full.
I am sorry that I chime in that late, but I think the complications that the proposed mechanism might raise are not negligible. And anyhow this patch-set still requires quite a bit of work before it can be merged.
On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote:
From my experience with UFFD, proper ordering of events is crucial, although it is not always done well. Therefore, we should aim for improvement, not regression. I believe that utilizing the pagemap-based mechanism for WP'ing might be a step in the wrong direction. I think that it would have been better to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the file descriptor unless the log is full.
Yes this is an interesting question to think about..
Keeping the data in the pgtable has one good thing that it doesn't need any complexity on maintaining the log, and no possibility of "log full".
If there's possible "log full" then the next question is whether we should let the worker wait the monitor if the monitor is not fast enough to collect those data. It adds some slight dependency on the two threads, I think it can make the tracking harder or impossible in latency sensitive workloads.
The other thing is we can also make the log "never gonna full" by making it a bitmap covering any registered ranges, but I don't either know whether it'll be worth it for the effort.
Thanks,
On Feb 27, 2023, at 1:18 PM, Peter Xu peterx@redhat.com wrote:
!! External Email
On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote:
From my experience with UFFD, proper ordering of events is crucial, although it is not always done well. Therefore, we should aim for improvement, not regression. I believe that utilizing the pagemap-based mechanism for WP'ing might be a step in the wrong direction. I think that it would have been better to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the file descriptor unless the log is full.
Yes this is an interesting question to think about..
Keeping the data in the pgtable has one good thing that it doesn't need any complexity on maintaining the log, and no possibility of "log full".
I understand your concern, but I think that eventually it might be simpler to maintain, since the logic of how to process the log is moved to userspace.
At the same time, handling inputs from pagemap and uffd handlers and sync’ing them would not be too easy for userspace.
But yes, allocation on the heap for userfaultfd_wait_queue-like entries would be needed, and there are some issues of ordering the events (I think all #PF and other events should be ordered regardless) and how not to traverse all async-userfaultfd_wait_queue’s (except those that block if the log is full) when a wakeup is needed.
If there's possible "log full" then the next question is whether we should let the worker wait the monitor if the monitor is not fast enough to collect those data. It adds some slight dependency on the two threads, I think it can make the tracking harder or impossible in latency sensitive workloads.
Again, I understand your concern. But this model that I propose is not new. It is used with PML (page-modification logging) and KVM, and IIRC there is a similar interface between KVM and QEMU to provide this information. There are endless other examples for similar producer-consumer mechanisms that might lead to stall in extreme cases.
The other thing is we can also make the log "never gonna full" by making it a bitmap covering any registered ranges, but I don't either know whether it'll be worth it for the effort.
I do not see a benefit of half-log half-scan. It tries to take the data-structure of one format and combine it with another.
Anyhow, I was just giving my 2 cents. Admittedly, I did not follow the threads of previous versions and I did not see userspace components that use the API to say something smart. Personally, I do not find the current API proposal to be very consistent and simple, and it seems to me that it lets pagemap do userfaultfd-related tasks, which might be considered inappropriate and non-intuitive.
If I derailed the discussion, I apologize.
On Mon, Feb 27, 2023 at 11:09:12PM +0000, Nadav Amit wrote:
On Feb 27, 2023, at 1:18 PM, Peter Xu peterx@redhat.com wrote:
!! External Email
On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote:
From my experience with UFFD, proper ordering of events is crucial, although it is not always done well. Therefore, we should aim for improvement, not regression. I believe that utilizing the pagemap-based mechanism for WP'ing might be a step in the wrong direction. I think that it would have been better to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the file descriptor unless the log is full.
Yes this is an interesting question to think about..
Keeping the data in the pgtable has one good thing that it doesn't need any complexity on maintaining the log, and no possibility of "log full".
I understand your concern, but I think that eventually it might be simpler to maintain, since the logic of how to process the log is moved to userspace.
At the same time, handling inputs from pagemap and uffd handlers and sync’ing them would not be too easy for userspace.
I do not expect a common uffd-wp async user to provide a fault handler at all. In my imagination it's in most cases used standalone from other uffd modes; it means all the faults will still be handled by the kernel. Here we only leverage the accuracy of userfaultfd comparing to soft-dirty, so not really real "user"-faults.
But yes, allocation on the heap for userfaultfd_wait_queue-like entries would be needed, and there are some issues of ordering the events (I think all #PF and other events should be ordered regardless) and how not to traverse all async-userfaultfd_wait_queue’s (except those that block if the log is full) when a wakeup is needed.
Will there be an ordering requirement for an async mode? Considering it should be async to whatever else, I would think it's not a problem, but maybe I missed something.
If there's possible "log full" then the next question is whether we should let the worker wait the monitor if the monitor is not fast enough to collect those data. It adds some slight dependency on the two threads, I think it can make the tracking harder or impossible in latency sensitive workloads.
Again, I understand your concern. But this model that I propose is not new. It is used with PML (page-modification logging) and KVM, and IIRC there is a similar interface between KVM and QEMU to provide this information. There are endless other examples for similar producer-consumer mechanisms that might lead to stall in extreme cases.
Yes, I'm not against thinking of using similar structures here. It's just that it's definitely more complicated on the interface, at least we need yet one more interface to setup the rings and define its interfaces.
Note that although Muhammud is defining another new interface here too for pagemap, I don't think it's strictly needed for uffd-wp async mode. One can use uffd-wp async mode with PM_UFFD_WP which is with current pagemap interface already.
So what Muhammud is proposing here are two things to me: (1) uffd-wp async, plus (2) a new pagemap interface (which will closely work with (1) only if we need atomicity on get-dirty and reprotect).
Defining new interface for uffd-wp async mode will be something extra, so IMHO besides the heap allocation on the rings, we need to also justify whether that is needed. That's why I think it's fine to go with what Muhammud proposed, because it's a minimum changeset at least for userfault to support an async mode, and anything else can be done on top if necessary.
Going a bit back to the "lead to stall in extreme cases" above, just also want to mention that the VM use case is slightly different - dirty tracking is only heavily used during migration afaict, and it's a short period. Not a lot of people will complain performance degrades during that period because that's just rare. And, even without the ring the perf is really bad during migration anyway... Especially when huge pages are used to back the guest RAM.
Here it's slightly different to me: it's about tracking dirty pages during any possible workload, and it can be monitored periodically and frequently. So IMHO stricter than a VM use case where migration is the only period to use it.
The other thing is we can also make the log "never gonna full" by making it a bitmap covering any registered ranges, but I don't either know whether it'll be worth it for the effort.
I do not see a benefit of half-log half-scan. It tries to take the data-structure of one format and combine it with another.
What I'm saying here is not half-log / half-scan, but use a single bitmap to store what page is dirty, just like KVM_GET_DIRTY_LOG. I think it avoids any above "stall" issue.
Anyhow, I was just giving my 2 cents. Admittedly, I did not follow the threads of previous versions and I did not see userspace components that use the API to say something smart.
Actually similar here. :) So I'm probably not the best one to describe what is the best to look as API.
What I know is I think the new pagemap interface is welcomed by CRIU developers, so it may be something good with/without userfaultfd getting involved already. I see this as "let's add one more bit for uffd-wp" in the new interface only.
Quotting some link I got from Muhammud before with CRIU usage:
https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
Personally, I do not find the current API proposal to be very consistent and simple, and it seems to me that it lets pagemap do userfaultfd-related tasks, which might be considered inappropriate and non-intuitive.
Yes, I agree. I just don't know what's the best way to avoid this.
The issue here IIUC is Muhammud needs one operation to do what Windows does with getWriteWatch() API. It means we need to mix up GET and PROTECT in a single shot. If we want to use pagemap as GET, then no choice to PROTECT also here to me.
I think it'll be the same to soft-dirty if it's used, it means we'll extend soft-dirty modifications from clear_refs to pagemap too which I also don't think it's as clean.
If I derailed the discussion, I apologize.
Not at all. I just wished you joined earlier!
On Feb 28, 2023, at 7:55 AM, Peter Xu peterx@redhat.com wrote:
!! External Email
On Mon, Feb 27, 2023 at 11:09:12PM +0000, Nadav Amit wrote:
On Feb 27, 2023, at 1:18 PM, Peter Xu peterx@redhat.com wrote:
!! External Email
On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote:
From my experience with UFFD, proper ordering of events is crucial, although it is not always done well. Therefore, we should aim for improvement, not regression. I believe that utilizing the pagemap-based mechanism for WP'ing might be a step in the wrong direction. I think that it would have been better to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the file descriptor unless the log is full.
Yes this is an interesting question to think about..
Keeping the data in the pgtable has one good thing that it doesn't need any complexity on maintaining the log, and no possibility of "log full".
I understand your concern, but I think that eventually it might be simpler to maintain, since the logic of how to process the log is moved to userspace.
At the same time, handling inputs from pagemap and uffd handlers and sync’ing them would not be too easy for userspace.
I do not expect a common uffd-wp async user to provide a fault handler at all. In my imagination it's in most cases used standalone from other uffd modes; it means all the faults will still be handled by the kernel. Here we only leverage the accuracy of userfaultfd comparing to soft-dirty, so not really real "user"-faults.
If that is the only use-case, it might make sense. But I guess most users would most likely use some library (and not syscalls directly). So slightly complicating the API for better generality may be reasonable.
But yes, allocation on the heap for userfaultfd_wait_queue-like entries would be needed, and there are some issues of ordering the events (I think all #PF and other events should be ordered regardless) and how not to traverse all async-userfaultfd_wait_queue’s (except those that block if the log is full) when a wakeup is needed.
Will there be an ordering requirement for an async mode? Considering it should be async to whatever else, I would think it's not a problem, but maybe I missed something.
You may be right, but I am not sure. I am still not sure what use-cases are targeted in this patch-set. For CRIU checkpoint use-case (when the app is not running), I guess the current interface makes sense. But if there are use-cases in which this you do care about UFFD-events this can become an issue.
But even in some obvious use-cases, this might be the wrong interface for major performance issues. If we think about some incremental copying of modified pages (a-la pre-copy live-migration or to create point-in-time snapshots), it seems to me much more efficient for application to have a log than traversing all the page-tables.
If there's possible "log full" then the next question is whether we should let the worker wait the monitor if the monitor is not fast enough to collect those data. It adds some slight dependency on the two threads, I think it can make the tracking harder or impossible in latency sensitive workloads.
Again, I understand your concern. But this model that I propose is not new. It is used with PML (page-modification logging) and KVM, and IIRC there is a similar interface between KVM and QEMU to provide this information. There are endless other examples for similar producer-consumer mechanisms that might lead to stall in extreme cases.
Yes, I'm not against thinking of using similar structures here. It's just that it's definitely more complicated on the interface, at least we need yet one more interface to setup the rings and define its interfaces.
Note that although Muhammud is defining another new interface here too for pagemap, I don't think it's strictly needed for uffd-wp async mode. One can use uffd-wp async mode with PM_UFFD_WP which is with current pagemap interface already.
So what Muhammud is proposing here are two things to me: (1) uffd-wp async, plus (2) a new pagemap interface (which will closely work with (1) only if we need atomicity on get-dirty and reprotect).
Defining new interface for uffd-wp async mode will be something extra, so IMHO besides the heap allocation on the rings, we need to also justify whether that is needed. That's why I think it's fine to go with what Muhammud proposed, because it's a minimum changeset at least for userfault to support an async mode, and anything else can be done on top if necessary.
Going a bit back to the "lead to stall in extreme cases" above, just also want to mention that the VM use case is slightly different - dirty tracking is only heavily used during migration afaict, and it's a short period. Not a lot of people will complain performance degrades during that period because that's just rare. And, even without the ring the perf is really bad during migration anyway... Especially when huge pages are used to back the guest RAM.
Here it's slightly different to me: it's about tracking dirty pages during any possible workload, and it can be monitored periodically and frequently. So IMHO stricter than a VM use case where migration is the only period to use it.
I still don’t get the use-cases. "monitored periodically and frequently” is not a use-case. And as I said before, actually, monitoring frequently is more performant with a log than with scanning all the page-tables.
The other thing is we can also make the log "never gonna full" by making it a bitmap covering any registered ranges, but I don't either know whether it'll be worth it for the effort.
I do not see a benefit of half-log half-scan. It tries to take the data-structure of one format and combine it with another.
What I'm saying here is not half-log / half-scan, but use a single bitmap to store what page is dirty, just like KVM_GET_DIRTY_LOG. I think it avoids any above "stall" issue.
Oh, I never went into the KVM details before - stupid me. If that’s what eventually was proven to work for KVM/QEMU, then it really sounds like the pagemap solution that Muhammad proposed.
But still not convoluting pagemap with userfaultfd (and especially uffd-wp) can be beneficial. Linus already threw some comments here and there about disliking uffd-wp, and I’m not sure adding uffd-wp specific stuff to pagemap would be welcomed.
Anyhow, thanks for all the explanations. Eventually, I understand that using bitmaps can be more efficient than a log if the bits are condensed.
On Tue, Feb 28, 2023 at 05:21:20PM +0000, Nadav Amit wrote:
On Feb 28, 2023, at 7:55 AM, Peter Xu peterx@redhat.com wrote:
!! External Email
On Mon, Feb 27, 2023 at 11:09:12PM +0000, Nadav Amit wrote:
On Feb 27, 2023, at 1:18 PM, Peter Xu peterx@redhat.com wrote:
!! External Email
On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote:
From my experience with UFFD, proper ordering of events is crucial, although it is not always done well. Therefore, we should aim for improvement, not regression. I believe that utilizing the pagemap-based mechanism for WP'ing might be a step in the wrong direction. I think that it would have been better to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the file descriptor unless the log is full.
Yes this is an interesting question to think about..
Keeping the data in the pgtable has one good thing that it doesn't need any complexity on maintaining the log, and no possibility of "log full".
I understand your concern, but I think that eventually it might be simpler to maintain, since the logic of how to process the log is moved to userspace.
At the same time, handling inputs from pagemap and uffd handlers and sync’ing them would not be too easy for userspace.
I do not expect a common uffd-wp async user to provide a fault handler at all. In my imagination it's in most cases used standalone from other uffd modes; it means all the faults will still be handled by the kernel. Here we only leverage the accuracy of userfaultfd comparing to soft-dirty, so not really real "user"-faults.
If that is the only use-case, it might make sense. But I guess most users would most likely use some library (and not syscalls directly). So slightly complicating the API for better generality may be reasonable.
But yes, allocation on the heap for userfaultfd_wait_queue-like entries would be needed, and there are some issues of ordering the events (I think all #PF and other events should be ordered regardless) and how not to traverse all async-userfaultfd_wait_queue’s (except those that block if the log is full) when a wakeup is needed.
Will there be an ordering requirement for an async mode? Considering it should be async to whatever else, I would think it's not a problem, but maybe I missed something.
You may be right, but I am not sure. I am still not sure what use-cases are targeted in this patch-set. For CRIU checkpoint use-case (when the app is not running), I guess the current interface makes sense. But if there are use-cases in which this you do care about UFFD-events this can become an issue.
But even in some obvious use-cases, this might be the wrong interface for major performance issues. If we think about some incremental copying of modified pages (a-la pre-copy live-migration or to create point-in-time snapshots), it seems to me much more efficient for application to have a log than traversing all the page-tables.
IMHO snapshots may not need a log at all - it needs CoW before the write happens. Nor is the case for swapping with userfaults, IIUC. IOW in those cases people don't care which page got dirtied, but care on data not being modified until the app allows it to.
But I get the point, and I agree collecting by scanning is slower.
If there's possible "log full" then the next question is whether we should let the worker wait the monitor if the monitor is not fast enough to collect those data. It adds some slight dependency on the two threads, I think it can make the tracking harder or impossible in latency sensitive workloads.
Again, I understand your concern. But this model that I propose is not new. It is used with PML (page-modification logging) and KVM, and IIRC there is a similar interface between KVM and QEMU to provide this information. There are endless other examples for similar producer-consumer mechanisms that might lead to stall in extreme cases.
Yes, I'm not against thinking of using similar structures here. It's just that it's definitely more complicated on the interface, at least we need yet one more interface to setup the rings and define its interfaces.
Note that although Muhammud is defining another new interface here too for pagemap, I don't think it's strictly needed for uffd-wp async mode. One can use uffd-wp async mode with PM_UFFD_WP which is with current pagemap interface already.
So what Muhammud is proposing here are two things to me: (1) uffd-wp async, plus (2) a new pagemap interface (which will closely work with (1) only if we need atomicity on get-dirty and reprotect).
Defining new interface for uffd-wp async mode will be something extra, so IMHO besides the heap allocation on the rings, we need to also justify whether that is needed. That's why I think it's fine to go with what Muhammud proposed, because it's a minimum changeset at least for userfault to support an async mode, and anything else can be done on top if necessary.
Going a bit back to the "lead to stall in extreme cases" above, just also want to mention that the VM use case is slightly different - dirty tracking is only heavily used during migration afaict, and it's a short period. Not a lot of people will complain performance degrades during that period because that's just rare. And, even without the ring the perf is really bad during migration anyway... Especially when huge pages are used to back the guest RAM.
Here it's slightly different to me: it's about tracking dirty pages during any possible workload, and it can be monitored periodically and frequently. So IMHO stricter than a VM use case where migration is the only period to use it.
I still don’t get the use-cases. "monitored periodically and frequently” is not a use-case. And as I said before, actually, monitoring frequently is more performant with a log than with scanning all the page-tables.
Feel free to ignore this part if we're not taking about using a ring structure. My previous comment was mostly for that. Bitmaps won't have this issue. Here I see a bitmap as one way to implement a log, where it's recorded by one bit per page. My comment was that we should be careful on using rings.
Side note: actually kvm dirty ring is even trickier; see the soft-full (kvm_dirty_ring.soft_limit) besides the hard-full event to make sure hard-full won't really trigger (or we're prone to lose dirty bits). I don't think we'll have the same issue here so we can trigger hard-full, but it's still unwanted to halt the threads being tracked for dirty pages. I don't know whether there'll be other side effects by the ring, though..
The other thing is we can also make the log "never gonna full" by making it a bitmap covering any registered ranges, but I don't either know whether it'll be worth it for the effort.
I do not see a benefit of half-log half-scan. It tries to take the data-structure of one format and combine it with another.
What I'm saying here is not half-log / half-scan, but use a single bitmap to store what page is dirty, just like KVM_GET_DIRTY_LOG. I think it avoids any above "stall" issue.
Oh, I never went into the KVM details before - stupid me. If that’s what eventually was proven to work for KVM/QEMU, then it really sounds like the pagemap solution that Muhammad proposed.
But still not convoluting pagemap with userfaultfd (and especially uffd-wp) can be beneficial. Linus already threw some comments here and there about disliking uffd-wp, and I’m not sure adding uffd-wp specific stuff to pagemap would be welcomed.
Yes I also don't know.. As I mentioned I'm not super happy with the interface either, but that's the simplest I can think of so far.
IOW, from an "userfaultfd-side reviewer" POV I'm fine if someone wants to leverage the concepts of uffd-wp and its internals using a separate but very light weighted patch just to impl async mode of uffd-wp. But I'm always open to any suggestions too. It's just that when there're multiple options and when we're not confident on either way, I normally prefer the simplest and cleanest (even if less efficient).
Anyhow, thanks for all the explanations. Eventually, I understand that using bitmaps can be more efficient than a log if the bits are condensed.
Note that I think what Muhammad (sorry, Muhammad! I think I spelled your name wrongly before starting from some email..) proposed is not a bitmap, but an array of ranges that can coalesce the result into very condensed form. Pros and cons.
Again, I can't comment much on that API, but since there're a bunch of other developers looking at that and they're also potential future users, I'll trust their judgement and just focus more on the other side of things.
Thanks,
On Feb 28, 2023, at 11:31 AM, Peter Xu peterx@redhat.com wrote:
Anyhow, thanks for all the explanations. Eventually, I understand that using bitmaps can be more efficient than a log if the bits are condensed.
Note that I think what Muhammad (sorry, Muhammad! I think I spelled your name wrongly before starting from some email..) proposed is not a bitmap, but an array of ranges that can coalesce the result into very condensed form. Pros and cons.
Again, I can't comment much on that API, but since there're a bunch of other developers looking at that and they're also potential future users, I'll trust their judgement and just focus more on the other side of things.
Thanks Peter for your patience.
I would just note that I understood that Muhammad did not propose a condensed bitmap, and that was a hint that handling a condensed bitmap (at least on x86) can be done rather efficiently. I am not sure about other representations.
Thanks for your explanations again, Peter.
Hi,
On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear the info about page table entries. The following operations are supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN), file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped (PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which pages have been written-to.
- Find pages which have been written-to and write protect the pages (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
+/*
- struct pagemap_scan_arg - Pagemap ioctl argument
- @start: Starting address of the region
- @len: Length of the region (All the pages in this length are included)
- @vec: Address of page_region struct array for output
- @vec_len: Length of the page_region struct array
- @max_pages: Optional max return pages
- @flags: Flags for the IOCTL
- @required_mask: Required mask - All of these bits have to be set in the PTE
- @anyof_mask: Any mask - Any of these bits are set in the PTE
- @excluded_mask: Exclude mask - None of these bits are set in the PTE
- @return_mask: Bits that are to be reported in page_region
- */
+struct pagemap_scan_arg {
- __u64 start;
- __u64 len;
- __u64 vec;
- __u64 vec_len;
- __u32 max_pages;
- __u32 flags;
- __u64 required_mask;
- __u64 anyof_mask;
- __u64 excluded_mask;
- __u64 return_mask;
+};
After Nadav's comment I've realized I missed the API part :)
A few quick notes for now: * The arg struct is fixed, so it would be impossible to extend the API later. Following the clone3() example, I'd add 'size' field to the pagemam_scan_arg so that it would be possible to add new fields afterwards. * Please make flags __u64, just in case * Put size and flags at the beginning of the struct, e.g.
strucr pagemap_scan_arg { size_t size; __u64 flags; /* all the rest */ };
+/* Special flags */ +#define PAGEMAP_WP_ENGAGE (1 << 0)
#endif /* _UAPI_LINUX_FS_H */
2.30.2
On 2/20/23 6:26 PM, Mike Rapoport wrote:
Hi,
On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear the info about page table entries. The following operations are supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN), file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped (PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which pages have been written-to.
- Find pages which have been written-to and write protect the pages (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
+/*
- struct pagemap_scan_arg - Pagemap ioctl argument
- @start: Starting address of the region
- @len: Length of the region (All the pages in this length are included)
- @vec: Address of page_region struct array for output
- @vec_len: Length of the page_region struct array
- @max_pages: Optional max return pages
- @flags: Flags for the IOCTL
- @required_mask: Required mask - All of these bits have to be set in the PTE
- @anyof_mask: Any mask - Any of these bits are set in the PTE
- @excluded_mask: Exclude mask - None of these bits are set in the PTE
- @return_mask: Bits that are to be reported in page_region
- */
+struct pagemap_scan_arg {
- __u64 start;
- __u64 len;
- __u64 vec;
- __u64 vec_len;
- __u32 max_pages;
- __u32 flags;
- __u64 required_mask;
- __u64 anyof_mask;
- __u64 excluded_mask;
- __u64 return_mask;
+};
After Nadav's comment I've realized I missed the API part :)
A few quick notes for now:
- The arg struct is fixed, so it would be impossible to extend the API
later. Following the clone3() example, I'd add 'size' field to the pagemam_scan_arg so that it would be possible to add new fields afterwards.
- Please make flags __u64, just in case
- Put size and flags at the beginning of the struct, e.g.
strucr pagemap_scan_arg { size_t size; __u64 flags; /* all the rest */ };
Updated. Thank you so much!
+/* Special flags */ +#define PAGEMAP_WP_ENGAGE (1 << 0)
#endif /* _UAPI_LINUX_FS_H */
2.30.2
New IOCTL and macros has been added in the kernel sources. Update the tools header file as well.
Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com --- tools/include/uapi/linux/fs.h | 50 +++++++++++++++++++++++++++++++++++ 1 file changed, 50 insertions(+)
diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h index b7b56871029c..1ae9a8684b48 100644 --- a/tools/include/uapi/linux/fs.h +++ b/tools/include/uapi/linux/fs.h @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t; #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ RWF_APPEND)
+/* Pagemap ioctl */ +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg) + +/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */ +#define PAGE_IS_WRITTEN (1 << 0) +#define PAGE_IS_FILE (1 << 1) +#define PAGE_IS_PRESENT (1 << 2) +#define PAGE_IS_SWAPPED (1 << 3) + +/* + * struct page_region - Page region with bitmap flags + * @start: Start of the region + * @len: Length of the region + * bitmap: Bits sets for the region + */ +struct page_region { + __u64 start; + __u64 len; + __u64 bitmap; +}; + +/* + * struct pagemap_scan_arg - Pagemap ioctl argument + * @start: Starting address of the region + * @len: Length of the region (All the pages in this length are included) + * @vec: Address of page_region struct array for output + * @vec_len: Length of the page_region struct array + * @max_pages: Optional max return pages + * @flags: Flags for the IOCTL + * @required_mask: Required mask - All of these bits have to be set in the PTE + * @anyof_mask: Any mask - Any of these bits are set in the PTE + * @excluded_mask: Exclude mask - None of these bits are set in the PTE + * @return_mask: Bits that are to be reported in page_region + */ +struct pagemap_scan_arg { + __u64 start; + __u64 len; + __u64 vec; + __u64 vec_len; + __u32 max_pages; + __u32 flags; + __u64 required_mask; + __u64 anyof_mask; + __u64 excluded_mask; + __u64 return_mask; +}; + +/* Special flags */ +#define PAGEMAP_WP_ENGAGE (1 << 0) + #endif /* _UAPI_LINUX_FS_H */
Add some explanation and method to use write-protection and written-to on memory range.
Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com --- Documentation/admin-guide/mm/pagemap.rst | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+)
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index 6e2e416af783..1cb2189e9a0d 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -230,3 +230,27 @@ Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is always 12 at most architectures). Since Linux 3.11 their meaning changes after first clear of soft-dirty bits. Since Linux 4.2 they are used for flags unconditionally. + +Pagemap Scan IOCTL +================== + +The ``PAGEMAP_SCAN`` IOCTL on the pagemap file can be used to get and/or clear +the info about page table entries. The following operations are supported in +this IOCTL: +- Get the information if the pages have been written-to (``PAGE_IS_WRITTEN``), + file mapped (``PAGE_IS_FILE``), present (``PAGE_IS_PRESENT``) or swapped + (``PAGE_IS_SWAPPED``). +- Write-protect the pages (``PAGEMAP_WP_ENGAGE``) to start finding which + pages have been written-to. +- Find pages which have been written-to and write protect the pages + (atomic ``PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE``) + +To get information about which pages have been written-to and/or write protect +the pages, following must be performed first in order: + 1. The userfaultfd file descriptor is created with ``userfaultfd`` syscall. + 2. The ``UFFD_FEATURE_WP_ASYNC`` feature is set by ``UFFDIO_API`` IOCTL. + 3. The memory range is registered with ``UFFDIO_REGISTER_MODE_WP`` mode + through ``UFFDIO_REGISTER`` IOCTL. +Then the any part of the registered memory or the whole memory region can be +write protected using the ``UFFDIO_WRITEPROTECT`` IOCTL or ``PAGEMAP_SCAN`` +IOCTL.
On Thu, Feb 02, 2023 at 04:29:14PM +0500, Muhammad Usama Anjum wrote:
Add some explanation and method to use write-protection and written-to on memory range.
Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com
Documentation/admin-guide/mm/pagemap.rst | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+)
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index 6e2e416af783..1cb2189e9a0d 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -230,3 +230,27 @@ Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is always 12 at most architectures). Since Linux 3.11 their meaning changes after first clear of soft-dirty bits. Since Linux 4.2 they are used for flags unconditionally.
+Pagemap Scan IOCTL +==================
+The ``PAGEMAP_SCAN`` IOCTL on the pagemap file can be used to get and/or clear +the info about page table entries. The following operations are supported in +this IOCTL: +- Get the information if the pages have been written-to (``PAGE_IS_WRITTEN``),
- file mapped (``PAGE_IS_FILE``), present (``PAGE_IS_PRESENT``) or swapped
- (``PAGE_IS_SWAPPED``).
+- Write-protect the pages (``PAGEMAP_WP_ENGAGE``) to start finding which
- pages have been written-to.
+- Find pages which have been written-to and write protect the pages
- (atomic ``PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE``)
Could we extend this section a bit more? Some points for reference:
- The new struct you introduced, definitions of each of the fields, and generic use cases for each of the field/ops.
- It'll be nice to list the OPs the new interface supports (GET, WP_ENGAGE, GET+WP_ENGAGE).
- When should people use this rather than the old pagemap interface? What's the major problems to solve / what's the major difference? (Maybe nice to reference the Windows API too here)
+To get information about which pages have been written-to and/or write protect +the pages, following must be performed first in order:
- The userfaultfd file descriptor is created with ``userfaultfd`` syscall.
- The ``UFFD_FEATURE_WP_ASYNC`` feature is set by ``UFFDIO_API`` IOCTL.
- The memory range is registered with ``UFFDIO_REGISTER_MODE_WP`` mode
- through ``UFFDIO_REGISTER`` IOCTL.
+Then the any part of the registered memory or the whole memory region can be +write protected using the ``UFFDIO_WRITEPROTECT`` IOCTL or ``PAGEMAP_SCAN`` +IOCTL.
This part looks good.
Thanks,
On 2/10/23 12:26 AM, Peter Xu wrote:
On Thu, Feb 02, 2023 at 04:29:14PM +0500, Muhammad Usama Anjum wrote:
Add some explanation and method to use write-protection and written-to on memory range.
Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com
Documentation/admin-guide/mm/pagemap.rst | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+)
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index 6e2e416af783..1cb2189e9a0d 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -230,3 +230,27 @@ Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is always 12 at most architectures). Since Linux 3.11 their meaning changes after first clear of soft-dirty bits. Since Linux 4.2 they are used for flags unconditionally.
+Pagemap Scan IOCTL +==================
+The ``PAGEMAP_SCAN`` IOCTL on the pagemap file can be used to get and/or clear +the info about page table entries. The following operations are supported in +this IOCTL: +- Get the information if the pages have been written-to (``PAGE_IS_WRITTEN``),
- file mapped (``PAGE_IS_FILE``), present (``PAGE_IS_PRESENT``) or swapped
- (``PAGE_IS_SWAPPED``).
+- Write-protect the pages (``PAGEMAP_WP_ENGAGE``) to start finding which
- pages have been written-to.
+- Find pages which have been written-to and write protect the pages
- (atomic ``PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE``)
Could we extend this section a bit more? Some points for reference:
The new struct you introduced, definitions of each of the fields, and generic use cases for each of the field/ops.
It'll be nice to list the OPs the new interface supports (GET, WP_ENGAGE, GET+WP_ENGAGE).
When should people use this rather than the old pagemap interface? What's the major problems to solve / what's the major difference? (Maybe nice to reference the Windows API too here)
I'll update the documentation.
+To get information about which pages have been written-to and/or write protect +the pages, following must be performed first in order:
- The userfaultfd file descriptor is created with ``userfaultfd`` syscall.
- The ``UFFD_FEATURE_WP_ASYNC`` feature is set by ``UFFDIO_API`` IOCTL.
- The memory range is registered with ``UFFDIO_REGISTER_MODE_WP`` mode
- through ``UFFDIO_REGISTER`` IOCTL.
+Then the any part of the registered memory or the whole memory region can be +write protected using the ``UFFDIO_WRITEPROTECT`` IOCTL or ``PAGEMAP_SCAN`` +IOCTL.
This part looks good.
Thanks,
Add pagemap ioctl tests. Add several different types of tests to judge the correction of the interface.
Signed-off-by: Muhammad Usama Anjum usama.anjum@collabora.com --- Chages in v7: - Add and update all test cases
Changes in v6: - Rename variables
Changes in v4: - Updated all the tests to conform to new IOCTL
Changes in v3: - Add another test to do sanity of flags
Changes in v2: - Update the tests to use the ioctl interface instead of syscall
TAP version 13 1..54 ok 1 sanity_tests_sd wrong flag specified ok 2 sanity_tests_sd wrong mask specified ok 3 sanity_tests_sd wrong return mask specified ok 4 sanity_tests_sd mixture of correct and wrong flag ok 5 sanity_tests_sd Clear area with larger vec size ok 6 sanity_tests_sd Repeated pattern of dirty and non-dirty pages ok 7 sanity_tests_sd Repeated pattern of dirty and non-dirty pages in parts ok 8 sanity_tests_sd Two regions ok 9 Page testing: all new pages must be soft dirty ok 10 Page testing: all pages must not be soft dirty ok 11 Page testing: all pages dirty other than first and the last one ok 12 Page testing: only middle page dirty ok 13 Page testing: only two middle pages dirty ok 14 Page testing: only get 2 dirty pages and clear them as well ok 15 Page testing: Range clear only ok 16 Large Page testing: all new pages must be soft dirty ok 17 Large Page testing: all pages must not be soft dirty ok 18 Large Page testing: all pages dirty other than first and the last one ok 19 Large Page testing: only middle page dirty ok 20 Large Page testing: only two middle pages dirty ok 21 Large Page testing: only get 2 dirty pages and clear them as well ok 22 Large Page testing: Range clear only ok 23 Huge page testing: all new pages must be soft dirty ok 24 Huge page testing: all pages must not be soft dirty ok 25 Huge page testing: all pages dirty other than first and the last one ok 26 Huge page testing: only middle page dirty ok 27 Huge page testing: only two middle pages dirty ok 28 Huge page testing: only get 2 dirty pages and clear them as well ok 29 Huge page testing: Range clear only ok 30 hpage_unit_tests all new huge page must be dirty ok 31 hpage_unit_tests all the huge page must not be dirty ok 32 hpage_unit_tests all the huge page must be dirty and clear ok 33 hpage_unit_tests only middle page dirty ok 34 hpage_unit_tests clear first half of huge page ok 35 hpage_unit_tests clear first half of huge page with limited buffer ok 36 hpage_unit_tests clear second half huge page ok 37 Test test_simple ok 38 mprotect_tests Both pages dirty ok 39 mprotect_tests Both pages are not soft dirty ok 40 mprotect_tests Both pages dirty after remap and mprotect ok 41 mprotect_tests Clear and make the pages dirty ok 42 sanity_tests clear op can only be specified with PAGE_IS_WRITTEN ok 43 sanity_tests required_mask specified ok 44 sanity_tests anyof_mask specified ok 45 sanity_tests excluded_mask specified ok 46 sanity_tests required_mask and anyof_mask specified ok 47 sanity_tests Get sd and present pages with anyof_mask ok 48 sanity_tests Get all the pages with required_mask ok 49 sanity_tests Get sd and present pages with required_mask and anyof_mask ok 50 sanity_tests Don't get sd pages ok 51 sanity_tests Don't get present pages ok 52 sanity_tests Find dirty present pages with return mask ok 53 sanity_tests Memory mapped file ok 54 unmapped_region_tests Get status of pages # Totals: pass:54 fail:0 xfail:0 xpass:0 skip:0 error:0 --- tools/testing/selftests/vm/.gitignore | 1 + tools/testing/selftests/vm/Makefile | 5 +- tools/testing/selftests/vm/pagemap_ioctl.c | 881 +++++++++++++++++++++ 3 files changed, 885 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c
diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore index 1f8c36a9fa10..9e7e0ae26582 100644 --- a/tools/testing/selftests/vm/.gitignore +++ b/tools/testing/selftests/vm/.gitignore @@ -17,6 +17,7 @@ mremap_dontunmap mremap_test on-fault-limit transhuge-stress +pagemap_ioctl protection_keys protection_keys_32 protection_keys_64 diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index 89c14e41bd43..54c074440a1b 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -24,9 +24,8 @@ MACHINE ?= $(shell echo $(uname_M) | sed -e 's/aarch64.*/arm64/' -e 's/ppc64.*/p # things despite using incorrect values such as an *occasionally* incomplete # LDLIBS. MAKEFLAGS += --no-builtin-rules - CFLAGS = -Wall -I $(top_srcdir) -I $(top_srcdir)/usr/include $(EXTRA_CFLAGS) $(KHDR_INCLUDES) -LDLIBS = -lrt -lpthread +LDLIBS = -lrt -lpthread -lm TEST_GEN_FILES = cow TEST_GEN_FILES += compaction_test TEST_GEN_FILES += gup_test @@ -52,6 +51,7 @@ TEST_GEN_FILES += on-fault-limit TEST_GEN_FILES += thuge-gen TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += userfaultfd +TEST_GEN_PROGS += pagemap_ioctl TEST_GEN_PROGS += soft-dirty TEST_GEN_PROGS += split_huge_page_test TEST_GEN_FILES += ksm_tests @@ -103,6 +103,7 @@ $(OUTPUT)/cow: vm_util.c $(OUTPUT)/khugepaged: vm_util.c $(OUTPUT)/ksm_functional_tests: vm_util.c $(OUTPUT)/madv_populate: vm_util.c +$(OUTPUT)/pagemap_ioctl: vm_util.c $(OUTPUT)/soft-dirty: vm_util.c $(OUTPUT)/split_huge_page_test: vm_util.c $(OUTPUT)/userfaultfd: vm_util.c diff --git a/tools/testing/selftests/vm/pagemap_ioctl.c b/tools/testing/selftests/vm/pagemap_ioctl.c new file mode 100644 index 000000000000..09b676a626d8 --- /dev/null +++ b/tools/testing/selftests/vm/pagemap_ioctl.c @@ -0,0 +1,881 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <stdio.h> +#include <fcntl.h> +#include <string.h> +#include <sys/mman.h> +#include <errno.h> +#include <malloc.h> +#include "vm_util.h" +#include "../kselftest.h" +#include <linux/types.h> +#include <linux/userfaultfd.h> +#include <linux/fs.h> +#include <sys/ioctl.h> +#include <sys/stat.h> +#include <math.h> +#include <asm/unistd.h> + +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \ + PAGE_IS_PRESENT | PAGE_IS_SWAPPED) +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | \ + PAGE_IS_SWAPPED) + +#define TEST_ITERATIONS 10 +#define PAGEMAP "/proc/self/pagemap" +int pagemap_fd; +int uffd; +int page_size; +int hpage_size; + +static long pagemap_ioctl(void *start, int len, void *vec, int vec_len, int flag, + int max_pages, long required_mask, long anyof_mask, long excluded_mask, + long return_mask) +{ + struct pagemap_scan_arg arg; + int ret; + + arg.start = (uintptr_t)start; + arg.len = len; + arg.vec = (uintptr_t)vec; + arg.vec_len = vec_len; + arg.flags = flag; + arg.max_pages = max_pages; + arg.required_mask = required_mask; + arg.anyof_mask = anyof_mask; + arg.excluded_mask = excluded_mask; + arg.return_mask = return_mask; + + ret = ioctl(pagemap_fd, PAGEMAP_SCAN, &arg); + + return ret; +} + +int init_uffd(void) +{ + struct uffdio_api uffdio_api; + + uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); + if (uffd == -1) + ksft_exit_fail_msg("uffd syscall failed\n"); + + uffdio_api.api = UFFD_API; + uffdio_api.features = UFFD_FEATURE_WP_ASYNC; + if (ioctl(uffd, UFFDIO_API, &uffdio_api)) + ksft_exit_fail_msg("UFFDIO_API\n"); + + if (uffdio_api.api != UFFD_API) + ksft_exit_fail_msg("UFFDIO_API error %llu\n", uffdio_api.api); + + return 0; +} + +int wp_init(void *lpBaseAddress, int dwRegionSize) +{ + struct uffdio_register uffdio_register; + struct uffdio_writeprotect wp; + + /* TODO: can it be avoided? Write protect doesn't engage on the pages if they aren't + * present already. The pages can be made present by writing to them. + */ + memset(lpBaseAddress, -1, dwRegionSize); + + uffdio_register.range.start = (unsigned long)lpBaseAddress; + uffdio_register.range.len = dwRegionSize; + uffdio_register.mode = UFFDIO_REGISTER_MODE_WP; + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) + ksft_exit_fail_msg("ioctl(UFFDIO_REGISTER)\n"); + + if (!(uffdio_register.ioctls & UFFDIO_WRITEPROTECT)) + ksft_exit_fail_msg("ioctl set is incorrect\n"); + + if (rand() % 2) { + wp.range.start = (unsigned long)lpBaseAddress; + wp.range.len = dwRegionSize; + wp.mode = UFFDIO_WRITEPROTECT_MODE_WP; + + if (ioctl(uffd, UFFDIO_WRITEPROTECT, &wp) == -1) + ksft_exit_fail_msg("ioctl(UFFDIO_WRITEPROTECT)\n"); + } else { + if (pagemap_ioctl(lpBaseAddress, dwRegionSize, NULL, 0, PAGEMAP_WP_ENGAGE, 0, + 0, 0, 0, 0) < 0) + ksft_exit_fail_msg("error %d %d %s\n", 1, errno, strerror(errno)); + } + return 0; +} + +int wp_free(void *lpBaseAddress, int dwRegionSize) +{ + struct uffdio_register uffdio_register; + + uffdio_register.range.start = (unsigned long)lpBaseAddress; + uffdio_register.range.len = dwRegionSize; + uffdio_register.mode = UFFDIO_REGISTER_MODE_WP; + if (ioctl(uffd, UFFDIO_UNREGISTER, &uffdio_register.range)) + ksft_exit_fail_msg("ioctl unregister failure\n"); + return 0; +} + +int clear_softdirty_wp(void *lpBaseAddress, int dwRegionSize) +{ + struct uffdio_writeprotect wp; + + if (rand() % 2) { + wp.range.start = (unsigned long)lpBaseAddress; + wp.range.len = dwRegionSize; + wp.mode = UFFDIO_WRITEPROTECT_MODE_WP; + + if (ioctl(uffd, UFFDIO_WRITEPROTECT, &wp) == -1) + ksft_exit_fail_msg("ioctl(UFFDIO_WRITEPROTECT)\n"); + } else { + if (pagemap_ioctl(lpBaseAddress, dwRegionSize, NULL, 0, PAGEMAP_WP_ENGAGE, 0, + 0, 0, 0, 0) < 0) + ksft_exit_fail_msg("error %d %d %s\n", 1, errno, strerror(errno)); + } + return 0; +} + +int sanity_tests_sd(void) +{ + char *mem, *m[2]; + int mem_size, vec_size, ret, ret2, ret3, i, num_pages = 10; + struct page_region *vec; + + vec_size = 100; + mem_size = num_pages * page_size; + + vec = malloc(sizeof(struct page_region) * vec_size); + if (!vec) + ksft_exit_fail_msg("error nomem\n"); + mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); + if (mem == MAP_FAILED) + ksft_exit_fail_msg("error nomem\n"); + + wp_init(mem, mem_size); + + /* 1. wrong operation */ + ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, -1, + 0, PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN) < 0, + "%s wrong flag specified\n", __func__); + ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, 8, + 0, 0x1111, 0, 0, PAGE_IS_WRITTEN) < 0, + "%s wrong mask specified\n", __func__); + ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, 0, + 0, PAGE_IS_WRITTEN, 0, 0, 0x1000) < 0, + "%s wrong return mask specified\n", __func__); + ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, + PAGEMAP_WP_ENGAGE | 0x32, + 0, PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN) < 0, + "%s mixture of correct and wrong flag\n", __func__); + + /* 2. Clear area with larger vec size */ + ret = pagemap_ioctl(mem, mem_size, vec, vec_size, PAGEMAP_WP_ENGAGE, 0, + PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN); + ksft_test_result(ret >= 0, "%s Clear area with larger vec size\n", __func__); + + /* 3. Repeated pattern of dirty and non-dirty pages */ + for (i = 0; i < mem_size; i += 2 * page_size) + mem[i]++; + + ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0, PAGE_IS_WRITTEN, 0, 0, + PAGE_IS_WRITTEN); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == mem_size/(page_size * 2), + "%s Repeated pattern of dirty and non-dirty pages\n", __func__); + + /* 4. Repeated pattern of dirty and non-dirty pages in parts */ + ret = pagemap_ioctl(mem, mem_size, vec, num_pages/5, PAGEMAP_WP_ENGAGE, + num_pages/2 - 2, PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ret2 = pagemap_ioctl(mem, mem_size, vec, 2, 0, 0, PAGE_IS_WRITTEN, 0, 0, + PAGE_IS_WRITTEN); + if (ret2 < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret2, errno, strerror(errno)); + + ret3 = pagemap_ioctl(mem, mem_size, vec, num_pages/2, 0, 0, PAGE_IS_WRITTEN, 0, 0, + PAGE_IS_WRITTEN); + if (ret3 < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret3, errno, strerror(errno)); + + ksft_test_result((ret + ret3) == num_pages/2 && ret2 == 2, + "%s Repeated pattern of dirty and non-dirty pages in parts\n", __func__); + + wp_free(mem, mem_size); + munmap(mem, mem_size); + + /* 5. Two regions */ + m[0] = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); + if (m[0] == MAP_FAILED) + ksft_exit_fail_msg("error nomem\n"); + m[1] = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); + if (m[1] == MAP_FAILED) + ksft_exit_fail_msg("error nomem\n"); + + wp_init(m[0], mem_size); + wp_init(m[1], mem_size); + + memset(m[0], 'a', mem_size); + memset(m[1], 'b', mem_size); + + ret = pagemap_ioctl(m[0], mem_size, NULL, 0, PAGEMAP_WP_ENGAGE, 0, + 0, 0, 0, 0); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ret = pagemap_ioctl(m[1], mem_size, vec, 1, 0, 0, PAGE_IS_WRITTEN, 0, 0, + PAGE_IS_WRITTEN); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 1 && vec[0].len == mem_size/page_size, + "%s Two regions\n", __func__); + + wp_free(m[0], mem_size); + wp_free(m[1], mem_size); + munmap(m[0], mem_size); + munmap(m[1], mem_size); + + free(vec); + return 0; +} + +int base_tests(char *prefix, char *mem, int mem_size, int skip) +{ + int vec_size, ret, dirty, dirty2; + struct page_region *vec, *vec2; + + if (skip) { + ksft_test_result_skip("%s all new pages must be soft dirty\n", prefix); + ksft_test_result_skip("%s all pages must not be soft dirty\n", prefix); + ksft_test_result_skip("%s all pages dirty other than first and the last one\n", + prefix); + ksft_test_result_skip("%s only middle page dirty\n", prefix); + ksft_test_result_skip("%s only two middle pages dirty\n", prefix); + ksft_test_result_skip("%s only get 2 dirty pages and clear them as well\n", prefix); + ksft_test_result_skip("%s Range clear only\n", prefix); + return 0; + } + + vec_size = mem_size/page_size; + vec = malloc(sizeof(struct page_region) * vec_size); + vec2 = malloc(sizeof(struct page_region) * vec_size); + + /* 1. all new pages must be not be soft dirty */ + dirty = pagemap_ioctl(mem, mem_size, vec, 1, PAGEMAP_WP_ENGAGE, vec_size - 2, + PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN); + if (dirty < 0) + ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno)); + + dirty2 = pagemap_ioctl(mem, mem_size, vec2, 1, PAGEMAP_WP_ENGAGE, 0, + PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN); + if (dirty2 < 0) + ksft_exit_fail_msg("error %d %d %s\n", dirty2, errno, strerror(errno)); + + ksft_test_result(dirty == 0 && dirty2 == 0, + "%s all new pages must be soft dirty\n", prefix); + + /* 2. all pages must not be soft dirty */ + dirty = pagemap_ioctl(mem, mem_size, vec, 1, 0, 0, PAGE_IS_WRITTEN, 0, 0, + PAGE_IS_WRITTEN); + if (dirty < 0) + ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno)); + + ksft_test_result(dirty == 0, "%s all pages must not be soft dirty\n", prefix); + + /* 3. all pages dirty other than first and the last one */ + memset(mem + page_size, 0, mem_size - (2 * page_size)); + + dirty = pagemap_ioctl(mem, mem_size, vec, 1, 0, 0, PAGE_IS_WRITTEN, 0, 0, + PAGE_IS_WRITTEN); + if (dirty < 0) + ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno)); + + ksft_test_result(dirty == 1 && vec[0].len >= vec_size - 2 && vec[0].len <= vec_size, + "%s all pages dirty other than first and the last one\n", prefix); + + /* 4. only middle page dirty */ + clear_softdirty_wp(mem, mem_size); + mem[vec_size/2 * page_size]++; + + dirty = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0, PAGE_IS_WRITTEN, 0, 0, + PAGE_IS_WRITTEN); + if (dirty < 0) + ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno)); + + ksft_test_result(dirty == 1 && vec[0].len >= 1, + "%s only middle page dirty\n", prefix); + + /* 5. only two middle pages dirty and walk over only middle pages */ + clear_softdirty_wp(mem, mem_size); + mem[vec_size/2 * page_size]++; + mem[(vec_size/2 + 1) * page_size]++; + + dirty = pagemap_ioctl(&mem[vec_size/2 * page_size], 2 * page_size, vec, 1, 0, 0, + PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN); + if (dirty < 0) + ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno)); + + ksft_test_result(dirty == 1 && vec[0].start == (uintptr_t)(&mem[vec_size/2 * page_size]) && + vec[0].len == 2, + "%s only two middle pages dirty\n", prefix); + + /* 6. only get 2 dirty pages and clear them as well */ + memset(mem, -1, mem_size); + + /* get and clear second and third pages */ + ret = pagemap_ioctl(mem + page_size, 2 * page_size, vec, 1, PAGEMAP_WP_ENGAGE, + 2, PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + dirty = pagemap_ioctl(mem, mem_size, vec2, vec_size, 0, 0, + PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN); + if (dirty < 0) + ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno)); + + ksft_test_result(ret == 1 && vec[0].len == 2 && + vec[0].start == (uintptr_t)(mem + page_size) && + dirty == 2 && vec2[0].len == 1 && vec2[0].start == (uintptr_t)mem && + vec2[1].len == vec_size - 3 && + vec2[1].start == (uintptr_t)(mem + 3 * page_size), + "%s only get 2 dirty pages and clear them as well\n", prefix); + + /* 7. Range clear only */ + memset(mem, -1, mem_size); + + dirty = pagemap_ioctl(mem, mem_size, NULL, 0, PAGEMAP_WP_ENGAGE, 0, + 0, 0, 0, 0); + if (dirty < 0) + ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno)); + + dirty2 = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0, + PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN); + if (dirty2 < 0) + ksft_exit_fail_msg("error %d %d %s\n", dirty2, errno, strerror(errno)); + + ksft_test_result(dirty == 0 && dirty2 == 0, "%s Range clear only\n", + prefix); + + free(vec); + free(vec2); + return 0; +} + +void *gethugepage(int map_size) +{ + int ret; + char *map; + + map = memalign(hpage_size, map_size); + if (!map) + ksft_exit_fail_msg("memalign failed %d %s\n", errno, strerror(errno)); + + ret = madvise(map, map_size, MADV_HUGEPAGE); + if (ret) + ksft_exit_fail_msg("madvise failed %d %d %s\n", ret, errno, strerror(errno)); + + wp_init(map, map_size); + + if (check_huge_anon(map, map_size/hpage_size, hpage_size)) + return map; + + free(map); + return NULL; + +} + +int hpage_unit_tests(void) +{ + char *map; + int ret; + size_t num_pages = 10; + int map_size = hpage_size * num_pages; + int vec_size = map_size/page_size; + struct page_region *vec, *vec2; + + vec = malloc(sizeof(struct page_region) * vec_size); + vec2 = malloc(sizeof(struct page_region) * vec_size); + if (!vec || !vec2) + ksft_exit_fail_msg("malloc failed\n"); + + map = gethugepage(map_size); + if (map) { + /* 1. all new huge page must not be dirty */ + ret = pagemap_ioctl(map, map_size, vec, vec_size, PAGEMAP_WP_ENGAGE, 0, + PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 0, "%s all new huge page must be dirty\n", __func__); + + /* 2. all the huge page must not be dirty */ + ret = pagemap_ioctl(map, map_size, vec, vec_size, 0, 0, + PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 0, "%s all the huge page must not be dirty\n", __func__); + + /* 3. all the huge page must be dirty and clear dirty as well */ + memset(map, -1, map_size); + ret = pagemap_ioctl(map, map_size, vec, vec_size, PAGEMAP_WP_ENGAGE, 0, + PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 1 && vec[0].start == (uintptr_t)map && + vec[0].len == vec_size && vec[0].bitmap == PAGE_IS_WRITTEN, + "%s all the huge page must be dirty and clear\n", __func__); + + /* 4. only middle page dirty */ + wp_free(map, map_size); + free(map); + map = gethugepage(map_size); + wp_init(map, map_size); + clear_softdirty_wp(map, map_size); + map[vec_size/2 * page_size]++; + + ret = pagemap_ioctl(map, map_size, vec, vec_size, 0, 0, + PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 1 && vec[0].len > 0, + "%s only middle page dirty\n", __func__); + + wp_free(map, map_size); + free(map); + } else { + ksft_test_result_skip("all new huge page must be dirty\n"); + ksft_test_result_skip("all the huge page must not be dirty\n"); + ksft_test_result_skip("all the huge page must be dirty and clear\n"); + ksft_test_result_skip("only middle page dirty\n"); + } + + /* 5. clear first half of huge page */ + map = gethugepage(map_size); + if (map) { + + memset(map, 0, map_size); + + ret = pagemap_ioctl(map, map_size/2, NULL, 0, PAGEMAP_WP_ENGAGE, 0, + 0, 0, 0, 0); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ret = pagemap_ioctl(map, map_size, vec, vec_size, 0, 0, + PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 1 && vec[0].len == vec_size/2 && + vec[0].start == (uintptr_t)(map + map_size/2), + "%s clear first half of huge page\n", __func__); + wp_free(map, map_size); + free(map); + } else { + ksft_test_result_skip("clear first half of huge page\n"); + } + + /* 6. clear first half of huge page with limited buffer */ + map = gethugepage(map_size); + if (map) { + memset(map, 0, map_size); + + ret = pagemap_ioctl(map, map_size, vec, vec_size, PAGEMAP_WP_ENGAGE, + vec_size/2, PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ret = pagemap_ioctl(map, map_size, vec, vec_size, 0, 0, + PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 1 && vec[0].len == vec_size/2 && + vec[0].start == (uintptr_t)(map + map_size/2), + "%s clear first half of huge page with limited buffer\n", + __func__); + wp_free(map, map_size); + free(map); + } else { + ksft_test_result_skip("clear first half of huge page with limited buffer\n"); + } + + /* 7. clear second half of huge page */ + map = gethugepage(map_size); + if (map) { + memset(map, -1, map_size); + ret = pagemap_ioctl(map + map_size/2, map_size/2, NULL, 0, PAGEMAP_WP_ENGAGE, + 0, 0, 0, 0, 0); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ret = pagemap_ioctl(map, map_size, vec, vec_size, 0, 0, + PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 1 && vec[0].len == vec_size/2, + "%s clear second half huge page\n", __func__); + wp_free(map, map_size); + free(map); + } else { + ksft_test_result_skip("clear second half huge page\n"); + } + + free(vec); + free(vec2); + return 0; +} + +int unmapped_region_tests(void) +{ + void *start = (void *)0x10000000; + int dirty, len = 0x00040000; + int vec_size = len / page_size; + struct page_region *vec = malloc(sizeof(struct page_region) * vec_size); + + /* 1. Get dirty pages */ + dirty = pagemap_ioctl(start, len, vec, vec_size, 0, 0, PAGEMAP_NON_WRITTEN_BITS, 0, 0, + PAGEMAP_NON_WRITTEN_BITS); + if (dirty < 0) + ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno)); + + ksft_test_result(dirty >= 0, "%s Get status of pages\n", __func__); + + free(vec); + return 0; +} + +static void test_simple(void) +{ + int i; + char *map; + struct page_region vec; + + map = aligned_alloc(page_size, page_size); + if (!map) + ksft_exit_fail_msg("aligned_alloc failed\n"); + wp_init(map, page_size); + + clear_softdirty_wp(map, page_size); + + for (i = 0 ; i < TEST_ITERATIONS; i++) { + if (pagemap_ioctl(map, page_size, &vec, 1, 0, 0, + PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN) == 1) { + ksft_print_msg("dirty bit was 1, but should be 0 (i=%d)\n", i); + break; + } + + clear_softdirty_wp(map, page_size); + /* Write something to the page to get the dirty bit enabled on the page */ + map[0]++; + + if (pagemap_ioctl(map, page_size, &vec, 1, 0, 0, + PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN) == 0) { + ksft_print_msg("dirty bit was 0, but should be 1 (i=%d)\n", i); + break; + } + + clear_softdirty_wp(map, page_size); + } + wp_free(map, page_size); + free(map); + + ksft_test_result(i == TEST_ITERATIONS, "Test %s\n", __func__); +} + +int sanity_tests(void) +{ + char *mem, *fmem; + int mem_size, vec_size, ret; + struct page_region *vec; + + /* 1. wrong operation */ + mem_size = 10 * page_size; + vec_size = mem_size / page_size; + + vec = malloc(sizeof(struct page_region) * vec_size); + mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); + if (mem == MAP_FAILED || vec == MAP_FAILED) + ksft_exit_fail_msg("error nomem\n"); + + wp_init(mem, mem_size); + + ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, PAGEMAP_WP_ENGAGE, 0, + PAGEMAP_BITS_ALL, 0, 0, PAGEMAP_BITS_ALL) < 0, + "%s clear op can only be specified with PAGE_IS_WRITTEN\n", __func__); + ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0, + PAGEMAP_BITS_ALL, 0, 0, PAGEMAP_BITS_ALL) >= 0, + "%s required_mask specified\n", __func__); + ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0, + 0, PAGEMAP_BITS_ALL, 0, PAGEMAP_BITS_ALL) >= 0, + "%s anyof_mask specified\n", __func__); + ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0, + 0, 0, PAGEMAP_BITS_ALL, PAGEMAP_BITS_ALL) >= 0, + "%s excluded_mask specified\n", __func__); + ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0, + PAGEMAP_BITS_ALL, PAGEMAP_BITS_ALL, 0, + PAGEMAP_BITS_ALL) >= 0, + "%s required_mask and anyof_mask specified\n", __func__); + wp_free(mem, mem_size); + munmap(mem, mem_size); + + /* 2. Get sd and present pages with anyof_mask */ + mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); + if (mem == MAP_FAILED) + ksft_exit_fail_msg("error nomem\n"); + wp_init(mem, mem_size); + + memset(mem, 0, mem_size); + + ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0, + 0, PAGEMAP_BITS_ALL, 0, PAGEMAP_BITS_ALL); + ksft_test_result(ret >= 0 && vec[0].start == (uintptr_t)mem && vec[0].len == vec_size && + vec[0].bitmap == (PAGE_IS_WRITTEN | PAGE_IS_PRESENT), + "%s Get sd and present pages with anyof_mask\n", __func__); + + /* 3. Get sd and present pages with required_mask */ + ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0, + PAGEMAP_BITS_ALL, 0, 0, PAGEMAP_BITS_ALL); + ksft_test_result(ret >= 0 && vec[0].start == (uintptr_t)mem && vec[0].len == vec_size && + vec[0].bitmap == (PAGE_IS_WRITTEN | PAGE_IS_PRESENT), + "%s Get all the pages with required_mask\n", __func__); + + /* 4. Get sd and present pages with required_mask and anyof_mask */ + ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0, + PAGE_IS_WRITTEN, PAGE_IS_PRESENT, 0, PAGEMAP_BITS_ALL); + ksft_test_result(ret >= 0 && vec[0].start == (uintptr_t)mem && vec[0].len == vec_size && + vec[0].bitmap == (PAGE_IS_WRITTEN | PAGE_IS_PRESENT), + "%s Get sd and present pages with required_mask and anyof_mask\n", + __func__); + + /* 5. Don't get sd pages */ + ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0, + 0, 0, PAGE_IS_WRITTEN, PAGEMAP_BITS_ALL); + ksft_test_result(ret == 0, "%s Don't get sd pages\n", __func__); + + /* 6. Don't get present pages */ + ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0, + 0, 0, PAGE_IS_PRESENT, PAGEMAP_BITS_ALL); + ksft_test_result(ret == 0, "%s Don't get present pages\n", __func__); + + wp_free(mem, mem_size); + munmap(mem, mem_size); + + /* 8. Find dirty present pages with return mask */ + mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); + if (mem == MAP_FAILED) + ksft_exit_fail_msg("error nomem\n"); + wp_init(mem, mem_size); + + memset(mem, 0, mem_size); + + ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0, + 0, PAGEMAP_BITS_ALL, 0, PAGE_IS_WRITTEN); + ksft_test_result(ret >= 0 && vec[0].start == (uintptr_t)mem && vec[0].len == vec_size && + vec[0].bitmap == PAGE_IS_WRITTEN, + "%s Find dirty present pages with return mask\n", __func__); + wp_free(mem, mem_size); + munmap(mem, mem_size); + + /* 9. Memory mapped file */ + int fd; + struct stat sbuf; + + fd = open(__FILE__, O_RDONLY); + if (fd < 0) { + ksft_test_result_skip("%s Memory mapped file\n"); + goto free_vec_and_return; + } + + ret = stat(__FILE__, &sbuf); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + fmem = mmap(NULL, sbuf.st_size, PROT_READ, MAP_SHARED, fd, 0); + if (fmem == MAP_FAILED) + ksft_exit_fail_msg("error nomem\n"); + + ret = pagemap_ioctl(fmem, sbuf.st_size, vec, vec_size, 0, 0, + 0, PAGEMAP_NON_WRITTEN_BITS, 0, PAGEMAP_NON_WRITTEN_BITS); + + ksft_test_result(ret >= 0 && vec[0].start == (uintptr_t)fmem && + vec[0].len == ceilf((float)sbuf.st_size/page_size) && + vec[0].bitmap == PAGE_IS_FILE, + "%s Memory mapped file\n", __func__); + + munmap(fmem, sbuf.st_size); + close(fd); + +free_vec_and_return: + free(vec); + return 0; +} + +int mprotect_tests(void) +{ + int ret; + char *mem, *mem2; + struct page_region vec; + int pagemap_fd = open("/proc/self/pagemap", O_RDONLY); + + if (pagemap_fd < 0) { + fprintf(stderr, "open() failed\n"); + exit(1); + } + + /* 1. Map two pages */ + mem = mmap(0, 2 * page_size, PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); + if (mem == MAP_FAILED) + ksft_exit_fail_msg("error nomem\n"); + wp_init(mem, 2 * page_size); + + /* Populate both pages. */ + memset(mem, 1, 2 * page_size); + + ret = pagemap_ioctl(mem, 2 * page_size, &vec, 1, 0, 0, PAGE_IS_WRITTEN, + 0, 0, PAGE_IS_WRITTEN); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 1 && vec.len == 2, "%s Both pages dirty\n", __func__); + + /* 2. Start softdirty tracking. Clear VM_SOFTDIRTY and clear the softdirty PTE bit. */ + ret = pagemap_ioctl(mem, 2 * page_size, NULL, 0, PAGEMAP_WP_ENGAGE, 0, + 0, 0, 0, 0); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(pagemap_ioctl(mem, 2 * page_size, &vec, 1, 0, 0, PAGE_IS_WRITTEN, + 0, 0, PAGE_IS_WRITTEN) == 0, + "%s Both pages are not soft dirty\n", __func__); + + /* 3. Remap the second page */ + mem2 = mmap(mem + page_size, page_size, PROT_READ|PROT_WRITE, + MAP_PRIVATE|MAP_ANON|MAP_FIXED, -1, 0); + if (mem2 == MAP_FAILED) + ksft_exit_fail_msg("error nomem\n"); + wp_init(mem2, page_size); + + /* Protect + unprotect. */ + mprotect(mem, 2 * page_size, PROT_READ); + mprotect(mem, 2 * page_size, PROT_READ|PROT_WRITE); + + /* Modify both pages. */ + memset(mem, 2, 2 * page_size); + + ret = pagemap_ioctl(mem, 2 * page_size, &vec, 1, 0, 0, PAGE_IS_WRITTEN, + 0, 0, PAGE_IS_WRITTEN); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 1 && vec.len == 2, + "%s Both pages dirty after remap and mprotect\n", __func__); + + /* 4. Clear and make the pages dirty */ + ret = pagemap_ioctl(mem, 2 * page_size, NULL, 0, PAGEMAP_WP_ENGAGE, 0, + 0, 0, 0, 0); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + memset(mem, 'A', 2 * page_size); + + ret = pagemap_ioctl(mem, 2 * page_size, &vec, 1, 0, 0, PAGE_IS_WRITTEN, + 0, 0, PAGE_IS_WRITTEN); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 1 && vec.len == 2, + "%s Clear and make the pages dirty\n", __func__); + + wp_free(mem, 2 * page_size); + munmap(mem, 2 * page_size); + return 0; +} + +int main(void) +{ + char *mem, *map; + int mem_size; + + ksft_print_header(); + ksft_set_plan(54); + + page_size = getpagesize(); + hpage_size = read_pmd_pagesize(); + + pagemap_fd = open(PAGEMAP, O_RDWR); + if (pagemap_fd < 0) + return -EINVAL; + + if (init_uffd()) + ksft_exit_fail_msg("uffd init failed\n"); + + /* + * Soft-dirty PTE bit tests + */ + + /* 1. Sanity testing */ + sanity_tests_sd(); + + /* 2. Normal page testing */ + mem_size = 10 * page_size; + mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); + if (mem == MAP_FAILED) + ksft_exit_fail_msg("error nomem\n"); + wp_init(mem, mem_size); + + base_tests("Page testing:", mem, mem_size, 0); + + wp_free(mem, mem_size); + munmap(mem, mem_size); + + /* 3. Large page testing */ + mem_size = 512 * 10 * page_size; + mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); + if (mem == MAP_FAILED) + ksft_exit_fail_msg("error nomem\n"); + wp_init(mem, mem_size); + + base_tests("Large Page testing:", mem, mem_size, 0); + + wp_free(mem, mem_size); + munmap(mem, mem_size); + + /* 4. Huge page testing */ + map = gethugepage(hpage_size); + if (map) { + base_tests("Huge page testing:", map, hpage_size, 0); + wp_free(map, hpage_size); + free(map); + } else { + base_tests("Huge page testing:", NULL, 0, 1); + } + + /* 6. Huge page tests */ + hpage_unit_tests(); + + /* 7. Iterative test */ + test_simple(); + + /* 8. Mprotect test */ + mprotect_tests(); + + /* + * Other PTE bit tests + */ + + /* 1. Sanity testing */ + sanity_tests(); + + /* 2. Unmapped address test */ + unmapped_region_tests(); + + close(pagemap_fd); + return ksft_exit_pass(); +}
linux-kselftest-mirror@lists.linaro.org