On 4/7/23 1:00 AM, Michał Mirosław wrote:
On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
Hello,
Thank you so much for the review. Do you have any thoughts on the build error on arc architecture? https://lore.kernel.org/all/e3c82373-256a-6297-bcb4-5e1179a2cbe2@collabora.c...
Maybe copy HPAGE_* defines from x86 and key on CONFIG_PGTABLE_LEVELS > 2? I don't know much about arc arch, though.
On 4/6/23 8:52 PM, Michał Mirosław wrote:
On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum usama.anjum@collabora.com wrote:>
[...]
+#define PM_SCAN_BITMAP(wt, file, present, swap) \
(wt | file << 1 | present << 2 | swap << 3)
Please parenthesize macro arguments ("(wt)", "(file)", etc.) to not have to worry about operator precedence when passed a complex expression.
Like this? #define PM_SCAN_BITMAP(wt, file, present, swap) \ ((wt) | (file << 1) | (present << 2) | (swap << 3))
The value would be: ( (wt) | ((file) << 1) | ... ) IOW, each parameter should have parentheses around its name.
Will do.
[...]
cur->len += n_pages;
p->found_pages += n_pages;
if (p->max_pages && (p->found_pages == p->max_pages))
return PM_SCAN_FOUND_MAX_PAGES;
return 0;
}
if (!p->vec_index || ((p->vec_index + 1) < p->vec_len)) {
It looks that `if (p->vec_index < p->vec_len)` is enough here - if we have vec_len == 0 here, then we'd not fit the entry in the userspace buffer anyway. Am I missing something?
No. I'd explained it with diagram last time: https://lore.kernel.org/all/3c8d9ea0-1382-be0c-8dd2-d490eedd3b55@collabora.c...
I'll add a concise comment here.
So it seems, but I think the code changed a bit and maybe could be simplified now? Since p->vec_len == 0 is currently not valid, the field could count only the entries available in p->vec[] -- IOW: not include p->cur in the count.
I see. But this'll not work as we need to count p->cur to don't go above the maximum count, p->vec_size.
BTW, `if (no space) return -ENOSPC` will avoid additional indentation for the non-merging case.
I'll update.
[...]
+static inline int pagemap_scan_deposit(struct pagemap_scan_private *p,
struct page_region __user *vec,
unsigned long *vec_index)
..._deposit() is used only in single place - please inline.
It is already inline.
Sorry. I mean: please paste the code in place of the single call.
I've made it a separate function to make the code look better in the caller function and logically easier to understand. This function is ugly. do_pagemap_scan() is also already very long function with lots of things happening. If you still insist, I'll remove this function.
[...]
/*
* Break huge page into small pages if the WP operation need to
* be performed is on a portion of the huge page or if max_pages
* pages limit would exceed.
BTW, could the `max_pages` limit be relaxed a bit (in that it would be possible to return more pages if they all merge into the last vector entry) so that it would not need to split otherwise-matching huge page? It would remove the need for this special handling in the kernel and splitting the page by this read-only-appearing ioctl?
No, this cannot be done. Otherwise we'll not be able to emulate Windows syscall GetWriteWatch() which specifies the exact number of pages. Usually in most of cases, either user will not use THP or not perform the operation on partial huge page. So this part is only there to keep things correct for those users who do use THP and partial pagemap_scan operations.
I see that `GetWriteWatch` returns a list of pages not ranges of pages. That makes sense (more or less). (BTW, It could be emulated in userspace by caching the last not-fully-consumed range.)
First of all, caching is avoided as then state maintained is needed. This is probably not accepted in Wine upstream later. Secondly, even if we have cache, Get + WP operation would not be accurate when we ask only N pages, but it gets N + X pages where X pages will not be consumed by the application at this time.
*/
if (is_written && PM_SCAN_OP_IS_WP(p) &&
((end - start < HPAGE_SIZE) ||
(p->max_pages &&
(p->max_pages - p->found_pages) < n_pages))) {
split_huge_pmd(vma, pmd, start);
goto process_smaller_pages;
}
if (p->max_pages &&
p->found_pages + n_pages > p->max_pages)
n_pages = p->max_pages - p->found_pages;
ret = pagemap_scan_output(is_written, is_file, is_present,
is_swap, p, start, n_pages);
if (ret < 0)
return ret;
So let's simplify this:
if (p->max_pages && n_pages > max_pages - found_pages) n_pages = max_pages - found_pages;
if (is_written && DO_WP && n_pages != HPAGE_SIZE / PAGE_SIZE) { split_thp(); goto process_smaller_pages; }
Clever!! This looks very sleek.
BTW, THP handling could be extracted to a function that would return -EAGAIN if it has split the page or it wasn't a THP -- and that would mean `goto process_smaller_pages`.
Other functions in this file handle the THP in this same way. So it feels like more intuitive that we follow to same pattern in this file.
Why not propagate the error from uffd_wp_range()?
uffd_wp_range() returns status in long variable. We cannot return long in this function. So intead of type casting long to int and then return I've used -EINVAL. Would following be more suitable?
long ret2 = uffd_wp_range(vma, start, HPAGE_SIZE, true); if (ret2 < 0) return (int)ret2;
I think it's ok, since negative values are expected to be error codes. And since you can't overflow int with HPAGE_SIZE pages, then I wouldn't use `ret2` but cast the return and add a comment why it's safe.
I'll update.
[...]
start = (unsigned long)untagged_addr(arg.start);
vec = (struct page_region *)(unsigned long)untagged_addr(arg.vec);
Is the inner cast needed?
arg.vec remains 64-bit on 32-bit systems. So casting 64bit value directly to struct page_region pointer errors out. So I've added specific casting to unsigned long first before casting to pointers.
I see. So to convey the intention, the `arg.start` and `arg.vec` should be casted to unsigned long, not the `untagged_addr()` return values.
I'll update.
ret = pagemap_scan_args_valid(&arg, start, vec);
if (ret)
return ret;
end = start + arg.len;
p.max_pages = arg.max_pages;
p.found_pages = 0;
p.flags = arg.flags;
p.required_mask = arg.required_mask;
p.anyof_mask = arg.anyof_mask;
p.excluded_mask = arg.excluded_mask;
p.return_mask = arg.return_mask;
p.cur.len = 0;
p.cur.start = 0;
p.vec = NULL;
p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
Nit: parentheses are not needed here, please remove.
Will do.
/*
* Allocate smaller buffer to get output from inside the page walk
* functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
* we want to return output to user in compact form where no two
* consecutive regions should be continuous and have the same flags.
* So store the latest element in p.cur between different walks and
* store the p.cur at the end of the walk to the user buffer.
*/
p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
GFP_KERNEL);
if (!p.vec)
return -ENOMEM;
walk_start = walk_end = start;
while (walk_end < end && !ret) {
The loop will stop if a previous iteration returned ENOSPC (and the error will be lost) - is it intended?
It is intentional. -ENOSPC means that the user buffer is full even though there was more memory to walk over. We don't treat this error. So when buffer gets full, we stop walking over further as user buffer has gotten full and return as success.
Thanks. What's the difference between -ENOSPC and PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code flow).
-ENOSPC --> user buffer has been filled completely PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may still have more space
[...]
--- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -210,6 +210,14 @@ extern bool userfaultfd_wp_async(struct vm_area_struct *vma);
#else /* CONFIG_USERFAULTFD */
+static inline long uffd_wp_range(struct mm_struct *dst_mm,
struct vm_area_struct *vma,
unsigned long start, unsigned long len,
bool enable_wp)
+{
return 0;
+}
[...]
Shouldn't this part be in the patch introducing uffd_wp_range()?
We have not added uffd_wp_range() in previous patches. We just need this stub for this patch for the case when CONFIG_USERFAULTFD isn't enabled.
I'd this as separate patch before this patch. Mike asked me to merge it with this patch in order not to break bisectability. [1] https://lore.kernel.org/all/ZBK+86eMMazwfhdx@kernel.org
I would understand the reply [1] to mean that the uffd_wp_range() stub should go in the same patch where uffd_wp_range() is implemented. But uffd_wp_range() is already in (since f369b07c86140) so I don't see how having the stub in a separate commit sequenced before this one could break bisect?
Sorry, I mis-interpreted it. I'll make it a separate patch.
Best Regards Michał Mirosław