Re: [PATCH v12 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

7 Apr 2023

      On 4/7/23 12:34 PM, Michał Mirosław wrote:
...
On Thu, 6 Apr 2023 at 23:04, Muhammad Usama Anjum
usama.anjum@collabora.com wrote:
...
On 4/7/23 1:00 AM, Michał Mirosław wrote:
...
On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
usama.anjum@collabora.com wrote:
[...]
...
...
...
...
...

          cur->len += n_pages;

          p->found_pages += n_pages;

          if (p->max_pages && (p->found_pages == p->max_pages))

                  return PM_SCAN_FOUND_MAX_PAGES;

          return 0;

  }

  if (!p->vec_index || ((p->vec_index + 1) < p->vec_len)) {

It looks that `if (p->vec_index < p->vec_len)` is enough here - if we
have vec_len == 0 here, then we'd not fit the entry in the userspace
buffer anyway. Am I missing something?
No. I'd explained it with diagram last time:
https://lore.kernel.org/all/3c8d9ea0-1382-be0c-8dd2-d490eedd3b55@collabora.c...
I'll add a concise comment here.
So it seems, but I think the code changed a bit and maybe could be
simplified now? Since p->vec_len == 0 is currently not valid, the
field could count only the entries available in p->vec[] -- IOW: not
include p->cur in the count.
I see. But this'll not work as we need to count p->cur to don't go above
the maximum count, p->vec_size.
You can subtract 1 from p->vec_size before the page walk to account
for the buffer in `cur`.
Yeah, it can be done. But I've thought about it. It would look more uglier.
I'll have to subtract 1 from vec_len in the start and then add 1 in the
end. The current algorithm seems simpler.
...
[...]
...
...
...
...
...
+static inline int pagemap_scan_deposit(struct pagemap_scan_private *p,

                                 struct page_region __user *vec,

                                 unsigned long *vec_index)

..._deposit() is used only in single place - please inline.
It is already inline.
Sorry. I mean: please paste the code in place of the single call.
I've made it a separate function to make the code look better in the caller
function and logically easier to understand. This function is ugly.
do_pagemap_scan() is also already very long function with lots of things
happening. If you still insist, I'll remove this function.
Please do remove - it will make the copy to userspace code all neatly together.
Will remove it.
...
[...]
...
...
...
...
...

           */

          if (is_written && PM_SCAN_OP_IS_WP(p) &&

              ((end - start < HPAGE_SIZE) ||

               (p->max_pages &&

                (p->max_pages - p->found_pages) < n_pages))) {

                  split_huge_pmd(vma, pmd, start);

                  goto process_smaller_pages;

          }

          if (p->max_pages &&

              p->found_pages + n_pages > p->max_pages)

                  n_pages = p->max_pages - p->found_pages;

          ret = pagemap_scan_output(is_written, is_file, is_present,

                                    is_swap, p, start, n_pages);

          if (ret < 0)

                  return ret;

So let's simplify this:
if (p->max_pages && n_pages > max_pages - found_pages)
  n_pages = max_pages - found_pages;
if (is_written && DO_WP && n_pages != HPAGE_SIZE / PAGE_SIZE) {
  split_thp();
  goto process_smaller_pages;
}
Clever!! This looks very sleek.
...
BTW, THP handling could be extracted to a function that would return
-EAGAIN if it has split the page or it wasn't a THP -- and that would
mean `goto process_smaller_pages`.
Other functions in this file handle the THP in this same way. So it feels
like more intuitive that we follow to same pattern in this file.
I'll leave it to you. Extracting THP support would avoid a goto and
#ifdef inside a function, though (and make the function smaller).
...
...
...
...
...

  /*

   * Allocate smaller buffer to get output from inside the page walk

   * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As

   * we want to return output to user in compact form where no two

   * consecutive regions should be continuous and have the same flags.

   * So store the latest element in p.cur between different walks and

   * store the p.cur at the end of the walk to the user buffer.

   */

  p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),

                        GFP_KERNEL);

  if (!p.vec)

          return -ENOMEM;

  walk_start = walk_end = start;

  while (walk_end < end && !ret) {

The loop will stop if a previous iteration returned ENOSPC (and the
error will be lost) - is it intended?
It is intentional. -ENOSPC means that the user buffer is full even though
there was more memory to walk over. We don't treat this error. So when
buffer gets full, we stop walking over further as user buffer has gotten
full and return as success.
Thanks. What's the difference between -ENOSPC and
PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
flow).
-ENOSPC --> user buffer has been filled completely
PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may
                            still have more space
What is the difference in code behaviour when those two cases are
compared? (I'd expect none.)
There is difference:
We add data to user buffer. If it succeeds with return code 0, we engage
the WP. If it succeeds with PM_SCAN_FOUND_MAX_PAGES, we still engage the
WP. But if we get -ENOSPC, we don't perform engage as the data wasn't added
to the user buffer.
...
Best Regards
Michał Mirosław
-- 
BR,
Muhammad Usama Anjum

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v12 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs