On Thu, 15 Jun 2023 at 15:58, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
I'll send next revision now. On 6/14/23 11:00 PM, Michał Mirosław wrote:
(A quick reply to answer open questions in case they help the next version.)
On Wed, 14 Jun 2023 at 19:10, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
On 6/14/23 8:14 PM, Michał Mirosław wrote:
On Wed, 14 Jun 2023 at 15:46, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
On 6/14/23 3:36 AM, Michał Mirosław wrote:
On Tue, 13 Jun 2023 at 12:29, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
[...]
> + if (cur_buf->bitmap == bitmap && > + cur_buf->start + cur_buf->len * PAGE_SIZE == addr) { > + cur_buf->len += n_pages; > + p->found_pages += n_pages; > + } else { > + if (cur_buf->len && p->vec_buf_index >= p->vec_buf_len) > + return -ENOMEM;
Shouldn't this be -ENOSPC? -ENOMEM usually signifies that the kernel ran out of memory when allocating, not that there is no space in a user-provided buffer.
There are 3 kinds of return values here:
- PM_SCAN_FOUND_MAX_PAGES (1) ---> max_pages have been found. Abort the
page walk from next entry
- 0 ---> continue the page walk
- -ENOMEM --> Abort the page walk from current entry, user buffer is full
which is not error, but only a stop signal. This -ENOMEM is just differentiater from (1). This -ENOMEM is for internal use and isn't returned to user.
But why ENOSPC is not good here? I was used before, I think.
-ENOSPC is being returned in form of true error from pagemap_scan_hugetlb_entry(). So I'd to remove -ENOSPC from here as it wasn't true error here, it was only a way to abort the walk immediately. I'm liking the following erturn code from here now:
#define PM_SCAN_BUFFER_FULL (-256)
I guess this will be reworked anyway, but I'd prefer this didn't need custom errors etc. If we agree to decoupling the selection and GET output, it could be:
bool is_interesting_page(p, flags); // this one does the required/anyof/excluded match size_t output_range(p, start, len, flags); // this one fills the output vector and returns how many pages were fit
In this setup, `is_interesting_page() && (n_out = output_range()) < n_pages` means this is the final range, no more will fit. And if `n_out == 0` then no pages fit and no WP is needed (no other special cases).
Right now, pagemap_scan_output() performs the work of both of these two functions. The part can be broken into is_interesting_pages() and we can leave the remaining part as it is.
Saying that n_out < n_pages tells us the buffer is full covers one case. But there is case of maximum pages have been found and walk needs to be aborted.
This case is exactly what `n_out < n_pages` will cover (if scan_output uses max_pages properly to limit n_out). Isn't it that when the buffer is full we want to abort the scan always (with WP if `n_out > 0`)?
For flags name: PM_REQUIRE_WRITE_ACCESS? Or Is it intended to be checked only if doing WP (as the current name suggests) and so it would be redundant as WP currently requires `p->required_mask = PAGE_IS_WRITTEN`?
This is intended to indicate that if userfaultfd is needed. If PAGE_IS_WRITTEN is mentioned in any of mask, we need to check if userfaultfd has been initialized for this memory. I'll rename to PM_SCAN_REQUIRE_UFFD.
Why do we need that check? Wouldn't `is_written = false` work for vmas not registered via uffd?
UFFD_FEATURE_WP_ASYNC and UNPOPULATED needs to be set on the memory region for it to report correct written values on the memory region. Without UFFD WP ASYNC and UNPOUPULATED defined on the memory, we consider UFFD_WP state undefined. If user hasn't initialized memory with UFFD, he has no right to set is_written = false.
How about calculating `is_written = is_uffd_registered() && is_uffd_wp()`? This would enable a user to apply GET+WP for the whole address space of a process regardless of whether all of it is registered.
I wouldn't want to check if uffd is registered again and again. This is why we are doing it only once every walk in pagemap_scan_test_walk().
There is no need to do the checks repeatedly. If I understand the code correctly, uffd registration is per-vma, so it can be communicated from test_walk to entry/hole callbacks via a field in pagemap_scan_private.
While here, I wonder if we really need to fail the call if there are unknown bits in those masks set: if this bit set is expanded with another category flags, a newer userspace run on older kernel would get EINVAL even if the "treat unknown as 0" be what it requires. There is no simple way in the API to discover what bits the kernel supports. We could allow a no-op (no WP nor GET) call to help with that and then rejecting unknown bits would make sense.
I've not seen any examples of this. But I've seen examples of returning error if kernel doesn't support a feature. Each new feature comes with a kernel version, greater than this version support this feature. If user is trying to use advanced feature which isn't present in a kernel, we should return error and not proceed to confuse the user/kernel. In fact if we look at userfaultfd_api(), we return error immediately if feature has some bit set which kernel doesn't support.
I think we should have a way of detecting the supported flags if we don't want a forward compatibility policy for flags here. Maybe it would be enough to allow all the no-op combinations for this purpose?
Again I don't think UFFD is doing anything like this.
If it's cheap and easy to provide a user with a way to detect the supported features - why not do it?
Best Regards Michał Mirosław