Changes for commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs") break this test expectations on the behavior of mlock syscall family immediately inserting the recently faulted pages into the UNEVICTABLE_LRU, when MCL_ONFAULT is passed to the syscall as part of its flag-set.
There is no functional error introduced by the aforementioned commit, but it opens up a time window where the recently faulted and locked pages might yet not be put back into the UNEVICTABLE_LRU, thus causing a subsequent and immediate PFN flag check for the UNEVICTABLE bit to trip on false-negative errors, as it happens with this test.
This patch fix the false negative by forcefully resorting to a code path that will call a CPU pagevec drain right after the fault but before the PFN flag check takes place, sorting out the race that way.
Fixes: 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs") Signed-off-by: Rafael Aquini aquini@redhat.com --- tools/testing/selftests/vm/mlock2-tests.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+)
diff --git a/tools/testing/selftests/vm/mlock2-tests.c b/tools/testing/selftests/vm/mlock2-tests.c index 637b6d0ac0d0..26dc320ca3c9 100644 --- a/tools/testing/selftests/vm/mlock2-tests.c +++ b/tools/testing/selftests/vm/mlock2-tests.c @@ -7,6 +7,7 @@ #include <sys/time.h> #include <sys/resource.h> #include <stdbool.h> +#include <sched.h> #include "mlock2.h"
#include "../kselftest.h" @@ -328,6 +329,22 @@ static int test_mlock_lock() return ret; }
+/* + * After commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs") + * changes made by calls to mlock* family might not be immediately reflected + * on the LRUs, thus checking the PFN flags might race against pagevec drain. + * + * In order to sort out that race, and get the after fault checks consistent, + * the "quick and dirty" trick below is required in order to force a call to + * lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to + * the unevictable LRU, as expected by the checks in this selftest. + */ +static void force_lru_add_drain_all(void) +{ + sched_yield(); + system("echo 1 > /proc/sys/vm/compact_memory"); +} + static int onfault_check(char *map) { unsigned long page_size = getpagesize(); @@ -343,6 +360,9 @@ static int onfault_check(char *map) }
*map = 'a'; + + force_lru_add_drain_all(); + page1_flags = get_pageflags((unsigned long)map); page2_flags = get_pageflags((unsigned long)map + page_size);
@@ -465,6 +485,8 @@ static int test_lock_onfault_of_present() goto unmap; }
+ force_lru_add_drain_all(); + page1_flags = get_pageflags((unsigned long)map); page2_flags = get_pageflags((unsigned long)map + page_size); page1_flags = get_kpageflags(page1_flags & PFN_MASK);
On Sat, 21 Mar 2020 21:35:25 -0400 Rafael Aquini aquini@redhat.com wrote:
Changes for commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs") break this test expectations on the behavior of mlock syscall family immediately inserting the recently faulted pages into the UNEVICTABLE_LRU, when MCL_ONFAULT is passed to the syscall as part of its flag-set.
There is no functional error introduced by the aforementioned commit, but it opens up a time window where the recently faulted and locked pages might yet not be put back into the UNEVICTABLE_LRU, thus causing a subsequent and immediate PFN flag check for the UNEVICTABLE bit to trip on false-negative errors, as it happens with this test.
This patch fix the false negative by forcefully resorting to a code path that will call a CPU pagevec drain right after the fault but before the PFN flag check takes place, sorting out the race that way.
+/*
- After commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
- changes made by calls to mlock* family might not be immediately reflected
- on the LRUs, thus checking the PFN flags might race against pagevec drain.
- In order to sort out that race, and get the after fault checks consistent,
- the "quick and dirty" trick below is required in order to force a call to
- lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
- the unevictable LRU, as expected by the checks in this selftest.
- */
+static void force_lru_add_drain_all(void) +{
- sched_yield();
- system("echo 1 > /proc/sys/vm/compact_memory");
+}
What is the sched_yield() for?
static int onfault_check(char *map) { unsigned long page_size = getpagesize(); @@ -343,6 +360,9 @@ static int onfault_check(char *map) } *map = 'a';
- force_lru_add_drain_all();
- page1_flags = get_pageflags((unsigned long)map); page2_flags = get_pageflags((unsigned long)map + page_size);
@@ -465,6 +485,8 @@ static int test_lock_onfault_of_present() goto unmap; }
- force_lru_add_drain_all();
- page1_flags = get_pageflags((unsigned long)map); page2_flags = get_pageflags((unsigned long)map + page_size); page1_flags = get_kpageflags(page1_flags & PFN_MASK);
On Sat, Mar 21, 2020 at 06:43:52PM -0700, Andrew Morton wrote:
On Sat, 21 Mar 2020 21:35:25 -0400 Rafael Aquini aquini@redhat.com wrote:
Changes for commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs") break this test expectations on the behavior of mlock syscall family immediately inserting the recently faulted pages into the UNEVICTABLE_LRU, when MCL_ONFAULT is passed to the syscall as part of its flag-set.
There is no functional error introduced by the aforementioned commit, but it opens up a time window where the recently faulted and locked pages might yet not be put back into the UNEVICTABLE_LRU, thus causing a subsequent and immediate PFN flag check for the UNEVICTABLE bit to trip on false-negative errors, as it happens with this test.
This patch fix the false negative by forcefully resorting to a code path that will call a CPU pagevec drain right after the fault but before the PFN flag check takes place, sorting out the race that way.
+/*
- After commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
- changes made by calls to mlock* family might not be immediately reflected
- on the LRUs, thus checking the PFN flags might race against pagevec drain.
- In order to sort out that race, and get the after fault checks consistent,
- the "quick and dirty" trick below is required in order to force a call to
- lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
- the unevictable LRU, as expected by the checks in this selftest.
- */
+static void force_lru_add_drain_all(void) +{
- sched_yield();
- system("echo 1 > /proc/sys/vm/compact_memory");
+}
What is the sched_yield() for?
Mostly it's there to provide a sleeping gap after the fault, whithout actually adding an arbitrary value with usleep().
It's not a hard requirement, but, in some of the tests I performed (whithout that sleeping gap) I would still see around 1% chance of hitting the false-negative. After adding it I could not hit the issue anymore.
static int onfault_check(char *map) { unsigned long page_size = getpagesize(); @@ -343,6 +360,9 @@ static int onfault_check(char *map) } *map = 'a';
- force_lru_add_drain_all();
- page1_flags = get_pageflags((unsigned long)map); page2_flags = get_pageflags((unsigned long)map + page_size);
@@ -465,6 +485,8 @@ static int test_lock_onfault_of_present() goto unmap; }
- force_lru_add_drain_all();
- page1_flags = get_pageflags((unsigned long)map); page2_flags = get_pageflags((unsigned long)map + page_size); page1_flags = get_kpageflags(page1_flags & PFN_MASK);
On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini aquini@redhat.com wrote:
- In order to sort out that race, and get the after fault checks consistent,
- the "quick and dirty" trick below is required in order to force a call to
- lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
- the unevictable LRU, as expected by the checks in this selftest.
- */
+static void force_lru_add_drain_all(void) +{
- sched_yield();
- system("echo 1 > /proc/sys/vm/compact_memory");
+}
What is the sched_yield() for?
Mostly it's there to provide a sleeping gap after the fault, whithout actually adding an arbitrary value with usleep().
It's not a hard requirement, but, in some of the tests I performed (whithout that sleeping gap) I would still see around 1% chance of hitting the false-negative. After adding it I could not hit the issue anymore.
It's concerning that such deep machinery as pagevec draining is visible to userspace.
I suppose that for consistency and correctness we should perform a drain prior to each read from /proc/*/pagemap. Presumably this would be far too expensive.
Is there any other way? One such might be to make the MLOCK_ONFAULT pages bypass the lru_add_pvecs?
On Sat, Mar 21, 2020 at 09:31:42PM -0700, Andrew Morton wrote:
On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini aquini@redhat.com wrote:
- In order to sort out that race, and get the after fault checks consistent,
- the "quick and dirty" trick below is required in order to force a call to
- lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
- the unevictable LRU, as expected by the checks in this selftest.
- */
+static void force_lru_add_drain_all(void) +{
- sched_yield();
- system("echo 1 > /proc/sys/vm/compact_memory");
+}
What is the sched_yield() for?
Mostly it's there to provide a sleeping gap after the fault, whithout actually adding an arbitrary value with usleep().
It's not a hard requirement, but, in some of the tests I performed (whithout that sleeping gap) I would still see around 1% chance of hitting the false-negative. After adding it I could not hit the issue anymore.
It's concerning that such deep machinery as pagevec draining is visible to userspace.
I suppose that for consistency and correctness we should perform a drain prior to each read from /proc/*/pagemap. Presumably this would be far too expensive.
Is there any other way? One such might be to make the MLOCK_ONFAULT pages bypass the lru_add_pvecs?
Well,
I admit I wasn't taking the approach of changing the kernel because I was thinking it would require a partial, or even full, revert of commit 9c4e6b1a7027f, and that would be increasing complexity, but on a second thought, it seems that we might just be missing:
diff --git a/mm/swap.c b/mm/swap.c index cf39d24ada2a..b1601228ded4 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -473,6 +473,7 @@ void lru_cache_add_active_or_unevictable(struct page *page, __mod_zone_page_state(page_zone(page), NR_MLOCK, hpage_nr_pages(page)); count_vm_event(UNEVICTABLE_PGMLOCKED); + SetPageUnevictable(page); } lru_cache_add(page); }
I'll take a closer look into it, as well as test it properly, tomorrow. Thanks for the heads up, Andrew. -- Rafael
On Sat, Mar 21, 2020 at 10:41 PM Rafael Aquini aquini@redhat.com wrote:
On Sat, Mar 21, 2020 at 09:31:42PM -0700, Andrew Morton wrote:
On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini aquini@redhat.com wrote:
- In order to sort out that race, and get the after fault checks consistent,
- the "quick and dirty" trick below is required in order to force a call to
- lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
- the unevictable LRU, as expected by the checks in this selftest.
- */
+static void force_lru_add_drain_all(void) +{
sched_yield();
system("echo 1 > /proc/sys/vm/compact_memory");
+}
What is the sched_yield() for?
Mostly it's there to provide a sleeping gap after the fault, whithout actually adding an arbitrary value with usleep().
It's not a hard requirement, but, in some of the tests I performed (whithout that sleeping gap) I would still see around 1% chance of hitting the false-negative. After adding it I could not hit the issue anymore.
It's concerning that such deep machinery as pagevec draining is visible to userspace.
I suppose that for consistency and correctness we should perform a drain prior to each read from /proc/*/pagemap. Presumably this would be far too expensive.
Is there any other way? One such might be to make the MLOCK_ONFAULT pages bypass the lru_add_pvecs?
Well,
I admit I wasn't taking the approach of changing the kernel because I was thinking it would require a partial, or even full, revert of commit 9c4e6b1a7027f, and that would be increasing complexity, but on a second thought, it seems that we might just be missing:
diff --git a/mm/swap.c b/mm/swap.c index cf39d24ada2a..b1601228ded4 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -473,6 +473,7 @@ void lru_cache_add_active_or_unevictable(struct page *page, __mod_zone_page_state(page_zone(page), NR_MLOCK, hpage_nr_pages(page)); count_vm_event(UNEVICTABLE_PGMLOCKED);
SetPageUnevictable(page);
No, this is not correct. Check __pagevec_lru_add_fn() for TestClearPageUnevictable().
As I mentioned in the other email, I think the better solution would be to introduce a sysctl to drain the pageves. That way there will not be any dependency on compaction as was in your original patch.
Shakeel
On Sat, Mar 21, 2020 at 9:31 PM Andrew Morton akpm@linux-foundation.org wrote:
On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini aquini@redhat.com wrote:
- In order to sort out that race, and get the after fault checks consistent,
- the "quick and dirty" trick below is required in order to force a call to
- lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
- the unevictable LRU, as expected by the checks in this selftest.
- */
+static void force_lru_add_drain_all(void) +{
- sched_yield();
- system("echo 1 > /proc/sys/vm/compact_memory");
+}
What is the sched_yield() for?
Mostly it's there to provide a sleeping gap after the fault, whithout actually adding an arbitrary value with usleep().
It's not a hard requirement, but, in some of the tests I performed (whithout that sleeping gap) I would still see around 1% chance of hitting the false-negative. After adding it I could not hit the issue anymore.
It's concerning that such deep machinery as pagevec draining is visible to userspace.
We already have other examples like memcg stats where the optimizations like batching per-cpu stats collection exposes differences to the userspace. I would not be that worried here.
I suppose that for consistency and correctness we should perform a drain prior to each read from /proc/*/pagemap. Presumably this would be far too expensive.
Is there any other way? One such might be to make the MLOCK_ONFAULT pages bypass the lru_add_pvecs?
I would rather prefer to have something similar to /proc/sys/vm/stat_refresh which drains the pagevecs.
On Sun 22-03-20 09:36:49, Shakeel Butt wrote:
On Sat, Mar 21, 2020 at 9:31 PM Andrew Morton akpm@linux-foundation.org wrote:
On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini aquini@redhat.com wrote:
- In order to sort out that race, and get the after fault checks consistent,
- the "quick and dirty" trick below is required in order to force a call to
- lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
- the unevictable LRU, as expected by the checks in this selftest.
- */
+static void force_lru_add_drain_all(void) +{
- sched_yield();
- system("echo 1 > /proc/sys/vm/compact_memory");
+}
What is the sched_yield() for?
Mostly it's there to provide a sleeping gap after the fault, whithout actually adding an arbitrary value with usleep().
It's not a hard requirement, but, in some of the tests I performed (whithout that sleeping gap) I would still see around 1% chance of hitting the false-negative. After adding it I could not hit the issue anymore.
It's concerning that such deep machinery as pagevec draining is visible to userspace.
We already have other examples like memcg stats where the optimizations like batching per-cpu stats collection exposes differences to the userspace. I would not be that worried here.
Agreed! Tests should be more tolerant for counters imprecision. Unevictable LRU is an optimization and transition to that list is a matter of an internal implementation detail.
I suppose that for consistency and correctness we should perform a drain prior to each read from /proc/*/pagemap. Presumably this would be far too expensive.
Is there any other way? One such might be to make the MLOCK_ONFAULT pages bypass the lru_add_pvecs?
I would rather prefer to have something similar to /proc/sys/vm/stat_refresh which drains the pagevecs.
No, please don't. Pagevecs draining is by far not the only batching scheme we use and an interface like this would promise users to effectivelly force flushing all of them.
Can we simply update the test to be more tolerant to imprecisions instead?
On Mon, Mar 23, 2020 at 08:52:08AM +0100, Michal Hocko wrote:
On Sun 22-03-20 09:36:49, Shakeel Butt wrote:
On Sat, Mar 21, 2020 at 9:31 PM Andrew Morton akpm@linux-foundation.org wrote:
On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini aquini@redhat.com wrote:
- In order to sort out that race, and get the after fault checks consistent,
- the "quick and dirty" trick below is required in order to force a call to
- lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
- the unevictable LRU, as expected by the checks in this selftest.
- */
+static void force_lru_add_drain_all(void) +{
- sched_yield();
- system("echo 1 > /proc/sys/vm/compact_memory");
+}
What is the sched_yield() for?
Mostly it's there to provide a sleeping gap after the fault, whithout actually adding an arbitrary value with usleep().
It's not a hard requirement, but, in some of the tests I performed (whithout that sleeping gap) I would still see around 1% chance of hitting the false-negative. After adding it I could not hit the issue anymore.
It's concerning that such deep machinery as pagevec draining is visible to userspace.
We already have other examples like memcg stats where the optimizations like batching per-cpu stats collection exposes differences to the userspace. I would not be that worried here.
Agreed! Tests should be more tolerant for counters imprecision. Unevictable LRU is an optimization and transition to that list is a matter of an internal implementation detail.
I suppose that for consistency and correctness we should perform a drain prior to each read from /proc/*/pagemap. Presumably this would be far too expensive.
Is there any other way? One such might be to make the MLOCK_ONFAULT pages bypass the lru_add_pvecs?
I would rather prefer to have something similar to /proc/sys/vm/stat_refresh which drains the pagevecs.
No, please don't. Pagevecs draining is by far not the only batching scheme we use and an interface like this would promise users to effectivelly force flushing all of them.
Can we simply update the test to be more tolerant to imprecisions instead?
I don't think, thouhg, that this particular test case can be entirely reduced as "counter imprecison".
The reason I think this is a different beast, is that having the page being flagged as PG_unevictable is expected part of the aftermath of a mlock* call. This selftest is, IMO, correctly verifying that fact, as it checks the functionality correctness.
The problem boils down to the fact that the page would immediately be flagged as PG_unevictable after the mlock (under MCL_FUTURE|MCL_ONFAULT semantics) call, and the test was expecting it, and commit 9c4e6b1a7027f changed that by "delaying" that flag setting.
As I mentioned, too, there's nothing wrong with the delayed setting of PG_unevictable, we just need to compensate for that fact in this test, which is what this patch is suggesting to do.
-- Michal Hocko SUSE Labs
On Mon 23-03-20 10:42:40, Rafael Aquini wrote:
On Mon, Mar 23, 2020 at 08:52:08AM +0100, Michal Hocko wrote:
On Sun 22-03-20 09:36:49, Shakeel Butt wrote:
On Sat, Mar 21, 2020 at 9:31 PM Andrew Morton akpm@linux-foundation.org wrote:
On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini aquini@redhat.com wrote:
> + * In order to sort out that race, and get the after fault checks consistent, > + * the "quick and dirty" trick below is required in order to force a call to > + * lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to > + * the unevictable LRU, as expected by the checks in this selftest. > + */ > +static void force_lru_add_drain_all(void) > +{ > + sched_yield(); > + system("echo 1 > /proc/sys/vm/compact_memory"); > +}
What is the sched_yield() for?
Mostly it's there to provide a sleeping gap after the fault, whithout actually adding an arbitrary value with usleep().
It's not a hard requirement, but, in some of the tests I performed (whithout that sleeping gap) I would still see around 1% chance of hitting the false-negative. After adding it I could not hit the issue anymore.
It's concerning that such deep machinery as pagevec draining is visible to userspace.
We already have other examples like memcg stats where the optimizations like batching per-cpu stats collection exposes differences to the userspace. I would not be that worried here.
Agreed! Tests should be more tolerant for counters imprecision. Unevictable LRU is an optimization and transition to that list is a matter of an internal implementation detail.
I suppose that for consistency and correctness we should perform a drain prior to each read from /proc/*/pagemap. Presumably this would be far too expensive.
Is there any other way? One such might be to make the MLOCK_ONFAULT pages bypass the lru_add_pvecs?
I would rather prefer to have something similar to /proc/sys/vm/stat_refresh which drains the pagevecs.
No, please don't. Pagevecs draining is by far not the only batching scheme we use and an interface like this would promise users to effectivelly force flushing all of them.
Can we simply update the test to be more tolerant to imprecisions instead?
I don't think, thouhg, that this particular test case can be entirely reduced as "counter imprecison".
The reason I think this is a different beast, is that having the page being flagged as PG_unevictable is expected part of the aftermath of a mlock* call. This selftest is, IMO, correctly verifying that fact, as it checks the functionality correctness.
The problem boils down to the fact that the page would immediately be flagged as PG_unevictable after the mlock (under MCL_FUTURE|MCL_ONFAULT semantics) call, and the test was expecting it, and commit 9c4e6b1a7027f changed that by "delaying" that flag setting.
As I've tried to explain in other email in this email thread. The test was exploiting a certain user visible side effect. The unevictable flag or the placement on the unevictable LRU list is are not really needed for the user contract correctness. That means that the test is not really correct. Working around that by trying to enforce kernel to comply with the test expectations is just plain wrong at least for two reasons 1) you cannot expect or event do not want userspace to do the same because the behavior might change in the future 2) the test is not really testing for correctness in the first place.
On Mon, Mar 23, 2020 at 03:51:06PM +0100, Michal Hocko wrote:
On Mon 23-03-20 10:42:40, Rafael Aquini wrote:
On Mon, Mar 23, 2020 at 08:52:08AM +0100, Michal Hocko wrote:
On Sun 22-03-20 09:36:49, Shakeel Butt wrote:
On Sat, Mar 21, 2020 at 9:31 PM Andrew Morton akpm@linux-foundation.org wrote:
On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini aquini@redhat.com wrote:
> > + * In order to sort out that race, and get the after fault checks consistent, > > + * the "quick and dirty" trick below is required in order to force a call to > > + * lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to > > + * the unevictable LRU, as expected by the checks in this selftest. > > + */ > > +static void force_lru_add_drain_all(void) > > +{ > > + sched_yield(); > > + system("echo 1 > /proc/sys/vm/compact_memory"); > > +} > > What is the sched_yield() for? >
Mostly it's there to provide a sleeping gap after the fault, whithout actually adding an arbitrary value with usleep().
It's not a hard requirement, but, in some of the tests I performed (whithout that sleeping gap) I would still see around 1% chance of hitting the false-negative. After adding it I could not hit the issue anymore.
It's concerning that such deep machinery as pagevec draining is visible to userspace.
We already have other examples like memcg stats where the optimizations like batching per-cpu stats collection exposes differences to the userspace. I would not be that worried here.
Agreed! Tests should be more tolerant for counters imprecision. Unevictable LRU is an optimization and transition to that list is a matter of an internal implementation detail.
I suppose that for consistency and correctness we should perform a drain prior to each read from /proc/*/pagemap. Presumably this would be far too expensive.
Is there any other way? One such might be to make the MLOCK_ONFAULT pages bypass the lru_add_pvecs?
I would rather prefer to have something similar to /proc/sys/vm/stat_refresh which drains the pagevecs.
No, please don't. Pagevecs draining is by far not the only batching scheme we use and an interface like this would promise users to effectivelly force flushing all of them.
Can we simply update the test to be more tolerant to imprecisions instead?
I don't think, thouhg, that this particular test case can be entirely reduced as "counter imprecison".
The reason I think this is a different beast, is that having the page being flagged as PG_unevictable is expected part of the aftermath of a mlock* call. This selftest is, IMO, correctly verifying that fact, as it checks the functionality correctness.
The problem boils down to the fact that the page would immediately be flagged as PG_unevictable after the mlock (under MCL_FUTURE|MCL_ONFAULT semantics) call, and the test was expecting it, and commit 9c4e6b1a7027f changed that by "delaying" that flag setting.
As I've tried to explain in other email in this email thread. The test was exploiting a certain user visible side effect. The unevictable flag or the placement on the unevictable LRU list is are not really needed for the user contract correctness. That means that the test is not really correct. Working around that by trying to enforce kernel to comply with the test expectations is just plain wrong at least for two reasons 1) you cannot expect or event do not want userspace to do the same because the behavior might change in the future 2) the test is not really testing for correctness in the first place.
Sorry, Michal, it seems we keep going back and forth (I just replied to your comment on the other thread)
The selftest also checks the kernel visible effect, via /proc/kpageflags, and that's where it fails after 9c4e6b1a7027f.
As I mentioned before, I think it is a reasonable check, given this is a kernel selftest, although we need to compensate it for the differences between its expectations and what the kernel is doing currently.
-- Rafael
On Mon 23-03-20 11:02:59, Rafael Aquini wrote:
On Mon, Mar 23, 2020 at 03:51:06PM +0100, Michal Hocko wrote:
On Mon 23-03-20 10:42:40, Rafael Aquini wrote:
On Mon, Mar 23, 2020 at 08:52:08AM +0100, Michal Hocko wrote:
On Sun 22-03-20 09:36:49, Shakeel Butt wrote:
On Sat, Mar 21, 2020 at 9:31 PM Andrew Morton akpm@linux-foundation.org wrote:
On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini aquini@redhat.com wrote:
> > > + * In order to sort out that race, and get the after fault checks consistent, > > > + * the "quick and dirty" trick below is required in order to force a call to > > > + * lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to > > > + * the unevictable LRU, as expected by the checks in this selftest. > > > + */ > > > +static void force_lru_add_drain_all(void) > > > +{ > > > + sched_yield(); > > > + system("echo 1 > /proc/sys/vm/compact_memory"); > > > +} > > > > What is the sched_yield() for? > > > > Mostly it's there to provide a sleeping gap after the fault, whithout > actually adding an arbitrary value with usleep(). > > It's not a hard requirement, but, in some of the tests I performed > (whithout that sleeping gap) I would still see around 1% chance > of hitting the false-negative. After adding it I could not hit > the issue anymore.
It's concerning that such deep machinery as pagevec draining is visible to userspace.
We already have other examples like memcg stats where the optimizations like batching per-cpu stats collection exposes differences to the userspace. I would not be that worried here.
Agreed! Tests should be more tolerant for counters imprecision. Unevictable LRU is an optimization and transition to that list is a matter of an internal implementation detail.
I suppose that for consistency and correctness we should perform a drain prior to each read from /proc/*/pagemap. Presumably this would be far too expensive.
Is there any other way? One such might be to make the MLOCK_ONFAULT pages bypass the lru_add_pvecs?
I would rather prefer to have something similar to /proc/sys/vm/stat_refresh which drains the pagevecs.
No, please don't. Pagevecs draining is by far not the only batching scheme we use and an interface like this would promise users to effectivelly force flushing all of them.
Can we simply update the test to be more tolerant to imprecisions instead?
I don't think, thouhg, that this particular test case can be entirely reduced as "counter imprecison".
The reason I think this is a different beast, is that having the page being flagged as PG_unevictable is expected part of the aftermath of a mlock* call. This selftest is, IMO, correctly verifying that fact, as it checks the functionality correctness.
The problem boils down to the fact that the page would immediately be flagged as PG_unevictable after the mlock (under MCL_FUTURE|MCL_ONFAULT semantics) call, and the test was expecting it, and commit 9c4e6b1a7027f changed that by "delaying" that flag setting.
As I've tried to explain in other email in this email thread. The test was exploiting a certain user visible side effect. The unevictable flag or the placement on the unevictable LRU list is are not really needed for the user contract correctness. That means that the test is not really correct. Working around that by trying to enforce kernel to comply with the test expectations is just plain wrong at least for two reasons 1) you cannot expect or event do not want userspace to do the same because the behavior might change in the future 2) the test is not really testing for correctness in the first place.
Sorry, Michal, it seems we keep going back and forth (I just replied to your comment on the other thread)
The selftest also checks the kernel visible effect, via /proc/kpageflags, and that's where it fails after 9c4e6b1a7027f.
I really fail to see your point. Even if you are right that the self test is somehow evaluating the kernel implementation which I am not sure is the scope of the selft thest but anyway. The mere fact that the kernel test fails on a perfectly valid change should just suggest that the test is leading to false positives and therefore should be fixed. Your proposed fix is simply suboptimal because it relies on yet another side effect which might change anytime in the future and still lead to a correctly behaving kernel. See my point?
On Mon, Mar 23, 2020 at 04:12:56PM +0100, Michal Hocko wrote:
On Mon 23-03-20 11:02:59, Rafael Aquini wrote:
On Mon, Mar 23, 2020 at 03:51:06PM +0100, Michal Hocko wrote:
On Mon 23-03-20 10:42:40, Rafael Aquini wrote:
On Mon, Mar 23, 2020 at 08:52:08AM +0100, Michal Hocko wrote:
On Sun 22-03-20 09:36:49, Shakeel Butt wrote:
On Sat, Mar 21, 2020 at 9:31 PM Andrew Morton akpm@linux-foundation.org wrote: > > On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini aquini@redhat.com wrote: > > > > > + * In order to sort out that race, and get the after fault checks consistent, > > > > + * the "quick and dirty" trick below is required in order to force a call to > > > > + * lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to > > > > + * the unevictable LRU, as expected by the checks in this selftest. > > > > + */ > > > > +static void force_lru_add_drain_all(void) > > > > +{ > > > > + sched_yield(); > > > > + system("echo 1 > /proc/sys/vm/compact_memory"); > > > > +} > > > > > > What is the sched_yield() for? > > > > > > > Mostly it's there to provide a sleeping gap after the fault, whithout > > actually adding an arbitrary value with usleep(). > > > > It's not a hard requirement, but, in some of the tests I performed > > (whithout that sleeping gap) I would still see around 1% chance > > of hitting the false-negative. After adding it I could not hit > > the issue anymore. > > It's concerning that such deep machinery as pagevec draining is visible > to userspace. >
We already have other examples like memcg stats where the optimizations like batching per-cpu stats collection exposes differences to the userspace. I would not be that worried here.
Agreed! Tests should be more tolerant for counters imprecision. Unevictable LRU is an optimization and transition to that list is a matter of an internal implementation detail.
> I suppose that for consistency and correctness we should perform a > drain prior to each read from /proc/*/pagemap. Presumably this would > be far too expensive. > > Is there any other way? One such might be to make the MLOCK_ONFAULT > pages bypass the lru_add_pvecs? >
I would rather prefer to have something similar to /proc/sys/vm/stat_refresh which drains the pagevecs.
No, please don't. Pagevecs draining is by far not the only batching scheme we use and an interface like this would promise users to effectivelly force flushing all of them.
Can we simply update the test to be more tolerant to imprecisions instead?
I don't think, thouhg, that this particular test case can be entirely reduced as "counter imprecison".
The reason I think this is a different beast, is that having the page being flagged as PG_unevictable is expected part of the aftermath of a mlock* call. This selftest is, IMO, correctly verifying that fact, as it checks the functionality correctness.
The problem boils down to the fact that the page would immediately be flagged as PG_unevictable after the mlock (under MCL_FUTURE|MCL_ONFAULT semantics) call, and the test was expecting it, and commit 9c4e6b1a7027f changed that by "delaying" that flag setting.
As I've tried to explain in other email in this email thread. The test was exploiting a certain user visible side effect. The unevictable flag or the placement on the unevictable LRU list is are not really needed for the user contract correctness. That means that the test is not really correct. Working around that by trying to enforce kernel to comply with the test expectations is just plain wrong at least for two reasons 1) you cannot expect or event do not want userspace to do the same because the behavior might change in the future 2) the test is not really testing for correctness in the first place.
Sorry, Michal, it seems we keep going back and forth (I just replied to your comment on the other thread)
The selftest also checks the kernel visible effect, via /proc/kpageflags, and that's where it fails after 9c4e6b1a7027f.
I really fail to see your point. Even if you are right that the self test is somehow evaluating the kernel implementation which I am not sure is the scope of the selft thest but anyway. The mere fact that the kernel test fails on a perfectly valid change should just suggest that the test is leading to false positives and therefore should be fixed. Your proposed fix is simply suboptimal because it relies on yet another side effect which might change anytime in the future and still lead to a correctly behaving kernel. See my point?
OK, I concede your point on the bogusness of checking the page flags in this particular test and expect certain valuse there, given that no other selftest seems to be doing that level of inner kenrel detail scrutiny.
I'll repost this fix suggestion getting rif of those related checkpoints.
Cheers! -- Rafael
On Mon 23-03-20 11:41:59, Rafael Aquini wrote:
On Mon, Mar 23, 2020 at 04:12:56PM +0100, Michal Hocko wrote:
On Mon 23-03-20 11:02:59, Rafael Aquini wrote:
[...]
The selftest also checks the kernel visible effect, via /proc/kpageflags, and that's where it fails after 9c4e6b1a7027f.
I really fail to see your point. Even if you are right that the self test is somehow evaluating the kernel implementation which I am not sure is the scope of the selft thest but anyway. The mere fact that the kernel test fails on a perfectly valid change should just suggest that the test is leading to false positives and therefore should be fixed. Your proposed fix is simply suboptimal because it relies on yet another side effect which might change anytime in the future and still lead to a correctly behaving kernel. See my point?
OK, I concede your point on the bogusness of checking the page flags in this particular test and expect certain valuse there, given that no other selftest seems to be doing that level of inner kenrel detail scrutiny.
I'll repost this fix suggestion getting rif of those related checkpoints.
Here is what I have after I had to context switch to something else before finishing it. Feel free to reuse if you feel like. It is likely to not even compile.
diff --git a/tools/testing/selftests/vm/mlock2-tests.c b/tools/testing/selftests/vm/mlock2-tests.c index 637b6d0ac0d0..9fb1638a4d33 100644 --- a/tools/testing/selftests/vm/mlock2-tests.c +++ b/tools/testing/selftests/vm/mlock2-tests.c @@ -67,59 +67,6 @@ static int get_vm_area(unsigned long addr, struct vm_boundaries *area) return ret; }
-static uint64_t get_pageflags(unsigned long addr) -{ - FILE *file; - uint64_t pfn; - unsigned long offset; - - file = fopen("/proc/self/pagemap", "r"); - if (!file) { - perror("fopen pagemap"); - _exit(1); - } - - offset = addr / getpagesize() * sizeof(pfn); - - if (fseek(file, offset, SEEK_SET)) { - perror("fseek pagemap"); - _exit(1); - } - - if (fread(&pfn, sizeof(pfn), 1, file) != 1) { - perror("fread pagemap"); - _exit(1); - } - - fclose(file); - return pfn; -} - -static uint64_t get_kpageflags(unsigned long pfn) -{ - uint64_t flags; - FILE *file; - - file = fopen("/proc/kpageflags", "r"); - if (!file) { - perror("fopen kpageflags"); - _exit(1); - } - - if (fseek(file, pfn * sizeof(flags), SEEK_SET)) { - perror("fseek kpageflags"); - _exit(1); - } - - if (fread(&flags, sizeof(flags), 1, file) != 1) { - perror("fread kpageflags"); - _exit(1); - } - - fclose(file); - return flags; -} - #define VMFLAGS "VmFlags:"
static bool is_vmflag_set(unsigned long addr, const char *vmflag) @@ -159,19 +106,13 @@ static bool is_vmflag_set(unsigned long addr, const char *vmflag) #define RSS "Rss:" #define LOCKED "lo"
-static bool is_vma_lock_on_fault(unsigned long addr) +static unsigned long get_value_for_name(unsigned long addr, const char *name) { - bool ret = false; - bool locked; - FILE *smaps = NULL; - unsigned long vma_size, vma_rss; char *line = NULL; - char *value; size_t size = 0; - - locked = is_vmflag_set(addr, LOCKED); - if (!locked) - goto out; + char *value_ptr; + FILE *smaps = NULL; + unsigned long value = -1UL;
smaps = seek_to_smaps_entry(addr); if (!smaps) { @@ -179,44 +120,43 @@ static bool is_vma_lock_on_fault(unsigned long addr) goto out; }
- while (getline(&line, &size, smaps) > 0) { - if (!strstr(line, SIZE)) { + while (getline(&line, &size, file) > 0) { + if (!strstr(line, name)) { free(line); line = NULL; size = 0; continue; }
- value = line + strlen(SIZE); - if (sscanf(value, "%lu kB", &vma_size) < 1) { + value_ptr = line + strlen(name); + if (sscanf(value_ptr, "%lu kB", &value) < 1) { printf("Unable to parse smaps entry for Size\n"); goto out; } break; }
- while (getline(&line, &size, smaps) > 0) { - if (!strstr(line, RSS)) { - free(line); - line = NULL; - size = 0; - continue; - } - - value = line + strlen(RSS); - if (sscanf(value, "%lu kB", &vma_rss) < 1) { - printf("Unable to parse smaps entry for Rss\n"); - goto out; - } - break; - } - - ret = locked && (vma_rss < vma_size); out: - free(line); if (smaps) fclose(smaps); - return ret; + free(line); + return value; +} + +static bool is_vma_lock_on_fault(unsigned long addr) +{ + bool locked; + unsigned long vma_size, vma_rss; + + locked = is_vmflag_set(addr, LOCKED); + if (!locked) + return false + + vma_size = get_value_for_name(addr, SIZE); + vma_rss = get_value_for_name(addr, RSS); + + /* only one page is faulted in */ + return (vma_rss < vma_size); }
#define PRESENT_BIT 0x8000000000000000ULL @@ -225,40 +165,18 @@ static bool is_vma_lock_on_fault(unsigned long addr)
static int lock_check(char *map) { - unsigned long page_size = getpagesize(); - uint64_t page1_flags, page2_flags; - - page1_flags = get_pageflags((unsigned long)map); - page2_flags = get_pageflags((unsigned long)map + page_size); - - /* Both pages should be present */ - if (((page1_flags & PRESENT_BIT) == 0) || - ((page2_flags & PRESENT_BIT) == 0)) { - printf("Failed to make both pages present\n"); - return 1; - } - - page1_flags = get_kpageflags(page1_flags & PFN_MASK); - page2_flags = get_kpageflags(page2_flags & PFN_MASK); - - /* Both pages should be unevictable */ - if (((page1_flags & UNEVICTABLE_BIT) == 0) || - ((page2_flags & UNEVICTABLE_BIT) == 0)) { - printf("Failed to make both pages unevictable\n"); - return 1; - } + bool locked; + FILE *smaps = NULL; + unsigned long vma_size, vma_rss;
- if (!is_vmflag_set((unsigned long)map, LOCKED)) { - printf("VMA flag %s is missing on page 1\n", LOCKED); - return 1; - } + locked = is_vmflag_set(addr, LOCKED); + if (!locked) + return false;
- if (!is_vmflag_set((unsigned long)map + page_size, LOCKED)) { - printf("VMA flag %s is missing on page 2\n", LOCKED); - return 1; - } + vma_size = get_value_for_name(SIZE, smaps); + vma_rss = get_value_for_name(RSS, smaps);
- return 0; + return (vma_rss == vma_size); }
static int unlock_lock_check(char *map) @@ -266,16 +184,6 @@ static int unlock_lock_check(char *map) unsigned long page_size = getpagesize(); uint64_t page1_flags, page2_flags;
- page1_flags = get_pageflags((unsigned long)map); - page2_flags = get_pageflags((unsigned long)map + page_size); - page1_flags = get_kpageflags(page1_flags & PFN_MASK); - page2_flags = get_kpageflags(page2_flags & PFN_MASK); - - if ((page1_flags & UNEVICTABLE_BIT) || (page2_flags & UNEVICTABLE_BIT)) { - printf("A page is still marked unevictable after unlock\n"); - return 1; - } - if (is_vmflag_set((unsigned long)map, LOCKED)) { printf("VMA flag %s is present on page 1 after unlock\n", LOCKED); return 1; @@ -333,36 +241,7 @@ static int onfault_check(char *map) unsigned long page_size = getpagesize(); uint64_t page1_flags, page2_flags;
- page1_flags = get_pageflags((unsigned long)map); - page2_flags = get_pageflags((unsigned long)map + page_size); - - /* Neither page should be present */ - if ((page1_flags & PRESENT_BIT) || (page2_flags & PRESENT_BIT)) { - printf("Pages were made present by MLOCK_ONFAULT\n"); - return 1; - } - *map = 'a'; - page1_flags = get_pageflags((unsigned long)map); - page2_flags = get_pageflags((unsigned long)map + page_size); - - /* Only page 1 should be present */ - if ((page1_flags & PRESENT_BIT) == 0) { - printf("Page 1 is not present after fault\n"); - return 1; - } else if (page2_flags & PRESENT_BIT) { - printf("Page 2 was made present\n"); - return 1; - } - - page1_flags = get_kpageflags(page1_flags & PFN_MASK); - - /* Page 1 should be unevictable */ - if ((page1_flags & UNEVICTABLE_BIT) == 0) { - printf("Failed to make faulted page unevictable\n"); - return 1; - } - if (!is_vma_lock_on_fault((unsigned long)map)) { printf("VMA is not marked for lock on fault\n"); return 1; @@ -381,14 +260,6 @@ static int unlock_onfault_check(char *map) unsigned long page_size = getpagesize(); uint64_t page1_flags;
- page1_flags = get_pageflags((unsigned long)map); - page1_flags = get_kpageflags(page1_flags & PFN_MASK); - - if (page1_flags & UNEVICTABLE_BIT) { - printf("Page 1 is still marked unevictable after unlock\n"); - return 1; - } - if (is_vma_lock_on_fault((unsigned long)map) || is_vma_lock_on_fault((unsigned long)map + page_size)) { printf("VMA is still lock on fault after unlock\n"); @@ -465,17 +336,6 @@ static int test_lock_onfault_of_present() goto unmap; }
- page1_flags = get_pageflags((unsigned long)map); - page2_flags = get_pageflags((unsigned long)map + page_size); - page1_flags = get_kpageflags(page1_flags & PFN_MASK); - page2_flags = get_kpageflags(page2_flags & PFN_MASK); - - /* Page 1 should be unevictable */ - if ((page1_flags & UNEVICTABLE_BIT) == 0) { - printf("Failed to make present page unevictable\n"); - goto unmap; - } - if (!is_vma_lock_on_fault((unsigned long)map) || !is_vma_lock_on_fault((unsigned long)map + page_size)) { printf("VMA with present pages is not marked lock on fault\n");
On Mon, Mar 23, 2020 at 04:51:11PM +0100, Michal Hocko wrote:
On Mon 23-03-20 11:41:59, Rafael Aquini wrote:
On Mon, Mar 23, 2020 at 04:12:56PM +0100, Michal Hocko wrote:
On Mon 23-03-20 11:02:59, Rafael Aquini wrote:
[...]
The selftest also checks the kernel visible effect, via /proc/kpageflags, and that's where it fails after 9c4e6b1a7027f.
I really fail to see your point. Even if you are right that the self test is somehow evaluating the kernel implementation which I am not sure is the scope of the selft thest but anyway. The mere fact that the kernel test fails on a perfectly valid change should just suggest that the test is leading to false positives and therefore should be fixed. Your proposed fix is simply suboptimal because it relies on yet another side effect which might change anytime in the future and still lead to a correctly behaving kernel. See my point?
OK, I concede your point on the bogusness of checking the page flags in this particular test and expect certain valuse there, given that no other selftest seems to be doing that level of inner kenrel detail scrutiny.
I'll repost this fix suggestion getting rif of those related checkpoints.
Here is what I have after I had to context switch to something else before finishing it. Feel free to reuse if you feel like. It is likely to not even compile.
I'm OK with it, if you want to go ahead and do the kill.
Thanks -- Rafael
[Cc Eric - the email thread starts http://lkml.kernel.org/r/20200322013525.1095493-1-aquini@redhat.com]
On Mon 23-03-20 11:54:49, Rafael Aquini wrote: [...]
I'm OK with it, if you want to go ahead and do the kill.
See the patch below
From 07c08f387d036c70239d4060ffd30534cf77a0a5 Mon Sep 17 00:00:00 2001
From: Michal Hocko mhocko@suse.com Date: Mon, 23 Mar 2020 17:33:46 +0100 Subject: [PATCH] selftests: vm: drop dependencies on page flags from mlock2 tests
It was noticed that mlock2 tests are failing after 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs") because the patch has changed the timing on when the page is added to the unevictable LRU list and thus gains the unevictable page flag.
The test was just too dependent on the implementation details which were true at the time when it was introduced. Page flags and the timing when they are set is something no userspace should ever depend on. The test should be testing only for the user observable contract of the tested syscalls. Those are defined pretty well for the mlock and there are other means for testing them. In fact this is already done and testing for page flags can be safely dropped to achieve the aimed purpose. Present bits can be checked by /proc/<pid>/smaps RSS field and the locking state by VmFlags although I would argue that Locked: field would be more appropriate.
Drop all the page flag machinery and considerably simplify the test. This should be more robust for future kernel changes while checking the promised contract is still valid.
Reported-by: Rafael Aquini aquini@redhat.com Signed-off-by: Michal Hocko mhocko@suse.com --- tools/testing/selftests/vm/mlock2-tests.c | 233 ++++------------------ 1 file changed, 37 insertions(+), 196 deletions(-)
diff --git a/tools/testing/selftests/vm/mlock2-tests.c b/tools/testing/selftests/vm/mlock2-tests.c index 637b6d0ac0d0..11b2301f3aa3 100644 --- a/tools/testing/selftests/vm/mlock2-tests.c +++ b/tools/testing/selftests/vm/mlock2-tests.c @@ -67,59 +67,6 @@ static int get_vm_area(unsigned long addr, struct vm_boundaries *area) return ret; }
-static uint64_t get_pageflags(unsigned long addr) -{ - FILE *file; - uint64_t pfn; - unsigned long offset; - - file = fopen("/proc/self/pagemap", "r"); - if (!file) { - perror("fopen pagemap"); - _exit(1); - } - - offset = addr / getpagesize() * sizeof(pfn); - - if (fseek(file, offset, SEEK_SET)) { - perror("fseek pagemap"); - _exit(1); - } - - if (fread(&pfn, sizeof(pfn), 1, file) != 1) { - perror("fread pagemap"); - _exit(1); - } - - fclose(file); - return pfn; -} - -static uint64_t get_kpageflags(unsigned long pfn) -{ - uint64_t flags; - FILE *file; - - file = fopen("/proc/kpageflags", "r"); - if (!file) { - perror("fopen kpageflags"); - _exit(1); - } - - if (fseek(file, pfn * sizeof(flags), SEEK_SET)) { - perror("fseek kpageflags"); - _exit(1); - } - - if (fread(&flags, sizeof(flags), 1, file) != 1) { - perror("fread kpageflags"); - _exit(1); - } - - fclose(file); - return flags; -} - #define VMFLAGS "VmFlags:"
static bool is_vmflag_set(unsigned long addr, const char *vmflag) @@ -159,19 +106,13 @@ static bool is_vmflag_set(unsigned long addr, const char *vmflag) #define RSS "Rss:" #define LOCKED "lo"
-static bool is_vma_lock_on_fault(unsigned long addr) +static unsigned long get_value_for_name(unsigned long addr, const char *name) { - bool ret = false; - bool locked; - FILE *smaps = NULL; - unsigned long vma_size, vma_rss; char *line = NULL; - char *value; size_t size = 0; - - locked = is_vmflag_set(addr, LOCKED); - if (!locked) - goto out; + char *value_ptr; + FILE *smaps = NULL; + unsigned long value = -1UL;
smaps = seek_to_smaps_entry(addr); if (!smaps) { @@ -180,112 +121,70 @@ static bool is_vma_lock_on_fault(unsigned long addr) }
while (getline(&line, &size, smaps) > 0) { - if (!strstr(line, SIZE)) { + if (!strstr(line, name)) { free(line); line = NULL; size = 0; continue; }
- value = line + strlen(SIZE); - if (sscanf(value, "%lu kB", &vma_size) < 1) { + value_ptr = line + strlen(name); + if (sscanf(value_ptr, "%lu kB", &value) < 1) { printf("Unable to parse smaps entry for Size\n"); goto out; } break; }
- while (getline(&line, &size, smaps) > 0) { - if (!strstr(line, RSS)) { - free(line); - line = NULL; - size = 0; - continue; - } - - value = line + strlen(RSS); - if (sscanf(value, "%lu kB", &vma_rss) < 1) { - printf("Unable to parse smaps entry for Rss\n"); - goto out; - } - break; - } - - ret = locked && (vma_rss < vma_size); out: - free(line); if (smaps) fclose(smaps); - return ret; + free(line); + return value; }
-#define PRESENT_BIT 0x8000000000000000ULL -#define PFN_MASK 0x007FFFFFFFFFFFFFULL -#define UNEVICTABLE_BIT (1UL << 18) - -static int lock_check(char *map) +static bool is_vma_lock_on_fault(unsigned long addr) { - unsigned long page_size = getpagesize(); - uint64_t page1_flags, page2_flags; + bool locked; + unsigned long vma_size, vma_rss;
- page1_flags = get_pageflags((unsigned long)map); - page2_flags = get_pageflags((unsigned long)map + page_size); + locked = is_vmflag_set(addr, LOCKED); + if (!locked) + return false;
- /* Both pages should be present */ - if (((page1_flags & PRESENT_BIT) == 0) || - ((page2_flags & PRESENT_BIT) == 0)) { - printf("Failed to make both pages present\n"); - return 1; - } + vma_size = get_value_for_name(addr, SIZE); + vma_rss = get_value_for_name(addr, RSS);
- page1_flags = get_kpageflags(page1_flags & PFN_MASK); - page2_flags = get_kpageflags(page2_flags & PFN_MASK); + /* only one page is faulted in */ + return (vma_rss < vma_size); +}
- /* Both pages should be unevictable */ - if (((page1_flags & UNEVICTABLE_BIT) == 0) || - ((page2_flags & UNEVICTABLE_BIT) == 0)) { - printf("Failed to make both pages unevictable\n"); - return 1; - } +#define PRESENT_BIT 0x8000000000000000ULL +#define PFN_MASK 0x007FFFFFFFFFFFFFULL +#define UNEVICTABLE_BIT (1UL << 18)
- if (!is_vmflag_set((unsigned long)map, LOCKED)) { - printf("VMA flag %s is missing on page 1\n", LOCKED); - return 1; - } +static int lock_check(unsigned long addr) +{ + bool locked; + unsigned long vma_size, vma_rss;
- if (!is_vmflag_set((unsigned long)map + page_size, LOCKED)) { - printf("VMA flag %s is missing on page 2\n", LOCKED); - return 1; - } + locked = is_vmflag_set(addr, LOCKED); + if (!locked) + return false;
- return 0; + vma_size = get_value_for_name(addr, SIZE); + vma_rss = get_value_for_name(addr, RSS); + + return (vma_rss == vma_size); }
static int unlock_lock_check(char *map) { - unsigned long page_size = getpagesize(); - uint64_t page1_flags, page2_flags; - - page1_flags = get_pageflags((unsigned long)map); - page2_flags = get_pageflags((unsigned long)map + page_size); - page1_flags = get_kpageflags(page1_flags & PFN_MASK); - page2_flags = get_kpageflags(page2_flags & PFN_MASK); - - if ((page1_flags & UNEVICTABLE_BIT) || (page2_flags & UNEVICTABLE_BIT)) { - printf("A page is still marked unevictable after unlock\n"); - return 1; - } - if (is_vmflag_set((unsigned long)map, LOCKED)) { printf("VMA flag %s is present on page 1 after unlock\n", LOCKED); return 1; }
- if (is_vmflag_set((unsigned long)map + page_size, LOCKED)) { - printf("VMA flag %s is present on page 2 after unlock\n", LOCKED); - return 1; - } - return 0; }
@@ -311,7 +210,7 @@ static int test_mlock_lock() goto unmap; }
- if (lock_check(map)) + if (!lock_check((unsigned long)map)) goto unmap;
/* Now unlock and recheck attributes */ @@ -330,64 +229,18 @@ static int test_mlock_lock()
static int onfault_check(char *map) { - unsigned long page_size = getpagesize(); - uint64_t page1_flags, page2_flags; - - page1_flags = get_pageflags((unsigned long)map); - page2_flags = get_pageflags((unsigned long)map + page_size); - - /* Neither page should be present */ - if ((page1_flags & PRESENT_BIT) || (page2_flags & PRESENT_BIT)) { - printf("Pages were made present by MLOCK_ONFAULT\n"); - return 1; - } - *map = 'a'; - page1_flags = get_pageflags((unsigned long)map); - page2_flags = get_pageflags((unsigned long)map + page_size); - - /* Only page 1 should be present */ - if ((page1_flags & PRESENT_BIT) == 0) { - printf("Page 1 is not present after fault\n"); - return 1; - } else if (page2_flags & PRESENT_BIT) { - printf("Page 2 was made present\n"); - return 1; - } - - page1_flags = get_kpageflags(page1_flags & PFN_MASK); - - /* Page 1 should be unevictable */ - if ((page1_flags & UNEVICTABLE_BIT) == 0) { - printf("Failed to make faulted page unevictable\n"); - return 1; - } - if (!is_vma_lock_on_fault((unsigned long)map)) { printf("VMA is not marked for lock on fault\n"); return 1; }
- if (!is_vma_lock_on_fault((unsigned long)map + page_size)) { - printf("VMA is not marked for lock on fault\n"); - return 1; - } - return 0; }
static int unlock_onfault_check(char *map) { unsigned long page_size = getpagesize(); - uint64_t page1_flags; - - page1_flags = get_pageflags((unsigned long)map); - page1_flags = get_kpageflags(page1_flags & PFN_MASK); - - if (page1_flags & UNEVICTABLE_BIT) { - printf("Page 1 is still marked unevictable after unlock\n"); - return 1; - }
if (is_vma_lock_on_fault((unsigned long)map) || is_vma_lock_on_fault((unsigned long)map + page_size)) { @@ -445,7 +298,6 @@ static int test_lock_onfault_of_present() char *map; int ret = 1; unsigned long page_size = getpagesize(); - uint64_t page1_flags, page2_flags;
map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); @@ -465,17 +317,6 @@ static int test_lock_onfault_of_present() goto unmap; }
- page1_flags = get_pageflags((unsigned long)map); - page2_flags = get_pageflags((unsigned long)map + page_size); - page1_flags = get_kpageflags(page1_flags & PFN_MASK); - page2_flags = get_kpageflags(page2_flags & PFN_MASK); - - /* Page 1 should be unevictable */ - if ((page1_flags & UNEVICTABLE_BIT) == 0) { - printf("Failed to make present page unevictable\n"); - goto unmap; - } - if (!is_vma_lock_on_fault((unsigned long)map) || !is_vma_lock_on_fault((unsigned long)map + page_size)) { printf("VMA with present pages is not marked lock on fault\n"); @@ -507,7 +348,7 @@ static int test_munlockall() goto out; }
- if (lock_check(map)) + if (!lock_check((unsigned long)map)) goto unmap;
if (munlockall()) { @@ -549,7 +390,7 @@ static int test_munlockall() goto out; }
- if (lock_check(map)) + if (!lock_check((unsigned long)map)) goto unmap;
if (munlockall()) {
On Tue, Mar 24, 2020 at 04:42:18PM +0100, Michal Hocko wrote:
[Cc Eric - the email thread starts http://lkml.kernel.org/r/20200322013525.1095493-1-aquini@redhat.com]
On Mon 23-03-20 11:54:49, Rafael Aquini wrote: [...]
I'm OK with it, if you want to go ahead and do the kill.
See the patch below
From 07c08f387d036c70239d4060ffd30534cf77a0a5 Mon Sep 17 00:00:00 2001 From: Michal Hocko mhocko@suse.com Date: Mon, 23 Mar 2020 17:33:46 +0100 Subject: [PATCH] selftests: vm: drop dependencies on page flags from mlock2 tests
It was noticed that mlock2 tests are failing after 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs") because the patch has changed the timing on when the page is added to the unevictable LRU list and thus gains the unevictable page flag.
The test was just too dependent on the implementation details which were true at the time when it was introduced. Page flags and the timing when they are set is something no userspace should ever depend on. The test should be testing only for the user observable contract of the tested syscalls. Those are defined pretty well for the mlock and there are other means for testing them. In fact this is already done and testing for page flags can be safely dropped to achieve the aimed purpose. Present bits can be checked by /proc/<pid>/smaps RSS field and the locking state by VmFlags although I would argue that Locked: field would be more appropriate.
Drop all the page flag machinery and considerably simplify the test. This should be more robust for future kernel changes while checking the promised contract is still valid.
Reported-by: Rafael Aquini aquini@redhat.com Signed-off-by: Michal Hocko mhocko@suse.com
tools/testing/selftests/vm/mlock2-tests.c | 233 ++++------------------ 1 file changed, 37 insertions(+), 196 deletions(-)
diff --git a/tools/testing/selftests/vm/mlock2-tests.c b/tools/testing/selftests/vm/mlock2-tests.c index 637b6d0ac0d0..11b2301f3aa3 100644 --- a/tools/testing/selftests/vm/mlock2-tests.c +++ b/tools/testing/selftests/vm/mlock2-tests.c @@ -67,59 +67,6 @@ static int get_vm_area(unsigned long addr, struct vm_boundaries *area) return ret; } -static uint64_t get_pageflags(unsigned long addr) -{
- FILE *file;
- uint64_t pfn;
- unsigned long offset;
- file = fopen("/proc/self/pagemap", "r");
- if (!file) {
perror("fopen pagemap");
_exit(1);
- }
- offset = addr / getpagesize() * sizeof(pfn);
- if (fseek(file, offset, SEEK_SET)) {
perror("fseek pagemap");
_exit(1);
- }
- if (fread(&pfn, sizeof(pfn), 1, file) != 1) {
perror("fread pagemap");
_exit(1);
- }
- fclose(file);
- return pfn;
-}
-static uint64_t get_kpageflags(unsigned long pfn) -{
- uint64_t flags;
- FILE *file;
- file = fopen("/proc/kpageflags", "r");
- if (!file) {
perror("fopen kpageflags");
_exit(1);
- }
- if (fseek(file, pfn * sizeof(flags), SEEK_SET)) {
perror("fseek kpageflags");
_exit(1);
- }
- if (fread(&flags, sizeof(flags), 1, file) != 1) {
perror("fread kpageflags");
_exit(1);
- }
- fclose(file);
- return flags;
-}
#define VMFLAGS "VmFlags:" static bool is_vmflag_set(unsigned long addr, const char *vmflag) @@ -159,19 +106,13 @@ static bool is_vmflag_set(unsigned long addr, const char *vmflag) #define RSS "Rss:" #define LOCKED "lo" -static bool is_vma_lock_on_fault(unsigned long addr) +static unsigned long get_value_for_name(unsigned long addr, const char *name) {
- bool ret = false;
- bool locked;
- FILE *smaps = NULL;
- unsigned long vma_size, vma_rss; char *line = NULL;
- char *value; size_t size = 0;
- locked = is_vmflag_set(addr, LOCKED);
- if (!locked)
goto out;
- char *value_ptr;
- FILE *smaps = NULL;
- unsigned long value = -1UL;
smaps = seek_to_smaps_entry(addr); if (!smaps) { @@ -180,112 +121,70 @@ static bool is_vma_lock_on_fault(unsigned long addr) } while (getline(&line, &size, smaps) > 0) {
if (!strstr(line, SIZE)) {
}if (!strstr(line, name)) { free(line); line = NULL; size = 0; continue;
value = line + strlen(SIZE);
if (sscanf(value, "%lu kB", &vma_size) < 1) {
value_ptr = line + strlen(name);
} break; }if (sscanf(value_ptr, "%lu kB", &value) < 1) { printf("Unable to parse smaps entry for Size\n"); goto out;
- while (getline(&line, &size, smaps) > 0) {
if (!strstr(line, RSS)) {
free(line);
line = NULL;
size = 0;
continue;
}
value = line + strlen(RSS);
if (sscanf(value, "%lu kB", &vma_rss) < 1) {
printf("Unable to parse smaps entry for Rss\n");
goto out;
}
break;
- }
- ret = locked && (vma_rss < vma_size);
out:
- free(line); if (smaps) fclose(smaps);
- return ret;
- free(line);
- return value;
} -#define PRESENT_BIT 0x8000000000000000ULL -#define PFN_MASK 0x007FFFFFFFFFFFFFULL -#define UNEVICTABLE_BIT (1UL << 18)
-static int lock_check(char *map) +static bool is_vma_lock_on_fault(unsigned long addr) {
- unsigned long page_size = getpagesize();
- uint64_t page1_flags, page2_flags;
- bool locked;
- unsigned long vma_size, vma_rss;
- page1_flags = get_pageflags((unsigned long)map);
- page2_flags = get_pageflags((unsigned long)map + page_size);
- locked = is_vmflag_set(addr, LOCKED);
- if (!locked)
return false;
- /* Both pages should be present */
- if (((page1_flags & PRESENT_BIT) == 0) ||
((page2_flags & PRESENT_BIT) == 0)) {
printf("Failed to make both pages present\n");
return 1;
- }
- vma_size = get_value_for_name(addr, SIZE);
- vma_rss = get_value_for_name(addr, RSS);
- page1_flags = get_kpageflags(page1_flags & PFN_MASK);
- page2_flags = get_kpageflags(page2_flags & PFN_MASK);
- /* only one page is faulted in */
- return (vma_rss < vma_size);
+}
- /* Both pages should be unevictable */
- if (((page1_flags & UNEVICTABLE_BIT) == 0) ||
((page2_flags & UNEVICTABLE_BIT) == 0)) {
printf("Failed to make both pages unevictable\n");
return 1;
- }
+#define PRESENT_BIT 0x8000000000000000ULL +#define PFN_MASK 0x007FFFFFFFFFFFFFULL +#define UNEVICTABLE_BIT (1UL << 18)
- if (!is_vmflag_set((unsigned long)map, LOCKED)) {
printf("VMA flag %s is missing on page 1\n", LOCKED);
return 1;
- }
+static int lock_check(unsigned long addr) +{
- bool locked;
- unsigned long vma_size, vma_rss;
- if (!is_vmflag_set((unsigned long)map + page_size, LOCKED)) {
printf("VMA flag %s is missing on page 2\n", LOCKED);
return 1;
- }
- locked = is_vmflag_set(addr, LOCKED);
- if (!locked)
return false;
- return 0;
- vma_size = get_value_for_name(addr, SIZE);
- vma_rss = get_value_for_name(addr, RSS);
- return (vma_rss == vma_size);
} static int unlock_lock_check(char *map) {
- unsigned long page_size = getpagesize();
- uint64_t page1_flags, page2_flags;
- page1_flags = get_pageflags((unsigned long)map);
- page2_flags = get_pageflags((unsigned long)map + page_size);
- page1_flags = get_kpageflags(page1_flags & PFN_MASK);
- page2_flags = get_kpageflags(page2_flags & PFN_MASK);
- if ((page1_flags & UNEVICTABLE_BIT) || (page2_flags & UNEVICTABLE_BIT)) {
printf("A page is still marked unevictable after unlock\n");
return 1;
- }
- if (is_vmflag_set((unsigned long)map, LOCKED)) { printf("VMA flag %s is present on page 1 after unlock\n", LOCKED); return 1; }
- if (is_vmflag_set((unsigned long)map + page_size, LOCKED)) {
printf("VMA flag %s is present on page 2 after unlock\n", LOCKED);
return 1;
- }
- return 0;
} @@ -311,7 +210,7 @@ static int test_mlock_lock() goto unmap; }
- if (lock_check(map))
- if (!lock_check((unsigned long)map)) goto unmap;
/* Now unlock and recheck attributes */ @@ -330,64 +229,18 @@ static int test_mlock_lock() static int onfault_check(char *map) {
- unsigned long page_size = getpagesize();
- uint64_t page1_flags, page2_flags;
- page1_flags = get_pageflags((unsigned long)map);
- page2_flags = get_pageflags((unsigned long)map + page_size);
- /* Neither page should be present */
- if ((page1_flags & PRESENT_BIT) || (page2_flags & PRESENT_BIT)) {
printf("Pages were made present by MLOCK_ONFAULT\n");
return 1;
- }
- *map = 'a';
- page1_flags = get_pageflags((unsigned long)map);
- page2_flags = get_pageflags((unsigned long)map + page_size);
- /* Only page 1 should be present */
- if ((page1_flags & PRESENT_BIT) == 0) {
printf("Page 1 is not present after fault\n");
return 1;
- } else if (page2_flags & PRESENT_BIT) {
printf("Page 2 was made present\n");
return 1;
- }
- page1_flags = get_kpageflags(page1_flags & PFN_MASK);
- /* Page 1 should be unevictable */
- if ((page1_flags & UNEVICTABLE_BIT) == 0) {
printf("Failed to make faulted page unevictable\n");
return 1;
- }
- if (!is_vma_lock_on_fault((unsigned long)map)) { printf("VMA is not marked for lock on fault\n"); return 1; }
- if (!is_vma_lock_on_fault((unsigned long)map + page_size)) {
printf("VMA is not marked for lock on fault\n");
return 1;
- }
- return 0;
} static int unlock_onfault_check(char *map) { unsigned long page_size = getpagesize();
- uint64_t page1_flags;
- page1_flags = get_pageflags((unsigned long)map);
- page1_flags = get_kpageflags(page1_flags & PFN_MASK);
- if (page1_flags & UNEVICTABLE_BIT) {
printf("Page 1 is still marked unevictable after unlock\n");
return 1;
- }
if (is_vma_lock_on_fault((unsigned long)map) || is_vma_lock_on_fault((unsigned long)map + page_size)) { @@ -445,7 +298,6 @@ static int test_lock_onfault_of_present() char *map; int ret = 1; unsigned long page_size = getpagesize();
- uint64_t page1_flags, page2_flags;
map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); @@ -465,17 +317,6 @@ static int test_lock_onfault_of_present() goto unmap; }
- page1_flags = get_pageflags((unsigned long)map);
- page2_flags = get_pageflags((unsigned long)map + page_size);
- page1_flags = get_kpageflags(page1_flags & PFN_MASK);
- page2_flags = get_kpageflags(page2_flags & PFN_MASK);
- /* Page 1 should be unevictable */
- if ((page1_flags & UNEVICTABLE_BIT) == 0) {
printf("Failed to make present page unevictable\n");
goto unmap;
- }
- if (!is_vma_lock_on_fault((unsigned long)map) || !is_vma_lock_on_fault((unsigned long)map + page_size)) { printf("VMA with present pages is not marked lock on fault\n");
@@ -507,7 +348,7 @@ static int test_munlockall() goto out; }
- if (lock_check(map))
- if (!lock_check((unsigned long)map)) goto unmap;
if (munlockall()) { @@ -549,7 +390,7 @@ static int test_munlockall() goto out; }
- if (lock_check(map))
- if (!lock_check((unsigned long)map)) goto unmap;
if (munlockall()) { -- 2.25.1
-- Michal Hocko SUSE Labs
Thanks Michal!
Acked-by: Rafael Aquini aquini@redhat.com
On Tue, 24 Mar 2020 11:49:10 -0400 Rafael Aquini aquini@redhat.com wrote:
Thanks Michal!
Acked-by: Rafael Aquini aquini@redhat.com
I'll add Fixes: 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs") and cc:stable to this. Any kernel which has 9c4e6b1a7027f will benefit from this change.
We're getting quite late in the cycle now so I think I'll hold off merging this up until post-5.7-rc1. It will still get into the stable trees, but a bit later.
On Wed, Mar 25, 2020 at 05:49:49PM -0700, Andrew Morton wrote:
On Tue, 24 Mar 2020 11:49:10 -0400 Rafael Aquini aquini@redhat.com wrote:
Thanks Michal!
Acked-by: Rafael Aquini aquini@redhat.com
I'll add Fixes: 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs") and cc:stable to this. Any kernel which has 9c4e6b1a7027f will benefit from this change.
We're getting quite late in the cycle now so I think I'll hold off merging this up until post-5.7-rc1. It will still get into the stable trees, but a bit later.
Thank you, Andrew
-- Rafael
On Wed 25-03-20 17:49:49, Andrew Morton wrote:
On Tue, 24 Mar 2020 11:49:10 -0400 Rafael Aquini aquini@redhat.com wrote:
Thanks Michal!
Acked-by: Rafael Aquini aquini@redhat.com
I'll add Fixes: 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
Wouldn't be this misleading? It would suggest that this commit is somehow incorrect. I would consider b3b0d09c7a23 ("selftests: vm: add tests for lock on fault") to be a better fit.
and cc:stable to this. Any kernel which has 9c4e6b1a7027f will benefit from this change.
OK, makes sense with Cc: stable # with 9c4e6b1a7027f applied
We're getting quite late in the cycle now so I think I'll hold off merging this up until post-5.7-rc1. It will still get into the stable trees, but a bit later.
No problem. Nothing really urget. Coincidentaly we've just had a report from our partner about this very specific test failure for our enterprise kernel as well. I will just backport the patch as it seems nobody really objects to it.
Thanks!
On Thu, 26 Mar 2020 07:49:09 +0100 Michal Hocko mhocko@kernel.org wrote:
On Wed 25-03-20 17:49:49, Andrew Morton wrote:
On Tue, 24 Mar 2020 11:49:10 -0400 Rafael Aquini aquini@redhat.com wrote:
Thanks Michal!
Acked-by: Rafael Aquini aquini@redhat.com
I'll add Fixes: 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
Wouldn't be this misleading? It would suggest that this commit is somehow incorrect. I would consider b3b0d09c7a23 ("selftests: vm: add tests for lock on fault") to be a better fit.
Yes, it's a bit misleading.
Or maybe not. b3b0d09c7a23 was merged in 2015 and worked OK (afaik) until 9c4e6b1a7027f came along in 2020. So arguably, 2020's 9c4e6b1a7027f was correct but incomplete, since it failed to repair the impact upon the test code.
I don't think that kernels prior to 2020's 9c4e6b1a7027f require this change(?), so we only need to backport this into 5.6.x, which is what my proposed Fixes: and cc:stable expresses.
and cc:stable to this. Any kernel which has 9c4e6b1a7027f will benefit from this change.
OK, makes sense with Cc: stable # with 9c4e6b1a7027f applied
We're getting quite late in the cycle now so I think I'll hold off merging this up until post-5.7-rc1. It will still get into the stable trees, but a bit later.
No problem. Nothing really urget. Coincidentaly we've just had a report from our partner about this very specific test failure for our enterprise kernel as well.
I assume that your kernel has 2020's 9c4e6b1a7027f?
I will just backport the patch as it seems nobody really objects to it.
On Thu, Mar 26, 2020 at 12:58:09PM -0700, Andrew Morton wrote:
On Thu, 26 Mar 2020 07:49:09 +0100 Michal Hocko mhocko@kernel.org wrote:
On Wed 25-03-20 17:49:49, Andrew Morton wrote:
On Tue, 24 Mar 2020 11:49:10 -0400 Rafael Aquini aquini@redhat.com wrote:
Thanks Michal!
Acked-by: Rafael Aquini aquini@redhat.com
I'll add Fixes: 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
Wouldn't be this misleading? It would suggest that this commit is somehow incorrect. I would consider b3b0d09c7a23 ("selftests: vm: add tests for lock on fault") to be a better fit.
Yes, it's a bit misleading.
Or maybe not. b3b0d09c7a23 was merged in 2015 and worked OK (afaik) until 9c4e6b1a7027f came along in 2020. So arguably, 2020's 9c4e6b1a7027f was correct but incomplete, since it failed to repair the impact upon the test code.
That's correct, and it was my reading, as well.
I don't think that kernels prior to 2020's 9c4e6b1a7027f require this change(?), so we only need to backport this into 5.6.x, which is what my proposed Fixes: and cc:stable expresses.
Agreed
-- Rafael
On Thu, Mar 26, 2020 at 12:58:09PM -0700, Andrew Morton wrote:
On Thu, 26 Mar 2020 07:49:09 +0100 Michal Hocko mhocko@kernel.org wrote:
On Wed 25-03-20 17:49:49, Andrew Morton wrote:
On Tue, 24 Mar 2020 11:49:10 -0400 Rafael Aquini aquini@redhat.com wrote:
Thanks Michal!
Acked-by: Rafael Aquini aquini@redhat.com
I'll add Fixes: 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
Wouldn't be this misleading? It would suggest that this commit is somehow incorrect. I would consider b3b0d09c7a23 ("selftests: vm: add tests for lock on fault") to be a better fit.
Yes, it's a bit misleading.
Or maybe not. b3b0d09c7a23 was merged in 2015 and worked OK (afaik) until 9c4e6b1a7027f came along in 2020. So arguably, 2020's 9c4e6b1a7027f was correct but incomplete, since it failed to repair the impact upon the test code.
I don't think that kernels prior to 2020's 9c4e6b1a7027f require this change(?), so we only need to backport this into 5.6.x, which is what
Only one note here: 9c4e6b1a7027f is from 2018, and went merged on 4.16, AFAICS.
-- Rafael
On Sat, Mar 21, 2020 at 6:35 PM Rafael Aquini aquini@redhat.com wrote:
Changes for commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs") break this test expectations on the behavior of mlock syscall family immediately inserting the recently faulted pages into the UNEVICTABLE_LRU, when MCL_ONFAULT is passed to the syscall as part of its flag-set.
mlock* syscalls do not provide any guarantee that the pages will be in unevictable LRU, only that the pages will not be paged-out. The test is checking something very internal to the kernel and this is expected to break.
There is no functional error introduced by the aforementioned commit, but it opens up a time window where the recently faulted and locked pages might yet not be put back into the UNEVICTABLE_LRU, thus causing a subsequent and immediate PFN flag check for the UNEVICTABLE bit to trip on false-negative errors, as it happens with this test.
This patch fix the false negative by forcefully resorting to a code path that will call a CPU pagevec drain right after the fault but before the PFN flag check takes place, sorting out the race that way.
Fixes: 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
This is fixing the actual test and not about fixing the mentioned patch. So, this Fixes line is not needed.
On Sun, Mar 22, 2020 at 09:31:04AM -0700, Shakeel Butt wrote:
On Sat, Mar 21, 2020 at 6:35 PM Rafael Aquini aquini@redhat.com wrote:
Changes for commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs") break this test expectations on the behavior of mlock syscall family immediately inserting the recently faulted pages into the UNEVICTABLE_LRU, when MCL_ONFAULT is passed to the syscall as part of its flag-set.
mlock* syscalls do not provide any guarantee that the pages will be in unevictable LRU, only that the pages will not be paged-out. The test is checking something very internal to the kernel and this is expected to break.
It was a check expected to be satisfied before the commit, though. Getting the mlocked pages inserted directly into the unevictable LRU, skipping the pagevec, was established behavior before the aforementioned commit, and even though one could argue userspace should not be aware, or care, about such inner kernel circles the program in question is not an ordinary userspace app, but a kernel selftest that is supposed to check for the functionality correctness.
There is no functional error introduced by the aforementioned commit, but it opens up a time window where the recently faulted and locked pages might yet not be put back into the UNEVICTABLE_LRU, thus causing a subsequent and immediate PFN flag check for the UNEVICTABLE bit to trip on false-negative errors, as it happens with this test.
This patch fix the false negative by forcefully resorting to a code path that will call a CPU pagevec drain right after the fault but before the PFN flag check takes place, sorting out the race that way.
Fixes: 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
This is fixing the actual test and not about fixing the mentioned patch. So, this Fixes line is not needed.
If one bisects the kernel looking for the patch that causes the selftest to fail that commit is going to show up as the issue, thus the reference.
On Mon 23-03-20 10:16:59, Rafael Aquini wrote:
On Sun, Mar 22, 2020 at 09:31:04AM -0700, Shakeel Butt wrote:
On Sat, Mar 21, 2020 at 6:35 PM Rafael Aquini aquini@redhat.com wrote:
Changes for commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs") break this test expectations on the behavior of mlock syscall family immediately inserting the recently faulted pages into the UNEVICTABLE_LRU, when MCL_ONFAULT is passed to the syscall as part of its flag-set.
mlock* syscalls do not provide any guarantee that the pages will be in unevictable LRU, only that the pages will not be paged-out. The test is checking something very internal to the kernel and this is expected to break.
It was a check expected to be satisfied before the commit, though. Getting the mlocked pages inserted directly into the unevictable LRU, skipping the pagevec, was established behavior before the aforementioned commit, and even though one could argue userspace should not be aware, or care, about such inner kernel circles the program in question is not an ordinary userspace app, but a kernel selftest that is supposed to check for the functionality correctness.
But mlock (in neither mode) is reall forced to put pages to the UNEVICTABLE_LRU for correctness. If the test is really depending on it then the selftest is bogus. A real MCL_ONFAULT test should focus on the user observable contract of this api. And that is that a new mapping doesn't fault in the page during the mlock call but the memory is locked after the memory is faulted in. You can use different methods to observe locked memory - e.g. try to reclaim it and check or check /proc/<pid>/smaps
On Mon, Mar 23, 2020 at 03:29:41PM +0100, Michal Hocko wrote:
On Mon 23-03-20 10:16:59, Rafael Aquini wrote:
On Sun, Mar 22, 2020 at 09:31:04AM -0700, Shakeel Butt wrote:
On Sat, Mar 21, 2020 at 6:35 PM Rafael Aquini aquini@redhat.com wrote:
Changes for commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs") break this test expectations on the behavior of mlock syscall family immediately inserting the recently faulted pages into the UNEVICTABLE_LRU, when MCL_ONFAULT is passed to the syscall as part of its flag-set.
mlock* syscalls do not provide any guarantee that the pages will be in unevictable LRU, only that the pages will not be paged-out. The test is checking something very internal to the kernel and this is expected to break.
It was a check expected to be satisfied before the commit, though. Getting the mlocked pages inserted directly into the unevictable LRU, skipping the pagevec, was established behavior before the aforementioned commit, and even though one could argue userspace should not be aware, or care, about such inner kernel circles the program in question is not an ordinary userspace app, but a kernel selftest that is supposed to check for the functionality correctness.
But mlock (in neither mode) is reall forced to put pages to the UNEVICTABLE_LRU for correctness. If the test is really depending on it then the selftest is bogus. A real MCL_ONFAULT test should focus on the user observable contract of this api. And that is that a new mapping doesn't fault in the page during the mlock call but the memory is locked after the memory is faulted in. You can use different methods to observe locked memory - e.g. try to reclaim it and check or check /proc/<pid>/smaps
Again, I don't think the test is bogus, although it's (now) expecting something that is not guaranteed after the referred commit. The check for PG_unevictable set on the page backing up the mapping seems reasonable, as the flag is supposed to be there, if everything went on fine after the mlock call.
-- Rafael
On Mon 23-03-20 15:29:43, Michal Hocko wrote:
On Mon 23-03-20 10:16:59, Rafael Aquini wrote:
On Sun, Mar 22, 2020 at 09:31:04AM -0700, Shakeel Butt wrote:
On Sat, Mar 21, 2020 at 6:35 PM Rafael Aquini aquini@redhat.com wrote:
Changes for commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs") break this test expectations on the behavior of mlock syscall family immediately inserting the recently faulted pages into the UNEVICTABLE_LRU, when MCL_ONFAULT is passed to the syscall as part of its flag-set.
mlock* syscalls do not provide any guarantee that the pages will be in unevictable LRU, only that the pages will not be paged-out. The test is checking something very internal to the kernel and this is expected to break.
It was a check expected to be satisfied before the commit, though. Getting the mlocked pages inserted directly into the unevictable LRU, skipping the pagevec, was established behavior before the aforementioned commit, and even though one could argue userspace should not be aware, or care, about such inner kernel circles the program in question is not an ordinary userspace app, but a kernel selftest that is supposed to check for the functionality correctness.
But mlock (in neither mode) is reall forced to put pages to the
ble I meant to say "is not really forced"
UNEVICTABLE_LRU for correctness. If the test is really depending on it then the selftest is bogus. A real MCL_ONFAULT test should focus on the user observable contract of this api. And that is that a new mapping doesn't fault in the page during the mlock call but the memory is locked after the memory is faulted in. You can use different methods to observe locked memory - e.g. try to reclaim it and check or check /proc/<pid>/smaps
I have just checked the testcase and I believe it is really dubious to check for page flags. Those are really an internal implementation detail. All the available information is available in the /proc/<pid>/smaps file which is already parsed in the test so the test is easily fixable.
On Mon, Mar 23, 2020 at 04:01:34PM +0100, Michal Hocko wrote:
On Mon 23-03-20 15:29:43, Michal Hocko wrote:
On Mon 23-03-20 10:16:59, Rafael Aquini wrote:
On Sun, Mar 22, 2020 at 09:31:04AM -0700, Shakeel Butt wrote:
On Sat, Mar 21, 2020 at 6:35 PM Rafael Aquini aquini@redhat.com wrote:
Changes for commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs") break this test expectations on the behavior of mlock syscall family immediately inserting the recently faulted pages into the UNEVICTABLE_LRU, when MCL_ONFAULT is passed to the syscall as part of its flag-set.
mlock* syscalls do not provide any guarantee that the pages will be in unevictable LRU, only that the pages will not be paged-out. The test is checking something very internal to the kernel and this is expected to break.
It was a check expected to be satisfied before the commit, though. Getting the mlocked pages inserted directly into the unevictable LRU, skipping the pagevec, was established behavior before the aforementioned commit, and even though one could argue userspace should not be aware, or care, about such inner kernel circles the program in question is not an ordinary userspace app, but a kernel selftest that is supposed to check for the functionality correctness.
But mlock (in neither mode) is reall forced to put pages to the
ble I meant to say "is not really forced"
UNEVICTABLE_LRU for correctness. If the test is really depending on it then the selftest is bogus. A real MCL_ONFAULT test should focus on the user observable contract of this api. And that is that a new mapping doesn't fault in the page during the mlock call but the memory is locked after the memory is faulted in. You can use different methods to observe locked memory - e.g. try to reclaim it and check or check /proc/<pid>/smaps
I have just checked the testcase and I believe it is really dubious to check for page flags. Those are really an internal implementation detail. All the available information is available in the /proc/<pid>/smaps file which is already parsed in the test so the test is easily fixable.
That surely is another take for this test. I still think the test is reasonable when it checks what was expected to be there, from the kernel standpoint, and just needs to be adjusted for the current state of affairs.
-- Rafael
linux-kselftest-mirror@lists.linaro.org