Re: [PATCH v4 08/10] mm/gup: limit number of gup migration failures, honor failures

13 Jan 2021


      On Fri, Dec 18, 2020 at 8:14 AM Michal Hocko mhocko@suse.com wrote:
...
On Fri 18-12-20 07:43:15, Pavel Tatashin wrote:
...
On Fri, Dec 18, 2020 at 5:46 AM Michal Hocko mhocko@suse.com wrote:
...
On Thu 17-12-20 13:52:41, Pavel Tatashin wrote:
[...]
...
+#define PINNABLE_MIGRATE_MAX 10
+#define PINNABLE_ISOLATE_MAX 100
Why would we need to limit the isolation retries. Those should always be
temporary failure unless I am missing something.
Actually, during development, I was retrying isolate errors
infinitely, but during testing found a hung where when FOLL_TOUCH
without FOLL_WRITE is passed (fault in kernel without write flag), the
zero page is faulted. The isolation of the zero page was failing every
time, therefore the process was hanging.
Why would you migrate zero page in the first place? Simply instantiate
it.
This is exactly the idea behind FOLL_WRITE; it causes zero pages to be
created in the right zone right away, and no migration is necessary.
...
...
Since then, I fixed this problem by adding FOLL_WRITE unconditionally
to FOLL_LONGTERM, but I was worried about other possible bugs that
would cause hangs, so decided to limit isolation errors. If you think
it its not necessary, I can unlimit isolate retires.
It should have a really good reason to exist. Worries about some corner
cases is definitely not a reason to put some awkward retry mechanism.
My historical experience is that these things are extremely hard to get
rid of later.
...
...
I am not sure about the
PINNABLE_MIGRATE_MAX either. Why do we want to limit that? migrate_pages
already implements its retry logic why do you want to count retries on
top of that? I do agree that the existing logic is suboptimal because
True, but again, just recently, I worked on a race bug where pages can
end up in per-cpu list after lru_add_drain_all() but before isolation,
so I think retry is necessary.
There are ways to make sure pages are not ending on pcp list. Have a
look at how hotplug does that.
Sounds good to me, I will remove PINNABLE_MIGRATE_MAX, and leave only
PINNABLE_ISOLATE_MAX for transient isolation errors.
...
...
...
the migration failure might be ephemeral or permanent but that should be
IMHO addressed at migrate_pages (resp. unmap_and_move) and simply report
failures that are permanent - e.g. any potential pre-existing long term
pin - if that is possible at all. If not what would cause permanent
migration failure? OOM?
Yes, OOM is the main cause for migration failures.
Then you can treat ENOMEM as a permanent failure.
Sounds good.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v4 08/10] mm/gup: limit number of gup migration failures, honor failures