Re: [Linaro-mm-sig] [PATCH v2 2/4] mm: introduce cma_alloc_bulk API - Linaro-mm-sig

2 Dec 2020


      On Wed, Dec 02, 2020 at 04:49:15PM +0100, Michal Hocko wrote:
...
On Wed 02-12-20 10:14:41, David Hildenbrand wrote:
...
On 01.12.20 18:51, Minchan Kim wrote:
...
There is a need for special HW to require bulk allocation of
high-order pages. For example, 4800 * order-4 pages, which
would be minimum, sometimes, it requires more.
To meet the requirement, a option reserves 300M CMA area and
requests the whole 300M contiguous memory. However, it doesn't
work if even one of those pages in the range is long-term pinned
directly or indirectly. The other option is to ask higher-order
My latest knowledge is that pages in the CMA area are never long term
pinned.
https://lore.kernel.org/lkml/20201123090129.GD27488@dhcp22.suse.cz/
"gup already tries to deal with long term pins on CMA regions and migrate
to a non CMA region. Have a look at __gup_longterm_locked."
We should rather identify ways how that is still possible and get rid of
them.
Now, short-term pinnings and PCP are other issues where
alloc_contig_range() could be improved (e.g., in contrast to a FAST
mode, a HARD mode which temporarily disables the PCP, ...).
Agreed!
...
...
size (e.g., 2M) than requested order(64K) repeatedly until driver
could gather necessary amount of memory. Basically, this approach
makes the allocation very slow due to cma_alloc's function
slowness and it could be stuck on one of the pageblocks if it
encounters unmigratable page.
To solve the issue, this patch introduces cma_alloc_bulk.
int cma_alloc_bulk(struct cma *cma, unsigned int align,
   	bool fast, unsigned int order, size_t nr_requests,
   	struct page **page_array, size_t *nr_allocated);
Most parameters are same with cma_alloc but it additionally passes
vector array to store allocated memory. What's different with cma_alloc
is it will skip pageblocks without waiting/stopping if it has unmovable
page so that API continues to scan other pageblocks to find requested
order page.
cma_alloc_bulk is best effort approach in that it skips some pageblocks
if they have unmovable pages unlike cma_alloc. It doesn't need to be
perfect from the beginning at the cost of performance. Thus, the API
takes "bool fast parameter" which is propagated into alloc_contig_range to
avoid significat overhead functions to inrecase CMA allocation success
ratio(e.g., migration retrial, PCP, LRU draining per pageblock)
at the cost of less allocation success ratio. If the caller couldn't
allocate enough, they could call it with "false" to increase success ratio
if they are okay to expense the overhead for the success ratio.
Just so I understand what the idea is:
alloc_contig_range() sometimes fails on CMA regions when trying to
allocate big chunks (e.g., 300M). Instead of tackling that issue, you
rather allocate plenty of small chunks, and make these small allocations
fail faster/ make the allocations less reliable. Correct?
I don't really have a strong opinion on that. Giving up fast rather than
trying for longer sounds like a useful thing to have - but I wonder if
it's strictly necessary for the use case you describe.
I'd like to hear Michals opinion on that.
Well, what I can see is that this new interface is an antipatern to our
allocation routines. We tend to control allocations by gfp mask yet you
are introducing a bool parameter to make something faster... What that
really means is rather arbitrary. Would it make more sense to teach
cma_alloc resp. alloc_contig_range to recognize GFP_NOWAIT, GFP_NORETRY resp.
GFP_RETRY_MAYFAIL instead?
If we use cma_alloc, that interface requires "allocate one big memory
chunk". IOW, return value is just struct page and expected that the page
is a big contiguos memory. That means it couldn't have a hole in the
range. However the idea here, what we asked is much smaller chunk rather
than a big contiguous memory so we could skip some of pages if they are
randomly pinned(long-term/short-term whatever) and search other pages
in the CMA area to avoid long stall. Thus, it couldn't work with exising
cma_alloc API with simple gfp_mak.
...
I am not deeply familiar with the cma allocator so sorry for a
potentially stupid question. Why does a bulk interface performs better
than repeated calls to cma_alloc? Is this because a failure would help
to move on to the next pfn range while a repeated call would have to
deal with the same range?
Yub, true with other overheads(e.g., migration retrial, waiting writeback
PCP/LRU draining IPI)
...
...
...
Signed-off-by: Minchan Kim minchan@kernel.org
include/linux/cma.h |   5 ++
 include/linux/gfp.h |   2 +
 mm/cma.c            | 126 ++++++++++++++++++++++++++++++++++++++++++--
 mm/page_alloc.c     |  19 ++++---
 4 files changed, 140 insertions(+), 12 deletions(-)
-- 
Michal Hocko
SUSE Labs