On Wed, Dec 02, 2020 at 04:49:15PM +0100, Michal Hocko wrote:
On Wed 02-12-20 10:14:41, David Hildenbrand wrote:
On 01.12.20 18:51, Minchan Kim wrote:
There is a need for special HW to require bulk allocation of high-order pages. For example, 4800 * order-4 pages, which would be minimum, sometimes, it requires more.
To meet the requirement, a option reserves 300M CMA area and requests the whole 300M contiguous memory. However, it doesn't work if even one of those pages in the range is long-term pinned directly or indirectly. The other option is to ask higher-order
My latest knowledge is that pages in the CMA area are never long term pinned.
https://lore.kernel.org/lkml/20201123090129.GD27488@dhcp22.suse.cz/
"gup already tries to deal with long term pins on CMA regions and migrate to a non CMA region. Have a look at __gup_longterm_locked."
We should rather identify ways how that is still possible and get rid of them.
Now, short-term pinnings and PCP are other issues where alloc_contig_range() could be improved (e.g., in contrast to a FAST mode, a HARD mode which temporarily disables the PCP, ...).
Agreed!
size (e.g., 2M) than requested order(64K) repeatedly until driver could gather necessary amount of memory. Basically, this approach makes the allocation very slow due to cma_alloc's function slowness and it could be stuck on one of the pageblocks if it encounters unmigratable page.
To solve the issue, this patch introduces cma_alloc_bulk.
int cma_alloc_bulk(struct cma *cma, unsigned int align, bool fast, unsigned int order, size_t nr_requests, struct page **page_array, size_t *nr_allocated);
Most parameters are same with cma_alloc but it additionally passes vector array to store allocated memory. What's different with cma_alloc is it will skip pageblocks without waiting/stopping if it has unmovable page so that API continues to scan other pageblocks to find requested order page.
cma_alloc_bulk is best effort approach in that it skips some pageblocks if they have unmovable pages unlike cma_alloc. It doesn't need to be perfect from the beginning at the cost of performance. Thus, the API takes "bool fast parameter" which is propagated into alloc_contig_range to avoid significat overhead functions to inrecase CMA allocation success ratio(e.g., migration retrial, PCP, LRU draining per pageblock) at the cost of less allocation success ratio. If the caller couldn't allocate enough, they could call it with "false" to increase success ratio if they are okay to expense the overhead for the success ratio.
Just so I understand what the idea is:
alloc_contig_range() sometimes fails on CMA regions when trying to allocate big chunks (e.g., 300M). Instead of tackling that issue, you rather allocate plenty of small chunks, and make these small allocations fail faster/ make the allocations less reliable. Correct?
I don't really have a strong opinion on that. Giving up fast rather than trying for longer sounds like a useful thing to have - but I wonder if it's strictly necessary for the use case you describe.
I'd like to hear Michals opinion on that.
Well, what I can see is that this new interface is an antipatern to our allocation routines. We tend to control allocations by gfp mask yet you are introducing a bool parameter to make something faster... What that really means is rather arbitrary. Would it make more sense to teach cma_alloc resp. alloc_contig_range to recognize GFP_NOWAIT, GFP_NORETRY resp. GFP_RETRY_MAYFAIL instead?
If we use cma_alloc, that interface requires "allocate one big memory chunk". IOW, return value is just struct page and expected that the page is a big contiguos memory. That means it couldn't have a hole in the range. However the idea here, what we asked is much smaller chunk rather than a big contiguous memory so we could skip some of pages if they are randomly pinned(long-term/short-term whatever) and search other pages in the CMA area to avoid long stall. Thus, it couldn't work with exising cma_alloc API with simple gfp_mak.
I am not deeply familiar with the cma allocator so sorry for a potentially stupid question. Why does a bulk interface performs better than repeated calls to cma_alloc? Is this because a failure would help to move on to the next pfn range while a repeated call would have to deal with the same range?
Yub, true with other overheads(e.g., migration retrial, waiting writeback PCP/LRU draining IPI)
Signed-off-by: Minchan Kim minchan@kernel.org
include/linux/cma.h | 5 ++ include/linux/gfp.h | 2 + mm/cma.c | 126 ++++++++++++++++++++++++++++++++++++++++++-- mm/page_alloc.c | 19 ++++--- 4 files changed, 140 insertions(+), 12 deletions(-)
-- Michal Hocko SUSE Labs
linaro-mm-sig@lists.linaro.org