Welcome everyone once again,
This is yet another release of the Contiguous Memory Allocator patches. This version is mainly a result of the discussion on Kernel Summit in Prague. The main change is completely different code base for the migration feature of the CMA. Now it shares the code with memory compaction subsystem, not the hotplug like it was before. This code has been kindly provided by Michal Nazarewicz. There are also a few fixes here and there, see changelog for the details.
Please notice that this patch series is aimed to start further discussion. There are still few issues that need to be resolved before CMA will be really ready. The most hot problem is the issue with movable pages that causes migration to fail from time to time. Our investigation leads us to the point that these rare pages cannot be migrated because there are some pending io operations on them.
ARM integration code has not been changed since last version, it provides implementation of all the ideas that has been discussed during Linaro Sprint meeting. Here are the details:
This version provides a solution for complete integration of CMA to DMA mapping subsystem on ARM architecture. The issue caused by double dma pages mapping and possible aliasing in coherent memory mapping has been finally resolved, both for GFP_ATOMIC case (allocations comes from coherent memory pool) and non-GFP_ATOMIC case (allocations comes from CMA managed areas).
For coherent, nommu, ARMv4 and ARMv5 systems the current DMA-mapping implementation has been kept.
For ARMv6+ systems, CMA has been enabled and a special pool of coherent memory for atomic allocations has been created. The size of this pool defaults to DEFAULT_CONSISTEN_DMA_SIZE/8, but can be changed with coherent_pool kernel parameter (if really required).
All atomic allocations are served from this pool. I've did a little simplification here, because there is no separate pool for writecombine memory - such requests are also served from coherent pool. I don't think that such simplification is a problem here - I found no driver that use dma_alloc_writecombine with GFP_ATOMIC flags.
All non-atomic allocation are served from CMA area. Kernel mapping is updated to reflect required memory attributes changes. This is possible because during early boot, all CMA area are remapped with 4KiB pages in kernel low-memory.
This version have been tested on Samsung S5PC110 based Goni machine and Exynos4 UniversalC210 board with various V4L2 multimedia drivers.
Coherent atomic allocations has been tested by manually enabling the dma bounce for the s3c-sdhci device.
All patches are prepared for Linux Kernel v3.2-rc2.
A few words for these who see CMA for the first time:
The Contiguous Memory Allocator (CMA) makes it possible for device drivers to allocate big contiguous chunks of memory after the system has booted.
The main difference from the similar frameworks is the fact that CMA allows to transparently reuse memory region reserved for the big chunk allocation as a system memory, so no memory is wasted when no big chunk is allocated. Once the alloc request is issued, the framework will migrate system pages to create a required big chunk of physically contiguous memory.
For more information you can refer to nice LWN articles: http://lwn.net/Articles/447405/ and http://lwn.net/Articles/450286/ as well as links to previous versions of the CMA framework.
The CMA framework has been initially developed by Michal Nazarewicz at Samsung Poland R&D Center. Since version 9, I've taken over the development, because Michal has left the company.
TODO (optional): - implement support for contiguous memory areas placed in HIGHMEM zone - resolve issue with movable pages with pending io operations
Best regards Marek Szyprowski Samsung Poland R&D Center
Links to previous versions of the patchset: v16: http://www.spinics.net/lists/linux-mm/msg25066.html v15: http://www.spinics.net/lists/linux-mm/msg23365.html v14: http://www.spinics.net/lists/linux-media/msg36536.html v13: (internal, intentionally not released) v12: http://www.spinics.net/lists/linux-media/msg35674.html v11: http://www.spinics.net/lists/linux-mm/msg21868.html v10: http://www.spinics.net/lists/linux-mm/msg20761.html v9: http://article.gmane.org/gmane.linux.kernel.mm/60787 v8: http://article.gmane.org/gmane.linux.kernel.mm/56855 v7: http://article.gmane.org/gmane.linux.kernel.mm/55626 v6: http://article.gmane.org/gmane.linux.kernel.mm/55626 v5: (intentionally left out as CMA v5 was identical to CMA v4) v4: http://article.gmane.org/gmane.linux.kernel.mm/52010 v3: http://article.gmane.org/gmane.linux.kernel.mm/51573 v2: http://article.gmane.org/gmane.linux.kernel.mm/50986 v1: http://article.gmane.org/gmane.linux.kernel.mm/50669
Changelog:
v17: 1. Replaced whole CMA core memory migration code to the new one kindly provided by Michal Nazarewicz. The new code is based on memory compaction framework not the memory hotplug, like it was before. This change has been suggested by Mel Godman.
2. Addressed most of the comments from Andrew Morton and Mel Gorman in the rest of the CMA code.
3. Fixed broken initialization on ARM systems with DMA zone enabled.
4. Rebased onto v3.2-rc2 kernel.
v16: 1. merged a fixup from Michal Nazarewicz to address comments from Dave Hansen about checking if pfns belong to the same memory zone
2. merged a fix from Michal Nazarewicz for incorrect handling of pages which belong to page block that is in MIGRATE_ISOLATE state, in very rare cases the migrate type of page block might have been changed from MIGRATE_CMA to MIGRATE_MOVABLE because of this bug
3. moved some common code to include/asm-generic
4. added support for x86 DMA-mapping framework for pci-dma hardware, CMA can be now even more widely tested on KVM/QEMU and a lot of common x86 boxes
5. rebased onto next-20111005 kernel tree, which includes changes in ARM DMA-mapping subsystem (CONSISTENT_DMA_SIZE removal)
6. removed patch for CMA s5p-fimc device private regions (served only as example) and provided the one that matches real life case - s5p-mfc device
v15: 1. fixed calculation of the total memory after activating CMA area (was broken from v12)
2. more code cleanup in drivers/base/dma-contiguous.c
3. added address limit for default CMA area
4. rewrote ARM DMA integration: - removed "ARM: DMA: steal memory for DMA coherent mappings" patch - kept current DMA mapping implementation for coherent, nommu and ARMv4/ARMv5 systems - enabled CMA for all ARMv6+ systems - added separate, small pool for coherent atomic allocations, defaults to CONSISTENT_DMA_SIZE/8, but can be changed with kernel parameter coherent_pool=[size]
v14: 1. Merged with "ARM: DMA: steal memory for DMA coherent mappings" patch, added support for GFP_ATOMIC allocations.
2. Added checks for NULL device pointer
v13: (internal, intentionally not released)
v12: 1. Fixed 2 nasty bugs in dma-contiguous allocator: - alignment argument was not passed correctly - range for dma_release_from_contiguous was not checked correctly
2. Added support for architecture specfic dma_contiguous_early_fixup() function
3. CMA and DMA-mapping integration for ARM architechture has been rewritten to take care of the memory aliasing issue that might happen for newer ARM CPUs (mapping of the same pages with different cache attributes is forbidden). TODO: add support for GFP_ATOMIC allocations basing on the "ARM: DMA: steal memory for DMA coherent mappings" patch and implement support for contiguous memory areas that are placed in HIGHMEM zone
v11: 1. Removed genalloc usage and replaced it with direct calls to bitmap_* functions, dropped patches that are not needed anymore (genalloc extensions)
2. Moved all contiguous area management code from mm/cma.c to drivers/base/dma-contiguous.c
3. Renamed cm_alloc/free to dma_alloc/release_from_contiguous
4. Introduced global, system wide (default) contiguous area configured with kernel config and kernel cmdline parameters
5. Simplified initialization to just one function: dma_declare_contiguous()
6. Added example of device private memory contiguous area
v10: 1. Rebased onto 3.0-rc2 and resolved all conflicts
2. Simplified CMA to be just a pure memory allocator, for use with platfrom/bus specific subsystems, like dma-mapping. Removed all device specific functions are calls.
3. Integrated with ARM DMA-mapping subsystem.
4. Code cleanup here and there.
5. Removed private context support.
v9: 1. Rebased onto 2.6.39-rc1 and resolved all conflicts
2. Fixed a bunch of nasty bugs that happened when the allocation failed (mainly kernel oops due to NULL ptr dereference).
3. Introduced testing code: cma-regions compatibility layer and videobuf2-cma memory allocator module.
v8: 1. The alloc_contig_range() function has now been separated from CMA and put in page_allocator.c. This function tries to migrate all LRU pages in specified range and then allocate the range using alloc_contig_freed_pages().
2. Support for MIGRATE_CMA has been separated from the CMA code. I have not tested if CMA works with ZONE_MOVABLE but I see no reasons why it shouldn't.
3. I have added a @private argument when creating CMA contexts so that one can reserve memory and not share it with the rest of the system. This way, CMA acts only as allocation algorithm.
v7: 1. A lot of functionality that handled driver->allocator_context mapping has been removed from the patchset. This is not to say that this code is not needed, it's just not worth posting everything in one patchset.
Currently, CMA is "just" an allocator. It uses it's own migratetype (MIGRATE_CMA) for defining ranges of pageblokcs which behave just like ZONE_MOVABLE but dispite the latter can be put in arbitrary places.
2. The migration code that was introduced in the previous version actually started working.
v6: 1. Most importantly, v6 introduces support for memory migration. The implementation is not yet complete though.
Migration support means that when CMA is not using memory reserved for it, page allocator can allocate pages from it. When CMA wants to use the memory, the pages have to be moved and/or evicted as to make room for CMA.
To make it possible it must be guaranteed that only movable and reclaimable pages are allocated in CMA controlled regions. This is done by introducing a MIGRATE_CMA migrate type that guarantees exactly that.
Some of the migration code is "borrowed" from Kamezawa Hiroyuki's alloc_contig_pages() implementation. The main difference is that thanks to MIGRATE_CMA migrate type CMA assumes that memory controlled by CMA are is always movable or reclaimable so that it makes allocation decisions regardless of the whether some pages are actually allocated and migrates them if needed.
The most interesting patches from the patchset that implement the functionality are:
09/13: mm: alloc_contig_free_pages() added 10/13: mm: MIGRATE_CMA migration type added 11/13: mm: MIGRATE_CMA isolation functions added 12/13: mm: cma: Migration support added [wip]
Currently, kernel panics in some situations which I am trying to investigate.
2. cma_pin() and cma_unpin() functions has been added (after a conversation with Johan Mossberg). The idea is that whenever hardware does not use the memory (no transaction is on) the chunk can be moved around. This would allow defragmentation to be implemented if desired. No defragmentation algorithm is provided at this time.
3. Sysfs support has been replaced with debugfs. I always felt unsure about the sysfs interface and when Greg KH pointed it out I finally got to rewrite it to debugfs.
v5: (intentionally left out as CMA v5 was identical to CMA v4)
v4: 1. The "asterisk" flag has been removed in favour of requiring that platform will provide a "*=<regions>" rule in the map attribute.
2. The terminology has been changed slightly renaming "kind" to "type" of memory. In the previous revisions, the documentation indicated that device drivers define memory kinds and now,
v3: 1. The command line parameters have been removed (and moved to a separate patch, the fourth one). As a consequence, the cma_set_defaults() function has been changed -- it no longer accepts a string with list of regions but an array of regions.
2. The "asterisk" attribute has been removed. Now, each region has an "asterisk" flag which lets one specify whether this region should by considered "asterisk" region.
3. SysFS support has been moved to a separate patch (the third one in the series) and now also includes list of regions.
v2: 1. The "cma_map" command line have been removed. In exchange, a SysFS entry has been created under kernel/mm/contiguous.
The intended way of specifying the attributes is a cma_set_defaults() function called by platform initialisation code. "regions" attribute (the string specified by "cma" command line parameter) can be overwritten with command line parameter; the other attributes can be changed during run-time using the SysFS entries.
2. The behaviour of the "map" attribute has been modified slightly. Currently, if no rule matches given device it is assigned regions specified by the "asterisk" attribute. It is by default built from the region names given in "regions" attribute.
3. Devices can register private regions as well as regions that can be shared but are not reserved using standard CMA mechanisms. A private region has no name and can be accessed only by devices that have the pointer to it.
4. The way allocators are registered has changed. Currently, a cma_allocator_register() function is used for that purpose. Moreover, allocators are attached to regions the first time memory is registered from the region or when allocator is registered which means that allocators can be dynamic modules that are loaded after the kernel booted (of course, it won't be possible to allocate a chunk of memory from a region if allocator is not loaded).
5. Index of new functions:
+static inline dma_addr_t __must_check +cma_alloc_from(const char *regions, size_t size, + dma_addr_t alignment)
+static inline int +cma_info_about(struct cma_info *info, const const char *regions)
+int __must_check cma_region_register(struct cma_region *reg);
+dma_addr_t __must_check +cma_alloc_from_region(struct cma_region *reg, + size_t size, dma_addr_t alignment);
+static inline dma_addr_t __must_check +cma_alloc_from(const char *regions, + size_t size, dma_addr_t alignment);
+int cma_allocator_register(struct cma_allocator *alloc);
Patches in this patchset:
Marek Szyprowski (4): drivers: add Contiguous Memory Allocator X86: integrate CMA with DMA-mapping subsystem ARM: integrate CMA with DMA-mapping subsystem ARM: Samsung: use CMA for 2 memory banks for s5p-mfc device
Michal Nazarewicz (7): mm: page_alloc: handle MIGRATE_ISOLATE in free_pcppages_bulk() mm: compaction: introduce isolate_{free,migrate}pages_range(). mm: mmzone: introduce zone_pfn_same_memmap() mm: compaction: export some of the functions mm: page_alloc: introduce alloc_contig_range() mm: mmzone: MIGRATE_CMA migration type added mm: page_isolation: MIGRATE_CMA isolation functions added
Documentation/kernel-parameters.txt | 9 + arch/Kconfig | 3 + arch/arm/Kconfig | 2 + arch/arm/include/asm/dma-contiguous.h | 16 ++ arch/arm/include/asm/mach/map.h | 1 + arch/arm/kernel/setup.c | 8 +- arch/arm/mm/dma-mapping.c | 368 +++++++++++++++++++++++++------ arch/arm/mm/init.c | 20 ++- arch/arm/mm/mm.h | 3 + arch/arm/mm/mmu.c | 29 ++- arch/arm/plat-s5p/dev-mfc.c | 51 +---- arch/x86/Kconfig | 1 + arch/x86/include/asm/dma-contiguous.h | 13 + arch/x86/include/asm/dma-mapping.h | 4 + arch/x86/kernel/pci-dma.c | 18 ++- arch/x86/kernel/pci-nommu.c | 8 +- arch/x86/kernel/setup.c | 2 + drivers/base/Kconfig | 89 ++++++++ drivers/base/Makefile | 1 + drivers/base/dma-contiguous.c | 396 +++++++++++++++++++++++++++++++++ include/asm-generic/dma-contiguous.h | 27 +++ include/linux/device.h | 4 + include/linux/dma-contiguous.h | 110 +++++++++ include/linux/mmzone.h | 57 ++++- include/linux/page-isolation.h | 27 ++- mm/Kconfig | 2 +- mm/Makefile | 3 +- mm/compaction.c | 230 +++++++++++-------- mm/internal.h | 35 +++ mm/memory-failure.c | 2 +- mm/memory_hotplug.c | 6 +- mm/page_alloc.c | 315 ++++++++++++++++++++++++-- mm/page_isolation.c | 15 +- 33 files changed, 1591 insertions(+), 284 deletions(-) create mode 100644 arch/arm/include/asm/dma-contiguous.h create mode 100644 arch/x86/include/asm/dma-contiguous.h create mode 100644 drivers/base/dma-contiguous.c create mode 100644 include/asm-generic/dma-contiguous.h create mode 100644 include/linux/dma-contiguous.h
From: Michal Nazarewicz mina86@mina86.com
If page is on PCP list while pageblock it belongs to gets isolated, the page's private still holds the old migrate type. This means that free_pcppages_bulk() will put the page on a freelist of the old migrate type instead of MIGRATE_ISOLATE.
This commit changes that by explicitly checking whether page's pageblock's migrate type is MIGRATE_ISOLATE and if it is, overwrites page's private data.
Signed-off-by: Michal Nazarewicz mina86@mina86.com Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com --- mm/page_alloc.c | 12 ++++++++++++ 1 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9dd443d..58d1a2e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -628,6 +628,18 @@ static void free_pcppages_bulk(struct zone *zone, int count, page = list_entry(list->prev, struct page, lru); /* must delete as __free_one_page list manipulates */ list_del(&page->lru); + + /* + * When page is isolated in set_migratetype_isolate() + * function it's page_private is not changed since the + * function has no way of knowing if it can touch it. + * This means that when a page is on PCP list, it's + * page_private no longer matches the desired migrate + * type. + */ + if (get_pageblock_migratetype(page) == MIGRATE_ISOLATE) + set_page_private(page, MIGRATE_ISOLATE); + /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */ __free_one_page(page, zone, 0, page_private(page)); trace_mm_page_pcpu_drain(page, 0, page_private(page));
On Fri, Nov 18, 2011 at 05:43:08PM +0100, Marek Szyprowski wrote:
From: Michal Nazarewicz mina86@mina86.com
If page is on PCP list while pageblock it belongs to gets isolated, the page's private still holds the old migrate type. This means that free_pcppages_bulk() will put the page on a freelist of the old migrate type instead of MIGRATE_ISOLATE.
This commit changes that by explicitly checking whether page's pageblock's migrate type is MIGRATE_ISOLATE and if it is, overwrites page's private data.
Signed-off-by: Michal Nazarewicz mina86@mina86.com Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com
mm/page_alloc.c | 12 ++++++++++++ 1 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9dd443d..58d1a2e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -628,6 +628,18 @@ static void free_pcppages_bulk(struct zone *zone, int count, page = list_entry(list->prev, struct page, lru); /* must delete as __free_one_page list manipulates */ list_del(&page->lru);
/*
* When page is isolated in set_migratetype_isolate()
* function it's page_private is not changed since the
* function has no way of knowing if it can touch it.
* This means that when a page is on PCP list, it's
* page_private no longer matches the desired migrate
* type.
*/
if (get_pageblock_migratetype(page) == MIGRATE_ISOLATE)
set_page_private(page, MIGRATE_ISOLATE);
How much of a problem is this in practice? It's adding overhead to the free path for what should be a very rare case which is undesirable. I know we are already calling get_pageblock_migrate() when freeing pages but it should be unnecessary to call it again. I'd go as far to say that it would be preferable to drain the per-CPU lists after you set pageblocks MIGRATE_ISOLATE. The IPIs also have overhead but it will be incurred for the rare rather than the common case.
On Fri, Nov 18, 2011 at 05:43:08PM +0100, Marek Szyprowski wrote:
From: Michal Nazarewicz mina86@mina86.com diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9dd443d..58d1a2e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -628,6 +628,18 @@ static void free_pcppages_bulk(struct zone *zone, int count, page = list_entry(list->prev, struct page, lru); /* must delete as __free_one_page list manipulates */ list_del(&page->lru);
/*
* When page is isolated in set_migratetype_isolate()
* function it's page_private is not changed since the
* function has no way of knowing if it can touch it.
* This means that when a page is on PCP list, it's
* page_private no longer matches the desired migrate
* type.
*/
if (get_pageblock_migratetype(page) == MIGRATE_ISOLATE)
set_page_private(page, MIGRATE_ISOLATE);
On Mon, 12 Dec 2011 14:42:35 +0100, Mel Gorman mel@csn.ul.ie wrote:
How much of a problem is this in practice?
IIRC, this lead to allocation being made from area marked as isolated or some such.
[...] I'd go as far to say that it would be preferable to drain the per-CPU lists after you set pageblocks MIGRATE_ISOLATE. The IPIs also have overhead but it will be incurred for the rare rather than the common case.
I'll look into that.
On Mon, Dec 12, 2011 at 03:23:02PM +0100, Michal Nazarewicz wrote:
On Fri, Nov 18, 2011 at 05:43:08PM +0100, Marek Szyprowski wrote:
From: Michal Nazarewicz mina86@mina86.com diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9dd443d..58d1a2e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -628,6 +628,18 @@ static void free_pcppages_bulk(struct zone *zone, int count, page = list_entry(list->prev, struct page, lru); /* must delete as __free_one_page list manipulates */ list_del(&page->lru);
/*
* When page is isolated in set_migratetype_isolate()
* function it's page_private is not changed since the
* function has no way of knowing if it can touch it.
* This means that when a page is on PCP list, it's
* page_private no longer matches the desired migrate
* type.
*/
if (get_pageblock_migratetype(page) == MIGRATE_ISOLATE)
set_page_private(page, MIGRATE_ISOLATE);
On Mon, 12 Dec 2011 14:42:35 +0100, Mel Gorman mel@csn.ul.ie wrote:
How much of a problem is this in practice?
IIRC, this lead to allocation being made from area marked as isolated or some such.
And I believe that nothing prevents that from happening. I was just wondering how common it was in practice. Draining the per-cpu lists should work as a substitute either way.
From: Michal Nazarewicz mina86@mina86.com
This commit introduces isolate_freepages_range() and isolate_migratepages_range() functions. The first one replaces isolate_freepages_block() and the second one extracts functionality from isolate_migratepages().
They are more generic and instead of operating on pageblocks operate on PFN ranges.
Signed-off-by: Michal Nazarewicz mina86@mina86.com Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com --- mm/compaction.c | 170 ++++++++++++++++++++++++++++++++++++------------------- 1 files changed, 111 insertions(+), 59 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c index 899d956..6afae0e 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -54,51 +54,64 @@ static unsigned long release_freepages(struct list_head *freelist) return count; }
-/* Isolate free pages onto a private freelist. Must hold zone->lock */ -static unsigned long isolate_freepages_block(struct zone *zone, - unsigned long blockpfn, - struct list_head *freelist) +/** + * isolate_freepages_range() - isolate free pages, must hold zone->lock. + * @zone: Zone pages are in. + * @start: The first PFN to start isolating. + * @end: The one-past-last PFN. + * @freelist: A list to save isolated pages to. + * + * If @freelist is not provided, holes in range (either non-free pages + * or invalid PFNs) are considered an error and function undos its + * actions and returns zero. + * + * If @freelist is provided, function will simply skip non-free and + * missing pages and put only the ones isolated on the list. + * + * Returns number of isolated pages. This may be more then end-start + * if end fell in a middle of a free page. + */ +static unsigned long +isolate_freepages_range(struct zone *zone, + unsigned long start, unsigned long end, + struct list_head *freelist) { - unsigned long zone_end_pfn, end_pfn; - int nr_scanned = 0, total_isolated = 0; - struct page *cursor; - - /* Get the last PFN we should scan for free pages at */ - zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages; - end_pfn = min(blockpfn + pageblock_nr_pages, zone_end_pfn); + unsigned long nr_scanned = 0, total_isolated = 0; + unsigned long pfn = start; + struct page *page;
- /* Find the first usable PFN in the block to initialse page cursor */ - for (; blockpfn < end_pfn; blockpfn++) { - if (pfn_valid_within(blockpfn)) - break; - } - cursor = pfn_to_page(blockpfn); + VM_BUG_ON(!pfn_valid(pfn)); + page = pfn_to_page(pfn);
/* Isolate free pages. This assumes the block is valid */ - for (; blockpfn < end_pfn; blockpfn++, cursor++) { - int isolated, i; - struct page *page = cursor; - - if (!pfn_valid_within(blockpfn)) - continue; - nr_scanned++; - - if (!PageBuddy(page)) - continue; + while (pfn < end) { + unsigned isolated = 1, i; + + if (!pfn_valid_within(pfn)) + goto skip; + ++nr_scanned; + + if (!PageBuddy(page)) { +skip: + if (freelist) + goto next; + for (; start < pfn; ++start) + __free_page(pfn_to_page(pfn)); + return 0; + }
/* Found a free page, break it into order-0 pages */ isolated = split_free_page(page); total_isolated += isolated; - for (i = 0; i < isolated; i++) { - list_add(&page->lru, freelist); - page++; + if (freelist) { + struct page *p = page; + for (i = isolated; i; --i, ++p) + list_add(&p->lru, freelist); }
- /* If a page was split, advance to the end of it */ - if (isolated) { - blockpfn += isolated - 1; - cursor += isolated - 1; - } +next: + pfn += isolated; + page += isolated; }
trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated); @@ -135,7 +148,7 @@ static void isolate_freepages(struct zone *zone, struct compact_control *cc) { struct page *page; - unsigned long high_pfn, low_pfn, pfn; + unsigned long high_pfn, low_pfn, pfn, zone_end_pfn, end_pfn; unsigned long flags; int nr_freepages = cc->nr_freepages; struct list_head *freelist = &cc->freepages; @@ -155,6 +168,8 @@ static void isolate_freepages(struct zone *zone, */ high_pfn = min(low_pfn, pfn);
+ zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages; + /* * Isolate free pages until enough are available to migrate the * pages on cc->migratepages. We stop searching if the migrate @@ -191,7 +206,9 @@ static void isolate_freepages(struct zone *zone, isolated = 0; spin_lock_irqsave(&zone->lock, flags); if (suitable_migration_target(page)) { - isolated = isolate_freepages_block(zone, pfn, freelist); + end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn); + isolated = isolate_freepages_range(zone, pfn, + end_pfn, freelist); nr_freepages += isolated; } spin_unlock_irqrestore(&zone->lock, flags); @@ -250,31 +267,34 @@ typedef enum { ISOLATE_SUCCESS, /* Pages isolated, migrate */ } isolate_migrate_t;
-/* - * Isolate all pages that can be migrated from the block pointed to by - * the migrate scanner within compact_control. +/** + * isolate_migratepages_range() - isolate all migrate-able pages in range. + * @zone: Zone pages are in. + * @cc: Compaction control structure. + * @low_pfn: The first PFN of the range. + * @end_pfn: The one-past-the-last PFN of the range. + * + * Isolate all pages that can be migrated from the range specified by + * [low_pfn, end_pfn). Returns zero if there is a fatal signal + * pending), otherwise PFN of the first page that was not scanned + * (which may be both less, equal to or more then end_pfn). + * + * Assumes that cc->migratepages is empty and cc->nr_migratepages is + * zero. + * + * Other then cc->migratepages and cc->nr_migratetypes this function + * does not modify any cc's fields, ie. it does not modify (or read + * for that matter) cc->migrate_pfn. */ -static isolate_migrate_t isolate_migratepages(struct zone *zone, - struct compact_control *cc) +static unsigned long +isolate_migratepages_range(struct zone *zone, struct compact_control *cc, + unsigned long low_pfn, unsigned long end_pfn) { - unsigned long low_pfn, end_pfn; unsigned long last_pageblock_nr = 0, pageblock_nr; unsigned long nr_scanned = 0, nr_isolated = 0; struct list_head *migratelist = &cc->migratepages; isolate_mode_t mode = ISOLATE_ACTIVE|ISOLATE_INACTIVE;
- /* Do not scan outside zone boundaries */ - low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn); - - /* Only scan within a pageblock boundary */ - end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages); - - /* Do not cross the free scanner or scan within a memory hole */ - if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) { - cc->migrate_pfn = end_pfn; - return ISOLATE_NONE; - } - /* * Ensure that there are not too many pages isolated from the LRU * list by either parallel reclaimers or compaction. If there are, @@ -283,12 +303,12 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, while (unlikely(too_many_isolated(zone))) { /* async migration should just abort */ if (!cc->sync) - return ISOLATE_ABORT; + return 0;
congestion_wait(BLK_RW_ASYNC, HZ/10);
if (fatal_signal_pending(current)) - return ISOLATE_ABORT; + return 0; }
/* Time to isolate some pages for migration */ @@ -365,17 +385,49 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, nr_isolated++;
/* Avoid isolating too much */ - if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) + if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) { + ++low_pfn; break; + } }
acct_isolated(zone, cc);
spin_unlock_irq(&zone->lru_lock); - cc->migrate_pfn = low_pfn;
trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
+ return low_pfn; +} + +/* + * Isolate all pages that can be migrated from the block pointed to by + * the migrate scanner within compact_control. + */ +static isolate_migrate_t isolate_migratepages(struct zone *zone, + struct compact_control *cc) +{ + unsigned long low_pfn, end_pfn; + + /* Do not scan outside zone boundaries */ + low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn); + + /* Only scan within a pageblock boundary */ + end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages); + + /* Do not cross the free scanner or scan within a memory hole */ + if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) { + cc->migrate_pfn = end_pfn; + return ISOLATE_NONE; + } + + /* Perform the isolation */ + low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn); + if (!low_pfn) + return ISOLATE_ABORT; + + cc->migrate_pfn = low_pfn; + return ISOLATE_SUCCESS; }
On Fri, Nov 18, 2011 at 05:43:09PM +0100, Marek Szyprowski wrote:
From: Michal Nazarewicz mina86@mina86.com
This commit introduces isolate_freepages_range() and isolate_migratepages_range() functions. The first one replaces isolate_freepages_block() and the second one extracts functionality from isolate_migratepages().
They are more generic and instead of operating on pageblocks operate on PFN ranges.
Signed-off-by: Michal Nazarewicz mina86@mina86.com Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com
mm/compaction.c | 170 ++++++++++++++++++++++++++++++++++++------------------- 1 files changed, 111 insertions(+), 59 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c index 899d956..6afae0e 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -54,51 +54,64 @@ static unsigned long release_freepages(struct list_head *freelist) return count; } -/* Isolate free pages onto a private freelist. Must hold zone->lock */ -static unsigned long isolate_freepages_block(struct zone *zone,
unsigned long blockpfn,
struct list_head *freelist)
+/**
- isolate_freepages_range() - isolate free pages, must hold zone->lock.
- @zone: Zone pages are in.
- @start: The first PFN to start isolating.
- @end: The one-past-last PFN.
- @freelist: A list to save isolated pages to.
- If @freelist is not provided, holes in range (either non-free pages
- or invalid PFNs) are considered an error and function undos its
- actions and returns zero.
- If @freelist is provided, function will simply skip non-free and
- missing pages and put only the ones isolated on the list.
- Returns number of isolated pages. This may be more then end-start
- if end fell in a middle of a free page.
- */
+static unsigned long +isolate_freepages_range(struct zone *zone,
unsigned long start, unsigned long end,
struct list_head *freelist)
Use start_pfn and end_pfn to keep it consistent with the rest of compaction.c.
{
- unsigned long zone_end_pfn, end_pfn;
- int nr_scanned = 0, total_isolated = 0;
- struct page *cursor;
- /* Get the last PFN we should scan for free pages at */
- zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
- end_pfn = min(blockpfn + pageblock_nr_pages, zone_end_pfn);
- unsigned long nr_scanned = 0, total_isolated = 0;
- unsigned long pfn = start;
- struct page *page;
- /* Find the first usable PFN in the block to initialse page cursor */
- for (; blockpfn < end_pfn; blockpfn++) {
if (pfn_valid_within(blockpfn))
break;
- }
- cursor = pfn_to_page(blockpfn);
- VM_BUG_ON(!pfn_valid(pfn));
- page = pfn_to_page(pfn);
/* Isolate free pages. This assumes the block is valid */
- for (; blockpfn < end_pfn; blockpfn++, cursor++) {
int isolated, i;
struct page *page = cursor;
if (!pfn_valid_within(blockpfn))
continue;
nr_scanned++;
if (!PageBuddy(page))
continue;
- while (pfn < end) {
unsigned isolated = 1, i;
Do not use implcit types. These are unsigned ints, call them unsigned ints.
if (!pfn_valid_within(pfn))
goto skip;
The flow of this function in general with gotos of skipped and next is confusing in comparison to the existing function. For example, if this PFN is not valid, and no freelist is provided, then we call __free_page() on a PFN that is known to be invalid.
++nr_scanned;
if (!PageBuddy(page)) {
+skip:
if (freelist)
goto next;
for (; start < pfn; ++start)
__free_page(pfn_to_page(pfn));
return 0;
}
So if a PFN is valid and !PageBuddy and no freelist is provided, we call __free_page() on it regardless of reference count. That does not sound safe.
/* Found a free page, break it into order-0 pages */ isolated = split_free_page(page); total_isolated += isolated;
for (i = 0; i < isolated; i++) {
list_add(&page->lru, freelist);
page++;
if (freelist) {
struct page *p = page;
for (i = isolated; i; --i, ++p)
}list_add(&p->lru, freelist);
/* If a page was split, advance to the end of it */
if (isolated) {
blockpfn += isolated - 1;
cursor += isolated - 1;
}
+next:
pfn += isolated;
page += isolated;
The name isolated is now confusing because it can mean either pages isolated or pages scanned depending on context. Your patch appears to be doing a lot more than is necessary to convert isolate_freepages_block into isolate_freepages_range and at this point, it's unclear why you did that.
} trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated); @@ -135,7 +148,7 @@ static void isolate_freepages(struct zone *zone, struct compact_control *cc) { struct page *page;
- unsigned long high_pfn, low_pfn, pfn;
- unsigned long high_pfn, low_pfn, pfn, zone_end_pfn, end_pfn; unsigned long flags; int nr_freepages = cc->nr_freepages; struct list_head *freelist = &cc->freepages;
@@ -155,6 +168,8 @@ static void isolate_freepages(struct zone *zone, */ high_pfn = min(low_pfn, pfn);
- zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
- /*
- Isolate free pages until enough are available to migrate the
- pages on cc->migratepages. We stop searching if the migrate
@@ -191,7 +206,9 @@ static void isolate_freepages(struct zone *zone, isolated = 0; spin_lock_irqsave(&zone->lock, flags); if (suitable_migration_target(page)) {
isolated = isolate_freepages_block(zone, pfn, freelist);
end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
isolated = isolate_freepages_range(zone, pfn,
} spin_unlock_irqrestore(&zone->lock, flags);end_pfn, freelist); nr_freepages += isolated;
@@ -250,31 +267,34 @@ typedef enum { ISOLATE_SUCCESS, /* Pages isolated, migrate */ } isolate_migrate_t; -/*
- Isolate all pages that can be migrated from the block pointed to by
- the migrate scanner within compact_control.
+/**
- isolate_migratepages_range() - isolate all migrate-able pages in range.
- @zone: Zone pages are in.
- @cc: Compaction control structure.
- @low_pfn: The first PFN of the range.
- @end_pfn: The one-past-the-last PFN of the range.
- Isolate all pages that can be migrated from the range specified by
- [low_pfn, end_pfn). Returns zero if there is a fatal signal
- pending), otherwise PFN of the first page that was not scanned
- (which may be both less, equal to or more then end_pfn).
- Assumes that cc->migratepages is empty and cc->nr_migratepages is
- zero.
- Other then cc->migratepages and cc->nr_migratetypes this function
- does not modify any cc's fields, ie. it does not modify (or read
*/
- for that matter) cc->migrate_pfn.
-static isolate_migrate_t isolate_migratepages(struct zone *zone,
struct compact_control *cc)
+static unsigned long +isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
unsigned long low_pfn, unsigned long end_pfn)
{
- unsigned long low_pfn, end_pfn; unsigned long last_pageblock_nr = 0, pageblock_nr; unsigned long nr_scanned = 0, nr_isolated = 0; struct list_head *migratelist = &cc->migratepages; isolate_mode_t mode = ISOLATE_ACTIVE|ISOLATE_INACTIVE;
- /* Do not scan outside zone boundaries */
- low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
- /* Only scan within a pageblock boundary */
- end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
- /* Do not cross the free scanner or scan within a memory hole */
- if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
cc->migrate_pfn = end_pfn;
return ISOLATE_NONE;
- }
- /*
- Ensure that there are not too many pages isolated from the LRU
- list by either parallel reclaimers or compaction. If there are,
@@ -283,12 +303,12 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, while (unlikely(too_many_isolated(zone))) { /* async migration should just abort */ if (!cc->sync)
return ISOLATE_ABORT;
return 0;
congestion_wait(BLK_RW_ASYNC, HZ/10); if (fatal_signal_pending(current))
return ISOLATE_ABORT;
}return 0;
/* Time to isolate some pages for migration */ @@ -365,17 +385,49 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, nr_isolated++; /* Avoid isolating too much */
if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
++low_pfn; break;
}}
This change is unrelated to the rest of the path. I recognise that incrementing low_pfn would prevent an already isolated PFN being scanned the next time but it should be a separate patch.
acct_isolated(zone, cc); spin_unlock_irq(&zone->lru_lock);
- cc->migrate_pfn = low_pfn;
trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
- return low_pfn;
+}
+/*
- Isolate all pages that can be migrated from the block pointed to by
- the migrate scanner within compact_control.
- */
+static isolate_migrate_t isolate_migratepages(struct zone *zone,
struct compact_control *cc)
+{
- unsigned long low_pfn, end_pfn;
- /* Do not scan outside zone boundaries */
- low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
- /* Only scan within a pageblock boundary */
- end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
- /* Do not cross the free scanner or scan within a memory hole */
- if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
cc->migrate_pfn = end_pfn;
return ISOLATE_NONE;
- }
- /* Perform the isolation */
- low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn);
- if (!low_pfn)
return ISOLATE_ABORT;
- cc->migrate_pfn = low_pfn;
- return ISOLATE_SUCCESS;
} -- 1.7.1.569.g6f426
On Fri, Nov 18, 2011 at 05:43:09PM +0100, Marek Szyprowski wrote:
From: Michal Nazarewicz mina86@mina86.com diff --git a/mm/compaction.c b/mm/compaction.c index 899d956..6afae0e 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -54,51 +54,64 @@ static unsigned long release_freepages(struct list_head *freelist) return count; }
-/* Isolate free pages onto a private freelist. Must hold zone->lock */ -static unsigned long isolate_freepages_block(struct zone *zone,
unsigned long blockpfn,
struct list_head *freelist)
+/**
- isolate_freepages_range() - isolate free pages, must hold zone->lock.
- @zone: Zone pages are in.
- @start: The first PFN to start isolating.
- @end: The one-past-last PFN.
- @freelist: A list to save isolated pages to.
- If @freelist is not provided, holes in range (either non-free pages
- or invalid PFNs) are considered an error and function undos its
- actions and returns zero.
- If @freelist is provided, function will simply skip non-free and
- missing pages and put only the ones isolated on the list.
- Returns number of isolated pages. This may be more then end-start
- if end fell in a middle of a free page.
- */
+static unsigned long +isolate_freepages_range(struct zone *zone,
unsigned long start, unsigned long end,
struct list_head *freelist)
On Mon, 12 Dec 2011 15:07:28 +0100, Mel Gorman mel@csn.ul.ie wrote:
Use start_pfn and end_pfn to keep it consistent with the rest of compaction.c.
Will do.
{
- unsigned long zone_end_pfn, end_pfn;
- int nr_scanned = 0, total_isolated = 0;
- struct page *cursor;
- /* Get the last PFN we should scan for free pages at */
- zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
- end_pfn = min(blockpfn + pageblock_nr_pages, zone_end_pfn);
- unsigned long nr_scanned = 0, total_isolated = 0;
- unsigned long pfn = start;
- struct page *page;
- /* Find the first usable PFN in the block to initialse page cursor */
- for (; blockpfn < end_pfn; blockpfn++) {
if (pfn_valid_within(blockpfn))
break;
- }
- cursor = pfn_to_page(blockpfn);
VM_BUG_ON(!pfn_valid(pfn));
page = pfn_to_page(pfn);
/* Isolate free pages. This assumes the block is valid */
- for (; blockpfn < end_pfn; blockpfn++, cursor++) {
int isolated, i;
struct page *page = cursor;
if (!pfn_valid_within(blockpfn))
continue;
nr_scanned++;
if (!PageBuddy(page))
continue;
- while (pfn < end) {
unsigned isolated = 1, i;
Do not use implcit types. These are unsigned ints, call them unsigned ints.
Will do.
if (!pfn_valid_within(pfn))
goto skip;
The flow of this function in general with gotos of skipped and next is confusing in comparison to the existing function. For example, if this PFN is not valid, and no freelist is provided, then we call __free_page() on a PFN that is known to be invalid.
++nr_scanned;
if (!PageBuddy(page)) {
+skip:
if (freelist)
goto next;
for (; start < pfn; ++start)
__free_page(pfn_to_page(pfn));
return 0;
}
So if a PFN is valid and !PageBuddy and no freelist is provided, we call __free_page() on it regardless of reference count. That does not sound safe.
Sorry about that. It's a bug in the code which was caught later on. The code should read “__free_page(pfn_to_page(start))”.
/* Found a free page, break it into order-0 pages */ isolated = split_free_page(page); total_isolated += isolated;
for (i = 0; i < isolated; i++) {
list_add(&page->lru, freelist);
page++;
if (freelist) {
struct page *p = page;
for (i = isolated; i; --i, ++p)
}list_add(&p->lru, freelist);
/* If a page was split, advance to the end of it */
if (isolated) {
blockpfn += isolated - 1;
cursor += isolated - 1;
}
+next:
pfn += isolated;
page += isolated;
The name isolated is now confusing because it can mean either pages isolated or pages scanned depending on context. Your patch appears to be doing a lot more than is necessary to convert isolate_freepages_block into isolate_freepages_range and at this point, it's unclear why you did that.
When CMA uses this function, it requires all pages in the range to be valid and free. (Both conditions should be met but you never know.) This change adds a second way isolate_freepages_range() works, which is when freelist is not specified, abort on invalid or non-free page, but continue as usual if freelist is provided.
I can try and restructure this function a bit so that there are fewer “gotos”, but without the above change, CMA won't really be able to use it effectively (it would have to provide a freelist and then validate if pages on it are added in order).
}
trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
On Mon, Dec 12, 2011 at 04:22:39PM +0100, Michal Nazarewicz wrote:
<SNIP>
if (!pfn_valid_within(pfn))
goto skip;
The flow of this function in general with gotos of skipped and next is confusing in comparison to the existing function. For example, if this PFN is not valid, and no freelist is provided, then we call __free_page() on a PFN that is known to be invalid.
++nr_scanned;
if (!PageBuddy(page)) {
+skip:
if (freelist)
goto next;
for (; start < pfn; ++start)
__free_page(pfn_to_page(pfn));
return 0;
}
So if a PFN is valid and !PageBuddy and no freelist is provided, we call __free_page() on it regardless of reference count. That does not sound safe.
Sorry about that. It's a bug in the code which was caught later on. The code should read ???__free_page(pfn_to_page(start))???.
That will call free on valid PFNs but why is it safe to call __free_page() at all? You say later that CMA requires that all pages in the range be valid but if the pages are in use, that does not mean that calling __free_page() is safe. I suspect you have not seen a problem because the pages in the range were free as expected and not in use because of MIGRATE_ISOLATE.
/* Found a free page, break it into order-0 pages */ isolated = split_free_page(page); total_isolated += isolated;
for (i = 0; i < isolated; i++) {
list_add(&page->lru, freelist);
page++;
if (freelist) {
struct page *p = page;
for (i = isolated; i; --i, ++p)
}list_add(&p->lru, freelist);
/* If a page was split, advance to the end of it */
if (isolated) {
blockpfn += isolated - 1;
cursor += isolated - 1;
}
+next:
pfn += isolated;
page += isolated;
The name isolated is now confusing because it can mean either pages isolated or pages scanned depending on context. Your patch appears to be doing a lot more than is necessary to convert isolate_freepages_block into isolate_freepages_range and at this point, it's unclear why you did that.
When CMA uses this function, it requires all pages in the range to be valid and free. (Both conditions should be met but you never know.)
It seems racy but I guess you are depending on MIGRATE_ISOLATE to keep things sane which is fine. However, I strongly suspect that if there is a race and a page is in use, then you will need to retry the migration step. Calling __free_page does not look right because something still has a reference to the page.
This change adds a second way isolate_freepages_range() works, which is when freelist is not specified, abort on invalid or non-free page, but continue as usual if freelist is provided.
Ok, I think you should be able to do that by not calling split_free_page or adding to the list if !freelist with a comment explaining why the pages are left on the buddy lists for the caller to figure out. Bail if a page-in-use is found and have the caller check that the return value of isolate_freepages_block == end_pfn - start_pfn.
I can try and restructure this function a bit so that there are fewer ???gotos???, but without the above change, CMA won't really be able to use it effectively (it would have to provide a freelist and then validate if pages on it are added in order).
Please do and double check that __free_page logic too.
On Mon, 12 Dec 2011 17:30:52 +0100, Mel Gorman mel@csn.ul.ie wrote:
On Mon, Dec 12, 2011 at 04:22:39PM +0100, Michal Nazarewicz wrote:
<SNIP>
if (!pfn_valid_within(pfn))
goto skip;
The flow of this function in general with gotos of skipped and next is confusing in comparison to the existing function. For example, if this PFN is not valid, and no freelist is provided, then we call __free_page() on a PFN that is known to be invalid.
++nr_scanned;
if (!PageBuddy(page)) {
+skip:
if (freelist)
goto next;
for (; start < pfn; ++start)
__free_page(pfn_to_page(pfn));
return 0;
}
So if a PFN is valid and !PageBuddy and no freelist is provided, we call __free_page() on it regardless of reference count. That does not sound safe.
Sorry about that. It's a bug in the code which was caught later on. The code should read ???__free_page(pfn_to_page(start))???.
On Mon, 12 Dec 2011 17:30:52 +0100, Mel Gorman mel@csn.ul.ie wrote:
That will call free on valid PFNs but why is it safe to call __free_page() at all? You say later that CMA requires that all pages in the range be valid but if the pages are in use, that does not mean that calling __free_page() is safe. I suspect you have not seen a problem because the pages in the range were free as expected and not in use because of MIGRATE_ISOLATE.
All pages from [start, pfn) have passed through the loop body which means that they are valid and they have been removed from buddy (for caller's use). Also, because of split_free_page(), all of those pages have been split into 0-order pages. Therefore, in error recovery, to undo what the loop has done so far, we put give back to buddy by calling __free_page() on each 0-order page.
/* Found a free page, break it into order-0 pages */ isolated = split_free_page(page); total_isolated += isolated;
for (i = 0; i < isolated; i++) {
list_add(&page->lru, freelist);
page++;
if (freelist) {
struct page *p = page;
for (i = isolated; i; --i, ++p)
}list_add(&p->lru, freelist);
/* If a page was split, advance to the end of it */
if (isolated) {
blockpfn += isolated - 1;
cursor += isolated - 1;
}
+next:
pfn += isolated;
page += isolated;
The name isolated is now confusing because it can mean either pages isolated or pages scanned depending on context. Your patch appears to be doing a lot more than is necessary to convert isolate_freepages_block into isolate_freepages_range and at this point, it's unclear why you did that.
When CMA uses this function, it requires all pages in the range to be valid and free. (Both conditions should be met but you never know.)
To be clear, I meant that the CMA expects pages to be in buddy when the function is called but after the function finishes, all the pages in the range are removed from buddy. This, among other things, is why the call to split_free_page() is necessary.
It seems racy but I guess you are depending on MIGRATE_ISOLATE to keep things sane which is fine. However, I strongly suspect that if there is a race and a page is in use, then you will need to retry the migration step. Calling __free_page does not look right because something still has a reference to the page.
This change adds a second way isolate_freepages_range() works, which is when freelist is not specified, abort on invalid or non-free page, but continue as usual if freelist is provided.
Ok, I think you should be able to do that by not calling split_free_page or adding to the list if !freelist with a comment explaining why the pages are left on the buddy lists for the caller to figure out. Bail if a page-in-use is found and have the caller check that the return value of isolate_freepages_block == end_pfn - start_pfn.
On Mon, Dec 12, 2011 at 05:46:13PM +0100, Michal Nazarewicz wrote:
On Mon, 12 Dec 2011 17:30:52 +0100, Mel Gorman mel@csn.ul.ie wrote:
On Mon, Dec 12, 2011 at 04:22:39PM +0100, Michal Nazarewicz wrote:
<SNIP>
if (!pfn_valid_within(pfn))
goto skip;
The flow of this function in general with gotos of skipped and next is confusing in comparison to the existing function. For example, if this PFN is not valid, and no freelist is provided, then we call __free_page() on a PFN that is known to be invalid.
++nr_scanned;
if (!PageBuddy(page)) {
+skip:
if (freelist)
goto next;
for (; start < pfn; ++start)
__free_page(pfn_to_page(pfn));
return 0;
}
So if a PFN is valid and !PageBuddy and no freelist is provided, we call __free_page() on it regardless of reference count. That does not sound safe.
Sorry about that. It's a bug in the code which was caught later on. The code should read ???__free_page(pfn_to_page(start))???.
On Mon, 12 Dec 2011 17:30:52 +0100, Mel Gorman mel@csn.ul.ie wrote:
That will call free on valid PFNs but why is it safe to call __free_page() at all? You say later that CMA requires that all pages in the range be valid but if the pages are in use, that does not mean that calling __free_page() is safe. I suspect you have not seen a problem because the pages in the range were free as expected and not in use because of MIGRATE_ISOLATE.
All pages from [start, pfn) have passed through the loop body which means that they are valid and they have been removed from buddy (for caller's use). Also, because of split_free_page(), all of those pages have been split into 0-order pages.
Ah, I see. Even though you are not putting the pages on a freelist, the function still returns with an elevated reference count and it's up to the caller to find them again.
Therefore, in error recovery, to undo what the loop has done so far, we put give back to buddy by calling __free_page() on each 0-order page.
Ok.
/* Found a free page, break it into order-0 pages */ isolated = split_free_page(page); total_isolated += isolated;
for (i = 0; i < isolated; i++) {
list_add(&page->lru, freelist);
page++;
if (freelist) {
struct page *p = page;
for (i = isolated; i; --i, ++p)
}list_add(&p->lru, freelist);
/* If a page was split, advance to the end of it */
if (isolated) {
blockpfn += isolated - 1;
cursor += isolated - 1;
}
+next:
pfn += isolated;
page += isolated;
The name isolated is now confusing because it can mean either pages isolated or pages scanned depending on context. Your patch appears to be doing a lot more than is necessary to convert isolate_freepages_block into isolate_freepages_range and at this point, it's unclear why you did that.
When CMA uses this function, it requires all pages in the range to be valid and free. (Both conditions should be met but you never know.)
To be clear, I meant that the CMA expects pages to be in buddy when the function is called but after the function finishes, all the pages in the range are removed from buddy. This, among other things, is why the call to split_free_page() is necessary.
Understood.
From: Michal Nazarewicz mina86@mina86.com
This commit introduces zone_pfn_same_memmap() function which checkes whether two PFNs within the same zone have struct pages within the same memmap. This check is needed because in general pointer arithmetic on struct pages may lead to invalid pointers.
On memory models that are not affected, zone_pfn_same_memmap() is defined as always returning true so the call should be optimised at compile time.
Signed-off-by: Michal Nazarewicz mina86@mina86.com Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com --- include/linux/mmzone.h | 16 ++++++++++++++++ mm/compaction.c | 5 ++++- 2 files changed, 20 insertions(+), 1 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 188cb2f..84e07d0 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1166,6 +1166,22 @@ static inline int memmap_valid_within(unsigned long pfn, } #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) +/* + * Both PFNs must be from the same zone! If this function returns + * true, pfn_to_page(pfn1) + (pfn2 - pfn1) == pfn_to_page(pfn2). + */ +static inline bool zone_pfn_same_memmap(unsigned long pfn1, unsigned long pfn2) +{ + return pfn_to_section_nr(pfn1) == pfn_to_section_nr(pfn2); +} + +#else + +#define zone_pfn_same_memmap(pfn1, pfn2) (true) + +#endif + #endif /* !__GENERATING_BOUNDS.H */ #endif /* !__ASSEMBLY__ */ #endif /* _LINUX_MMZONE_H */ diff --git a/mm/compaction.c b/mm/compaction.c index 6afae0e..09c9702 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -111,7 +111,10 @@ skip:
next: pfn += isolated; - page += isolated; + if (zone_pfn_same_memmap(pfn - isolated, pfn)) + page += isolated; + else + page = pfn_to_page(pfn); }
trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
On Fri, Nov 18, 2011 at 05:43:10PM +0100, Marek Szyprowski wrote:
From: Michal Nazarewicz mina86@mina86.com
This commit introduces zone_pfn_same_memmap() function which checkes
s/checkes/checks/
whether two PFNs within the same zone have struct pages within the same memmap.
s/memmap/same sparsemem section/
This check is needed because in general pointer arithmetic on struct pages may lead to invalid pointers.
On memory models that are not affected, zone_pfn_same_memmap() is defined as always returning true so the call should be optimised at compile time.
Signed-off-by: Michal Nazarewicz mina86@mina86.com Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com
include/linux/mmzone.h | 16 ++++++++++++++++ mm/compaction.c | 5 ++++- 2 files changed, 20 insertions(+), 1 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 188cb2f..84e07d0 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1166,6 +1166,22 @@ static inline int memmap_valid_within(unsigned long pfn, } #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */ +#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) +/*
- Both PFNs must be from the same zone! If this function returns
from the same sparsemem section, not the same zone.
- true, pfn_to_page(pfn1) + (pfn2 - pfn1) == pfn_to_page(pfn2).
- */
+static inline bool zone_pfn_same_memmap(unsigned long pfn1, unsigned long pfn2) +{
- return pfn_to_section_nr(pfn1) == pfn_to_section_nr(pfn2);
+}
+#else
+#define zone_pfn_same_memmap(pfn1, pfn2) (true)
+#endif
#endif /* !__GENERATING_BOUNDS.H */ #endif /* !__ASSEMBLY__ */ #endif /* _LINUX_MMZONE_H */ diff --git a/mm/compaction.c b/mm/compaction.c index 6afae0e..09c9702 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -111,7 +111,10 @@ skip: next: pfn += isolated;
page += isolated;
if (zone_pfn_same_memmap(pfn - isolated, pfn))
page += isolated;
else
}page = pfn_to_page(pfn);
Is this necessary?
We are isolating pages, the largest of which is a MAX_ORDER_NR_PAGES page. Sections are never smaller than MAX_ORDER_NR_PAGES so the end of the free range of pages should never be in another section. That should mean that the PFN walk will always consider the first PFN of every section and you can implement a simplier check than zone_pfn_same_memmap based on pfn & PAGE_SECTION_MASK and contain it within mm/compaction.c
That said, everywhere else managed to avoid checks like this by always scanning in units of pageblocks. Maybe this should be structured the same way to guarantee pfn_valid is called at least per pageblock (even though only once per MAX_ORDER_NR_PAGES is necessary).
On Fri, Nov 18, 2011 at 05:43:10PM +0100, Marek Szyprowski wrote:
From: Michal Nazarewicz mina86@mina86.com diff --git a/mm/compaction.c b/mm/compaction.c index 6afae0e..09c9702 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -111,7 +111,10 @@ skip:
next: pfn += isolated;
page += isolated;
if (zone_pfn_same_memmap(pfn - isolated, pfn))
page += isolated;
else
}page = pfn_to_page(pfn);
On Mon, 12 Dec 2011 15:19:53 +0100, Mel Gorman mel@csn.ul.ie wrote:
Is this necessary?
We are isolating pages, the largest of which is a MAX_ORDER_NR_PAGES page. [...]
This is not true for CMA.
That said, everywhere else managed to avoid checks like this by always scanning in units of pageblocks. Maybe this should be structured the same way to guarantee pfn_valid is called at least per pageblock (even though only once per MAX_ORDER_NR_PAGES is necessary).
I'll look into that.
On Mon, Dec 12, 2011 at 03:35:03PM +0100, Michal Nazarewicz wrote:
On Fri, Nov 18, 2011 at 05:43:10PM +0100, Marek Szyprowski wrote:
From: Michal Nazarewicz mina86@mina86.com diff --git a/mm/compaction.c b/mm/compaction.c index 6afae0e..09c9702 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -111,7 +111,10 @@ skip:
next: pfn += isolated;
page += isolated;
if (zone_pfn_same_memmap(pfn - isolated, pfn))
page += isolated;
else
}page = pfn_to_page(pfn);
On Mon, 12 Dec 2011 15:19:53 +0100, Mel Gorman mel@csn.ul.ie wrote:
Is this necessary?
We are isolating pages, the largest of which is a MAX_ORDER_NR_PAGES page. [...]
This is not true for CMA.
To be clear, I'm referring to a single page being isolated here. It may or may not be a high-order page but it's still going to be less then MAX_ORDER_NR_PAGES so you should be able check when a new block is entered and pfn_to_page is necessary.
On Fri, Nov 18, 2011 at 05:43:10PM +0100, Marek Szyprowski wrote:
From: Michal Nazarewicz mina86@mina86.com diff --git a/mm/compaction.c b/mm/compaction.c index 6afae0e..09c9702 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -111,7 +111,10 @@ skip:
next: pfn += isolated;
page += isolated;
if (zone_pfn_same_memmap(pfn - isolated, pfn))
page += isolated;
else
}page = pfn_to_page(pfn);
On Mon, 12 Dec 2011 15:19:53 +0100, Mel Gorman mel@csn.ul.ie wrote:
Is this necessary?
We are isolating pages, the largest of which is a MAX_ORDER_NR_PAGES page. [...]
On Mon, 12 Dec 2011 15:40:30 +0100, Mel Gorman mel@csn.ul.ie wrote:
To be clear, I'm referring to a single page being isolated here. It may or may not be a high-order page but it's still going to be less then MAX_ORDER_NR_PAGES so you should be able check when a new block is entered and pfn_to_page is necessary.
Do you mean something like:
if (same pageblock) just do arithmetic; else use pfn_to_page;
?
I've discussed it with Dave and he suggested that approach as an optimisation since in some configurations zone_pfn_same_memmap() is always true thus compiler will strip the else part, whereas same pageblock test will be false on occasions regardless of kernel configuration.
On Mon, Dec 12, 2011 at 03:51:55PM +0100, Michal Nazarewicz wrote:
On Fri, Nov 18, 2011 at 05:43:10PM +0100, Marek Szyprowski wrote:
From: Michal Nazarewicz mina86@mina86.com diff --git a/mm/compaction.c b/mm/compaction.c index 6afae0e..09c9702 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -111,7 +111,10 @@ skip:
next: pfn += isolated;
page += isolated;
if (zone_pfn_same_memmap(pfn - isolated, pfn))
page += isolated;
else
}page = pfn_to_page(pfn);
On Mon, 12 Dec 2011 15:19:53 +0100, Mel Gorman mel@csn.ul.ie wrote:
Is this necessary?
We are isolating pages, the largest of which is a MAX_ORDER_NR_PAGES page. [...]
On Mon, 12 Dec 2011 15:40:30 +0100, Mel Gorman mel@csn.ul.ie wrote:
To be clear, I'm referring to a single page being isolated here. It may or may not be a high-order page but it's still going to be less then MAX_ORDER_NR_PAGES so you should be able check when a new block is entered and pfn_to_page is necessary.
Do you mean something like:
if (same pageblock) just do arithmetic; else use pfn_to_page;
something like the following untested snippet.
/* * Resolve pfn_to_page every MAX_ORDER_NR_PAGES to handle the case where * memmap is not contiguous such as with SPARSEMEM memory model without * VMEMMAP */ pfn += isolated; page += isolated; if ((pfn & ~(MAX_ORDER_NR_PAGES-1)) == 0) page = pfn_to_page(pfn);
That would be closer to what other PFN walkers do
?
I've discussed it with Dave and he suggested that approach as an optimisation since in some configurations zone_pfn_same_memmap() is always true thus compiler will strip the else part, whereas same pageblock test will be false on occasions regardless of kernel configuration.
Ok, while I recognise it's an optimisation, it's a very small optimisation and I'm not keen on introducing something new for CMA that has been coped with in the past by always walking PFNs in pageblock-sized ranges with pfn_valid checks where necessary.
See setup_zone_migrate_reserve as one example where pfn_to_page is only called once per pageblock and calls pageblock_is_reserved() for examining pages within a pageblock. Still, if you really want the helper, at least keep it in compaction.c as there should be no need to have it in mmzone.h
From: Michal Nazarewicz mina86@mina86.com
This commit exports some of the functions from compaction.c file outside of it adding their declaration into internal.h header file so that other mm related code can use them.
This forced compaction.c to always be compiled (as opposed to being compiled only if CONFIG_COMPACTION is defined) but as to avoid introducing code that user did not ask for, part of the compaction.c is now wrapped in on #ifdef.
Signed-off-by: Michal Nazarewicz mina86@mina86.com Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com --- mm/Makefile | 3 +- mm/compaction.c | 112 +++++++++++++++++++++++-------------------------------- mm/internal.h | 35 +++++++++++++++++ 3 files changed, 83 insertions(+), 67 deletions(-)
diff --git a/mm/Makefile b/mm/Makefile index 50ec00e..24ed801 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -13,7 +13,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ readahead.o swap.o truncate.o vmscan.o shmem.o \ prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \ page_isolation.o mm_init.o mmu_context.o percpu.o \ - $(mmu-y) + $(mmu-y) compaction.o obj-y += init-mm.o
ifdef CONFIG_NO_BOOTMEM @@ -32,7 +32,6 @@ obj-$(CONFIG_NUMA) += mempolicy.o obj-$(CONFIG_SPARSEMEM) += sparse.o obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o obj-$(CONFIG_SLOB) += slob.o -obj-$(CONFIG_COMPACTION) += compaction.o obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o obj-$(CONFIG_KSM) += ksm.o obj-$(CONFIG_PAGE_POISONING) += debug-pagealloc.o diff --git a/mm/compaction.c b/mm/compaction.c index 09c9702..e71ceaf 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -19,28 +19,7 @@ #define CREATE_TRACE_POINTS #include <trace/events/compaction.h>
-/* - * compact_control is used to track pages being migrated and the free pages - * they are being migrated to during memory compaction. The free_pfn starts - * at the end of a zone and migrate_pfn begins at the start. Movable pages - * are moved to the end of a zone during a compaction run and the run - * completes when free_pfn <= migrate_pfn - */ -struct compact_control { - struct list_head freepages; /* List of free pages to migrate to */ - struct list_head migratepages; /* List of pages being migrated */ - unsigned long nr_freepages; /* Number of isolated free pages */ - unsigned long nr_migratepages; /* Number of pages to migrate */ - unsigned long free_pfn; /* isolate_freepages search base */ - unsigned long migrate_pfn; /* isolate_migratepages search base */ - bool sync; /* Synchronous migration */ - - unsigned int order; /* order a direct compactor needs */ - int migratetype; /* MOVABLE, RECLAIMABLE etc */ - struct zone *zone; -}; - -static unsigned long release_freepages(struct list_head *freelist) +unsigned long release_freepages(struct list_head *freelist) { struct page *page, *next; unsigned long count = 0; @@ -71,7 +50,7 @@ static unsigned long release_freepages(struct list_head *freelist) * Returns number of isolated pages. This may be more then end-start * if end fell in a middle of a free page. */ -static unsigned long +unsigned long isolate_freepages_range(struct zone *zone, unsigned long start, unsigned long end, struct list_head *freelist) @@ -263,13 +242,6 @@ static bool too_many_isolated(struct zone *zone) return isolated > (inactive + active) / 2; }
-/* possible outcome of isolate_migratepages */ -typedef enum { - ISOLATE_ABORT, /* Abort compaction now */ - ISOLATE_NONE, /* No pages isolated, continue scanning */ - ISOLATE_SUCCESS, /* Pages isolated, migrate */ -} isolate_migrate_t; - /** * isolate_migratepages_range() - isolate all migrate-able pages in range. * @zone: Zone pages are in. @@ -289,7 +261,7 @@ typedef enum { * does not modify any cc's fields, ie. it does not modify (or read * for that matter) cc->migrate_pfn. */ -static unsigned long +unsigned long isolate_migratepages_range(struct zone *zone, struct compact_control *cc, unsigned long low_pfn, unsigned long end_pfn) { @@ -404,43 +376,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc, }
/* - * Isolate all pages that can be migrated from the block pointed to by - * the migrate scanner within compact_control. - */ -static isolate_migrate_t isolate_migratepages(struct zone *zone, - struct compact_control *cc) -{ - unsigned long low_pfn, end_pfn; - - /* Do not scan outside zone boundaries */ - low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn); - - /* Only scan within a pageblock boundary */ - end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages); - - /* Do not cross the free scanner or scan within a memory hole */ - if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) { - cc->migrate_pfn = end_pfn; - return ISOLATE_NONE; - } - - /* Perform the isolation */ - low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn); - if (!low_pfn) - return ISOLATE_ABORT; - - cc->migrate_pfn = low_pfn; - - return ISOLATE_SUCCESS; -} - -/* * This is a migrate-callback that "allocates" freepages by taking pages * from the isolated freelists in the block we are migrating to. */ -static struct page *compaction_alloc(struct page *migratepage, - unsigned long data, - int **result) +struct page *compaction_alloc(struct page *migratepage, unsigned long data, + int **result) { struct compact_control *cc = (struct compact_control *)data; struct page *freepage; @@ -460,6 +400,8 @@ static struct page *compaction_alloc(struct page *migratepage, return freepage; }
+#ifdef CONFIG_COMPACTION + /* * We cannot control nr_migratepages and nr_freepages fully when migration is * running as migrate_pages() has no knowledge of compact_control. When @@ -480,6 +422,44 @@ static void update_nr_listpages(struct compact_control *cc) cc->nr_freepages = nr_freepages; }
+/* possible outcome of isolate_migratepages */ +typedef enum { + ISOLATE_ABORT, /* Abort compaction now */ + ISOLATE_NONE, /* No pages isolated, continue scanning */ + ISOLATE_SUCCESS, /* Pages isolated, migrate */ +} isolate_migrate_t; + +/* + * Isolate all pages that can be migrated from the block pointed to by + * the migrate scanner within compact_control. + */ +static isolate_migrate_t isolate_migratepages(struct zone *zone, + struct compact_control *cc) +{ + unsigned long low_pfn, end_pfn; + + /* Do not scan outside zone boundaries */ + low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn); + + /* Only scan within a pageblock boundary */ + end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages); + + /* Do not cross the free scanner or scan within a memory hole */ + if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) { + cc->migrate_pfn = end_pfn; + return ISOLATE_NONE; + } + + /* Perform the isolation */ + low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn); + if (!low_pfn) + return ISOLATE_ABORT; + + cc->migrate_pfn = low_pfn; + + return ISOLATE_SUCCESS; +} + static int compact_finished(struct zone *zone, struct compact_control *cc) { @@ -796,3 +776,5 @@ void compaction_unregister_node(struct node *node) return sysdev_remove_file(&node->sysdev, &attr_compact); } #endif /* CONFIG_SYSFS && CONFIG_NUMA */ + +#endif /* CONFIG_COMPACTION */ diff --git a/mm/internal.h b/mm/internal.h index 2189af4..67ec1d3 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -100,6 +100,41 @@ extern void prep_compound_page(struct page *page, unsigned long order); extern bool is_free_buddy_page(struct page *page); #endif
+/* + * in mm/compaction.c + */ +/* + * compact_control is used to track pages being migrated and the free pages + * they are being migrated to during memory compaction. The free_pfn starts + * at the end of a zone and migrate_pfn begins at the start. Movable pages + * are moved to the end of a zone during a compaction run and the run + * completes when free_pfn <= migrate_pfn + */ +struct compact_control { + struct list_head freepages; /* List of free pages to migrate to */ + struct list_head migratepages; /* List of pages being migrated */ + unsigned long nr_freepages; /* Number of isolated free pages */ + unsigned long nr_migratepages; /* Number of pages to migrate */ + unsigned long free_pfn; /* isolate_freepages search base */ + unsigned long migrate_pfn; /* isolate_migratepages search base */ + bool sync; /* Synchronous migration */ + + unsigned int order; /* order a direct compactor needs */ + int migratetype; /* MOVABLE, RECLAIMABLE etc */ + struct zone *zone; +}; + +unsigned long release_freepages(struct list_head *freelist); +unsigned long +isolate_freepages_range(struct zone *zone, + unsigned long start, unsigned long end, + struct list_head *freelist); +unsigned long +isolate_migratepages_range(struct zone *zone, struct compact_control *cc, + unsigned long low_pfn, unsigned long end_pfn); +struct page *compaction_alloc(struct page *migratepage, unsigned long data, + int **result); +
/* * function for dealing with page's order in buddy system.
On Fri, Nov 18, 2011 at 05:43:11PM +0100, Marek Szyprowski wrote:
From: Michal Nazarewicz mina86@mina86.com
This commit exports some of the functions from compaction.c file outside of it adding their declaration into internal.h header file so that other mm related code can use them.
This forced compaction.c to always be compiled (as opposed to being compiled only if CONFIG_COMPACTION is defined) but as to avoid introducing code that user did not ask for, part of the compaction.c is now wrapped in on #ifdef.
Signed-off-by: Michal Nazarewicz mina86@mina86.com Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com
mm/Makefile | 3 +- mm/compaction.c | 112 +++++++++++++++++++++++-------------------------------- mm/internal.h | 35 +++++++++++++++++ 3 files changed, 83 insertions(+), 67 deletions(-)
diff --git a/mm/Makefile b/mm/Makefile index 50ec00e..24ed801 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -13,7 +13,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ readahead.o swap.o truncate.o vmscan.o shmem.o \ prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \ page_isolation.o mm_init.o mmu_context.o percpu.o \
$(mmu-y)
$(mmu-y) compaction.o
That should be
compaction.o $(mmu-y)
for consistency.
Overall, this patch implies that CMA is always compiled in. Why not just make CMA depend on COMPACTION to keep things simplier? For example, if you enable CMA and do not enable COMPACTION, you lose things like the vmstat counters that can aid debugging. In fact, as parts of compaction.c are using defines like COMPACTBLOCKS, I'm not even sure compaction.c can compile without CONFIG_COMPACTION because of the vmstat stuff.
On Mon, 12 Dec 2011 15:29:07 +0100, Mel Gorman mel@csn.ul.ie wrote:
On Fri, Nov 18, 2011 at 05:43:11PM +0100, Marek Szyprowski wrote:
From: Michal Nazarewicz mina86@mina86.com
This commit exports some of the functions from compaction.c file outside of it adding their declaration into internal.h header file so that other mm related code can use them.
This forced compaction.c to always be compiled (as opposed to being compiled only if CONFIG_COMPACTION is defined) but as to avoid introducing code that user did not ask for, part of the compaction.c is now wrapped in on #ifdef.
Signed-off-by: Michal Nazarewicz mina86@mina86.com Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com
mm/Makefile | 3 +- mm/compaction.c | 112 +++++++++++++++++++++++-------------------------------- mm/internal.h | 35 +++++++++++++++++ 3 files changed, 83 insertions(+), 67 deletions(-)
diff --git a/mm/Makefile b/mm/Makefile index 50ec00e..24ed801 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -13,7 +13,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ readahead.o swap.o truncate.o vmscan.o shmem.o \ prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \ page_isolation.o mm_init.o mmu_context.o percpu.o \
$(mmu-y)
$(mmu-y) compaction.o
That should be
compaction.o $(mmu-y)
for consistency.
Overall, this patch implies that CMA is always compiled in.
Not really. But yes, it produces some bloat when neither CMA nor compaction are compiled. I assume that linker will be able to deal with that (since the functions are not EXPORT_SYMBOL'ed).
Note also that the was majority of compaction.c is #ifdef'd though so only a handful of functions are compiled.
Why not just make CMA depend on COMPACTION to keep things simplier?
I could imagine that someone would want to have CMA but not compaction, hence I decided to give that choice.
For example, if you enable CMA and do not enable COMPACTION, you lose things like the vmstat counters that can aid debugging. In fact, as parts of compaction.c are using defines like COMPACTBLOCKS, I'm not even sure compaction.c can compile without CONFIG_COMPACTION because of the vmstat stuff.
On Mon, Dec 12, 2011 at 03:41:04PM +0100, Michal Nazarewicz wrote:
On Mon, 12 Dec 2011 15:29:07 +0100, Mel Gorman mel@csn.ul.ie wrote:
On Fri, Nov 18, 2011 at 05:43:11PM +0100, Marek Szyprowski wrote:
From: Michal Nazarewicz mina86@mina86.com
This commit exports some of the functions from compaction.c file outside of it adding their declaration into internal.h header file so that other mm related code can use them.
This forced compaction.c to always be compiled (as opposed to being compiled only if CONFIG_COMPACTION is defined) but as to avoid introducing code that user did not ask for, part of the compaction.c is now wrapped in on #ifdef.
Signed-off-by: Michal Nazarewicz mina86@mina86.com Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com
mm/Makefile | 3 +- mm/compaction.c | 112 +++++++++++++++++++++++-------------------------------- mm/internal.h | 35 +++++++++++++++++ 3 files changed, 83 insertions(+), 67 deletions(-)
diff --git a/mm/Makefile b/mm/Makefile index 50ec00e..24ed801 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -13,7 +13,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ readahead.o swap.o truncate.o vmscan.o shmem.o \ prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \ page_isolation.o mm_init.o mmu_context.o percpu.o \
$(mmu-y)
$(mmu-y) compaction.o
That should be
compaction.o $(mmu-y)
for consistency.
Overall, this patch implies that CMA is always compiled in.
Not really. But yes, it produces some bloat when neither CMA nor compaction are compiled. I assume that linker will be able to deal with that (since the functions are not EXPORT_SYMBOL'ed).
The bloat exists either way. I don't believe the linker strips it out so overall it would make more sense to depend on compaction to keep the vmstat counters for debugging reasons if nothing else. It's not something I feel very strongly about though.
On Mon, 12 Dec 2011 15:29:07 +0100, Mel Gorman mel@csn.ul.ie wrote:
Overall, this patch implies that CMA is always compiled in.
On Mon, Dec 12, 2011 at 03:41:04PM +0100, Michal Nazarewicz wrote:
Not really. But yes, it produces some bloat when neither CMA nor compaction are compiled. I assume that linker will be able to deal with that (since the functions are not EXPORT_SYMBOL'ed).
On Mon, 12 Dec 2011 16:40:15 +0100, Mel Gorman mel@csn.ul.ie wrote:
The bloat exists either way. I don't believe the linker strips it out so overall it would make more sense to depend on compaction to keep the vmstat counters for debugging reasons if nothing else. It's not something I feel very strongly about though.
KK, I'll do that then.
On Monday 12 December 2011, Mel Gorman wrote:
The bloat exists either way. I don't believe the linker strips it out so overall it would make more sense to depend on compaction to keep the vmstat counters for debugging reasons if nothing else. It's not something I feel very strongly about though.
There were some previous attempts to use -fgc-sections to strip out unused functions from the kernel, but I think they were never merged because of regressions.
Arnd
From: Michal Nazarewicz mina86@mina86.com
This commit adds the alloc_contig_range() function which tries to allocate given range of pages. It tries to migrate all already allocated pages that fall in the range thus freeing them. Once all pages in the range are freed they are removed from the buddy system thus allocated for the caller to use.
__alloc_contig_migrate_range() borrows some code from KAMEZAWA Hiroyuki's __alloc_contig_pages().
Signed-off-by: Michal Nazarewicz mina86@mina86.com Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com --- include/linux/page-isolation.h | 3 + mm/page_alloc.c | 223 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 226 insertions(+), 0 deletions(-)
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index 051c1b1..d305080 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -33,5 +33,8 @@ test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn); extern int set_migratetype_isolate(struct page *page); extern void unset_migratetype_isolate(struct page *page);
+/* The below functions must be run on a range from a single zone. */ +int alloc_contig_range(unsigned long start, unsigned long end); +void free_contig_range(unsigned long pfn, unsigned nr_pages);
#endif diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 58d1a2e..b7aac26 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -57,6 +57,7 @@ #include <linux/ftrace_event.h> #include <linux/memcontrol.h> #include <linux/prefetch.h> +#include <linux/migrate.h>
#include <asm/tlbflush.h> #include <asm/div64.h> @@ -5687,6 +5688,228 @@ out: spin_unlock_irqrestore(&zone->lock, flags); }
+static unsigned long pfn_align_to_maxpage_down(unsigned long pfn) +{ + return pfn & ~(MAX_ORDER_NR_PAGES - 1); +} + +static unsigned long pfn_align_to_maxpage_up(unsigned long pfn) +{ + return ALIGN(pfn, MAX_ORDER_NR_PAGES); +} + +static int __alloc_contig_migrate_range(unsigned long start, unsigned long end) +{ + /* This function is based on compact_zone() from compaction.c. */ + + unsigned long pfn = start; + int ret = -EBUSY; + unsigned tries = 0; + + struct compact_control cc = { + .nr_migratepages = 0, + .nr_freepages = 0, + .order = -1, + .zone = page_zone(pfn_to_page(start)), + .sync = true, + .migrate_pfn = 0, + }; + INIT_LIST_HEAD(&cc.freepages); + INIT_LIST_HEAD(&cc.migratepages); + + /* + * Compaction code uses two pointers, a "migrate_pfn" and + * "free_pfn". The first one goes from the beginning, and the + * second one from the end of the zone. Once they meet, + * compaction stops. + * + * This is good behaviour if you want to migrate all pages to + * the end of the zone put that's not what we want. We want + * to migrate pages from [start, end) *anywhere* inside the + * zone. To do that, we set "migrate_pfn" to the beginning of + * the zone and never touch it again so compaction will finish + * only when the "free_pfn" pointer reaches beginning of the + * zone. + */ + + cc.migrate_pfn = cc.zone->zone_start_pfn; + cc.free_pfn = cc.migrate_pfn + cc.zone->spanned_pages; + cc.free_pfn &= ~(pageblock_nr_pages-1); + + migrate_prep_local(); + + while (pfn < end || cc.nr_migratepages) { + /* Abort on signal */ + if (fatal_signal_pending(current)) { + ret = -EINTR; + goto done; + } + + /* Are there any pages left? */ + if (cc.nr_freepages && cc.free_pfn <= cc.migrate_pfn) { + /* We've run out of pages to migrate to. */ + ret = -EBUSY; + goto done; + } + + /* Get some pages to migrate. */ + if (list_empty(&cc.migratepages)) { + cc.nr_migratepages = 0; + pfn = isolate_migratepages_range(cc.zone, &cc, + pfn, end); + if (!pfn) { + ret = -EINTR; + goto done; + } + tries = 0; + } + + /* Try to migrate. */ + ret = migrate_pages(&cc.migratepages, compaction_alloc, + (unsigned long)&cc, false, cc.sync); + + /* Migrated all of them? Great! */ + if (list_empty(&cc.migratepages)) + continue; + + /* Try five times. */ + if (++tries == 5) { + ret = ret < 0 ? ret : -EBUSY; + goto done; + } + + /* Before each time drain everything and reschedule. */ + lru_add_drain_all(); + drain_all_pages(); + cond_resched(); + } + ret = 0; + +done: + /* Make sure all pages are isolated. */ + if (!ret) { + lru_add_drain_all(); + drain_all_pages(); + if (WARN_ON(test_pages_isolated(start, end))) + ret = -EBUSY; + } + + /* Release pages */ + putback_lru_pages(&cc.migratepages); + cc.nr_freepages -= release_freepages(&cc.freepages); + VM_BUG_ON(cc.nr_freepages != 0); + + return ret; +} + +/** + * alloc_contig_range() -- tries to allocate given range of pages + * @start: start PFN to allocate + * @end: one-past-the-last PFN to allocate + * + * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES + * aligned, hovewer it's callers responsibility to guarantee that we + * are the only thread that changes migrate type of pageblocks the + * pages fall in. + * + * Returns zero on success or negative error code. On success all + * pages which PFN is in (start, end) are allocated for the caller and + * need to be freed with free_contig_range(). + */ +int alloc_contig_range(unsigned long start, unsigned long end) +{ + unsigned long outer_start, outer_end; + int ret; + + /* + * What we do here is we mark all pageblocks in range as + * MIGRATE_ISOLATE. Because of the way page allocator work, we + * align the range to MAX_ORDER pages so that page allocator + * won't try to merge buddies from different pageblocks and + * change MIGRATE_ISOLATE to some other migration type. + * + * Once the pageblocks are marked as MIGRATE_ISOLATE, we + * migrate the pages from an unaligned range (ie. pages that + * we are interested in). This will put all the pages in + * range back to page allocator as MIGRATE_ISOLATE. + * + * When this is done, we take the pages in range from page + * allocator removing them from the buddy system. This way + * page allocator will never consider using them. + * + * This lets us mark the pageblocks back as + * MIGRATE_CMA/MIGRATE_MOVABLE so that free pages in the + * MAX_ORDER aligned range but not in the unaligned, original + * range are put back to page allocator so that buddy can use + * them. + */ + + ret = start_isolate_page_range(pfn_align_to_maxpage_down(start), + pfn_align_to_maxpage_up(end)); + if (ret) + goto done; + + ret = __alloc_contig_migrate_range(start, end); + if (ret) + goto done; + + /* + * Pages from [start, end) are within a MAX_ORDER_NR_PAGES + * aligned blocks that are marked as MIGRATE_ISOLATE. What's + * more, all pages in [start, end) are free in page allocator. + * What we are going to do is to allocate all pages from + * [start, end) (that is remove them from page allocater). + * + * The only problem is that pages at the beginning and at the + * end of interesting range may be not aligned with pages that + * page allocator holds, ie. they can be part of higher order + * pages. Because of this, we reserve the bigger range and + * once this is done free the pages we are not interested in. + */ + + ret = 0; + while (!PageBuddy(pfn_to_page(start & (~0UL << ret)))) + if (WARN_ON(++ret >= MAX_ORDER)) { + ret = -EINVAL; + goto done; + } + + outer_start = start & (~0UL << ret); + outer_end = isolate_freepages_range(page_zone(pfn_to_page(outer_start)), + outer_start, end, NULL); + if (!outer_end) { + ret = -EBUSY; + goto done; + } + outer_end += outer_start; + + /* Free head and tail (if any) */ + if (start != outer_start) + free_contig_range(outer_start, start - outer_start); + if (end != outer_end) + free_contig_range(end, outer_end - end); + + ret = 0; +done: + undo_isolate_page_range(pfn_align_to_maxpage_down(start), + pfn_align_to_maxpage_up(end)); + return ret; +} + +void free_contig_range(unsigned long pfn, unsigned nr_pages) +{ + struct page *page = pfn_to_page(pfn); + + while (nr_pages--) { + __free_page(page); + ++pfn; + if (likely(zone_pfn_same_memmap(pfn - 1, pfn))) + ++page; + else + page = pfn_to_page(pfn); + } +} + #ifdef CONFIG_MEMORY_HOTREMOVE /* * All pages in the range must be isolated before calling this.
From: Michal Nazarewicz mina86@mina86.com
The MIGRATE_CMA migration type has two main characteristics: (i) only movable pages can be allocated from MIGRATE_CMA pageblocks and (ii) page allocator will never change migration type of MIGRATE_CMA pageblocks.
This guarantees (to some degree) that page in a MIGRATE_CMA page block can always be migrated somewhere else (unless there's no memory left in the system).
It is designed to be used for allocating big chunks (eg. 10MiB) of physically contiguous memory. Once driver requests contiguous memory, pages from MIGRATE_CMA pageblocks may be migrated away to create a contiguous block.
To minimise number of migrations, MIGRATE_CMA migration type is the last type tried when page allocator falls back to other migration types then requested.
Signed-off-by: Michal Nazarewicz mina86@mina86.com [m.szyprowski: removed CONFIG_CMA_MIGRATE_TYPE] Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com --- include/linux/mmzone.h | 41 ++++++++++++++++++++---- include/linux/page-isolation.h | 3 ++ mm/Kconfig | 2 +- mm/compaction.c | 11 +++++-- mm/page_alloc.c | 68 ++++++++++++++++++++++++++++++--------- 5 files changed, 98 insertions(+), 27 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 84e07d0..604a235 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -35,13 +35,35 @@ */ #define PAGE_ALLOC_COSTLY_ORDER 3
-#define MIGRATE_UNMOVABLE 0 -#define MIGRATE_RECLAIMABLE 1 -#define MIGRATE_MOVABLE 2 -#define MIGRATE_PCPTYPES 3 /* the number of types on the pcp lists */ -#define MIGRATE_RESERVE 3 -#define MIGRATE_ISOLATE 4 /* can't allocate from here */ -#define MIGRATE_TYPES 5 +enum { + MIGRATE_UNMOVABLE, + MIGRATE_RECLAIMABLE, + MIGRATE_MOVABLE, + MIGRATE_PCPTYPES, /* the number of types on the pcp lists */ + MIGRATE_RESERVE = MIGRATE_PCPTYPES, + /* + * MIGRATE_CMA migration type is designed to mimic the way + * ZONE_MOVABLE works. Only movable pages can be allocated + * from MIGRATE_CMA pageblocks and page allocator never + * implicitly change migration type of MIGRATE_CMA pageblock. + * + * The way to use it is to change migratetype of a range of + * pageblocks to MIGRATE_CMA which can be done by + * __free_pageblock_cma() function. What is important though + * is that a range of pageblocks must be aligned to + * MAX_ORDER_NR_PAGES should biggest page be bigger then + * a single pageblock. + */ + MIGRATE_CMA, + MIGRATE_ISOLATE, /* can't allocate from here */ + MIGRATE_TYPES +}; + +#ifdef CONFIG_CMA +# define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA) +#else +# define is_migrate_cma(migratetype) false +#endif
#define for_each_migratetype_order(order, type) \ for (order = 0; order < MAX_ORDER; order++) \ @@ -54,6 +76,11 @@ static inline int get_pageblock_migratetype(struct page *page) return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end); }
+static inline bool is_pageblock_cma(struct page *page) +{ + return is_migrate_cma(get_pageblock_migratetype(page)); +} + struct free_area { struct list_head free_list[MIGRATE_TYPES]; unsigned long nr_free; diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index d305080..af650db 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -37,4 +37,7 @@ extern void unset_migratetype_isolate(struct page *page); int alloc_contig_range(unsigned long start, unsigned long end); void free_contig_range(unsigned long pfn, unsigned nr_pages);
+/* CMA stuff */ +extern void init_cma_reserved_pageblock(struct page *page); + #endif diff --git a/mm/Kconfig b/mm/Kconfig index 011b110..e080cac 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -192,7 +192,7 @@ config COMPACTION config MIGRATION bool "Page migration" def_bool y - depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION + depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION || CMA help Allows the migration of the physical location of pages of processes while the virtual addresses are not changed. This is useful in diff --git a/mm/compaction.c b/mm/compaction.c index e71ceaf..3e07341 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -19,6 +19,11 @@ #define CREATE_TRACE_POINTS #include <trace/events/compaction.h>
+static inline bool is_migrate_cma_or_movable(int migratetype) +{ + return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE; +} + unsigned long release_freepages(struct list_head *freelist) { struct page *page, *next; @@ -114,8 +119,8 @@ static bool suitable_migration_target(struct page *page) if (PageBuddy(page) && page_order(page) >= pageblock_order) return true;
- /* If the block is MIGRATE_MOVABLE, allow migration */ - if (migratetype == MIGRATE_MOVABLE) + /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */ + if (is_migrate_cma_or_movable(migratetype)) return true;
/* Otherwise skip the block */ @@ -324,7 +329,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc, */ pageblock_nr = low_pfn >> pageblock_order; if (!cc->sync && last_pageblock_nr != pageblock_nr && - get_pageblock_migratetype(page) != MIGRATE_MOVABLE) { + is_migrate_cma_or_movable(get_pageblock_migratetype(page))) { low_pfn += pageblock_nr_pages; low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1; last_pageblock_nr = pageblock_nr; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b7aac26..9ec73f4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -733,6 +733,26 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order) } }
+#ifdef CONFIG_CMA +/* + * Free whole pageblock and set it's migration type to MIGRATE_CMA. + */ +void __init init_cma_reserved_pageblock(struct page *page) +{ + unsigned i = pageblock_nr_pages; + struct page *p = page; + + do { + __ClearPageReserved(p); + set_page_count(p, 0); + } while (++p, --i); + + set_page_refcounted(page); + set_pageblock_migratetype(page, MIGRATE_CMA); + __free_pages(page, pageblock_order); + totalram_pages += pageblock_nr_pages; +} +#endif
/* * The order of subdivision here is critical for the IO subsystem. @@ -841,11 +861,10 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, * This array describes the order lists are fallen back to when * the free lists for the desirable migrate type are depleted */ -static int fallbacks[MIGRATE_TYPES][MIGRATE_TYPES-1] = { +static int fallbacks[MIGRATE_PCPTYPES][4] = { [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE }, [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE }, - [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE }, - [MIGRATE_RESERVE] = { MIGRATE_RESERVE, MIGRATE_RESERVE, MIGRATE_RESERVE }, /* Never used */ + [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_CMA , MIGRATE_RESERVE }, };
/* @@ -940,12 +959,12 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) /* Find the largest possible block of pages in the other list */ for (current_order = MAX_ORDER-1; current_order >= order; --current_order) { - for (i = 0; i < MIGRATE_TYPES - 1; i++) { + for (i = 0; i < ARRAY_SIZE(fallbacks[0]); i++) { migratetype = fallbacks[start_migratetype][i];
/* MIGRATE_RESERVE handled later if necessary */ if (migratetype == MIGRATE_RESERVE) - continue; + break;
area = &(zone->free_area[current_order]); if (list_empty(&area->free_list[migratetype])) @@ -960,11 +979,18 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) * pages to the preferred allocation list. If falling * back for a reclaimable kernel allocation, be more * aggressive about taking ownership of free pages + * + * On the other hand, never change migration + * type of MIGRATE_CMA pageblocks nor move CMA + * pages on different free lists. We don't + * want unmovable pages to be allocated from + * MIGRATE_CMA areas. */ - if (unlikely(current_order >= (pageblock_order >> 1)) || - start_migratetype == MIGRATE_RECLAIMABLE || - page_group_by_mobility_disabled) { - unsigned long pages; + if (!is_pageblock_cma(page) && + (unlikely(current_order >= pageblock_order / 2) || + start_migratetype == MIGRATE_RECLAIMABLE || + page_group_by_mobility_disabled)) { + int pages; pages = move_freepages_block(zone, page, start_migratetype);
@@ -982,11 +1008,14 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) rmv_page_order(page);
/* Take ownership for orders >= pageblock_order */ - if (current_order >= pageblock_order) + if (current_order >= pageblock_order && + !is_pageblock_cma(page)) change_pageblock_range(page, current_order, start_migratetype);
- expand(zone, page, order, current_order, area, migratetype); + expand(zone, page, order, current_order, area, + is_migrate_cma(start_migratetype) + ? start_migratetype : migratetype);
trace_mm_page_alloc_extfrag(page, order, current_order, start_migratetype, migratetype); @@ -1058,7 +1087,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, list_add(&page->lru, list); else list_add_tail(&page->lru, list); - set_page_private(page, migratetype); + if (is_pageblock_cma(page)) + set_page_private(page, MIGRATE_CMA); + else + set_page_private(page, migratetype); list = &page->lru; } __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order)); @@ -1289,8 +1321,12 @@ int split_free_page(struct page *page)
if (order >= pageblock_order - 1) { struct page *endpage = page + (1 << order) - 1; - for (; page < endpage; page += pageblock_nr_pages) - set_pageblock_migratetype(page, MIGRATE_MOVABLE); + for (; page < endpage; page += pageblock_nr_pages) { + int mt = get_pageblock_migratetype(page); + if (mt != MIGRATE_ISOLATE && !is_migrate_cma(mt)) + set_pageblock_migratetype(page, + MIGRATE_MOVABLE); + } }
return 1 << order; @@ -5574,8 +5610,8 @@ __count_immobile_pages(struct zone *zone, struct page *page, int count) */ if (zone_idx(zone) == ZONE_MOVABLE) return true; - - if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE) + if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE || + is_pageblock_cma(page)) return true;
pfn = page_to_pfn(page);
From: Michal Nazarewicz mina86@mina86.com
This commit changes various functions that change pages and pageblocks migrate type between MIGRATE_ISOLATE and MIGRATE_MOVABLE in such a way as to allow to work with MIGRATE_CMA migrate type.
Signed-off-by: Michal Nazarewicz mina86@mina86.com Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com --- include/linux/page-isolation.h | 23 ++++++++++++----------- mm/memory-failure.c | 2 +- mm/memory_hotplug.c | 6 +++--- mm/page_alloc.c | 18 ++++++++++++------ mm/page_isolation.c | 15 ++++++++------- 5 files changed, 36 insertions(+), 28 deletions(-)
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index af650db..c8004dc 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -3,7 +3,7 @@
/* * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE. - * If specified range includes migrate types other than MOVABLE, + * If specified range includes migrate types other than MOVABLE or CMA, * this will fail with -EBUSY. * * For isolating all pages in the range finally, the caller have to @@ -11,30 +11,31 @@ * test it. */ extern int -start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn); +start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, + unsigned migratetype);
/* * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE. * target range is [start_pfn, end_pfn) */ extern int -undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn); +undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, + unsigned migratetype);
/* - * test all pages in [start_pfn, end_pfn)are isolated or not. + * Test all pages in [start_pfn, end_pfn) are isolated or not. */ -extern int -test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn); +int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn);
/* - * Internal funcs.Changes pageblock's migrate type. - * Please use make_pagetype_isolated()/make_pagetype_movable(). + * Internal functions. Changes pageblock's migrate type. */ -extern int set_migratetype_isolate(struct page *page); -extern void unset_migratetype_isolate(struct page *page); +int set_migratetype_isolate(struct page *page); +void unset_migratetype_isolate(struct page *page, unsigned migratetype);
/* The below functions must be run on a range from a single zone. */ -int alloc_contig_range(unsigned long start, unsigned long end); +int alloc_contig_range(unsigned long start, unsigned long end, + unsigned migratetype); void free_contig_range(unsigned long pfn, unsigned nr_pages);
/* CMA stuff */ diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 06d3479..3cf809d 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1400,7 +1400,7 @@ static int get_any_page(struct page *p, unsigned long pfn, int flags) /* Not a free page */ ret = 1; } - unset_migratetype_isolate(p); + unset_migratetype_isolate(p, MIGRATE_MOVABLE); unlock_memory_hotplug(); return ret; } diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 2168489..2a44d55 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -891,7 +891,7 @@ static int __ref offline_pages(unsigned long start_pfn, nr_pages = end_pfn - start_pfn;
/* set above range as isolated */ - ret = start_isolate_page_range(start_pfn, end_pfn); + ret = start_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE); if (ret) goto out;
@@ -956,7 +956,7 @@ repeat: We cannot do rollback at this point. */ offline_isolated_pages(start_pfn, end_pfn); /* reset pagetype flags and makes migrate type to be MOVABLE */ - undo_isolate_page_range(start_pfn, end_pfn); + undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE); /* removal success */ zone->present_pages -= offlined_pages; zone->zone_pgdat->node_present_pages -= offlined_pages; @@ -981,7 +981,7 @@ failed_removal: start_pfn, end_pfn); memory_notify(MEM_CANCEL_OFFLINE, &arg); /* pushback to free area */ - undo_isolate_page_range(start_pfn, end_pfn); + undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
out: unlock_memory_hotplug(); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9ec73f4..714b1c1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5710,7 +5710,7 @@ out: return ret; }
-void unset_migratetype_isolate(struct page *page) +void unset_migratetype_isolate(struct page *page, unsigned migratetype) { struct zone *zone; unsigned long flags; @@ -5718,8 +5718,8 @@ void unset_migratetype_isolate(struct page *page) spin_lock_irqsave(&zone->lock, flags); if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE) goto out; - set_pageblock_migratetype(page, MIGRATE_MOVABLE); - move_freepages_block(zone, page, MIGRATE_MOVABLE); + set_pageblock_migratetype(page, migratetype); + move_freepages_block(zone, page, migratetype); out: spin_unlock_irqrestore(&zone->lock, flags); } @@ -5842,6 +5842,10 @@ done: * alloc_contig_range() -- tries to allocate given range of pages * @start: start PFN to allocate * @end: one-past-the-last PFN to allocate + * @migratetype: migratetype of the underlaying pageblocks (either + * #MIGRATE_MOVABLE or #MIGRATE_CMA). All pageblocks + * in range must have the same migratetype and it must + * be either of the two. * * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES * aligned, hovewer it's callers responsibility to guarantee that we @@ -5852,7 +5856,8 @@ done: * pages which PFN is in (start, end) are allocated for the caller and * need to be freed with free_contig_range(). */ -int alloc_contig_range(unsigned long start, unsigned long end) +int alloc_contig_range(unsigned long start, unsigned long end, + unsigned migratetype) { unsigned long outer_start, outer_end; int ret; @@ -5881,7 +5886,8 @@ int alloc_contig_range(unsigned long start, unsigned long end) */
ret = start_isolate_page_range(pfn_align_to_maxpage_down(start), - pfn_align_to_maxpage_up(end)); + pfn_align_to_maxpage_up(end), + migratetype); if (ret) goto done;
@@ -5928,7 +5934,7 @@ int alloc_contig_range(unsigned long start, unsigned long end) ret = 0; done: undo_isolate_page_range(pfn_align_to_maxpage_down(start), - pfn_align_to_maxpage_up(end)); + pfn_align_to_maxpage_up(end), migratetype); return ret; }
diff --git a/mm/page_isolation.c b/mm/page_isolation.c index 4ae42bb..c9f0477 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -24,6 +24,7 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages) * to be MIGRATE_ISOLATE. * @start_pfn: The lower PFN of the range to be isolated. * @end_pfn: The upper PFN of the range to be isolated. + * @migratetype: migrate type to set in error recovery. * * Making page-allocation-type to be MIGRATE_ISOLATE means free pages in * the range will never be allocated. Any free pages and pages freed in the @@ -32,8 +33,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages) * start_pfn/end_pfn must be aligned to pageblock_order. * Returns 0 on success and -EBUSY if any part of range cannot be isolated. */ -int -start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn) +int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, + unsigned migratetype) { unsigned long pfn; unsigned long undo_pfn; @@ -56,7 +57,7 @@ undo: for (pfn = start_pfn; pfn < undo_pfn; pfn += pageblock_nr_pages) - unset_migratetype_isolate(pfn_to_page(pfn)); + unset_migratetype_isolate(pfn_to_page(pfn), migratetype);
return -EBUSY; } @@ -64,8 +65,8 @@ undo: /* * Make isolated pages available again. */ -int -undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn) +int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, + unsigned migratetype) { unsigned long pfn; struct page *page; @@ -77,7 +78,7 @@ undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn) page = __first_valid_page(pfn, pageblock_nr_pages); if (!page || get_pageblock_migratetype(page) != MIGRATE_ISOLATE) continue; - unset_migratetype_isolate(page); + unset_migratetype_isolate(page, migratetype); } return 0; } @@ -86,7 +87,7 @@ undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn) * all pages in [start_pfn...end_pfn) must be in the same zone. * zone->lock must be held before call this. * - * Returns 1 if all pages in the range is isolated. + * Returns 1 if all pages in the range are isolated. */ static int __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn)
The Contiguous Memory Allocator is a set of helper functions for DMA mapping framework that improves allocations of contiguous memory chunks.
CMA grabs memory on system boot, marks it with CMA_MIGRATE_TYPE and gives back to the system. Kernel is allowed to allocate movable pages within CMA's managed memory so that it can be used for example for page cache when DMA mapping do not use it. On dma_alloc_from_contiguous() request such pages are migrated out of CMA area to free required contiguous block and fulfill the request. This allows to allocate large contiguous chunks of memory at any time assuming that there is enough free memory available in the system.
This code is heavily based on earlier works by Michal Nazarewicz.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com CC: Michal Nazarewicz mina86@mina86.com --- Documentation/kernel-parameters.txt | 5 + arch/Kconfig | 3 + drivers/base/Kconfig | 89 ++++++++ drivers/base/Makefile | 1 + drivers/base/dma-contiguous.c | 396 ++++++++++++++++++++++++++++++++++ include/asm-generic/dma-contiguous.h | 27 +++ include/linux/device.h | 4 + include/linux/dma-contiguous.h | 110 ++++++++++ 8 files changed, 635 insertions(+), 0 deletions(-) create mode 100644 drivers/base/dma-contiguous.c create mode 100644 include/asm-generic/dma-contiguous.h create mode 100644 include/linux/dma-contiguous.h
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index a0c5c5f..164024e 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -503,6 +503,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted. Also note the kernel might malfunction if you disable some critical bits.
+ cma=nn[MG] [ARM,KNL] + Sets the size of kernel global memory area for contiguous + memory allocations. For more information, see + include/linux/dma-contiguous.h + cmo_free_hint= [PPC] Format: { yes | no } Specify whether pages are marked as being inactive when they are freed. This is used in CMO environments diff --git a/arch/Kconfig b/arch/Kconfig index 4b0669c..a3b39a2 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -124,6 +124,9 @@ config HAVE_ARCH_TRACEHOOK config HAVE_DMA_ATTRS bool
+config HAVE_DMA_CONTIGUOUS + bool + config USE_GENERIC_SMP_HELPERS bool
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index 21cf46f..99f5fad 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -174,4 +174,93 @@ config SYS_HYPERVISOR
source "drivers/base/regmap/Kconfig"
+config CMA + bool "Contiguous Memory Allocator (EXPERIMENTAL)" + depends on HAVE_DMA_CONTIGUOUS && HAVE_MEMBLOCK && EXPERIMENTAL + select MIGRATION + help + This enables the Contiguous Memory Allocator which allows drivers + to allocate big physically-contiguous blocks of memory for use with + hardware components that do not support I/O map nor scatter-gather. + + For more information see <include/linux/dma-contiguous.h>. + If unsure, say "n". + +if CMA + +config CMA_DEBUG + bool "CMA debug messages (DEVELOPMENT)" + depends on DEBUG_KERNEL + help + Turns on debug messages in CMA. This produces KERN_DEBUG + messages for every CMA call as well as various messages while + processing calls such as dma_alloc_from_contiguous(). + This option does not affect warning and error messages. + +comment "Default contiguous memory area size:" + +config CMA_SIZE_MBYTES + int "Size in Mega Bytes" + depends on !CMA_SIZE_SEL_PERCENTAGE + default 16 + help + Defines the size (in MiB) of the default memory area for Contiguous + Memory Allocator. + +config CMA_SIZE_PERCENTAGE + int "Percentage of total memory" + depends on !CMA_SIZE_SEL_MBYTES + default 10 + help + Defines the size of the default memory area for Contiguous Memory + Allocator as a percentage of the total memory in the system. + +choice + prompt "Selected region size" + default CMA_SIZE_SEL_ABSOLUTE + +config CMA_SIZE_SEL_MBYTES + bool "Use mega bytes value only" + +config CMA_SIZE_SEL_PERCENTAGE + bool "Use percentage value only" + +config CMA_SIZE_SEL_MIN + bool "Use lower value (minimum)" + +config CMA_SIZE_SEL_MAX + bool "Use higher value (maximum)" + +endchoice + +config CMA_ALIGNMENT + int "Maximum PAGE_SIZE order of alignment for contiguous buffers" + range 4 9 + default 8 + help + DMA mapping framework by default aligns all buffers to the smallest + PAGE_SIZE order which is greater than or equal to the requested buffer + size. This works well for buffers up to a few hundreds kilobytes, but + for larger buffers it just a memory waste. With this parameter you can + specify the maximum PAGE_SIZE order for contiguous buffers. Larger + buffers will be aligned only to this specified order. The order is + expressed as a power of two multiplied by the PAGE_SIZE. + + For example, if your system defaults to 4KiB pages, the order value + of 8 means that the buffers will be aligned up to 1MiB only. + + If unsure, leave the default value "8". + +config CMA_AREAS + int "Maximum count of the CMA device-private areas" + default 7 + help + CMA allows to create CMA areas for particular devices. This parameter + sets the maximum number of such device private CMA areas in the + system. + + If unsure, leave the default value "7". + +endif + endmenu diff --git a/drivers/base/Makefile b/drivers/base/Makefile index 99a375a..794546f 100644 --- a/drivers/base/Makefile +++ b/drivers/base/Makefile @@ -5,6 +5,7 @@ obj-y := core.o sys.o bus.o dd.o syscore.o \ cpu.o firmware.o init.o map.o devres.o \ attribute_container.o transport_class.o obj-$(CONFIG_DEVTMPFS) += devtmpfs.o +obj-$(CONFIG_CMA) += dma-contiguous.o obj-y += power/ obj-$(CONFIG_HAS_DMA) += dma-mapping.o obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c new file mode 100644 index 0000000..06ceab8 --- /dev/null +++ b/drivers/base/dma-contiguous.c @@ -0,0 +1,396 @@ +/* + * Contiguous Memory Allocator for DMA mapping framework + * Copyright (c) 2010-2011 by Samsung Electronics. + * Written by: + * Marek Szyprowski m.szyprowski@samsung.com + * Michal Nazarewicz mina86@mina86.com + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation; either version 2 of the + * License or (at your optional) any later version of the license. + */ + +#define pr_fmt(fmt) "cma: " fmt + +#ifdef CONFIG_CMA_DEBUG +#ifndef DEBUG +# define DEBUG +#endif +#endif + +#include <asm/page.h> +#include <asm/dma-contiguous.h> + +#include <linux/memblock.h> +#include <linux/err.h> +#include <linux/mm.h> +#include <linux/mutex.h> +#include <linux/page-isolation.h> +#include <linux/slab.h> +#include <linux/swap.h> +#include <linux/mm_types.h> +#include <linux/dma-contiguous.h> + +#ifndef SZ_1M +#define SZ_1M (1 << 20) +#endif + +struct cma { + unsigned long base_pfn; + unsigned long count; + unsigned long *bitmap; +}; + +struct cma *dma_contiguous_default_area; + +#ifdef CONFIG_CMA_SIZE_MBYTES +#define CMA_SIZE_MBYTES CONFIG_CMA_SIZE_MBYTES +#else +#define CMA_SIZE_MBYTES 0 +#endif + +#ifdef CONFIG_CMA_SIZE_PERCENTAGE +#define CMA_SIZE_PERCENTAGE CONFIG_CMA_SIZE_PERCENTAGE +#else +#define CMA_SIZE_PERCENTAGE 0 +#endif + +/* + * Default global CMA area size can be defined in kernel's .config. + * This is usefull mainly for distro maintainers to create a kernel + * that works correctly for most supported systems. + * The size can be set in bytes or as a percentage of the total memory + * in the system. + * + * Users, who want to set the size of global CMA area for their system + * should use cma= kernel parameter. + */ +static unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M; +static unsigned long size_percent = CMA_SIZE_PERCENTAGE; +static long size_cmdline = -1; + +static int __init early_cma(char *p) +{ + pr_debug("%s(%s)\n", __func__, p); + size_cmdline = memparse(p, &p); + return 0; +} +early_param("cma", early_cma); + +static unsigned long __init cma_early_get_total_pages(void) +{ + struct memblock_region *reg; + unsigned long total_pages = 0; + + /* + * We cannot use memblock_phys_mem_size() here, because + * memblock_analyze() has not been called yet. + */ + for_each_memblock(memory, reg) + total_pages += memblock_region_memory_end_pfn(reg) - + memblock_region_memory_base_pfn(reg); + return total_pages; +} + +/** + * dma_contiguous_reserve() - reserve area for contiguous memory handling + * @limit: End address of the reserved memory (optional, 0 for any). + * + * This funtion reserves memory from early allocator. It should be + * called by arch specific code once the early allocator (memblock or bootmem) + * has been activated and all other subsystems have already allocated/reserved + * memory. + */ +void __init dma_contiguous_reserve(phys_addr_t limit) +{ + unsigned long selected_size = 0; + unsigned long total_pages; + + pr_debug("%s(limit %08lx)\n", __func__, (unsigned long)limit); + + total_pages = cma_early_get_total_pages(); + size_percent *= (total_pages << PAGE_SHIFT) / 100; + + pr_debug("%s: total available: %ld MiB, size absolute: %ld MiB, size percentage: %ld MiB\n", + __func__, (total_pages << PAGE_SHIFT) / SZ_1M, + size_bytes / SZ_1M, size_percent / SZ_1M); + +#ifdef CONFIG_CMA_SIZE_SEL_MBYTES + selected_size = size_bytes; +#elif defined(CONFIG_CMA_SIZE_SEL_PERCENTAGE) + selected_size = size_percent; +#elif defined(CONFIG_CMA_SIZE_SEL_MIN) + selected_size = min(size_bytes, size_percent); +#elif defined(CONFIG_CMA_SIZE_SEL_MAX) + selected_size = max(size_bytes, size_percent); +#endif + + if (size_cmdline != -1) + selected_size = size_cmdline; + + if (!selected_size) + return; + + pr_debug("%s: reserving %ld MiB for global area\n", __func__, + selected_size / SZ_1M); + + dma_declare_contiguous(NULL, selected_size, 0, limit); +}; + +static DEFINE_MUTEX(cma_mutex); + +static int cma_activate_area(unsigned long base_pfn, unsigned long count) +{ + unsigned long pfn = base_pfn; + unsigned i = count >> pageblock_order; + struct zone *zone; + + WARN_ON_ONCE(!pfn_valid(pfn)); + zone = page_zone(pfn_to_page(pfn)); + + do { + unsigned j; + base_pfn = pfn; + for (j = pageblock_nr_pages; j; --j, pfn++) { + WARN_ON_ONCE(!pfn_valid(pfn)); + if (page_zone(pfn_to_page(pfn)) != zone) + return -EINVAL; + } + init_cma_reserved_pageblock(pfn_to_page(base_pfn)); + } while (--i); + return 0; +} + +static struct cma *cma_create_area(unsigned long base_pfn, + unsigned long count) +{ + int bitmap_size = BITS_TO_LONGS(count) * sizeof(long); + struct cma *cma; + int ret = -ENOMEM; + + pr_debug("%s(base %08lx, count %lx)\n", __func__, base_pfn, count); + + cma = kmalloc(sizeof *cma, GFP_KERNEL); + if (!cma) + return ERR_PTR(-ENOMEM); + + cma->base_pfn = base_pfn; + cma->count = count; + cma->bitmap = kzalloc(bitmap_size, GFP_KERNEL); + + if (!cma->bitmap) + goto no_mem; + + ret = cma_activate_area(base_pfn, count); + if (ret) + goto error; + + pr_debug("%s: returned %p\n", __func__, (void *)cma); + return cma; + +error: + kfree(cma->bitmap); +no_mem: + kfree(cma); + return ERR_PTR(ret); +} + +static struct cma_reserved { + phys_addr_t start; + unsigned long size; + struct device *dev; +} cma_reserved[MAX_CMA_AREAS] __initdata; +static unsigned cma_reserved_count __initdata; + +static int __init cma_init_reserved_areas(void) +{ + struct cma_reserved *r = cma_reserved; + unsigned i = cma_reserved_count; + + pr_debug("%s()\n", __func__); + + for (; i; --i, ++r) { + struct cma *cma; + cma = cma_create_area(PFN_DOWN(r->start), + r->size >> PAGE_SHIFT); + if (!IS_ERR(cma)) + dev_set_cma_area(r->dev, cma); + } + return 0; +} +core_initcall(cma_init_reserved_areas); + +/** + * dma_declare_contiguous() - reserve area for contiguous memory handling + * for particular device + * @dev: Pointer to device structure. + * @size: Size of the reserved memory. + * @start: Start address of the reserved memory (optional, 0 for any). + * @limit: End address of the reserved memory (optional, 0 for any). + * + * This funtion reserves memory for specified device. It should be + * called by board specific code when early allocator (memblock or bootmem) + * is still activate. + */ +int __init dma_declare_contiguous(struct device *dev, unsigned long size, + phys_addr_t base, phys_addr_t limit) +{ + struct cma_reserved *r = &cma_reserved[cma_reserved_count]; + unsigned long alignment; + + pr_debug("%s(size %lx, base %08lx, limit %08lx)\n", __func__, + (unsigned long)size, (unsigned long)base, + (unsigned long)limit); + + /* Sanity checks */ + if (cma_reserved_count == ARRAY_SIZE(cma_reserved)) { + pr_err("Not enough slots for CMA reserved regions!\n"); + return -ENOSPC; + } + + if (!size) + return -EINVAL; + + /* Sanitise input arguments */ + alignment = PAGE_SIZE << max(MAX_ORDER, pageblock_order); + base = ALIGN(base, alignment); + size = ALIGN(size, alignment); + limit = ALIGN(limit, alignment); + + /* Reserve memory */ + if (base) { + if (memblock_is_region_reserved(base, size) || + memblock_reserve(base, size) < 0) { + base = -EBUSY; + goto err; + } + } else { + /* + * Use __memblock_alloc_base() since + * memblock_alloc_base() panic()s. + */ + phys_addr_t addr = __memblock_alloc_base(size, alignment, limit); + if (!addr) { + base = -ENOMEM; + goto err; + } else if (addr + size > ~(unsigned long)0) { + memblock_free(addr, size); + base = -EINVAL; + goto err; + } else { + base = addr; + } + } + + /* + * Each reserved area must be initialised later, when more kernel + * subsystems (like slab allocator) are available. + */ + r->start = base; + r->size = size; + r->dev = dev; + cma_reserved_count++; + pr_info("CMA: reserved %ld MiB at %08lx\n", size / SZ_1M, + (unsigned long)base); + + /* + * Architecture specific contiguous memory fixup. + */ + dma_contiguous_early_fixup(base, size); + return 0; +err: + pr_err("CMA: failed to reserve %ld MiB\n", size / SZ_1M); + return base; +} + +/** + * dma_alloc_from_contiguous() - allocate pages from contiguous area + * @dev: Pointer to device for which the allocation is performed. + * @count: Requested number of pages. + * @align: Requested alignment of pages (in PAGE_SIZE order). + * + * This funtion allocates memory buffer for specified device. It uses + * device specific contiguous memory area if available or the default + * global one. Requires architecture specific get_dev_cma_area() helper + * function. + */ +struct page *dma_alloc_from_contiguous(struct device *dev, int count, + unsigned int align) +{ + struct cma *cma = dev_get_cma_area(dev); + unsigned long pfn, pageno; + int ret; + + if (!cma) + return NULL; + + if (align > CONFIG_CMA_ALIGNMENT) + align = CONFIG_CMA_ALIGNMENT; + + pr_debug("%s(cma %p, count %d, align %d)\n", __func__, (void *)cma, + count, align); + + if (!count) + return NULL; + + mutex_lock(&cma_mutex); + + pageno = bitmap_find_next_zero_area(cma->bitmap, cma->count, 0, count, + (1 << align) - 1); + if (pageno >= cma->count) { + ret = -ENOMEM; + goto error; + } + bitmap_set(cma->bitmap, pageno, count); + + pfn = cma->base_pfn + pageno; + ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA); + if (ret) + goto free; + + mutex_unlock(&cma_mutex); + + pr_debug("%s(): returned %p\n", __func__, pfn_to_page(pfn)); + return pfn_to_page(pfn); +free: + bitmap_clear(cma->bitmap, pageno, count); +error: + mutex_unlock(&cma_mutex); + return NULL; +} + +/** + * dma_release_from_contiguous() - release allocated pages + * @dev: Pointer to device for which the pages were allocated. + * @pages: Allocated pages. + * @count: Number of allocated pages. + * + * This funtion releases memory allocated by dma_alloc_from_contiguous(). + * It return 0 when provided pages doen't belongs to contiguous area and + * 1 on success. + */ +int dma_release_from_contiguous(struct device *dev, struct page *pages, + int count) +{ + struct cma *cma = dev_get_cma_area(dev); + unsigned long pfn; + + if (!cma || !pages) + return 0; + + pr_debug("%s(page %p)\n", __func__, (void *)pages); + + pfn = page_to_pfn(pages); + + if (pfn < cma->base_pfn || pfn >= cma->base_pfn + cma->count) + return 0; + + mutex_lock(&cma_mutex); + + bitmap_clear(cma->bitmap, pfn - cma->base_pfn, count); + free_contig_range(pfn, count); + + mutex_unlock(&cma_mutex); + return 1; +} diff --git a/include/asm-generic/dma-contiguous.h b/include/asm-generic/dma-contiguous.h new file mode 100644 index 0000000..bf2bccc --- /dev/null +++ b/include/asm-generic/dma-contiguous.h @@ -0,0 +1,27 @@ +#ifndef ASM_DMA_CONTIGUOUS_H +#define ASM_DMA_CONTIGUOUS_H + +#ifdef __KERNEL__ + +#include <linux/device.h> +#include <linux/dma-contiguous.h> + +#ifdef CONFIG_CMA + +static inline struct cma *dev_get_cma_area(struct device *dev) +{ + if (dev && dev->cma_area) + return dev->cma_area; + return dma_contiguous_default_area; +} + +static inline void dev_set_cma_area(struct device *dev, struct cma *cma) +{ + if (dev) + dev->cma_area = cma; + dma_contiguous_default_area = cma; +} + +#endif +#endif +#endif diff --git a/include/linux/device.h b/include/linux/device.h index ffbcf95..54a2295 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -594,6 +594,10 @@ struct device {
struct dma_coherent_mem *dma_mem; /* internal for coherent mem override */ +#ifdef CONFIG_CMA + struct cma *cma_area; /* contiguous memory area for dma + allocations */ +#endif /* arch specific additions */ struct dev_archdata archdata;
diff --git a/include/linux/dma-contiguous.h b/include/linux/dma-contiguous.h new file mode 100644 index 0000000..ffb4b40 --- /dev/null +++ b/include/linux/dma-contiguous.h @@ -0,0 +1,110 @@ +#ifndef __LINUX_CMA_H +#define __LINUX_CMA_H + +/* + * Contiguous Memory Allocator for DMA mapping framework + * Copyright (c) 2010-2011 by Samsung Electronics. + * Written by: + * Marek Szyprowski m.szyprowski@samsung.com + * Michal Nazarewicz mina86@mina86.com + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation; either version 2 of the + * License or (at your optional) any later version of the license. + */ + +/* + * Contiguous Memory Allocator + * + * The Contiguous Memory Allocator (CMA) makes it possible to + * allocate big contiguous chunks of memory after the system has + * booted. + * + * Why is it needed? + * + * Various devices on embedded systems have no scatter-getter and/or + * IO map support and require contiguous blocks of memory to + * operate. They include devices such as cameras, hardware video + * coders, etc. + * + * Such devices often require big memory buffers (a full HD frame + * is, for instance, more then 2 mega pixels large, i.e. more than 6 + * MB of memory), which makes mechanisms such as kmalloc() or + * alloc_page() ineffective. + * + * At the same time, a solution where a big memory region is + * reserved for a device is suboptimal since often more memory is + * reserved then strictly required and, moreover, the memory is + * inaccessible to page system even if device drivers don't use it. + * + * CMA tries to solve this issue by operating on memory regions + * where only movable pages can be allocated from. This way, kernel + * can use the memory for pagecache and when device driver requests + * it, allocated pages can be migrated. + * + * Driver usage + * + * CMA should not be used by the device drivers directly. It is + * only a helper framework for dma-mapping subsystem. + * + * For more information, see kernel-docs in drivers/base/dma-contiguous.c + */ + +#ifdef __KERNEL__ + +struct cma; +struct page; +struct device; + +#ifdef CONFIG_CMA + +/* + * There is always at least global CMA area and a few optional device + * private areas configured in kernel .config. + */ +#define MAX_CMA_AREAS (1 + CONFIG_CMA_AREAS) + +extern struct cma *dma_contiguous_default_area; + +void dma_contiguous_reserve(phys_addr_t addr_limit); +int dma_declare_contiguous(struct device *dev, unsigned long size, + phys_addr_t base, phys_addr_t limit); + +struct page *dma_alloc_from_contiguous(struct device *dev, int count, + unsigned int order); +int dma_release_from_contiguous(struct device *dev, struct page *pages, + int count); + +#else + +#define MAX_CMA_AREAS (0) + +static inline void dma_contiguous_reserve(phys_addr_t limit) { } + +static inline +int dma_declare_contiguous(struct device *dev, unsigned long size, + phys_addr_t base, phys_addr_t limit) +{ + return -ENOSYS; +} + +static inline +struct page *dma_alloc_from_contiguous(struct device *dev, int count, + unsigned int order) +{ + return NULL; +} + +static inline +int dma_release_from_contiguous(struct device *dev, struct page *pages, + int count) +{ + return 0; +} + +#endif + +#endif + +#endif
This patch adds support for CMA to dma-mapping subsystem for x86 architecture that uses common pci-dma/pci-nommu implementation. This allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com CC: Michal Nazarewicz mina86@mina86.com --- arch/x86/Kconfig | 1 + arch/x86/include/asm/dma-contiguous.h | 13 +++++++++++++ arch/x86/include/asm/dma-mapping.h | 4 ++++ arch/x86/kernel/pci-dma.c | 18 ++++++++++++++++-- arch/x86/kernel/pci-nommu.c | 8 +------- arch/x86/kernel/setup.c | 2 ++ 6 files changed, 37 insertions(+), 9 deletions(-) create mode 100644 arch/x86/include/asm/dma-contiguous.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index cb9a104..57a7177 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -29,6 +29,7 @@ config X86 select ARCH_WANT_OPTIONAL_GPIOLIB select ARCH_WANT_FRAME_POINTERS select HAVE_DMA_ATTRS + select HAVE_DMA_CONTIGUOUS if !SWIOTLB select HAVE_KRETPROBES select HAVE_OPTPROBES select HAVE_FTRACE_MCOUNT_RECORD diff --git a/arch/x86/include/asm/dma-contiguous.h b/arch/x86/include/asm/dma-contiguous.h new file mode 100644 index 0000000..8fb117d --- /dev/null +++ b/arch/x86/include/asm/dma-contiguous.h @@ -0,0 +1,13 @@ +#ifndef ASMX86_DMA_CONTIGUOUS_H +#define ASMX86_DMA_CONTIGUOUS_H + +#ifdef __KERNEL__ + +#include <linux/device.h> +#include <linux/dma-contiguous.h> +#include <asm-generic/dma-contiguous.h> + +static inline void dma_contiguous_early_fixup(phys_addr_t base, unsigned long size) { } + +#endif +#endif diff --git a/arch/x86/include/asm/dma-mapping.h b/arch/x86/include/asm/dma-mapping.h index ed3065f..90ac6f0 100644 --- a/arch/x86/include/asm/dma-mapping.h +++ b/arch/x86/include/asm/dma-mapping.h @@ -13,6 +13,7 @@ #include <asm/io.h> #include <asm/swiotlb.h> #include <asm-generic/dma-coherent.h> +#include <linux/dma-contiguous.h>
#ifdef CONFIG_ISA # define ISA_DMA_BIT_MASK DMA_BIT_MASK(24) @@ -61,6 +62,9 @@ extern int dma_set_mask(struct device *dev, u64 mask); extern void *dma_generic_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_addr, gfp_t flag);
+extern void dma_generic_free_coherent(struct device *dev, size_t size, + void *vaddr, dma_addr_t dma_addr); + static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size) { if (!dev->dma_mask) diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c index 80dc793..f4abafc 100644 --- a/arch/x86/kernel/pci-dma.c +++ b/arch/x86/kernel/pci-dma.c @@ -90,14 +90,18 @@ void *dma_generic_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_addr, gfp_t flag) { unsigned long dma_mask; - struct page *page; + struct page *page = NULL; + unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT; dma_addr_t addr;
dma_mask = dma_alloc_coherent_mask(dev, flag);
flag |= __GFP_ZERO; again: - page = alloc_pages_node(dev_to_node(dev), flag, get_order(size)); + if (!(flag & GFP_ATOMIC)) + page = dma_alloc_from_contiguous(dev, count, get_order(size)); + if (!page) + page = alloc_pages_node(dev_to_node(dev), flag, get_order(size)); if (!page) return NULL;
@@ -117,6 +121,16 @@ again: return page_address(page); }
+void dma_generic_free_coherent(struct device *dev, size_t size, void *vaddr, + dma_addr_t dma_addr) +{ + unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT; + struct page *page = virt_to_page(vaddr); + + if (!dma_release_from_contiguous(dev, page, count)) + free_pages((unsigned long)vaddr, get_order(size)); +} + /* * See <Documentation/x86/x86_64/boot-options.txt> for the iommu kernel * parameter documentation. diff --git a/arch/x86/kernel/pci-nommu.c b/arch/x86/kernel/pci-nommu.c index 3af4af8..656566f 100644 --- a/arch/x86/kernel/pci-nommu.c +++ b/arch/x86/kernel/pci-nommu.c @@ -74,12 +74,6 @@ static int nommu_map_sg(struct device *hwdev, struct scatterlist *sg, return nents; }
-static void nommu_free_coherent(struct device *dev, size_t size, void *vaddr, - dma_addr_t dma_addr) -{ - free_pages((unsigned long)vaddr, get_order(size)); -} - static void nommu_sync_single_for_device(struct device *dev, dma_addr_t addr, size_t size, enum dma_data_direction dir) @@ -97,7 +91,7 @@ static void nommu_sync_sg_for_device(struct device *dev,
struct dma_map_ops nommu_dma_ops = { .alloc_coherent = dma_generic_alloc_coherent, - .free_coherent = nommu_free_coherent, + .free_coherent = dma_generic_free_coherent, .map_sg = nommu_map_sg, .map_page = nommu_map_page, .sync_single_for_device = nommu_sync_single_for_device, diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index afaf384..6c9efb5 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -50,6 +50,7 @@ #include <asm/pci-direct.h> #include <linux/init_ohci1394_dma.h> #include <linux/kvm_para.h> +#include <linux/dma-contiguous.h>
#include <linux/errno.h> #include <linux/kernel.h> @@ -944,6 +945,7 @@ void __init setup_arch(char **cmdline_p) } #endif memblock.current_limit = get_max_mapped(); + dma_contiguous_reserve(0);
/* * NOTE: On x86-32, only from this point on, fixmaps are ready for use.
This patch adds support for CMA to dma-mapping subsystem for ARM architecture. By default a global CMA area is used, but specific devices are allowed to have their private memory areas if required (they can be created with dma_declare_contiguous() function during board initialization).
Contiguous memory areas reserved for DMA are remapped with 2-level page tables on boot. Once a buffer is requested, a low memory kernel mapping is updated to to match requested memory access type.
GFP_ATOMIC allocations are performed from special pool which is created early during boot. This way remapping page attributes is not needed on allocation time.
CMA has been enabled unconditionally for ARMv6+ systems.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com CC: Michal Nazarewicz mina86@mina86.com --- Documentation/kernel-parameters.txt | 4 + arch/arm/Kconfig | 2 + arch/arm/include/asm/dma-contiguous.h | 16 ++ arch/arm/include/asm/mach/map.h | 1 + arch/arm/kernel/setup.c | 8 +- arch/arm/mm/dma-mapping.c | 368 +++++++++++++++++++++++++++------ arch/arm/mm/init.c | 20 ++- arch/arm/mm/mm.h | 3 + arch/arm/mm/mmu.c | 29 ++- 9 files changed, 364 insertions(+), 87 deletions(-) create mode 100644 arch/arm/include/asm/dma-contiguous.h
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 164024e..6685e63 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -515,6 +515,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted. a hypervisor. Default: yes
+ coherent_pool=nn[KMG] [ARM,KNL] + Sets the size of memory pool for coherent, atomic dma + allocations if Contiguous Memory Allocator (CMA) is used. + code_bytes [X86] How many bytes of object code to print in an oops report. Range: 0 - 8192 diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 44789ef..53c96d9 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -4,6 +4,8 @@ config ARM select HAVE_AOUT select HAVE_DMA_API_DEBUG select HAVE_IDE if PCI || ISA || PCMCIA + select HAVE_DMA_CONTIGUOUS if (CPU_V6 || CPU_V6K || CPU_V7) + select CMA if (CPU_V6 || CPU_V6K || CPU_V7) select HAVE_MEMBLOCK select RTC_LIB select SYS_SUPPORTS_APM_EMULATION diff --git a/arch/arm/include/asm/dma-contiguous.h b/arch/arm/include/asm/dma-contiguous.h new file mode 100644 index 0000000..c7ba05e --- /dev/null +++ b/arch/arm/include/asm/dma-contiguous.h @@ -0,0 +1,16 @@ +#ifndef ASMARM_DMA_CONTIGUOUS_H +#define ASMARM_DMA_CONTIGUOUS_H + +#ifdef __KERNEL__ + +#include <linux/device.h> +#include <linux/dma-contiguous.h> +#include <asm-generic/dma-contiguous.h> + +#ifdef CONFIG_CMA + +void dma_contiguous_early_fixup(phys_addr_t base, unsigned long size); + +#endif +#endif +#endif diff --git a/arch/arm/include/asm/mach/map.h b/arch/arm/include/asm/mach/map.h index b36f365..a6efcdd 100644 --- a/arch/arm/include/asm/mach/map.h +++ b/arch/arm/include/asm/mach/map.h @@ -30,6 +30,7 @@ struct map_desc { #define MT_MEMORY_DTCM 12 #define MT_MEMORY_ITCM 13 #define MT_MEMORY_SO 14 +#define MT_MEMORY_DMA_READY 15
#ifdef CONFIG_MMU extern void iotable_init(struct map_desc *, int); diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c index 7e7977a..4c82ad3 100644 --- a/arch/arm/kernel/setup.c +++ b/arch/arm/kernel/setup.c @@ -78,6 +78,7 @@ __setup("fpe=", fpe_setup); extern void paging_init(struct machine_desc *desc); extern void sanity_check_meminfo(void); extern void reboot_setup(char *str); +extern void setup_dma_zone(struct machine_desc *desc);
unsigned int processor_id; EXPORT_SYMBOL(processor_id); @@ -917,6 +918,7 @@ void __init setup_arch(char **cmdline_p) parse_early_param();
sanity_check_meminfo(); + setup_dma_zone(mdesc); arm_memblock_init(&meminfo, mdesc);
paging_init(mdesc); @@ -932,12 +934,6 @@ void __init setup_arch(char **cmdline_p)
tcm_init();
-#ifdef CONFIG_ZONE_DMA - if (mdesc->dma_zone_size) { - extern unsigned long arm_dma_zone_size; - arm_dma_zone_size = mdesc->dma_zone_size; - } -#endif #ifdef CONFIG_MULTI_IRQ_HANDLER handle_arch_irq = mdesc->handle_irq; #endif diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index e4e7f6c..c63174c 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -17,7 +17,9 @@ #include <linux/init.h> #include <linux/device.h> #include <linux/dma-mapping.h> +#include <linux/dma-contiguous.h> #include <linux/highmem.h> +#include <linux/memblock.h> #include <linux/slab.h>
#include <asm/memory.h> @@ -26,6 +28,8 @@ #include <asm/tlbflush.h> #include <asm/sizes.h> #include <asm/mach/arch.h> +#include <asm/mach/map.h> +#include <asm/dma-contiguous.h>
#include "mm.h"
@@ -56,6 +60,19 @@ static u64 get_coherent_dma_mask(struct device *dev) return mask; }
+static void __dma_clear_buffer(struct page *page, size_t size) +{ + void *ptr; + /* + * Ensure that the allocated pages are zeroed, and that any data + * lurking in the kernel direct-mapped region is invalidated. + */ + ptr = page_address(page); + memset(ptr, 0, size); + dmac_flush_range(ptr, ptr + size); + outer_flush_range(__pa(ptr), __pa(ptr) + size); +} + /* * Allocate a DMA buffer for 'dev' of size 'size' using the * specified gfp mask. Note that 'size' must be page aligned. @@ -64,23 +81,6 @@ static struct page *__dma_alloc_buffer(struct device *dev, size_t size, gfp_t gf { unsigned long order = get_order(size); struct page *page, *p, *e; - void *ptr; - u64 mask = get_coherent_dma_mask(dev); - -#ifdef CONFIG_DMA_API_DEBUG - u64 limit = (mask + 1) & ~mask; - if (limit && size >= limit) { - dev_warn(dev, "coherent allocation too big (requested %#x mask %#llx)\n", - size, mask); - return NULL; - } -#endif - - if (!mask) - return NULL; - - if (mask < 0xffffffffULL) - gfp |= GFP_DMA;
page = alloc_pages(gfp, order); if (!page) @@ -93,14 +93,7 @@ static struct page *__dma_alloc_buffer(struct device *dev, size_t size, gfp_t gf for (p = page + (size >> PAGE_SHIFT), e = page + (1 << order); p < e; p++) __free_page(p);
- /* - * Ensure that the allocated pages are zeroed, and that any data - * lurking in the kernel direct-mapped region is invalidated. - */ - ptr = page_address(page); - memset(ptr, 0, size); - dmac_flush_range(ptr, ptr + size); - outer_flush_range(__pa(ptr), __pa(ptr) + size); + __dma_clear_buffer(page, size);
return page; } @@ -170,6 +163,9 @@ static int __init consistent_init(void) unsigned long base = consistent_base; unsigned long num_ptes = (CONSISTENT_END - base) >> PGDIR_SHIFT;
+ if (cpu_architecture() >= CPU_ARCH_ARMv6) + return 0; + consistent_pte = kmalloc(num_ptes * sizeof(pte_t), GFP_KERNEL); if (!consistent_pte) { pr_err("%s: no memory\n", __func__); @@ -210,9 +206,101 @@ static int __init consistent_init(void)
return ret; } - core_initcall(consistent_init);
+static void *__alloc_from_contiguous(struct device *dev, size_t size, + pgprot_t prot, struct page **ret_page); + +static struct arm_vmregion_head coherent_head = { + .vm_lock = __SPIN_LOCK_UNLOCKED(&coherent_head.vm_lock), + .vm_list = LIST_HEAD_INIT(coherent_head.vm_list), +}; + +size_t coherent_pool_size = DEFAULT_CONSISTENT_DMA_SIZE / 8; + +static int __init early_coherent_pool(char *p) +{ + coherent_pool_size = memparse(p, &p); + return 0; +} +early_param("coherent_pool", early_coherent_pool); + +/* + * Initialise the coherent pool for atomic allocations. + */ +static int __init coherent_init(void) +{ + pgprot_t prot = pgprot_dmacoherent(pgprot_kernel); + size_t size = coherent_pool_size; + struct page *page; + void *ptr; + + if (cpu_architecture() < CPU_ARCH_ARMv6) + return 0; + + ptr = __alloc_from_contiguous(NULL, size, prot, &page); + if (ptr) { + coherent_head.vm_start = (unsigned long) ptr; + coherent_head.vm_end = (unsigned long) ptr + size; + printk(KERN_INFO "DMA: preallocated %u KiB pool for atomic coherent allocations\n", + (unsigned)size / 1024); + return 0; + } + printk(KERN_ERR "DMA: failed to allocate %u KiB pool for atomic coherent allocation\n", + (unsigned)size / 1024); + return -ENOMEM; +} +/* + * CMA is activated by core_initcall, so we must be called after it + */ +postcore_initcall(coherent_init); + +struct dma_contig_early_reserve { + phys_addr_t base; + unsigned long size; +}; + +static struct dma_contig_early_reserve dma_mmu_remap[MAX_CMA_AREAS] __initdata; + +static int dma_mmu_remap_num __initdata; + +void __init dma_contiguous_early_fixup(phys_addr_t base, unsigned long size) +{ + dma_mmu_remap[dma_mmu_remap_num].base = base; + dma_mmu_remap[dma_mmu_remap_num].size = size; + dma_mmu_remap_num++; +} + +void __init dma_contiguous_remap(void) +{ + int i; + for (i = 0; i < dma_mmu_remap_num; i++) { + phys_addr_t start = dma_mmu_remap[i].base; + phys_addr_t end = start + dma_mmu_remap[i].size; + struct map_desc map; + unsigned long addr; + + if (end > arm_lowmem_limit) + end = arm_lowmem_limit; + if (start >= end) + return; + + map.pfn = __phys_to_pfn(start); + map.virtual = __phys_to_virt(start); + map.length = end - start; + map.type = MT_MEMORY_DMA_READY; + + /* + * Clear previous low-memory mapping + */ + for (addr = __phys_to_virt(start); addr < __phys_to_virt(end); + addr += PGDIR_SIZE) + pmd_clear(pmd_off_k(addr)); + + iotable_init(&map, 1); + } +} + static void * __dma_alloc_remap(struct page *page, size_t size, gfp_t gfp, pgprot_t prot) { @@ -318,36 +406,186 @@ static void __dma_free_remap(void *cpu_addr, size_t size) arm_vmregion_free(&consistent_head, c); }
+static int __dma_update_pte(pte_t *pte, pgtable_t token, unsigned long addr, + void *data) +{ + struct page *page = virt_to_page(addr); + pgprot_t prot = *(pgprot_t *)data; + + set_pte_ext(pte, mk_pte(page, prot), 0); + return 0; +} + +static void __dma_remap(struct page *page, size_t size, pgprot_t prot) +{ + unsigned long start = (unsigned long) page_address(page); + unsigned end = start + size; + + apply_to_page_range(&init_mm, start, size, __dma_update_pte, &prot); + dsb(); + flush_tlb_kernel_range(start, end); +} + +static void *__alloc_remap_buffer(struct device *dev, size_t size, gfp_t gfp, + pgprot_t prot, struct page **ret_page) +{ + struct page *page; + void *ptr; + page = __dma_alloc_buffer(dev, size, gfp); + if (!page) + return NULL; + + ptr = __dma_alloc_remap(page, size, gfp, prot); + if (!ptr) { + __dma_free_buffer(page, size); + return NULL; + } + + *ret_page = page; + return ptr; +} + +static void *__alloc_from_pool(struct device *dev, size_t size, + struct page **ret_page) +{ + struct arm_vmregion *c; + size_t align; + + if (!coherent_head.vm_start) { + printk(KERN_ERR "%s: coherent pool not initialised!\n", + __func__); + dump_stack(); + return NULL; + } + + /* + * Align the region allocation - allocations from pool are rather + * small, so align them to their order in pages, minimum is a page + * size. This helps reduce fragmentation of the DMA space. + */ + align = PAGE_SIZE << get_order(size); + c = arm_vmregion_alloc(&coherent_head, align, size, 0); + if (c) { + void *ptr = (void *)c->vm_start; + struct page *page = virt_to_page(ptr); + *ret_page = page; + return ptr; + } + return NULL; +} + +static int __free_from_pool(void *cpu_addr, size_t size) +{ + unsigned long start = (unsigned long)cpu_addr; + unsigned long end = start + size; + struct arm_vmregion *c; + + if (start < coherent_head.vm_start || end > coherent_head.vm_end) + return 0; + + c = arm_vmregion_find_remove(&coherent_head, (unsigned long)start); + + if ((c->vm_end - c->vm_start) != size) { + printk(KERN_ERR "%s: freeing wrong coherent size (%ld != %d)\n", + __func__, c->vm_end - c->vm_start, size); + dump_stack(); + size = c->vm_end - c->vm_start; + } + + arm_vmregion_free(&coherent_head, c); + return 1; +} + +static void *__alloc_from_contiguous(struct device *dev, size_t size, + pgprot_t prot, struct page **ret_page) +{ + unsigned long order = get_order(size); + size_t count = size >> PAGE_SHIFT; + struct page *page; + + page = dma_alloc_from_contiguous(dev, count, order); + if (!page) + return NULL; + + __dma_clear_buffer(page, size); + __dma_remap(page, size, prot); + + *ret_page = page; + return page_address(page); +} + +static void __free_from_contiguous(struct device *dev, struct page *page, + size_t size) +{ + __dma_remap(page, size, pgprot_kernel); + dma_release_from_contiguous(dev, page, size >> PAGE_SHIFT); +} + +#define nommu() 0 + #else /* !CONFIG_MMU */
-#define __dma_alloc_remap(page, size, gfp, prot) page_address(page) -#define __dma_free_remap(addr, size) do { } while (0) +#define nommu() 1 + +#define __alloc_remap_buffer(dev, size, gfp, prot, ret) NULL +#define __alloc_from_pool(dev, size, ret_page) NULL +#define __alloc_from_contiguous(dev, size, prot, ret) NULL +#define __free_from_pool(cpu_addr, size) 0 +#define __free_from_contiguous(dev, page, size) do { } while (0) +#define __dma_free_remap(cpu_addr, size) do { } while (0)
#endif /* CONFIG_MMU */
-static void * -__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle, gfp_t gfp, - pgprot_t prot) +static void *__alloc_simple_buffer(struct device *dev, size_t size, gfp_t gfp, + struct page **ret_page) { struct page *page; + page = __dma_alloc_buffer(dev, size, gfp); + if (!page) + return NULL; + + *ret_page = page; + return page_address(page); +} + + + +static void *__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle, + gfp_t gfp, pgprot_t prot) +{ + u64 mask = get_coherent_dma_mask(dev); + struct page *page; void *addr;
- *handle = ~0; - size = PAGE_ALIGN(size); +#ifdef CONFIG_DMA_API_DEBUG + u64 limit = (mask + 1) & ~mask; + if (limit && size >= limit) { + dev_warn(dev, "coherent allocation too big (requested %#x mask %#llx)\n", + size, mask); + return NULL; + } +#endif
- page = __dma_alloc_buffer(dev, size, gfp); - if (!page) + if (!mask) return NULL;
- if (!arch_is_coherent()) - addr = __dma_alloc_remap(page, size, gfp, prot); + if (mask < 0xffffffffULL) + gfp |= GFP_DMA; + + *handle = ~0; + size = PAGE_ALIGN(size); + + if (arch_is_coherent() || nommu()) + addr = __alloc_simple_buffer(dev, size, gfp, &page); + else if (cpu_architecture() < CPU_ARCH_ARMv6) + addr = __alloc_remap_buffer(dev, size, gfp, prot, &page); + else if (gfp & GFP_ATOMIC) + addr = __alloc_from_pool(dev, size, &page); else - addr = page_address(page); + addr = __alloc_from_contiguous(dev, size, prot, &page);
if (addr) *handle = pfn_to_dma(dev, page_to_pfn(page)); - else - __dma_free_buffer(page, size);
return addr; } @@ -356,8 +594,8 @@ __dma_alloc(struct device *dev, size_t size, dma_addr_t *handle, gfp_t gfp, * Allocate DMA-coherent memory space and return both the kernel remapped * virtual and bus address for that space. */ -void * -dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *handle, gfp_t gfp) +void *dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *handle, + gfp_t gfp) { void *memory;
@@ -386,25 +624,11 @@ static int dma_mmap(struct device *dev, struct vm_area_struct *vma, { int ret = -ENXIO; #ifdef CONFIG_MMU - unsigned long user_size, kern_size; - struct arm_vmregion *c; - - user_size = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT; - - c = arm_vmregion_find(&consistent_head, (unsigned long)cpu_addr); - if (c) { - unsigned long off = vma->vm_pgoff; - - kern_size = (c->vm_end - c->vm_start) >> PAGE_SHIFT; - - if (off < kern_size && - user_size <= (kern_size - off)) { - ret = remap_pfn_range(vma, vma->vm_start, - page_to_pfn(c->vm_pages) + off, - user_size << PAGE_SHIFT, - vma->vm_page_prot); - } - } + unsigned long pfn = dma_to_pfn(dev, dma_addr); + ret = remap_pfn_range(vma, vma->vm_start, + pfn + vma->vm_pgoff, + vma->vm_end - vma->vm_start, + vma->vm_page_prot); #endif /* CONFIG_MMU */
return ret; @@ -426,23 +650,33 @@ int dma_mmap_writecombine(struct device *dev, struct vm_area_struct *vma, } EXPORT_SYMBOL(dma_mmap_writecombine);
+ /* - * free a page as defined by the above mapping. - * Must not be called with IRQs disabled. + * Free a buffer as defined by the above mapping. */ void dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, dma_addr_t handle) { - WARN_ON(irqs_disabled()); + struct page *page = pfn_to_page(dma_to_pfn(dev, handle));
if (dma_release_from_coherent(dev, get_order(size), cpu_addr)) return;
size = PAGE_ALIGN(size);
- if (!arch_is_coherent()) + if (arch_is_coherent() || nommu()) { + __dma_free_buffer(page, size); + } else if (cpu_architecture() < CPU_ARCH_ARMv6) { __dma_free_remap(cpu_addr, size); - - __dma_free_buffer(pfn_to_page(dma_to_pfn(dev, handle)), size); + __dma_free_buffer(page, size); + } else { + if (__free_from_pool(cpu_addr, size)) + return; + /* + * Non-atomic allocations cannot be freed with IRQs disabled + */ + WARN_ON(irqs_disabled()); + __free_from_contiguous(dev, page, size); + } } EXPORT_SYMBOL(dma_free_coherent);
diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c index fbdd12e..adddcde 100644 --- a/arch/arm/mm/init.c +++ b/arch/arm/mm/init.c @@ -21,6 +21,7 @@ #include <linux/gfp.h> #include <linux/memblock.h> #include <linux/sort.h> +#include <linux/dma-contiguous.h>
#include <asm/mach-types.h> #include <asm/prom.h> @@ -238,6 +239,17 @@ static void __init arm_adjust_dma_zone(unsigned long *size, unsigned long *hole, } #endif
+void __init setup_dma_zone(struct machine_desc *mdesc) +{ +#ifdef CONFIG_ZONE_DMA + if (mdesc->dma_zone_size) { + arm_dma_zone_size = mdesc->dma_zone_size; + arm_dma_limit = PHYS_OFFSET + arm_dma_zone_size - 1; + } else + arm_dma_limit = 0xffffffff; +#endif +} + static void __init arm_bootmem_free(unsigned long min, unsigned long max_low, unsigned long max_high) { @@ -285,12 +297,9 @@ static void __init arm_bootmem_free(unsigned long min, unsigned long max_low, * Adjust the sizes according to any special requirements for * this machine type. */ - if (arm_dma_zone_size) { + if (arm_dma_zone_size) arm_adjust_dma_zone(zone_size, zhole_size, arm_dma_zone_size >> PAGE_SHIFT); - arm_dma_limit = PHYS_OFFSET + arm_dma_zone_size - 1; - } else - arm_dma_limit = 0xffffffff; #endif
free_area_init_node(0, zone_size, min, zhole_size); @@ -371,6 +380,9 @@ void __init arm_memblock_init(struct meminfo *mi, struct machine_desc *mdesc) if (mdesc->reserve) mdesc->reserve();
+ /* reserve memory for DMA contigouos allocations */ + dma_contiguous_reserve(arm_dma_limit); + memblock_analyze(); memblock_dump_all(); } diff --git a/arch/arm/mm/mm.h b/arch/arm/mm/mm.h index ad7cce3..fa95d9b 100644 --- a/arch/arm/mm/mm.h +++ b/arch/arm/mm/mm.h @@ -29,5 +29,8 @@ extern u32 arm_dma_limit; #define arm_dma_limit ((u32)~0) #endif
+extern phys_addr_t arm_lowmem_limit; + void __init bootmem_init(void); void arm_mm_memblock_reserve(void); +void dma_contiguous_remap(void); diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c index dc8c550..9796cf1 100644 --- a/arch/arm/mm/mmu.c +++ b/arch/arm/mm/mmu.c @@ -281,6 +281,11 @@ static struct mem_type mem_types[] = { PMD_SECT_UNCACHED | PMD_SECT_XN, .domain = DOMAIN_KERNEL, }, + [MT_MEMORY_DMA_READY] = { + .prot_pte = L_PTE_PRESENT | L_PTE_YOUNG | L_PTE_DIRTY, + .prot_l1 = PMD_TYPE_TABLE, + .domain = DOMAIN_KERNEL, + }, };
const struct mem_type *get_mem_type(unsigned int type) @@ -422,6 +427,7 @@ static void __init build_mem_type_table(void) if (arch_is_coherent() && cpu_is_xsc3()) { mem_types[MT_MEMORY].prot_sect |= PMD_SECT_S; mem_types[MT_MEMORY].prot_pte |= L_PTE_SHARED; + mem_types[MT_MEMORY_DMA_READY].prot_pte |= L_PTE_SHARED; mem_types[MT_MEMORY_NONCACHED].prot_sect |= PMD_SECT_S; mem_types[MT_MEMORY_NONCACHED].prot_pte |= L_PTE_SHARED; } @@ -451,6 +457,7 @@ static void __init build_mem_type_table(void) mem_types[MT_DEVICE_CACHED].prot_pte |= L_PTE_SHARED; mem_types[MT_MEMORY].prot_sect |= PMD_SECT_S; mem_types[MT_MEMORY].prot_pte |= L_PTE_SHARED; + mem_types[MT_MEMORY_DMA_READY].prot_pte |= L_PTE_SHARED; mem_types[MT_MEMORY_NONCACHED].prot_sect |= PMD_SECT_S; mem_types[MT_MEMORY_NONCACHED].prot_pte |= L_PTE_SHARED; } @@ -490,6 +497,7 @@ static void __init build_mem_type_table(void) mem_types[MT_HIGH_VECTORS].prot_l1 |= ecc_mask; mem_types[MT_MEMORY].prot_sect |= ecc_mask | cp->pmd; mem_types[MT_MEMORY].prot_pte |= kern_pgprot; + mem_types[MT_MEMORY_DMA_READY].prot_pte |= kern_pgprot; mem_types[MT_MEMORY_NONCACHED].prot_sect |= ecc_mask; mem_types[MT_ROM].prot_sect |= cp->pmd;
@@ -569,7 +577,7 @@ static void __init alloc_init_section(pud_t *pud, unsigned long addr, * L1 entries, whereas PGDs refer to a group of L1 entries making * up one logical pointer to an L2 table. */ - if (((addr | end | phys) & ~SECTION_MASK) == 0) { + if (type->prot_sect && ((addr | end | phys) & ~SECTION_MASK) == 0) { pmd_t *p = pmd;
if (addr & SECTION_SIZE) @@ -765,7 +773,7 @@ static int __init early_vmalloc(char *arg) } early_param("vmalloc", early_vmalloc);
-static phys_addr_t lowmem_limit __initdata = 0; +phys_addr_t arm_lowmem_limit __initdata = 0;
void __init sanity_check_meminfo(void) { @@ -834,8 +842,8 @@ void __init sanity_check_meminfo(void) bank->size = newsize; } #endif - if (!bank->highmem && bank->start + bank->size > lowmem_limit) - lowmem_limit = bank->start + bank->size; + if (!bank->highmem && bank->start + bank->size > arm_lowmem_limit) + arm_lowmem_limit = bank->start + bank->size;
j++; } @@ -860,7 +868,7 @@ void __init sanity_check_meminfo(void) } #endif meminfo.nr_banks = j; - memblock_set_current_limit(lowmem_limit); + memblock_set_current_limit(arm_lowmem_limit); }
static inline void prepare_page_table(void) @@ -885,8 +893,8 @@ static inline void prepare_page_table(void) * Find the end of the first block of lowmem. */ end = memblock.memory.regions[0].base + memblock.memory.regions[0].size; - if (end >= lowmem_limit) - end = lowmem_limit; + if (end >= arm_lowmem_limit) + end = arm_lowmem_limit;
/* * Clear out all the kernel space mappings, except for the first @@ -1020,8 +1028,8 @@ static void __init map_lowmem(void) phys_addr_t end = start + reg->size; struct map_desc map;
- if (end > lowmem_limit) - end = lowmem_limit; + if (end > arm_lowmem_limit) + end = arm_lowmem_limit; if (start >= end) break;
@@ -1042,11 +1050,12 @@ void __init paging_init(struct machine_desc *mdesc) { void *zero_page;
- memblock_set_current_limit(lowmem_limit); + memblock_set_current_limit(arm_lowmem_limit);
build_mem_type_table(); prepare_page_table(); map_lowmem(); + dma_contiguous_remap(); devicemaps_init(mdesc); kmap_init();
Replace custom memory bank initialization using memblock_reserve and dma_declare_coherent with a single call to CMA's dma_declare_contiguous.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com --- arch/arm/plat-s5p/dev-mfc.c | 51 ++++++------------------------------------- 1 files changed, 7 insertions(+), 44 deletions(-)
diff --git a/arch/arm/plat-s5p/dev-mfc.c b/arch/arm/plat-s5p/dev-mfc.c index a30d36b..fcb8400 100644 --- a/arch/arm/plat-s5p/dev-mfc.c +++ b/arch/arm/plat-s5p/dev-mfc.c @@ -14,6 +14,7 @@ #include <linux/interrupt.h> #include <linux/platform_device.h> #include <linux/dma-mapping.h> +#include <linux/dma-contiguous.h> #include <linux/memblock.h> #include <linux/ioport.h>
@@ -22,52 +23,14 @@ #include <plat/irqs.h> #include <plat/mfc.h>
-struct s5p_mfc_reserved_mem { - phys_addr_t base; - unsigned long size; - struct device *dev; -}; - -static struct s5p_mfc_reserved_mem s5p_mfc_mem[2] __initdata; - void __init s5p_mfc_reserve_mem(phys_addr_t rbase, unsigned int rsize, phys_addr_t lbase, unsigned int lsize) { - int i; - - s5p_mfc_mem[0].dev = &s5p_device_mfc_r.dev; - s5p_mfc_mem[0].base = rbase; - s5p_mfc_mem[0].size = rsize; - - s5p_mfc_mem[1].dev = &s5p_device_mfc_l.dev; - s5p_mfc_mem[1].base = lbase; - s5p_mfc_mem[1].size = lsize; - - for (i = 0; i < ARRAY_SIZE(s5p_mfc_mem); i++) { - struct s5p_mfc_reserved_mem *area = &s5p_mfc_mem[i]; - if (memblock_remove(area->base, area->size)) { - printk(KERN_ERR "Failed to reserve memory for MFC device (%ld bytes at 0x%08lx)\n", - area->size, (unsigned long) area->base); - area->base = 0; - } - } -} - -static int __init s5p_mfc_memory_init(void) -{ - int i; - - for (i = 0; i < ARRAY_SIZE(s5p_mfc_mem); i++) { - struct s5p_mfc_reserved_mem *area = &s5p_mfc_mem[i]; - if (!area->base) - continue; + if (dma_declare_contiguous(&s5p_device_mfc_r.dev, rsize, rbase, 0)) + printk(KERN_ERR "Failed to reserve memory for MFC device (%u bytes at 0x%08lx)\n", + rsize, (unsigned long) rbase);
- if (dma_declare_coherent_memory(area->dev, area->base, - area->base, area->size, - DMA_MEMORY_MAP | DMA_MEMORY_EXCLUSIVE) == 0) - printk(KERN_ERR "Failed to declare coherent memory for MFC device (%ld bytes at 0x%08lx)\n", - area->size, (unsigned long) area->base); - } - return 0; + if (dma_declare_contiguous(&s5p_device_mfc_l.dev, lsize, lbase, 0)) + printk(KERN_ERR "Failed to reserve memory for MFC device (%u bytes at 0x%08lx)\n", + rsize, (unsigned long) rbase); } -device_initcall(s5p_mfc_memory_init);
On Fri, Nov 18, 2011 at 8:43 AM, Marek Szyprowski m.szyprowski@samsung.com wrote:
Welcome everyone once again,
Please notice that this patch series is aimed to start further discussion. There are still few issues that need to be resolved before CMA will be really ready. The most hot problem is the issue with movable pages that causes migration to fail from time to time. Our investigation leads us to the point that these rare pages cannot be migrated because there are some pending io operations on them.
I am running a simple test to allocate contiguous regions and write a log on in a file on sdcard simultaneously. I can reproduce this migration failure 100% times with it. when I tracked the pages that failed to migrate, I found them on the buffer head lru list with a reference held on the buffer_head in the page, which causes drop_buffers() to fail.
So, i guess my question is, until all the migration failures are tracked down and fixed, is there a plan to retry the contiguous allocation from a new range in the CMA region?
~ sandeep
On Fri, 18 Nov 2011 22:20:48 +0100, sandeep patil psandeep.s@gmail.com wrote:
I am running a simple test to allocate contiguous regions and write a log on in a file on sdcard simultaneously. I can reproduce this migration failure 100% times with it. when I tracked the pages that failed to migrate, I found them on the buffer head lru list with a reference held on the buffer_head in the page, which causes drop_buffers() to fail.
So, i guess my question is, until all the migration failures are tracked down and fixed, is there a plan to retry the contiguous allocation from a new range in the CMA region?
No. Current CMA implementation will stick to the same range of pages also on consequent allocations of the same size.
2011/11/18 Michal Nazarewicz mina86@mina86.com:
On Fri, 18 Nov 2011 22:20:48 +0100, sandeep patil psandeep.s@gmail.com wrote:
I am running a simple test to allocate contiguous regions and write a log on in a file on sdcard simultaneously. I can reproduce this migration failure 100% times with it. when I tracked the pages that failed to migrate, I found them on the buffer head lru list with a reference held on the buffer_head in the page, which causes drop_buffers() to fail.
So, i guess my question is, until all the migration failures are tracked down and fixed, is there a plan to retry the contiguous allocation from a new range in the CMA region?
No. Current CMA implementation will stick to the same range of pages also on consequent allocations of the same size.
Doesn't that mean the drivers that fail to allocate from contiguous DMA region will fail, if the migration fails?
~ sandeep
On Fri, 18 Nov 2011 22:20:48 +0100, sandeep patil wrote:
So, i guess my question is, until all the migration failures are tracked down and fixed, is there a plan to retry the contiguous allocation from a new range in the CMA region?
2011/11/18 Michal Nazarewicz mina86@mina86.com:
No. Current CMA implementation will stick to the same range of pages also on consequent allocations of the same size.
On Sat, 19 Nov 2011 00:30:49 +0100, sandeep patil psandeep.s@gmail.com wrote:
Doesn't that mean the drivers that fail to allocate from contiguous DMA region will fail, if the migration fails?
Yes.
I have some ideas how that could be mitigated. The easiest would be to try another region to allocate from on failure. More complicated could be to try and wait for the I/O transfer to finish. I'll try to work on it during upcoming week.
This is a quick and dirty patch and hack to solve some memory allocation issues that appeared at CMA v17 after switching migration code from hotplug to memory compaction. Especially the issue with watermark adjustment need a real fix instead of disabling the code.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com ---
Hello,
This patch fixes the issues that have been reported recently. It should be considered only as a temporary solution until a new version of CMA patches is ready.
Best regards -- Marek Szyprowski Samsung Poland R&D Center
--- mm/compaction.c | 5 ++++- mm/page_alloc.c | 12 +++++++++--- 2 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c index 3e07341..41976f8 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -79,8 +79,9 @@ isolate_freepages_range(struct zone *zone, skip: if (freelist) goto next; +failed: for (; start < pfn; ++start) - __free_page(pfn_to_page(pfn)); + __free_page(pfn_to_page(start)); return 0; }
@@ -91,6 +92,8 @@ skip: struct page *p = page; for (i = isolated; i; --i, ++p) list_add(&p->lru, freelist); + } else if (!isolated) { + goto failed; }
next: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 714b1c1..b4a46c7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1303,12 +1303,12 @@ int split_free_page(struct page *page)
zone = page_zone(page); order = page_order(page); - +#if 0 /* Obey watermarks as if the page was being allocated */ watermark = low_wmark_pages(zone) + (1 << order); if (!zone_watermark_ok(zone, 0, watermark, 0, 0)) return 0; - +#endif /* Remove page from free list */ list_del(&page->lru); zone->free_area[order].nr_free--; @@ -5734,6 +5734,12 @@ static unsigned long pfn_align_to_maxpage_up(unsigned long pfn) return ALIGN(pfn, MAX_ORDER_NR_PAGES); }
+static struct page * +cma_migrate_alloc(struct page *page, unsigned long private, int **x) +{ + return alloc_page(GFP_HIGHUSER_MOVABLE); +} + static int __alloc_contig_migrate_range(unsigned long start, unsigned long end) { /* This function is based on compact_zone() from compaction.c. */ @@ -5801,7 +5807,7 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end) }
/* Try to migrate. */ - ret = migrate_pages(&cc.migratepages, compaction_alloc, + ret = migrate_pages(&cc.migratepages, cma_migrate_alloc, (unsigned long)&cc, false, cc.sync);
/* Migrated all of them? Great! */
On Fri, 25 Nov 2011 17:43:07 +0100, Marek Szyprowski m.szyprowski@samsung.com wrote:
This is a quick and dirty patch and hack to solve some memory allocation issues that appeared at CMA v17 after switching migration code from hotplug to memory compaction. Especially the issue with watermark adjustment need a real fix instead of disabling the code.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com
Hello,
This patch fixes the issues that have been reported recently. It should be considered only as a temporary solution until a new version of CMA patches is ready.
Best regards
Marek Szyprowski Samsung Poland R&D Center
mm/compaction.c | 5 ++++- mm/page_alloc.c | 12 +++++++++--- 2 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c index 3e07341..41976f8 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -79,8 +79,9 @@ isolate_freepages_range(struct zone *zone, skip: if (freelist) goto next; +failed: for (; start < pfn; ++start)
__free_page(pfn_to_page(pfn));
}__free_page(pfn_to_page(start)); return 0;
Yeah, my mistake, sorry about that. ;)
@@ -91,6 +92,8 @@ skip: struct page *p = page; for (i = isolated; i; --i, ++p) list_add(&p->lru, freelist);
} else if (!isolated) {
}goto failed;
next: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 714b1c1..b4a46c7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1303,12 +1303,12 @@ int split_free_page(struct page *page) zone = page_zone(page); order = page_order(page);
+#if 0 /* Obey watermarks as if the page was being allocated */ watermark = low_wmark_pages(zone) + (1 << order); if (!zone_watermark_ok(zone, 0, watermark, 0, 0)) return 0;
+#endif /* Remove page from free list */ list_del(&page->lru); zone->free_area[order].nr_free--;
Come to think of it, this watermark check seem a little meaningless in case of CMA. With CMA the pages that we are splitting here have migrate type ISOLATE so they aren't “free” at all. Buddy will never use them for allocation. That means that we don't really allocate any pages, we just want to split them into order-0 pages.
Also, if we bail out now, it's a huge waste of time and efforts.
So, if the watermarks need to be checked, they should somewhere before we do migration and stuff. This may be due to my ignorance, but I don't know whether we really need the watermark check if we decide to use plain alloc_page() as allocator for migrate_pages() rather then compaction_alloc().
@@ -5734,6 +5734,12 @@ static unsigned long pfn_align_to_maxpage_up(unsigned long pfn) return ALIGN(pfn, MAX_ORDER_NR_PAGES); } +static struct page * +cma_migrate_alloc(struct page *page, unsigned long private, int **x) +{
- return alloc_page(GFP_HIGHUSER_MOVABLE);
+}
static int __alloc_contig_migrate_range(unsigned long start, unsigned long end) { /* This function is based on compact_zone() from compaction.c. */ @@ -5801,7 +5807,7 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end) } /* Try to migrate. */
ret = migrate_pages(&cc.migratepages, compaction_alloc,
/* Migrated all of them? Great! */ret = migrate_pages(&cc.migratepages, cma_migrate_alloc, (unsigned long)&cc, false, cc.sync);
Yep, that makes sense to me. compaction_alloc() takes only free pages (ie. pages from buddy system) from given zone. This means that no pages get discarded or swapped to disk. This makes sense for compaction since it's opportunistic in its nature. We, however, want pages to be discarded/swapped if that's the only way of getting pages to migrate to.
Of course, with this change the “(unsigneg long)&cc” part can be safely replaced with “NULL” and “cc.nr_freepages -= release_freepages(&cc.freepages);” at the end of the function (not visible in this patch) with the next line removed.
linaro-mm-sig@lists.linaro.org