Hello,
Contiguous Memory Allocator is very sensitive about migration failures of the individual pages. A single page, which causes permanent migration failure can break large conitguous allocations and cause the failure of a multimedia device driver.
One of the known issues with migration of CMA pages are the problems of migrating the anonymous user pages, for which the others called get_user_pages(). This takes a reference to the given user pages to let kernel to operate directly on the page content. This is usually used for preventing swaping out the page contents and doing direct DMA to/from userspace.
To solving this issue requires preventing locking of the pages, which are placed in CMA regions, for a long time. Our idea is to migrate anonymous page content before locking the page in get_user_pages(). This cannot be done automatically, as get_user_pages() interface is used very often for various operations, which usually last for a short period of time (like for example exec syscall). We have added a new flag indicating that the given get_user_space() call will grab pages for a long time, thus it is suitable to use the migration workaround in such cases.
The proposed extensions is used by V4L2/VideoBuf2 (drivers/media/v4l2-core/videobuf2-dma-contig.c), but that is not the only place which might benefit from it, like any driver which use DMA to userspace with get_user_pages(). This one is provided to demonstrate the use case.
I would like to hear some comments on the presented approach. What do you think about it? Is there a chance to get such workaround merged at some point to mainline?
Best regards Marek Szyprowski Samsung Poland R&D Center
Patch summary:
Marek Szyprowski (5): mm: introduce migrate_replace_page() for migrating page to the given target mm: get_user_pages: use static inline mm: get_user_pages: use NON-MOVABLE pages when FOLL_DURABLE flag is set mm: get_user_pages: migrate out CMA pages when FOLL_DURABLE flag is set media: vb2: use FOLL_DURABLE and __get_user_pages() to avoid CMA migration issues
drivers/media/v4l2-core/videobuf2-dma-contig.c | 8 +- include/linux/highmem.h | 12 ++- include/linux/migrate.h | 5 + include/linux/mm.h | 76 ++++++++++++- mm/internal.h | 12 +++ mm/memory.c | 136 +++++++++++------------- mm/migrate.c | 59 ++++++++++ 7 files changed, 225 insertions(+), 83 deletions(-)
Introduce migrate_replace_page() function for migrating single page to the given target page.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com --- include/linux/migrate.h | 5 ++++ mm/migrate.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 64 insertions(+)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h index a405d3dc..3a8a6c1 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -35,6 +35,8 @@ enum migrate_reason {
#ifdef CONFIG_MIGRATION
+extern int migrate_replace_page(struct page *oldpage, struct page *newpage); + extern void putback_lru_pages(struct list_head *l); extern void putback_movable_pages(struct list_head *l); extern int migrate_page(struct address_space *, @@ -57,6 +59,9 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page); #else
+static inline int migrate_replace_page(struct page *oldpage, + struct page *newpage) { return -ENOSYS; } + static inline void putback_lru_pages(struct list_head *l) {} static inline void putback_movable_pages(struct list_head *l) {} static inline int migrate_pages(struct list_head *l, new_page_t x, diff --git a/mm/migrate.c b/mm/migrate.c index 3bbaf5d..a2a6950 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1067,6 +1067,65 @@ out: return rc; }
+/* + * migrate_replace_page + * + * The function takes one single page and a target page (newpage) and + * tries to migrate data to the target page. The caller must ensure that + * the source page is locked with one additional get_page() call, which + * will be freed during the migration. The caller also must release newpage + * if migration fails, otherwise the ownership of the newpage is taken. + * Source page is released if migration succeeds. + * + * Return: error code or 0 on success. + */ +int migrate_replace_page(struct page *page, struct page *newpage) +{ + struct zone *zone = page_zone(page); + unsigned long flags; + int ret = -EAGAIN; + int pass; + + migrate_prep(); + + spin_lock_irqsave(&zone->lru_lock, flags); + + if (PageLRU(page) && + __isolate_lru_page(page, ISOLATE_UNEVICTABLE) == 0) { + struct lruvec *lruvec = mem_cgroup_page_lruvec(page, zone); + del_page_from_lru_list(page, lruvec, page_lru(page)); + spin_unlock_irqrestore(&zone->lru_lock, flags); + } else { + spin_unlock_irqrestore(&zone->lru_lock, flags); + return -EAGAIN; + } + + /* page is now isolated, so release additional reference */ + put_page(page); + + for (pass = 0; pass < 10 && ret != 0; pass++) { + cond_resched(); + + if (page_count(page) == 1) { + /* page was freed from under us, so we are done */ + ret = 0; + break; + } + ret = __unmap_and_move(page, newpage, 1, MIGRATE_SYNC); + } + + if (ret == 0) { + /* take ownership of newpage and add it to lru */ + putback_lru_page(newpage); + } else { + /* restore additional reference to the oldpage */ + get_page(page); + } + + putback_lru_page(page); + return ret; +} + #ifdef CONFIG_NUMA /* * Move a list of individual pages
__get_user_pages() is already exported function, so get_user_pages() can be easily inlined to the caller functions.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com --- include/linux/mm.h | 74 +++++++++++++++++++++++++++++++++++++++++++++++++--- mm/memory.c | 69 ------------------------------------------------ 2 files changed, 70 insertions(+), 73 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 7acc9dc..9806e54 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1019,10 +1019,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int foll_flags, struct page **pages, struct vm_area_struct **vmas, int *nonblocking); -long get_user_pages(struct task_struct *tsk, struct mm_struct *mm, - unsigned long start, unsigned long nr_pages, - int write, int force, struct page **pages, - struct vm_area_struct **vmas); + int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); struct kvec; @@ -1642,6 +1639,75 @@ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, extern int apply_to_page_range(struct mm_struct *mm, unsigned long address, unsigned long size, pte_fn_t fn, void *data);
+/* + * get_user_pages() - pin user pages in memory + * @tsk: the task_struct to use for page fault accounting, or + * NULL if faults are not to be recorded. + * @mm: mm_struct of target mm + * @start: starting user address + * @nr_pages: number of pages from start to pin + * @write: whether pages will be written to by the caller + * @force: whether to force write access even if user mapping is + * readonly. This will result in the page being COWed even + * in MAP_SHARED mappings. You do not want this. + * @pages: array that receives pointers to the pages pinned. + * Should be at least nr_pages long. Or NULL, if caller + * only intends to ensure the pages are faulted in. + * @vmas: array of pointers to vmas corresponding to each page. + * Or NULL if the caller does not require them. + * + * Returns number of pages pinned. This may be fewer than the number + * requested. If nr_pages is 0 or negative, returns 0. If no pages + * were pinned, returns -errno. Each page returned must be released + * with a put_page() call when it is finished with. vmas will only + * remain valid while mmap_sem is held. + * + * Must be called with mmap_sem held for read or write. + * + * get_user_pages walks a process's page tables and takes a reference to + * each struct page that each user address corresponds to at a given + * instant. That is, it takes the page that would be accessed if a user + * thread accesses the given user virtual address at that instant. + * + * This does not guarantee that the page exists in the user mappings when + * get_user_pages returns, and there may even be a completely different + * page there in some cases (eg. if mmapped pagecache has been invalidated + * and subsequently re faulted). However it does guarantee that the page + * won't be freed completely. And mostly callers simply care that the page + * contains data that was valid *at some point in time*. Typically, an IO + * or similar operation cannot guarantee anything stronger anyway because + * locks can't be held over the syscall boundary. + * + * If write=0, the page must not be written to. If the page is written to, + * set_page_dirty (or set_page_dirty_lock, as appropriate) must be called + * after the page is finished with, and before put_page is called. + * + * get_user_pages is typically used for fewer-copy IO operations, to get a + * handle on the memory by some means other than accesses via the user virtual + * addresses. The pages may be submitted for DMA to devices or accessed via + * their kernel linear mapping (via the kmap APIs). Care should be taken to + * use the correct cache flushing APIs. + * + * See also get_user_pages_fast, for performance critical applications. + */ +static inline long get_user_pages(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, int write, + int force, struct page **pages, + struct vm_area_struct **vmas) +{ + int flags = FOLL_TOUCH; + + if (pages) + flags |= FOLL_GET; + if (write) + flags |= FOLL_WRITE; + if (force) + flags |= FOLL_FORCE; + + return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas, + NULL); +} + #ifdef CONFIG_PROC_FS void vm_stat_account(struct mm_struct *, unsigned long, struct file *, long); #else diff --git a/mm/memory.c b/mm/memory.c index 494526a..42dfd8e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1961,75 +1961,6 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, return 0; }
-/* - * get_user_pages() - pin user pages in memory - * @tsk: the task_struct to use for page fault accounting, or - * NULL if faults are not to be recorded. - * @mm: mm_struct of target mm - * @start: starting user address - * @nr_pages: number of pages from start to pin - * @write: whether pages will be written to by the caller - * @force: whether to force write access even if user mapping is - * readonly. This will result in the page being COWed even - * in MAP_SHARED mappings. You do not want this. - * @pages: array that receives pointers to the pages pinned. - * Should be at least nr_pages long. Or NULL, if caller - * only intends to ensure the pages are faulted in. - * @vmas: array of pointers to vmas corresponding to each page. - * Or NULL if the caller does not require them. - * - * Returns number of pages pinned. This may be fewer than the number - * requested. If nr_pages is 0 or negative, returns 0. If no pages - * were pinned, returns -errno. Each page returned must be released - * with a put_page() call when it is finished with. vmas will only - * remain valid while mmap_sem is held. - * - * Must be called with mmap_sem held for read or write. - * - * get_user_pages walks a process's page tables and takes a reference to - * each struct page that each user address corresponds to at a given - * instant. That is, it takes the page that would be accessed if a user - * thread accesses the given user virtual address at that instant. - * - * This does not guarantee that the page exists in the user mappings when - * get_user_pages returns, and there may even be a completely different - * page there in some cases (eg. if mmapped pagecache has been invalidated - * and subsequently re faulted). However it does guarantee that the page - * won't be freed completely. And mostly callers simply care that the page - * contains data that was valid *at some point in time*. Typically, an IO - * or similar operation cannot guarantee anything stronger anyway because - * locks can't be held over the syscall boundary. - * - * If write=0, the page must not be written to. If the page is written to, - * set_page_dirty (or set_page_dirty_lock, as appropriate) must be called - * after the page is finished with, and before put_page is called. - * - * get_user_pages is typically used for fewer-copy IO operations, to get a - * handle on the memory by some means other than accesses via the user virtual - * addresses. The pages may be submitted for DMA to devices or accessed via - * their kernel linear mapping (via the kmap APIs). Care should be taken to - * use the correct cache flushing APIs. - * - * See also get_user_pages_fast, for performance critical applications. - */ -long get_user_pages(struct task_struct *tsk, struct mm_struct *mm, - unsigned long start, unsigned long nr_pages, int write, - int force, struct page **pages, struct vm_area_struct **vmas) -{ - int flags = FOLL_TOUCH; - - if (pages) - flags |= FOLL_GET; - if (write) - flags |= FOLL_WRITE; - if (force) - flags |= FOLL_FORCE; - - return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas, - NULL); -} -EXPORT_SYMBOL(get_user_pages); - /** * get_dump_page() - pin user page in memory while writing it to core dump * @addr: user address
Ensure that newly allocated pages, which are faulted in in FOLL_DURABLE mode comes from non-movalbe pageblocks, to workaround migration failures with Contiguous Memory Allocator.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com --- include/linux/highmem.h | 12 ++++++++++-- include/linux/mm.h | 2 ++ mm/memory.c | 24 ++++++++++++++++++------ 3 files changed, 30 insertions(+), 8 deletions(-)
diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 7fb31da..cf0b9d8 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -168,7 +168,8 @@ __alloc_zeroed_user_highpage(gfp_t movableflags, #endif
/** - * alloc_zeroed_user_highpage_movable - Allocate a zeroed HIGHMEM page for a VMA that the caller knows can move + * alloc_zeroed_user_highpage_movable - Allocate a zeroed HIGHMEM page for + * a VMA that the caller knows can move * @vma: The VMA the page is to be allocated for * @vaddr: The virtual address the page will be inserted into * @@ -177,11 +178,18 @@ __alloc_zeroed_user_highpage(gfp_t movableflags, */ static inline struct page * alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma, - unsigned long vaddr) + unsigned long vaddr) { return __alloc_zeroed_user_highpage(__GFP_MOVABLE, vma, vaddr); }
+static inline struct page * +alloc_zeroed_user_highpage(gfp_t gfp, struct vm_area_struct *vma, + unsigned long vaddr) +{ + return __alloc_zeroed_user_highpage(gfp, vma, vaddr); +} + static inline void clear_highpage(struct page *page) { void *kaddr = kmap_atomic(page); diff --git a/include/linux/mm.h b/include/linux/mm.h index 9806e54..c11f58f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -165,6 +165,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ #define FAULT_FLAG_TRIED 0x40 /* second try */ +#define FAULT_FLAG_NO_CMA 0x80 /* don't use CMA pages */
/* * vm_fault is filled by the the pagefault handler and passed to the vma's @@ -1633,6 +1634,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma, #define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ #define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ +#define FOLL_DURABLE 0x800 /* get the page reference for a long time */
typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); diff --git a/mm/memory.c b/mm/memory.c index 42dfd8e..2b9c2dd 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1816,6 +1816,9 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, int ret; unsigned int fault_flags = 0;
+ if (gup_flags & FOLL_DURABLE) + fault_flags = FAULT_FLAG_NO_CMA; + /* For mlock, just skip the stack guard page. */ if (foll_flags & FOLL_MLOCK) { if (stack_guard_page(vma, start)) @@ -2495,7 +2498,7 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo */ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd, - spinlock_t *ptl, pte_t orig_pte) + spinlock_t *ptl, pte_t orig_pte, unsigned int flags) __releases(ptl) { struct page *old_page, *new_page = NULL; @@ -2505,6 +2508,10 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page *dirty_page = NULL; unsigned long mmun_start = 0; /* For mmu_notifiers */ unsigned long mmun_end = 0; /* For mmu_notifiers */ + gfp_t gfp = GFP_HIGHUSER_MOVABLE; + + if (IS_ENABLED(CONFIG_CMA) && (flags & FAULT_FLAG_NO_CMA)) + gfp &= ~__GFP_MOVABLE;
old_page = vm_normal_page(vma, address, orig_pte); if (!old_page) { @@ -2668,11 +2675,11 @@ gotten: goto oom;
if (is_zero_pfn(pte_pfn(orig_pte))) { - new_page = alloc_zeroed_user_highpage_movable(vma, address); + new_page = alloc_zeroed_user_highpage(gfp, vma, address); if (!new_page) goto oom; } else { - new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); + new_page = alloc_page_vma(gfp, vma, address); if (!new_page) goto oom; cow_user_page(new_page, old_page, address, vma); @@ -3032,7 +3039,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, }
if (flags & FAULT_FLAG_WRITE) { - ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte); + ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte, flags); if (ret & VM_FAULT_ERROR) ret &= VM_FAULT_ERROR; goto out; @@ -3187,6 +3194,11 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_fault vmf; int ret; int page_mkwrite = 0; + gfp_t gfp = GFP_HIGHUSER_MOVABLE; + + if (IS_ENABLED(CONFIG_CMA) && (flags & FAULT_FLAG_NO_CMA)) + gfp &= ~__GFP_MOVABLE; +
/* * If we do COW later, allocate page befor taking lock_page() @@ -3197,7 +3209,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(anon_vma_prepare(vma))) return VM_FAULT_OOM;
- cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); + cow_page = alloc_page_vma(gfp, vma, address); if (!cow_page) return VM_FAULT_OOM;
@@ -3614,7 +3626,7 @@ int handle_pte_fault(struct mm_struct *mm, if (flags & FAULT_FLAG_WRITE) { if (!pte_write(entry)) return do_wp_page(mm, vma, address, - pte, pmd, ptl, entry); + pte, pmd, ptl, entry, flags); entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry);
2013/03/05 15:57, Marek Szyprowski wrote:
Ensure that newly allocated pages, which are faulted in in FOLL_DURABLE mode comes from non-movalbe pageblocks, to workaround migration failures with Contiguous Memory Allocator.
In your idea, all users who uses non-movable pageblocks need to set gup_flags. It's not good.
So how about prepare "get_user_pages_non_movable"? The idea is based on following Lin Feng's idea: https://lkml.org/lkml/2013/2/21/123
int get_user_pages_non_movable() { int flags = FOLL_TOUCH | FOLL_DURABLE;
if (pages) flags |= FOLL_GET; if (write) flags |= FOLL_WRITE; if (force) flags |= FOLL_FORCE;
return __get_user_pages(); }
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com
include/linux/highmem.h | 12 ++++++++++-- include/linux/mm.h | 2 ++ mm/memory.c | 24 ++++++++++++++++++------ 3 files changed, 30 insertions(+), 8 deletions(-)
diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 7fb31da..cf0b9d8 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -168,7 +168,8 @@ __alloc_zeroed_user_highpage(gfp_t movableflags, #endif /**
- alloc_zeroed_user_highpage_movable - Allocate a zeroed HIGHMEM page for a VMA that the caller knows can move
- alloc_zeroed_user_highpage_movable - Allocate a zeroed HIGHMEM page for
a VMA that the caller knows can move
- @vma: The VMA the page is to be allocated for
- @vaddr: The virtual address the page will be inserted into
@@ -177,11 +178,18 @@ __alloc_zeroed_user_highpage(gfp_t movableflags, */ static inline struct page * alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
unsigned long vaddr)
{ return __alloc_zeroed_user_highpage(__GFP_MOVABLE, vma, vaddr); }unsigned long vaddr)
+static inline struct page * +alloc_zeroed_user_highpage(gfp_t gfp, struct vm_area_struct *vma,
unsigned long vaddr)
+{
- return __alloc_zeroed_user_highpage(gfp, vma, vaddr);
+}
- static inline void clear_highpage(struct page *page) { void *kaddr = kmap_atomic(page);
diff --git a/include/linux/mm.h b/include/linux/mm.h index 9806e54..c11f58f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -165,6 +165,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ #define FAULT_FLAG_TRIED 0x40 /* second try */
+#define FAULT_FLAG_NO_CMA 0x80 /* don't use CMA pages */
How about FAULT_FLAG_NO_MIGLATABLE? I want to use it to not only CMA but also memory hotplug.
/*
- vm_fault is filled by the the pagefault handler and passed to the vma's
@@ -1633,6 +1634,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma, #define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ #define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ +#define FOLL_DURABLE 0x800 /* get the page reference for a long time */ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); diff --git a/mm/memory.c b/mm/memory.c index 42dfd8e..2b9c2dd 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1816,6 +1816,9 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, int ret; unsigned int fault_flags = 0;
if (gup_flags & FOLL_DURABLE)
fault_flags = FAULT_FLAG_NO_CMA;
/* For mlock, just skip the stack guard page. */ if (foll_flags & FOLL_MLOCK) { if (stack_guard_page(vma, start))
@@ -2495,7 +2498,7 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo */ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd,
spinlock_t *ptl, pte_t orig_pte)
__releases(ptl) { struct page *old_page, *new_page = NULL;spinlock_t *ptl, pte_t orig_pte, unsigned int flags)
@@ -2505,6 +2508,10 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page *dirty_page = NULL; unsigned long mmun_start = 0; /* For mmu_notifiers */ unsigned long mmun_end = 0; /* For mmu_notifiers */
- gfp_t gfp = GFP_HIGHUSER_MOVABLE;
- if (IS_ENABLED(CONFIG_CMA) && (flags & FAULT_FLAG_NO_CMA))
gfp &= ~__GFP_MOVABLE;
Pleae remove IS_ENABLED(CONFIG_CMA) check.
old_page = vm_normal_page(vma, address, orig_pte); if (!old_page) { @@ -2668,11 +2675,11 @@ gotten: goto oom; if (is_zero_pfn(pte_pfn(orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma, address);
if (!new_page) goto oom; } else {new_page = alloc_zeroed_user_highpage(gfp, vma, address);
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
if (!new_page) goto oom; cow_user_page(new_page, old_page, address, vma);new_page = alloc_page_vma(gfp, vma, address);
@@ -3032,7 +3039,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, } if (flags & FAULT_FLAG_WRITE) {
ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
if (ret & VM_FAULT_ERROR) ret &= VM_FAULT_ERROR; goto out;ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte, flags);
@@ -3187,6 +3194,11 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_fault vmf; int ret; int page_mkwrite = 0;
- gfp_t gfp = GFP_HIGHUSER_MOVABLE;
- if (IS_ENABLED(CONFIG_CMA) && (flags & FAULT_FLAG_NO_CMA))
gfp &= ~__GFP_MOVABLE;
Pleae remove IS_ENABLED(CONFIG_CMA) check.
/* * If we do COW later, allocate page befor taking lock_page() @@ -3197,7 +3209,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(anon_vma_prepare(vma))) return VM_FAULT_OOM;
cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
if (!cow_page) return VM_FAULT_OOM;cow_page = alloc_page_vma(gfp, vma, address);
@@ -3614,7 +3626,7 @@ int handle_pte_fault(struct mm_struct *mm, if (flags & FAULT_FLAG_WRITE) { if (!pte_write(entry)) return do_wp_page(mm, vma, address,
pte, pmd, ptl, entry);
entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry);pte, pmd, ptl, entry, flags);
Hi Marek,
On 03/05/2013 02:57 PM, Marek Szyprowski wrote:
@@ -2495,7 +2498,7 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo */ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd,
spinlock_t *ptl, pte_t orig_pte)
__releases(ptl)spinlock_t *ptl, pte_t orig_pte, unsigned int flags)
{ struct page *old_page, *new_page = NULL; @@ -2505,6 +2508,10 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page *dirty_page = NULL; unsigned long mmun_start = 0; /* For mmu_notifiers */ unsigned long mmun_end = 0; /* For mmu_notifiers */
- gfp_t gfp = GFP_HIGHUSER_MOVABLE;
- if (IS_ENABLED(CONFIG_CMA) && (flags & FAULT_FLAG_NO_CMA))
gfp &= ~__GFP_MOVABLE;
Here just simply strip the __GFP_MOVABLE flag, IIUC it will break the page migrate policy. Because " But GFP_MOVABLE is not only a zone specifier but also an allocation policy.".
Another problem is that you add a new flag to instruct the page allocation, do we have to also handle the hugepage or THP as Mel ever mentioned?
thanks, linfeng
Hi Marek,
On 03/05/2013 02:57 PM, Marek Szyprowski wrote:
Ensure that newly allocated pages, which are faulted in in FOLL_DURABLE mode comes from non-movalbe pageblocks, to workaround migration failures with Contiguous Memory Allocator.
snip
@@ -2495,7 +2498,7 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo */ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd,
spinlock_t *ptl, pte_t orig_pte)
__releases(ptl)spinlock_t *ptl, pte_t orig_pte, unsigned int flags)
{ struct page *old_page, *new_page = NULL; @@ -2505,6 +2508,10 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page *dirty_page = NULL; unsigned long mmun_start = 0; /* For mmu_notifiers */ unsigned long mmun_end = 0; /* For mmu_notifiers */
- gfp_t gfp = GFP_HIGHUSER_MOVABLE;
- if (IS_ENABLED(CONFIG_CMA) && (flags & FAULT_FLAG_NO_CMA))
gfp &= ~__GFP_MOVABLE;
snip
@@ -3187,6 +3194,11 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_fault vmf; int ret; int page_mkwrite = 0;
- gfp_t gfp = GFP_HIGHUSER_MOVABLE;
- if (IS_ENABLED(CONFIG_CMA) && (flags & FAULT_FLAG_NO_CMA))
gfp &= ~__GFP_MOVABLE;
/*
Since the GUP unmovable pages are only corner cases in all kinds of pagefaults, I'm afraid that adding special treatment codes in generic pagefault core interface is not that necessary or worth to do. But I'm not sure if the performance impact is as large as to be worried about.
thanks, linfeng
Hi Marek,
It has been a long time since this patch-set was sent. And I'm pushing memory hot-remove works. I think I need your [patch3/5] to fix a problem I met.
We have sent a similar patch before. But I think yours may be better. :) https://lkml.org/lkml/2013/2/21/126
So would you please update and resend your patch again ? Or do you have your own plan to push it ?
Thanks. :)
On 03/05/2013 02:57 PM, Marek Szyprowski wrote:
Ensure that newly allocated pages, which are faulted in in FOLL_DURABLE mode comes from non-movalbe pageblocks, to workaround migration failures with Contiguous Memory Allocator.
Signed-off-by: Marek Szyprowskim.szyprowski@samsung.com Signed-off-by: Kyungmin Parkkyungmin.park@samsung.com
include/linux/highmem.h | 12 ++++++++++-- include/linux/mm.h | 2 ++ mm/memory.c | 24 ++++++++++++++++++------ 3 files changed, 30 insertions(+), 8 deletions(-)
diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 7fb31da..cf0b9d8 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -168,7 +168,8 @@ __alloc_zeroed_user_highpage(gfp_t movableflags, #endif
/**
- alloc_zeroed_user_highpage_movable - Allocate a zeroed HIGHMEM page for a VMA that the caller knows can move
- alloc_zeroed_user_highpage_movable - Allocate a zeroed HIGHMEM page for
a VMA that the caller knows can move
- @vma: The VMA the page is to be allocated for
- @vaddr: The virtual address the page will be inserted into
@@ -177,11 +178,18 @@ __alloc_zeroed_user_highpage(gfp_t movableflags, */ static inline struct page * alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
unsigned long vaddr)
{ return __alloc_zeroed_user_highpage(__GFP_MOVABLE, vma, vaddr); }unsigned long vaddr)
+static inline struct page * +alloc_zeroed_user_highpage(gfp_t gfp, struct vm_area_struct *vma,
unsigned long vaddr)
+{
- return __alloc_zeroed_user_highpage(gfp, vma, vaddr);
+}
- static inline void clear_highpage(struct page *page) { void *kaddr = kmap_atomic(page);
diff --git a/include/linux/mm.h b/include/linux/mm.h index 9806e54..c11f58f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -165,6 +165,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ #define FAULT_FLAG_TRIED 0x40 /* second try */ +#define FAULT_FLAG_NO_CMA 0x80 /* don't use CMA pages */
/*
- vm_fault is filled by the the pagefault handler and passed to the vma's
@@ -1633,6 +1634,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma, #define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ #define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ +#define FOLL_DURABLE 0x800 /* get the page reference for a long time */
typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); diff --git a/mm/memory.c b/mm/memory.c index 42dfd8e..2b9c2dd 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1816,6 +1816,9 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, int ret; unsigned int fault_flags = 0;
if (gup_flags& FOLL_DURABLE)
fault_flags = FAULT_FLAG_NO_CMA;
/* For mlock, just skip the stack guard page. */ if (foll_flags& FOLL_MLOCK) { if (stack_guard_page(vma, start))
@@ -2495,7 +2498,7 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo */ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd,
spinlock_t *ptl, pte_t orig_pte)
__releases(ptl) { struct page *old_page, *new_page = NULL;spinlock_t *ptl, pte_t orig_pte, unsigned int flags)
@@ -2505,6 +2508,10 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page *dirty_page = NULL; unsigned long mmun_start = 0; /* For mmu_notifiers */ unsigned long mmun_end = 0; /* For mmu_notifiers */
gfp_t gfp = GFP_HIGHUSER_MOVABLE;
if (IS_ENABLED(CONFIG_CMA)&& (flags& FAULT_FLAG_NO_CMA))
gfp&= ~__GFP_MOVABLE;
old_page = vm_normal_page(vma, address, orig_pte); if (!old_page) {
@@ -2668,11 +2675,11 @@ gotten: goto oom;
if (is_zero_pfn(pte_pfn(orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma, address);
if (!new_page) goto oom; } else {new_page = alloc_zeroed_user_highpage(gfp, vma, address);
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
if (!new_page) goto oom; cow_user_page(new_page, old_page, address, vma);new_page = alloc_page_vma(gfp, vma, address);
@@ -3032,7 +3039,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, }
if (flags& FAULT_FLAG_WRITE) {
ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
if (ret& VM_FAULT_ERROR) ret&= VM_FAULT_ERROR; goto out;ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte, flags);
@@ -3187,6 +3194,11 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_fault vmf; int ret; int page_mkwrite = 0;
gfp_t gfp = GFP_HIGHUSER_MOVABLE;
if (IS_ENABLED(CONFIG_CMA)&& (flags& FAULT_FLAG_NO_CMA))
gfp&= ~__GFP_MOVABLE;
/*
- If we do COW later, allocate page befor taking lock_page()
@@ -3197,7 +3209,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(anon_vma_prepare(vma))) return VM_FAULT_OOM;
cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
if (!cow_page) return VM_FAULT_OOM;cow_page = alloc_page_vma(gfp, vma, address);
@@ -3614,7 +3626,7 @@ int handle_pte_fault(struct mm_struct *mm, if (flags& FAULT_FLAG_WRITE) { if (!pte_write(entry)) return do_wp_page(mm, vma, address,
pte, pmd, ptl, entry);
entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry);pte, pmd, ptl, entry, flags);
Hello,
On 5/6/2013 9:19 AM, Tang Chen wrote:
Hi Marek,
It has been a long time since this patch-set was sent. And I'm pushing memory hot-remove works. I think I need your [patch3/5] to fix a problem I met.
We have sent a similar patch before. But I think yours may be better. :) https://lkml.org/lkml/2013/2/21/126
So would you please update and resend your patch again ? Or do you have your own plan to push it ?
I don't think that there was any conclusion after my patch, so I really see no point in submitting it again now. If you need it for Your patchset, You can include it directly. Just please keep my signed-off-by tag.
Thanks. :)
On 03/05/2013 02:57 PM, Marek Szyprowski wrote:
Ensure that newly allocated pages, which are faulted in in FOLL_DURABLE mode comes from non-movalbe pageblocks, to workaround migration failures with Contiguous Memory Allocator.
Signed-off-by: Marek Szyprowskim.szyprowski@samsung.com Signed-off-by: Kyungmin Parkkyungmin.park@samsung.com
include/linux/highmem.h | 12 ++++++++++-- include/linux/mm.h | 2 ++ mm/memory.c | 24 ++++++++++++++++++------ 3 files changed, 30 insertions(+), 8 deletions(-)
diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 7fb31da..cf0b9d8 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -168,7 +168,8 @@ __alloc_zeroed_user_highpage(gfp_t movableflags, #endif
/**
- alloc_zeroed_user_highpage_movable - Allocate a zeroed HIGHMEM
page for a VMA that the caller knows can move
- alloc_zeroed_user_highpage_movable - Allocate a zeroed HIGHMEM
page for
a VMA that the caller knows can move
- @vma: The VMA the page is to be allocated for
- @vaddr: The virtual address the page will be inserted into
@@ -177,11 +178,18 @@ __alloc_zeroed_user_highpage(gfp_t movableflags, */ static inline struct page * alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
unsigned long vaddr)
{ return __alloc_zeroed_user_highpage(__GFP_MOVABLE, vma, vaddr); }unsigned long vaddr)
+static inline struct page * +alloc_zeroed_user_highpage(gfp_t gfp, struct vm_area_struct *vma,
unsigned long vaddr)
+{
- return __alloc_zeroed_user_highpage(gfp, vma, vaddr);
+}
- static inline void clear_highpage(struct page *page) { void *kaddr = kmap_atomic(page);
diff --git a/include/linux/mm.h b/include/linux/mm.h index 9806e54..c11f58f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -165,6 +165,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */ #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ #define FAULT_FLAG_TRIED 0x40 /* second try */ +#define FAULT_FLAG_NO_CMA 0x80 /* don't use CMA pages */
/*
- vm_fault is filled by the the pagefault handler and passed to
the vma's @@ -1633,6 +1634,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma, #define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ #define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ +#define FOLL_DURABLE 0x800 /* get the page reference for a long time */
typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); diff --git a/mm/memory.c b/mm/memory.c index 42dfd8e..2b9c2dd 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1816,6 +1816,9 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, int ret; unsigned int fault_flags = 0;
if (gup_flags& FOLL_DURABLE)
fault_flags = FAULT_FLAG_NO_CMA;
/* For mlock, just skip the stack guard page. */ if (foll_flags& FOLL_MLOCK) { if (stack_guard_page(vma, start))
@@ -2495,7 +2498,7 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo */ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd,
spinlock_t *ptl, pte_t orig_pte)
{ struct page *old_page, *new_page = NULL;spinlock_t *ptl, pte_t orig_pte, unsigned int flags) __releases(ptl)
@@ -2505,6 +2508,10 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page *dirty_page = NULL; unsigned long mmun_start = 0; /* For mmu_notifiers */ unsigned long mmun_end = 0; /* For mmu_notifiers */
gfp_t gfp = GFP_HIGHUSER_MOVABLE;
if (IS_ENABLED(CONFIG_CMA)&& (flags& FAULT_FLAG_NO_CMA))
gfp&= ~__GFP_MOVABLE; old_page = vm_normal_page(vma, address, orig_pte); if (!old_page) {
@@ -2668,11 +2675,11 @@ gotten: goto oom;
if (is_zero_pfn(pte_pfn(orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma, address);
new_page = alloc_zeroed_user_highpage(gfp, vma, address); if (!new_page) goto oom; } else {
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
new_page = alloc_page_vma(gfp, vma, address); if (!new_page) goto oom; cow_user_page(new_page, old_page, address, vma);
@@ -3032,7 +3039,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, }
if (flags& FAULT_FLAG_WRITE) {
ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl,
pte, flags); if (ret& VM_FAULT_ERROR) ret&= VM_FAULT_ERROR; goto out; @@ -3187,6 +3194,11 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_fault vmf; int ret; int page_mkwrite = 0;
gfp_t gfp = GFP_HIGHUSER_MOVABLE;
if (IS_ENABLED(CONFIG_CMA)&& (flags& FAULT_FLAG_NO_CMA))
gfp&= ~__GFP_MOVABLE;
/* * If we do COW later, allocate page befor taking lock_page()
@@ -3197,7 +3209,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(anon_vma_prepare(vma))) return VM_FAULT_OOM;
cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
cow_page = alloc_page_vma(gfp, vma, address); if (!cow_page) return VM_FAULT_OOM;
@@ -3614,7 +3626,7 @@ int handle_pte_fault(struct mm_struct *mm, if (flags& FAULT_FLAG_WRITE) { if (!pte_write(entry)) return do_wp_page(mm, vma, address,
pte, pmd, ptl, entry);
pte, pmd, ptl, entry, flags); entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry);
Best regards
Hi Marek,
On 05/07/2013 06:47 PM, Marek Szyprowski wrote:
I don't think that there was any conclusion after my patch, so I really see no point in submitting it again now. If you need it for Your patchset, You can include it directly. Just please keep my signed-off-by tag.
That's very kind of you. I'll keep you as the Author and your signed-off-by tag if I use your patches, and will cc you.
Thanks. :)
When __get_user_pages() is called with FOLL_DURABLE flag, ensure that no page in CMA pageblocks gets locked. This workarounds the permanent migration failures caused by locking the pages by get_user_pages() call for a long period of time.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com --- mm/internal.h | 12 ++++++++++++ mm/memory.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 55 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index 8562de0..a290d04 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -105,6 +105,18 @@ extern void prep_compound_page(struct page *page, unsigned long order); extern bool is_free_buddy_page(struct page *page); #endif
+#ifdef CONFIG_CMA +static inline int is_cma_page(struct page *page) +{ + unsigned mt = get_pageblock_migratetype(page); + if (mt == MIGRATE_ISOLATE || mt == MIGRATE_CMA) + return true; + return false; +} +#else +#define is_cma_page(page) 0 +#endif + #if defined CONFIG_COMPACTION || defined CONFIG_CMA
/* diff --git a/mm/memory.c b/mm/memory.c index 2b9c2dd..f81b273 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1650,6 +1650,45 @@ static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long add }
/** + * replace_cma_page() - migrate page out of CMA page blocks + * @page: source page to be migrated + * + * Returns either the old page (if migration was not possible) or the pointer + * to the newly allocated page (with additional reference taken). + * + * get_user_pages() might take a reference to a page for a long period of time, + * what prevent such page from migration. This is fatal to the preffered usage + * pattern of CMA pageblocks. This function replaces the given user page with + * a new one allocated from NON-MOVABLE pageblock, so locking CMA page can be + * avoided. + */ +static inline struct page *migrate_replace_cma_page(struct page *page) +{ + struct page *newpage = alloc_page(GFP_HIGHUSER); + + if (!newpage) + goto out; + + /* + * Take additional reference to the new page to ensure it won't get + * freed after migration procedure end. + */ + get_page_foll(newpage); + + if (migrate_replace_page(page, newpage) == 0) + return newpage; + + put_page(newpage); + __free_page(newpage); +out: + /* + * Migration errors in case of get_user_pages() might not + * be fatal to CMA itself, so better don't fail here. + */ + return page; +} + +/** * __get_user_pages() - pin user pages in memory * @tsk: task_struct of target task * @mm: mm_struct of target mm @@ -1884,6 +1923,10 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, } if (IS_ERR(page)) return i ? i : PTR_ERR(page); + + if ((gup_flags & FOLL_DURABLE) && is_cma_page(page)) + page = migrate_replace_cma_page(page); + if (pages) { pages[i] = page;
2013/03/05 15:57, Marek Szyprowski wrote:
When __get_user_pages() is called with FOLL_DURABLE flag, ensure that no page in CMA pageblocks gets locked. This workarounds the permanent migration failures caused by locking the pages by get_user_pages() call for a long period of time.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com
mm/internal.h | 12 ++++++++++++ mm/memory.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 55 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index 8562de0..a290d04 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -105,6 +105,18 @@ extern void prep_compound_page(struct page *page, unsigned long order); extern bool is_free_buddy_page(struct page *page); #endif +#ifdef CONFIG_CMA +static inline int is_cma_page(struct page *page) +{
- unsigned mt = get_pageblock_migratetype(page);
- if (mt == MIGRATE_ISOLATE || mt == MIGRATE_CMA)
return true;
- return false;
+} +#else +#define is_cma_page(page) 0 +#endif
- #if defined CONFIG_COMPACTION || defined CONFIG_CMA
/* diff --git a/mm/memory.c b/mm/memory.c index 2b9c2dd..f81b273 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1650,6 +1650,45 @@ static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long add } /**
- replace_cma_page() - migrate page out of CMA page blocks
- @page: source page to be migrated
- Returns either the old page (if migration was not possible) or the pointer
- to the newly allocated page (with additional reference taken).
- get_user_pages() might take a reference to a page for a long period of time,
- what prevent such page from migration. This is fatal to the preffered usage
- pattern of CMA pageblocks. This function replaces the given user page with
- a new one allocated from NON-MOVABLE pageblock, so locking CMA page can be
- avoided.
- */
+static inline struct page *migrate_replace_cma_page(struct page *page) +{
- struct page *newpage = alloc_page(GFP_HIGHUSER);
- if (!newpage)
goto out;
- /*
* Take additional reference to the new page to ensure it won't get
* freed after migration procedure end.
*/
- get_page_foll(newpage);
- if (migrate_replace_page(page, newpage) == 0)
return newpage;
- put_page(newpage);
- __free_page(newpage);
+out:
- /*
* Migration errors in case of get_user_pages() might not
* be fatal to CMA itself, so better don't fail here.
*/
- return page;
+}
+/**
- __get_user_pages() - pin user pages in memory
- @tsk: task_struct of target task
- @mm: mm_struct of target mm
@@ -1884,6 +1923,10 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, } if (IS_ERR(page)) return i ? i : PTR_ERR(page);
if ((gup_flags & FOLL_DURABLE) && is_cma_page(page))
page = migrate_replace_cma_page(page);
I might be misreading. If FOLL_DURABLE is set, this page is always allocated as non movable. Is it right? If so, when does this situation occur?
Thanks, Yasuaki Ishimatsu
if (pages) { pages[i] = page;
V4L2 devices usually grab additional references to user pages for a very long period of time, what causes permanent migration failures if the given page has been allocated from CMA pageblock. By setting FOLL_DURABLE flag, videobuf2 will instruct __get_user_pages() to migrate user pages out of CMA pageblocks before blocking them with an additional reference.
Signed-off-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Kyungmin Park kyungmin.park@samsung.com --- drivers/media/v4l2-core/videobuf2-dma-contig.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/media/v4l2-core/videobuf2-dma-contig.c b/drivers/media/v4l2-core/videobuf2-dma-contig.c index 10beaee..70649ab 100644 --- a/drivers/media/v4l2-core/videobuf2-dma-contig.c +++ b/drivers/media/v4l2-core/videobuf2-dma-contig.c @@ -443,9 +443,13 @@ static int vb2_dc_get_user_pages(unsigned long start, struct page **pages, } } else { int n; + int flags = FOLL_TOUCH | FOLL_GET | FOLL_FORCE | FOLL_DURABLE;
- n = get_user_pages(current, current->mm, start & PAGE_MASK, - n_pages, write, 1, pages, NULL); + if (write) + flags |= FOLL_WRITE; + + n = __get_user_pages(current, current->mm, start & PAGE_MASK, + n_pages, flags, pages, NULL, NULL); /* negative error means that no page was pinned */ n = max(n, 0); if (n != n_pages) {
On Tuesday 05 March 2013, Marek Szyprowski wrote:
To solving this issue requires preventing locking of the pages, which are placed in CMA regions, for a long time. Our idea is to migrate anonymous page content before locking the page in get_user_pages(). This cannot be done automatically, as get_user_pages() interface is used very often for various operations, which usually last for a short period of time (like for example exec syscall). We have added a new flag indicating that the given get_user_space() call will grab pages for a long time, thus it is suitable to use the migration workaround in such cases.
Can you explain the tradeoff here? I would have expected that the default should be to migrate pages out, and annotate the instances that we know are performance critical and short-lived. That would at least appear more reliable to me.
Arnd
Hello,
On 3/5/2013 9:50 AM, Arnd Bergmann wrote:
On Tuesday 05 March 2013, Marek Szyprowski wrote:
To solving this issue requires preventing locking of the pages, which are placed in CMA regions, for a long time. Our idea is to migrate anonymous page content before locking the page in get_user_pages(). This cannot be done automatically, as get_user_pages() interface is used very often for various operations, which usually last for a short period of time (like for example exec syscall). We have added a new flag indicating that the given get_user_space() call will grab pages for a long time, thus it is suitable to use the migration workaround in such cases.
Can you explain the tradeoff here? I would have expected that the default should be to migrate pages out, and annotate the instances that we know are performance critical and short-lived. That would at least appear more reliable to me.
The problem is that the opposite approach is imho easier. get_user_pages() is used in quite a lot of places (I was quite surprised when I've added some debug to it and saw the logs) and it seems to be easier to identify places where references are kept for significant amount of time. Usually such places are in the device drivers. In our case only videobuf2 and some closed-source driver were causing the real migration problems, so I decided to leave the default approach unchanged.
If we use this workaround for every get_user_pages() call we will sooner or later end with most of the anonymous pages migrated to non-movable pageblocks what make the whole CMA approach a bit pointless.
Best regards
On Tuesday 05 March 2013, Marek Szyprowski wrote:
On 3/5/2013 9:50 AM, Arnd Bergmann wrote:
On Tuesday 05 March 2013, Marek Szyprowski wrote:
The problem is that the opposite approach is imho easier.
I can understand that, yes ;-)
get_user_pages() is used in quite a lot of places (I was quite surprised when I've added some debug to it and saw the logs) and it seems to be easier to identify places where references are kept for significant amount of time. Usually such places are in the device drivers. In our case only videobuf2 and some closed-source driver were causing the real migration problems, so I decided to leave the default approach unchanged.
If we use this workaround for every get_user_pages() call we will sooner or later end with most of the anonymous pages migrated to non-movable pageblocks what make the whole CMA approach a bit pointless.
But you said that most users are in device drivers, and I would expect drivers not to touch that many pages.
We already have two interfaces: the generic get_user_pages and the "fast" version "get_user_pages_fast" that has a number of restrictions. We could add another such restriction to get_user_pages_fast(), which is that it must not hold the page reference count for an extended time because it will not migrate pages out.
I would assume that most of the in-kernel users of get_user_pages() that are called a lot either already use get_user_pages_fast, or can be easily converted to it.
Arnd
On Tue, Mar 5, 2013 at 7:57 AM, Marek Szyprowski m.szyprowski@samsung.com wrote:
Hello,
Contiguous Memory Allocator is very sensitive about migration failures of the individual pages. A single page, which causes permanent migration failure can break large conitguous allocations and cause the failure of a multimedia device driver.
One of the known issues with migration of CMA pages are the problems of migrating the anonymous user pages, for which the others called get_user_pages(). This takes a reference to the given user pages to let kernel to operate directly on the page content. This is usually used for preventing swaping out the page contents and doing direct DMA to/from userspace.
To solving this issue requires preventing locking of the pages, which are placed in CMA regions, for a long time. Our idea is to migrate anonymous page content before locking the page in get_user_pages(). This cannot be done automatically, as get_user_pages() interface is used very often for various operations, which usually last for a short period of time (like for example exec syscall). We have added a new flag indicating that the given get_user_space() call will grab pages for a long time, thus it is suitable to use the migration workaround in such cases.
The proposed extensions is used by V4L2/VideoBuf2 (drivers/media/v4l2-core/videobuf2-dma-contig.c), but that is not the only place which might benefit from it, like any driver which use DMA to userspace with get_user_pages(). This one is provided to demonstrate the use case.
I would like to hear some comments on the presented approach. What do you think about it? Is there a chance to get such workaround merged at some point to mainline?
Imo neat trick to make CMA work together with long-term gup'ed userspace memory in buffer objects, but doesn't really address the bigger issue that such userspace pinning kills all the nice features page migration allows. E.g. if your iommu supports huge pages and you need those to hit some performance targets, but not correctness since you can fall back to normal pages.
For the userptr support we're playing around with in drm/i915 we've opted to fix this with the mmu_notifier. That allows us to evict buffers and unbind the mappings when the vm wants to move a page. There's still the issue that we can't unbind it right away, but the usual retry loop for referenced pages in the migration code should handle that like any other short-lived locked pages for I/O. I see two issues with that approach though: - Needs buffer eviction support. No really a problem for drm/i915, a bit a challenge for v4l ;-) - The mmu notifiers aren't really designed to keep track of a lot of tiny ranges in different mms. At least the simplistic approach currently used in the i915 patches to register a new mmu_notifier for each buffer object sucks performance wise.
For performance reasons we want to also use get_user_pages_fast, so I don't think mixing that together with the "please migrate out of CMA" trick here is a good thing.
Current drm/i915 wip patch is at: https://patchwork.kernel.org/patch/1748601/
Just my 2 cents on this entire issue.
Cheers, Daniel
2013/03/05 15:57, Marek Szyprowski wrote:
Hello,
Contiguous Memory Allocator is very sensitive about migration failures of the individual pages. A single page, which causes permanent migration failure can break large conitguous allocations and cause the failure of a multimedia device driver.
One of the known issues with migration of CMA pages are the problems of migrating the anonymous user pages, for which the others called get_user_pages(). This takes a reference to the given user pages to let kernel to operate directly on the page content. This is usually used for preventing swaping out the page contents and doing direct DMA to/from userspace.
To solving this issue requires preventing locking of the pages, which are placed in CMA regions, for a long time. Our idea is to migrate anonymous page content before locking the page in get_user_pages(). This cannot be done automatically, as get_user_pages() interface is used very often for various operations, which usually last for a short period of time (like for example exec syscall). We have added a new flag indicating that the given get_user_space() call will grab pages for a long time, thus it is suitable to use the migration workaround in such cases.
The proposed extensions is used by V4L2/VideoBuf2 (drivers/media/v4l2-core/videobuf2-dma-contig.c), but that is not the only place which might benefit from it, like any driver which use DMA to userspace with get_user_pages(). This one is provided to demonstrate the use case.
I would like to hear some comments on the presented approach. What do you think about it? Is there a chance to get such workaround merged at some point to mainline?
I'm interested in your idea since it seems that the idea solves my issue: https://lkml.org/lkml/2012/11/29/69
So I want to apply your idea to a memory hot plug.
Thanks, Yasuaki Ishimatsu
Best regards Marek Szyprowski Samsung Poland R&D Center
Patch summary:
Marek Szyprowski (5): mm: introduce migrate_replace_page() for migrating page to the given target mm: get_user_pages: use static inline mm: get_user_pages: use NON-MOVABLE pages when FOLL_DURABLE flag is set mm: get_user_pages: migrate out CMA pages when FOLL_DURABLE flag is set media: vb2: use FOLL_DURABLE and __get_user_pages() to avoid CMA migration issues
drivers/media/v4l2-core/videobuf2-dma-contig.c | 8 +- include/linux/highmem.h | 12 ++- include/linux/migrate.h | 5 + include/linux/mm.h | 76 ++++++++++++- mm/internal.h | 12 +++ mm/memory.c | 136 +++++++++++------------- mm/migrate.c | 59 ++++++++++ 7 files changed, 225 insertions(+), 83 deletions(-)
Hello,
On Tue, Mar 5, 2013 at 3:57 PM, Marek Szyprowski m.szyprowski@samsung.com wrote:
Hello,
Contiguous Memory Allocator is very sensitive about migration failures of the individual pages. A single page, which causes permanent migration failure can break large conitguous allocations and cause the failure of a multimedia device driver.
One of the known issues with migration of CMA pages are the problems of migrating the anonymous user pages, for which the others called get_user_pages(). This takes a reference to the given user pages to let kernel to operate directly on the page content. This is usually used for preventing swaping out the page contents and doing direct DMA to/from userspace.
To solving this issue requires preventing locking of the pages, which are placed in CMA regions, for a long time. Our idea is to migrate anonymous page content before locking the page in get_user_pages(). This cannot be done automatically, as get_user_pages() interface is used very often for various operations, which usually last for a short period of time (like for example exec syscall). We have added a new flag indicating that the given get_user_space() call will grab pages for a long time, thus it is suitable to use the migration workaround in such cases.
The proposed extensions is used by V4L2/VideoBuf2 (drivers/media/v4l2-core/videobuf2-dma-contig.c), but that is not the only place which might benefit from it, like any driver which use DMA to userspace with get_user_pages(). This one is provided to demonstrate the use case.
I would like to hear some comments on the presented approach. What do you think about it? Is there a chance to get such workaround merged at some point to mainline?
I discussed similar patch from memory-hotplug guys with Mel. Look at http://marc.info/?l=linux-mm&m=136014458829566&w=2
The conern is that we ends up forcing using FOLL_DURABLE/GUP_NM for all drivers and subsystems for making sure CMA/memory-hotplug works well.
You mentioned driver grab a page for a long time should use FOLL_DURABLE flag but "for a long time" is very ambiguous. For example, there is a driver
get_user_pages() some operation. put_pages
You can make sure some operation is really fast always? For example, what if it depends on other event which is normally very fast but quite slow once a week or try to do dynamic memory allocation but memory pressure is severe?
For 100% working well, at last we need to change all GUP user with GUP_NM or your FOLL_DURABLE whatever but the concern Mel pointed out is it could cause lowmem exhaustion problem.
At the moment, there is other problem against migratoin, which are not related with your patch. ex, zcache, zram, zswap. Their pages couldn't be migrated out so I think below Mel's suggestion or some generic infrastructure can move pinned page is more proper way to go.
"To guarantee CMA can migrate pages pinned by drivers I think you need migrate-related callsbacks to unpin, barrier the driver until migration completes and repin."
Thanks.
Hello,
On 3/6/2013 9:47 AM, Minchan Kim wrote:
Hello,
On Tue, Mar 5, 2013 at 3:57 PM, Marek Szyprowski m.szyprowski@samsung.com wrote:
Hello,
Contiguous Memory Allocator is very sensitive about migration failures of the individual pages. A single page, which causes permanent migration failure can break large conitguous allocations and cause the failure of a multimedia device driver.
One of the known issues with migration of CMA pages are the problems of migrating the anonymous user pages, for which the others called get_user_pages(). This takes a reference to the given user pages to let kernel to operate directly on the page content. This is usually used for preventing swaping out the page contents and doing direct DMA to/from userspace.
To solving this issue requires preventing locking of the pages, which are placed in CMA regions, for a long time. Our idea is to migrate anonymous page content before locking the page in get_user_pages(). This cannot be done automatically, as get_user_pages() interface is used very often for various operations, which usually last for a short period of time (like for example exec syscall). We have added a new flag indicating that the given get_user_space() call will grab pages for a long time, thus it is suitable to use the migration workaround in such cases.
The proposed extensions is used by V4L2/VideoBuf2 (drivers/media/v4l2-core/videobuf2-dma-contig.c), but that is not the only place which might benefit from it, like any driver which use DMA to userspace with get_user_pages(). This one is provided to demonstrate the use case.
I would like to hear some comments on the presented approach. What do you think about it? Is there a chance to get such workaround merged at some point to mainline?
I discussed similar patch from memory-hotplug guys with Mel. Look at http://marc.info/?l=linux-mm&m=136014458829566&w=2
The conern is that we ends up forcing using FOLL_DURABLE/GUP_NM for all drivers and subsystems for making sure CMA/memory-hotplug works well.
You mentioned driver grab a page for a long time should use FOLL_DURABLE flag but "for a long time" is very ambiguous. For example, there is a driver
get_user_pages() some operation. put_pages
You can make sure some operation is really fast always?
Well, in our case (judging from the logs) we observed 2 usage patterns for get_user_pages() calls. One group was lots of short time locks, whose call stacks originated in various kernel places, the second group was device drivers which used get_user_pages() to create a buffer for the DMA. Such buffers were used for the whole lifetime of the session to the given device, what was equivalent to infinity from the migration/CMA point of view. This was however based on the specific use case at out target system, that's why I wanted to start the discussion and find some generic approach.
For example, what if it depends on other event which is normally very fast but quite slow once a week or try to do dynamic memory allocation but memory pressure is severe?
For 100% working well, at last we need to change all GUP user with GUP_NM or your FOLL_DURABLE whatever but the concern Mel pointed out is it could cause lowmem exhaustion problem.
This way we sooner or later end up without any movable pages at all. I assume that keeping some temporary references on movable/cma pages must be allowed, because otherwise we limit the functionality too much.
At the moment, there is other problem against migratoin, which are not related with your patch. ex, zcache, zram, zswap. Their pages couldn't be migrated out so I think below Mel's suggestion or some generic infrastructure can move pinned page is more proper way to go.
zcache/zram/zswap (vsmalloc based code) can be also extended to support migration. It requires some significant amount of work, but it is really doable.
"To guarantee CMA can migrate pages pinned by drivers I think you need migrate-related callsbacks to unpin, barrier the driver until migration completes and repin."
Right, this might improve the migration reliability. Are there any works being done in this direction?
Best regards
On Wed, Mar 06, 2013 at 11:48:36AM +0100, Marek Szyprowski wrote:
Hello,
On 3/6/2013 9:47 AM, Minchan Kim wrote:
Hello,
On Tue, Mar 5, 2013 at 3:57 PM, Marek Szyprowski m.szyprowski@samsung.com wrote:
Hello,
Contiguous Memory Allocator is very sensitive about migration failures of the individual pages. A single page, which causes permanent migration failure can break large conitguous allocations and cause the failure of a multimedia device driver.
One of the known issues with migration of CMA pages are the problems of migrating the anonymous user pages, for which the others called get_user_pages(). This takes a reference to the given user pages to let kernel to operate directly on the page content. This is usually used for preventing swaping out the page contents and doing direct DMA to/from userspace.
To solving this issue requires preventing locking of the pages, which are placed in CMA regions, for a long time. Our idea is to migrate anonymous page content before locking the page in get_user_pages(). This cannot be done automatically, as get_user_pages() interface is used very often for various operations, which usually last for a short period of time (like for example exec syscall). We have added a new flag indicating that the given get_user_space() call will grab pages for a long time, thus it is suitable to use the migration workaround in such cases.
The proposed extensions is used by V4L2/VideoBuf2 (drivers/media/v4l2-core/videobuf2-dma-contig.c), but that is not the only place which might benefit from it, like any driver which use DMA to userspace with get_user_pages(). This one is provided to demonstrate the use case.
I would like to hear some comments on the presented approach. What do you think about it? Is there a chance to get such workaround merged at some point to mainline?
I discussed similar patch from memory-hotplug guys with Mel. Look at http://marc.info/?l=linux-mm&m=136014458829566&w=2
The conern is that we ends up forcing using FOLL_DURABLE/GUP_NM for all drivers and subsystems for making sure CMA/memory-hotplug works well.
You mentioned driver grab a page for a long time should use FOLL_DURABLE flag but "for a long time" is very ambiguous. For example, there is a driver
get_user_pages() some operation. put_pages
You can make sure some operation is really fast always?
Well, in our case (judging from the logs) we observed 2 usage patterns for get_user_pages() calls. One group was lots of short time locks, whose call stacks originated in various kernel places, the second group was device drivers which used get_user_pages() to create a buffer for the DMA. Such buffers were used for the whole lifetime of the session to the given device, what was equivalent to infinity from the migration/CMA point of view. This was however based on the specific use case at out target system, that's why I wanted to start the discussion and find some generic approach.
For example, what if it depends on other event which is normally very fast but quite slow once a week or try to do dynamic memory allocation but memory pressure is severe?
For 100% working well, at last we need to change all GUP user with GUP_NM or your FOLL_DURABLE whatever but the concern Mel pointed out is it could cause lowmem exhaustion problem.
This way we sooner or later end up without any movable pages at all. I assume that keeping some temporary references on movable/cma pages must be allowed, because otherwise we limit the functionality too much.
At the moment, there is other problem against migratoin, which are not related with your patch. ex, zcache, zram, zswap. Their pages couldn't be migrated out so I think below Mel's suggestion or some generic infrastructure can move pinned page is more proper way to go.
zcache/zram/zswap (vsmalloc based code) can be also extended to support migration. It requires some significant amount of work, but it is really doable.
"To guarantee CMA can migrate pages pinned by drivers I think you need migrate-related callsbacks to unpin, barrier the driver until migration completes and repin."
Right, this might improve the migration reliability. Are there any works being done in this direction?
See my other mail about how we (ab)use mmu_notifiers in an experimental drm/i915 patch. I have no idea whether that's the right approach though. But I'd certainly welcome a generic approach here which works for all page migration users. And I guess some callback based approach is better to handle low memory situations, since at least for drm/i915 userptr backed buffer objects we might want to slurp in the entire available memory. Or as much as we can get hold off at least. So moving pages to a safe area before pinning them might not be feasible. -Daniel
linaro-mm-sig@lists.linaro.org