The patch titled
Subject: mm/migrate.c: move_pages: fix the return value if there are not-migrated pages
has been added to the -mm tree. Its filename is
mm-move_pages-fix-the-return-value-if-there-are-not-migrated-pages.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-move_pages-fix-the-return-value…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-move_pages-fix-the-return-value…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Yang Shi <yang.shi(a)linux.alibaba.com>
Subject: mm/migrate.c: move_pages: fix the return value if there are not-migrated pages
do_move_pages_to_node() might return > 0 value, the number of pages that
are not migrated, then the value will be returned to userspace directly.
But, move_pages() syscall would just return 0 or errno. So, we need reset
the return value to 0 for such case as pre-v4.17 did.
Link: http://lkml.kernel.org/r/1579325203-16405-1-git-send-email-yang.shi@linux.a…
Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
Signed-off-by: Yang Shi <yang.shi(a)linux.alibaba.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Wei Yang <richardw.yang(a)linux.intel.com>
Cc: <stable(a)vger.kernel.org> [4.17+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/migrate.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
--- a/mm/migrate.c~mm-move_pages-fix-the-return-value-if-there-are-not-migrated-pages
+++ a/mm/migrate.c
@@ -1659,8 +1659,11 @@ static int do_pages_move(struct mm_struc
goto out_flush;
err = do_move_pages_to_node(mm, &pagelist, current_node);
- if (err)
+ if (err) {
+ if (err > 0)
+ err = 0;
goto out;
+ }
if (i > start) {
err = store_status(status, start, current_node, i - start);
if (err)
_
Patches currently in -mm which might be from yang.shi(a)linux.alibaba.com are
mm-move_pages-fix-the-return-value-if-there-are-not-migrated-pages.patch
The patch titled
Subject: mm: thp: don't need care deferred split queue in memcg charge move path
has been added to the -mm tree. Its filename is
mm-thp-remove-the-defer-list-related-code-since-this-will-not-happen.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-remove-the-defer-list-relat…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-thp-remove-the-defer-list-relat…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Wei Yang <richardw.yang(a)linux.intel.com>
Subject: mm: thp: don't need care deferred split queue in memcg charge move path
If compound is true, this means it is a PMD mapped THP. Which implies
the page is not linked to any defer list. So the first code chunk will
not be executed.
Also with this reason, it would not be proper to add this page to a
defer list. So the second code chunk is not correct.
Based on this, we should remove the defer list related code.
[yang.shi(a)linux.alibaba.com: better patch title]
Link: http://lkml.kernel.org/r/20200117233836.3434-1-richardw.yang@linux.intel.com
Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
Signed-off-by: Wei Yang <richardw.yang(a)linux.intel.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Acked-by: Yang Shi <yang.shi(a)linux.alibaba.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [5.4+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memcontrol.c | 18 ------------------
1 file changed, 18 deletions(-)
--- a/mm/memcontrol.c~mm-thp-remove-the-defer-list-related-code-since-this-will-not-happen
+++ a/mm/memcontrol.c
@@ -5340,14 +5340,6 @@ static int mem_cgroup_move_account(struc
__mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages);
}
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- if (compound && !list_empty(page_deferred_list(page))) {
- spin_lock(&from->deferred_split_queue.split_queue_lock);
- list_del_init(page_deferred_list(page));
- from->deferred_split_queue.split_queue_len--;
- spin_unlock(&from->deferred_split_queue.split_queue_lock);
- }
-#endif
/*
* It is safe to change page->mem_cgroup here because the page
* is referenced, charged, and isolated - we can't race with
@@ -5357,16 +5349,6 @@ static int mem_cgroup_move_account(struc
/* caller should have done css_get */
page->mem_cgroup = to;
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- if (compound && list_empty(page_deferred_list(page))) {
- spin_lock(&to->deferred_split_queue.split_queue_lock);
- list_add_tail(page_deferred_list(page),
- &to->deferred_split_queue.split_queue);
- to->deferred_split_queue.split_queue_len++;
- spin_unlock(&to->deferred_split_queue.split_queue_lock);
- }
-#endif
-
spin_unlock_irqrestore(&from->move_lock, flags);
ret = 0;
_
Patches currently in -mm which might be from richardw.yang(a)linux.intel.com are
mm-thp-remove-the-defer-list-related-code-since-this-will-not-happen.patch
mm-gupc-use-is_vm_hugetlb_page-to-check-whether-to-follow-huge.patch
mm-huge_memoryc-use-head-to-check-huge-zero-page.patch
mm-huge_memoryc-use-head-to-emphasize-the-purpose-of-page.patch
mm-huge_memoryc-reduce-critical-section-protected-by-split_queue_lock.patch
mm-remove-dead-code-totalram_pages_set.patch
The patch titled
Subject: mm: thp: grab the lock before manipulating defer list
has been removed from the -mm tree. Its filename was
mm-thp-grab-the-lock-before-manipulation-defer-list.patch
This patch was dropped because an updated version will be merged
------------------------------------------------------
From: Wei Yang <richardw.yang(a)linux.intel.com>
Subject: mm: thp: grab the lock before manipulating defer list
As all the other places, we grab the lock before manipulate the defer
list. Current implementation may face a race condition.
For example, the potential race would be:
CPU1 CPU2
mem_cgroup_move_account deferred_split_huge_page
list_empty
lock
list_empty
list_add_tail
unlock
lock
# list_empty might not hold anymore
list_add_tail
unlock
When this sequence happens, the list_add_tail() in
mem_cgroup_move_account() corrupt the list since which is already been
added to some split_queue in split_huge_page_to_list().
Besides this, David Rientjes points out the split_queue_len would be in a
wrong state, which would be a significant issue for shrinkers.
Link: http://lkml.kernel.org/r/20200109143054.13203-1-richardw.yang@linux.intel.c…
Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
Signed-off-by: Wei Yang <richardw.yang(a)linux.intel.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [5.4+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memcontrol.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
--- a/mm/memcontrol.c~mm-thp-grab-the-lock-before-manipulation-defer-list
+++ a/mm/memcontrol.c
@@ -5341,10 +5341,12 @@ static int mem_cgroup_move_account(struc
}
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- if (compound && !list_empty(page_deferred_list(page))) {
+ if (compound) {
spin_lock(&from->deferred_split_queue.split_queue_lock);
- list_del_init(page_deferred_list(page));
- from->deferred_split_queue.split_queue_len--;
+ if (!list_empty(page_deferred_list(page))) {
+ list_del_init(page_deferred_list(page));
+ from->deferred_split_queue.split_queue_len--;
+ }
spin_unlock(&from->deferred_split_queue.split_queue_lock);
}
#endif
@@ -5358,11 +5360,13 @@ static int mem_cgroup_move_account(struc
page->mem_cgroup = to;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- if (compound && list_empty(page_deferred_list(page))) {
+ if (compound) {
spin_lock(&to->deferred_split_queue.split_queue_lock);
- list_add_tail(page_deferred_list(page),
- &to->deferred_split_queue.split_queue);
- to->deferred_split_queue.split_queue_len++;
+ if (list_empty(page_deferred_list(page))) {
+ list_add_tail(page_deferred_list(page),
+ &to->deferred_split_queue.split_queue);
+ to->deferred_split_queue.split_queue_len++;
+ }
spin_unlock(&to->deferred_split_queue.split_queue_lock);
}
#endif
_
Patches currently in -mm which might be from richardw.yang(a)linux.intel.com are
mm-thp-remove-the-defer-list-related-code-since-this-will-not-happen.patch
mm-gupc-use-is_vm_hugetlb_page-to-check-whether-to-follow-huge.patch
mm-huge_memoryc-use-head-to-check-huge-zero-page.patch
mm-huge_memoryc-use-head-to-emphasize-the-purpose-of-page.patch
mm-huge_memoryc-reduce-critical-section-protected-by-split_queue_lock.patch
mm-remove-dead-code-totalram_pages_set.patch
Hey Linus,
/* Summary */
Here is an urgent fix for ptrace_may_access() permission checking.
Commit 69f594a38967 ("ptrace: do not audit capability check when outputing
/proc/pid/stat") introduced the ability to opt out of audit
messages for accesses to various proc files since they are not violations of
policy. While doing so it switched the check from ns_capable() to
has_ns_capability{_noaudit}(). That means it switched from checking the
subjective credentials (ktask->cred) of the task to using the objective
credentials (ktask->real_cred). This is appears to be wrong. ptrace_has_cap()
is currently only used in ptrace_may_access() And is used to check whether the
calling task (subject) has the CAP_SYS_PTRACE capability in the provided user
namespace to operate on the target task (object). According to the cred.h
comments this means the subjective credentials of the calling task need to be
used.
With this pr we switch ptrace_has_cap() to use security_capable() and thus back
to using the subjective credentials.
As one example where this might be particularly problematic, Jann pointed out
that in combination with the upcoming IORING_OP_OPENAT{2} feature, this bug
might allow unprivileged users to bypass the capability checks while
asynchronously opening files like /proc/*/mem, because the capability checks
for this would be performed against kernel credentials.
To illustrate on the former point about this being exploitable: When io_uring
creates a new context it records the subjective credentials of the caller.
Later on, when it starts to do work it creates a kernel thread and registers a
callback. The callback runs with kernel creds for ktask->real_cred and
ktask->cred. To prevent this from becoming a full-blown 0-day io_uring will
call override_cred() and override ktask->cred with the subjective credentials
of the creator of the io_uring instance. With ptrace_has_cap() currently
looking at ktask->real_cred this override will be ineffective and the caller
will be able to open arbitray proc files as mentioned above.
Luckily, this is currently not exploitable but will turn into a 0-day once
IORING_OP_OPENAT{2} land in v5.6. Let's fix it now.
To minimize potential regressions I successfully ran the criu testsuite. criu
makes heavy use of ptrace() and extensively hits ptrace_may_access() codepaths
and has a good change of detecting any regressions.
Additionally, I succesfully ran the ptrace and seccomp kernel tests.
/* Testing */
All patches have seen exposure in linux-next and are based on v5.5-rc6.
As mentioned above, the criu test-suite which is one of the test-suits make
massive use of ptrace and hitting ptrace_may_access() codepaths successfully
passed on a kernel with this fix:
################## ALL TEST(S) PASSED (TOTAL 178/SKIPPED 16) ###################
I've posted the full test-log at:
https://gitlab.com/snippets/1931214
Additionally, I succesfully ran the ptrace and seccomp kernel tests.
We also will add a regression test once IO_URING_OPENAT{2} has landed for v5.6
since this gives us a really easy test.
/* Conflicts */
At the time of creating this PR no merge conflicts were reported from
linux-next.
The following changes since commit b3a987b0264d3ddbb24293ebff10eddfc472f653:
Linux 5.5-rc6 (2020-01-12 16:55:08 -0800)
are available in the Git repository at:
git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux tags/for-linus-2020-01-18
for you to fetch changes up to 6b3ad6649a4c75504edeba242d3fd36b3096a57f:
ptrace: reintroduce usage of subjective credentials in ptrace_has_cap() (2020-01-18 13:51:39 +0100)
Please consider pulling these changes from the signed for-linus-2020-01-18 tag.
Thanks!
Christian
----------------------------------------------------------------
for-linus-2020-01-18
----------------------------------------------------------------
Christian Brauner (1):
ptrace: reintroduce usage of subjective credentials in ptrace_has_cap()
kernel/ptrace.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
From: Peter Zijlstra <peterz(a)infradead.org>
Architectures for which we have hardware walkers of Linux page table should
flush TLB on mmu gather batch allocation failures and batch flush. Some
architectures like POWER supports multiple translation modes (hash and radix)
and in the case of POWER only radix translation mode needs the above TLBI.
This is because for hash translation mode kernel wants to avoid this extra
flush since there are no hardware walkers of linux page table. With radix
translation, the hardware also walks linux page table and with that, kernel
needs to make sure to TLB invalidate page walk cache before page table pages are
freed.
More details in
commit: d86564a2f085 ("mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE")
The changes to sparc are to make sure we keep the old behavior since we are now
removing HAVE_RCU_TABLE_NO_INVALIDATE. The default value for
tlb_needs_table_invalidate is to always force an invalidate and sparc can avoid
the table invalidate. Hence we define tlb_needs_table_invalidate to false for
sparc architecture.
Cc: <stable(a)vger.kernel.org>
Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes")
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
---
arch/Kconfig | 3 ---
arch/powerpc/Kconfig | 1 -
arch/powerpc/include/asm/tlb.h | 11 +++++++++++
arch/sparc/Kconfig | 1 -
arch/sparc/include/asm/tlb_64.h | 9 +++++++++
include/asm-generic/tlb.h | 22 +++++++++++++++-------
mm/mmu_gather.c | 16 ++++++++--------
7 files changed, 43 insertions(+), 20 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 48b5e103bdb0..208aad121630 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -396,9 +396,6 @@ config HAVE_ARCH_JUMP_LABEL_RELATIVE
config HAVE_RCU_TABLE_FREE
bool
-config HAVE_RCU_TABLE_NO_INVALIDATE
- bool
-
config HAVE_MMU_GATHER_PAGE_SIZE
bool
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 04240205f38c..f9970f87612e 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -223,7 +223,6 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE
- select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MMU_GATHER_PAGE_SIZE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC_BOOK3S_64 && CPU_LITTLE_ENDIAN
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index b2c0be93929d..7f3a8b902325 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -26,6 +26,17 @@
#define tlb_flush tlb_flush
extern void tlb_flush(struct mmu_gather *tlb);
+/*
+ * book3s:
+ * Hash does not use the linux page-tables, so we can avoid
+ * the TLB invalidate for page-table freeing, Radix otoh does use the
+ * page-tables and needs the TLBI.
+ *
+ * nohash:
+ * We still do TLB invalidate in the __pte_free_tlb routine before we
+ * add the page table pages to mmu gather table batch.
+ */
+#define tlb_needs_table_invalidate() radix_enabled()
/* Get the generic bits... */
#include <asm-generic/tlb.h>
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index eb24cb1afc11..18e9fb6fcf1b 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -65,7 +65,6 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
- select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_DYNAMIC_FTRACE
diff --git a/arch/sparc/include/asm/tlb_64.h b/arch/sparc/include/asm/tlb_64.h
index a2f3fa61ee36..8cb8f3833239 100644
--- a/arch/sparc/include/asm/tlb_64.h
+++ b/arch/sparc/include/asm/tlb_64.h
@@ -28,6 +28,15 @@ void flush_tlb_pending(void);
#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
#define tlb_flush(tlb) flush_tlb_pending()
+/*
+ * SPARC64's hardware TLB fill does not use the Linux page-tables
+ * and therefore we don't need a TLBI when freeing page-table pages.
+ */
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define tlb_needs_table_invalidate() (false)
+#endif
+
#include <asm-generic/tlb.h>
#endif /* _SPARC64_TLB_H */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 2b10036fefd0..9e22ac369d1d 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -137,13 +137,6 @@
* When used, an architecture is expected to provide __tlb_remove_table()
* which does the actual freeing of these pages.
*
- * HAVE_RCU_TABLE_NO_INVALIDATE
- *
- * This makes HAVE_RCU_TABLE_FREE avoid calling tlb_flush_mmu_tlbonly() before
- * freeing the page-table pages. This can be avoided if you use
- * HAVE_RCU_TABLE_FREE and your architecture does _NOT_ use the Linux
- * page-tables natively.
- *
* MMU_GATHER_NO_RANGE
*
* Use this if your architecture lacks an efficient flush_tlb_range().
@@ -189,8 +182,23 @@ struct mmu_table_batch {
extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
+/*
+ * This allows an architecture that does not use the linux page-tables for
+ * hardware to skip the TLBI when freeing page tables.
+ */
+#ifndef tlb_needs_table_invalidate
+#define tlb_needs_table_invalidate() (true)
+#endif
+
+#else
+
+#ifdef tlb_needs_table_invalidate
+#error tlb_needs_table_invalidate() requires HAVE_RCU_TABLE_FREE
#endif
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+
+
#ifndef CONFIG_HAVE_MMU_GATHER_NO_GATHER
/*
* If we can't allocate a page to make a big batch of page pointers
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 7d70e5c78f97..7c1b8f67af7b 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -102,14 +102,14 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
*/
static inline void tlb_table_invalidate(struct mmu_gather *tlb)
{
-#ifndef CONFIG_HAVE_RCU_TABLE_NO_INVALIDATE
- /*
- * Invalidate page-table caches used by hardware walkers. Then we still
- * need to RCU-sched wait while freeing the pages because software
- * walkers can still be in-flight.
- */
- tlb_flush_mmu_tlbonly(tlb);
-#endif
+ if (tlb_needs_table_invalidate()) {
+ /*
+ * Invalidate page-table caches used by hardware walkers. Then
+ * we still need to RCU-sched wait while freeing the pages
+ * because software walkers can still be in-flight.
+ */
+ tlb_flush_mmu_tlbonly(tlb);
+ }
}
static void tlb_remove_table_smp_sync(void *arg)
--
2.24.1