From: Vlastimil Babka <vbabka(a)suse.cz>
Subject: mm, page_alloc: do not break __GFP_THISNODE by zonelist reset
In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
allocations that can ignore memory policies. The zonelist is obtained
from current CPU's node. This is a problem for __GFP_THISNODE allocations
that want to allocate on a different node, e.g. because the allocating
thread has been migrated to a different CPU.
This has been observed to break SLAB in our 4.4-based kernel, because
there it relies on __GFP_THISNODE working as intended. If a slab page is
put on wrong node's list, then further list manipulations may corrupt the
list because page_to_nid() is used to determine which node's list_lock
should be locked and thus we may take a wrong lock and race.
Current SLAB implementation seems to be immune by luck thanks to commit
511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on
arbitrary node") but there may be others assuming that __GFP_THISNODE
works as promised.
We can fix it by simply removing the zonelist reset completely. There is
actually no reason to reset it, because memory policies and cpusets don't
affect the zonelist choice in the first place. This was different when
commit 183f6371aac2 ("mm: ignore mempolicies when using
ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
own restricted zonelists.
We might consider this for 4.17 although I don't know if there's anything
currently broken.
SLAB is currently not affected, but in kernels older than 4.7 that
don't yet have 511e3a058812 ("mm/slab: make cache_grow() handle the
page allocated on arbitrary node") it is. That's at least 4.4 LTS.
Older ones I'll have to check.
So stable backports should be more important, but will have to be
reviewed carefully, as the code went through many changes. BTW I think
that also the ac->preferred_zoneref reset is currently useless if we
don't also reset ac->nodemask from a mempolicy to NULL first (which we
probably should for the OOM victims etc?), but I would leave that for a
separate patch.
Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka(a)suse.cz>
Fixes: 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
Acked-by: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 1 -
1 file changed, 1 deletion(-)
diff -puN mm/page_alloc.c~mm-page_alloc-do-not-break-__gfp_thisnode-by-zonelist-reset mm/page_alloc.c
--- a/mm/page_alloc.c~mm-page_alloc-do-not-break-__gfp_thisnode-by-zonelist-reset
+++ a/mm/page_alloc.c
@@ -4169,7 +4169,6 @@ retry:
* orientated.
*/
if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
- ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
ac->high_zoneidx, ac->nodemask);
}
_
Hi Greg,
Please apply 2ae89c7a82ea ("kconfig: Avoid format overflow warning from
GCC 8.1") to all active branches. I am rather tired of seeing these
warnings every time I need to compile. It will apply cleanly.
Thanks!
Nathan
Mark notes that gcc optimization passes have the potential to elide
necessary invocations of this instruction sequence, so include an
optimization barrier.
> I think that either way, we have a potential problem if the compiler
> generates a branch dependent on the result of validate_index_nospec().
>
> In that case, we could end up with codegen approximating:
>
> bool safe = false;
>
> if (idx < bound) {
> idx = array_index_nospec(idx, bound);
> safe = true;
> }
>
> // this branch can be mispredicted
> if (safe) {
> foo = array[idx];
> }
>
> ... and thus we lose the nospec protection.
I see GCC do this at -O0, but so far I haven't tricked it into doing
this at -O1 or above.
Regardless, I worry this is fragile -- GCC *can* generate code as per
the above, even if it's unlikely to.
Cc: <stable(a)vger.kernel.org>
Fixes: babdde2698d4 ("x86: Implement array_index_mask_nospec")
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Ingo Molnar <mingo(a)kernel.org>
Reported-by: Mark Rutland <mark.rutland(a)arm.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
arch/x86/include/asm/barrier.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 042b5e892ed1..41f7435c84a7 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -38,10 +38,11 @@ static inline unsigned long array_index_mask_nospec(unsigned long index,
{
unsigned long mask;
- asm ("cmp %1,%2; sbb %0,%0;"
+ asm volatile ("cmp %1,%2; sbb %0,%0;"
:"=r" (mask)
:"g"(size),"r" (index)
:"cc");
+ barrier();
return mask;
}
Chris Chiu (1):
tpm: self test failure should not cause suspend to fail
Enric Balletbo i Serra (1):
tpm: do not suspend/resume if power stays on
drivers/char/tpm/tpm-chip.c | 12 ++++++++++++
drivers/char/tpm/tpm-interface.c | 7 +++++++
drivers/char/tpm/tpm.h | 1 +
3 files changed, 20 insertions(+)
--
v3: use CONFIG_OF flag
v2: moved the check from tpm_of.c to tpm-chip.c as in v4.4 chip is
unreachable otherwise. I did compilation test now with BuildRoot
for power arch.
2.17.0