My apology that the v6 patches are missing the first two patch in
the series. Resending the patch series as v7.
Fix in this version bugs causing build problems for UP configuration.
Also merged in Jiri's change to extend STIBP for SECCOMP processes and
renaming TIF_STIBP to TIF_SPEC_INDIR_BRANCH.
I've updated the boot options spectre_v2_app2app to
on, off, auto, prctl and seccomp. This aligns with
the options for other speculation related mitigations.
I tried to incorporate sched_smt_present to detect when we have all SMT
going offline and we can disable the SMT path that Peter suggested.
this is an optimization that can be easily left out of the patch
series. I've put these two patches at the end and they can be considered
separately.
I've dropped the TIF flags re-organization patches
as they are orthogonal to this patch series.
To do: Create a dedicated document on the mitigation options for Spectre V2.
Since Jiri's patchset to always turn on STIBP
has big performance impact, I think that it should
be reverted from 4.20 and stable kernels for now, till this
patchset to mitigate its performance impact can be merged
with it.
Thanks.
Tim
Patch 1 to 3 are clean up patches.
Patch 4 and 5 disable STIBP for enhacned IBRS.
Patch 6 to 9 reorganize and clean up the code without affecting
functionality for easier modification later.
Patch 10 introduces the STIBP flag on a task to dynamically
enable STIBP for that task.
Patch 11 introduces different modes to protect a
task against Spectre v2 user space attack.
Patch 12 adds prctl interface to turn on Spectre v2 user mode defenses on a task.
Patch 13 Put IBPB usage under the mode chosen for app2app mitigation.
Patch 14 Add STIBP protection for SECCOMP tasks.
Patch 15-16 add Spectre v2 defenses for non-dumpable tasks.
Patch 15-16 reorganizes the TIF flags, and can be dropped without affecting this series
Patch 17-18 When there are no paired SMT left, disable SMT specific code
Changes:
v6:
1. Fix bugs for UP build configuration.
2. Add protection for SECCOMP tasks.
3. Rename TIF_STIBP to TIF_SPEC_INDIR_BRANCH.
4. Update boot options to align with other speculation mitigations.
5. Separate out IBPB change that makes it depend on TIF_SPEC_INDIR_BRANCH.
6. Move some checks for SPEC_CTRL updates to spec_ctrl_update_msr to avoid
unnecesseary MSR writes.
7. Drop TIF reorg patches.
8. Incorporate optimization to disable SMT code paths when no paired SMT is present.
v5:
1. Drop patch to extend TIF_STIBP changes to all related threads on
a task's dumpabibility change.
2. Drop patch to replace sched_smt_present with cpu_smt_enabled.
3. Drop export of cpu_smt_control in kernel/cpu.c and replace external
usages of cpu_smt_control with cpu_smt_enabled.
4. Rebase patch series on 4.20-rc2.
v4:
1. Extend STIBP update to all threads of a process changing
it dumpability.
2. Add logic to update SPEC_CTRL MSR on a remote CPU when TIF flags
affecting speculation changes for task running on the remote CPU.
3. Regroup x86 TIF_* flags according to their functions.
4. Various code clean up.
v3:
1. Add logic to skip STIBP when Enhanced IBRS is used.
2. Break up v2 patches into smaller logical patches.
3. Fix bug in arch_set_dumpable that did not update SPEC_CTRL
MSR right away when according to task's STIBP flag clearing which
caused SITBP to be left on.
4. Various code clean up.
v2:
1. Extend per process STIBP to AMD cpus
2. Add prctl option to control per process indirect branch speculation
3. Bug fixes and cleanups
Jiri's patchset to harden Spectre v2 user space mitigation makes IBPB
and STIBP in use for Spectre v2 mitigation on all processes. IBPB will
be issued for switching to an application that's not ptraceable by the
previous application and STIBP will be always turned on.
However, leaving STIBP on all the time is expensive for certain
applications that have frequent indirect branches. One such application
is perlbench in the SpecInt Rate 2006 test suite which shows a
21% reduction in throughput.
There're also reports of drop in performance on Python and PHP benchmarks:
https://www.phoronix.com/scan.php?page=article&item=linux-420-bisect&num=2
Other applications like bzip2 with minimal indirct branches have
only a 0.7% reduction in throughput. IBPB will also impose
overhead during context switches.
Users may not wish to incur performance overhead from IBPB and STIBP for
general non security sensitive processes and use these mitigations only
for security sensitive processes.
This patchset provides a process property based lite protection mode.
In this mode, IBPB and STIBP mitigation are applied only to security
sensitive non-dumpable processes and processes that users want to protect
by having indirect branch speculation disabled via PRCTL. So the overhead
from IBPB and STIBP are avoided for low security processes that don't
require extra protection.
Jiri Kosina (1):
x86/speculation: Add 'seccomp' Spectre v2 app to app protection mode
Peter Zijlstra (1):
sched/smt: Make sched_smt_present track topology
Tim Chen (16):
x86/speculation: Clean up spectre_v2_parse_cmdline()
x86/speculation: Remove unnecessary ret variable in cpu_show_common()
x86/speculation: Reorganize cpu_show_common()
x86/speculation: Add X86_FEATURE_USE_IBRS_ENHANCED
x86/speculation: Disable STIBP when enhanced IBRS is in use
x86/speculation: Rename SSBD update functions
x86/speculation: Reorganize speculation control MSRs update
smt: Create cpu_smt_enabled static key for SMT specific code
x86/smt: Convert cpu_smt_control check to cpu_smt_enabled static key
x86/speculation: Turn on or off STIBP according to a task's TIF_STIBP
x86/speculation: Add Spectre v2 app to app protection modes
x86/speculation: Create PRCTL interface to restrict indirect branch
speculation
x86/speculation: Enable IBPB for tasks with
TIF_SPEC_BRANCH_SPECULATION
security: Update speculation restriction of a process when modifying
its dumpability
x86/speculation: Use STIBP to restrict speculation on non-dumpable
task
x86/smt: Allow disabling of SMT when last SMT is offlined
Documentation/admin-guide/kernel-parameters.txt | 34 +++
Documentation/userspace-api/spec_ctrl.rst | 9 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 6 +-
arch/x86/include/asm/nospec-branch.h | 10 +
arch/x86/include/asm/spec-ctrl.h | 18 +-
arch/x86/include/asm/thread_info.h | 5 +-
arch/x86/kernel/cpu/bugs.c | 336 +++++++++++++++++++++---
arch/x86/kernel/process.c | 58 +++-
arch/x86/kvm/vmx.c | 2 +-
arch/x86/mm/tlb.c | 23 +-
fs/exec.c | 3 +
include/linux/cpu.h | 31 ++-
include/linux/sched.h | 9 +
include/uapi/linux/prctl.h | 1 +
kernel/cpu.c | 28 +-
kernel/cred.c | 5 +-
kernel/sched/core.c | 19 +-
kernel/sched/sched.h | 2 -
kernel/sys.c | 7 +
tools/include/uapi/linux/prctl.h | 1 +
21 files changed, 526 insertions(+), 82 deletions(-)
--
2.9.4
Commit f77084d96355 "x86/mm/pat: Disable preemption around
__flush_tlb_all()" addressed a case where __flush_tlb_all() is called
without preemption being disabled. It also left a warning to catch other
cases where preemption is not disabled. That warning triggers for the
memory hotplug path which is also used for persistent memory enabling:
WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460
RIP: 0010:__flush_tlb_all+0x1b/0x3a
[..]
Call Trace:
phys_pud_init+0x29c/0x2bb
kernel_physical_mapping_init+0xfc/0x219
init_memory_mapping+0x1a5/0x3b0
arch_add_memory+0x2c/0x50
devm_memremap_pages+0x3aa/0x610
pmem_attach_disk+0x585/0x700 [nd_pmem]
Andy wondered why a path that can sleep was using __flush_tlb_all() [1]
and Dave confirmed the expectation for TLB flush is for modifying /
invalidating existing pte entries, but not initial population [2]. Drop
the usage of __flush_tlb_all() in phys_{p4d,pud,pmd}_init() on the
expectation that this path is only ever populating empty entries for the
linear map. Note, at linear map teardown time there is a call to the
all-cpu flush_tlb_all() to invalidate the removed mappings.
[1]: https://lore.kernel.org/patchwork/patch/1009434/#1193941
[2]: https://lore.kernel.org/patchwork/patch/1009434/#1194540
Fixes: f77084d96355 ("x86/mm/pat: Disable preemption around __flush_tlb_all()")
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: <stable(a)vger.kernel.org>
Reported-by: Andy Lutomirski <luto(a)kernel.org>
Suggested-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
arch/x86/mm/init_64.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5fab264948c2..de95db8ac52f 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -584,7 +584,6 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
paddr_end,
page_size_mask,
prot);
- __flush_tlb_all();
continue;
}
/*
@@ -627,7 +626,6 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
pud_populate(&init_mm, pud, pmd);
spin_unlock(&init_mm.page_table_lock);
}
- __flush_tlb_all();
update_page_count(PG_LEVEL_1G, pages);
@@ -668,7 +666,6 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
paddr_last = phys_pud_init(pud, paddr,
paddr_end,
page_size_mask);
- __flush_tlb_all();
continue;
}
@@ -680,7 +677,6 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
p4d_populate(&init_mm, p4d, pud);
spin_unlock(&init_mm.page_table_lock);
}
- __flush_tlb_all();
return paddr_last;
}
@@ -733,8 +729,6 @@ kernel_physical_mapping_init(unsigned long paddr_start,
if (pgd_changed)
sync_global_pgds(vaddr_start, vaddr_end - 1);
- __flush_tlb_all();
-
return paddr_last;
}
The patch titled
Subject: scripts/spdxcheck.py: make python3 compliant
has been removed from the -mm tree. Its filename was
scripts-spdxcheck-make-python3-compliant.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Uwe Kleine-König <u.kleine-koenig(a)pengutronix.de>
Subject: scripts/spdxcheck.py: make python3 compliant
Without this change the following happens when using Python3 (3.6.6):
$ echo "GPL-2.0" | python3 scripts/spdxcheck.py -
FAIL: 'str' object has no attribute 'decode'
Traceback (most recent call last):
File "scripts/spdxcheck.py", line 253, in <module>
parser.parse_lines(sys.stdin, args.maxlines, '-')
File "scripts/spdxcheck.py", line 171, in parse_lines
line = line.decode(locale.getpreferredencoding(False), errors='ignore')
AttributeError: 'str' object has no attribute 'decode'
So as the line is already a string, there is no need to decode it and
the line can be dropped.
/usr/bin/python on Arch is Python 3. So this would indeed be worth
going into 4.19.
Link: http://lkml.kernel.org/r/20181023070802.22558-1-u.kleine-koenig@pengutronix…
Signed-off-by: Uwe Kleine-König <u.kleine-koenig(a)pengutronix.de>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Joe Perches <joe(a)perches.com>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
scripts/spdxcheck.py | 1 -
1 file changed, 1 deletion(-)
--- a/scripts/spdxcheck.py~scripts-spdxcheck-make-python3-compliant
+++ a/scripts/spdxcheck.py
@@ -168,7 +168,6 @@ class id_parser(object):
self.curline = 0
try:
for line in fd:
- line = line.decode(locale.getpreferredencoding(False), errors='ignore')
self.curline += 1
if self.curline > maxlines:
break
_
Patches currently in -mm which might be from u.kleine-koenig(a)pengutronix.de are
The patch titled
Subject: lib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn
has been removed from the -mm tree. Its filename was
ubsan-dont-mark-__ubsan_handle_builtin_unreachable-as-noreturn.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Arnd Bergmann <arnd(a)arndb.de>
Subject: lib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn
gcc-8 complains about the prototype for this function:
lib/ubsan.c:432:1: error: ignoring attribute 'noreturn' in declaration of a built-in function '__ubsan_handle_builtin_unreachable' because it conflicts with attribute 'const' [-Werror=attributes]
This is actually a GCC's bug. In GCC internals
__ubsan_handle_builtin_unreachable() declared with both 'noreturn' and
'const' attributes instead of only 'noreturn':
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84210
Workaround this by removing the noreturn attribute.
[aryabinin: add information about GCC bug in changelog]
Link: http://lkml.kernel.org/r/20181107144516.4587-1-aryabinin@virtuozzo.com
Signed-off-by: Arnd Bergmann <arnd(a)arndb.de>
Signed-off-by: Andrey Ryabinin <aryabinin(a)virtuozzo.com>
Acked-by: Olof Johansson <olof(a)lixom.net>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/ubsan.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/lib/ubsan.c~ubsan-dont-mark-__ubsan_handle_builtin_unreachable-as-noreturn
+++ a/lib/ubsan.c
@@ -427,8 +427,7 @@ void __ubsan_handle_shift_out_of_bounds(
EXPORT_SYMBOL(__ubsan_handle_shift_out_of_bounds);
-void __noreturn
-__ubsan_handle_builtin_unreachable(struct unreachable_data *data)
+void __ubsan_handle_builtin_unreachable(struct unreachable_data *data)
{
unsigned long flags;
_
Patches currently in -mm which might be from arnd(a)arndb.de are
vfs-replace-current_kernel_time64-with-ktime-equivalent.patch
The patch titled
Subject: ocfs2: free up write context when direct IO failed
has been removed from the -mm tree. Its filename was
ocfs2-free-up-write-context-when-direct-io-failed.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Wengang Wang <wen.gang.wang(a)oracle.com>
Subject: ocfs2: free up write context when direct IO failed
The write context should also be freed even when direct IO failed.
Otherwise a memory leak is introduced and entries remain in
oi->ip_unwritten_list causing the following BUG later in unlink path:
ERROR: bug expression: !list_empty(&oi->ip_unwritten_list)
ERROR: Clear inode of 215043, inode has unwritten extents
...
Call Trace:
? __set_current_blocked+0x42/0x68
ocfs2_evict_inode+0x91/0x6a0 [ocfs2]
? bit_waitqueue+0x40/0x33
evict+0xdb/0x1af
iput+0x1a2/0x1f7
do_unlinkat+0x194/0x28f
SyS_unlinkat+0x1b/0x2f
do_syscall_64+0x79/0x1ae
entry_SYSCALL_64_after_hwframe+0x151/0x0
This patch also logs, with frequency limit, direct IO failures.
Link: http://lkml.kernel.org/r/20181102170632.25921-1-wen.gang.wang@oracle.com
Signed-off-by: Wengang Wang <wen.gang.wang(a)oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi(a)oracle.com>
Reviewed-by: Changwei Ge <ge.changwei(a)h3c.com>
Reviewed-by: Joseph Qi <jiangqi903(a)gmail.com>
Cc: Mark Fasheh <mark(a)fasheh.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ocfs2/aops.c | 12 ++++++++++--
fs/ocfs2/cluster/masklog.h | 9 +++++++++
2 files changed, 19 insertions(+), 2 deletions(-)
--- a/fs/ocfs2/aops.c~ocfs2-free-up-write-context-when-direct-io-failed
+++ a/fs/ocfs2/aops.c
@@ -2411,8 +2411,16 @@ static int ocfs2_dio_end_io(struct kiocb
/* this io's submitter should not have unlocked this before we could */
BUG_ON(!ocfs2_iocb_is_rw_locked(iocb));
- if (bytes > 0 && private)
- ret = ocfs2_dio_end_io_write(inode, private, offset, bytes);
+ if (bytes <= 0)
+ mlog_ratelimited(ML_ERROR, "Direct IO failed, bytes = %lld",
+ (long long)bytes);
+ if (private) {
+ if (bytes > 0)
+ ret = ocfs2_dio_end_io_write(inode, private, offset,
+ bytes);
+ else
+ ocfs2_dio_free_write_ctx(inode, private);
+ }
ocfs2_iocb_clear_rw_locked(iocb);
--- a/fs/ocfs2/cluster/masklog.h~ocfs2-free-up-write-context-when-direct-io-failed
+++ a/fs/ocfs2/cluster/masklog.h
@@ -178,6 +178,15 @@ do { \
##__VA_ARGS__); \
} while (0)
+#define mlog_ratelimited(mask, fmt, ...) \
+do { \
+ static DEFINE_RATELIMIT_STATE(_rs, \
+ DEFAULT_RATELIMIT_INTERVAL, \
+ DEFAULT_RATELIMIT_BURST); \
+ if (__ratelimit(&_rs)) \
+ mlog(mask, fmt, ##__VA_ARGS__); \
+} while (0)
+
#define mlog_errno(st) ({ \
int _st = (st); \
if (_st != -ERESTARTSYS && _st != -EINTR && \
_
Patches currently in -mm which might be from wen.gang.wang(a)oracle.com are
The patch titled
Subject: mm: don't reclaim inodes with many attached pages
has been removed from the -mm tree. Its filename was
mm-dont-reclaim-inodes-with-many-attached-pages.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Roman Gushchin <guro(a)fb.com>
Subject: mm: don't reclaim inodes with many attached pages
Spock reported that the commit 172b06c32b94 ("mm: slowly shrink slabs with
a relatively small number of objects") leads to a regression on his setup:
periodically the majority of the pagecache is evicted without an obvious
reason, while before the change the amount of free memory was balancing
around the watermark.
The reason behind is that the mentioned above change created some minimal
background pressure on the inode cache. The problem is that if an inode
is considered to be reclaimed, all belonging pagecache page are stripped,
no matter how many of them are there. So, if a huge multi-gigabyte file
is cached in the memory, and the goal is to reclaim only few slab objects
(unused inodes), we still can eventually evict all gigabytes of the
pagecache at once.
The workload described by Spock has few large non-mapped files in the
pagecache, so it's especially noticeable.
To solve the problem let's postpone the reclaim of inodes, which have more
than 1 attached page. Let's wait until the pagecache pages will be
evicted naturally by scanning the corresponding LRU lists, and only then
reclaim the inode structure.
Link: http://lkml.kernel.org/r/20181023164302.20436-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro(a)fb.com>
Reported-by: Spock <dairinin(a)gmail.com>
Tested-by: Spock <dairinin(a)gmail.com>
Reviewed-by: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Rik van Riel <riel(a)surriel.com>
Cc: Randy Dunlap <rdunlap(a)infradead.org>
Cc: <stable(a)vger.kernel.org> [4.19.x]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/inode.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
--- a/fs/inode.c~mm-dont-reclaim-inodes-with-many-attached-pages
+++ a/fs/inode.c
@@ -730,8 +730,11 @@ static enum lru_status inode_lru_isolate
return LRU_REMOVED;
}
- /* recently referenced inodes get one more pass */
- if (inode->i_state & I_REFERENCED) {
+ /*
+ * Recently referenced inodes and inodes with many attached pages
+ * get one more pass.
+ */
+ if (inode->i_state & I_REFERENCED || inode->i_data.nrpages > 1) {
inode->i_state &= ~I_REFERENCED;
spin_unlock(&inode->i_lock);
return LRU_ROTATE;
_
Patches currently in -mm which might be from guro(a)fb.com are
The patch titled
Subject: mm/swapfile.c: use kvzalloc for swap_info_struct allocation
has been removed from the -mm tree. Its filename was
mm-use-kvzalloc-for-swap_info_struct-allocation.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Vasily Averin <vvs(a)virtuozzo.com>
Subject: mm/swapfile.c: use kvzalloc for swap_info_struct allocation
a2468cc9bfdf ("swap: choose swap device according to numa node") changed
'avail_lists' field of 'struct swap_info_struct' to an array. In popular
linux distros it increased size of swap_info_struct up to 40 Kbytes and
now swap_info_struct allocation requires order-4 page. Switch to
kvzmalloc allows to avoid unexpected allocation failures.
Link: http://lkml.kernel.org/r/fc23172d-3c75-21e2-d551-8b1808cbe593@virtuozzo.com
Fixes: a2468cc9bfdf ("swap: choose swap device according to numa node")
Signed-off-by: Vasily Averin <vvs(a)virtuozzo.com>
Acked-by: Aaron Lu <aaron.lu(a)intel.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Reviewed-by: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Huang Ying <ying.huang(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/swapfile.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/mm/swapfile.c~mm-use-kvzalloc-for-swap_info_struct-allocation
+++ a/mm/swapfile.c
@@ -2813,7 +2813,7 @@ static struct swap_info_struct *alloc_sw
unsigned int type;
int i;
- p = kzalloc(sizeof(*p), GFP_KERNEL);
+ p = kvzalloc(sizeof(*p), GFP_KERNEL);
if (!p)
return ERR_PTR(-ENOMEM);
@@ -2824,7 +2824,7 @@ static struct swap_info_struct *alloc_sw
}
if (type >= MAX_SWAPFILES) {
spin_unlock(&swap_lock);
- kfree(p);
+ kvfree(p);
return ERR_PTR(-EPERM);
}
if (type >= nr_swapfiles) {
@@ -2838,7 +2838,7 @@ static struct swap_info_struct *alloc_sw
smp_wmb();
nr_swapfiles++;
} else {
- kfree(p);
+ kvfree(p);
p = swap_info[type];
/*
* Do not memset this entry: a racing procfs swap_next()
_
Patches currently in -mm which might be from vvs(a)virtuozzo.com are
The patch titled
Subject: hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
has been removed from the -mm tree. Its filename was
hugetlbfs-fix-kernel-bug-at-fs-hugetlbfs-inodec-444.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
This bug has been experienced several times by Oracle DB team. The BUG is
in remove_inode_hugepages() as follows:
/*
* If page is mapped, it was faulted in after being
* unmapped in caller. Unmap (again) now after taking
* the fault mutex. The mutex will prevent faults
* until we finish removing the page.
*
* This race can only happen in the hole punch case.
* Getting here in a truncate operation is a bug.
*/
if (unlikely(page_mapped(page))) {
BUG_ON(truncate_op);
In this case, the elevated map count is not the result of a race. Rather
it was incorrectly incremented as the result of a bug in the huge pmd
sharing code. Consider the following:
- Process A maps a hugetlbfs file of sufficient size and alignment
(PUD_SIZE) that a pmd page could be shared.
- Process B maps the same hugetlbfs file with the same size and alignment
such that a pmd page is shared.
- Process B then calls mprotect() to change protections for the mapping
with the shared pmd. As a result, the pmd is 'unshared'.
- Process B then calls mprotect() again to chage protections for the
mapping back to their original value. pmd remains unshared.
- Process B then forks and process C is created. During the fork process,
we do dup_mm -> dup_mmap -> copy_page_range to copy page tables. Copying
page tables for hugetlb mappings is done in the routine
copy_hugetlb_page_range.
In copy_hugetlb_page_range(), the destination pte is obtained by:
dst_pte = huge_pte_alloc(dst, addr, sz);
If pmd sharing is possible, the returned pointer will be to a pte in an
existing page table. In the situation above, process C could share with
either process A or process B. Since process A is first in the list, the
returned pte is a pointer to a pte in process A's page table.
However, the following check for pmd sharing is in
copy_hugetlb_page_range.
/* If the pagetables are shared don't copy or take references */
if (dst_pte == src_pte)
continue;
Since process C is sharing with process A instead of process B, the above
test fails. The code in copy_hugetlb_page_range which follows assumes
dst_pte points to a huge_pte_none pte. It copies the pte entry from
src_pte to dst_pte and increments this map count of the associated page.
This is how we end up with an elevated map count.
To solve, check the dst_pte entry for huge_pte_none. If !none, this
implies PMD sharing so do not copy.
Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com
Fixes: c5c99429fa57 ("fix hugepages leak due to pagetable page sharing")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)
--- a/mm/hugetlb.c~hugetlbfs-fix-kernel-bug-at-fs-hugetlbfs-inodec-444
+++ a/mm/hugetlb.c
@@ -3233,7 +3233,7 @@ static int is_hugetlb_entry_hwpoisoned(p
int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *vma)
{
- pte_t *src_pte, *dst_pte, entry;
+ pte_t *src_pte, *dst_pte, entry, dst_entry;
struct page *ptepage;
unsigned long addr;
int cow;
@@ -3261,15 +3261,30 @@ int copy_hugetlb_page_range(struct mm_st
break;
}
- /* If the pagetables are shared don't copy or take references */
- if (dst_pte == src_pte)
+ /*
+ * If the pagetables are shared don't copy or take references.
+ * dst_pte == src_pte is the common case of src/dest sharing.
+ *
+ * However, src could have 'unshared' and dst shares with
+ * another vma. If dst_pte !none, this implies sharing.
+ * Check here before taking page table lock, and once again
+ * after taking the lock below.
+ */
+ dst_entry = huge_ptep_get(dst_pte);
+ if ((dst_pte == src_pte) || !huge_pte_none(dst_entry))
continue;
dst_ptl = huge_pte_lock(h, dst, dst_pte);
src_ptl = huge_pte_lockptr(h, src, src_pte);
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
entry = huge_ptep_get(src_pte);
- if (huge_pte_none(entry)) { /* skip none entry */
+ dst_entry = huge_ptep_get(dst_pte);
+ if (huge_pte_none(entry) || !huge_pte_none(dst_entry)) {
+ /*
+ * Skip if src entry none. Also, skip in the
+ * unlikely case dst entry !none as this implies
+ * sharing with another vma.
+ */
;
} else if (unlikely(is_hugetlb_entry_migration(entry) ||
is_hugetlb_entry_hwpoisoned(entry))) {
_
Patches currently in -mm which might be from mike.kravetz(a)oracle.com are