Mapping the delay slot emulation page as both writeable & executable
presents a security risk, in that if an exploit can write to & jump into
the page then it can be used as an easy way to execute arbitrary code.
Prevent this by mapping the page read-only for userland, and using
access_process_vm() with the FOLL_FORCE flag to write to it from
mips_dsemul().
This will likely be less efficient due to copy_to_user_page() performing
cache maintenance on a whole page, rather than a single line as in the
previous use of flush_cache_sigtramp(). However this delay slot
emulation code ought not to be running in any performance critical paths
anyway so this isn't really a problem, and we can probably do better in
copy_to_user_page() anyway in future.
A major advantage of this approach is that the fix is small & simple to
backport to stable kernels.
Reported-by: Andy Lutomirski <luto(a)kernel.org>
Signed-off-by: Paul Burton <paul.burton(a)mips.com>
Fixes: 432c6bacbd0c ("MIPS: Use per-mm page to execute branch delay slot instructions")
Cc: stable(a)vger.kernel.org # v4.8+
---
arch/mips/kernel/vdso.c | 4 ++--
arch/mips/math-emu/dsemul.c | 38 +++++++++++++++++++------------------
2 files changed, 22 insertions(+), 20 deletions(-)
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 48a9c6b90e07..9df3ebdc7b0f 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -126,8 +126,8 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
/* Map delay slot emulation page */
base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
- VM_READ|VM_WRITE|VM_EXEC|
- VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
+ VM_READ | VM_EXEC |
+ VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC,
0, NULL);
if (IS_ERR_VALUE(base)) {
ret = base;
diff --git a/arch/mips/math-emu/dsemul.c b/arch/mips/math-emu/dsemul.c
index 5450f4d1c920..e2d46cb93ca9 100644
--- a/arch/mips/math-emu/dsemul.c
+++ b/arch/mips/math-emu/dsemul.c
@@ -214,8 +214,9 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir,
{
int isa16 = get_isa16_mode(regs->cp0_epc);
mips_instruction break_math;
- struct emuframe __user *fr;
- int err, fr_idx;
+ unsigned long fr_uaddr;
+ struct emuframe fr;
+ int fr_idx, ret;
/* NOP is easy */
if (ir == 0)
@@ -250,27 +251,31 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir,
fr_idx = alloc_emuframe();
if (fr_idx == BD_EMUFRAME_NONE)
return SIGBUS;
- fr = &dsemul_page()[fr_idx];
/* Retrieve the appropriately encoded break instruction */
break_math = BREAK_MATH(isa16);
/* Write the instructions to the frame */
if (isa16) {
- err = __put_user(ir >> 16,
- (u16 __user *)(&fr->emul));
- err |= __put_user(ir & 0xffff,
- (u16 __user *)((long)(&fr->emul) + 2));
- err |= __put_user(break_math >> 16,
- (u16 __user *)(&fr->badinst));
- err |= __put_user(break_math & 0xffff,
- (u16 __user *)((long)(&fr->badinst) + 2));
+ union mips_instruction _emul = {
+ .halfword = { ir >> 16, ir }
+ };
+ union mips_instruction _badinst = {
+ .halfword = { break_math >> 16, break_math }
+ };
+
+ fr.emul = _emul.word;
+ fr.badinst = _badinst.word;
} else {
- err = __put_user(ir, &fr->emul);
- err |= __put_user(break_math, &fr->badinst);
+ fr.emul = ir;
+ fr.badinst = break_math;
}
- if (unlikely(err)) {
+ /* Write the frame to user memory */
+ fr_uaddr = (unsigned long)&dsemul_page()[fr_idx];
+ ret = access_process_vm(current, fr_uaddr, &fr, sizeof(fr),
+ FOLL_FORCE | FOLL_WRITE);
+ if (unlikely(ret != sizeof(fr))) {
MIPS_FPU_EMU_INC_STATS(errors);
free_emuframe(fr_idx, current->mm);
return SIGBUS;
@@ -282,10 +287,7 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir,
atomic_set(¤t->thread.bd_emu_frame, fr_idx);
/* Change user register context to execute the frame */
- regs->cp0_epc = (unsigned long)&fr->emul | isa16;
-
- /* Ensure the icache observes our newly written frame */
- flush_cache_sigtramp((unsigned long)&fr->emul);
+ regs->cp0_epc = fr_uaddr | isa16;
return 0;
}
--
2.20.0
The AFU Descriptor Template in the PCI config space has a Name Space
field which is a 24 Byte ASCII character string of descriptive name
space for the AFU. The OCXL driver read the string four characters at
a time with pci_read_config_dword().
This optimization is valid on a little-endian system since this is PCI,
but a big-endian system ends up with each subset of four characters in
reverse order.
This could be fixed by switching to read characters one by one. Another
option is to swap the bytes if we're big-endian.
Go for the latter with le32_to_cpu().
Cc: stable(a)vger.kernel.org # v4.16
Signed-off-by: Greg Kurz <groug(a)kaod.org>
Acked-by: Frederic Barrat <fbarrat(a)linux.ibm.com>
---
v2: - silence sparse with (__force __le32) cast
- new changelog
---
drivers/misc/ocxl/config.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
index 57a6bb1fd3c9..8f2c5d8bd2ee 100644
--- a/drivers/misc/ocxl/config.c
+++ b/drivers/misc/ocxl/config.c
@@ -318,7 +318,7 @@ static int read_afu_name(struct pci_dev *dev, struct ocxl_fn_config *fn,
if (rc)
return rc;
ptr = (u32 *) &afu->name[i];
- *ptr = val;
+ *ptr = le32_to_cpu((__force __le32) val);
}
afu->name[OCXL_AFU_NAME_SZ - 1] = '\0'; /* play safe */
return 0;
On a signal handler return, the user could set a context with MSR[TS] bits
set, and these bits would be copied to task regs->msr.
At restore_tm_sigcontexts(), after current task regs->msr[TS] bits are set,
several __get_user() are called and then a recheckpoint is executed.
This is a problem since a page fault (in kernel space) could happen when
calling __get_user(). If it happens, the process MSR[TS] bits were
already set, but recheckpoint was not executed, and SPRs are still invalid.
The page fault can cause the current process to be de-scheduled, with
MSR[TS] active and without tm_recheckpoint() being called. More
importantly, without TEXASR[FS] bit set also.
Since TEXASR might not have the FS bit set, and when the process is
scheduled back, it will try to reclaim, which will be aborted because of
the CPU is not in the suspended state, and, then, recheckpoint. This
recheckpoint will restore thread->texasr into TEXASR SPR, which might be
zero, hitting a BUG_ON().
kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
cpu 0xb: Vector: 700 (Program Check) at [c00000041f1576d0]
pc: c000000000054550: restore_gprs+0xb0/0x180
lr: 0000000000000000
sp: c00000041f157950
msr: 8000000100021033
current = 0xc00000041f143000
paca = 0xc00000000fb86300 softe: 0 irq_happened: 0x01
pid = 1021, comm = kworker/11:1
kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
Linux version 4.9.0-3-powerpc64le (debian-kernel(a)lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
enter ? for help
[c00000041f157b30] c00000000001bc3c tm_recheckpoint.part.11+0x6c/0xa0
[c00000041f157b70] c00000000001d184 __switch_to+0x1e4/0x4c0
[c00000041f157bd0] c00000000082eeb8 __schedule+0x2f8/0x990
[c00000041f157cb0] c00000000082f598 schedule+0x48/0xc0
[c00000041f157ce0] c0000000000f0d28 worker_thread+0x148/0x610
[c00000041f157d80] c0000000000f96b0 kthread+0x120/0x140
[c00000041f157e30] c00000000000c0e0 ret_from_kernel_thread+0x5c/0x7c
This patch simply delays the MSR[TS] set, so, if there is any page fault in
the __get_user() section, it does not have regs->msr[TS] set, since the TM
structures are still invalid, thus avoiding doing TM operations for
in-kernel exceptions and possible process reschedule.
With this patch, the MSR[TS] will only be set just before recheckpointing
and setting TEXASR[FS] = 1, thus avoiding an interrupt with TM registers in
invalid state.
Other than that, if CONFIG_PREEMPT is set, there might be a preemption just
after setting MSR[TS] and before tm_recheckpoint(), thus, this block must
be atomic from a preemption perspective, thus, calling
preempt_disable/enable() on this code.
It is not possible to move tm_recheckpoint to happen earlier, because it is
required to get the checkpointed registers from userspace, with
__get_user(), thus, the only way to avoid this undesired behavior is
delaying the MSR[TS] set.
The 32-bits signal handler seems to be safe this current issue, but, it
might be exposed to the preemption issue, thus, disabling preemption in
this chunk of code.
Changes from v2:
* Run the critical section with preempt_disable.
Fixes: 87b4e5393af7 ("powerpc/tm: Fix return of active 64bit signals")
Cc: stable(a)vger.kernel.org (v3.9+)
Signed-off-by: Breno Leitao <leitao(a)debian.org>
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index e6474a45cef5..fd59fef9931b 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -848,7 +848,23 @@ static long restore_tm_user_regs(struct pt_regs *regs,
/* If TM bits are set to the reserved value, it's an invalid context */
if (MSR_TM_RESV(msr_hi))
return 1;
- /* Pull in the MSR TM bits from the user context */
+
+ /*
+ * Disabling preemption, since it is unsafe to be preempted
+ * with MSR[TS] set without recheckpointing.
+ */
+ preempt_disable();
+
+ /*
+ * CAUTION:
+ * After regs->MSR[TS] being updated, make sure that get_user(),
+ * put_user() or similar functions are *not* called. These
+ * functions can generate page faults which will cause the process
+ * to be de-scheduled with MSR[TS] set but without calling
+ * tm_recheckpoint(). This can cause a bug.
+ *
+ * Pull in the MSR TM bits from the user context
+ */
regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr_hi & MSR_TS_MASK);
/* Now, recheckpoint. This loads up all of the checkpointed (older)
* registers, including FP and V[S]Rs. After recheckpointing, the
@@ -873,6 +889,8 @@ static long restore_tm_user_regs(struct pt_regs *regs,
}
#endif
+ preempt_enable();
+
return 0;
}
#endif
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 83d51bf586c7..bbd1c73243d7 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -467,20 +467,6 @@ static long restore_tm_sigcontexts(struct task_struct *tsk,
if (MSR_TM_RESV(msr))
return -EINVAL;
- /* pull in MSR TS bits from user context */
- regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK);
-
- /*
- * Ensure that TM is enabled in regs->msr before we leave the signal
- * handler. It could be the case that (a) user disabled the TM bit
- * through the manipulation of the MSR bits in uc_mcontext or (b) the
- * TM bit was disabled because a sufficient number of context switches
- * happened whilst in the signal handler and load_tm overflowed,
- * disabling the TM bit. In either case we can end up with an illegal
- * TM state leading to a TM Bad Thing when we return to userspace.
- */
- regs->msr |= MSR_TM;
-
/* pull in MSR LE from user context */
regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE);
@@ -572,6 +558,34 @@ static long restore_tm_sigcontexts(struct task_struct *tsk,
tm_enable();
/* Make sure the transaction is marked as failed */
tsk->thread.tm_texasr |= TEXASR_FS;
+
+ /*
+ * Disabling preemption, since it is unsafe to be preempted
+ * with MSR[TS] set without recheckpointing.
+ */
+ preempt_disable();
+
+ /* pull in MSR TS bits from user context */
+ regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK);
+
+ /*
+ * Ensure that TM is enabled in regs->msr before we leave the signal
+ * handler. It could be the case that (a) user disabled the TM bit
+ * through the manipulation of the MSR bits in uc_mcontext or (b) the
+ * TM bit was disabled because a sufficient number of context switches
+ * happened whilst in the signal handler and load_tm overflowed,
+ * disabling the TM bit. In either case we can end up with an illegal
+ * TM state leading to a TM Bad Thing when we return to userspace.
+ *
+ * CAUTION:
+ * After regs->MSR[TS] being updated, make sure that get_user(),
+ * put_user() or similar functions are *not* called. These
+ * functions can generate page faults which will cause the process
+ * to be de-scheduled with MSR[TS] set but without calling
+ * tm_recheckpoint(). This can cause a bug.
+ */
+ regs->msr |= MSR_TM;
+
/* This loads the checkpointed FP/VEC state, if used */
tm_recheckpoint(&tsk->thread);
@@ -585,6 +599,8 @@ static long restore_tm_sigcontexts(struct task_struct *tsk,
regs->msr |= MSR_VEC;
}
+ preempt_enable();
+
return err;
}
#endif
--
2.19.0
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 747df19747bc9752cd40b9cce761e17a033aa5c2 Mon Sep 17 00:00:00 2001
From: Daniel Mack <daniel(a)zonque.org>
Date: Thu, 11 Oct 2018 20:32:05 +0200
Subject: [PATCH] ASoC: sta32x: set ->component pointer in private struct
The ESD watchdog code in sta32x_watchdog() dereferences the pointer
which is never assigned.
This is a regression from a1be4cead9b950 ("ASoC: sta32x: Convert to direct
regmap API usage.") which went unnoticed since nobody seems to use that ESD
workaround.
Fixes: a1be4cead9b950 ("ASoC: sta32x: Convert to direct regmap API usage.")
Signed-off-by: Daniel Mack <daniel(a)zonque.org>
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Cc: stable(a)vger.kernel.org
diff --git a/sound/soc/codecs/sta32x.c b/sound/soc/codecs/sta32x.c
index d5035f2f2b2b..ce508b4cc85c 100644
--- a/sound/soc/codecs/sta32x.c
+++ b/sound/soc/codecs/sta32x.c
@@ -879,6 +879,9 @@ static int sta32x_probe(struct snd_soc_component *component)
struct sta32x_priv *sta32x = snd_soc_component_get_drvdata(component);
struct sta32x_platform_data *pdata = sta32x->pdata;
int i, ret = 0, thermal = 0;
+
+ sta32x->component = component;
+
ret = regulator_bulk_enable(ARRAY_SIZE(sta32x->supplies),
sta32x->supplies);
if (ret != 0) {
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 2d204ee9d671327915260071c19350d84344e096 Mon Sep 17 00:00:00 2001
From: Dan Carpenter <dan.carpenter(a)oracle.com>
Date: Mon, 10 Sep 2018 14:12:07 +0300
Subject: [PATCH] cifs: integer overflow in in SMB2_ioctl()
The "le32_to_cpu(rsp->OutputOffset) + *plen" addition can overflow and
wrap around to a smaller value which looks like it would lead to an
information leak.
Fixes: 4a72dafa19ba ("SMB2 FSCTL and IOCTL worker function")
Signed-off-by: Dan Carpenter <dan.carpenter(a)oracle.com>
Signed-off-by: Steve French <stfrench(a)microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel(a)suse.com>
CC: Stable <stable(a)vger.kernel.org>
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 6f0e6b42599c..f54d07bda067 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2459,14 +2459,14 @@ SMB2_ioctl(const unsigned int xid, struct cifs_tcon *tcon, u64 persistent_fid,
/* We check for obvious errors in the output buffer length and offset */
if (*plen == 0)
goto ioctl_exit; /* server returned no data */
- else if (*plen > 0xFF00) {
+ else if (*plen > rsp_iov.iov_len || *plen > 0xFF00) {
cifs_dbg(VFS, "srv returned invalid ioctl length: %d\n", *plen);
*plen = 0;
rc = -EIO;
goto ioctl_exit;
}
- if (rsp_iov.iov_len < le32_to_cpu(rsp->OutputOffset) + *plen) {
+ if (rsp_iov.iov_len - *plen < le32_to_cpu(rsp->OutputOffset)) {
cifs_dbg(VFS, "Malformed ioctl resp: len %d offset %d\n", *plen,
le32_to_cpu(rsp->OutputOffset));
*plen = 0;
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken
into account as well.
Instead of writing the required complicated code for this rare
occurrence, just eliminate the race. i_mmap_rwsem is now held in read
mode for the duration of page fault processing. Hold i_mmap_rwsem
longer in truncation and hold punch code to cover the call to
remove_inode_hugepages.
With this modification, code in remove_inode_hugepages checking for
races becomes 'dead' as it can not longer happen. Remove the dead code
and expand comments to explain reasoning. Similarly, checks for races
with truncation in the page fault path can be simplified and removed.
Cc: <stable(a)vger.kernel.org>
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
---
fs/hugetlbfs/inode.c | 61 ++++++++++++++++++++------------------------
mm/hugetlb.c | 21 ++++++++-------
2 files changed, 38 insertions(+), 44 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 32920a10100e..a2fcea5f8225 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -383,17 +383,16 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
* truncation is indicated by end of range being LLONG_MAX
* In this case, we first scan the range and release found pages.
* After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
- * maps and global counts. Page faults can not race with truncation
- * in this routine. hugetlb_no_page() prevents page faults in the
- * truncated range. It checks i_size before allocation, and again after
- * with the page table lock for the page held. The same lock must be
- * acquired to unmap a page.
+ * maps and global counts.
* hole punch is indicated if end is not LLONG_MAX
* In the hole punch case we scan the range and release found pages.
* Only when releasing a page is the associated region/reserv map
* deleted. The region/reserv map for ranges without associated
- * pages are not modified. Page faults can race with hole punch.
- * This is indicated if we find a mapped page.
+ * pages are not modified.
+ *
+ * Callers of this routine must hold the i_mmap_rwsem in write mode to prevent
+ * races with page faults.
+ *
* Note: If the passed end of range value is beyond the end of file, but
* not LLONG_MAX this routine still performs a hole punch operation.
*/
@@ -423,32 +422,14 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
for (i = 0; i < pagevec_count(&pvec); ++i) {
struct page *page = pvec.pages[i];
- u32 hash;
index = page->index;
- hash = hugetlb_fault_mutex_hash(h, current->mm,
- &pseudo_vma,
- mapping, index, 0);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
-
/*
- * If page is mapped, it was faulted in after being
- * unmapped in caller. Unmap (again) now after taking
- * the fault mutex. The mutex will prevent faults
- * until we finish removing the page.
- *
- * This race can only happen in the hole punch case.
- * Getting here in a truncate operation is a bug.
+ * A mapped page is impossible as callers should unmap
+ * all references before calling. And, i_mmap_rwsem
+ * prevents the creation of additional mappings.
*/
- if (unlikely(page_mapped(page))) {
- BUG_ON(truncate_op);
-
- i_mmap_lock_write(mapping);
- hugetlb_vmdelete_list(&mapping->i_mmap,
- index * pages_per_huge_page(h),
- (index + 1) * pages_per_huge_page(h));
- i_mmap_unlock_write(mapping);
- }
+ VM_BUG_ON(page_mapped(page));
lock_page(page);
/*
@@ -470,7 +451,6 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
}
unlock_page(page);
- mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
huge_pagevec_release(&pvec);
cond_resched();
@@ -482,9 +462,20 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
static void hugetlbfs_evict_inode(struct inode *inode)
{
+ struct address_space *mapping = inode->i_mapping;
struct resv_map *resv_map;
+ /*
+ * The vfs layer guarantees that there are no other users of this
+ * inode. Therefore, it would be safe to call remove_inode_hugepages
+ * without holding i_mmap_rwsem. We acquire and hold here to be
+ * consistent with other callers. Since there will be no contention
+ * on the semaphore, overhead is negligible.
+ */
+ i_mmap_lock_write(mapping);
remove_inode_hugepages(inode, 0, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
+
resv_map = (struct resv_map *)inode->i_mapping->private_data;
/* root inode doesn't have the resv_map, so we should check it */
if (resv_map)
@@ -505,8 +496,8 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offset, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
return 0;
}
@@ -540,8 +531,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
hugetlb_vmdelete_list(&mapping->i_mmap,
hole_start >> PAGE_SHIFT,
hole_end >> PAGE_SHIFT);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, hole_start, hole_end);
+ i_mmap_unlock_write(mapping);
inode_unlock(inode);
}
@@ -624,7 +615,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
/* addr is the offset within the file (zero based) */
addr = index * hpage_size;
- /* mutex taken here, fault path and hole punch */
+ /*
+ * fault mutex taken here, protects against fault path
+ * and hole punch. inode_lock previously taken protects
+ * against truncation.
+ */
hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
index, addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2a3162030167..cfd9790b01e3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3757,16 +3757,16 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}
/*
- * Use page lock to guard against racing truncation
- * before we get page_table_lock.
+ * We can not race with truncation due to holding i_mmap_rwsem.
+ * Check once here for faults beyond end of file.
*/
+ size = i_size_read(mapping->host) >> huge_page_shift(h);
+ if (idx >= size)
+ goto out;
+
retry:
page = find_lock_page(mapping, idx);
if (!page) {
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto out;
-
/*
* Check for page in userfault range
*/
@@ -3856,9 +3856,6 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}
ptl = huge_pte_lock(h, mm, ptep);
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto backout;
ret = 0;
if (!huge_pte_none(huge_ptep_get(ptep)))
@@ -3961,8 +3958,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
/*
* Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
- * until finished with ptep. This prevents huge_pmd_unshare from
- * being called elsewhere and making the ptep no longer valid.
+ * until finished with ptep. This serves two purposes:
+ * 1) It prevents huge_pmd_unshare from being called elsewhere
+ * and making the ptep no longer valid.
+ * 2) It synchronizes us with file truncation.
*
* ptep could have already be assigned via huge_pte_offset. That
* is OK, as huge_pte_alloc will return the same value unless
--
2.17.2
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken
into account as well.
Instead of writing the required complicated code for this rare
occurrence, just eliminate the race. i_mmap_rwsem is now held in read
mode for the duration of page fault processing. Hold i_mmap_rwsem
longer in truncation and hold punch code to cover the call to
remove_inode_hugepages.
With this modification, code in remove_inode_hugepages checking for
races becomes 'dead' as it can not longer happen. Remove the dead code
and expand comments to explain reasoning. Similarly, checks for races
with truncation in the page fault path can be simplified and removed.
Cc: <stable(a)vger.kernel.org>
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
---
fs/hugetlbfs/inode.c | 50 +++++++++++++++-----------------------------
mm/hugetlb.c | 21 +++++++++----------
2 files changed, 27 insertions(+), 44 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 32920a10100e..a9c00c6ef80d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -383,17 +383,16 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
* truncation is indicated by end of range being LLONG_MAX
* In this case, we first scan the range and release found pages.
* After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
- * maps and global counts. Page faults can not race with truncation
- * in this routine. hugetlb_no_page() prevents page faults in the
- * truncated range. It checks i_size before allocation, and again after
- * with the page table lock for the page held. The same lock must be
- * acquired to unmap a page.
+ * maps and global counts.
* hole punch is indicated if end is not LLONG_MAX
* In the hole punch case we scan the range and release found pages.
* Only when releasing a page is the associated region/reserv map
* deleted. The region/reserv map for ranges without associated
- * pages are not modified. Page faults can race with hole punch.
- * This is indicated if we find a mapped page.
+ * pages are not modified.
+ *
+ * Callers of this routine must hold the i_mmap_rwsem in write mode to prevent
+ * races with page faults.
+ *
* Note: If the passed end of range value is beyond the end of file, but
* not LLONG_MAX this routine still performs a hole punch operation.
*/
@@ -423,32 +422,14 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
for (i = 0; i < pagevec_count(&pvec); ++i) {
struct page *page = pvec.pages[i];
- u32 hash;
index = page->index;
- hash = hugetlb_fault_mutex_hash(h, current->mm,
- &pseudo_vma,
- mapping, index, 0);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
-
/*
- * If page is mapped, it was faulted in after being
- * unmapped in caller. Unmap (again) now after taking
- * the fault mutex. The mutex will prevent faults
- * until we finish removing the page.
- *
- * This race can only happen in the hole punch case.
- * Getting here in a truncate operation is a bug.
+ * A mapped page is impossible as callers should unmap
+ * all references before calling. And, i_mmap_rwsem
+ * prevents the creation of additional mappings.
*/
- if (unlikely(page_mapped(page))) {
- BUG_ON(truncate_op);
-
- i_mmap_lock_write(mapping);
- hugetlb_vmdelete_list(&mapping->i_mmap,
- index * pages_per_huge_page(h),
- (index + 1) * pages_per_huge_page(h));
- i_mmap_unlock_write(mapping);
- }
+ VM_BUG_ON(page_mapped(page));
lock_page(page);
/*
@@ -470,7 +451,6 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
}
unlock_page(page);
- mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
huge_pagevec_release(&pvec);
cond_resched();
@@ -505,8 +485,8 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offset, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
return 0;
}
@@ -540,8 +520,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
hugetlb_vmdelete_list(&mapping->i_mmap,
hole_start >> PAGE_SHIFT,
hole_end >> PAGE_SHIFT);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, hole_start, hole_end);
+ i_mmap_unlock_write(mapping);
inode_unlock(inode);
}
@@ -624,7 +604,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
/* addr is the offset within the file (zero based) */
addr = index * hpage_size;
- /* mutex taken here, fault path and hole punch */
+ /*
+ * fault mutex taken here, protects against fault path
+ * and hole punch. inode_lock previously taken protects
+ * against truncation.
+ */
hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
index, addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ab4c77b8c72c..25a0cd2f8b39 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3760,16 +3760,16 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}
/*
- * Use page lock to guard against racing truncation
- * before we get page_table_lock.
+ * We can not race with truncation due to holding i_mmap_rwsem.
+ * Check once here for faults beyond end of file.
*/
+ size = i_size_read(mapping->host) >> huge_page_shift(h);
+ if (idx >= size)
+ goto out;
+
retry:
page = find_lock_page(mapping, idx);
if (!page) {
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto out;
-
/*
* Check for page in userfault range
*/
@@ -3859,9 +3859,6 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}
ptl = huge_pte_lock(h, mm, ptep);
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto backout;
ret = 0;
if (!huge_pte_none(huge_ptep_get(ptep)))
@@ -3964,8 +3961,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
/*
* Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
- * until finished with ptep. This prevents huge_pmd_unshare from
- * being called elsewhere and making the ptep no longer valid.
+ * until finished with ptep. This serves two purposes:
+ * 1) It prevents huge_pmd_unshare from being called elsewhere
+ * and making the ptep no longer valid.
+ * 2) It synchronizes us with file truncation.
*
* ptep could have already be assigned via huge_pte_offset. That
* is OK, as huge_pte_alloc will return the same value unless
--
2.17.2
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 55e56f06ed71d9441f3abd5b1d3c1a870812b3fe Mon Sep 17 00:00:00 2001
From: Matthew Wilcox <willy(a)infradead.org>
Date: Tue, 27 Nov 2018 13:16:34 -0800
Subject: [PATCH] dax: Don't access a freed inode
After we drop the i_pages lock, the inode can be freed at any time.
The get_unlocked_entry() code has no choice but to reacquire the lock,
so it can't be used here. Create a new wait_entry_unlocked() which takes
care not to acquire the lock or dereference the address_space in any way.
Fixes: c2a7d2a11552 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Matthew Wilcox <willy(a)infradead.org>
Reviewed-by: Jan Kara <jack(a)suse.cz>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
diff --git a/fs/dax.c b/fs/dax.c
index e69fc231833b..3f592dc18d67 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -232,6 +232,34 @@ static void *get_unlocked_entry(struct xa_state *xas)
}
}
+/*
+ * The only thing keeping the address space around is the i_pages lock
+ * (it's cycled in clear_inode() after removing the entries from i_pages)
+ * After we call xas_unlock_irq(), we cannot touch xas->xa.
+ */
+static void wait_entry_unlocked(struct xa_state *xas, void *entry)
+{
+ struct wait_exceptional_entry_queue ewait;
+ wait_queue_head_t *wq;
+
+ init_wait(&ewait.wait);
+ ewait.wait.func = wake_exceptional_entry_func;
+
+ wq = dax_entry_waitqueue(xas, entry, &ewait.key);
+ prepare_to_wait_exclusive(wq, &ewait.wait, TASK_UNINTERRUPTIBLE);
+ xas_unlock_irq(xas);
+ schedule();
+ finish_wait(wq, &ewait.wait);
+
+ /*
+ * Entry lock waits are exclusive. Wake up the next waiter since
+ * we aren't sure we will acquire the entry lock and thus wake
+ * the next waiter up on unlock.
+ */
+ if (waitqueue_active(wq))
+ __wake_up(wq, TASK_NORMAL, 1, &ewait.key);
+}
+
static void put_unlocked_entry(struct xa_state *xas, void *entry)
{
/* If we were the only waiter woken, wake the next one */
@@ -389,9 +417,7 @@ bool dax_lock_mapping_entry(struct page *page)
entry = xas_load(&xas);
if (dax_is_locked(entry)) {
rcu_read_unlock();
- entry = get_unlocked_entry(&xas);
- xas_unlock_irq(&xas);
- put_unlocked_entry(&xas, entry);
+ wait_entry_unlocked(&xas, entry);
rcu_read_lock();
continue;
}
Hi Sasha,
On Thu, Dec 20, 2018 at 07:26:15PM +0000, Sasha Levin wrote:
> Hi,
>
> [This is an automated email]
>
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: 432c6bacbd0c MIPS: Use per-mm page to execute branch delay slot instructions.
>
> The bot has tested the following trees: v4.19.10, v4.14.89, v4.9.146,
Neat! I like the idea of this automation :)
> v4.19.10: Build OK!
> v4.14.89: Build OK!
> v4.9.146: Failed to apply! Possible dependencies:
> 05ce77249d50 ("userfaultfd: non-cooperative: add madvise() event for MADV_DONTNEED request")
> 163e11bc4f6e ("userfaultfd: hugetlbfs: UFFD_FEATURE_MISSING_HUGETLBFS")
> 67dece7d4c58 ("x86/vdso: Set vDSO pointer only after success")
> 72f87654c696 ("userfaultfd: non-cooperative: add mremap() event")
> 893e26e61d04 ("userfaultfd: non-cooperative: Add fork() event")
> 897ab3e0c49e ("userfaultfd: non-cooperative: add event for memory unmaps")
> 9cd75c3cd4c3 ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor")
> d811914d8757 ("userfaultfd: non-cooperative: rename *EVENT_MADVDONTNEED to *EVENT_REMOVE")
This list includes the correct soft dependency - commit 897ab3e0c49e
("userfaultfd: non-cooperative: add event for memory unmaps") which
added an extra argument to mmap_region().
> How should we proceed with this patch?
The backport to v4.9 should simply drop the last argument (NULL) in the
call to mmap_region().
Is there some way I can indicate this sort of thing in future patches so
that the automation can spot that I already know it won't apply cleanly
to a particular range of kernel versions? Or even better, that I could
indicate what needs to be changed when backporting to those kernel
versions?
Thanks,
Paul
+cc Greg, stable
Greensky, James J <james.j.greensky(a)intel.com> 于2018年12月21日周五 上午11:48写道:
>
> Commit d38d272592737ea88a20 ("perf tools: Synthesize GROUP_DESC feature in pipe mode") broke the LT 4.14 branch when using event groups in pipe-mode.
>
> # perf record -e '{cycles,instructions,branches}' -- sleep 4 | perf report
> # To display the perf.data header info, please use --header/--header-only options
> #
> Oxd7c [0x60]: failed to process type: 80
> Error:
> Failed to process sample
>
> Commit a2015516c5c0be932a69 ("perf record: Synthesize features before events in pipe mode") is the fix. Can we get this cherry-picked and applied?
>