Editing for photos. Retouching for photos. Cutting out for photos,
We are a photo team of 20 image editors and we can edit your photos today
if you need help,
We mainly work on:
Clipping path - image cut out
Shadow creation
Image masking
Image retouching
Beauty model retouching
Please reply if interested.
Thanks,
Jessica
From: Stanley Chu <stanley.chu(a)mediatek.com>
The commit 356fd2663cff ("scsi: Set request queue runtime PM status
back to active on resume") fixed up the inconsistent RPM status between
request queue and device. However changing request queue RPM status
shall be done only on successful resume, otherwise status may be still
inconsistent as below,
Request queue: RPM_ACTIVE
Device: RPM_SUSPENDED
This ends up soft lockup because requests can be submitted to
underlying devices but those devices and their required resource
are not resumed.
For example,
After above inconsistent status happens, IO request can be submitted
to UFS device driver but required resource (like clock) is not resumed
yet thus lead to warning as below call stack,
WARN_ON(hba->clk_gating.state != CLKS_ON);
ufshcd_queuecommand
scsi_dispatch_cmd
scsi_request_fn
__blk_run_queue
cfq_insert_request
__elv_add_request
blk_flush_plug_list
blk_finish_plug
jbd2_journal_commit_transaction
kjournald2
We may see all behind IO requests hang because of no response from
storage host or device and then soft lockup happens in system. In the
end, system may crash in many ways.
Fixes: 356fd2663cff (scsi: Set request queue runtime PM status
back to active on resume)
Cc: stable(a)vger.kernel.org
Signed-off-by: Stanley Chu <stanley.chu(a)mediatek.com>
---
drivers/scsi/scsi_pm.c | 26 +++++++++++++++-----------
1 file changed, 15 insertions(+), 11 deletions(-)
diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c
index a2b4179bfdf7..7639df91b110 100644
--- a/drivers/scsi/scsi_pm.c
+++ b/drivers/scsi/scsi_pm.c
@@ -80,8 +80,22 @@ static int scsi_dev_type_resume(struct device *dev,
if (err == 0) {
pm_runtime_disable(dev);
- pm_runtime_set_active(dev);
+ err = pm_runtime_set_active(dev);
pm_runtime_enable(dev);
+
+ /*
+ * Forcibly set runtime PM status of request queue to "active"
+ * to make sure we can again get requests from the queue
+ * (see also blk_pm_peek_request()).
+ *
+ * The resume hook will correct runtime PM status of the disk.
+ */
+ if (!err && scsi_is_sdev_device(dev)) {
+ struct scsi_device *sdev = to_scsi_device(dev);
+
+ if (sdev->request_queue->dev)
+ blk_set_runtime_active(sdev->request_queue);
+ }
}
return err;
@@ -140,16 +154,6 @@ static int scsi_bus_resume_common(struct device *dev,
else
fn = NULL;
- /*
- * Forcibly set runtime PM status of request queue to "active" to
- * make sure we can again get requests from the queue (see also
- * blk_pm_peek_request()).
- *
- * The resume hook will correct runtime PM status of the disk.
- */
- if (scsi_is_sdev_device(dev) && pm_runtime_suspended(dev))
- blk_set_runtime_active(to_scsi_device(dev)->request_queue);
-
if (fn) {
async_schedule_domain(fn, dev, &scsi_sd_pm_domain);
--
2.18.0
From: Eric Biggers <ebiggers(a)google.com>
The memcpy() in crypto_cfb_decrypt_inplace() uses walk->iv as both the
source and destination, which has undefined behavior. It is unneeded
because walk->iv is already used to hold the previous ciphertext block;
thus, walk->iv is already updated to its final value. So, remove it.
Also, note that in-place decryption is the only case where the previous
ciphertext block is not directly available. Therefore, as a related
cleanup I also updated crypto_cfb_encrypt_segment() to directly use the
previous ciphertext block rather than save it into walk->iv. This makes
it consistent with in-place encryption and out-of-place decryption; now
only in-place decryption is different, because it has to be.
Fixes: a7d85e06ed80 ("crypto: cfb - add support for Cipher FeedBack mode")
Cc: <stable(a)vger.kernel.org> # v4.17+
Cc: James Bottomley <James.Bottomley(a)HansenPartnership.com>
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
---
crypto/cfb.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/crypto/cfb.c b/crypto/cfb.c
index 183e8a9c33128..4abfe32ff8451 100644
--- a/crypto/cfb.c
+++ b/crypto/cfb.c
@@ -77,12 +77,14 @@ static int crypto_cfb_encrypt_segment(struct skcipher_walk *walk,
do {
crypto_cfb_encrypt_one(tfm, iv, dst);
crypto_xor(dst, src, bsize);
- memcpy(iv, dst, bsize);
+ iv = dst;
src += bsize;
dst += bsize;
} while ((nbytes -= bsize) >= bsize);
+ memcpy(walk->iv, iv, bsize);
+
return nbytes;
}
@@ -162,7 +164,7 @@ static int crypto_cfb_decrypt_inplace(struct skcipher_walk *walk,
const unsigned int bsize = crypto_cfb_bsize(tfm);
unsigned int nbytes = walk->nbytes;
u8 *src = walk->src.virt.addr;
- u8 *iv = walk->iv;
+ u8 * const iv = walk->iv;
u8 tmp[MAX_CIPHER_BLOCKSIZE];
do {
@@ -172,8 +174,6 @@ static int crypto_cfb_decrypt_inplace(struct skcipher_walk *walk,
src += bsize;
} while ((nbytes -= bsize) >= bsize);
- memcpy(walk->iv, iv, bsize);
-
return nbytes;
}
--
2.20.1
From: Eric Biggers <ebiggers(a)google.com>
Like some other block cipher mode implementations, the CFB
implementation assumes that while walking through the scatterlist, a
partial block does not occur until the end. But the walk is incorrectly
being done with a blocksize of 1, as 'cra_blocksize' is set to 1 (since
CFB is a stream cipher) but no 'chunksize' is set. This bug causes
incorrect encryption/decryption for some scatterlist layouts.
Fix it by setting the 'chunksize'. Also extend the CFB test vectors to
cover this bug as well as cases where the message length is not a
multiple of the block size.
Fixes: a7d85e06ed80 ("crypto: cfb - add support for Cipher FeedBack mode")
Cc: <stable(a)vger.kernel.org> # v4.17+
Cc: James Bottomley <James.Bottomley(a)HansenPartnership.com>
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
---
crypto/cfb.c | 6 ++++++
crypto/testmgr.h | 25 +++++++++++++++++++++++++
2 files changed, 31 insertions(+)
diff --git a/crypto/cfb.c b/crypto/cfb.c
index e81e456734985..183e8a9c33128 100644
--- a/crypto/cfb.c
+++ b/crypto/cfb.c
@@ -298,6 +298,12 @@ static int crypto_cfb_create(struct crypto_template *tmpl, struct rtattr **tb)
inst->alg.base.cra_blocksize = 1;
inst->alg.base.cra_alignmask = alg->cra_alignmask;
+ /*
+ * To simplify the implementation, configure the skcipher walk to only
+ * give a partial block at the very end, never earlier.
+ */
+ inst->alg.chunksize = alg->cra_blocksize;
+
inst->alg.ivsize = alg->cra_blocksize;
inst->alg.min_keysize = alg->cra_cipher.cia_min_keysize;
inst->alg.max_keysize = alg->cra_cipher.cia_max_keysize;
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index e8f47d7b92cdd..7f4dae7a57a1c 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -12870,6 +12870,31 @@ static const struct cipher_testvec aes_cfb_tv_template[] = {
"\x75\xa3\x85\x74\x1a\xb9\xce\xf8"
"\x20\x31\x62\x3d\x55\xb1\xe4\x71",
.len = 64,
+ .also_non_np = 1,
+ .np = 2,
+ .tap = { 31, 33 },
+ }, { /* > 16 bytes, not a multiple of 16 bytes */
+ .key = "\x2b\x7e\x15\x16\x28\xae\xd2\xa6"
+ "\xab\xf7\x15\x88\x09\xcf\x4f\x3c",
+ .klen = 16,
+ .iv = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+ .ptext = "\x6b\xc1\xbe\xe2\x2e\x40\x9f\x96"
+ "\xe9\x3d\x7e\x11\x73\x93\x17\x2a"
+ "\xae",
+ .ctext = "\x3b\x3f\xd9\x2e\xb7\x2d\xad\x20"
+ "\x33\x34\x49\xf8\xe8\x3c\xfb\x4a"
+ "\xc8",
+ .len = 17,
+ }, { /* < 16 bytes */
+ .key = "\x2b\x7e\x15\x16\x28\xae\xd2\xa6"
+ "\xab\xf7\x15\x88\x09\xcf\x4f\x3c",
+ .klen = 16,
+ .iv = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+ .ptext = "\x6b\xc1\xbe\xe2\x2e\x40\x9f",
+ .ctext = "\x3b\x3f\xd9\x2e\xb7\x2d\xad",
+ .len = 7,
},
};
--
2.20.1
From: Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
[Why]
If the "max bpc" isn't explicitly set in the atomic state then it
have a value of 0. This has the correct behavior of limiting a panel
to 8bpc in the case where the panel supports 8bpc. In the case of eDP
panels this isn't a true assumption - there are panels that can only
do 6bpc.
Banding occurs for these displays.
[How]
Initialize the max_bpc when the connector resets to 8bpc. Also carry
over the value when the state is duplicated.
Bugzilla: https://bugs.freedesktop.org/108825
Fixes: 307638884f72 ("drm/amd/display: Support amdgpu "max bpc" connector property")
Backport of commit 49f1c44b581b08e3289127ffe58bd208c3166701 upstream.
Fixes regression caused by the automatic patch select of these two patches:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/dri…https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/dri…
Bug: https://bugs.freedesktop.org/show_bug.cgi?id=109122
Cc: Sasha Levin <sashal(a)kernel.org>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
Reviewed-by: Alex Deucher <alexander.deucher(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
(cherry picked from commit 49f1c44b581b08e3289127ffe58bd208c3166701)
Cc: <stable(a)vger.kernel.org> # 4.19.x
---
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 299def84e69c..d792735f1365 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -2894,6 +2894,7 @@ void amdgpu_dm_connector_funcs_reset(struct drm_connector *connector)
state->underscan_enable = false;
state->underscan_hborder = 0;
state->underscan_vborder = 0;
+ state->max_bpc = 8;
__drm_atomic_helper_connector_reset(connector, &state->base);
}
@@ -2911,6 +2912,7 @@ amdgpu_dm_connector_atomic_duplicate_state(struct drm_connector *connector)
if (new_state) {
__drm_atomic_helper_connector_duplicate_state(connector,
&new_state->base);
+ new_state->max_bpc = state->max_bpc;
return &new_state->base;
}
--
2.20.1
The patch titled
Subject: mm/vmalloc: fix size check for remap_vmalloc_range_partial()
has been added to the -mm tree. Its filename is
mm-vmalloc-fix-size-check-for-remap_vmalloc_range_partial.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-vmalloc-fix-size-check-for-rema…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-vmalloc-fix-size-check-for-rema…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Roman Penyaev <rpenyaev(a)suse.de>
Subject: mm/vmalloc: fix size check for remap_vmalloc_range_partial()
area->size can include adjacent guard page but get_vm_area_size() returns
actual size of the area.
This fixes possible kernel crash when userspace tries to map area on 1
page bigger: size check passes but the following vmalloc_to_page() returns
NULL on last guard (non-existing) page.
Link: http://lkml.kernel.org/r/20190103145954.16942-2-rpenyaev@suse.de
Signed-off-by: Roman Penyaev <rpenyaev(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Andrey Ryabinin <aryabinin(a)virtuozzo.com>
Cc: Joe Perches <joe(a)perches.com>
Cc: "Luis R. Rodriguez" <mcgrof(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmalloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/vmalloc.c~mm-vmalloc-fix-size-check-for-remap_vmalloc_range_partial
+++ a/mm/vmalloc.c
@@ -2248,7 +2248,7 @@ int remap_vmalloc_range_partial(struct v
if (!(area->flags & VM_USERMAP))
return -EINVAL;
- if (kaddr + size > area->addr + area->size)
+ if (kaddr + size > area->addr + get_vm_area_size(area))
return -EINVAL;
do {
_
Patches currently in -mm which might be from rpenyaev(a)suse.de are
epoll-make-sure-all-elements-in-ready-list-are-in-fifo-order.patch
epoll-loosen-irq-safety-in-ep_poll_callback.patch
epoll-unify-awaking-of-wakeup-source-on-ep_poll_callback-path.patch
epoll-use-rwlock-in-order-to-reduce-ep_poll_callback-contention.patch
mm-vmalloc-fix-size-check-for-remap_vmalloc_range_partial.patch
mm-vmalloc-do-not-call-kmemleak_free-on-not-yet-accounted-memory.patch
mm-vmalloc-pass-vm_usermap-flags-directly-to-__vmalloc_node_range.patch
On 1/3/19 5:52 AM, Sasha Levin wrote:
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: e8c24d3a23a4 x86/pkeys: Allocation/free syscalls.
>
> The bot has tested the following trees: v4.20.0, v4.19.13, v4.14.91, v4.9.148,
>
> v4.20.0: Build OK!
> v4.19.13: Build OK!
> v4.14.91: Build OK!
> v4.9.148: Failed to apply! Possible dependencies:
> c10e83f598d0 ("arch, mm: Allow arch_dup_mmap() to fail")
>
>
> How should we proceed with this patch?
The 4.9 version of arch_dup_mmap() does not contain the
ldt_dup_context() line. We just need to add arch_dup_pkeys(), like so:
void arch_dup_mmap(struct mm_struct *oldmm,
struct mm_struct *mm)
{
+ arch_dup_pkeys(oldmm, mm);
paravirt_arch_dup_mmap(oldmm, mm);
}
Should be a pretty simple merge. We can basically ignore the
ldt_dup_context() changes.
The patch titled
Subject: mm/usercopy.c: no check page span for stack objects
has been added to the -mm tree. Its filename is
usercopy-no-check-page-span-for-stack-objects.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/usercopy-no-check-page-span-for-st…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/usercopy-no-check-page-span-for-st…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Qian Cai <cai(a)lca.pw>
Subject: mm/usercopy.c: no check page span for stack objects
It is easy to trigger this with CONFIG_HARDENED_USERCOPY_PAGESPAN=y,
usercopy: Kernel memory overwrite attempt detected to spans multiple
pages (offset 0, size 23)!
kernel BUG at mm/usercopy.c:102!
For example,
print_worker_info
char name[WQ_NAME_LEN] = { };
char desc[WORKER_DESC_LEN] = { };
probe_kernel_read(name, wq->name, sizeof(name) - 1);
probe_kernel_read(desc, worker->desc, sizeof(desc) - 1);
__copy_from_user_inatomic
check_object_size
check_heap_object
check_page_span
This is because on-stack variables could cross PAGE_SIZE boundary, and
failed this check,
if (likely(((unsigned long)ptr & (unsigned long)PAGE_MASK) ==
((unsigned long)end & (unsigned long)PAGE_MASK)))
ptr = FFFF889007D7EFF8
end = FFFF889007D7F00E
Hence, fix it by checking if it is a stack object first.
Link: http://lkml.kernel.org/r/20181231030254.99441-1-cai@lca.pw
Signed-off-by: Qian Cai <cai(a)lca.pw>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/usercopy.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/mm/usercopy.c~usercopy-no-check-page-span-for-stack-objects
+++ a/mm/usercopy.c
@@ -262,9 +262,6 @@ void __check_object_size(const void *ptr
/* Check for invalid addresses. */
check_bogus_address((const unsigned long)ptr, n, to_user);
- /* Check for bad heap object. */
- check_heap_object(ptr, n, to_user);
-
/* Check for bad stack object. */
switch (check_stack_object(ptr, n)) {
case NOT_STACK:
@@ -282,6 +279,9 @@ void __check_object_size(const void *ptr
usercopy_abort("process stack", NULL, to_user, 0, n);
}
+ /* Check for bad heap object. */
+ check_heap_object(ptr, n, to_user);
+
/* Check for object in kernel to avoid text exposure. */
check_kernel_text_object((const unsigned long)ptr, n, to_user);
}
_
Patches currently in -mm which might be from cai(a)lca.pw are
mm-page_owner-fix-for-deferred-struct-page-init.patch
usercopy-no-check-page-span-for-stack-objects.patch
Hi folks,
I noticed a regression introduced sometime after 4.19.4 in USB power
management. I have a 2015 MacBook Pro. When I try to do a suspend or a
suspend+hibernate, I get the following error messages trying to
suspend usb2 and the suspend fails. This works fine in 4.19.4:
Dec 22 13:50:36 eric-macbookpro kernel: Freezing remaining freezable
tasks ... (elapsed 0.001 seconds) done.
Dec 22 13:50:36 eric-macbookpro kernel: Suspending console(s) (use
no_console_suspend to debug)
Dec 22 13:50:36 eric-macbookpro kernel: dpm_run_callback():
usb_dev_freeze+0x0/0x10 returns -16
Dec 22 13:50:36 eric-macbookpro kernel: PM: Device usb2 failed to
freeze async: error -16
Dec 22 13:50:38 eric-macbookpro systemd[1]:
systemd-hybrid-sleep.service: Main process exited, code=exited,
status=1/FAILURE
Dec 22 13:50:38 eric-macbookpro systemd[1]:
systemd-hybrid-sleep.service: Failed with result 'exit-code'.
Dec 22 13:50:38 eric-macbookpro systemd[1]: Failed to start Hybrid
Suspend+Hibernate.
Dec 22 13:50:38 eric-macbookpro systemd[1]: Dependency failed for
Hybrid Suspend+Hibernate.
Dec 22 13:50:38 eric-macbookpro systemd[1]: hybrid-sleep.target: Job
hybrid-sleep.target/start failed with result 'dependency'.
Dec 22 13:50:38 eric-macbookpro systemd-logind[1573]: Operation
'sleep' finished.
Dec 22 13:50:38 eric-macbookpro systemd[1]: Stopped target Sleep.
The behavior exists in 4.19.8 and 4.19.11, the kernel versions I have
upgraded to with Arch Linux, so the regression was introduced sometime
between 4.19.4 and 4.19.8. Hibernate still works but when I resume
from hibernate, there is a ksoftirqd and kworker thread/process
together taking up 100% of one core. If I turn off auto power control
for usb1 and usb2, the threads stop spinning. i.e.,
echo 'on' > '/sys/bus/usb/devices/usb1/power/control
Any suggestions as to where this regression was introduced and what
can be done to fix it?
Thanks,
Eric
From: Stanley Chu <stanley.chu(a)mediatek.com>
The commit 356fd2663cff ("scsi: Set request queue runtime PM status
back to active on resume") fixed up the inconsistent RPM status between
request queue and device. However changing request queue RPM status
shall be done only on successful resume, otherwise status may be still
inconsistent as below,
Request queue: RPM_ACTIVE
Device: RPM_SUSPENDED
This ends up soft lockup because requests can be submitted to
underlying devices but those devices and their required resource
are not resumed.
Fixes: 356fd2663cff ("scsi: Set request queue runtime PM status
back to active on resume")
Cc: stable(a)vger.kernel.org
Signed-off-by: Stanley Chu <stanley.chu(a)mediatek.com>
---
drivers/scsi/scsi_pm.c | 26 +++++++++++++++-----------
1 file changed, 15 insertions(+), 11 deletions(-)
diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c
index a2b4179bfdf7..7639df91b110 100644
--- a/drivers/scsi/scsi_pm.c
+++ b/drivers/scsi/scsi_pm.c
@@ -80,8 +80,22 @@ static int scsi_dev_type_resume(struct device *dev,
if (err == 0) {
pm_runtime_disable(dev);
- pm_runtime_set_active(dev);
+ err = pm_runtime_set_active(dev);
pm_runtime_enable(dev);
+
+ /*
+ * Forcibly set runtime PM status of request queue to "active"
+ * to make sure we can again get requests from the queue
+ * (see also blk_pm_peek_request()).
+ *
+ * The resume hook will correct runtime PM status of the disk.
+ */
+ if (!err && scsi_is_sdev_device(dev)) {
+ struct scsi_device *sdev = to_scsi_device(dev);
+
+ if (sdev->request_queue->dev)
+ blk_set_runtime_active(sdev->request_queue);
+ }
}
return err;
@@ -140,16 +154,6 @@ static int scsi_bus_resume_common(struct device *dev,
else
fn = NULL;
- /*
- * Forcibly set runtime PM status of request queue to "active" to
- * make sure we can again get requests from the queue (see also
- * blk_pm_peek_request()).
- *
- * The resume hook will correct runtime PM status of the disk.
- */
- if (scsi_is_sdev_device(dev) && pm_runtime_suspended(dev))
- blk_set_runtime_active(to_scsi_device(dev)->request_queue);
-
if (fn) {
async_schedule_domain(fn, dev, &scsi_sd_pm_domain);
--
2.18.0
Commit 2b2ea09e74a5 ("staging:r8188eu: Use lib80211 to decrypt WEP-frames")
causes scheduling while atomic bugs followed by a hard freeze whenever
the driver tries to connect to a WEP-encrypted network. Experimentation
showed that the freezes were eliminated when module lib80211 was
preloaded, which can be forced by calling lib80211_get_crypto_ops()
directly rather than indirectly through try_then_request_module().
With this change, no BUG messages are logged.
Fixes: 2b2ea09e74a5 ("staging:r8188eu: Use lib80211 to decrypt WEP-frames")
Cc: Stable <stable(a)vger.kernel.org> # v4.17+
Cc: Michael Straube <straube.linux(a)gmail.com>
Cc: Ivan Safonov <insafonov(a)gmail.com>
Signed-off-by: Larry Finger <Larry.Finger(a)lwfinger.net>
---
drivers/staging/rtl8188eu/core/rtw_security.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/staging/rtl8188eu/core/rtw_security.c b/drivers/staging/rtl8188eu/core/rtw_security.c
index 052656a22821..bab96c870042 100644
--- a/drivers/staging/rtl8188eu/core/rtw_security.c
+++ b/drivers/staging/rtl8188eu/core/rtw_security.c
@@ -154,7 +154,7 @@ void rtw_wep_encrypt(struct adapter *padapter, u8 *pxmitframe)
pframe = ((struct xmit_frame *)pxmitframe)->buf_addr + hw_hdr_offset;
- crypto_ops = try_then_request_module(lib80211_get_crypto_ops("WEP"), "lib80211_crypt_wep");
+ crypto_ops = lib80211_get_crypto_ops("WEP");
if (!crypto_ops)
return;
@@ -210,7 +210,7 @@ int rtw_wep_decrypt(struct adapter *padapter, u8 *precvframe)
void *crypto_private = NULL;
int status = _SUCCESS;
const int keyindex = prxattrib->key_index;
- struct lib80211_crypto_ops *crypto_ops = try_then_request_module(lib80211_get_crypto_ops("WEP"), "lib80211_crypt_wep");
+ struct lib80211_crypto_ops *crypto_ops = lib80211_get_crypto_ops("WEP");
char iv[4], icv[4];
if (!crypto_ops) {
--
2.16.4
An iptable rule like the following on a multicore systems will result in
accepting more connections than set in the rule.
iptables -A INPUT -p tcp -m tcp --syn --dport 7777 -m connlimit \
--connlimit-above 2000 --connlimit-mask 0 -j DROP
In check_hlist function, connections that are found in saved connections
but not in netfilter conntrack are deleted, assuming that those
connections do not exist anymore. But for multi core systems, there exists
a small time window, when a connection has been added to the xt_connlimit
maintained rb-tree but has not yet made to netfilter conntrack table. This
causes concurrent connections to return incorrect counts and go over limit
set in iptable rule.
The fix has been partially backported from the above mentioned upstream
commit. Introduce timestamp and the owning cpu.
Signed-off-by: Alakesh Haloi <alakeshh(a)amazon.com>
Cc: Pablo Neira Ayuso <pablo(a)netfilter.org>
Cc: Jozsef Kadlecsik <kadlec(a)blackhole.kfki.hu>
Cc: Florian Westphal <fw(a)strlen.de>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: stable(a)vger.kernel.org # v4.15 and before
Cc: netdev(a)vger.kernel.org
Cc: Dmitry Andrianov <dmitry.andrianov(a)alertme.com>
Cc: Justin Pettit <jpettit(a)vmware.com>
Cc: Yi-Hung Wei <yihung.wei(a)gmail.com>
---
net/netfilter/xt_connlimit.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index ffa8eec..e7b092b 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -47,6 +47,8 @@ struct xt_connlimit_conn {
struct hlist_node node;
struct nf_conntrack_tuple tuple;
union nf_inet_addr addr;
+ int cpu;
+ u32 jiffies32;
};
struct xt_connlimit_rb {
@@ -126,6 +128,8 @@ static bool add_hlist(struct hlist_head *head,
return false;
conn->tuple = *tuple;
conn->addr = *addr;
+ conn->cpu = raw_smp_processor_id();
+ conn->jiffies32 = (u32)jiffies;
hlist_add_head(&conn->node, head);
return true;
}
@@ -148,8 +152,26 @@ static unsigned int check_hlist(struct net *net,
hlist_for_each_entry_safe(conn, n, head, node) {
found = nf_conntrack_find_get(net, zone, &conn->tuple);
if (found == NULL) {
- hlist_del(&conn->node);
- kmem_cache_free(connlimit_conn_cachep, conn);
+ /* If connection is not found, it may be because
+ * it has not made into conntrack table yet. We
+ * check if it is a recently created connection
+ * on a different core and do not delete it in that
+ * case.
+ */
+
+ unsigned long a, b;
+ int cpu = raw_smp_processor_id();
+ __u32 age;
+
+ b = conn->jiffies;
+ a = (u32)jiffies;
+ age = a - b;
+ if (conn->cpu != cpu && age <= 2) {
+ length++;
+ } else {
+ hlist_del(&conn->node);
+ kmem_cache_free(connlimit_conn_cachep, conn);
+ }
continue;
}
@@ -271,6 +293,8 @@ static void tree_nodes_free(struct rb_root *root,
conn->tuple = *tuple;
conn->addr = *addr;
+ conn->cpu = raw_smp_processor_id();
+ conn->jiffies32 = (u32)jiffies;
rbconn->addr = *addr;
INIT_HLIST_HEAD(&rbconn->hhead);
--
1.8.3.1
An iptable rule like the following on a multicore systems will result in
accepting more connections than set in the rule.
iptables -A INPUT -p tcp -m tcp --syn --dport 7777 -m connlimit \
--connlimit-above 2000 --connlimit-mask 0 -j DROP
In check_hlist function, connections that are found in saved connections
but not in netfilter conntrack are deleted, assuming that those
connections do not exist anymore. But for multi core systems, there exists
a small time window, when a connection has been added to the xt_connlimit
maintained rb-tree but has not yet made to netfilter conntrack table. This
causes concurrent connections to return incorrect counts and go over limit
set in iptable rule.
Connection 1 on Core 1 Connection 2 on Core 2
list_length = N
conntrack_table_len = N
spin_lock_bh()
In check_hlist() function
a. loop over saved connections
1. call nf_conntrack_find_get()
2. If not found in 1,
i. call hlist_del()
b. return total count to caller
c. connection gets added to list
of saved connections.
spin_unlock_bh()
list_length = N + 1
spin_lock_bh() on core 2
In check_hlist() function
a. loop over saved connection.
1. call nf_conntrack_find_get()
2. If not found in 1.
i. call hlist_del()
[Connection 1 was in the
but not in nf_conntrack yet]
ii. connection 1 gets deleted
list_length = N
conntrack_table_len = N
b. return total count to caller
c. connection 2 gets added to list
of saved connections
spin_unlock_bh()
d. connection 1 gets added to
nf_conntrack
list_length = N + 1
conntrack_table_len = N + 1
e. connection 2 gets added to
nf_conntrack
list_length = N + 1
conntrack_table_len = N + 2
So we end up with N + 1 connections in the list but N + 2 in nf_conntrack,
allowing more number of connections eventually than set in the rule.
This fix adds an additional field to track such pending connections
and prevent them from being deleted by another execution thread on
a different core and returns correct count.
Signed-off-by: Alakesh Haloi <alakeshh(a)amazon.com>
Cc: Pablo Neira Ayuso <pablo(a)netfilter.org>
Cc: Jozsef Kadlecsik <kadlec(a)blackhole.kfki.hu>
Cc: Florian Westphal <fw(a)strlen.de>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: stable(a)vger.kernel.org # v4.15 and before
---
net/netfilter/xt_connlimit.c | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index ffa8eec980e9..bd7563c209a4 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -47,6 +47,7 @@ struct xt_connlimit_conn {
struct hlist_node node;
struct nf_conntrack_tuple tuple;
union nf_inet_addr addr;
+ bool pending_add;
};
struct xt_connlimit_rb {
@@ -126,6 +127,7 @@ static bool add_hlist(struct hlist_head *head,
return false;
conn->tuple = *tuple;
conn->addr = *addr;
+ conn->pending_add = true;
hlist_add_head(&conn->node, head);
return true;
}
@@ -144,15 +146,31 @@ static unsigned int check_hlist(struct net *net,
*addit = true;
- /* check the saved connections */
+ /* check the saved connections
+ */
hlist_for_each_entry_safe(conn, n, head, node) {
found = nf_conntrack_find_get(net, zone, &conn->tuple);
if (found == NULL) {
- hlist_del(&conn->node);
- kmem_cache_free(connlimit_conn_cachep, conn);
+ /* It could be an already deleted connection or
+ * a new connection that is not there in conntrack
+ * yet. If former delete it from the list, else
+ * increase count and move on.
+ */
+ if (conn->pending_add) {
+ length++;
+ } else {
+ hlist_del(&conn->node);
+ kmem_cache_free(connlimit_conn_cachep, conn);
+ }
continue;
}
+ /* If it is a connection that was pending insertion to
+ * connection tracking table before, then it's time to clear
+ * the flag.
+ */
+ conn->pending_add = false;
+
found_ct = nf_ct_tuplehash_to_ctrack(found);
if (nf_ct_tuple_equal(&conn->tuple, tuple)) {
--
2.14.4
Hi x86 maintainers,
This is an important fix that I believe needs to be merged for 4.21.
Without it, applications calling fork() can potentially double-allocate
a protection key, causing lots of strange problems.
Thomas's Reviewed-by is on the the actual fix, but not the selftest.
I would also be happy to send this as a pull request if you would
prefer.
Cc: x86(a)kernel.org
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: stable(a)vger.kernel.org
The patch titled
Subject: slab: alien caches must not be initialized if the allocation of the alien cache failed
has been added to the -mm tree. Its filename is
slab-alien-caches-must-not-be-initialized-if-the-allocation-of-the-alien-cache-failed.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/slab-alien-caches-must-not-be-init…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/slab-alien-caches-must-not-be-init…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Christoph Lameter <cl(a)linux.com>
Subject: slab: alien caches must not be initialized if the allocation of the alien cache failed
Callers of __alloc_alien() check for NULL. We must do the same check in
__alloc_alien_cache to avoid NULL pointer dereferences on allocation
failures.
Link: http://lkml.kernel.org/r/010001680f42f192-82b4e12e-1565-4ee0-ae1f-1e9897490…
Signed-off-by: Christoph Lameter <cl(a)linux.com>
Reported-by: syzbot+d6ed4ec679652b4fd4e4(a)syzkaller.appspotmail.com
Reviewed-by: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Pekka Enberg <penberg(a)kernel.org>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/slab.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
--- a/mm/slab.c~slab-alien-caches-must-not-be-initialized-if-the-allocation-of-the-alien-cache-failed
+++ a/mm/slab.c
@@ -666,8 +666,10 @@ static struct alien_cache *__alloc_alien
struct alien_cache *alc = NULL;
alc = kmalloc_node(memsize, gfp, node);
- init_arraycache(&alc->ac, entries, batch);
- spin_lock_init(&alc->lock);
+ if (alc) {
+ init_arraycache(&alc->ac, entries, batch);
+ spin_lock_init(&alc->lock);
+ }
return alc;
}
_
Patches currently in -mm which might be from cl(a)linux.com are
slab-alien-caches-must-not-be-initialized-if-the-allocation-of-the-alien-cache-failed.patch
The patch titled
Subject: fork, memcg: fix cached_stacks case
has been added to the -mm tree. Its filename is
fork-memcg-fix-cached_stacks-case.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/fork-memcg-fix-cached_stacks-case.…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/fork-memcg-fix-cached_stacks-case.…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Shakeel Butt <shakeelb(a)google.com>
Subject: fork, memcg: fix cached_stacks case
5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on memcg charge
fail") fixes a crash caused due to failed memcg charge of the kernel
stack. However the fix misses the cached_stacks case which this patch
fixes. So, the same crash can happen if the memcg charge of a cached
stack is failed.
Link: http://lkml.kernel.org/r/20190102180145.57406-1-shakeelb@google.com
Fixes: 5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on memcg charge fail")
Signed-off-by: Shakeel Butt <shakeelb(a)google.com>
Cc: Rik van Riel <riel(a)surriel.com>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/fork.c | 1 +
1 file changed, 1 insertion(+)
--- a/kernel/fork.c~fork-memcg-fix-cached_stacks-case
+++ a/kernel/fork.c
@@ -221,6 +221,7 @@ static unsigned long *alloc_thread_stack
memset(s->addr, 0, THREAD_SIZE);
tsk->stack_vm_area = s;
+ tsk->stack = s->addr;
return s->addr;
}
_
Patches currently in -mm which might be from shakeelb(a)google.com are
fork-memcg-fix-cached_stacks-case.patch
Recently, Alakesh Haloi reported the following issue [1] with stable/4.14:
"""
An iptable rule like the following on a multicore systems will result in
accepting more connections than set in the rule.
iptables -A INPUT -p tcp -m tcp --syn --dport 7777 -m connlimit \
--connlimit-above 2000 --connlimit-mask 0 -j DROP
"""
And proposed a fix that is not in Linus's tree. The discussion went on to
confirm whether the issue was still reproducible with mainline/nf.git tip,
and to either identify the upstream fix or re-submit the non-upstream fix.
Alakesh eventually was able to test with upstream, and reported that issue
was still reproducible [2].
On that, our findinds diverge, at least in my test environment:
First, I verified that the suggested mainline fix for the issue [3] indeed
fixes it, by testing with it applied and reverted on v4.18, a clean revert.
(The issue is reproducible with the commit reverted).
Then, with a consistent reproducer, I moved to nf.git, with HEAD on commit
a007232 ("netfilter: nf_conncount: fix argument order to find_next_bit"),
and the issues was not reproducible (even with 20+ threads on client side,
the number Alakesh reported to achieve 2150+ connections [4], and I tried
spreading the network interface IRQ affinity over more and more CPUs too.)
Either way, the suggested mainline fix does actually fix the issue in 4.14
for at least one environment. So, it might well be the case that Alakesh's
test environment has differences/subtleties that leads to more connections
accepted, and more commits are needed for that particular environment type.
But for now, with one bare-metal environment (24-core server, 4-core client)
verified, I thought of submitting the patches for review/comments/testing,
then looking for additional fixes for that environment separately.
The fix is PATCH 4/4, and PATCHes 1-3/4 are helpers for a cleaner backport.
All backports are simple, and essentially consist of refresh context lines
and use older struct/file names.
Reviews from netfilter maintainers are very appreciated, as I've no previous
experience in this area, and although the backports look simple and build/run
correctly, there's usually stuff that only more experienced people may notice.
Thanks,
Mauricio
Links:
=====
[1] https://www.spinics.net/lists/stable/msg270040.html
[2] https://www.spinics.net/lists/stable/msg273669.html
[3] https://www.spinics.net/lists/stable/msg271300.html
[4] https://www.spinics.net/lists/stable/msg273669.html
Test-case:
=========
- v4.14.91 (original): client achieves 2000+ connections (6000 target)
with 3 threads.
server # iptables -F
server # iptables -A INPUT -p tcp -m tcp --syn --dport 7777 -m connlimit --connlimit-above 2000 --connlimit-mask 0 -j DROP
server # iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
DROP tcp -- anywhere anywhere tcp dpt:7777 flags:FIN,SYN,RST,ACK/SYN #conn src/0 > 2000
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
server # ulimit -SHn 65000
server # ruby server.rb
<... listening ...>
client # ulimit -SHn 65000
client # ruby client.rb 10.230.56.100 7777 6000 3
Connecting to ["10.230.56.100"]:7777 6000 times with 3
1
2
3
<...>
2000
<...>
6000
Target reached. Thread finishing
6001
Target reached. Thread finishing
6002
Target reached. Thread finishing
Threads done. 6002 connections
press enter to exit
- v4.14.91 + patches: client only achieved 2000 connections.
server # (same procedure)
client # (same procedure)
Connecting to ["10.230.56.100"]:7777 6000 times with 3
1
2
3
<...>
2000
<... blocked for a while...>
failed to create connection: Connection timed out - connect(2) for "10.230.56.100" port 7777
failed to create connection: Connection timed out - connect(2) for "10.230.56.100" port 7777
failed to create connection: Connection timed out - connect(2) for "10.230.56.100" port 7777
Threads done. 2000 connections
press enter to exit
Florian Westphal (2):
netfilter: xt_connlimit: don't store address in the conn nodes
netfilter: nf_conncount: fix garbage collection confirm race
Pablo Neira Ayuso (1):
netfilter: nf_conncount: expose connection list interface
Yi-Hung Wei (1):
netfilter: nf_conncount: Fix garbage collection with zones
include/net/netfilter/nf_conntrack_count.h | 15 +++++
net/netfilter/xt_connlimit.c | 99 +++++++++++++++++++++++-------
2 files changed, 91 insertions(+), 23 deletions(-)
create mode 100644 include/net/netfilter/nf_conntrack_count.h
--
2.7.4
The patch titled
Subject: lib/gen_crc64table.c: don't depend on linux headers being installed
has been removed from the -mm tree. Its filename was
lib-dont-depend-on-linux-headers-being-installed.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: NeilBrown <neil(a)brown.name>
Subject: lib/gen_crc64table.c: don't depend on linux headers being installed
gen_crc64table requires linux include files to be installed in
/usr/include/linux. This is a new requrement so hosts that could
previously build the kernel, now cannot.
gen_crc64table makes this requirement by including <linux/swab.h>, but
nothing from that header is actually used.
So remove the #include, so that the linux headers no longer need to be
installed.
Link: http://lkml.kernel.org/r/87y3899gzi.fsf@notabene.neil.brown.name
Fixes: feba04fd2cf8 ("lib: add crc64 calculation routines")
Signed-off-by: NeilBrown <neil(a)brown.name>
Acked-by: Coly Li <colyli(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/lib/gen_crc64table.c~lib-dont-depend-on-linux-headers-being-installed
+++ a/lib/gen_crc64table.c
@@ -16,8 +16,6 @@
#include <inttypes.h>
#include <stdio.h>
-#include <linux/swab.h>
-
#define CRC64_ECMA182_POLY 0x42F0E1EBA9EA3693ULL
static uint64_t crc64_table[256] = {0};
_
Patches currently in -mm which might be from neil(a)brown.name are
The patch titled
Subject: memcg, oom: notify on oom killer invocation from the charge path
has been removed from the -mm tree. Its filename was
memcg-oom-notify-on-oom-killer-invocation-from-the-charge-path.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Michal Hocko <mhocko(a)suse.com>
Subject: memcg, oom: notify on oom killer invocation from the charge path
Burt Holzman has noticed that memcg v1 doesn't notify about OOM events via
eventfd anymore. The reason is that 29ef680ae7c2 ("memcg, oom: move
out_of_memory back to the charge path") has moved the oom handling back to
the charge path. While doing so the notification was left behind in
mem_cgroup_oom_synchronize.
Fix the issue by replicating the oom hierarchy locking and the
notification.
Link: http://lkml.kernel.org/r/20181224091107.18354-1-mhocko@kernel.org
Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Reported-by: Burt Holzman <burt(a)fnal.gov>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com
Cc: <stable(a)vger.kernel.org> [4.19+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/memcontrol.c~memcg-oom-notify-on-oom-killer-invocation-from-the-charge-path
+++ a/mm/memcontrol.c
@@ -1673,6 +1673,9 @@ enum oom_status {
static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
{
+ enum oom_status ret;
+ bool locked;
+
if (order > PAGE_ALLOC_COSTLY_ORDER)
return OOM_SKIPPED;
@@ -1707,10 +1710,23 @@ static enum oom_status mem_cgroup_oom(st
return OOM_ASYNC;
}
+ mem_cgroup_mark_under_oom(memcg);
+
+ locked = mem_cgroup_oom_trylock(memcg);
+
+ if (locked)
+ mem_cgroup_oom_notify(memcg);
+
+ mem_cgroup_unmark_under_oom(memcg);
if (mem_cgroup_out_of_memory(memcg, mask, order))
- return OOM_SUCCESS;
+ ret = OOM_SUCCESS;
+ else
+ ret = OOM_FAILED;
+
+ if (locked)
+ mem_cgroup_oom_unlock(memcg);
- return OOM_FAILED;
+ return ret;
}
/**
_
Patches currently in -mm which might be from mhocko(a)suse.com are
mm-memcg-fix-reclaim-deadlock-with-writeback.patch
The patch titled
Subject: mm, swap: fix swapoff with KSM pages
has been removed from the -mm tree. Its filename was
mm-swap-fix-swapoff-with-ksm-pages.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Huang Ying <ying.huang(a)intel.com>
Subject: mm, swap: fix swapoff with KSM pages
KSM pages may be mapped to the multiple VMAs that cannot be reached from
one anon_vma. So during swapin, a new copy of the page need to be
generated if a different anon_vma is needed, please refer to comments of
ksm_might_need_to_copy() for details.
During swapoff, unuse_vma() uses anon_vma (if available) to locate VMA and
virtual address mapped to the page, so not all mappings to a swapped out
KSM page could be found. So in try_to_unuse(), even if the swap count of
a swap entry isn't zero, the page needs to be deleted from swap cache, so
that, in the next round a new page could be allocated and swapin for the
other mappings of the swapped out KSM page.
But this contradicts with the THP swap support. Where the THP could be
deleted from swap cache only after the swap count of every swap entry in
the huge swap cluster backing the THP has reach 0. So try_to_unuse() is
changed in commit e07098294adf ("mm, THP, swap: support to reclaim swap
space for THP swapped out") to check that before delete a page from swap
cache, but this has broken KSM swapoff too.
Fortunately, KSM is for the normal pages only, so the original behavior
for KSM pages could be restored easily via checking PageTransCompound().
That is how this patch works.
The bug is introduced by e07098294adf ("mm, THP, swap: support to reclaim
swap space for THP swapped out"), which is merged by v4.14-rc1. So I
think we should backport the fix to from 4.14 on. But Hugh thinks it may
be rare for the KSM pages being in the swap device when swapoff, so nobody
reports the bug so far.
Link: http://lkml.kernel.org/r/20181226051522.28442-1-ying.huang@intel.com
Fixes: e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out")
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Reported-by: Hugh Dickins <hughd(a)google.com>
Tested-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Hugh Dickins <hughd(a)google.com>
Cc: Rik van Riel <riel(a)redhat.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Shaohua Li <shli(a)kernel.org>
Cc: Daniel Jordan <daniel.m.jordan(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/swapfile.c~mm-swap-fix-swapoff-with-ksm-pages
+++ a/mm/swapfile.c
@@ -2197,7 +2197,8 @@ int try_to_unuse(unsigned int type, bool
*/
if (PageSwapCache(page) &&
likely(page_private(page) == entry.val) &&
- !page_swapped(page))
+ (!PageTransCompound(page) ||
+ !swap_page_trans_huge_swapped(si, entry)))
delete_from_swap_cache(compound_head(page));
/*
_
Patches currently in -mm which might be from ying.huang(a)intel.com are
mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch
mm-swap-fix-race-between-swapoff-and-some-swap-operations-v6.patch
mm-fix-race-between-swapoff-and-mincore.patch
The patch titled
Subject: hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
has been removed from the -mm tree. Its filename was
hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken into
account as well.
Instead of writing the required complicated code for this rare occurrence,
just eliminate the race. i_mmap_rwsem is now held in read mode for the
duration of page fault processing. Hold i_mmap_rwsem longer in truncation
and hold punch code to cover the call to remove_inode_hugepages.
With this modification, code in remove_inode_hugepages checking for races
becomes 'dead' as it can not longer happen. Remove the dead code and
expand comments to explain reasoning. Similarly, checks for races with
truncation in the page fault path can be simplified and removed.
[mike.kravetz(a)oracle.com: incorporat suggestions from Kirill]
Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com
Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/fs/hugetlbfs/inode.c~hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race
+++ a/fs/hugetlbfs/inode.c
@@ -383,17 +383,16 @@ hugetlb_vmdelete_list(struct rb_root_cac
* truncation is indicated by end of range being LLONG_MAX
* In this case, we first scan the range and release found pages.
* After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
- * maps and global counts. Page faults can not race with truncation
- * in this routine. hugetlb_no_page() prevents page faults in the
- * truncated range. It checks i_size before allocation, and again after
- * with the page table lock for the page held. The same lock must be
- * acquired to unmap a page.
+ * maps and global counts.
* hole punch is indicated if end is not LLONG_MAX
* In the hole punch case we scan the range and release found pages.
* Only when releasing a page is the associated region/reserv map
* deleted. The region/reserv map for ranges without associated
- * pages are not modified. Page faults can race with hole punch.
- * This is indicated if we find a mapped page.
+ * pages are not modified.
+ *
+ * Callers of this routine must hold the i_mmap_rwsem in write mode to prevent
+ * races with page faults.
+ *
* Note: If the passed end of range value is beyond the end of file, but
* not LLONG_MAX this routine still performs a hole punch operation.
*/
@@ -423,32 +422,14 @@ static void remove_inode_hugepages(struc
for (i = 0; i < pagevec_count(&pvec); ++i) {
struct page *page = pvec.pages[i];
- u32 hash;
index = page->index;
- hash = hugetlb_fault_mutex_hash(h, current->mm,
- &pseudo_vma,
- mapping, index, 0);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
-
/*
- * If page is mapped, it was faulted in after being
- * unmapped in caller. Unmap (again) now after taking
- * the fault mutex. The mutex will prevent faults
- * until we finish removing the page.
- *
- * This race can only happen in the hole punch case.
- * Getting here in a truncate operation is a bug.
+ * A mapped page is impossible as callers should unmap
+ * all references before calling. And, i_mmap_rwsem
+ * prevents the creation of additional mappings.
*/
- if (unlikely(page_mapped(page))) {
- BUG_ON(truncate_op);
-
- i_mmap_lock_write(mapping);
- hugetlb_vmdelete_list(&mapping->i_mmap,
- index * pages_per_huge_page(h),
- (index + 1) * pages_per_huge_page(h));
- i_mmap_unlock_write(mapping);
- }
+ VM_BUG_ON(page_mapped(page));
lock_page(page);
/*
@@ -470,7 +451,6 @@ static void remove_inode_hugepages(struc
}
unlock_page(page);
- mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
huge_pagevec_release(&pvec);
cond_resched();
@@ -482,9 +462,20 @@ static void remove_inode_hugepages(struc
static void hugetlbfs_evict_inode(struct inode *inode)
{
+ struct address_space *mapping = inode->i_mapping;
struct resv_map *resv_map;
+ /*
+ * The vfs layer guarantees that there are no other users of this
+ * inode. Therefore, it would be safe to call remove_inode_hugepages
+ * without holding i_mmap_rwsem. We acquire and hold here to be
+ * consistent with other callers. Since there will be no contention
+ * on the semaphore, overhead is negligible.
+ */
+ i_mmap_lock_write(mapping);
remove_inode_hugepages(inode, 0, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
+
resv_map = (struct resv_map *)inode->i_mapping->private_data;
/* root inode doesn't have the resv_map, so we should check it */
if (resv_map)
@@ -505,8 +496,8 @@ static int hugetlb_vmtruncate(struct ino
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offset, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
return 0;
}
@@ -540,8 +531,8 @@ static long hugetlbfs_punch_hole(struct
hugetlb_vmdelete_list(&mapping->i_mmap,
hole_start >> PAGE_SHIFT,
hole_end >> PAGE_SHIFT);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, hole_start, hole_end);
+ i_mmap_unlock_write(mapping);
inode_unlock(inode);
}
@@ -624,7 +615,11 @@ static long hugetlbfs_fallocate(struct f
/* addr is the offset within the file (zero based) */
addr = index * hpage_size;
- /* mutex taken here, fault path and hole punch */
+ /*
+ * fault mutex taken here, protects against fault path
+ * and hole punch. inode_lock previously taken protects
+ * against truncation.
+ */
hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
index, addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
--- a/mm/hugetlb.c~hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race
+++ a/mm/hugetlb.c
@@ -3755,16 +3755,16 @@ static vm_fault_t hugetlb_no_page(struct
}
/*
- * Use page lock to guard against racing truncation
- * before we get page_table_lock.
+ * We can not race with truncation due to holding i_mmap_rwsem.
+ * Check once here for faults beyond end of file.
*/
+ size = i_size_read(mapping->host) >> huge_page_shift(h);
+ if (idx >= size)
+ goto out;
+
retry:
page = find_lock_page(mapping, idx);
if (!page) {
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto out;
-
/*
* Check for page in userfault range
*/
@@ -3854,9 +3854,6 @@ retry:
}
ptl = huge_pte_lock(h, mm, ptep);
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto backout;
ret = 0;
if (!huge_pte_none(huge_ptep_get(ptep)))
@@ -3959,8 +3956,10 @@ vm_fault_t hugetlb_fault(struct mm_struc
/*
* Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
- * until finished with ptep. This prevents huge_pmd_unshare from
- * being called elsewhere and making the ptep no longer valid.
+ * until finished with ptep. This serves two purposes:
+ * 1) It prevents huge_pmd_unshare from being called elsewhere
+ * and making the ptep no longer valid.
+ * 2) It synchronizes us with file truncation.
*
* ptep could have already be assigned via huge_pte_offset. That
* is OK, as huge_pte_alloc will return the same value unless
_
Patches currently in -mm which might be from mike.kravetz(a)oracle.com are
The patch titled
Subject: hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
has been removed from the -mm tree. Its filename was
hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with the
ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
called.
[mike.kravetz(a)oracle.com: add explicit check for mapping != null]
Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com>
Cc: Colin Ian King <colin.king(a)canonical.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/hugetlb.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/hugetlb.c
@@ -3238,6 +3238,7 @@ int copy_hugetlb_page_range(struct mm_st
struct page *ptepage;
unsigned long addr;
int cow;
+ struct address_space *mapping = vma->vm_file->f_mapping;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
struct mmu_notifier_range range;
@@ -3249,13 +3250,23 @@ int copy_hugetlb_page_range(struct mm_st
mmu_notifier_range_init(&range, src, vma->vm_start,
vma->vm_end);
mmu_notifier_invalidate_range_start(&range);
+ } else {
+ /*
+ * For shared mappings i_mmap_rwsem must be held to call
+ * huge_pte_alloc, otherwise the returned ptep could go
+ * away if part of a shared pmd and another thread calls
+ * huge_pmd_unshare.
+ */
+ i_mmap_lock_read(mapping);
}
for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
+
src_pte = huge_pte_offset(src, addr, sz);
if (!src_pte)
continue;
+
dst_pte = huge_pte_alloc(dst, addr, sz);
if (!dst_pte) {
ret = -ENOMEM;
@@ -3326,6 +3337,8 @@ int copy_hugetlb_page_range(struct mm_st
if (cow)
mmu_notifier_invalidate_range_end(&range);
+ else
+ i_mmap_unlock_read(mapping);
return ret;
}
@@ -3771,14 +3784,18 @@ retry:
};
/*
- * hugetlb_fault_mutex must be dropped before
- * handling userfault. Reacquire after handling
- * fault to make calling code simpler.
+ * hugetlb_fault_mutex and i_mmap_rwsem must be
+ * dropped before handling userfault. Reacquire
+ * after handling fault to make calling code simpler.
*/
hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping,
idx, haddr);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+
ret = handle_userfault(&vmf, VM_UFFD_MISSING);
+
+ i_mmap_lock_read(mapping);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
goto out;
}
@@ -3926,6 +3943,11 @@ vm_fault_t hugetlb_fault(struct mm_struc
ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
if (ptep) {
+ /*
+ * Since we hold no locks, ptep could be stale. That is
+ * OK as we are only making decisions based on content and
+ * not actually modifying content here.
+ */
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
migration_entry_wait_huge(vma, mm, ptep);
@@ -3933,20 +3955,31 @@ vm_fault_t hugetlb_fault(struct mm_struc
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
VM_FAULT_SET_HINDEX(hstate_index(h));
- } else {
- ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
- if (!ptep)
- return VM_FAULT_OOM;
}
+ /*
+ * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
+ * until finished with ptep. This prevents huge_pmd_unshare from
+ * being called elsewhere and making the ptep no longer valid.
+ *
+ * ptep could have already be assigned via huge_pte_offset. That
+ * is OK, as huge_pte_alloc will return the same value unless
+ * something changed.
+ */
mapping = vma->vm_file->f_mapping;
- idx = vma_hugecache_offset(h, vma, haddr);
+ i_mmap_lock_read(mapping);
+ ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
+ if (!ptep) {
+ i_mmap_unlock_read(mapping);
+ return VM_FAULT_OOM;
+ }
/*
* Serialize hugepage allocation and instantiation, so that we don't
* get spurious allocation failures if two CPUs race to instantiate
* the same page in the page cache.
*/
+ idx = vma_hugecache_offset(h, vma, haddr);
hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, haddr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -4034,6 +4067,7 @@ out_ptl:
}
out_mutex:
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
/*
* Generally it's safe to hold refcount during waiting page lock. But
* here we just wait to defer the next page fault to avoid busy loop and
@@ -4638,10 +4672,12 @@ void adjust_range_if_pmd_sharing_possibl
* Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
* and returns the corresponding pte. While this is not necessary for the
* !shared pmd case because we can allocate the pmd later as well, it makes the
- * code much cleaner. pmd allocation is essential for the shared case because
- * pud has to be populated inside the same i_mmap_rwsem section - otherwise
- * racing tasks could either miss the sharing (see huge_pte_offset) or select a
- * bad pmd for sharing.
+ * code much cleaner.
+ *
+ * This routine must be called with i_mmap_rwsem held in at least read mode.
+ * For hugetlbfs, this prevents removal of any page table entries associated
+ * with the address space. This is important as we are setting up sharing
+ * based on existing page table entries (mappings).
*/
pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
{
@@ -4658,7 +4694,6 @@ pte_t *huge_pmd_share(struct mm_struct *
if (!vma_shareable(vma, addr))
return (pte_t *)pmd_alloc(mm, pud, addr);
- i_mmap_lock_write(mapping);
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -4688,7 +4723,6 @@ pte_t *huge_pmd_share(struct mm_struct *
spin_unlock(ptl);
out:
pte = (pte_t *)pmd_alloc(mm, pud, addr);
- i_mmap_unlock_write(mapping);
return pte;
}
@@ -4699,7 +4733,7 @@ out:
* indicated by page_count > 1, unmap is achieved by clearing pud and
* decrementing the ref count. If count == 1, the pte page is not shared.
*
- * called with page table lock held.
+ * Called with page table lock held and i_mmap_rwsem held in write mode.
*
* returns: 1 successfully unmapped a shared pte page
* 0 the underlying pte page is not shared, or it is the last user
--- a/mm/memory-failure.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/memory-failure.c
@@ -966,7 +966,7 @@ static bool hwpoison_user_mappings(struc
enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
struct address_space *mapping;
LIST_HEAD(tokill);
- bool unmap_success;
+ bool unmap_success = true;
int kill = 1, forcekill;
struct page *hpage = *hpagep;
bool mlocked = PageMlocked(hpage);
@@ -1028,7 +1028,19 @@ static bool hwpoison_user_mappings(struc
if (kill)
collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
- unmap_success = try_to_unmap(hpage, ttu);
+ if (!PageHuge(hpage)) {
+ unmap_success = try_to_unmap(hpage, ttu);
+ } else if (mapping) {
+ /*
+ * For hugetlb pages, try_to_unmap could potentially call
+ * huge_pmd_unshare. Because of this, take semaphore in
+ * write mode here and set TTU_RMAP_LOCKED to indicate we
+ * have taken the lock at this higer level.
+ */
+ i_mmap_lock_write(mapping);
+ unmap_success = try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED);
+ i_mmap_unlock_write(mapping);
+ }
if (!unmap_success)
pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
pfn, page_mapcount(hpage));
--- a/mm/migrate.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/migrate.c
@@ -1324,8 +1324,19 @@ static int unmap_and_move_huge_page(new_
goto put_anon;
if (page_mapped(hpage)) {
+ struct address_space *mapping = page_mapping(hpage);
+
+ /*
+ * try_to_unmap could potentially call huge_pmd_unshare.
+ * Because of this, take semaphore in write mode here and
+ * set TTU_RMAP_LOCKED to let lower levels know we have
+ * taken the lock.
+ */
+ i_mmap_lock_write(mapping);
try_to_unmap(hpage,
- TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+ TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|
+ TTU_RMAP_LOCKED);
+ i_mmap_unlock_write(mapping);
page_was_mapped = 1;
}
--- a/mm/rmap.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/rmap.c
@@ -25,6 +25,7 @@
* page->flags PG_locked (lock_page)
* hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
* mapping->i_mmap_rwsem
+ * hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
* anon_vma->rwsem
* mm->page_table_lock or pte_lock
* zone_lru_lock (in mark_page_accessed, isolate_lru_page)
@@ -1378,6 +1379,9 @@ static bool try_to_unmap_one(struct page
/*
* If sharing is possible, start and end will be adjusted
* accordingly.
+ *
+ * If called for a huge page, caller must hold i_mmap_rwsem
+ * in write mode as it is possible to call huge_pmd_unshare.
*/
adjust_range_if_pmd_sharing_possible(vma, &range.start,
&range.end);
--- a/mm/userfaultfd.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/userfaultfd.c
@@ -267,10 +267,14 @@ retry:
VM_BUG_ON(dst_addr & ~huge_page_mask(h));
/*
- * Serialize via hugetlb_fault_mutex
+ * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
+ * i_mmap_rwsem ensures the dst_pte remains valid even
+ * in the case of shared pmds. fault mutex prevents
+ * races with other faulting threads.
*/
- idx = linear_page_index(dst_vma, dst_addr);
mapping = dst_vma->vm_file->f_mapping;
+ i_mmap_lock_read(mapping);
+ idx = linear_page_index(dst_vma, dst_addr);
hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping,
idx, dst_addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -279,6 +283,7 @@ retry:
dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h));
if (!dst_pte) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
goto out_unlock;
}
@@ -286,6 +291,7 @@ retry:
dst_pteval = huge_ptep_get(dst_pte);
if (!huge_pte_none(dst_pteval)) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
goto out_unlock;
}
@@ -293,6 +299,7 @@ retry:
dst_addr, src_addr, &page);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
vm_alloc_shared = vm_shared;
cond_resched();
_
Patches currently in -mm which might be from mike.kravetz(a)oracle.com are
The patch titled
Subject: hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined
has been removed from the -mm tree. Its filename was
hwpoison-memory_hotplug-allow-hwpoisoned-pages-to-be-offlined.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Michal Hocko <mhocko(a)suse.com>
Subject: hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined
We have received a bug report that an injected MCE about faulty memory
prevents memory offline to succeed on 4.4 base kernel. The underlying
reason was that the HWPoison page has an elevated reference count and the
migration keeps failing. There are two problems with that. First of all
it is dubious to migrate the poisoned page because we know that accessing
that memory is possible to fail. Secondly it doesn't make any sense to
migrate a potentially broken content and preserve the memory corruption
over to a new location.
Oscar has found out that 4.4 and the current upstream kernels behave
slightly differently with his simply testcase
===
int main(void)
{
int ret;
int i;
int fd;
char *array = malloc(4096);
char *array_locked = malloc(4096);
fd = open("/tmp/data", O_RDONLY);
read(fd, array, 4095);
for (i = 0; i < 4096; i++)
array_locked[i] = 'd';
ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
if (ret)
perror("mlock");
sleep (20);
ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
if (ret)
perror("madvise");
for (i = 0; i < 4096; i++)
array_locked[i] = 'd';
return 0;
}
===
+ offline this memory.
In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
list
kernel: [<ffffffff81019ac9>] dump_trace+0x59/0x340
kernel: [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170
kernel: [<ffffffff8101ac71>] show_stack+0x21/0x40
kernel: [<ffffffff8132bb90>] dump_stack+0x5c/0x7c
kernel: [<ffffffff810815a1>] warn_slowpath_common+0x81/0xb0
kernel: [<ffffffff811a275c>] __pagevec_lru_add_fn+0x14c/0x160
kernel: [<ffffffff811a2eed>] pagevec_lru_move_fn+0xad/0x100
kernel: [<ffffffff811a334c>] __lru_cache_add+0x6c/0xb0
kernel: [<ffffffff81195236>] add_to_page_cache_lru+0x46/0x70
kernel: [<ffffffffa02b4373>] extent_readpages+0xc3/0x1a0 [btrfs]
kernel: [<ffffffff811a16d7>] __do_page_cache_readahead+0x177/0x200
kernel: [<ffffffff811a18c8>] ondemand_readahead+0x168/0x2a0
kernel: [<ffffffff8119673f>] generic_file_read_iter+0x41f/0x660
kernel: [<ffffffff8120e50d>] __vfs_read+0xcd/0x140
kernel: [<ffffffff8120e9ea>] vfs_read+0x7a/0x120
kernel: [<ffffffff8121404b>] kernel_read+0x3b/0x50
kernel: [<ffffffff81215c80>] do_execveat_common.isra.29+0x490/0x6f0
kernel: [<ffffffff81215f08>] do_execve+0x28/0x30
kernel: [<ffffffff81095ddb>] call_usermodehelper_exec_async+0xfb/0x130
kernel: [<ffffffff8161c045>] ret_from_fork+0x55/0x80
And that latter confuses the hotremove path because an LRU page is
attempted to be migrated and that fails due to an elevated reference
count. It is quite possible that the reuse of the HWPoisoned page is some
kind of fixed race condition but I am not really sure about that.
With the upstream kernel the failure is slightly different. The page
doesn't seem to have LRU bit set but isolate_movable_page simply fails and
do_migrate_range simply puts all the isolated pages back to LRU and
therefore no progress is made and scan_movable_pages finds same set of
pages over and over again.
Fix both cases by explicitly checking HWPoisoned pages before we even try
to get reference on the page, try to unmap it if it is still mapped. As
explained by Naoya:
: Hwpoison code never unmapped those for no big reason because
: Ksm pages never dominate memory, so we simply didn't have strong
: motivation to save the pages.
Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
HWPoison pages which shouldn't happen but I couldn't convince myself about
that. Naoya has noted the following:
: Theoretically no such gurantee, because try_to_unmap() doesn't have a
: guarantee of success and then memory_failure() returns immediately
: when hwpoison_user_mappings fails.
: Or the following code (comes after hwpoison_user_mappings block) also impli=
: es
: that the target page can still have PageLRU flag.
:
: /*
: * Torn down by someone else?
: */
: if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
: action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
: res =3D -EBUSY;
: goto out;
: }
:
: So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
: current version of your patch.
Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Reviewed-by: Oscar Salvador <osalvador(a)suse.com>
Debugged-by: Oscar Salvador <osalvador(a)suse.com>
Tested-by: Oscar Salvador <osalvador(a)suse.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Acked-by: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/memory_hotplug.c~hwpoison-memory_hotplug-allow-hwpoisoned-pages-to-be-offlined
+++ a/mm/memory_hotplug.c
@@ -34,6 +34,7 @@
#include <linux/hugetlb.h>
#include <linux/memblock.h>
#include <linux/compaction.h>
+#include <linux/rmap.h>
#include <asm/tlbflush.h>
@@ -1368,6 +1369,21 @@ do_migrate_range(unsigned long start_pfn
pfn = page_to_pfn(compound_head(page))
+ hpage_nr_pages(page) - 1;
+ /*
+ * HWPoison pages have elevated reference counts so the migration would
+ * fail on them. It also doesn't make any sense to migrate them in the
+ * first place. Still try to unmap such a page in case it is still mapped
+ * (e.g. current hwpoison implementation doesn't unmap KSM pages but keep
+ * the unmap as the catch all safety net).
+ */
+ if (PageHWPoison(page)) {
+ if (WARN_ON(PageLRU(page)))
+ isolate_lru_page(page);
+ if (page_mapped(page))
+ try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS);
+ continue;
+ }
+
if (!get_page_unless_zero(page))
continue;
/*
_
Patches currently in -mm which might be from mhocko(a)suse.com are
mm-memcg-fix-reclaim-deadlock-with-writeback.patch
The patch titled
Subject: mm, hmm: mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL
has been removed from the -mm tree. Its filename was
mm-hmm-mark-hmm_devmem_add-add_resource-export_symbol_gpl.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm, hmm: mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL
At Maintainer Summit, Greg brought up a topic I proposed around
EXPORT_SYMBOL_GPL usage. The motivation was considerations for when
EXPORT_SYMBOL_GPL is warranted and the criteria for taking the exceptional
step of reclassifying an existing export. Specifically, I wanted to make
the case that although the line is fuzzy and hard to specify in abstract
terms, it is nonetheless clear that devm_memremap_pages() and HMM
(Heterogeneous Memory Management) have crossed it. The
devm_memremap_pages() facility should have been EXPORT_SYMBOL_GPL from the
beginning, and HMM as a derivative of that functionality should have
naturally picked up that designation as well.
Contrary to typical rules, the HMM infrastructure was merged upstream with
zero in-tree consumers. There was a promise at the time that those users
would be merged "soon", but it has been over a year with no drivers
arriving. While the Nouveau driver is about to belatedly make good on
that promise it is clear that HMM was targeted first and foremost at an
out-of-tree consumer.
HMM is derived from devm_memremap_pages(), a facility Christoph and I
spearheaded to support persistent memory. It combines a device lifetime
model with a dynamically created 'struct page' / memmap array for any
physical address range. It enables coordination and control of the many
code paths in the kernel built to interact with memory via 'struct page'
objects. With HMM the integration goes even deeper by allowing device
drivers to hook and manipulate page fault and page free events.
One interpretation of when EXPORT_SYMBOL is suitable is when it is
exporting stable and generic leaf functionality. The
devm_memremap_pages() facility continues to see expanding use cases,
peer-to-peer DMA being the most recent, with no clear end date when it
will stop attracting reworks and semantic changes. It is not suitable to
export devm_memremap_pages() as a stable 3rd party driver API due to the
fact that it is still changing and manipulates core behavior. Moreover,
it is not in the best interest of the long term development of the core
memory management subsystem to permit any external driver to effectively
define its own system-wide memory management policies with no
encouragement to engage with upstream.
I am also concerned that HMM was designed in a way to minimize further
engagement with the core-MM. That, with these hooks in place,
device-drivers are free to implement their own policies without much
consideration for whether and how the core-MM could grow to meet that
need. Going forward not only should HMM be EXPORT_SYMBOL_GPL, but the
core-MM should be allowed the opportunity and stimulus to change and
address these new use cases as first class functionality.
Original changelog:
hmm_devmem_add(), and hmm_devmem_add_resource() duplicated
devm_memremap_pages() and are now simple now wrappers around the core
facility to inject a dev_pagemap instance into the global pgmap_radix and
hook page-idle events. The devm_memremap_pages() interface is base
infrastructure for HMM. HMM has more and deeper ties into the kernel
memory management implementation than base ZONE_DEVICE which is itself a
EXPORT_SYMBOL_GPL facility.
Originally, the HMM page structure creation routines copied the
devm_memremap_pages() code and reused ZONE_DEVICE. A cleanup to unify the
implementations was discussed during the initial review:
http://lkml.iu.edu/hypermail/linux/kernel/1701.2/00812.html Recent work to
extend devm_memremap_pages() for the peer-to-peer-DMA facility enabled
this cleanup to move forward.
In addition to the integration with devm_memremap_pages() HMM depends on
other GPL-only symbols:
mmu_notifier_unregister_no_release
percpu_ref
region_intersects
__class_create
It goes further to consume / indirectly expose functionality that is not
exported to any other driver:
alloc_pages_vma
walk_page_range
HMM is derived from devm_memremap_pages(), and extends deep core-kernel
fundamentals. Similar to devm_memremap_pages(), mark its entry points
EXPORT_SYMBOL_GPL().
[logang(a)deltatee.com: PCI/P2PDMA: match interface changes to devm_memremap_pages()]
Link: http://lkml.kernel.org/r/20181130225911.2900-1-logang@deltatee.com
Link: http://lkml.kernel.org/r/154275560565.76910.15919297436557795278.stgit@dwil…
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Logan Gunthorpe <logang(a)deltatee.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Cc: Logan Gunthorpe <logang(a)deltatee.com>
Cc: "Jérôme Glisse" <jglisse(a)redhat.com>
Cc: Balbir Singh <bsingharora(a)gmail.com>,
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/drivers/pci/p2pdma.c~mm-hmm-mark-hmm_devmem_add-add_resource-export_symbol_gpl
+++ a/drivers/pci/p2pdma.c
@@ -82,10 +82,8 @@ static void pci_p2pdma_percpu_release(st
complete_all(&p2p->devmap_ref_done);
}
-static void pci_p2pdma_percpu_kill(void *data)
+static void pci_p2pdma_percpu_kill(struct percpu_ref *ref)
{
- struct percpu_ref *ref = data;
-
/*
* pci_p2pdma_add_resource() may be called multiple times
* by a driver and may register the percpu_kill devm action multiple
@@ -198,6 +196,7 @@ int pci_p2pdma_add_resource(struct pci_d
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
pci_resource_start(pdev, bar);
+ pgmap->kill = pci_p2pdma_percpu_kill;
addr = devm_memremap_pages(&pdev->dev, pgmap);
if (IS_ERR(addr)) {
@@ -211,11 +210,6 @@ int pci_p2pdma_add_resource(struct pci_d
if (error)
goto pgmap_free;
- error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
- &pdev->p2pdma->devmap_ref);
- if (error)
- goto pgmap_free;
-
pci_info(pdev, "added peer-to-peer DMA memory %pR\n",
&pgmap->res);
--- a/mm/hmm.c~mm-hmm-mark-hmm_devmem_add-add_resource-export_symbol_gpl
+++ a/mm/hmm.c
@@ -1110,7 +1110,7 @@ struct hmm_devmem *hmm_devmem_add(const
return result;
return devmem;
}
-EXPORT_SYMBOL(hmm_devmem_add);
+EXPORT_SYMBOL_GPL(hmm_devmem_add);
struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
struct device *device,
@@ -1164,7 +1164,7 @@ struct hmm_devmem *hmm_devmem_add_resour
return result;
return devmem;
}
-EXPORT_SYMBOL(hmm_devmem_add_resource);
+EXPORT_SYMBOL_GPL(hmm_devmem_add_resource);
/*
* A device driver that wants to handle multiple devices memory through a
_
Patches currently in -mm which might be from dan.j.williams(a)intel.com are
The patch titled
Subject: mm, devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support
has been removed from the -mm tree. Its filename was
mm-devm_memremap_pages-add-memory_device_private-support.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm, devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support
In preparation for consolidating all ZONE_DEVICE enabling via
devm_memremap_pages(), teach it how to handle the constraints of
MEMORY_DEVICE_PRIVATE ranges.
[jglisse(a)redhat.com: call move_pfn_range_to_zone for MEMORY_DEVICE_PRIVATE]
Link: http://lkml.kernel.org/r/154275559036.76910.12434636179931292607.stgit@dwil…
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Reviewed-by: Jérôme Glisse <jglisse(a)redhat.com>
Acked-by: Christoph Hellwig <hch(a)lst.de>
Reported-by: Logan Gunthorpe <logang(a)deltatee.com>
Reviewed-by: Logan Gunthorpe <logang(a)deltatee.com>
Cc: Balbir Singh <bsingharora(a)gmail.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/kernel/memremap.c~mm-devm_memremap_pages-add-memory_device_private-support
+++ a/kernel/memremap.c
@@ -98,9 +98,15 @@ static void devm_memremap_pages_release(
- align_start;
mem_hotplug_begin();
- arch_remove_memory(align_start, align_size, pgmap->altmap_valid ?
- &pgmap->altmap : NULL);
- kasan_remove_zero_shadow(__va(align_start), align_size);
+ if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+ pfn = align_start >> PAGE_SHIFT;
+ __remove_pages(page_zone(pfn_to_page(pfn)), pfn,
+ align_size >> PAGE_SHIFT, NULL);
+ } else {
+ arch_remove_memory(align_start, align_size,
+ pgmap->altmap_valid ? &pgmap->altmap : NULL);
+ kasan_remove_zero_shadow(__va(align_start), align_size);
+ }
mem_hotplug_done();
untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
@@ -187,17 +193,40 @@ void *devm_memremap_pages(struct device
goto err_pfn_remap;
mem_hotplug_begin();
- error = kasan_add_zero_shadow(__va(align_start), align_size);
- if (error) {
- mem_hotplug_done();
- goto err_kasan;
+
+ /*
+ * For device private memory we call add_pages() as we only need to
+ * allocate and initialize struct page for the device memory. More-
+ * over the device memory is un-accessible thus we do not want to
+ * create a linear mapping for the memory like arch_add_memory()
+ * would do.
+ *
+ * For all other device memory types, which are accessible by
+ * the CPU, we do want the linear mapping and thus use
+ * arch_add_memory().
+ */
+ if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+ error = add_pages(nid, align_start >> PAGE_SHIFT,
+ align_size >> PAGE_SHIFT, NULL, false);
+ } else {
+ error = kasan_add_zero_shadow(__va(align_start), align_size);
+ if (error) {
+ mem_hotplug_done();
+ goto err_kasan;
+ }
+
+ error = arch_add_memory(nid, align_start, align_size, altmap,
+ false);
+ }
+
+ if (!error) {
+ struct zone *zone;
+
+ zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
+ move_pfn_range_to_zone(zone, align_start >> PAGE_SHIFT,
+ align_size >> PAGE_SHIFT, altmap);
}
- error = arch_add_memory(nid, align_start, align_size, altmap, false);
- if (!error)
- move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
- align_start >> PAGE_SHIFT,
- align_size >> PAGE_SHIFT, altmap);
mem_hotplug_done();
if (error)
goto err_add_memory;
_
Patches currently in -mm which might be from dan.j.williams(a)intel.com are
The patch titled
Subject: mm, devm_memremap_pages: fix shutdown handling
has been removed from the -mm tree. Its filename was
mm-devm_memremap_pages-fix-shutdown-handling.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm, devm_memremap_pages: fix shutdown handling
The last step before devm_memremap_pages() returns success is to allocate
a release action, devm_memremap_pages_release(), to tear the entire setup
down. However, the result from devm_add_action() is not checked.
Checking the error from devm_add_action() is not enough. The api
currently relies on the fact that the percpu_ref it is using is killed by
the time the devm_memremap_pages_release() is run. Rather than continue
this awkward situation, offload the responsibility of killing the
percpu_ref to devm_memremap_pages_release() directly. This allows
devm_memremap_pages() to do the right thing relative to init failures and
shutdown.
Without this change we could fail to register the teardown of
devm_memremap_pages(). The likelihood of hitting this failure is tiny as
small memory allocations almost always succeed. However, the impact of
the failure is large given any future reconfiguration, or disable/enable,
of an nvdimm namespace will fail forever as subsequent calls to
devm_memremap_pages() will fail to setup the pgmap_radix since there will
be stale entries for the physical address range.
An argument could be made to require that the ->kill() operation be set in
the @pgmap arg rather than passed in separately. However, it helps code
readability, tracking the lifetime of a given instance, to be able to grep
the kill routine directly at the devm_memremap_pages() call site.
Link: http://lkml.kernel.org/r/154275558526.76910.7535251937849268605.stgit@dwill…
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
Reviewed-by: "Jérôme Glisse" <jglisse(a)redhat.com>
Reported-by: Logan Gunthorpe <logang(a)deltatee.com>
Reviewed-by: Logan Gunthorpe <logang(a)deltatee.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Cc: Balbir Singh <bsingharora(a)gmail.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/drivers/dax/pmem.c~mm-devm_memremap_pages-fix-shutdown-handling
+++ a/drivers/dax/pmem.c
@@ -48,9 +48,8 @@ static void dax_pmem_percpu_exit(void *d
percpu_ref_exit(ref);
}
-static void dax_pmem_percpu_kill(void *data)
+static void dax_pmem_percpu_kill(struct percpu_ref *ref)
{
- struct percpu_ref *ref = data;
struct dax_pmem *dax_pmem = to_dax_pmem(ref);
dev_dbg(dax_pmem->dev, "trace\n");
@@ -112,17 +111,10 @@ static int dax_pmem_probe(struct device
}
dax_pmem->pgmap.ref = &dax_pmem->ref;
+ dax_pmem->pgmap.kill = dax_pmem_percpu_kill;
addr = devm_memremap_pages(dev, &dax_pmem->pgmap);
- if (IS_ERR(addr)) {
- devm_remove_action(dev, dax_pmem_percpu_exit, &dax_pmem->ref);
- percpu_ref_exit(&dax_pmem->ref);
+ if (IS_ERR(addr))
return PTR_ERR(addr);
- }
-
- rc = devm_add_action_or_reset(dev, dax_pmem_percpu_kill,
- &dax_pmem->ref);
- if (rc)
- return rc;
/* adjust the dax_region resource to the start of data */
memcpy(&res, &dax_pmem->pgmap.res, sizeof(res));
--- a/drivers/nvdimm/pmem.c~mm-devm_memremap_pages-fix-shutdown-handling
+++ a/drivers/nvdimm/pmem.c
@@ -309,8 +309,11 @@ static void pmem_release_queue(void *q)
blk_cleanup_queue(q);
}
-static void pmem_freeze_queue(void *q)
+static void pmem_freeze_queue(struct percpu_ref *ref)
{
+ struct request_queue *q;
+
+ q = container_of(ref, typeof(*q), q_usage_counter);
blk_freeze_queue_start(q);
}
@@ -402,6 +405,7 @@ static int pmem_attach_disk(struct devic
pmem->pfn_flags = PFN_DEV;
pmem->pgmap.ref = &q->q_usage_counter;
+ pmem->pgmap.kill = pmem_freeze_queue;
if (is_nd_pfn(dev)) {
if (setup_pagemap_fsdax(dev, &pmem->pgmap))
return -ENOMEM;
@@ -427,13 +431,6 @@ static int pmem_attach_disk(struct devic
memcpy(&bb_res, &nsio->res, sizeof(bb_res));
}
- /*
- * At release time the queue must be frozen before
- * devm_memremap_pages is unwound
- */
- if (devm_add_action_or_reset(dev, pmem_freeze_queue, q))
- return -ENOMEM;
-
if (IS_ERR(addr))
return PTR_ERR(addr);
pmem->virt_addr = addr;
--- a/include/linux/memremap.h~mm-devm_memremap_pages-fix-shutdown-handling
+++ a/include/linux/memremap.h
@@ -111,6 +111,7 @@ typedef void (*dev_page_free_t)(struct p
* @altmap: pre-allocated/reserved memory for vmemmap allocations
* @res: physical address range covered by @ref
* @ref: reference count that pins the devm_memremap_pages() mapping
+ * @kill: callback to transition @ref to the dead state
* @dev: host device of the mapping for debug
* @data: private data pointer for page_free()
* @type: memory type: see MEMORY_* in memory_hotplug.h
@@ -122,6 +123,7 @@ struct dev_pagemap {
bool altmap_valid;
struct resource res;
struct percpu_ref *ref;
+ void (*kill)(struct percpu_ref *ref);
struct device *dev;
void *data;
enum memory_type type;
--- a/kernel/memremap.c~mm-devm_memremap_pages-fix-shutdown-handling
+++ a/kernel/memremap.c
@@ -88,14 +88,10 @@ static void devm_memremap_pages_release(
resource_size_t align_start, align_size;
unsigned long pfn;
+ pgmap->kill(pgmap->ref);
for_each_device_pfn(pfn, pgmap)
put_page(pfn_to_page(pfn));
- if (percpu_ref_tryget_live(pgmap->ref)) {
- dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
- percpu_ref_put(pgmap->ref);
- }
-
/* pages are dead and unused, undo the arch mapping */
align_start = res->start & ~(SECTION_SIZE - 1);
align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
@@ -116,7 +112,7 @@ static void devm_memremap_pages_release(
/**
* devm_memremap_pages - remap and provide memmap backing for the given resource
* @dev: hosting device for @res
- * @pgmap: pointer to a struct dev_pgmap
+ * @pgmap: pointer to a struct dev_pagemap
*
* Notes:
* 1/ At a minimum the res, ref and type members of @pgmap must be initialized
@@ -125,11 +121,8 @@ static void devm_memremap_pages_release(
* 2/ The altmap field may optionally be initialized, in which case altmap_valid
* must be set to true
*
- * 3/ pgmap.ref must be 'live' on entry and 'dead' before devm_memunmap_pages()
- * time (or devm release event). The expected order of events is that ref has
- * been through percpu_ref_kill() before devm_memremap_pages_release(). The
- * wait for the completion of all references being dropped and
- * percpu_ref_exit() must occur after devm_memremap_pages_release().
+ * 3/ pgmap->ref must be 'live' on entry and will be killed at
+ * devm_memremap_pages_release() time, or if this routine fails.
*
* 4/ res is expected to be a host memory range that could feasibly be
* treated as a "System RAM" range, i.e. not a device mmio range, but
@@ -145,6 +138,9 @@ void *devm_memremap_pages(struct device
pgprot_t pgprot = PAGE_KERNEL;
int error, nid, is_ram;
+ if (!pgmap->ref || !pgmap->kill)
+ return ERR_PTR(-EINVAL);
+
align_start = res->start & ~(SECTION_SIZE - 1);
align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
- align_start;
@@ -170,12 +166,10 @@ void *devm_memremap_pages(struct device
if (is_ram != REGION_DISJOINT) {
WARN_ONCE(1, "%s attempted on %s region %pr\n", __func__,
is_ram == REGION_MIXED ? "mixed" : "ram", res);
- return ERR_PTR(-ENXIO);
+ error = -ENXIO;
+ goto err_array;
}
- if (!pgmap->ref)
- return ERR_PTR(-EINVAL);
-
pgmap->dev = dev;
error = xa_err(xa_store_range(&pgmap_array, PHYS_PFN(res->start),
@@ -217,7 +211,10 @@ void *devm_memremap_pages(struct device
align_size >> PAGE_SHIFT, pgmap);
percpu_ref_get_many(pgmap->ref, pfn_end(pgmap) - pfn_first(pgmap));
- devm_add_action(dev, devm_memremap_pages_release, pgmap);
+ error = devm_add_action_or_reset(dev, devm_memremap_pages_release,
+ pgmap);
+ if (error)
+ return ERR_PTR(error);
return __va(res->start);
@@ -228,6 +225,7 @@ void *devm_memremap_pages(struct device
err_pfn_remap:
pgmap_array_delete(res);
err_array:
+ pgmap->kill(pgmap->ref);
return ERR_PTR(error);
}
EXPORT_SYMBOL_GPL(devm_memremap_pages);
--- a/tools/testing/nvdimm/test/iomap.c~mm-devm_memremap_pages-fix-shutdown-handling
+++ a/tools/testing/nvdimm/test/iomap.c
@@ -104,13 +104,26 @@ void *__wrap_devm_memremap(struct device
}
EXPORT_SYMBOL(__wrap_devm_memremap);
+static void nfit_test_kill(void *_pgmap)
+{
+ struct dev_pagemap *pgmap = _pgmap;
+
+ pgmap->kill(pgmap->ref);
+}
+
void *__wrap_devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
{
resource_size_t offset = pgmap->res.start;
struct nfit_test_resource *nfit_res = get_nfit_res(offset);
- if (nfit_res)
+ if (nfit_res) {
+ int rc;
+
+ rc = devm_add_action_or_reset(dev, nfit_test_kill, pgmap);
+ if (rc)
+ return ERR_PTR(rc);
return nfit_res->buf + offset - nfit_res->res.start;
+ }
return devm_memremap_pages(dev, pgmap);
}
EXPORT_SYMBOL_GPL(__wrap_devm_memremap_pages);
_
Patches currently in -mm which might be from dan.j.williams(a)intel.com are
The patch titled
Subject: mm, devm_memremap_pages: kill mapping "System RAM" support
has been removed from the -mm tree. Its filename was
mm-devm_memremap_pages-kill-mapping-system-ram-support.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm, devm_memremap_pages: kill mapping "System RAM" support
Given the fact that devm_memremap_pages() requires a percpu_ref that is
torn down by devm_memremap_pages_release() the current support for mapping
RAM is broken.
Support for remapping "System RAM" has been broken since the beginning and
there is no existing user of this this code path, so just kill the support
and make it an explicit error.
This cleanup also simplifies a follow-on patch to fix the error path when
setting a devm release action for devm_memremap_pages_release() fails.
Link: http://lkml.kernel.org/r/154275557997.76910.14689813630968180480.stgit@dwil…
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Reviewed-by: "Jérôme Glisse" <jglisse(a)redhat.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Logan Gunthorpe <logang(a)deltatee.com>
Cc: Balbir Singh <bsingharora(a)gmail.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/kernel/memremap.c~mm-devm_memremap_pages-kill-mapping-system-ram-support
+++ a/kernel/memremap.c
@@ -167,15 +167,12 @@ void *devm_memremap_pages(struct device
is_ram = region_intersects(align_start, align_size,
IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE);
- if (is_ram == REGION_MIXED) {
- WARN_ONCE(1, "%s attempted on mixed region %pr\n",
- __func__, res);
+ if (is_ram != REGION_DISJOINT) {
+ WARN_ONCE(1, "%s attempted on %s region %pr\n", __func__,
+ is_ram == REGION_MIXED ? "mixed" : "ram", res);
return ERR_PTR(-ENXIO);
}
- if (is_ram == REGION_INTERSECTS)
- return __va(res->start);
-
if (!pgmap->ref)
return ERR_PTR(-EINVAL);
_
Patches currently in -mm which might be from dan.j.williams(a)intel.com are
The patch titled
Subject: mm, devm_memremap_pages: mark devm_memremap_pages() EXPORT_SYMBOL_GPL
has been removed from the -mm tree. Its filename was
mm-devm_memremap_pages-mark-devm_memremap_pages-export_symbol_gpl.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm, devm_memremap_pages: mark devm_memremap_pages() EXPORT_SYMBOL_GPL
devm_memremap_pages() is a facility that can create struct page entries
for any arbitrary range and give drivers the ability to subvert core
aspects of page management.
Specifically the facility is tightly integrated with the kernel's memory
hotplug functionality. It injects an altmap argument deep into the
architecture specific vmemmap implementation to allow allocating from
specific reserved pages, and it has Linux specific assumptions about page
structure reference counting relative to get_user_pages() and
get_user_pages_fast(). It was an oversight and a mistake that this was
not marked EXPORT_SYMBOL_GPL from the outset.
Again, devm_memremap_pagex() exposes and relies upon core kernel internal
assumptions and will continue to evolve along with 'struct page', memory
hotplug, and support for new memory types / topologies. Only an in-kernel
GPL-only driver is expected to keep up with this ongoing evolution. This
interface, and functionality derived from this interface, is not suitable
for kernel-external drivers.
Link: http://lkml.kernel.org/r/154275557457.76910.16923571232582744134.stgit@dwil…
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: "Jérôme Glisse" <jglisse(a)redhat.com>
Cc: Balbir Singh <bsingharora(a)gmail.com>
Cc: Logan Gunthorpe <logang(a)deltatee.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/kernel/memremap.c~mm-devm_memremap_pages-mark-devm_memremap_pages-export_symbol_gpl
+++ a/kernel/memremap.c
@@ -233,7 +233,7 @@ void *devm_memremap_pages(struct device
err_array:
return ERR_PTR(error);
}
-EXPORT_SYMBOL(devm_memremap_pages);
+EXPORT_SYMBOL_GPL(devm_memremap_pages);
unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
{
--- a/tools/testing/nvdimm/test/iomap.c~mm-devm_memremap_pages-mark-devm_memremap_pages-export_symbol_gpl
+++ a/tools/testing/nvdimm/test/iomap.c
@@ -113,7 +113,7 @@ void *__wrap_devm_memremap_pages(struct
return nfit_res->buf + offset - nfit_res->res.start;
return devm_memremap_pages(dev, pgmap);
}
-EXPORT_SYMBOL(__wrap_devm_memremap_pages);
+EXPORT_SYMBOL_GPL(__wrap_devm_memremap_pages);
pfn_t __wrap_phys_to_pfn_t(phys_addr_t addr, unsigned long flags)
{
_
Patches currently in -mm which might be from dan.j.williams(a)intel.com are
From: Dan Carpenter <dan.carpenter(a)oracle.com>
[ Upstream commit 1376b0a2160319125c3a2822e8c09bd283cd8141 ]
There is a '>' vs '<' typo so this loop is a no-op.
Fixes: d35dcc89fc93 ("staging: comedi: quatech_daqp_cs: fix daqp_ao_insn_write()")
Signed-off-by: Dan Carpenter <dan.carpenter(a)oracle.com>
Reviewed-by: Ian Abbott <abbotti(a)mev.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/staging/comedi/drivers/quatech_daqp_cs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/staging/comedi/drivers/quatech_daqp_cs.c b/drivers/staging/comedi/drivers/quatech_daqp_cs.c
index b3bbec0a0d23..f89a863ea04c 100644
--- a/drivers/staging/comedi/drivers/quatech_daqp_cs.c
+++ b/drivers/staging/comedi/drivers/quatech_daqp_cs.c
@@ -649,7 +649,7 @@ static int daqp_ao_insn_write(struct comedi_device *dev,
/* Make sure D/A update mode is direct update */
outb(0, dev->iobase + DAQP_AUX);
- for (i = 0; i > insn->n; i++) {
+ for (i = 0; i < insn->n; i++) {
val = data[0];
val &= 0x0fff;
val ^= 0x0800; /* Flip the sign */
--
2.18.0
While looking at BUGs associated with invalid huge page map counts,
it was discovered and observed that a huge pte pointer could become
'invalid' and point to another task's page table. Consider the
following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation,
it unmaps everyone who has the file mapped. If the range being
truncated is covered by a shared pmd, huge_pmd_unshare will be called.
For all but the last user of the shared pmd, huge_pmd_unshare will
clear the pud pointing to the pmd. If the task in the middle of the
page fault is not the last user, the ptep returned by huge_pte_alloc
now points to another task's page table or worse. This leads to bad
things such as incorrect page map/reference counts or invalid memory
references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with
the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
Cc: <stable(a)vger.kernel.org>
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
---
mm/hugetlb.c | 67 ++++++++++++++++++++++++++++++++++-----------
mm/memory-failure.c | 14 +++++++++-
mm/migrate.c | 13 ++++++++-
mm/rmap.c | 4 +++
mm/userfaultfd.c | 11 ++++++--
5 files changed, 89 insertions(+), 20 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 309fb8c969af..2a3162030167 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3239,6 +3239,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
int cow;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
+ struct address_space *mapping = vma->vm_file->f_mapping;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
int ret = 0;
@@ -3247,14 +3248,25 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
mmun_start = vma->vm_start;
mmun_end = vma->vm_end;
- if (cow)
+ if (cow) {
mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
+ } else {
+ /*
+ * For shared mappings i_mmap_rwsem must be held to call
+ * huge_pte_alloc, otherwise the returned ptep could go
+ * away if part of a shared pmd and another thread calls
+ * huge_pmd_unshare.
+ */
+ i_mmap_lock_read(mapping);
+ }
for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
+
src_pte = huge_pte_offset(src, addr, sz);
if (!src_pte)
continue;
+
dst_pte = huge_pte_alloc(dst, addr, sz);
if (!dst_pte) {
ret = -ENOMEM;
@@ -3325,6 +3337,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
if (cow)
mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
+ else
+ i_mmap_unlock_read(mapping);
return ret;
}
@@ -3772,14 +3786,18 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
};
/*
- * hugetlb_fault_mutex must be dropped before
- * handling userfault. Reacquire after handling
- * fault to make calling code simpler.
+ * hugetlb_fault_mutex and i_mmap_rwsem must be
+ * dropped before handling userfault. Reacquire
+ * after handling fault to make calling code simpler.
*/
hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping,
idx, haddr);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+
ret = handle_userfault(&vmf, VM_UFFD_MISSING);
+
+ i_mmap_lock_read(mapping);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
goto out;
}
@@ -3927,6 +3945,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
if (ptep) {
+ /*
+ * Since we hold no locks, ptep could be stale. That is
+ * OK as we are only making decisions based on content and
+ * not actually modifying content here.
+ */
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
migration_entry_wait_huge(vma, mm, ptep);
@@ -3934,20 +3957,31 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
VM_FAULT_SET_HINDEX(hstate_index(h));
- } else {
- ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
- if (!ptep)
- return VM_FAULT_OOM;
}
+ /*
+ * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
+ * until finished with ptep. This prevents huge_pmd_unshare from
+ * being called elsewhere and making the ptep no longer valid.
+ *
+ * ptep could have already be assigned via huge_pte_offset. That
+ * is OK, as huge_pte_alloc will return the same value unless
+ * something changed.
+ */
mapping = vma->vm_file->f_mapping;
- idx = vma_hugecache_offset(h, vma, haddr);
+ i_mmap_lock_read(mapping);
+ ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
+ if (!ptep) {
+ i_mmap_unlock_read(mapping);
+ return VM_FAULT_OOM;
+ }
/*
* Serialize hugepage allocation and instantiation, so that we don't
* get spurious allocation failures if two CPUs race to instantiate
* the same page in the page cache.
*/
+ idx = vma_hugecache_offset(h, vma, haddr);
hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, haddr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -4035,6 +4069,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
out_mutex:
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
/*
* Generally it's safe to hold refcount during waiting page lock. But
* here we just wait to defer the next page fault to avoid busy loop and
@@ -4639,10 +4674,12 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
* Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
* and returns the corresponding pte. While this is not necessary for the
* !shared pmd case because we can allocate the pmd later as well, it makes the
- * code much cleaner. pmd allocation is essential for the shared case because
- * pud has to be populated inside the same i_mmap_rwsem section - otherwise
- * racing tasks could either miss the sharing (see huge_pte_offset) or select a
- * bad pmd for sharing.
+ * code much cleaner.
+ *
+ * This routine must be called with i_mmap_rwsem held in at least read mode.
+ * For hugetlbfs, this prevents removal of any page table entries associated
+ * with the address space. This is important as we are setting up sharing
+ * based on existing page table entries (mappings).
*/
pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
{
@@ -4659,7 +4696,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
if (!vma_shareable(vma, addr))
return (pte_t *)pmd_alloc(mm, pud, addr);
- i_mmap_lock_write(mapping);
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -4689,7 +4725,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
spin_unlock(ptl);
out:
pte = (pte_t *)pmd_alloc(mm, pud, addr);
- i_mmap_unlock_write(mapping);
return pte;
}
@@ -4700,7 +4735,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
* indicated by page_count > 1, unmap is achieved by clearing pud and
* decrementing the ref count. If count == 1, the pte page is not shared.
*
- * called with page table lock held.
+ * Called with page table lock held and i_mmap_rwsem held in write mode.
*
* returns: 1 successfully unmapped a shared pte page
* 0 the underlying pte page is not shared, or it is the last user
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 0cd3de3550f0..b992d1295578 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1028,7 +1028,19 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
if (kill)
collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
- unmap_success = try_to_unmap(hpage, ttu);
+ if (!PageHuge(hpage)) {
+ unmap_success = try_to_unmap(hpage, ttu);
+ } else {
+ /*
+ * For hugetlb pages, try_to_unmap could potentially call
+ * huge_pmd_unshare. Because of this, take semaphore in
+ * write mode here and set TTU_RMAP_LOCKED to indicate we
+ * have taken the lock at this higer level.
+ */
+ i_mmap_lock_write(mapping);
+ unmap_success = try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED);
+ i_mmap_unlock_write(mapping);
+ }
if (!unmap_success)
pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
pfn, page_mapcount(hpage));
diff --git a/mm/migrate.c b/mm/migrate.c
index 84381b55b2bd..725edaef238a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1307,8 +1307,19 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
goto put_anon;
if (page_mapped(hpage)) {
+ struct address_space *mapping = page_mapping(hpage);
+
+ /*
+ * try_to_unmap could potentially call huge_pmd_unshare.
+ * Because of this, take semaphore in write mode here and
+ * set TTU_RMAP_LOCKED to let lower levels know we have
+ * taken the lock.
+ */
+ i_mmap_lock_write(mapping);
try_to_unmap(hpage,
- TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+ TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|
+ TTU_RMAP_LOCKED);
+ i_mmap_unlock_write(mapping);
page_was_mapped = 1;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index 85b7f9423352..c566bd552535 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -25,6 +25,7 @@
* page->flags PG_locked (lock_page)
* hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
* mapping->i_mmap_rwsem
+ * hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
* anon_vma->rwsem
* mm->page_table_lock or pte_lock
* zone_lru_lock (in mark_page_accessed, isolate_lru_page)
@@ -1374,6 +1375,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
/*
* If sharing is possible, start and end will be adjusted
* accordingly.
+ *
+ * If called for a huge page, caller must hold i_mmap_rwsem
+ * in write mode as it is possible to call huge_pmd_unshare.
*/
adjust_range_if_pmd_sharing_possible(vma, &start, &end);
}
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 458acda96f20..48368589f519 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -267,10 +267,14 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
VM_BUG_ON(dst_addr & ~huge_page_mask(h));
/*
- * Serialize via hugetlb_fault_mutex
+ * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
+ * i_mmap_rwsem ensures the dst_pte remains valid even
+ * in the case of shared pmds. fault mutex prevents
+ * races with other faulting threads.
*/
- idx = linear_page_index(dst_vma, dst_addr);
mapping = dst_vma->vm_file->f_mapping;
+ i_mmap_lock_read(mapping);
+ idx = linear_page_index(dst_vma, dst_addr);
hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping,
idx, dst_addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -279,6 +283,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h));
if (!dst_pte) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
goto out_unlock;
}
@@ -286,6 +291,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
dst_pteval = huge_ptep_get(dst_pte);
if (!huge_pte_none(dst_pteval)) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
goto out_unlock;
}
@@ -293,6 +299,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
dst_addr, src_addr, &page);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
vm_alloc_shared = vm_shared;
cond_resched();
--
2.17.2
The xfstests generic/475 test switches the underlying device with
dm-error while running a stress test. This results in a large number
of file system errors, and since we can't lock the buffer head when
marking the superblock dirty in the ext4_grp_locked_error() case, it's
possible the superblock to be !buffer_uptodate() without
buffer_write_io_error() being true.
We need to set buffer_uptodate() before we call mark_buffer_dirty() or
this will trigger a WARN_ON. It's safe to do this since the
superblock must have been properly read into memory or the mount would
have been successful. So if buffer_uptodate() is not set, we can
safely assume that this happened due to a failed attempt to write the
superblock.
Signed-off-by: Theodore Ts'o <tytso(a)mit.edu>
Cc: stable(a)vger.kernel.org
---
fs/ext4/super.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d6c142d73d99..fb12d3c17c1b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4902,7 +4902,7 @@ static int ext4_commit_super(struct super_block *sb, int sync)
ext4_superblock_csum_set(sb);
if (sync)
lock_buffer(sbh);
- if (buffer_write_io_error(sbh)) {
+ if (buffer_write_io_error(sbh) || !buffer_uptodate(sbh)) {
/*
* Oh, dear. A previous attempt to write the
* superblock failed. This could happen because the
--
2.19.1
We have got a team of professional to do image editing service and it will
help you submitted your project timely.
We have 20 image editors and if you want to know more about us please reply
back.
If you want to check our quality of work please send us a photo with
instruction and we will work on it.
Our Services:
Image cutting out
Color, brightness
Beauty, Model retouching, skin retouching.
Image cropping and resizing
Correcting the shape and size
Our Service is 24-48 hours but we can deliver the images sooner in case of
emergency.
We do unlimited revisions until you are satisfied with the work.
Thanks,
Lucy King
Delay the drm_modeset_acquire_init() until after we check for an
allocation failure so that we can return immediately upon error without
having to unwind.
WARNING: lock held when returning to user space!
4.20.0+ #174 Not tainted
------------------------------------------------
syz-executor556/8153 is leaving the kernel with locks still held!
1 lock held by syz-executor556/8153:
#0: 000000005100c85c (crtc_ww_class_acquire){+.+.}, at:
set_property_atomic+0xb3/0x330 drivers/gpu/drm/drm_mode_object.c:462
Reported-by: syzbot+6ea337c427f5083ebdf2(a)syzkaller.appspotmail.com
Fixes: 144a7999d633 ("drm: Handle properties in the core for atomic drivers")
Signed-off-by: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Sean Paul <sean(a)poorly.run>
Cc: David Airlie <airlied(a)linux.ie>
Cc: <stable(a)vger.kernel.org> # v4.14+
---
drivers/gpu/drm/drm_mode_object.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/drm_mode_object.c b/drivers/gpu/drm/drm_mode_object.c
index bb1dd46496cd..a9005c1c2384 100644
--- a/drivers/gpu/drm/drm_mode_object.c
+++ b/drivers/gpu/drm/drm_mode_object.c
@@ -459,12 +459,13 @@ static int set_property_atomic(struct drm_mode_object *obj,
struct drm_modeset_acquire_ctx ctx;
int ret;
- drm_modeset_acquire_init(&ctx, 0);
-
state = drm_atomic_state_alloc(dev);
if (!state)
return -ENOMEM;
+
+ drm_modeset_acquire_init(&ctx, 0);
state->acquire_ctx = &ctx;
+
retry:
if (prop == state->dev->mode_config.dpms_property) {
if (obj->type != DRM_MODE_OBJECT_CONNECTOR) {
--
2.20.1
Commit 33a48ab105a7 ("ibmveth: Fix DMA unmap error") fixed an issue in the
normal code path of ibmveth_xmit_start() that was originally introduced by
Commit 6e8ab30ec677 ("ibmveth: Add scatter-gather support"). This original
fix missed the error path where dma_unmap_page is wrongly called on the
header portion in descs[0] which was mapped with dma_map_single. As a
result a failure to DMA map any of the frags results in a dmesg warning
when CONFIG_DMA_API_DEBUG is enabled.
------------[ cut here ]------------
DMA-API: ibmveth 30000002: device driver frees DMA memory with wrong function
[device address=0x000000000a430000] [size=172 bytes] [mapped as page] [unmapped as single]
WARNING: CPU: 1 PID: 8426 at kernel/dma/debug.c:1085 check_unmap+0x4fc/0xe10
...
<snip>
...
DMA-API: Mapped at:
ibmveth_start_xmit+0x30c/0xb60
dev_hard_start_xmit+0x100/0x450
sch_direct_xmit+0x224/0x490
__qdisc_run+0x20c/0x980
__dev_queue_xmit+0x1bc/0xf20
This fixes the API misuse by unampping descs[0] with dma_unmap_single.
Fixes: 6e8ab30ec677 ("ibmveth: Add scatter-gather support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Tyrel Datwyler <tyreld(a)linux.vnet.ibm.com>
---
drivers/net/ethernet/ibm/ibmveth.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
index a468178..098d876 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1171,11 +1171,15 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
map_failed_frags:
last = i+1;
- for (i = 0; i < last; i++)
+ for (i = 1; i < last; i++)
dma_unmap_page(&adapter->vdev->dev, descs[i].fields.address,
descs[i].fields.flags_len & IBMVETH_BUF_LEN_MASK,
DMA_TO_DEVICE);
+ dma_unmap_single(&adapter->vdev->dev,
+ descs[0].fields.address,
+ descs[0].fields.flags_len & IBMVETH_BUF_LEN_MASK,
+ DMA_TO_DEVICE);
map_failed:
if (!firmware_has_feature(FW_FEATURE_CMO))
netdev_err(netdev, "tx: unable to map xmit buffer\n");
--
2.7.4
The patch titled
Subject: lib/gen_crc64table.c: don't depend on linux headers being installed
has been added to the -mm tree. Its filename is
lib-dont-depend-on-linux-headers-being-installed.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/lib-dont-depend-on-linux-headers-b…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/lib-dont-depend-on-linux-headers-b…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: NeilBrown <neil(a)brown.name>
Subject: lib/gen_crc64table.c: don't depend on linux headers being installed
gen_crc64table requires linux include files to be installed in
/usr/include/linux. This is a new requrement so hosts that could
previously build the kernel, now cannot.
gen_crc64table makes this requirement by including <linux/swab.h>, but
nothing from that header is actually used.
So remove the #include, so that the linux headers no longer need to be
installed.
Link: http://lkml.kernel.org/r/87y3899gzi.fsf@notabene.neil.brown.name
Fixes: feba04fd2cf8 ("lib: add crc64 calculation routines")
Signed-off-by: NeilBrown <neil(a)brown.name>
Acked-by: Coly Li <colyli(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/lib/gen_crc64table.c~lib-dont-depend-on-linux-headers-being-installed
+++ a/lib/gen_crc64table.c
@@ -16,8 +16,6 @@
#include <inttypes.h>
#include <stdio.h>
-#include <linux/swab.h>
-
#define CRC64_ECMA182_POLY 0x42F0E1EBA9EA3693ULL
static uint64_t crc64_table[256] = {0};
_
Patches currently in -mm which might be from neil(a)brown.name are
lib-dont-depend-on-linux-headers-being-installed.patch
I'm announcing the release of the 4.9.148 kernel.
All users of the 4.9 kernel series must upgrade.
The updated 4.9.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-4.9.y
and can be browsed at the normal kernel.org git web browser:
http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2 -
arch/x86/kernel/cpu/mtrr/if.c | 2 +
arch/x86/kernel/fpu/signal.c | 4 +--
block/blk-lib.c | 22 +++++++++++++++++---
drivers/gpio/gpio-max7301.c | 12 ++---------
drivers/gpu/drm/drm_ioctl.c | 10 +++++++--
drivers/hv/vmbus_drv.c | 20 ++++++++++++++++++
drivers/infiniband/ulp/srpt/ib_srpt.c | 4 +--
drivers/mmc/core/mmc.c | 24 +++++++++++++---------
drivers/mmc/host/omap_hsmmc.c | 12 ++++++++++-
drivers/net/usb/hso.c | 18 ++++++++++++++--
drivers/usb/host/xhci-hub.c | 3 +-
drivers/usb/serial/option.c | 16 +++++++++++++-
fs/proc/proc_sysctl.c | 13 +++++------
fs/ubifs/replay.c | 37 ++++++++++++++++++++++++++++++++++
kernel/panic.c | 6 ++++-
16 files changed, 164 insertions(+), 41 deletions(-)
Bart Van Assche (1):
ib_srpt: Fix a use-after-free in __srpt_close_all_ch()
Christophe Leroy (1):
gpio: max7301: fix driver for use with CONFIG_VMAP_STACK
Colin Ian King (1):
x86/mtrr: Don't copy uninitialized gentry fields back to userspace
Dexuan Cui (1):
Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels
Greg Kroah-Hartman (1):
Linux 4.9.148
Gustavo A. R. Silva (1):
drm/ioctl: Fix Spectre v1 vulnerabilities
Hui Peng (1):
USB: hso: Fix OOB memory access in hso_probe/hso_get_config_data
Ivan Delalande (1):
proc/sysctl: don't return ENOMEM on lookup when a table is unregistering
Jens Axboe (1):
block: break discard submissions into the user defined size
Jörgen Storvist (4):
USB: serial: option: add GosunCn ZTE WeLink ME3630
USB: serial: option: add Simcom SIM7500/SIM7600 (MBIM mode)
USB: serial: option: add Fibocom NL668 series
USB: serial: option: add Telit LN940 series
Mathias Nyman (1):
xhci: Don't prevent USB2 bus suspend in state check intended for USB3 only
Mikulas Patocka (1):
block: fix infinite loop if the device loses discard capability
Richard Weinberger (1):
ubifs: Handle re-linking of inodes correctly while recovery
Russell King (1):
mmc: omap_hsmmc: fix DMA API warning
Sebastian Andrzej Siewior (1):
x86/fpu: Disable bottom halves while loading FPU registers
Sergey Senozhatsky (1):
panic: avoid deadlocks in re-entrant console drivers
Tore Anderson (1):
USB: serial: option: add HP lt4132
Ulf Hansson (3):
mmc: core: Reset HPI enabled state during re-init and in case of errors
mmc: core: Allow BKOPS and CACHE ctrl even if no HPI support
mmc: core: Use a minimum 1600ms timeout when enabling CACHE ctrl
I'm announcing the release of the 4.14.91 kernel.
All users of the 4.14 kernel series must upgrade.
The updated 4.14.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-4.14.y
and can be browsed at the normal kernel.org git web browser:
http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2
arch/x86/include/asm/msr-index.h | 1
arch/x86/kernel/cpu/mtrr/if.c | 2
arch/x86/kvm/vmx.c | 2
arch/x86/kvm/x86.c | 2
block/blk-lib.c | 22 +++
drivers/gpio/gpio-max7301.c | 12 --
drivers/gpio/gpiolib-acpi.c | 144 +++++++++++++++-----------
drivers/gpu/drm/drm_ioctl.c | 10 +
drivers/hv/vmbus_drv.c | 20 +++
drivers/infiniband/ulp/srpt/ib_srpt.c | 4
drivers/mmc/core/mmc.c | 24 ++--
drivers/mmc/host/omap_hsmmc.c | 12 +-
drivers/net/usb/hso.c | 18 ++-
drivers/net/wireless/intel/iwlwifi/mvm/fw.c | 9 +
drivers/net/wireless/intel/iwlwifi/pcie/drv.c | 50 +++++++++
drivers/scsi/sd.c | 23 +++-
drivers/spi/spi-imx.c | 91 ++++++++++++----
drivers/usb/host/xhci-hub.c | 3
drivers/usb/host/xhci.h | 4
drivers/usb/serial/option.c | 16 ++
fs/cifs/smb2pdu.c | 4
fs/proc/proc_sysctl.c | 13 +-
fs/ubifs/dir.c | 5
fs/ubifs/replay.c | 37 ++++++
include/linux/math64.h | 3
kernel/panic.c | 6 -
kernel/time/posix-timers.c | 5
mm/vmscan.c | 6 -
sound/soc/codecs/sta32x.c | 3
tools/perf/builtin-record.c | 18 +--
31 files changed, 429 insertions(+), 142 deletions(-)
Bart Van Assche (1):
ib_srpt: Fix a use-after-free in __srpt_close_all_ch()
Cfir Cohen (1):
KVM: Fix UAF in nested posted interrupt processing
Christophe Leroy (1):
gpio: max7301: fix driver for use with CONFIG_VMAP_STACK
Colin Ian King (1):
x86/mtrr: Don't copy uninitialized gentry fields back to userspace
Dan Carpenter (1):
cifs: integer overflow in in SMB2_ioctl()
Daniel Mack (1):
ASoC: sta32x: set ->component pointer in private struct
Dexuan Cui (1):
Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels
Eduardo Habkost (1):
kvm: x86: Add AMD's EX_CFG to the list of ignored MSRs
Emmanuel Grumbach (1):
iwlwifi: mvm: don't send GEO_TX_POWER_LIMIT to old firmwares
Greg Kroah-Hartman (1):
Linux 4.14.91
Gustavo A. R. Silva (1):
drm/ioctl: Fix Spectre v1 vulnerabilities
Hans de Goede (1):
gpiolib-acpi: Only defer request_irq for GpioInt ACPI event handlers
Hui Peng (1):
USB: hso: Fix OOB memory access in hso_probe/hso_get_config_data
Ihab Zhaika (1):
iwlwifi: add new cards for 9560, 9462, 9461 and killer series
Ivan Delalande (1):
proc/sysctl: don't return ENOMEM on lookup when a table is unregistering
Jens Axboe (2):
block: break discard submissions into the user defined size
scsi: sd: use mempool for discard special page
Jiri Olsa (1):
perf record: Synthesize features before events in pipe mode
Jörgen Storvist (4):
USB: serial: option: add GosunCn ZTE WeLink ME3630
USB: serial: option: add Simcom SIM7500/SIM7600 (MBIM mode)
USB: serial: option: add Fibocom NL668 series
USB: serial: option: add Telit LN940 series
Mathias Nyman (1):
xhci: Don't prevent USB2 bus suspend in state check intended for USB3 only
Mikulas Patocka (1):
block: fix infinite loop if the device loses discard capability
Nicolas Saenz Julienne (1):
USB: xhci: fix 'broken_suspend' placement in struct xchi_hcd
Richard Weinberger (2):
ubifs: Fix directory size calculation for symlinks
ubifs: Handle re-linking of inodes correctly while recovery
Roman Gushchin (1):
mm: don't miss the last page because of round-off error
Russell King (1):
mmc: omap_hsmmc: fix DMA API warning
Sergey Senozhatsky (1):
panic: avoid deadlocks in re-entrant console drivers
Thomas Gleixner (1):
posix-timers: Fix division by zero bug
Tore Anderson (1):
USB: serial: option: add HP lt4132
Ulf Hansson (3):
mmc: core: Reset HPI enabled state during re-init and in case of errors
mmc: core: Allow BKOPS and CACHE ctrl even if no HPI support
mmc: core: Use a minimum 1600ms timeout when enabling CACHE ctrl
Uwe Kleine-König (2):
spi: imx: add a device specific prepare_message callback
spi: imx: mx51-ecspi: Move some initialisation to prepare_message hook.
I'm announcing the release of the 4.19.13 kernel.
All users of the 4.19 kernel series must upgrade.
The updated 4.19.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-4.19.y
and can be browsed at the normal kernel.org git web browser:
http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2
arch/arm/include/asm/pgtable-2level.h | 2
arch/m68k/include/asm/pgtable_mm.h | 4
arch/microblaze/include/asm/pgtable.h | 2
arch/nds32/include/asm/pgtable.h | 2
arch/parisc/include/asm/pgtable.h | 2
arch/x86/entry/vdso/Makefile | 3
arch/x86/include/asm/msr-index.h | 1
arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 4
arch/x86/kernel/cpu/mtrr/if.c | 2
arch/x86/kvm/vmx.c | 2
arch/x86/kvm/x86.c | 4
arch/x86/mm/pat.c | 13 +
drivers/gpio/gpio-max7301.c | 12 -
drivers/gpio/gpiolib-acpi.c | 144 +++++++++++--------
drivers/gpu/drm/drm_ioctl.c | 10 +
drivers/hv/vmbus_drv.c | 20 ++
drivers/input/mouse/elantech.c | 18 ++
drivers/media/i2c/ov5640.c | 17 +-
drivers/mmc/core/mmc.c | 24 +--
drivers/mmc/host/omap_hsmmc.c | 12 +
drivers/net/usb/hso.c | 18 ++
drivers/net/wireless/intel/iwlwifi/mvm/fw.c | 9 +
drivers/net/wireless/intel/iwlwifi/pcie/drv.c | 50 ++++++
drivers/net/wireless/marvell/mwifiex/11n.c | 5
drivers/net/wireless/marvell/mwifiex/11n_rxreorder.c | 96 ++++++------
drivers/net/wireless/marvell/mwifiex/uap_txrx.c | 3
drivers/net/wireless/realtek/rtlwifi/base.c | 1
drivers/scsi/sd.c | 23 ++-
drivers/usb/host/xhci-hub.c | 3
drivers/usb/host/xhci.h | 4
drivers/usb/serial/option.c | 16 +-
fs/iomap.c | 7
fs/namei.c | 3
fs/proc/proc_sysctl.c | 13 -
fs/ubifs/replay.c | 37 ++++
include/asm-generic/4level-fixup.h | 2
include/asm-generic/5level-fixup.h | 2
include/asm-generic/pgtable-nop4d-hack.h | 2
include/asm-generic/pgtable-nop4d.h | 2
include/asm-generic/pgtable-nopmd.h | 2
include/asm-generic/pgtable-nopud.h | 2
include/asm-generic/pgtable.h | 16 ++
include/linux/math64.h | 3
include/linux/mm.h | 8 +
include/linux/t10-pi.h | 9 -
include/net/xfrm.h | 1
kernel/futex.c | 69 ++++++++-
kernel/panic.c | 6
kernel/time/posix-timers.c | 5
mm/huge_memory.c | 20 +-
mm/page_alloc.c | 19 ++
mm/vmscan.c | 6
net/xfrm/xfrm_state.c | 8 -
net/xfrm/xfrm_user.c | 4
55 files changed, 555 insertions(+), 219 deletions(-)
Alistair Strachan (1):
x86/vdso: Pass --eh-frame-hdr to the linker
Benjamin Tissoires (1):
Input: elantech - disable elan-i2c for P52 and P72
Brian Norris (1):
Revert "mwifiex: restructure rx_reorder_tbl_lock usage"
Cfir Cohen (1):
KVM: Fix UAF in nested posted interrupt processing
Christian Brauner (1):
Revert "vfs: Allow userns root to call mknod on owned filesystems."
Christophe Leroy (1):
gpio: max7301: fix driver for use with CONFIG_VMAP_STACK
Colin Ian King (1):
x86/mtrr: Don't copy uninitialized gentry fields back to userspace
Dan Williams (1):
x86/mm: Fix decoy address handling vs 32-bit builds
Dave Chinner (1):
iomap: Revert "fs/iomap.c: get/put the page in iomap_page_create/release()"
Dexuan Cui (1):
Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels
Eduardo Habkost (1):
kvm: x86: Add AMD's EX_CFG to the list of ignored MSRs
Emmanuel Grumbach (1):
iwlwifi: mvm: don't send GEO_TX_POWER_LIMIT to old firmwares
Greg Kroah-Hartman (1):
Linux 4.19.13
Gustavo A. R. Silva (1):
drm/ioctl: Fix Spectre v1 vulnerabilities
Hans de Goede (1):
gpiolib-acpi: Only defer request_irq for GpioInt ACPI event handlers
Hui Peng (1):
USB: hso: Fix OOB memory access in hso_probe/hso_get_config_data
Ihab Zhaika (1):
iwlwifi: add new cards for 9560, 9462, 9461 and killer series
Ivan Delalande (1):
proc/sysctl: don't return ENOMEM on lookup when a table is unregistering
Jacopo Mondi (1):
media: ov5640: Fix set format regression
Jens Axboe (1):
scsi: sd: use mempool for discard special page
Jörgen Storvist (4):
USB: serial: option: add GosunCn ZTE WeLink ME3630
USB: serial: option: add Simcom SIM7500/SIM7600 (MBIM mode)
USB: serial: option: add Fibocom NL668 series
USB: serial: option: add Telit LN940 series
Larry Finger (1):
rtlwifi: Fix leak of skb when processing C2H_BT_INFO
Martin K. Petersen (1):
scsi: t10-pi: Return correct ref tag when queue has no integrity profile
Martin Schwidefsky (3):
mm: add mm_pxd_folded checks to pgtable_bytes accounting functions
mm: make the __PAGETABLE_PxD_FOLDED defines non-empty
mm: introduce mm_[p4d|pud|pmd]_folded
Mathias Krause (1):
xfrm_user: fix freeing of xfrm states on acquire
Mathias Nyman (1):
xhci: Don't prevent USB2 bus suspend in state check intended for USB3 only
Mikhail Zaslonko (1):
mm, memory_hotplug: initialize struct pages for the full memory section
Nicolas Saenz Julienne (1):
USB: xhci: fix 'broken_suspend' placement in struct xchi_hcd
Oscar Salvador (1):
mm, page_alloc: fix has_unmovable_pages for HugePages
Peter Xu (1):
mm: thp: fix flags for pmd migration when split
Reinette Chatre (1):
x86/intel_rdt: Ensure a CPU remains online for the region's pseudo-locking sequence
Richard Weinberger (1):
ubifs: Handle re-linking of inodes correctly while recovery
Roman Gushchin (1):
mm: don't miss the last page because of round-off error
Russell King (1):
mmc: omap_hsmmc: fix DMA API warning
Sergey Senozhatsky (1):
panic: avoid deadlocks in re-entrant console drivers
Thomas Gleixner (2):
posix-timers: Fix division by zero bug
futex: Cure exit race
Tore Anderson (1):
USB: serial: option: add HP lt4132
Ulf Hansson (3):
mmc: core: Reset HPI enabled state during re-init and in case of errors
mmc: core: Allow BKOPS and CACHE ctrl even if no HPI support
mmc: core: Use a minimum 1600ms timeout when enabling CACHE ctrl
Wanpeng Li (1):
KVM: X86: Fix NULL deref in vcpu_scan_ioapic
This is the start of the stable review cycle for the 4.19.13 release.
There are 46 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sun Dec 30 11:30:49 UTC 2018.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.19.13-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.19.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.19.13-rc1
Gustavo A. R. Silva <gustavo(a)embeddedor.com>
drm/ioctl: Fix Spectre v1 vulnerabilities
Ivan Delalande <colona(a)arista.com>
proc/sysctl: don't return ENOMEM on lookup when a table is unregistering
Benjamin Tissoires <benjamin.tissoires(a)redhat.com>
Input: elantech - disable elan-i2c for P52 and P72
Roman Gushchin <guro(a)fb.com>
mm: don't miss the last page because of round-off error
Oscar Salvador <osalvador(a)suse.de>
mm, page_alloc: fix has_unmovable_pages for HugePages
Peter Xu <peterx(a)redhat.com>
mm: thp: fix flags for pmd migration when split
Mikhail Zaslonko <zaslonko(a)linux.ibm.com>
mm, memory_hotplug: initialize struct pages for the full memory section
Jacopo Mondi <jacopo+renesas(a)jmondi.org>
media: ov5640: Fix set format regression
Ihab Zhaika <ihab.zhaika(a)intel.com>
iwlwifi: add new cards for 9560, 9462, 9461 and killer series
Brian Norris <briannorris(a)chromium.org>
Revert "mwifiex: restructure rx_reorder_tbl_lock usage"
Emmanuel Grumbach <emmanuel.grumbach(a)intel.com>
iwlwifi: mvm: don't send GEO_TX_POWER_LIMIT to old firmwares
Larry Finger <Larry.Finger(a)lwfinger.net>
rtlwifi: Fix leak of skb when processing C2H_BT_INFO
Mathias Krause <minipli(a)googlemail.com>
xfrm_user: fix freeing of xfrm states on acquire
Martin Schwidefsky <schwidefsky(a)de.ibm.com>
mm: introduce mm_[p4d|pud|pmd]_folded
Martin Schwidefsky <schwidefsky(a)de.ibm.com>
mm: make the __PAGETABLE_PxD_FOLDED defines non-empty
Martin Schwidefsky <schwidefsky(a)de.ibm.com>
mm: add mm_pxd_folded checks to pgtable_bytes accounting functions
Sergey Senozhatsky <sergey.senozhatsky.work(a)gmail.com>
panic: avoid deadlocks in re-entrant console drivers
Reinette Chatre <reinette.chatre(a)intel.com>
x86/intel_rdt: Ensure a CPU remains online for the region's pseudo-locking sequence
Alistair Strachan <astrachan(a)google.com>
x86/vdso: Pass --eh-frame-hdr to the linker
Dan Williams <dan.j.williams(a)intel.com>
x86/mm: Fix decoy address handling vs 32-bit builds
Colin Ian King <colin.king(a)canonical.com>
x86/mtrr: Don't copy uninitialized gentry fields back to userspace
Thomas Gleixner <tglx(a)linutronix.de>
futex: Cure exit race
Dexuan Cui <decui(a)microsoft.com>
Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels
Cfir Cohen <cfir(a)google.com>
KVM: Fix UAF in nested posted interrupt processing
Eduardo Habkost <ehabkost(a)redhat.com>
kvm: x86: Add AMD's EX_CFG to the list of ignored MSRs
Wanpeng Li <wanpengli(a)tencent.com>
KVM: X86: Fix NULL deref in vcpu_scan_ioapic
Thomas Gleixner <tglx(a)linutronix.de>
posix-timers: Fix division by zero bug
Hans de Goede <hdegoede(a)redhat.com>
gpiolib-acpi: Only defer request_irq for GpioInt ACPI event handlers
Christophe Leroy <christophe.leroy(a)c-s.fr>
gpio: max7301: fix driver for use with CONFIG_VMAP_STACK
Russell King <rmk+kernel(a)armlinux.org.uk>
mmc: omap_hsmmc: fix DMA API warning
Ulf Hansson <ulf.hansson(a)linaro.org>
mmc: core: Use a minimum 1600ms timeout when enabling CACHE ctrl
Ulf Hansson <ulf.hansson(a)linaro.org>
mmc: core: Allow BKOPS and CACHE ctrl even if no HPI support
Ulf Hansson <ulf.hansson(a)linaro.org>
mmc: core: Reset HPI enabled state during re-init and in case of errors
Jens Axboe <axboe(a)kernel.dk>
scsi: sd: use mempool for discard special page
Martin K. Petersen <martin.petersen(a)oracle.com>
scsi: t10-pi: Return correct ref tag when queue has no integrity profile
Richard Weinberger <richard(a)nod.at>
ubifs: Handle re-linking of inodes correctly while recovery
Jörgen Storvist <jorgen.storvist(a)gmail.com>
USB: serial: option: add Telit LN940 series
Jörgen Storvist <jorgen.storvist(a)gmail.com>
USB: serial: option: add Fibocom NL668 series
Jörgen Storvist <jorgen.storvist(a)gmail.com>
USB: serial: option: add Simcom SIM7500/SIM7600 (MBIM mode)
Tore Anderson <tore(a)fud.no>
USB: serial: option: add HP lt4132
Jörgen Storvist <jorgen.storvist(a)gmail.com>
USB: serial: option: add GosunCn ZTE WeLink ME3630
Nicolas Saenz Julienne <nsaenzjulienne(a)suse.de>
USB: xhci: fix 'broken_suspend' placement in struct xchi_hcd
Mathias Nyman <mathias.nyman(a)linux.intel.com>
xhci: Don't prevent USB2 bus suspend in state check intended for USB3 only
Hui Peng <benquike(a)gmail.com>
USB: hso: Fix OOB memory access in hso_probe/hso_get_config_data
Christian Brauner <christian(a)brauner.io>
Revert "vfs: Allow userns root to call mknod on owned filesystems."
Dave Chinner <dchinner(a)redhat.com>
iomap: Revert "fs/iomap.c: get/put the page in iomap_page_create/release()"
-------------
Diffstat:
Makefile | 4 +-
arch/arm/include/asm/pgtable-2level.h | 2 +-
arch/m68k/include/asm/pgtable_mm.h | 4 +-
arch/microblaze/include/asm/pgtable.h | 2 +-
arch/nds32/include/asm/pgtable.h | 2 +-
arch/parisc/include/asm/pgtable.h | 2 +-
arch/x86/entry/vdso/Makefile | 3 +-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 4 +
arch/x86/kernel/cpu/mtrr/if.c | 2 +
arch/x86/kvm/vmx.c | 2 +
arch/x86/kvm/x86.c | 4 +-
arch/x86/mm/pat.c | 13 +-
drivers/gpio/gpio-max7301.c | 12 +-
drivers/gpio/gpiolib-acpi.c | 144 ++++++++++++---------
drivers/gpu/drm/drm_ioctl.c | 10 +-
drivers/hv/vmbus_drv.c | 20 +++
drivers/input/mouse/elantech.c | 18 ++-
drivers/media/i2c/ov5640.c | 17 ++-
drivers/mmc/core/mmc.c | 24 ++--
drivers/mmc/host/omap_hsmmc.c | 12 +-
drivers/net/usb/hso.c | 18 ++-
drivers/net/wireless/intel/iwlwifi/mvm/fw.c | 9 ++
drivers/net/wireless/intel/iwlwifi/pcie/drv.c | 50 +++++++
drivers/net/wireless/marvell/mwifiex/11n.c | 5 +-
.../net/wireless/marvell/mwifiex/11n_rxreorder.c | 96 +++++++-------
drivers/net/wireless/marvell/mwifiex/uap_txrx.c | 3 -
drivers/net/wireless/realtek/rtlwifi/base.c | 1 +
drivers/scsi/sd.c | 23 +++-
drivers/usb/host/xhci-hub.c | 3 +-
drivers/usb/host/xhci.h | 4 +-
drivers/usb/serial/option.c | 16 ++-
fs/iomap.c | 7 -
fs/namei.c | 3 +-
fs/proc/proc_sysctl.c | 13 +-
fs/ubifs/replay.c | 37 ++++++
include/asm-generic/4level-fixup.h | 2 +-
include/asm-generic/5level-fixup.h | 2 +-
include/asm-generic/pgtable-nop4d-hack.h | 2 +-
include/asm-generic/pgtable-nop4d.h | 2 +-
include/asm-generic/pgtable-nopmd.h | 2 +-
include/asm-generic/pgtable-nopud.h | 2 +-
include/asm-generic/pgtable.h | 16 +++
include/linux/math64.h | 3 +
include/linux/mm.h | 8 ++
include/linux/t10-pi.h | 9 +-
include/net/xfrm.h | 1 +
kernel/futex.c | 69 +++++++++-
kernel/panic.c | 6 +-
kernel/time/posix-timers.c | 5 +-
mm/huge_memory.c | 20 +--
mm/page_alloc.c | 19 ++-
mm/vmscan.c | 6 +-
net/xfrm/xfrm_state.c | 8 +-
net/xfrm/xfrm_user.c | 4 +-
55 files changed, 556 insertions(+), 220 deletions(-)
This is the start of the stable review cycle for the 4.14.91 release.
There are 36 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sun Dec 30 11:30:54 UTC 2018.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.14.91-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.14.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.14.91-rc1
Gustavo A. R. Silva <gustavo(a)embeddedor.com>
drm/ioctl: Fix Spectre v1 vulnerabilities
Ivan Delalande <colona(a)arista.com>
proc/sysctl: don't return ENOMEM on lookup when a table is unregistering
Roman Gushchin <guro(a)fb.com>
mm: don't miss the last page because of round-off error
Richard Weinberger <richard(a)nod.at>
ubifs: Handle re-linking of inodes correctly while recovery
Uwe Kleine-König <u.kleine-koenig(a)pengutronix.de>
spi: imx: mx51-ecspi: Move some initialisation to prepare_message hook.
Uwe Kleine-König <u.kleine-koenig(a)pengutronix.de>
spi: imx: add a device specific prepare_message callback
Ihab Zhaika <ihab.zhaika(a)intel.com>
iwlwifi: add new cards for 9560, 9462, 9461 and killer series
Emmanuel Grumbach <emmanuel.grumbach(a)intel.com>
iwlwifi: mvm: don't send GEO_TX_POWER_LIMIT to old firmwares
Sergey Senozhatsky <sergey.senozhatsky.work(a)gmail.com>
panic: avoid deadlocks in re-entrant console drivers
Colin Ian King <colin.king(a)canonical.com>
x86/mtrr: Don't copy uninitialized gentry fields back to userspace
Dexuan Cui <decui(a)microsoft.com>
Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels
Cfir Cohen <cfir(a)google.com>
KVM: Fix UAF in nested posted interrupt processing
Eduardo Habkost <ehabkost(a)redhat.com>
kvm: x86: Add AMD's EX_CFG to the list of ignored MSRs
Thomas Gleixner <tglx(a)linutronix.de>
posix-timers: Fix division by zero bug
Hans de Goede <hdegoede(a)redhat.com>
gpiolib-acpi: Only defer request_irq for GpioInt ACPI event handlers
Christophe Leroy <christophe.leroy(a)c-s.fr>
gpio: max7301: fix driver for use with CONFIG_VMAP_STACK
Russell King <rmk+kernel(a)armlinux.org.uk>
mmc: omap_hsmmc: fix DMA API warning
Ulf Hansson <ulf.hansson(a)linaro.org>
mmc: core: Use a minimum 1600ms timeout when enabling CACHE ctrl
Ulf Hansson <ulf.hansson(a)linaro.org>
mmc: core: Allow BKOPS and CACHE ctrl even if no HPI support
Ulf Hansson <ulf.hansson(a)linaro.org>
mmc: core: Reset HPI enabled state during re-init and in case of errors
Jens Axboe <axboe(a)kernel.dk>
scsi: sd: use mempool for discard special page
Jörgen Storvist <jorgen.storvist(a)gmail.com>
USB: serial: option: add Telit LN940 series
Jörgen Storvist <jorgen.storvist(a)gmail.com>
USB: serial: option: add Fibocom NL668 series
Jörgen Storvist <jorgen.storvist(a)gmail.com>
USB: serial: option: add Simcom SIM7500/SIM7600 (MBIM mode)
Tore Anderson <tore(a)fud.no>
USB: serial: option: add HP lt4132
Jörgen Storvist <jorgen.storvist(a)gmail.com>
USB: serial: option: add GosunCn ZTE WeLink ME3630
Nicolas Saenz Julienne <nsaenzjulienne(a)suse.de>
USB: xhci: fix 'broken_suspend' placement in struct xchi_hcd
Mathias Nyman <mathias.nyman(a)linux.intel.com>
xhci: Don't prevent USB2 bus suspend in state check intended for USB3 only
Hui Peng <benquike(a)gmail.com>
USB: hso: Fix OOB memory access in hso_probe/hso_get_config_data
Dan Carpenter <dan.carpenter(a)oracle.com>
cifs: integer overflow in in SMB2_ioctl()
Jiri Olsa <jolsa(a)kernel.org>
perf record: Synthesize features before events in pipe mode
Bart Van Assche <bart.vanassche(a)wdc.com>
ib_srpt: Fix a use-after-free in __srpt_close_all_ch()
Richard Weinberger <richard(a)nod.at>
ubifs: Fix directory size calculation for symlinks
Daniel Mack <daniel(a)zonque.org>
ASoC: sta32x: set ->component pointer in private struct
Mikulas Patocka <mpatocka(a)redhat.com>
block: fix infinite loop if the device loses discard capability
Jens Axboe <axboe(a)kernel.dk>
block: break discard submissions into the user defined size
-------------
Diffstat:
Makefile | 4 +-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/kernel/cpu/mtrr/if.c | 2 +
arch/x86/kvm/vmx.c | 2 +
arch/x86/kvm/x86.c | 2 +
block/blk-lib.c | 22 +++-
drivers/gpio/gpio-max7301.c | 12 +--
drivers/gpio/gpiolib-acpi.c | 144 +++++++++++++++-----------
drivers/gpu/drm/drm_ioctl.c | 10 +-
drivers/hv/vmbus_drv.c | 20 ++++
drivers/infiniband/ulp/srpt/ib_srpt.c | 4 +-
drivers/mmc/core/mmc.c | 24 +++--
drivers/mmc/host/omap_hsmmc.c | 12 ++-
drivers/net/usb/hso.c | 18 +++-
drivers/net/wireless/intel/iwlwifi/mvm/fw.c | 9 ++
drivers/net/wireless/intel/iwlwifi/pcie/drv.c | 50 +++++++++
drivers/scsi/sd.c | 23 +++-
drivers/spi/spi-imx.c | 91 ++++++++++++----
drivers/usb/host/xhci-hub.c | 3 +-
drivers/usb/host/xhci.h | 4 +-
drivers/usb/serial/option.c | 16 ++-
fs/cifs/smb2pdu.c | 4 +-
fs/proc/proc_sysctl.c | 13 ++-
fs/ubifs/dir.c | 5 +-
fs/ubifs/replay.c | 37 +++++++
include/linux/math64.h | 3 +
kernel/panic.c | 6 +-
kernel/time/posix-timers.c | 5 +-
mm/vmscan.c | 6 +-
sound/soc/codecs/sta32x.c | 3 +
tools/perf/builtin-record.c | 18 ++--
31 files changed, 430 insertions(+), 143 deletions(-)
Hello, Please consider backporting to 4.14.y the following commit from
kernel-net-next by Vlastimil Babka [CC'ed]:
d6a24df00638 ("mm, page_alloc: actually ignore mempolicies for high
priority allocations") It cherry-picks cleanly and builds fine.
The reason for the request is that the commit 1d26c112959f
<http://stash.eng.vyatta.net:7990/projects/VC/repos/linux-vyatta/commits/1d2…> ("mm,
page_alloc:do not break __GFP_THISNODE by zonelist reset") that was
previously backported to 4.14.y broke some of our functionality after we
upgraded from an earlier 4.14 kernel without the fix. The reason this is
happening is not clear, with this commit only found by bisect.
Fortunately the requested commit resolves the issue.
Best Regards,
Mike Manning
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 2e83ee1d8694a61d0d95a5b694f2e61e8dde8627 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx(a)redhat.com>
Date: Fri, 21 Dec 2018 14:30:50 -0800
Subject: [PATCH] mm: thp: fix flags for pmd migration when split
When splitting a huge migrating PMD, we'll transfer all the existing PMD
bits and apply them again onto the small PTEs. However we are fetching
the bits unconditionally via pmd_soft_dirty(), pmd_write() or
pmd_yound() while actually they don't make sense at all when it's a
migration entry. Fix them up. Since at it, drop the ifdef together as
not needed.
Note that if my understanding is correct about the problem then if
without the patch there is chance to lose some of the dirty bits in the
migrating pmd pages (on x86_64 we're fetching bit 11 which is part of
swap offset instead of bit 2) and it could potentially corrupt the
memory of an userspace program which depends on the dirty bit.
Link: http://lkml.kernel.org/r/20181213051510.20306-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx(a)redhat.com>
Reviewed-by: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Reviewed-by: William Kucharski <william.kucharski(a)oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Dave Jiang <dave.jiang(a)intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Souptick Joarder <jrdr.linux(a)gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Cc: Zi Yan <zi.yan(a)cs.rutgers.edu>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5da55b38b1b7..e84a10b0d310 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2144,23 +2144,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
*/
old_pmd = pmdp_invalidate(vma, haddr, pmd);
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
pmd_migration = is_pmd_migration_entry(old_pmd);
- if (pmd_migration) {
+ if (unlikely(pmd_migration)) {
swp_entry_t entry;
entry = pmd_to_swp_entry(old_pmd);
page = pfn_to_page(swp_offset(entry));
- } else
-#endif
+ write = is_write_migration_entry(entry);
+ young = false;
+ soft_dirty = pmd_swp_soft_dirty(old_pmd);
+ } else {
page = pmd_page(old_pmd);
+ if (pmd_dirty(old_pmd))
+ SetPageDirty(page);
+ write = pmd_write(old_pmd);
+ young = pmd_young(old_pmd);
+ soft_dirty = pmd_soft_dirty(old_pmd);
+ }
VM_BUG_ON_PAGE(!page_count(page), page);
page_ref_add(page, HPAGE_PMD_NR - 1);
- if (pmd_dirty(old_pmd))
- SetPageDirty(page);
- write = pmd_write(old_pmd);
- young = pmd_young(old_pmd);
- soft_dirty = pmd_soft_dirty(old_pmd);
/*
* Withdraw the table only after we mark the pmd entry invalid.
Please apply 071154499193 ("media: ov5640: Fix set format regression")
to 4.19.y.
The set_fmt operations updates the sensor format only when the image
format is changed and this patch fixes it.
Thank you,
adam
From: Michal Hocko <mhocko(a)suse.com>
Subject: memcg, oom: notify on oom killer invocation from the charge path
Burt Holzman has noticed that memcg v1 doesn't notify about OOM events via
eventfd anymore. The reason is that 29ef680ae7c2 ("memcg, oom: move
out_of_memory back to the charge path") has moved the oom handling back to
the charge path. While doing so the notification was left behind in
mem_cgroup_oom_synchronize.
Fix the issue by replicating the oom hierarchy locking and the
notification.
Link: http://lkml.kernel.org/r/20181224091107.18354-1-mhocko@kernel.org
Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Reported-by: Burt Holzman <burt(a)fnal.gov>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com
Cc: <stable(a)vger.kernel.org> [4.19+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/memcontrol.c~memcg-oom-notify-on-oom-killer-invocation-from-the-charge-path
+++ a/mm/memcontrol.c
@@ -1673,6 +1673,9 @@ enum oom_status {
static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
{
+ enum oom_status ret;
+ bool locked;
+
if (order > PAGE_ALLOC_COSTLY_ORDER)
return OOM_SKIPPED;
@@ -1707,10 +1710,23 @@ static enum oom_status mem_cgroup_oom(st
return OOM_ASYNC;
}
+ mem_cgroup_mark_under_oom(memcg);
+
+ locked = mem_cgroup_oom_trylock(memcg);
+
+ if (locked)
+ mem_cgroup_oom_notify(memcg);
+
+ mem_cgroup_unmark_under_oom(memcg);
if (mem_cgroup_out_of_memory(memcg, mask, order))
- return OOM_SUCCESS;
+ ret = OOM_SUCCESS;
+ else
+ ret = OOM_FAILED;
+
+ if (locked)
+ mem_cgroup_oom_unlock(memcg);
- return OOM_FAILED;
+ return ret;
}
/**
_
From: Huang Ying <ying.huang(a)intel.com>
Subject: mm, swap: fix swapoff with KSM pages
KSM pages may be mapped to the multiple VMAs that cannot be reached from
one anon_vma. So during swapin, a new copy of the page need to be
generated if a different anon_vma is needed, please refer to comments of
ksm_might_need_to_copy() for details.
During swapoff, unuse_vma() uses anon_vma (if available) to locate VMA and
virtual address mapped to the page, so not all mappings to a swapped out
KSM page could be found. So in try_to_unuse(), even if the swap count of
a swap entry isn't zero, the page needs to be deleted from swap cache, so
that, in the next round a new page could be allocated and swapin for the
other mappings of the swapped out KSM page.
But this contradicts with the THP swap support. Where the THP could be
deleted from swap cache only after the swap count of every swap entry in
the huge swap cluster backing the THP has reach 0. So try_to_unuse() is
changed in commit e07098294adf ("mm, THP, swap: support to reclaim swap
space for THP swapped out") to check that before delete a page from swap
cache, but this has broken KSM swapoff too.
Fortunately, KSM is for the normal pages only, so the original behavior
for KSM pages could be restored easily via checking PageTransCompound().
That is how this patch works.
The bug is introduced by e07098294adf ("mm, THP, swap: support to reclaim
swap space for THP swapped out"), which is merged by v4.14-rc1. So I
think we should backport the fix to from 4.14 on. But Hugh thinks it may
be rare for the KSM pages being in the swap device when swapoff, so nobody
reports the bug so far.
Link: http://lkml.kernel.org/r/20181226051522.28442-1-ying.huang@intel.com
Fixes: e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out")
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Reported-by: Hugh Dickins <hughd(a)google.com>
Tested-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Hugh Dickins <hughd(a)google.com>
Cc: Rik van Riel <riel(a)redhat.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Shaohua Li <shli(a)kernel.org>
Cc: Daniel Jordan <daniel.m.jordan(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/swapfile.c~mm-swap-fix-swapoff-with-ksm-pages
+++ a/mm/swapfile.c
@@ -2197,7 +2197,8 @@ int try_to_unuse(unsigned int type, bool
*/
if (PageSwapCache(page) &&
likely(page_private(page) == entry.val) &&
- !page_swapped(page))
+ (!PageTransCompound(page) ||
+ !swap_page_trans_huge_swapped(si, entry)))
delete_from_swap_cache(compound_head(page));
/*
_
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken into
account as well.
Instead of writing the required complicated code for this rare occurrence,
just eliminate the race. i_mmap_rwsem is now held in read mode for the
duration of page fault processing. Hold i_mmap_rwsem longer in truncation
and hold punch code to cover the call to remove_inode_hugepages.
With this modification, code in remove_inode_hugepages checking for races
becomes 'dead' as it can not longer happen. Remove the dead code and
expand comments to explain reasoning. Similarly, checks for races with
truncation in the page fault path can be simplified and removed.
[mike.kravetz(a)oracle.com: incorporat suggestions from Kirill]
Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com
Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/fs/hugetlbfs/inode.c~hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race
+++ a/fs/hugetlbfs/inode.c
@@ -383,17 +383,16 @@ hugetlb_vmdelete_list(struct rb_root_cac
* truncation is indicated by end of range being LLONG_MAX
* In this case, we first scan the range and release found pages.
* After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
- * maps and global counts. Page faults can not race with truncation
- * in this routine. hugetlb_no_page() prevents page faults in the
- * truncated range. It checks i_size before allocation, and again after
- * with the page table lock for the page held. The same lock must be
- * acquired to unmap a page.
+ * maps and global counts.
* hole punch is indicated if end is not LLONG_MAX
* In the hole punch case we scan the range and release found pages.
* Only when releasing a page is the associated region/reserv map
* deleted. The region/reserv map for ranges without associated
- * pages are not modified. Page faults can race with hole punch.
- * This is indicated if we find a mapped page.
+ * pages are not modified.
+ *
+ * Callers of this routine must hold the i_mmap_rwsem in write mode to prevent
+ * races with page faults.
+ *
* Note: If the passed end of range value is beyond the end of file, but
* not LLONG_MAX this routine still performs a hole punch operation.
*/
@@ -423,32 +422,14 @@ static void remove_inode_hugepages(struc
for (i = 0; i < pagevec_count(&pvec); ++i) {
struct page *page = pvec.pages[i];
- u32 hash;
index = page->index;
- hash = hugetlb_fault_mutex_hash(h, current->mm,
- &pseudo_vma,
- mapping, index, 0);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
-
/*
- * If page is mapped, it was faulted in after being
- * unmapped in caller. Unmap (again) now after taking
- * the fault mutex. The mutex will prevent faults
- * until we finish removing the page.
- *
- * This race can only happen in the hole punch case.
- * Getting here in a truncate operation is a bug.
+ * A mapped page is impossible as callers should unmap
+ * all references before calling. And, i_mmap_rwsem
+ * prevents the creation of additional mappings.
*/
- if (unlikely(page_mapped(page))) {
- BUG_ON(truncate_op);
-
- i_mmap_lock_write(mapping);
- hugetlb_vmdelete_list(&mapping->i_mmap,
- index * pages_per_huge_page(h),
- (index + 1) * pages_per_huge_page(h));
- i_mmap_unlock_write(mapping);
- }
+ VM_BUG_ON(page_mapped(page));
lock_page(page);
/*
@@ -470,7 +451,6 @@ static void remove_inode_hugepages(struc
}
unlock_page(page);
- mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
huge_pagevec_release(&pvec);
cond_resched();
@@ -482,9 +462,20 @@ static void remove_inode_hugepages(struc
static void hugetlbfs_evict_inode(struct inode *inode)
{
+ struct address_space *mapping = inode->i_mapping;
struct resv_map *resv_map;
+ /*
+ * The vfs layer guarantees that there are no other users of this
+ * inode. Therefore, it would be safe to call remove_inode_hugepages
+ * without holding i_mmap_rwsem. We acquire and hold here to be
+ * consistent with other callers. Since there will be no contention
+ * on the semaphore, overhead is negligible.
+ */
+ i_mmap_lock_write(mapping);
remove_inode_hugepages(inode, 0, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
+
resv_map = (struct resv_map *)inode->i_mapping->private_data;
/* root inode doesn't have the resv_map, so we should check it */
if (resv_map)
@@ -505,8 +496,8 @@ static int hugetlb_vmtruncate(struct ino
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offset, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
return 0;
}
@@ -540,8 +531,8 @@ static long hugetlbfs_punch_hole(struct
hugetlb_vmdelete_list(&mapping->i_mmap,
hole_start >> PAGE_SHIFT,
hole_end >> PAGE_SHIFT);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, hole_start, hole_end);
+ i_mmap_unlock_write(mapping);
inode_unlock(inode);
}
@@ -624,7 +615,11 @@ static long hugetlbfs_fallocate(struct f
/* addr is the offset within the file (zero based) */
addr = index * hpage_size;
- /* mutex taken here, fault path and hole punch */
+ /*
+ * fault mutex taken here, protects against fault path
+ * and hole punch. inode_lock previously taken protects
+ * against truncation.
+ */
hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
index, addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
--- a/mm/hugetlb.c~hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race
+++ a/mm/hugetlb.c
@@ -3755,16 +3755,16 @@ static vm_fault_t hugetlb_no_page(struct
}
/*
- * Use page lock to guard against racing truncation
- * before we get page_table_lock.
+ * We can not race with truncation due to holding i_mmap_rwsem.
+ * Check once here for faults beyond end of file.
*/
+ size = i_size_read(mapping->host) >> huge_page_shift(h);
+ if (idx >= size)
+ goto out;
+
retry:
page = find_lock_page(mapping, idx);
if (!page) {
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto out;
-
/*
* Check for page in userfault range
*/
@@ -3854,9 +3854,6 @@ retry:
}
ptl = huge_pte_lock(h, mm, ptep);
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto backout;
ret = 0;
if (!huge_pte_none(huge_ptep_get(ptep)))
@@ -3959,8 +3956,10 @@ vm_fault_t hugetlb_fault(struct mm_struc
/*
* Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
- * until finished with ptep. This prevents huge_pmd_unshare from
- * being called elsewhere and making the ptep no longer valid.
+ * until finished with ptep. This serves two purposes:
+ * 1) It prevents huge_pmd_unshare from being called elsewhere
+ * and making the ptep no longer valid.
+ * 2) It synchronizes us with file truncation.
*
* ptep could have already be assigned via huge_pte_offset. That
* is OK, as huge_pte_alloc will return the same value unless
_
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with the
ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
called.
[mike.kravetz(a)oracle.com: add explicit check for mapping != null]
Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com>
Cc: Colin Ian King <colin.king(a)canonical.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/hugetlb.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/hugetlb.c
@@ -3238,6 +3238,7 @@ int copy_hugetlb_page_range(struct mm_st
struct page *ptepage;
unsigned long addr;
int cow;
+ struct address_space *mapping = vma->vm_file->f_mapping;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
struct mmu_notifier_range range;
@@ -3249,13 +3250,23 @@ int copy_hugetlb_page_range(struct mm_st
mmu_notifier_range_init(&range, src, vma->vm_start,
vma->vm_end);
mmu_notifier_invalidate_range_start(&range);
+ } else {
+ /*
+ * For shared mappings i_mmap_rwsem must be held to call
+ * huge_pte_alloc, otherwise the returned ptep could go
+ * away if part of a shared pmd and another thread calls
+ * huge_pmd_unshare.
+ */
+ i_mmap_lock_read(mapping);
}
for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
+
src_pte = huge_pte_offset(src, addr, sz);
if (!src_pte)
continue;
+
dst_pte = huge_pte_alloc(dst, addr, sz);
if (!dst_pte) {
ret = -ENOMEM;
@@ -3326,6 +3337,8 @@ int copy_hugetlb_page_range(struct mm_st
if (cow)
mmu_notifier_invalidate_range_end(&range);
+ else
+ i_mmap_unlock_read(mapping);
return ret;
}
@@ -3771,14 +3784,18 @@ retry:
};
/*
- * hugetlb_fault_mutex must be dropped before
- * handling userfault. Reacquire after handling
- * fault to make calling code simpler.
+ * hugetlb_fault_mutex and i_mmap_rwsem must be
+ * dropped before handling userfault. Reacquire
+ * after handling fault to make calling code simpler.
*/
hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping,
idx, haddr);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+
ret = handle_userfault(&vmf, VM_UFFD_MISSING);
+
+ i_mmap_lock_read(mapping);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
goto out;
}
@@ -3926,6 +3943,11 @@ vm_fault_t hugetlb_fault(struct mm_struc
ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
if (ptep) {
+ /*
+ * Since we hold no locks, ptep could be stale. That is
+ * OK as we are only making decisions based on content and
+ * not actually modifying content here.
+ */
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
migration_entry_wait_huge(vma, mm, ptep);
@@ -3933,20 +3955,31 @@ vm_fault_t hugetlb_fault(struct mm_struc
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
VM_FAULT_SET_HINDEX(hstate_index(h));
- } else {
- ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
- if (!ptep)
- return VM_FAULT_OOM;
}
+ /*
+ * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
+ * until finished with ptep. This prevents huge_pmd_unshare from
+ * being called elsewhere and making the ptep no longer valid.
+ *
+ * ptep could have already be assigned via huge_pte_offset. That
+ * is OK, as huge_pte_alloc will return the same value unless
+ * something changed.
+ */
mapping = vma->vm_file->f_mapping;
- idx = vma_hugecache_offset(h, vma, haddr);
+ i_mmap_lock_read(mapping);
+ ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
+ if (!ptep) {
+ i_mmap_unlock_read(mapping);
+ return VM_FAULT_OOM;
+ }
/*
* Serialize hugepage allocation and instantiation, so that we don't
* get spurious allocation failures if two CPUs race to instantiate
* the same page in the page cache.
*/
+ idx = vma_hugecache_offset(h, vma, haddr);
hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, haddr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -4034,6 +4067,7 @@ out_ptl:
}
out_mutex:
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
/*
* Generally it's safe to hold refcount during waiting page lock. But
* here we just wait to defer the next page fault to avoid busy loop and
@@ -4638,10 +4672,12 @@ void adjust_range_if_pmd_sharing_possibl
* Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
* and returns the corresponding pte. While this is not necessary for the
* !shared pmd case because we can allocate the pmd later as well, it makes the
- * code much cleaner. pmd allocation is essential for the shared case because
- * pud has to be populated inside the same i_mmap_rwsem section - otherwise
- * racing tasks could either miss the sharing (see huge_pte_offset) or select a
- * bad pmd for sharing.
+ * code much cleaner.
+ *
+ * This routine must be called with i_mmap_rwsem held in at least read mode.
+ * For hugetlbfs, this prevents removal of any page table entries associated
+ * with the address space. This is important as we are setting up sharing
+ * based on existing page table entries (mappings).
*/
pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
{
@@ -4658,7 +4694,6 @@ pte_t *huge_pmd_share(struct mm_struct *
if (!vma_shareable(vma, addr))
return (pte_t *)pmd_alloc(mm, pud, addr);
- i_mmap_lock_write(mapping);
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -4688,7 +4723,6 @@ pte_t *huge_pmd_share(struct mm_struct *
spin_unlock(ptl);
out:
pte = (pte_t *)pmd_alloc(mm, pud, addr);
- i_mmap_unlock_write(mapping);
return pte;
}
@@ -4699,7 +4733,7 @@ out:
* indicated by page_count > 1, unmap is achieved by clearing pud and
* decrementing the ref count. If count == 1, the pte page is not shared.
*
- * called with page table lock held.
+ * Called with page table lock held and i_mmap_rwsem held in write mode.
*
* returns: 1 successfully unmapped a shared pte page
* 0 the underlying pte page is not shared, or it is the last user
--- a/mm/memory-failure.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/memory-failure.c
@@ -966,7 +966,7 @@ static bool hwpoison_user_mappings(struc
enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
struct address_space *mapping;
LIST_HEAD(tokill);
- bool unmap_success;
+ bool unmap_success = true;
int kill = 1, forcekill;
struct page *hpage = *hpagep;
bool mlocked = PageMlocked(hpage);
@@ -1028,7 +1028,19 @@ static bool hwpoison_user_mappings(struc
if (kill)
collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
- unmap_success = try_to_unmap(hpage, ttu);
+ if (!PageHuge(hpage)) {
+ unmap_success = try_to_unmap(hpage, ttu);
+ } else if (mapping) {
+ /*
+ * For hugetlb pages, try_to_unmap could potentially call
+ * huge_pmd_unshare. Because of this, take semaphore in
+ * write mode here and set TTU_RMAP_LOCKED to indicate we
+ * have taken the lock at this higer level.
+ */
+ i_mmap_lock_write(mapping);
+ unmap_success = try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED);
+ i_mmap_unlock_write(mapping);
+ }
if (!unmap_success)
pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
pfn, page_mapcount(hpage));
--- a/mm/migrate.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/migrate.c
@@ -1324,8 +1324,19 @@ static int unmap_and_move_huge_page(new_
goto put_anon;
if (page_mapped(hpage)) {
+ struct address_space *mapping = page_mapping(hpage);
+
+ /*
+ * try_to_unmap could potentially call huge_pmd_unshare.
+ * Because of this, take semaphore in write mode here and
+ * set TTU_RMAP_LOCKED to let lower levels know we have
+ * taken the lock.
+ */
+ i_mmap_lock_write(mapping);
try_to_unmap(hpage,
- TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+ TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|
+ TTU_RMAP_LOCKED);
+ i_mmap_unlock_write(mapping);
page_was_mapped = 1;
}
--- a/mm/rmap.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/rmap.c
@@ -25,6 +25,7 @@
* page->flags PG_locked (lock_page)
* hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
* mapping->i_mmap_rwsem
+ * hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
* anon_vma->rwsem
* mm->page_table_lock or pte_lock
* zone_lru_lock (in mark_page_accessed, isolate_lru_page)
@@ -1378,6 +1379,9 @@ static bool try_to_unmap_one(struct page
/*
* If sharing is possible, start and end will be adjusted
* accordingly.
+ *
+ * If called for a huge page, caller must hold i_mmap_rwsem
+ * in write mode as it is possible to call huge_pmd_unshare.
*/
adjust_range_if_pmd_sharing_possible(vma, &range.start,
&range.end);
--- a/mm/userfaultfd.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/userfaultfd.c
@@ -267,10 +267,14 @@ retry:
VM_BUG_ON(dst_addr & ~huge_page_mask(h));
/*
- * Serialize via hugetlb_fault_mutex
+ * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
+ * i_mmap_rwsem ensures the dst_pte remains valid even
+ * in the case of shared pmds. fault mutex prevents
+ * races with other faulting threads.
*/
- idx = linear_page_index(dst_vma, dst_addr);
mapping = dst_vma->vm_file->f_mapping;
+ i_mmap_lock_read(mapping);
+ idx = linear_page_index(dst_vma, dst_addr);
hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping,
idx, dst_addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -279,6 +283,7 @@ retry:
dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h));
if (!dst_pte) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
goto out_unlock;
}
@@ -286,6 +291,7 @@ retry:
dst_pteval = huge_ptep_get(dst_pte);
if (!huge_pte_none(dst_pteval)) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
goto out_unlock;
}
@@ -293,6 +299,7 @@ retry:
dst_addr, src_addr, &page);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
vm_alloc_shared = vm_shared;
cond_resched();
_
From: Michal Hocko <mhocko(a)suse.com>
Subject: hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined
We have received a bug report that an injected MCE about faulty memory
prevents memory offline to succeed on 4.4 base kernel. The underlying
reason was that the HWPoison page has an elevated reference count and the
migration keeps failing. There are two problems with that. First of all
it is dubious to migrate the poisoned page because we know that accessing
that memory is possible to fail. Secondly it doesn't make any sense to
migrate a potentially broken content and preserve the memory corruption
over to a new location.
Oscar has found out that 4.4 and the current upstream kernels behave
slightly differently with his simply testcase
===
int main(void)
{
int ret;
int i;
int fd;
char *array = malloc(4096);
char *array_locked = malloc(4096);
fd = open("/tmp/data", O_RDONLY);
read(fd, array, 4095);
for (i = 0; i < 4096; i++)
array_locked[i] = 'd';
ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
if (ret)
perror("mlock");
sleep (20);
ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
if (ret)
perror("madvise");
for (i = 0; i < 4096; i++)
array_locked[i] = 'd';
return 0;
}
===
+ offline this memory.
In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
list
kernel: [<ffffffff81019ac9>] dump_trace+0x59/0x340
kernel: [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170
kernel: [<ffffffff8101ac71>] show_stack+0x21/0x40
kernel: [<ffffffff8132bb90>] dump_stack+0x5c/0x7c
kernel: [<ffffffff810815a1>] warn_slowpath_common+0x81/0xb0
kernel: [<ffffffff811a275c>] __pagevec_lru_add_fn+0x14c/0x160
kernel: [<ffffffff811a2eed>] pagevec_lru_move_fn+0xad/0x100
kernel: [<ffffffff811a334c>] __lru_cache_add+0x6c/0xb0
kernel: [<ffffffff81195236>] add_to_page_cache_lru+0x46/0x70
kernel: [<ffffffffa02b4373>] extent_readpages+0xc3/0x1a0 [btrfs]
kernel: [<ffffffff811a16d7>] __do_page_cache_readahead+0x177/0x200
kernel: [<ffffffff811a18c8>] ondemand_readahead+0x168/0x2a0
kernel: [<ffffffff8119673f>] generic_file_read_iter+0x41f/0x660
kernel: [<ffffffff8120e50d>] __vfs_read+0xcd/0x140
kernel: [<ffffffff8120e9ea>] vfs_read+0x7a/0x120
kernel: [<ffffffff8121404b>] kernel_read+0x3b/0x50
kernel: [<ffffffff81215c80>] do_execveat_common.isra.29+0x490/0x6f0
kernel: [<ffffffff81215f08>] do_execve+0x28/0x30
kernel: [<ffffffff81095ddb>] call_usermodehelper_exec_async+0xfb/0x130
kernel: [<ffffffff8161c045>] ret_from_fork+0x55/0x80
And that latter confuses the hotremove path because an LRU page is
attempted to be migrated and that fails due to an elevated reference
count. It is quite possible that the reuse of the HWPoisoned page is some
kind of fixed race condition but I am not really sure about that.
With the upstream kernel the failure is slightly different. The page
doesn't seem to have LRU bit set but isolate_movable_page simply fails and
do_migrate_range simply puts all the isolated pages back to LRU and
therefore no progress is made and scan_movable_pages finds same set of
pages over and over again.
Fix both cases by explicitly checking HWPoisoned pages before we even try
to get reference on the page, try to unmap it if it is still mapped. As
explained by Naoya:
: Hwpoison code never unmapped those for no big reason because
: Ksm pages never dominate memory, so we simply didn't have strong
: motivation to save the pages.
Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
HWPoison pages which shouldn't happen but I couldn't convince myself about
that. Naoya has noted the following:
: Theoretically no such gurantee, because try_to_unmap() doesn't have a
: guarantee of success and then memory_failure() returns immediately
: when hwpoison_user_mappings fails.
: Or the following code (comes after hwpoison_user_mappings block) also impli=
: es
: that the target page can still have PageLRU flag.
:
: /*
: * Torn down by someone else?
: */
: if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
: action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
: res =3D -EBUSY;
: goto out;
: }
:
: So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
: current version of your patch.
Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Reviewed-by: Oscar Salvador <osalvador(a)suse.com>
Debugged-by: Oscar Salvador <osalvador(a)suse.com>
Tested-by: Oscar Salvador <osalvador(a)suse.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Acked-by: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/memory_hotplug.c~hwpoison-memory_hotplug-allow-hwpoisoned-pages-to-be-offlined
+++ a/mm/memory_hotplug.c
@@ -34,6 +34,7 @@
#include <linux/hugetlb.h>
#include <linux/memblock.h>
#include <linux/compaction.h>
+#include <linux/rmap.h>
#include <asm/tlbflush.h>
@@ -1368,6 +1369,21 @@ do_migrate_range(unsigned long start_pfn
pfn = page_to_pfn(compound_head(page))
+ hpage_nr_pages(page) - 1;
+ /*
+ * HWPoison pages have elevated reference counts so the migration would
+ * fail on them. It also doesn't make any sense to migrate them in the
+ * first place. Still try to unmap such a page in case it is still mapped
+ * (e.g. current hwpoison implementation doesn't unmap KSM pages but keep
+ * the unmap as the catch all safety net).
+ */
+ if (PageHWPoison(page)) {
+ if (WARN_ON(PageLRU(page)))
+ isolate_lru_page(page);
+ if (page_mapped(page))
+ try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS);
+ continue;
+ }
+
if (!get_page_unless_zero(page))
continue;
/*
_
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm, hmm: mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL
At Maintainer Summit, Greg brought up a topic I proposed around
EXPORT_SYMBOL_GPL usage. The motivation was considerations for when
EXPORT_SYMBOL_GPL is warranted and the criteria for taking the exceptional
step of reclassifying an existing export. Specifically, I wanted to make
the case that although the line is fuzzy and hard to specify in abstract
terms, it is nonetheless clear that devm_memremap_pages() and HMM
(Heterogeneous Memory Management) have crossed it. The
devm_memremap_pages() facility should have been EXPORT_SYMBOL_GPL from the
beginning, and HMM as a derivative of that functionality should have
naturally picked up that designation as well.
Contrary to typical rules, the HMM infrastructure was merged upstream with
zero in-tree consumers. There was a promise at the time that those users
would be merged "soon", but it has been over a year with no drivers
arriving. While the Nouveau driver is about to belatedly make good on
that promise it is clear that HMM was targeted first and foremost at an
out-of-tree consumer.
HMM is derived from devm_memremap_pages(), a facility Christoph and I
spearheaded to support persistent memory. It combines a device lifetime
model with a dynamically created 'struct page' / memmap array for any
physical address range. It enables coordination and control of the many
code paths in the kernel built to interact with memory via 'struct page'
objects. With HMM the integration goes even deeper by allowing device
drivers to hook and manipulate page fault and page free events.
One interpretation of when EXPORT_SYMBOL is suitable is when it is
exporting stable and generic leaf functionality. The
devm_memremap_pages() facility continues to see expanding use cases,
peer-to-peer DMA being the most recent, with no clear end date when it
will stop attracting reworks and semantic changes. It is not suitable to
export devm_memremap_pages() as a stable 3rd party driver API due to the
fact that it is still changing and manipulates core behavior. Moreover,
it is not in the best interest of the long term development of the core
memory management subsystem to permit any external driver to effectively
define its own system-wide memory management policies with no
encouragement to engage with upstream.
I am also concerned that HMM was designed in a way to minimize further
engagement with the core-MM. That, with these hooks in place,
device-drivers are free to implement their own policies without much
consideration for whether and how the core-MM could grow to meet that
need. Going forward not only should HMM be EXPORT_SYMBOL_GPL, but the
core-MM should be allowed the opportunity and stimulus to change and
address these new use cases as first class functionality.
Original changelog:
hmm_devmem_add(), and hmm_devmem_add_resource() duplicated
devm_memremap_pages() and are now simple now wrappers around the core
facility to inject a dev_pagemap instance into the global pgmap_radix and
hook page-idle events. The devm_memremap_pages() interface is base
infrastructure for HMM. HMM has more and deeper ties into the kernel
memory management implementation than base ZONE_DEVICE which is itself a
EXPORT_SYMBOL_GPL facility.
Originally, the HMM page structure creation routines copied the
devm_memremap_pages() code and reused ZONE_DEVICE. A cleanup to unify the
implementations was discussed during the initial review:
http://lkml.iu.edu/hypermail/linux/kernel/1701.2/00812.html Recent work to
extend devm_memremap_pages() for the peer-to-peer-DMA facility enabled
this cleanup to move forward.
In addition to the integration with devm_memremap_pages() HMM depends on
other GPL-only symbols:
mmu_notifier_unregister_no_release
percpu_ref
region_intersects
__class_create
It goes further to consume / indirectly expose functionality that is not
exported to any other driver:
alloc_pages_vma
walk_page_range
HMM is derived from devm_memremap_pages(), and extends deep core-kernel
fundamentals. Similar to devm_memremap_pages(), mark its entry points
EXPORT_SYMBOL_GPL().
[logang(a)deltatee.com: PCI/P2PDMA: match interface changes to devm_memremap_pages()]
Link: http://lkml.kernel.org/r/20181130225911.2900-1-logang@deltatee.com
Link: http://lkml.kernel.org/r/154275560565.76910.15919297436557795278.stgit@dwil…
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Logan Gunthorpe <logang(a)deltatee.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Cc: Logan Gunthorpe <logang(a)deltatee.com>
Cc: "Jérôme Glisse" <jglisse(a)redhat.com>
Cc: Balbir Singh <bsingharora(a)gmail.com>,
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/drivers/pci/p2pdma.c~mm-hmm-mark-hmm_devmem_add-add_resource-export_symbol_gpl
+++ a/drivers/pci/p2pdma.c
@@ -82,10 +82,8 @@ static void pci_p2pdma_percpu_release(st
complete_all(&p2p->devmap_ref_done);
}
-static void pci_p2pdma_percpu_kill(void *data)
+static void pci_p2pdma_percpu_kill(struct percpu_ref *ref)
{
- struct percpu_ref *ref = data;
-
/*
* pci_p2pdma_add_resource() may be called multiple times
* by a driver and may register the percpu_kill devm action multiple
@@ -198,6 +196,7 @@ int pci_p2pdma_add_resource(struct pci_d
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
pci_resource_start(pdev, bar);
+ pgmap->kill = pci_p2pdma_percpu_kill;
addr = devm_memremap_pages(&pdev->dev, pgmap);
if (IS_ERR(addr)) {
@@ -211,11 +210,6 @@ int pci_p2pdma_add_resource(struct pci_d
if (error)
goto pgmap_free;
- error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill,
- &pdev->p2pdma->devmap_ref);
- if (error)
- goto pgmap_free;
-
pci_info(pdev, "added peer-to-peer DMA memory %pR\n",
&pgmap->res);
--- a/mm/hmm.c~mm-hmm-mark-hmm_devmem_add-add_resource-export_symbol_gpl
+++ a/mm/hmm.c
@@ -1110,7 +1110,7 @@ struct hmm_devmem *hmm_devmem_add(const
return result;
return devmem;
}
-EXPORT_SYMBOL(hmm_devmem_add);
+EXPORT_SYMBOL_GPL(hmm_devmem_add);
struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
struct device *device,
@@ -1164,7 +1164,7 @@ struct hmm_devmem *hmm_devmem_add_resour
return result;
return devmem;
}
-EXPORT_SYMBOL(hmm_devmem_add_resource);
+EXPORT_SYMBOL_GPL(hmm_devmem_add_resource);
/*
* A device driver that wants to handle multiple devices memory through a
_
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm, devm_memremap_pages: fix shutdown handling
The last step before devm_memremap_pages() returns success is to allocate
a release action, devm_memremap_pages_release(), to tear the entire setup
down. However, the result from devm_add_action() is not checked.
Checking the error from devm_add_action() is not enough. The api
currently relies on the fact that the percpu_ref it is using is killed by
the time the devm_memremap_pages_release() is run. Rather than continue
this awkward situation, offload the responsibility of killing the
percpu_ref to devm_memremap_pages_release() directly. This allows
devm_memremap_pages() to do the right thing relative to init failures and
shutdown.
Without this change we could fail to register the teardown of
devm_memremap_pages(). The likelihood of hitting this failure is tiny as
small memory allocations almost always succeed. However, the impact of
the failure is large given any future reconfiguration, or disable/enable,
of an nvdimm namespace will fail forever as subsequent calls to
devm_memremap_pages() will fail to setup the pgmap_radix since there will
be stale entries for the physical address range.
An argument could be made to require that the ->kill() operation be set in
the @pgmap arg rather than passed in separately. However, it helps code
readability, tracking the lifetime of a given instance, to be able to grep
the kill routine directly at the devm_memremap_pages() call site.
Link: http://lkml.kernel.org/r/154275558526.76910.7535251937849268605.stgit@dwill…
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
Reviewed-by: "Jérôme Glisse" <jglisse(a)redhat.com>
Reported-by: Logan Gunthorpe <logang(a)deltatee.com>
Reviewed-by: Logan Gunthorpe <logang(a)deltatee.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Cc: Balbir Singh <bsingharora(a)gmail.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/drivers/dax/pmem.c~mm-devm_memremap_pages-fix-shutdown-handling
+++ a/drivers/dax/pmem.c
@@ -48,9 +48,8 @@ static void dax_pmem_percpu_exit(void *d
percpu_ref_exit(ref);
}
-static void dax_pmem_percpu_kill(void *data)
+static void dax_pmem_percpu_kill(struct percpu_ref *ref)
{
- struct percpu_ref *ref = data;
struct dax_pmem *dax_pmem = to_dax_pmem(ref);
dev_dbg(dax_pmem->dev, "trace\n");
@@ -112,17 +111,10 @@ static int dax_pmem_probe(struct device
}
dax_pmem->pgmap.ref = &dax_pmem->ref;
+ dax_pmem->pgmap.kill = dax_pmem_percpu_kill;
addr = devm_memremap_pages(dev, &dax_pmem->pgmap);
- if (IS_ERR(addr)) {
- devm_remove_action(dev, dax_pmem_percpu_exit, &dax_pmem->ref);
- percpu_ref_exit(&dax_pmem->ref);
+ if (IS_ERR(addr))
return PTR_ERR(addr);
- }
-
- rc = devm_add_action_or_reset(dev, dax_pmem_percpu_kill,
- &dax_pmem->ref);
- if (rc)
- return rc;
/* adjust the dax_region resource to the start of data */
memcpy(&res, &dax_pmem->pgmap.res, sizeof(res));
--- a/drivers/nvdimm/pmem.c~mm-devm_memremap_pages-fix-shutdown-handling
+++ a/drivers/nvdimm/pmem.c
@@ -309,8 +309,11 @@ static void pmem_release_queue(void *q)
blk_cleanup_queue(q);
}
-static void pmem_freeze_queue(void *q)
+static void pmem_freeze_queue(struct percpu_ref *ref)
{
+ struct request_queue *q;
+
+ q = container_of(ref, typeof(*q), q_usage_counter);
blk_freeze_queue_start(q);
}
@@ -402,6 +405,7 @@ static int pmem_attach_disk(struct devic
pmem->pfn_flags = PFN_DEV;
pmem->pgmap.ref = &q->q_usage_counter;
+ pmem->pgmap.kill = pmem_freeze_queue;
if (is_nd_pfn(dev)) {
if (setup_pagemap_fsdax(dev, &pmem->pgmap))
return -ENOMEM;
@@ -427,13 +431,6 @@ static int pmem_attach_disk(struct devic
memcpy(&bb_res, &nsio->res, sizeof(bb_res));
}
- /*
- * At release time the queue must be frozen before
- * devm_memremap_pages is unwound
- */
- if (devm_add_action_or_reset(dev, pmem_freeze_queue, q))
- return -ENOMEM;
-
if (IS_ERR(addr))
return PTR_ERR(addr);
pmem->virt_addr = addr;
--- a/include/linux/memremap.h~mm-devm_memremap_pages-fix-shutdown-handling
+++ a/include/linux/memremap.h
@@ -111,6 +111,7 @@ typedef void (*dev_page_free_t)(struct p
* @altmap: pre-allocated/reserved memory for vmemmap allocations
* @res: physical address range covered by @ref
* @ref: reference count that pins the devm_memremap_pages() mapping
+ * @kill: callback to transition @ref to the dead state
* @dev: host device of the mapping for debug
* @data: private data pointer for page_free()
* @type: memory type: see MEMORY_* in memory_hotplug.h
@@ -122,6 +123,7 @@ struct dev_pagemap {
bool altmap_valid;
struct resource res;
struct percpu_ref *ref;
+ void (*kill)(struct percpu_ref *ref);
struct device *dev;
void *data;
enum memory_type type;
--- a/kernel/memremap.c~mm-devm_memremap_pages-fix-shutdown-handling
+++ a/kernel/memremap.c
@@ -88,14 +88,10 @@ static void devm_memremap_pages_release(
resource_size_t align_start, align_size;
unsigned long pfn;
+ pgmap->kill(pgmap->ref);
for_each_device_pfn(pfn, pgmap)
put_page(pfn_to_page(pfn));
- if (percpu_ref_tryget_live(pgmap->ref)) {
- dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
- percpu_ref_put(pgmap->ref);
- }
-
/* pages are dead and unused, undo the arch mapping */
align_start = res->start & ~(SECTION_SIZE - 1);
align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
@@ -116,7 +112,7 @@ static void devm_memremap_pages_release(
/**
* devm_memremap_pages - remap and provide memmap backing for the given resource
* @dev: hosting device for @res
- * @pgmap: pointer to a struct dev_pgmap
+ * @pgmap: pointer to a struct dev_pagemap
*
* Notes:
* 1/ At a minimum the res, ref and type members of @pgmap must be initialized
@@ -125,11 +121,8 @@ static void devm_memremap_pages_release(
* 2/ The altmap field may optionally be initialized, in which case altmap_valid
* must be set to true
*
- * 3/ pgmap.ref must be 'live' on entry and 'dead' before devm_memunmap_pages()
- * time (or devm release event). The expected order of events is that ref has
- * been through percpu_ref_kill() before devm_memremap_pages_release(). The
- * wait for the completion of all references being dropped and
- * percpu_ref_exit() must occur after devm_memremap_pages_release().
+ * 3/ pgmap->ref must be 'live' on entry and will be killed at
+ * devm_memremap_pages_release() time, or if this routine fails.
*
* 4/ res is expected to be a host memory range that could feasibly be
* treated as a "System RAM" range, i.e. not a device mmio range, but
@@ -145,6 +138,9 @@ void *devm_memremap_pages(struct device
pgprot_t pgprot = PAGE_KERNEL;
int error, nid, is_ram;
+ if (!pgmap->ref || !pgmap->kill)
+ return ERR_PTR(-EINVAL);
+
align_start = res->start & ~(SECTION_SIZE - 1);
align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
- align_start;
@@ -170,12 +166,10 @@ void *devm_memremap_pages(struct device
if (is_ram != REGION_DISJOINT) {
WARN_ONCE(1, "%s attempted on %s region %pr\n", __func__,
is_ram == REGION_MIXED ? "mixed" : "ram", res);
- return ERR_PTR(-ENXIO);
+ error = -ENXIO;
+ goto err_array;
}
- if (!pgmap->ref)
- return ERR_PTR(-EINVAL);
-
pgmap->dev = dev;
error = xa_err(xa_store_range(&pgmap_array, PHYS_PFN(res->start),
@@ -217,7 +211,10 @@ void *devm_memremap_pages(struct device
align_size >> PAGE_SHIFT, pgmap);
percpu_ref_get_many(pgmap->ref, pfn_end(pgmap) - pfn_first(pgmap));
- devm_add_action(dev, devm_memremap_pages_release, pgmap);
+ error = devm_add_action_or_reset(dev, devm_memremap_pages_release,
+ pgmap);
+ if (error)
+ return ERR_PTR(error);
return __va(res->start);
@@ -228,6 +225,7 @@ void *devm_memremap_pages(struct device
err_pfn_remap:
pgmap_array_delete(res);
err_array:
+ pgmap->kill(pgmap->ref);
return ERR_PTR(error);
}
EXPORT_SYMBOL_GPL(devm_memremap_pages);
--- a/tools/testing/nvdimm/test/iomap.c~mm-devm_memremap_pages-fix-shutdown-handling
+++ a/tools/testing/nvdimm/test/iomap.c
@@ -104,13 +104,26 @@ void *__wrap_devm_memremap(struct device
}
EXPORT_SYMBOL(__wrap_devm_memremap);
+static void nfit_test_kill(void *_pgmap)
+{
+ struct dev_pagemap *pgmap = _pgmap;
+
+ pgmap->kill(pgmap->ref);
+}
+
void *__wrap_devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
{
resource_size_t offset = pgmap->res.start;
struct nfit_test_resource *nfit_res = get_nfit_res(offset);
- if (nfit_res)
+ if (nfit_res) {
+ int rc;
+
+ rc = devm_add_action_or_reset(dev, nfit_test_kill, pgmap);
+ if (rc)
+ return ERR_PTR(rc);
return nfit_res->buf + offset - nfit_res->res.start;
+ }
return devm_memremap_pages(dev, pgmap);
}
EXPORT_SYMBOL_GPL(__wrap_devm_memremap_pages);
_
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm, devm_memremap_pages: kill mapping "System RAM" support
Given the fact that devm_memremap_pages() requires a percpu_ref that is
torn down by devm_memremap_pages_release() the current support for mapping
RAM is broken.
Support for remapping "System RAM" has been broken since the beginning and
there is no existing user of this this code path, so just kill the support
and make it an explicit error.
This cleanup also simplifies a follow-on patch to fix the error path when
setting a devm release action for devm_memremap_pages_release() fails.
Link: http://lkml.kernel.org/r/154275557997.76910.14689813630968180480.stgit@dwil…
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Reviewed-by: "Jérôme Glisse" <jglisse(a)redhat.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Logan Gunthorpe <logang(a)deltatee.com>
Cc: Balbir Singh <bsingharora(a)gmail.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/kernel/memremap.c~mm-devm_memremap_pages-kill-mapping-system-ram-support
+++ a/kernel/memremap.c
@@ -167,15 +167,12 @@ void *devm_memremap_pages(struct device
is_ram = region_intersects(align_start, align_size,
IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE);
- if (is_ram == REGION_MIXED) {
- WARN_ONCE(1, "%s attempted on mixed region %pr\n",
- __func__, res);
+ if (is_ram != REGION_DISJOINT) {
+ WARN_ONCE(1, "%s attempted on %s region %pr\n", __func__,
+ is_ram == REGION_MIXED ? "mixed" : "ram", res);
return ERR_PTR(-ENXIO);
}
- if (is_ram == REGION_INTERSECTS)
- return __va(res->start);
-
if (!pgmap->ref)
return ERR_PTR(-EINVAL);
_
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm, devm_memremap_pages: mark devm_memremap_pages() EXPORT_SYMBOL_GPL
devm_memremap_pages() is a facility that can create struct page entries
for any arbitrary range and give drivers the ability to subvert core
aspects of page management.
Specifically the facility is tightly integrated with the kernel's memory
hotplug functionality. It injects an altmap argument deep into the
architecture specific vmemmap implementation to allow allocating from
specific reserved pages, and it has Linux specific assumptions about page
structure reference counting relative to get_user_pages() and
get_user_pages_fast(). It was an oversight and a mistake that this was
not marked EXPORT_SYMBOL_GPL from the outset.
Again, devm_memremap_pagex() exposes and relies upon core kernel internal
assumptions and will continue to evolve along with 'struct page', memory
hotplug, and support for new memory types / topologies. Only an in-kernel
GPL-only driver is expected to keep up with this ongoing evolution. This
interface, and functionality derived from this interface, is not suitable
for kernel-external drivers.
Link: http://lkml.kernel.org/r/154275557457.76910.16923571232582744134.stgit@dwil…
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: "Jérôme Glisse" <jglisse(a)redhat.com>
Cc: Balbir Singh <bsingharora(a)gmail.com>
Cc: Logan Gunthorpe <logang(a)deltatee.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/kernel/memremap.c~mm-devm_memremap_pages-mark-devm_memremap_pages-export_symbol_gpl
+++ a/kernel/memremap.c
@@ -233,7 +233,7 @@ void *devm_memremap_pages(struct device
err_array:
return ERR_PTR(error);
}
-EXPORT_SYMBOL(devm_memremap_pages);
+EXPORT_SYMBOL_GPL(devm_memremap_pages);
unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
{
--- a/tools/testing/nvdimm/test/iomap.c~mm-devm_memremap_pages-mark-devm_memremap_pages-export_symbol_gpl
+++ a/tools/testing/nvdimm/test/iomap.c
@@ -113,7 +113,7 @@ void *__wrap_devm_memremap_pages(struct
return nfit_res->buf + offset - nfit_res->res.start;
return devm_memremap_pages(dev, pgmap);
}
-EXPORT_SYMBOL(__wrap_devm_memremap_pages);
+EXPORT_SYMBOL_GPL(__wrap_devm_memremap_pages);
pfn_t __wrap_phys_to_pfn_t(phys_addr_t addr, unsigned long flags)
{
_
The patch titled
Subject: hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race-v3
has been removed from the -mm tree. Its filename was
hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race-v3.patch
This patch was dropped because it was folded into hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race.patch
------------------------------------------------------
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race-v3
Incorporated suggestions from Kirill. Code change to hold i_mmap_rwsem
for duration of copy in copy_hugetlb_page_range. Took i_mmap_rwsem in
hugetlbfs_evict_inode to be consistent with other callers. Other changes
were to documentation/comments.
Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com
Cc: <stable(a)vger.kernel.org>
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/fs/hugetlbfs/inode.c~hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race-v3
+++ a/fs/hugetlbfs/inode.c
@@ -462,9 +462,20 @@ static void remove_inode_hugepages(struc
static void hugetlbfs_evict_inode(struct inode *inode)
{
+ struct address_space *mapping = inode->i_mapping;
struct resv_map *resv_map;
+ /*
+ * The vfs layer guarantees that there are no other users of this
+ * inode. Therefore, it would be safe to call remove_inode_hugepages
+ * without holding i_mmap_rwsem. We acquire and hold here to be
+ * consistent with other callers. Since there will be no contention
+ * on the semaphore, overhead is negligible.
+ */
+ i_mmap_lock_write(mapping);
remove_inode_hugepages(inode, 0, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
+
resv_map = (struct resv_map *)inode->i_mapping->private_data;
/* root inode doesn't have the resv_map, so we should check it */
if (resv_map)
_
Patches currently in -mm which might be from mike.kravetz(a)oracle.com are
hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization.patch
hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race.patch
The patch titled
Subject: hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race-v3
has been added to the -mm tree. Its filename is
hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race-v3.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/hugetlbfs-use-i_mmap_rwsem-to-fix-…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/hugetlbfs-use-i_mmap_rwsem-to-fix-…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race-v3
Incorporated suggestions from Kirill. Code change to hold i_mmap_rwsem
for duration of copy in copy_hugetlb_page_range. Took i_mmap_rwsem in
hugetlbfs_evict_inode to be consistent with other callers. Other changes
were to documentation/comments.
Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com
Cc: <stable(a)vger.kernel.org>
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/fs/hugetlbfs/inode.c~hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race-v3
+++ a/fs/hugetlbfs/inode.c
@@ -462,9 +462,20 @@ static void remove_inode_hugepages(struc
static void hugetlbfs_evict_inode(struct inode *inode)
{
+ struct address_space *mapping = inode->i_mapping;
struct resv_map *resv_map;
+ /*
+ * The vfs layer guarantees that there are no other users of this
+ * inode. Therefore, it would be safe to call remove_inode_hugepages
+ * without holding i_mmap_rwsem. We acquire and hold here to be
+ * consistent with other callers. Since there will be no contention
+ * on the semaphore, overhead is negligible.
+ */
+ i_mmap_lock_write(mapping);
remove_inode_hugepages(inode, 0, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
+
resv_map = (struct resv_map *)inode->i_mapping->private_data;
/* root inode doesn't have the resv_map, so we should check it */
if (resv_map)
_
Patches currently in -mm which might be from mike.kravetz(a)oracle.com are
hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization.patch
hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization-fix.patch
hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race.patch
hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race-v3.patch
Hi,
Static analysis with CoverityScan on linux-next detected a potential
null pointer dereference with the following commit:
>From d8a1051ed4ba55679ef24e838a1942c9c40f0a14 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Date: Sat, 22 Dec 2018 10:55:57 +1100
Subject: [PATCH] hugetlbfs: use i_mmap_rwsem for more pmd sharing
The earlier check implies that "mapping" may be a null pointer:
var_compare_op: Comparing mapping to null implies that mapping might be
null.
1008 if (!(flags & MF_MUST_KILL) && !PageDirty(hpage) && mapping &&
1009 mapping_cap_writeback_dirty(mapping)) {
..however later "mapper" is dereferenced when it may be potentially null:
1034 /*
1035 * For hugetlb pages, try_to_unmap could potentially
call
1036 * huge_pmd_unshare. Because of this, take semaphore in
1037 * write mode here and set TTU_RMAP_LOCKED to
indicate we
1038 * have taken the lock at this higer level.
1039 */
CID 1476097 (#1 of 1): Dereference after null check (FORWARD_NULL)
var_deref_model: Passing null pointer mapping to
i_mmap_lock_write, which dereferences it.
1040 i_mmap_lock_write(mapping);
1041 unmap_success = try_to_unmap(hpage,
ttu|TTU_RMAP_LOCKED);
1042 i_mmap_unlock_write(mapping);
Colin
The patch titled
Subject: memcg, oom: notify on oom killer invocation from the charge path
has been added to the -mm tree. Its filename is
memcg-oom-notify-on-oom-killer-invocation-from-the-charge-path.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/memcg-oom-notify-on-oom-killer-inv…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/memcg-oom-notify-on-oom-killer-inv…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Michal Hocko <mhocko(a)suse.com>
Subject: memcg, oom: notify on oom killer invocation from the charge path
Burt Holzman has noticed that memcg v1 doesn't notify about OOM events via
eventfd anymore. The reason is that 29ef680ae7c2 ("memcg, oom: move
out_of_memory back to the charge path") has moved the oom handling back to
the charge path. While doing so the notification was left behind in
mem_cgroup_oom_synchronize.
Fix the issue by replicating the oom hierarchy locking and the
notification.
Link: http://lkml.kernel.org/r/20181224091107.18354-1-mhocko@kernel.org
Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Reported-by: Burt Holzman <burt(a)fnal.gov>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com
Cc: <stable(a)vger.kernel.org> [4.19+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/memcontrol.c~memcg-oom-notify-on-oom-killer-invocation-from-the-charge-path
+++ a/mm/memcontrol.c
@@ -1673,6 +1673,9 @@ enum oom_status {
static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
{
+ enum oom_status ret;
+ bool locked;
+
if (order > PAGE_ALLOC_COSTLY_ORDER)
return OOM_SKIPPED;
@@ -1707,10 +1710,23 @@ static enum oom_status mem_cgroup_oom(st
return OOM_ASYNC;
}
+ mem_cgroup_mark_under_oom(memcg);
+
+ locked = mem_cgroup_oom_trylock(memcg);
+
+ if (locked)
+ mem_cgroup_oom_notify(memcg);
+
+ mem_cgroup_unmark_under_oom(memcg);
if (mem_cgroup_out_of_memory(memcg, mask, order))
- return OOM_SUCCESS;
+ ret = OOM_SUCCESS;
+ else
+ ret = OOM_FAILED;
+
+ if (locked)
+ mem_cgroup_oom_unlock(memcg);
- return OOM_FAILED;
+ return ret;
}
/**
_
Patches currently in -mm which might be from mhocko(a)suse.com are
mm-memcg-fix-reclaim-deadlock-with-writeback.patch
mm-print-more-information-about-mapping-in-__dump_page.patch
mm-lower-the-printk-loglevel-for-__dump_page-messages.patch
mm-memory_hotplug-drop-pointless-block-alignment-checks-from-__offline_pages.patch
mm-memory_hotplug-print-reason-for-the-offlining-failure.patch
mm-memory_hotplug-be-more-verbose-for-memory-offline-failures.patch
mm-memory_hotplug-be-more-verbose-for-memory-offline-failures-update.patch
mm-only-report-isolation-failures-when-offlining-memory.patch
mm-memory_hotplug-do-not-clear-numa_node-association-after-hot_remove.patch
hwpoison-memory_hotplug-allow-hwpoisoned-pages-to-be-offlined.patch
mm-proc-be-more-verbose-about-unstable-vma-flags-in-proc-pid-smaps.patch
mm-thp-proc-report-thp-eligibility-for-each-vma.patch
mm-proc-report-pr_set_thp_disable-in-proc.patch
mm-memory_hotplug-try-to-migrate-full-pfn-range.patch
mm-memory_hotplug-deobfuscate-migration-part-of-offlining.patch
mm-fault_around-do-not-take-a-reference-to-a-locked-page.patch
memory_hotplug-add-missing-newlines-to-debugging-output.patch
memcg-oom-notify-on-oom-killer-invocation-from-the-charge-path.patch
Martin,
Right now you get a message
"BUG: non-zero pgtables_bytes on freeing mm: -16384"
for EVERY process that exits in 4.19.5 and later.
bisect points to
commit 4136161d676a93fc8df6bdb80d720c15522d6c24
Author: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Date: Mon Oct 15 11:09:16 2018 +0200
s390/mm: fix mis-accounting of pgtable_bytes
[ Upstream commit e12e4044aede97974f2222eb7f0ed726a5179a32 ]
Turns out that this patch requires several dependencies so the autoselection of this
patch was missing that.
Can we either revert this patch or add the dependencies?
Christian
From: Richard Weinberger <richard(a)nod.at>
commit e58725d51fa8da9133f3f1c54170aa2e43056b91 upstream.
UBIFS's recovery code strictly assumes that a deleted inode will never
come back, therefore it removes all data which belongs to that inode
as soon it faces an inode with link count 0 in the replay list.
Before O_TMPFILE this assumption was perfectly fine. With O_TMPFILE
it can lead to data loss upon a power-cut.
Consider a journal with entries like:
0: inode X (nlink = 0) /* O_TMPFILE was created */
1: data for inode X /* Someone writes to the temp file */
2: inode X (nlink = 0) /* inode was changed, xattr, chmod, … */
3: inode X (nlink = 1) /* inode was re-linked via linkat() */
Upon replay of entry #2 UBIFS will drop all data that belongs to inode X,
this will lead to an empty file after mounting.
As solution for this problem, scan the replay list for a re-link entry
before dropping data.
Fixes: 474b93704f32 ("ubifs: Implement O_TMPFILE")
Cc: stable(a)vger.kernel.org # 4.9-4.18
Cc: Russell Senior <russell(a)personaltelco.net>
Cc: Rafał Miłecki <zajec5(a)gmail.com>
Reported-by: Russell Senior <russell(a)personaltelco.net>
Reported-by: Rafał Miłecki <zajec5(a)gmail.com>
Tested-by: Rafał Miłecki <rafal(a)milecki.pl>
Signed-off-by: Richard Weinberger <richard(a)nod.at>
[rmilecki: update ubifs_assert() calls to compile with 4.18 and older]
Signed-off-by: Rafał Miłecki <rafal(a)milecki.pl>
(cherry picked from commit e58725d51fa8da9133f3f1c54170aa2e43056b91)
---
fs/ubifs/replay.c | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)
diff --git a/fs/ubifs/replay.c b/fs/ubifs/replay.c
index ae5c02f22f3e..d998fbf7de30 100644
--- a/fs/ubifs/replay.c
+++ b/fs/ubifs/replay.c
@@ -210,6 +210,38 @@ static int trun_remove_range(struct ubifs_info *c, struct replay_entry *r)
}
/**
+ * inode_still_linked - check whether inode in question will be re-linked.
+ * @c: UBIFS file-system description object
+ * @rino: replay entry to test
+ *
+ * O_TMPFILE files can be re-linked, this means link count goes from 0 to 1.
+ * This case needs special care, otherwise all references to the inode will
+ * be removed upon the first replay entry of an inode with link count 0
+ * is found.
+ */
+static bool inode_still_linked(struct ubifs_info *c, struct replay_entry *rino)
+{
+ struct replay_entry *r;
+
+ ubifs_assert(rino->deletion);
+ ubifs_assert(key_type(c, &rino->key) == UBIFS_INO_KEY);
+
+ /*
+ * Find the most recent entry for the inode behind @rino and check
+ * whether it is a deletion.
+ */
+ list_for_each_entry_reverse(r, &c->replay_list, list) {
+ ubifs_assert(r->sqnum >= rino->sqnum);
+ if (key_inum(c, &r->key) == key_inum(c, &rino->key))
+ return r->deletion == 0;
+
+ }
+
+ ubifs_assert(0);
+ return false;
+}
+
+/**
* apply_replay_entry - apply a replay entry to the TNC.
* @c: UBIFS file-system description object
* @r: replay entry to apply
@@ -239,6 +271,11 @@ static int apply_replay_entry(struct ubifs_info *c, struct replay_entry *r)
{
ino_t inum = key_inum(c, &r->key);
+ if (inode_still_linked(c, r)) {
+ err = 0;
+ break;
+ }
+
err = ubifs_tnc_remove_ino(c, inum);
break;
}
--
2.13.7
Hello,
even though the subject sounds harmless it fixes a real bug.
(Yes, I'm aware that commit isn't in Linus' tree yet, but I assume your
tracking of patches targetting stable is better than mine, so I didn't
wait :-)
For backporting you also need its parent commit (i.e. e697271c4e29
("spi: imx: add a device specific prepare_message callback")). I have a
local backport to v4.14 here which isn't entirely trivial. So just tell
me if you need help.
Best regards
Uwe
--
Pengutronix e.K. | Uwe Kleine-König |
Industrial Linux Solutions | http://www.pengutronix.de/ |
在 2018-12-27四的 22:34 +0800,Icenowy Zheng写道:
> The SMI SM3350 USB-UFS bridge controller cannot handle long sense
> request
> correctly and will make the chip refuse to do read/write when
> requested
> long sense.
>
> Add a bad sense quirk for it.
>
> Signed-off-by: Icenowy Zheng <icenowy(a)aosc.io>
> ---
I forgot to:
Cc: stable(a)vger.kernel.org
> drivers/usb/storage/unusual_devs.h | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/drivers/usb/storage/unusual_devs.h
> b/drivers/usb/storage/unusual_devs.h
> index f7f83b21dc74..ea0d27a94afe 100644
> --- a/drivers/usb/storage/unusual_devs.h
> +++ b/drivers/usb/storage/unusual_devs.h
> @@ -1265,6 +1265,18 @@ UNUSUAL_DEV( 0x090c, 0x1132, 0x0000, 0xffff,
> USB_SC_DEVICE, USB_PR_DEVICE, NULL,
> US_FL_FIX_CAPACITY ),
>
> +/*
> + * Reported by Icenowy Zheng <icenowy(a)aosc.io>
> + * The SMI SM3350 USB-UFS bridge controller will enter a wrong state
> + * that do not process read/write command if a long sense is
> requested,
> + * so force to use 18-byte sense.
> + */
> +UNUSUAL_DEV( 0x090c, 0x3350, 0x0000, 0xffff,
> + "SMI",
> + "SM3350 UFS-to-USB-Mass-Storage bridge",
> + USB_SC_DEVICE, USB_PR_DEVICE, NULL,
> + US_FL_BAD_SENSE ),
> +
> /*
> * Reported by Paul Hartman <paul.hartman+linux(a)gmail.com>
> * This card reader returns "Illegal Request, Logical Block Address
在 2018-12-27四的 22:34 +0800,Icenowy Zheng写道:
> Currently the code will set US_FL_SANE_SENSE flag unconditionally if
> device claims SPC3+, however we should allow US_FL_BAD_SENSE flag to
> prevent this behavior, because SMI SM3350 UFS-USB bridge controller,
> which claims SPC4, will show strange behavior with 96-byte sense
> (put the chip into a wrong state that cannot read/write anything).
>
> Check the presence of US_FL_BAD_SENSE when assuming US_FL_SANE_SENSE
> on
> SPC4+ devices.
>
> Signed-off-by: Icenowy Zheng <icenowy(a)aosc.io>
> ---
I forgot to:
Cc: stable(a)vger.kernel.org
> drivers/usb/storage/scsiglue.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/usb/storage/scsiglue.c
> b/drivers/usb/storage/scsiglue.c
> index fde2e71a6ade..699fe9557127 100644
> --- a/drivers/usb/storage/scsiglue.c
> +++ b/drivers/usb/storage/scsiglue.c
> @@ -236,7 +236,8 @@ static int slave_configure(struct scsi_device
> *sdev)
> sdev->try_rc_10_first = 1;
>
> /* assume SPC3 or latter devices support sense size >
> 18 */
> - if (sdev->scsi_level > SCSI_SPC_2)
> + if (sdev->scsi_level > SCSI_SPC_2 &&
> + !(us->fflags & US_FL_BAD_SENSE))
> us->fflags |= US_FL_SANE_SENSE;
>
> /*
The patch titled
Subject: fork,memcg: fix crash in free_thread_stack on memcg charge fail
has been removed from the -mm tree. Its filename was
forkmemcg-fix-crash-in-free_thread_stack-on-memcg-charge-fail.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Rik van Riel <riel(a)surriel.com>
Subject: fork,memcg: fix crash in free_thread_stack on memcg charge fail
Changeset 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
will result in fork failing if allocating a kernel stack for a task
in dup_task_struct exceeds the kernel memory allowance for that cgroup.
Unfortunately, it also results in a crash.
This is due to the code jumping to free_stack and calling free_thread_stack
when the memcg kernel stack charge fails, but without tsk->stack pointing
at the freshly allocated stack.
This in turn results in the vfree_atomic in free_thread_stack oopsing
with a backtrace like this:
#5 [ffffc900244efc88] die at ffffffff8101f0ab
#6 [ffffc900244efcb8] do_general_protection at ffffffff8101cb86
#7 [ffffc900244efce0] general_protection at ffffffff818ff082
[exception RIP: llist_add_batch+7]
RIP: ffffffff8150d487 RSP: ffffc900244efd98 RFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff88085ef55980 RCX: 0000000000000000
RDX: ffff88085ef55980 RSI: 343834343531203a RDI: 343834343531203a
RBP: ffffc900244efd98 R8: 0000000000000001 R9: ffff8808578c3600
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88029f6c21c0
R13: 0000000000000286 R14: ffff880147759b00 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#8 [ffffc900244efda0] vfree_atomic at ffffffff811df2c7
#9 [ffffc900244efdb8] copy_process at ffffffff81086e37
#10 [ffffc900244efe98] _do_fork at ffffffff810884e0
#11 [ffffc900244eff10] sys_vfork at ffffffff810887ff
#12 [ffffc900244eff20] do_syscall_64 at ffffffff81002a43
RIP: 000000000049b948 RSP: 00007ffcdb307830 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000000000896030 RCX: 000000000049b948
RDX: 0000000000000000 RSI: 00007ffcdb307790 RDI: 00000000005d7421
RBP: 000000000067370f R8: 00007ffcdb3077b0 R9: 000000000001ed00
R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000040
R13: 000000000000000f R14: 0000000000000000 R15: 000000000088d018
ORIG_RAX: 000000000000003a CS: 0033 SS: 002b
The simplest fix is to assign tsk->stack right where it is allocated.
Link: http://lkml.kernel.org/r/20181214231726.7ee4843c@imladris.surriel.com
Fixes: 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
Signed-off-by: Rik van Riel <riel(a)surriel.com>
Acked-by: Roman Gushchin <guro(a)fb.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/fork.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
--- a/kernel/fork.c~forkmemcg-fix-crash-in-free_thread_stack-on-memcg-charge-fail
+++ a/kernel/fork.c
@@ -240,8 +240,10 @@ static unsigned long *alloc_thread_stack
* free_thread_stack() can be called in interrupt context,
* so cache the vm_struct.
*/
- if (stack)
+ if (stack) {
tsk->stack_vm_area = find_vm_area(stack);
+ tsk->stack = stack;
+ }
return stack;
#else
struct page *page = alloc_pages_node(node, THREADINFO_GFP,
@@ -288,7 +290,10 @@ static struct kmem_cache *thread_stack_c
static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
int node)
{
- return kmem_cache_alloc_node(thread_stack_cache, THREADINFO_GFP, node);
+ unsigned long *stack;
+ stack = kmem_cache_alloc_node(thread_stack_cache, THREADINFO_GFP, node);
+ tsk->stack = stack;
+ return stack;
}
static void free_thread_stack(struct task_struct *tsk)
_
Patches currently in -mm which might be from riel(a)surriel.com are
The patch titled
Subject: mm: thp: fix flags for pmd migration when split
has been removed from the -mm tree. Its filename was
mm-thp-fix-flags-for-pmd-migration-when-split.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Peter Xu <peterx(a)redhat.com>
Subject: mm: thp: fix flags for pmd migration when split
When splitting a huge migrating PMD, we'll transfer all the existing PMD
bits and apply them again onto the small PTEs. However we are fetching
the bits unconditionally via pmd_soft_dirty(), pmd_write() or pmd_yound()
while actually they don't make sense at all when it's a migration entry.
Fix them up. Since at it, drop the ifdef together as not needed.
Note that if my understanding is correct about the problem then if without
the patch there is chance to lose some of the dirty bits in the migrating
pmd pages (on x86_64 we're fetching bit 11 which is part of swap offset
instead of bit 2) and it could potentially corrupt the memory of an
userspace program which depends on the dirty bit.
Link: http://lkml.kernel.org/r/20181213051510.20306-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx(a)redhat.com>
Reviewed-by: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Reviewed-by: William Kucharski <william.kucharski(a)oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Dave Jiang <dave.jiang(a)intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Souptick Joarder <jrdr.linux(a)gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Cc: Zi Yan <zi.yan(a)cs.rutgers.edu>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/huge_memory.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)
--- a/mm/huge_memory.c~mm-thp-fix-flags-for-pmd-migration-when-split
+++ a/mm/huge_memory.c
@@ -2144,23 +2144,25 @@ static void __split_huge_pmd_locked(stru
*/
old_pmd = pmdp_invalidate(vma, haddr, pmd);
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
pmd_migration = is_pmd_migration_entry(old_pmd);
- if (pmd_migration) {
+ if (unlikely(pmd_migration)) {
swp_entry_t entry;
entry = pmd_to_swp_entry(old_pmd);
page = pfn_to_page(swp_offset(entry));
- } else
-#endif
+ write = is_write_migration_entry(entry);
+ young = false;
+ soft_dirty = pmd_swp_soft_dirty(old_pmd);
+ } else {
page = pmd_page(old_pmd);
+ if (pmd_dirty(old_pmd))
+ SetPageDirty(page);
+ write = pmd_write(old_pmd);
+ young = pmd_young(old_pmd);
+ soft_dirty = pmd_soft_dirty(old_pmd);
+ }
VM_BUG_ON_PAGE(!page_count(page), page);
page_ref_add(page, HPAGE_PMD_NR - 1);
- if (pmd_dirty(old_pmd))
- SetPageDirty(page);
- write = pmd_write(old_pmd);
- young = pmd_young(old_pmd);
- soft_dirty = pmd_soft_dirty(old_pmd);
/*
* Withdraw the table only after we mark the pmd entry invalid.
_
Patches currently in -mm which might be from peterx(a)redhat.com are
userfaultfd-clear-flag-if-remap-event-not-enabled.patch
The patch titled
Subject: mm, memory_hotplug: initialize struct pages for the full memory section
has been removed from the -mm tree. Its filename was
mm-memory_hotplug-initialize-struct-pages-for-the-full-memory-section.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Mikhail Zaslonko <zaslonko(a)linux.ibm.com>
Subject: mm, memory_hotplug: initialize struct pages for the full memory section
If memory end is not aligned with the sparse memory section boundary, the
mapping of such a section is only partly initialized. This may lead to
VM_BUG_ON due to uninitialized struct page access from
is_mem_section_removable() or test_pages_in_a_zone() function triggered by
memory_hotplug sysfs handlers:
Here are the the panic examples:
CONFIG_DEBUG_VM=y
CONFIG_DEBUG_VM_PGFLAGS=y
kernel parameter mem=2050M
--------------------------
page:000003d082008000 is uninitialized and poisoned
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
Call Trace:
([<0000000000385b26>] test_pages_in_a_zone+0xde/0x160)
[<00000000008f15c4>] show_valid_zones+0x5c/0x190
[<00000000008cf9c4>] dev_attr_show+0x34/0x70
[<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
[<00000000003e4194>] seq_read+0x204/0x480
[<00000000003b53ea>] __vfs_read+0x32/0x178
[<00000000003b55b2>] vfs_read+0x82/0x138
[<00000000003b5be2>] ksys_read+0x5a/0xb0
[<0000000000b86ba0>] system_call+0xdc/0x2d8
Last Breaking-Event-Address:
[<0000000000385b26>] test_pages_in_a_zone+0xde/0x160
Kernel panic - not syncing: Fatal exception: panic_on_oops
kernel parameter mem=3075M
--------------------------
page:000003d08300c000 is uninitialized and poisoned
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
Call Trace:
([<000000000038596c>] is_mem_section_removable+0xb4/0x190)
[<00000000008f12fa>] show_mem_removable+0x9a/0xd8
[<00000000008cf9c4>] dev_attr_show+0x34/0x70
[<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
[<00000000003e4194>] seq_read+0x204/0x480
[<00000000003b53ea>] __vfs_read+0x32/0x178
[<00000000003b55b2>] vfs_read+0x82/0x138
[<00000000003b5be2>] ksys_read+0x5a/0xb0
[<0000000000b86ba0>] system_call+0xdc/0x2d8
Last Breaking-Event-Address:
[<000000000038596c>] is_mem_section_removable+0xb4/0x190
Kernel panic - not syncing: Fatal exception: panic_on_oops
Fix the problem by initializing the last memory section of each zone in
memmap_init_zone() till the very end, even if it goes beyond the zone end.
Michal said:
: This has alwways been problem AFAIU. It just went unnoticed because we
: have zeroed memmaps during allocation before f7f99100d8d9 ("mm: stop
: zeroing memory during allocation in vmemmap") and so the above test
: would simply skip these ranges as belonging to zone 0 or provided a
: garbage.
:
: So I guess we do care for post f7f99100d8d9 kernels mostly and
: therefore Fixes: f7f99100d8d9 ("mm: stop zeroing memory during
: allocation in vmemmap")
Link: http://lkml.kernel.org/r/20181212172712.34019-2-zaslonko@linux.ibm.com
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Mikhail Zaslonko <zaslonko(a)linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer(a)de.ibm.com>
Suggested-by: Michal Hocko <mhocko(a)kernel.org>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov(a)gmail.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov(a)gmail.com>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Alexander Duyck <alexander.h.duyck(a)linux.intel.com>
Cc: Pasha Tatashin <Pavel.Tatashin(a)microsoft.com>
Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
--- a/mm/page_alloc.c~mm-memory_hotplug-initialize-struct-pages-for-the-full-memory-section
+++ a/mm/page_alloc.c
@@ -5542,6 +5542,18 @@ void __meminit memmap_init_zone(unsigned
cond_resched();
}
}
+#ifdef CONFIG_SPARSEMEM
+ /*
+ * If the zone does not span the rest of the section then
+ * we should at least initialize those pages. Otherwise we
+ * could blow up on a poisoned page in some paths which depend
+ * on full sections being initialized (e.g. memory hotplug).
+ */
+ while (end_pfn % PAGES_PER_SECTION) {
+ __init_single_page(pfn_to_page(end_pfn), end_pfn, zone, nid);
+ end_pfn++;
+ }
+#endif
}
#ifdef CONFIG_ZONE_DEVICE
_
Patches currently in -mm which might be from zaslonko(a)linux.ibm.com are
From: Paul Mackerras <paulus(a)ozlabs.org>
[ Upstream commit 5564597d51c8ff5b88d95c76255e18b13b760879 ]
Commit 6975a783d7b4 ("powerpc/boot: Allow building the zImage wrapper
as a relocatable ET_DYN", 2011-04-12) changed the procedure descriptor
at the start of crt0.S to have a hard-coded start address of 0x500000
rather than a reference to _zimage_start, presumably because having
a reference to a symbol introduced a relocation which is awkward to
handle in a position-independent executable. Unfortunately, what is
at 0x500000 in the COFF image is not the first instruction, but the
procedure descriptor itself, that is, a word containing 0x500000,
which is not a valid instruction. Hence, booting a COFF zImage
results in a "DEFAULT CATCH!, code=FFF00700" message from Open
Firmware.
This fixes the problem by (a) putting the procedure descriptor in the
data section and (b) adding a branch to _zimage_start as the first
instruction in the program.
Fixes: 6975a783d7b4 ("powerpc/boot: Allow building the zImage wrapper as a relocatable ET_DYN")
Signed-off-by: Paul Mackerras <paulus(a)ozlabs.org>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
arch/powerpc/boot/crt0.S | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/boot/crt0.S b/arch/powerpc/boot/crt0.S
index 8539ac93b0de..dbb06588b594 100644
--- a/arch/powerpc/boot/crt0.S
+++ b/arch/powerpc/boot/crt0.S
@@ -15,7 +15,7 @@
RELA = 7
RELACOUNT = 0x6ffffff9
- .text
+ .data
/* A procedure descriptor used when booting this as a COFF file.
* When making COFF, this comes first in the link and we're
* linked at 0x500000.
@@ -23,6 +23,8 @@ RELACOUNT = 0x6ffffff9
.globl _zimage_start_opd
_zimage_start_opd:
.long 0x500000, 0, 0, 0
+ .text
+ b _zimage_start
#ifdef __powerpc64__
.balign 8
--
2.19.1
From: Paul Mackerras <paulus(a)ozlabs.org>
[ Upstream commit 5564597d51c8ff5b88d95c76255e18b13b760879 ]
Commit 6975a783d7b4 ("powerpc/boot: Allow building the zImage wrapper
as a relocatable ET_DYN", 2011-04-12) changed the procedure descriptor
at the start of crt0.S to have a hard-coded start address of 0x500000
rather than a reference to _zimage_start, presumably because having
a reference to a symbol introduced a relocation which is awkward to
handle in a position-independent executable. Unfortunately, what is
at 0x500000 in the COFF image is not the first instruction, but the
procedure descriptor itself, that is, a word containing 0x500000,
which is not a valid instruction. Hence, booting a COFF zImage
results in a "DEFAULT CATCH!, code=FFF00700" message from Open
Firmware.
This fixes the problem by (a) putting the procedure descriptor in the
data section and (b) adding a branch to _zimage_start as the first
instruction in the program.
Fixes: 6975a783d7b4 ("powerpc/boot: Allow building the zImage wrapper as a relocatable ET_DYN")
Signed-off-by: Paul Mackerras <paulus(a)ozlabs.org>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
arch/powerpc/boot/crt0.S | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/boot/crt0.S b/arch/powerpc/boot/crt0.S
index 5c2199857aa8..a3550e8f1a77 100644
--- a/arch/powerpc/boot/crt0.S
+++ b/arch/powerpc/boot/crt0.S
@@ -15,7 +15,7 @@
RELA = 7
RELACOUNT = 0x6ffffff9
- .text
+ .data
/* A procedure descriptor used when booting this as a COFF file.
* When making COFF, this comes first in the link and we're
* linked at 0x500000.
@@ -23,6 +23,8 @@ RELACOUNT = 0x6ffffff9
.globl _zimage_start_opd
_zimage_start_opd:
.long 0x500000, 0, 0, 0
+ .text
+ b _zimage_start
#ifdef __powerpc64__
.balign 8
--
2.19.1
From: Jerome Brunet <jbrunet(a)baylibre.com>
[ Upstream commit 614b1868a125a0ba24be08f3a7fa832ddcde6bca ]
We just changed the code so we apply bias disable on the correct
register but forgot to align the register calculation. The result
is that we apply the change on the correct register, but possibly
at the incorrect offset/bit
This went undetected because offsets tends to be the same between
REG_PULL and REG_PULLEN for a given pin the EE controller. This
is not true for the AO controller.
Fixes: e39f9dd8206a ("pinctrl: meson: fix pinconf bias disable")
Signed-off-by: Jerome Brunet <jbrunet(a)baylibre.com>
Acked-by: Neil Armstrong <narmstrong(a)baylibre.com>
Signed-off-by: Linus Walleij <linus.walleij(a)linaro.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/pinctrl/meson/pinctrl-meson.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/pinctrl/meson/pinctrl-meson.c b/drivers/pinctrl/meson/pinctrl-meson.c
index df61a71420b1..8e73641bd823 100644
--- a/drivers/pinctrl/meson/pinctrl-meson.c
+++ b/drivers/pinctrl/meson/pinctrl-meson.c
@@ -274,7 +274,8 @@ static int meson_pinconf_set(struct pinctrl_dev *pcdev, unsigned int pin,
case PIN_CONFIG_BIAS_DISABLE:
dev_dbg(pc->dev, "pin %u: disable bias\n", pin);
- meson_calc_reg_and_bit(bank, pin, REG_PULL, ®, &bit);
+ meson_calc_reg_and_bit(bank, pin, REG_PULLEN, ®,
+ &bit);
ret = regmap_update_bits(pc->reg_pullen, reg,
BIT(bit), 0);
if (ret)
--
2.19.1
From: Jerome Brunet <jbrunet(a)baylibre.com>
[ Upstream commit 614b1868a125a0ba24be08f3a7fa832ddcde6bca ]
We just changed the code so we apply bias disable on the correct
register but forgot to align the register calculation. The result
is that we apply the change on the correct register, but possibly
at the incorrect offset/bit
This went undetected because offsets tends to be the same between
REG_PULL and REG_PULLEN for a given pin the EE controller. This
is not true for the AO controller.
Fixes: e39f9dd8206a ("pinctrl: meson: fix pinconf bias disable")
Signed-off-by: Jerome Brunet <jbrunet(a)baylibre.com>
Acked-by: Neil Armstrong <narmstrong(a)baylibre.com>
Signed-off-by: Linus Walleij <linus.walleij(a)linaro.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/pinctrl/meson/pinctrl-meson.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/pinctrl/meson/pinctrl-meson.c b/drivers/pinctrl/meson/pinctrl-meson.c
index 6c43322dbb97..2998941fdeca 100644
--- a/drivers/pinctrl/meson/pinctrl-meson.c
+++ b/drivers/pinctrl/meson/pinctrl-meson.c
@@ -272,7 +272,8 @@ static int meson_pinconf_set(struct pinctrl_dev *pcdev, unsigned int pin,
case PIN_CONFIG_BIAS_DISABLE:
dev_dbg(pc->dev, "pin %u: disable bias\n", pin);
- meson_calc_reg_and_bit(bank, pin, REG_PULL, ®, &bit);
+ meson_calc_reg_and_bit(bank, pin, REG_PULLEN, ®,
+ &bit);
ret = regmap_update_bits(pc->reg_pullen, reg,
BIT(bit), 0);
if (ret)
--
2.19.1
From: Corentin Labbe <clabbe(a)baylibre.com>
[ Upstream commit 5f8208f557065163f9a8089ea2ea7888f9d96922 ]
Since commit d7c5f6863550 ("ARM: dts: sun8i: a83t: bananapi-m3: Add
AXP813 regulator nodes") my BPIM3 no longer works at gigabit speed.
With the default setting, dldo3 is regulated at 2.9v which seems
sufficient for the PHY but the aforementioned commit drops it to 2.5V
which is insufficient. Note that this behaviour is random for all BPIM3.
Some work with 2.5V, but some don't.
Finnaly, someone from Bananapi confirmed that this regulator must be set
to 3.3V.
Fixes: d7c5f6863550 ("ARM: dts: sun8i: a83t: bananapi-m3: Add AXP813
regulator nodes")
Signed-off-by: Corentin Labbe <clabbe(a)baylibre.com>
[wens(a)csie.org: Reworked commit message]
Signed-off-by: Chen-Yu Tsai <wens(a)csie.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
arch/arm/boot/dts/sun8i-a83t-bananapi-m3.dts | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm/boot/dts/sun8i-a83t-bananapi-m3.dts b/arch/arm/boot/dts/sun8i-a83t-bananapi-m3.dts
index c7ce4158d6c8..f250b20af493 100644
--- a/arch/arm/boot/dts/sun8i-a83t-bananapi-m3.dts
+++ b/arch/arm/boot/dts/sun8i-a83t-bananapi-m3.dts
@@ -309,8 +309,8 @@
®_dldo3 {
regulator-always-on;
- regulator-min-microvolt = <2500000>;
- regulator-max-microvolt = <2500000>;
+ regulator-min-microvolt = <3300000>;
+ regulator-max-microvolt = <3300000>;
regulator-name = "vcc-pd";
};
--
2.19.1
From: Tang Junhui <tang.junhui.linux(a)gmail.com>
Stale && dirty keys can be produced in the follow way:
After writeback in write_dirty_finish(), dirty keys k1 will
replace by clean keys k2
==>ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
==>btree_insert_fn(struct btree_op *b_op, struct btree *b)
==>static int bch_btree_insert_node(struct btree *b,
struct btree_op *op,
struct keylist *insert_keys,
atomic_t *journal_ref,
Then two steps:
A) update k1 to k2 in btree node memory;
bch_btree_insert_keys(b, op, insert_keys, replace_key)
B) Write the bset(contains k2) to cache disk by a 30s delay work
bch_btree_leaf_dirty(b, journal_ref).
But before the 30s delay work write the bset to cache device,
these things happened:
A) GC works, and reclaim the bucket k2 point to;
B) Allocator works, and invalidate the bucket k2 point to,
and increase the gen of the bucket, and place it into free_inc
fifo;
C) Until now, the 30s delay work still does not finish work,
so in the disk, the key still is k1, it is dirty and stale
(its gen is smaller than the gen of the bucket). and then the
machine power off suddenly happens;
D) When the machine power on again, after the btree reconstruction,
the stale dirty key appear.
In bch_extent_bad(), when expensive_debug_checks is off, it would
treat the dirty key as good even it is stale keys, and it would
cause bellow probelms:
A) In read_dirty() it would cause machine crash:
BUG_ON(ptr_stale(dc->disk.c, &w->key, 0));
B) It could be worse when reads hits stale dirty keys, it would
read old incorrect data.
This patch tolerate the existence of these stale && dirty keys,
and treat them as bad key in bch_extent_bad().
(Coly Li: fix indent format which was modified by sender's email
client)
Signed-off-by: Tang Junhui <tang.junhui.linux(a)gmail.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Coly Li <colyli(a)suse.de>
---
drivers/md/bcache/extents.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/drivers/md/bcache/extents.c b/drivers/md/bcache/extents.c
index 956004366699..886710043025 100644
--- a/drivers/md/bcache/extents.c
+++ b/drivers/md/bcache/extents.c
@@ -538,6 +538,7 @@ static bool bch_extent_bad(struct btree_keys *bk, const struct bkey *k)
{
struct btree *b = container_of(bk, struct btree, keys);
unsigned int i, stale;
+ char buf[80];
if (!KEY_PTRS(k) ||
bch_extent_invalid(bk, k))
@@ -547,19 +548,19 @@ static bool bch_extent_bad(struct btree_keys *bk, const struct bkey *k)
if (!ptr_available(b->c, k, i))
return true;
- if (!expensive_debug_checks(b->c) && KEY_DIRTY(k))
- return false;
-
for (i = 0; i < KEY_PTRS(k); i++) {
stale = ptr_stale(b->c, k, i);
+ if (stale && KEY_DIRTY(k)) {
+ bch_extent_to_text(buf, sizeof(buf), k);
+ pr_info("stale dirty pointer, stale %u, key: %s",
+ stale, buf);
+ }
+
btree_bug_on(stale > BUCKET_GC_GEN_MAX, b,
"key too stale: %i, need_gc %u",
stale, b->c->need_gc);
- btree_bug_on(stale && KEY_DIRTY(k) && KEY_SIZE(k),
- b, "stale dirty pointer");
-
if (stale)
return true;
--
2.16.4
On Tue, 2018-12-25 at 15:45 +0000, ? ? wrote:
> Hi, Greg
>
> I found on Debian testing with kernel 4.18.20 fail boot, kernel panic
> on i915. and reported it to Debian bug 917280 [0], with panic log[1].
>
> after revert:
>
> commit 06e562e7f515292ea7721475950f23554214adde
> Author: Chris Wilson <chris(a)chris-wilson.co.uk>
> Date: Mon Nov 5 09:43:05 2018 +0000
>
> drm/i915/ringbuffer: Delay after EMIT_INVALIDATE for gen4/gen5
>
> System boots to desktop.
>
> [0]: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=917280
> [1]:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?att=1;bug=917280;filename=dme…
The 4.18 stable branch is no longer maintained.
I suspect this is the same as <https://bugs.debian.org/914495> and
<https://bugs.freedesktop.org/show_bug.cgi?id=108850>, which is fixed
in 4.19 (currently in unstable).
Ben.
--
Ben Hutchings
It is impossible to make anything foolproof
because fools are so ingenious.
Commit f6aa5beb45be ("serial: 8250: Fix clearing FIFOs in RS485 mode
again") makes a change to FIFO clearing code which its commit message
suggests was intended to be specific to use with RS485 mode, however:
1) The change made does not just affect __do_stop_tx_rs485(), it also
affects other uses of serial8250_clear_fifos() including paths for
starting up, shutting down or auto-configuring a port regardless of
whether it's an RS485 port or not.
2) It makes the assumption that resetting the FIFOs is a no-op when
FIFOs are disabled, and as such it checks for this case & explicitly
avoids setting the FIFO reset bits when the FIFO enable bit is
clear. A reading of the PC16550D manual would suggest that this is
OK since the FIFO should automatically be reset if it is later
enabled, but we support many 16550-compatible devices and have never
required this auto-reset behaviour for at least the whole git era.
Starting to rely on it now seems risky, offers no benefit, and
indeed breaks at least the Ingenic JZ4780's UARTs which reads
garbage when the RX FIFO is enabled if we don't explicitly reset it.
3) By only resetting the FIFOs if they're enabled, the behaviour of
serial8250_do_startup() during boot now depends on what the value of
FCR is before the 8250 driver is probed. This in itself seems
questionable and leaves us with FCR=0 & no FIFO reset if the UART
was used by 8250_early, otherwise it depends upon what the
bootloader left behind.
4) Although the naming of serial8250_clear_fifos() may be unclear, it
is clear that callers of it expect that it will disable FIFOs. Both
serial8250_do_startup() & serial8250_do_shutdown() contain comments
to that effect, and other callers explicitly re-enable the FIFOs
after calling serial8250_clear_fifos(). The premise of that patch
that disabling the FIFOs is incorrect therefore seems wrong.
For these reasons, this reverts commit f6aa5beb45be ("serial: 8250: Fix
clearing FIFOs in RS485 mode again").
Signed-off-by: Paul Burton <paul.burton(a)mips.com>
Fixes: f6aa5beb45be ("serial: 8250: Fix clearing FIFOs in RS485 mode again").
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Daniel Jedrychowski <avistel(a)gmail.com>
Cc: Marek Vasut <marex(a)denx.de>
Cc: linux-mips(a)vger.kernel.org
Cc: linux-serial(a)vger.kernel.org
Cc: stable <stable(a)vger.kernel.org> # 4.10+
---
I did suggest an alternative approach which would rename
serial8250_clear_fifos() and split it into 2 variants - one that
disables FIFOs & one that does not, then use the latter in
__do_stop_tx_rs485():
https://lore.kernel.org/lkml/20181213014805.77u5dzydo23cm6fq@pburton-laptop/
However I have no access to the OMAP3 hardware that Marek's patch was
attempting to fix & have heard nothing back with regards to him testing
that approach, so here's a simple revert that fixes the Ingenic JZ4780.
I've marked for stable back to v4.10 presuming that this is how far the
broken patch may be backported, given that this is where commit
2bed8a8e7072 ("Clearing FIFOs in RS485 emulation mode causes subsequent
transmits to break") that it tried to fix was introduced.
---
drivers/tty/serial/8250/8250_port.c | 29 +++++------------------------
1 file changed, 5 insertions(+), 24 deletions(-)
diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
index f776b3eafb96..3f779d25ec0c 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -552,30 +552,11 @@ static unsigned int serial_icr_read(struct uart_8250_port *up, int offset)
*/
static void serial8250_clear_fifos(struct uart_8250_port *p)
{
- unsigned char fcr;
- unsigned char clr_mask = UART_FCR_CLEAR_RCVR | UART_FCR_CLEAR_XMIT;
-
if (p->capabilities & UART_CAP_FIFO) {
- /*
- * Make sure to avoid changing FCR[7:3] and ENABLE_FIFO bits.
- * In case ENABLE_FIFO is not set, there is nothing to flush
- * so just return. Furthermore, on certain implementations of
- * the 8250 core, the FCR[7:3] bits may only be changed under
- * specific conditions and changing them if those conditions
- * are not met can have nasty side effects. One such core is
- * the 8250-omap present in TI AM335x.
- */
- fcr = serial_in(p, UART_FCR);
-
- /* FIFO is not enabled, there's nothing to clear. */
- if (!(fcr & UART_FCR_ENABLE_FIFO))
- return;
-
- fcr |= clr_mask;
- serial_out(p, UART_FCR, fcr);
-
- fcr &= ~clr_mask;
- serial_out(p, UART_FCR, fcr);
+ serial_out(p, UART_FCR, UART_FCR_ENABLE_FIFO);
+ serial_out(p, UART_FCR, UART_FCR_ENABLE_FIFO |
+ UART_FCR_CLEAR_RCVR | UART_FCR_CLEAR_XMIT);
+ serial_out(p, UART_FCR, 0);
}
}
@@ -1467,7 +1448,7 @@ static void __do_stop_tx_rs485(struct uart_8250_port *p)
* Enable previously disabled RX interrupts.
*/
if (!(p->port.rs485.flags & SER_RS485_RX_DURING_TX)) {
- serial8250_clear_fifos(p);
+ serial8250_clear_and_reinit_fifos(p);
p->ier |= UART_IER_RLSI | UART_IER_RDI;
serial_port_out(&p->port, UART_IER, p->ier);
--
2.20.0
Omer Tripp's analysis of a Spectre V1 gadget in __close_fd():
"1. __close_fd() is reachable via the close() syscall with a
user-controlled fd.
2. If said bounds check is mispredicted, then a user-controlled
address fdt->fd[fd] is obtained then dereferenced, and the value of
a user-controlled address is loaded into the local variable file.
3. file is then passed as an argument to filp_close, where the cache
lines secret + offsetof(f_op) and secret + offsetof(f_mode) are hot
and vulnerable to a timing channel attack."
Address this by using array_index_nospec() to prevent speculation past
the end of current->fdt.
Reported-by: Omer Tripp <trippo(a)google.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Greg Hackmann <ghackmann(a)android.com>
---
v2: include Omer Tripp's analysis in commit message, and update my email
address
fs/file.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/file.c b/fs/file.c
index 7ffd6e9d103d..a80cf82be96b 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -18,6 +18,7 @@
#include <linux/bitops.h>
#include <linux/spinlock.h>
#include <linux/rcupdate.h>
+#include <linux/nospec.h>
unsigned int sysctl_nr_open __read_mostly = 1024*1024;
unsigned int sysctl_nr_open_min = BITS_PER_LONG;
@@ -626,6 +627,7 @@ int __close_fd(struct files_struct *files, unsigned fd)
fdt = files_fdtable(files);
if (fd >= fdt->max_fds)
goto out_unlock;
+ fd = array_index_nospec(fd, fdt->max_fds);
file = fdt->fd[fd];
if (!file)
goto out_unlock;
--
2.19.1
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From da791a667536bf8322042e38ca85d55a78d3c273 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx(a)linutronix.de>
Date: Mon, 10 Dec 2018 14:35:14 +0100
Subject: [PATCH] futex: Cure exit race
Stefan reported, that the glibc tst-robustpi4 test case fails
occasionally. That case creates the following race between
sys_exit() and sys_futex_lock_pi():
CPU0 CPU1
sys_exit() sys_futex()
do_exit() futex_lock_pi()
exit_signals(tsk) No waiters:
tsk->flags |= PF_EXITING; *uaddr == 0x00000PID
mm_release(tsk) Set waiter bit
exit_robust_list(tsk) { *uaddr = 0x80000PID;
Set owner died attach_to_pi_owner() {
*uaddr = 0xC0000000; tsk = get_task(PID);
} if (!tsk->flags & PF_EXITING) {
... attach();
tsk->flags |= PF_EXITPIDONE; } else {
if (!(tsk->flags & PF_EXITPIDONE))
return -EAGAIN;
return -ESRCH; <--- FAIL
}
ESRCH is returned all the way to user space, which triggers the glibc test
case assert. Returning ESRCH unconditionally is wrong here because the user
space value has been changed by the exiting task to 0xC0000000, i.e. the
FUTEX_OWNER_DIED bit is set and the futex PID value has been cleared. This
is a valid state and the kernel has to handle it, i.e. taking the futex.
Cure it by rereading the user space value when PF_EXITING and PF_EXITPIDONE
is set in the task which 'owns' the futex. If the value has changed, let
the kernel retry the operation, which includes all regular sanity checks
and correctly handles the FUTEX_OWNER_DIED case.
If it hasn't changed, then return ESRCH as there is no way to distinguish
this case from malfunctioning user space. This happens when the exiting
task did not have a robust list, the robust list was corrupted or the user
space value in the futex was simply bogus.
Reported-by: Stefan Liebler <stli(a)linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Acked-by: Peter Zijlstra <peterz(a)infradead.org>
Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Cc: Darren Hart <dvhart(a)infradead.org>
Cc: Ingo Molnar <mingo(a)kernel.org>
Cc: Sasha Levin <sashal(a)kernel.org>
Cc: stable(a)vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200467
Link: https://lkml.kernel.org/r/20181210152311.986181245@linutronix.de
diff --git a/kernel/futex.c b/kernel/futex.c
index f423f9b6577e..5cc8083a4c89 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1148,11 +1148,65 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
return ret;
}
+static int handle_exit_race(u32 __user *uaddr, u32 uval,
+ struct task_struct *tsk)
+{
+ u32 uval2;
+
+ /*
+ * If PF_EXITPIDONE is not yet set, then try again.
+ */
+ if (tsk && !(tsk->flags & PF_EXITPIDONE))
+ return -EAGAIN;
+
+ /*
+ * Reread the user space value to handle the following situation:
+ *
+ * CPU0 CPU1
+ *
+ * sys_exit() sys_futex()
+ * do_exit() futex_lock_pi()
+ * futex_lock_pi_atomic()
+ * exit_signals(tsk) No waiters:
+ * tsk->flags |= PF_EXITING; *uaddr == 0x00000PID
+ * mm_release(tsk) Set waiter bit
+ * exit_robust_list(tsk) { *uaddr = 0x80000PID;
+ * Set owner died attach_to_pi_owner() {
+ * *uaddr = 0xC0000000; tsk = get_task(PID);
+ * } if (!tsk->flags & PF_EXITING) {
+ * ... attach();
+ * tsk->flags |= PF_EXITPIDONE; } else {
+ * if (!(tsk->flags & PF_EXITPIDONE))
+ * return -EAGAIN;
+ * return -ESRCH; <--- FAIL
+ * }
+ *
+ * Returning ESRCH unconditionally is wrong here because the
+ * user space value has been changed by the exiting task.
+ *
+ * The same logic applies to the case where the exiting task is
+ * already gone.
+ */
+ if (get_futex_value_locked(&uval2, uaddr))
+ return -EFAULT;
+
+ /* If the user space value has changed, try again. */
+ if (uval2 != uval)
+ return -EAGAIN;
+
+ /*
+ * The exiting task did not have a robust list, the robust list was
+ * corrupted or the user space value in *uaddr is simply bogus.
+ * Give up and tell user space.
+ */
+ return -ESRCH;
+}
+
/*
* Lookup the task for the TID provided from user space and attach to
* it after doing proper sanity checks.
*/
-static int attach_to_pi_owner(u32 uval, union futex_key *key,
+static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
struct futex_pi_state **ps)
{
pid_t pid = uval & FUTEX_TID_MASK;
@@ -1162,12 +1216,15 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
/*
* We are the first waiter - try to look up the real owner and attach
* the new pi_state to it, but bail out when TID = 0 [1]
+ *
+ * The !pid check is paranoid. None of the call sites should end up
+ * with pid == 0, but better safe than sorry. Let the caller retry
*/
if (!pid)
- return -ESRCH;
+ return -EAGAIN;
p = find_get_task_by_vpid(pid);
if (!p)
- return -ESRCH;
+ return handle_exit_race(uaddr, uval, NULL);
if (unlikely(p->flags & PF_KTHREAD)) {
put_task_struct(p);
@@ -1187,7 +1244,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
* set, we know that the task has finished the
* cleanup:
*/
- int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
+ int ret = handle_exit_race(uaddr, uval, p);
raw_spin_unlock_irq(&p->pi_lock);
put_task_struct(p);
@@ -1244,7 +1301,7 @@ static int lookup_pi_state(u32 __user *uaddr, u32 uval,
* We are the first waiter - try to look up the owner based on
* @uval and attach to it.
*/
- return attach_to_pi_owner(uval, key, ps);
+ return attach_to_pi_owner(uaddr, uval, key, ps);
}
static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
@@ -1352,7 +1409,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
* attach to the owner. If that fails, no harm done, we only
* set the FUTEX_WAITERS bit in the user space variable.
*/
- return attach_to_pi_owner(uval, key, ps);
+ return attach_to_pi_owner(uaddr, newval, key, ps);
}
/**
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From da791a667536bf8322042e38ca85d55a78d3c273 Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <tglx(a)linutronix.de>
Date: Mon, 10 Dec 2018 14:35:14 +0100
Subject: [PATCH] futex: Cure exit race
Stefan reported, that the glibc tst-robustpi4 test case fails
occasionally. That case creates the following race between
sys_exit() and sys_futex_lock_pi():
CPU0 CPU1
sys_exit() sys_futex()
do_exit() futex_lock_pi()
exit_signals(tsk) No waiters:
tsk->flags |= PF_EXITING; *uaddr == 0x00000PID
mm_release(tsk) Set waiter bit
exit_robust_list(tsk) { *uaddr = 0x80000PID;
Set owner died attach_to_pi_owner() {
*uaddr = 0xC0000000; tsk = get_task(PID);
} if (!tsk->flags & PF_EXITING) {
... attach();
tsk->flags |= PF_EXITPIDONE; } else {
if (!(tsk->flags & PF_EXITPIDONE))
return -EAGAIN;
return -ESRCH; <--- FAIL
}
ESRCH is returned all the way to user space, which triggers the glibc test
case assert. Returning ESRCH unconditionally is wrong here because the user
space value has been changed by the exiting task to 0xC0000000, i.e. the
FUTEX_OWNER_DIED bit is set and the futex PID value has been cleared. This
is a valid state and the kernel has to handle it, i.e. taking the futex.
Cure it by rereading the user space value when PF_EXITING and PF_EXITPIDONE
is set in the task which 'owns' the futex. If the value has changed, let
the kernel retry the operation, which includes all regular sanity checks
and correctly handles the FUTEX_OWNER_DIED case.
If it hasn't changed, then return ESRCH as there is no way to distinguish
this case from malfunctioning user space. This happens when the exiting
task did not have a robust list, the robust list was corrupted or the user
space value in the futex was simply bogus.
Reported-by: Stefan Liebler <stli(a)linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Acked-by: Peter Zijlstra <peterz(a)infradead.org>
Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Cc: Darren Hart <dvhart(a)infradead.org>
Cc: Ingo Molnar <mingo(a)kernel.org>
Cc: Sasha Levin <sashal(a)kernel.org>
Cc: stable(a)vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200467
Link: https://lkml.kernel.org/r/20181210152311.986181245@linutronix.de
diff --git a/kernel/futex.c b/kernel/futex.c
index f423f9b6577e..5cc8083a4c89 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1148,11 +1148,65 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
return ret;
}
+static int handle_exit_race(u32 __user *uaddr, u32 uval,
+ struct task_struct *tsk)
+{
+ u32 uval2;
+
+ /*
+ * If PF_EXITPIDONE is not yet set, then try again.
+ */
+ if (tsk && !(tsk->flags & PF_EXITPIDONE))
+ return -EAGAIN;
+
+ /*
+ * Reread the user space value to handle the following situation:
+ *
+ * CPU0 CPU1
+ *
+ * sys_exit() sys_futex()
+ * do_exit() futex_lock_pi()
+ * futex_lock_pi_atomic()
+ * exit_signals(tsk) No waiters:
+ * tsk->flags |= PF_EXITING; *uaddr == 0x00000PID
+ * mm_release(tsk) Set waiter bit
+ * exit_robust_list(tsk) { *uaddr = 0x80000PID;
+ * Set owner died attach_to_pi_owner() {
+ * *uaddr = 0xC0000000; tsk = get_task(PID);
+ * } if (!tsk->flags & PF_EXITING) {
+ * ... attach();
+ * tsk->flags |= PF_EXITPIDONE; } else {
+ * if (!(tsk->flags & PF_EXITPIDONE))
+ * return -EAGAIN;
+ * return -ESRCH; <--- FAIL
+ * }
+ *
+ * Returning ESRCH unconditionally is wrong here because the
+ * user space value has been changed by the exiting task.
+ *
+ * The same logic applies to the case where the exiting task is
+ * already gone.
+ */
+ if (get_futex_value_locked(&uval2, uaddr))
+ return -EFAULT;
+
+ /* If the user space value has changed, try again. */
+ if (uval2 != uval)
+ return -EAGAIN;
+
+ /*
+ * The exiting task did not have a robust list, the robust list was
+ * corrupted or the user space value in *uaddr is simply bogus.
+ * Give up and tell user space.
+ */
+ return -ESRCH;
+}
+
/*
* Lookup the task for the TID provided from user space and attach to
* it after doing proper sanity checks.
*/
-static int attach_to_pi_owner(u32 uval, union futex_key *key,
+static int attach_to_pi_owner(u32 __user *uaddr, u32 uval, union futex_key *key,
struct futex_pi_state **ps)
{
pid_t pid = uval & FUTEX_TID_MASK;
@@ -1162,12 +1216,15 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
/*
* We are the first waiter - try to look up the real owner and attach
* the new pi_state to it, but bail out when TID = 0 [1]
+ *
+ * The !pid check is paranoid. None of the call sites should end up
+ * with pid == 0, but better safe than sorry. Let the caller retry
*/
if (!pid)
- return -ESRCH;
+ return -EAGAIN;
p = find_get_task_by_vpid(pid);
if (!p)
- return -ESRCH;
+ return handle_exit_race(uaddr, uval, NULL);
if (unlikely(p->flags & PF_KTHREAD)) {
put_task_struct(p);
@@ -1187,7 +1244,7 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
* set, we know that the task has finished the
* cleanup:
*/
- int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
+ int ret = handle_exit_race(uaddr, uval, p);
raw_spin_unlock_irq(&p->pi_lock);
put_task_struct(p);
@@ -1244,7 +1301,7 @@ static int lookup_pi_state(u32 __user *uaddr, u32 uval,
* We are the first waiter - try to look up the owner based on
* @uval and attach to it.
*/
- return attach_to_pi_owner(uval, key, ps);
+ return attach_to_pi_owner(uaddr, uval, key, ps);
}
static int lock_pi_update_atomic(u32 __user *uaddr, u32 uval, u32 newval)
@@ -1352,7 +1409,7 @@ static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
* attach to the owner. If that fails, no harm done, we only
* set the FUTEX_WAITERS bit in the user space variable.
*/
- return attach_to_pi_owner(uval, key, ps);
+ return attach_to_pi_owner(uaddr, newval, key, ps);
}
/**
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e58725d51fa8da9133f3f1c54170aa2e43056b91 Mon Sep 17 00:00:00 2001
From: Richard Weinberger <richard(a)nod.at>
Date: Wed, 7 Nov 2018 23:04:43 +0100
Subject: [PATCH] ubifs: Handle re-linking of inodes correctly while recovery
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
UBIFS's recovery code strictly assumes that a deleted inode will never
come back, therefore it removes all data which belongs to that inode
as soon it faces an inode with link count 0 in the replay list.
Before O_TMPFILE this assumption was perfectly fine. With O_TMPFILE
it can lead to data loss upon a power-cut.
Consider a journal with entries like:
0: inode X (nlink = 0) /* O_TMPFILE was created */
1: data for inode X /* Someone writes to the temp file */
2: inode X (nlink = 0) /* inode was changed, xattr, chmod, … */
3: inode X (nlink = 1) /* inode was re-linked via linkat() */
Upon replay of entry #2 UBIFS will drop all data that belongs to inode X,
this will lead to an empty file after mounting.
As solution for this problem, scan the replay list for a re-link entry
before dropping data.
Fixes: 474b93704f32 ("ubifs: Implement O_TMPFILE")
Cc: stable(a)vger.kernel.org
Cc: Russell Senior <russell(a)personaltelco.net>
Cc: Rafał Miłecki <zajec5(a)gmail.com>
Reported-by: Russell Senior <russell(a)personaltelco.net>
Reported-by: Rafał Miłecki <zajec5(a)gmail.com>
Tested-by: Rafał Miłecki <rafal(a)milecki.pl>
Signed-off-by: Richard Weinberger <richard(a)nod.at>
diff --git a/fs/ubifs/replay.c b/fs/ubifs/replay.c
index a08c5b7030ea..0a0e65c07c6d 100644
--- a/fs/ubifs/replay.c
+++ b/fs/ubifs/replay.c
@@ -212,6 +212,38 @@ static int trun_remove_range(struct ubifs_info *c, struct replay_entry *r)
return ubifs_tnc_remove_range(c, &min_key, &max_key);
}
+/**
+ * inode_still_linked - check whether inode in question will be re-linked.
+ * @c: UBIFS file-system description object
+ * @rino: replay entry to test
+ *
+ * O_TMPFILE files can be re-linked, this means link count goes from 0 to 1.
+ * This case needs special care, otherwise all references to the inode will
+ * be removed upon the first replay entry of an inode with link count 0
+ * is found.
+ */
+static bool inode_still_linked(struct ubifs_info *c, struct replay_entry *rino)
+{
+ struct replay_entry *r;
+
+ ubifs_assert(c, rino->deletion);
+ ubifs_assert(c, key_type(c, &rino->key) == UBIFS_INO_KEY);
+
+ /*
+ * Find the most recent entry for the inode behind @rino and check
+ * whether it is a deletion.
+ */
+ list_for_each_entry_reverse(r, &c->replay_list, list) {
+ ubifs_assert(c, r->sqnum >= rino->sqnum);
+ if (key_inum(c, &r->key) == key_inum(c, &rino->key))
+ return r->deletion == 0;
+
+ }
+
+ ubifs_assert(c, 0);
+ return false;
+}
+
/**
* apply_replay_entry - apply a replay entry to the TNC.
* @c: UBIFS file-system description object
@@ -239,6 +271,11 @@ static int apply_replay_entry(struct ubifs_info *c, struct replay_entry *r)
{
ino_t inum = key_inum(c, &r->key);
+ if (inode_still_linked(c, r)) {
+ err = 0;
+ break;
+ }
+
err = ubifs_tnc_remove_ino(c, inum);
break;
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e58725d51fa8da9133f3f1c54170aa2e43056b91 Mon Sep 17 00:00:00 2001
From: Richard Weinberger <richard(a)nod.at>
Date: Wed, 7 Nov 2018 23:04:43 +0100
Subject: [PATCH] ubifs: Handle re-linking of inodes correctly while recovery
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
UBIFS's recovery code strictly assumes that a deleted inode will never
come back, therefore it removes all data which belongs to that inode
as soon it faces an inode with link count 0 in the replay list.
Before O_TMPFILE this assumption was perfectly fine. With O_TMPFILE
it can lead to data loss upon a power-cut.
Consider a journal with entries like:
0: inode X (nlink = 0) /* O_TMPFILE was created */
1: data for inode X /* Someone writes to the temp file */
2: inode X (nlink = 0) /* inode was changed, xattr, chmod, … */
3: inode X (nlink = 1) /* inode was re-linked via linkat() */
Upon replay of entry #2 UBIFS will drop all data that belongs to inode X,
this will lead to an empty file after mounting.
As solution for this problem, scan the replay list for a re-link entry
before dropping data.
Fixes: 474b93704f32 ("ubifs: Implement O_TMPFILE")
Cc: stable(a)vger.kernel.org
Cc: Russell Senior <russell(a)personaltelco.net>
Cc: Rafał Miłecki <zajec5(a)gmail.com>
Reported-by: Russell Senior <russell(a)personaltelco.net>
Reported-by: Rafał Miłecki <zajec5(a)gmail.com>
Tested-by: Rafał Miłecki <rafal(a)milecki.pl>
Signed-off-by: Richard Weinberger <richard(a)nod.at>
diff --git a/fs/ubifs/replay.c b/fs/ubifs/replay.c
index a08c5b7030ea..0a0e65c07c6d 100644
--- a/fs/ubifs/replay.c
+++ b/fs/ubifs/replay.c
@@ -212,6 +212,38 @@ static int trun_remove_range(struct ubifs_info *c, struct replay_entry *r)
return ubifs_tnc_remove_range(c, &min_key, &max_key);
}
+/**
+ * inode_still_linked - check whether inode in question will be re-linked.
+ * @c: UBIFS file-system description object
+ * @rino: replay entry to test
+ *
+ * O_TMPFILE files can be re-linked, this means link count goes from 0 to 1.
+ * This case needs special care, otherwise all references to the inode will
+ * be removed upon the first replay entry of an inode with link count 0
+ * is found.
+ */
+static bool inode_still_linked(struct ubifs_info *c, struct replay_entry *rino)
+{
+ struct replay_entry *r;
+
+ ubifs_assert(c, rino->deletion);
+ ubifs_assert(c, key_type(c, &rino->key) == UBIFS_INO_KEY);
+
+ /*
+ * Find the most recent entry for the inode behind @rino and check
+ * whether it is a deletion.
+ */
+ list_for_each_entry_reverse(r, &c->replay_list, list) {
+ ubifs_assert(c, r->sqnum >= rino->sqnum);
+ if (key_inum(c, &r->key) == key_inum(c, &rino->key))
+ return r->deletion == 0;
+
+ }
+
+ ubifs_assert(c, 0);
+ return false;
+}
+
/**
* apply_replay_entry - apply a replay entry to the TNC.
* @c: UBIFS file-system description object
@@ -239,6 +271,11 @@ static int apply_replay_entry(struct ubifs_info *c, struct replay_entry *r)
{
ino_t inum = key_inum(c, &r->key);
+ if (inode_still_linked(c, r)) {
+ err = 0;
+ break;
+ }
+
err = ubifs_tnc_remove_ino(c, inum);
break;
}
From: Michal Hocko <mhocko(a)suse.com>
Burt Holzman has noticed that memcg v1 doesn't notify about OOM events
via eventfd anymore. The reason is that 29ef680ae7c2 ("memcg, oom: move
out_of_memory back to the charge path") has moved the oom handling back
to the charge path. While doing so the notification was left behind in
mem_cgroup_oom_synchronize.
Fix the issue by replicating the oom hierarchy locking and the
notification.
Reported-by: Burt Holzman <burt(a)fnal.gov>
Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
Cc: stable # 4.19+
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
---
Hi Andrew,
I forgot to CC you on the patch sent as a reply to the original bug
report [1] so I am reposting with Ack from Johannes. Burt has confirmed
this is resolving the regression for him [2]. 4.20 is out but I have
marked the patch for stable so it should hit both 4.19 and 4.20.
[1] http://lkml.kernel.org/r/20181221153302.GB6410@dhcp22.suse.cz
[2] http://lkml.kernel.org/r/96D4815C-420F-41B7-B1E9-A741E7523596@services.fnal…
mm/memcontrol.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6e1469b80cb7..7e6bf74ddb1e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1666,6 +1666,9 @@ enum oom_status {
static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
{
+ enum oom_status ret;
+ bool locked;
+
if (order > PAGE_ALLOC_COSTLY_ORDER)
return OOM_SKIPPED;
@@ -1700,10 +1703,23 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
return OOM_ASYNC;
}
+ mem_cgroup_mark_under_oom(memcg);
+
+ locked = mem_cgroup_oom_trylock(memcg);
+
+ if (locked)
+ mem_cgroup_oom_notify(memcg);
+
+ mem_cgroup_unmark_under_oom(memcg);
if (mem_cgroup_out_of_memory(memcg, mask, order))
- return OOM_SUCCESS;
+ ret = OOM_SUCCESS;
+ else
+ ret = OOM_FAILED;
- return OOM_FAILED;
+ if (locked)
+ mem_cgroup_oom_unlock(memcg);
+
+ return ret;
}
/**
--
2.19.2
Big endian machines (at least the one I have access to) cannot mount
f2fs filesystems anymore.
This is with Linux 4.14.89 but I suspect that 4.9.144 (and later) is
affected as well.
commit 0cfe75c5b01199 ("f2fs: enhance sanity_check_raw_super() to avoid
potential overflows") treats the "block_count" from struct
f2fs_super_block as 32-bit little endian value instead of a 64-bit
little endian value.
I tested this fix on top of Linux 4.14.49 but it seems that all stable
and mainline kernel versions are affected:
- 4.9.144 and later because 0cfe75c5b01199 was backported there
- 4.14.86 and later because 0cfe75c5b01199 was backported there
- 4.19
- 4.20-rcX
Martin Blumenstingl (1):
f2fs: fix validation of the block count in sanity_check_raw_super
fs/f2fs/super.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--
2.20.1
Den 23-12-2018 kl. 01:28, skrev Linus Torvalds:
> On Sat, Dec 22, 2018 at 3:07 PM Christian Brauner
> <christian.brauner(a)canonical.com> wrote:
>>
>> However, for this case should I resend the revert?
>
> Since I was pointed at the original email thread, I just picked it up
> from there directly. It still applied cleanly, nothing had changed in
> that area.
>
> Linus
>
This should also be picked up for 4.19 lts
Greg, it's now upstream as:
From 94f82008ce30e2624537d240d64ce718255e0b80 Mon Sep 17 00:00:00 2001
From: Christian Brauner <christian(a)brauner.io>
Date: Thu, 5 Jul 2018 17:51:20 +0200
Subject: Revert "vfs: Allow userns root to call mknod on owned filesystems."
--
Thomas
Mapping the delay slot emulation page as both writeable & executable
presents a security risk, in that if an exploit can write to & jump into
the page then it can be used as an easy way to execute arbitrary code.
Prevent this by mapping the page read-only for userland, and using
access_process_vm() with the FOLL_FORCE flag to write to it from
mips_dsemul().
This will likely be less efficient due to copy_to_user_page() performing
cache maintenance on a whole page, rather than a single line as in the
previous use of flush_cache_sigtramp(). However this delay slot
emulation code ought not to be running in any performance critical paths
anyway so this isn't really a problem, and we can probably do better in
copy_to_user_page() anyway in future.
A major advantage of this approach is that the fix is small & simple to
backport to stable kernels.
Reported-by: Andy Lutomirski <luto(a)kernel.org>
Signed-off-by: Paul Burton <paul.burton(a)mips.com>
Fixes: 432c6bacbd0c ("MIPS: Use per-mm page to execute branch delay slot instructions")
Cc: stable(a)vger.kernel.org # v4.8+
---
arch/mips/kernel/vdso.c | 4 ++--
arch/mips/math-emu/dsemul.c | 38 +++++++++++++++++++------------------
2 files changed, 22 insertions(+), 20 deletions(-)
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 48a9c6b90e07..9df3ebdc7b0f 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -126,8 +126,8 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
/* Map delay slot emulation page */
base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
- VM_READ|VM_WRITE|VM_EXEC|
- VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
+ VM_READ | VM_EXEC |
+ VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC,
0, NULL);
if (IS_ERR_VALUE(base)) {
ret = base;
diff --git a/arch/mips/math-emu/dsemul.c b/arch/mips/math-emu/dsemul.c
index 5450f4d1c920..e2d46cb93ca9 100644
--- a/arch/mips/math-emu/dsemul.c
+++ b/arch/mips/math-emu/dsemul.c
@@ -214,8 +214,9 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir,
{
int isa16 = get_isa16_mode(regs->cp0_epc);
mips_instruction break_math;
- struct emuframe __user *fr;
- int err, fr_idx;
+ unsigned long fr_uaddr;
+ struct emuframe fr;
+ int fr_idx, ret;
/* NOP is easy */
if (ir == 0)
@@ -250,27 +251,31 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir,
fr_idx = alloc_emuframe();
if (fr_idx == BD_EMUFRAME_NONE)
return SIGBUS;
- fr = &dsemul_page()[fr_idx];
/* Retrieve the appropriately encoded break instruction */
break_math = BREAK_MATH(isa16);
/* Write the instructions to the frame */
if (isa16) {
- err = __put_user(ir >> 16,
- (u16 __user *)(&fr->emul));
- err |= __put_user(ir & 0xffff,
- (u16 __user *)((long)(&fr->emul) + 2));
- err |= __put_user(break_math >> 16,
- (u16 __user *)(&fr->badinst));
- err |= __put_user(break_math & 0xffff,
- (u16 __user *)((long)(&fr->badinst) + 2));
+ union mips_instruction _emul = {
+ .halfword = { ir >> 16, ir }
+ };
+ union mips_instruction _badinst = {
+ .halfword = { break_math >> 16, break_math }
+ };
+
+ fr.emul = _emul.word;
+ fr.badinst = _badinst.word;
} else {
- err = __put_user(ir, &fr->emul);
- err |= __put_user(break_math, &fr->badinst);
+ fr.emul = ir;
+ fr.badinst = break_math;
}
- if (unlikely(err)) {
+ /* Write the frame to user memory */
+ fr_uaddr = (unsigned long)&dsemul_page()[fr_idx];
+ ret = access_process_vm(current, fr_uaddr, &fr, sizeof(fr),
+ FOLL_FORCE | FOLL_WRITE);
+ if (unlikely(ret != sizeof(fr))) {
MIPS_FPU_EMU_INC_STATS(errors);
free_emuframe(fr_idx, current->mm);
return SIGBUS;
@@ -282,10 +287,7 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir,
atomic_set(¤t->thread.bd_emu_frame, fr_idx);
/* Change user register context to execute the frame */
- regs->cp0_epc = (unsigned long)&fr->emul | isa16;
-
- /* Ensure the icache observes our newly written frame */
- flush_cache_sigtramp((unsigned long)&fr->emul);
+ regs->cp0_epc = fr_uaddr | isa16;
return 0;
}
--
2.20.0
The AFU Descriptor Template in the PCI config space has a Name Space
field which is a 24 Byte ASCII character string of descriptive name
space for the AFU. The OCXL driver read the string four characters at
a time with pci_read_config_dword().
This optimization is valid on a little-endian system since this is PCI,
but a big-endian system ends up with each subset of four characters in
reverse order.
This could be fixed by switching to read characters one by one. Another
option is to swap the bytes if we're big-endian.
Go for the latter with le32_to_cpu().
Cc: stable(a)vger.kernel.org # v4.16
Signed-off-by: Greg Kurz <groug(a)kaod.org>
Acked-by: Frederic Barrat <fbarrat(a)linux.ibm.com>
---
v2: - silence sparse with (__force __le32) cast
- new changelog
---
drivers/misc/ocxl/config.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
index 57a6bb1fd3c9..8f2c5d8bd2ee 100644
--- a/drivers/misc/ocxl/config.c
+++ b/drivers/misc/ocxl/config.c
@@ -318,7 +318,7 @@ static int read_afu_name(struct pci_dev *dev, struct ocxl_fn_config *fn,
if (rc)
return rc;
ptr = (u32 *) &afu->name[i];
- *ptr = val;
+ *ptr = le32_to_cpu((__force __le32) val);
}
afu->name[OCXL_AFU_NAME_SZ - 1] = '\0'; /* play safe */
return 0;
On a signal handler return, the user could set a context with MSR[TS] bits
set, and these bits would be copied to task regs->msr.
At restore_tm_sigcontexts(), after current task regs->msr[TS] bits are set,
several __get_user() are called and then a recheckpoint is executed.
This is a problem since a page fault (in kernel space) could happen when
calling __get_user(). If it happens, the process MSR[TS] bits were
already set, but recheckpoint was not executed, and SPRs are still invalid.
The page fault can cause the current process to be de-scheduled, with
MSR[TS] active and without tm_recheckpoint() being called. More
importantly, without TEXASR[FS] bit set also.
Since TEXASR might not have the FS bit set, and when the process is
scheduled back, it will try to reclaim, which will be aborted because of
the CPU is not in the suspended state, and, then, recheckpoint. This
recheckpoint will restore thread->texasr into TEXASR SPR, which might be
zero, hitting a BUG_ON().
kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
cpu 0xb: Vector: 700 (Program Check) at [c00000041f1576d0]
pc: c000000000054550: restore_gprs+0xb0/0x180
lr: 0000000000000000
sp: c00000041f157950
msr: 8000000100021033
current = 0xc00000041f143000
paca = 0xc00000000fb86300 softe: 0 irq_happened: 0x01
pid = 1021, comm = kworker/11:1
kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
Linux version 4.9.0-3-powerpc64le (debian-kernel(a)lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
enter ? for help
[c00000041f157b30] c00000000001bc3c tm_recheckpoint.part.11+0x6c/0xa0
[c00000041f157b70] c00000000001d184 __switch_to+0x1e4/0x4c0
[c00000041f157bd0] c00000000082eeb8 __schedule+0x2f8/0x990
[c00000041f157cb0] c00000000082f598 schedule+0x48/0xc0
[c00000041f157ce0] c0000000000f0d28 worker_thread+0x148/0x610
[c00000041f157d80] c0000000000f96b0 kthread+0x120/0x140
[c00000041f157e30] c00000000000c0e0 ret_from_kernel_thread+0x5c/0x7c
This patch simply delays the MSR[TS] set, so, if there is any page fault in
the __get_user() section, it does not have regs->msr[TS] set, since the TM
structures are still invalid, thus avoiding doing TM operations for
in-kernel exceptions and possible process reschedule.
With this patch, the MSR[TS] will only be set just before recheckpointing
and setting TEXASR[FS] = 1, thus avoiding an interrupt with TM registers in
invalid state.
Other than that, if CONFIG_PREEMPT is set, there might be a preemption just
after setting MSR[TS] and before tm_recheckpoint(), thus, this block must
be atomic from a preemption perspective, thus, calling
preempt_disable/enable() on this code.
It is not possible to move tm_recheckpoint to happen earlier, because it is
required to get the checkpointed registers from userspace, with
__get_user(), thus, the only way to avoid this undesired behavior is
delaying the MSR[TS] set.
The 32-bits signal handler seems to be safe this current issue, but, it
might be exposed to the preemption issue, thus, disabling preemption in
this chunk of code.
Changes from v2:
* Run the critical section with preempt_disable.
Fixes: 87b4e5393af7 ("powerpc/tm: Fix return of active 64bit signals")
Cc: stable(a)vger.kernel.org (v3.9+)
Signed-off-by: Breno Leitao <leitao(a)debian.org>
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index e6474a45cef5..fd59fef9931b 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -848,7 +848,23 @@ static long restore_tm_user_regs(struct pt_regs *regs,
/* If TM bits are set to the reserved value, it's an invalid context */
if (MSR_TM_RESV(msr_hi))
return 1;
- /* Pull in the MSR TM bits from the user context */
+
+ /*
+ * Disabling preemption, since it is unsafe to be preempted
+ * with MSR[TS] set without recheckpointing.
+ */
+ preempt_disable();
+
+ /*
+ * CAUTION:
+ * After regs->MSR[TS] being updated, make sure that get_user(),
+ * put_user() or similar functions are *not* called. These
+ * functions can generate page faults which will cause the process
+ * to be de-scheduled with MSR[TS] set but without calling
+ * tm_recheckpoint(). This can cause a bug.
+ *
+ * Pull in the MSR TM bits from the user context
+ */
regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr_hi & MSR_TS_MASK);
/* Now, recheckpoint. This loads up all of the checkpointed (older)
* registers, including FP and V[S]Rs. After recheckpointing, the
@@ -873,6 +889,8 @@ static long restore_tm_user_regs(struct pt_regs *regs,
}
#endif
+ preempt_enable();
+
return 0;
}
#endif
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 83d51bf586c7..bbd1c73243d7 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -467,20 +467,6 @@ static long restore_tm_sigcontexts(struct task_struct *tsk,
if (MSR_TM_RESV(msr))
return -EINVAL;
- /* pull in MSR TS bits from user context */
- regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK);
-
- /*
- * Ensure that TM is enabled in regs->msr before we leave the signal
- * handler. It could be the case that (a) user disabled the TM bit
- * through the manipulation of the MSR bits in uc_mcontext or (b) the
- * TM bit was disabled because a sufficient number of context switches
- * happened whilst in the signal handler and load_tm overflowed,
- * disabling the TM bit. In either case we can end up with an illegal
- * TM state leading to a TM Bad Thing when we return to userspace.
- */
- regs->msr |= MSR_TM;
-
/* pull in MSR LE from user context */
regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE);
@@ -572,6 +558,34 @@ static long restore_tm_sigcontexts(struct task_struct *tsk,
tm_enable();
/* Make sure the transaction is marked as failed */
tsk->thread.tm_texasr |= TEXASR_FS;
+
+ /*
+ * Disabling preemption, since it is unsafe to be preempted
+ * with MSR[TS] set without recheckpointing.
+ */
+ preempt_disable();
+
+ /* pull in MSR TS bits from user context */
+ regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK);
+
+ /*
+ * Ensure that TM is enabled in regs->msr before we leave the signal
+ * handler. It could be the case that (a) user disabled the TM bit
+ * through the manipulation of the MSR bits in uc_mcontext or (b) the
+ * TM bit was disabled because a sufficient number of context switches
+ * happened whilst in the signal handler and load_tm overflowed,
+ * disabling the TM bit. In either case we can end up with an illegal
+ * TM state leading to a TM Bad Thing when we return to userspace.
+ *
+ * CAUTION:
+ * After regs->MSR[TS] being updated, make sure that get_user(),
+ * put_user() or similar functions are *not* called. These
+ * functions can generate page faults which will cause the process
+ * to be de-scheduled with MSR[TS] set but without calling
+ * tm_recheckpoint(). This can cause a bug.
+ */
+ regs->msr |= MSR_TM;
+
/* This loads the checkpointed FP/VEC state, if used */
tm_recheckpoint(&tsk->thread);
@@ -585,6 +599,8 @@ static long restore_tm_sigcontexts(struct task_struct *tsk,
regs->msr |= MSR_VEC;
}
+ preempt_enable();
+
return err;
}
#endif
--
2.19.0
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 747df19747bc9752cd40b9cce761e17a033aa5c2 Mon Sep 17 00:00:00 2001
From: Daniel Mack <daniel(a)zonque.org>
Date: Thu, 11 Oct 2018 20:32:05 +0200
Subject: [PATCH] ASoC: sta32x: set ->component pointer in private struct
The ESD watchdog code in sta32x_watchdog() dereferences the pointer
which is never assigned.
This is a regression from a1be4cead9b950 ("ASoC: sta32x: Convert to direct
regmap API usage.") which went unnoticed since nobody seems to use that ESD
workaround.
Fixes: a1be4cead9b950 ("ASoC: sta32x: Convert to direct regmap API usage.")
Signed-off-by: Daniel Mack <daniel(a)zonque.org>
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Cc: stable(a)vger.kernel.org
diff --git a/sound/soc/codecs/sta32x.c b/sound/soc/codecs/sta32x.c
index d5035f2f2b2b..ce508b4cc85c 100644
--- a/sound/soc/codecs/sta32x.c
+++ b/sound/soc/codecs/sta32x.c
@@ -879,6 +879,9 @@ static int sta32x_probe(struct snd_soc_component *component)
struct sta32x_priv *sta32x = snd_soc_component_get_drvdata(component);
struct sta32x_platform_data *pdata = sta32x->pdata;
int i, ret = 0, thermal = 0;
+
+ sta32x->component = component;
+
ret = regulator_bulk_enable(ARRAY_SIZE(sta32x->supplies),
sta32x->supplies);
if (ret != 0) {
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 2d204ee9d671327915260071c19350d84344e096 Mon Sep 17 00:00:00 2001
From: Dan Carpenter <dan.carpenter(a)oracle.com>
Date: Mon, 10 Sep 2018 14:12:07 +0300
Subject: [PATCH] cifs: integer overflow in in SMB2_ioctl()
The "le32_to_cpu(rsp->OutputOffset) + *plen" addition can overflow and
wrap around to a smaller value which looks like it would lead to an
information leak.
Fixes: 4a72dafa19ba ("SMB2 FSCTL and IOCTL worker function")
Signed-off-by: Dan Carpenter <dan.carpenter(a)oracle.com>
Signed-off-by: Steve French <stfrench(a)microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel(a)suse.com>
CC: Stable <stable(a)vger.kernel.org>
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 6f0e6b42599c..f54d07bda067 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2459,14 +2459,14 @@ SMB2_ioctl(const unsigned int xid, struct cifs_tcon *tcon, u64 persistent_fid,
/* We check for obvious errors in the output buffer length and offset */
if (*plen == 0)
goto ioctl_exit; /* server returned no data */
- else if (*plen > 0xFF00) {
+ else if (*plen > rsp_iov.iov_len || *plen > 0xFF00) {
cifs_dbg(VFS, "srv returned invalid ioctl length: %d\n", *plen);
*plen = 0;
rc = -EIO;
goto ioctl_exit;
}
- if (rsp_iov.iov_len < le32_to_cpu(rsp->OutputOffset) + *plen) {
+ if (rsp_iov.iov_len - *plen < le32_to_cpu(rsp->OutputOffset)) {
cifs_dbg(VFS, "Malformed ioctl resp: len %d offset %d\n", *plen,
le32_to_cpu(rsp->OutputOffset));
*plen = 0;
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken
into account as well.
Instead of writing the required complicated code for this rare
occurrence, just eliminate the race. i_mmap_rwsem is now held in read
mode for the duration of page fault processing. Hold i_mmap_rwsem
longer in truncation and hold punch code to cover the call to
remove_inode_hugepages.
With this modification, code in remove_inode_hugepages checking for
races becomes 'dead' as it can not longer happen. Remove the dead code
and expand comments to explain reasoning. Similarly, checks for races
with truncation in the page fault path can be simplified and removed.
Cc: <stable(a)vger.kernel.org>
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
---
fs/hugetlbfs/inode.c | 61 ++++++++++++++++++++------------------------
mm/hugetlb.c | 21 ++++++++-------
2 files changed, 38 insertions(+), 44 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 32920a10100e..a2fcea5f8225 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -383,17 +383,16 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
* truncation is indicated by end of range being LLONG_MAX
* In this case, we first scan the range and release found pages.
* After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
- * maps and global counts. Page faults can not race with truncation
- * in this routine. hugetlb_no_page() prevents page faults in the
- * truncated range. It checks i_size before allocation, and again after
- * with the page table lock for the page held. The same lock must be
- * acquired to unmap a page.
+ * maps and global counts.
* hole punch is indicated if end is not LLONG_MAX
* In the hole punch case we scan the range and release found pages.
* Only when releasing a page is the associated region/reserv map
* deleted. The region/reserv map for ranges without associated
- * pages are not modified. Page faults can race with hole punch.
- * This is indicated if we find a mapped page.
+ * pages are not modified.
+ *
+ * Callers of this routine must hold the i_mmap_rwsem in write mode to prevent
+ * races with page faults.
+ *
* Note: If the passed end of range value is beyond the end of file, but
* not LLONG_MAX this routine still performs a hole punch operation.
*/
@@ -423,32 +422,14 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
for (i = 0; i < pagevec_count(&pvec); ++i) {
struct page *page = pvec.pages[i];
- u32 hash;
index = page->index;
- hash = hugetlb_fault_mutex_hash(h, current->mm,
- &pseudo_vma,
- mapping, index, 0);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
-
/*
- * If page is mapped, it was faulted in after being
- * unmapped in caller. Unmap (again) now after taking
- * the fault mutex. The mutex will prevent faults
- * until we finish removing the page.
- *
- * This race can only happen in the hole punch case.
- * Getting here in a truncate operation is a bug.
+ * A mapped page is impossible as callers should unmap
+ * all references before calling. And, i_mmap_rwsem
+ * prevents the creation of additional mappings.
*/
- if (unlikely(page_mapped(page))) {
- BUG_ON(truncate_op);
-
- i_mmap_lock_write(mapping);
- hugetlb_vmdelete_list(&mapping->i_mmap,
- index * pages_per_huge_page(h),
- (index + 1) * pages_per_huge_page(h));
- i_mmap_unlock_write(mapping);
- }
+ VM_BUG_ON(page_mapped(page));
lock_page(page);
/*
@@ -470,7 +451,6 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
}
unlock_page(page);
- mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
huge_pagevec_release(&pvec);
cond_resched();
@@ -482,9 +462,20 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
static void hugetlbfs_evict_inode(struct inode *inode)
{
+ struct address_space *mapping = inode->i_mapping;
struct resv_map *resv_map;
+ /*
+ * The vfs layer guarantees that there are no other users of this
+ * inode. Therefore, it would be safe to call remove_inode_hugepages
+ * without holding i_mmap_rwsem. We acquire and hold here to be
+ * consistent with other callers. Since there will be no contention
+ * on the semaphore, overhead is negligible.
+ */
+ i_mmap_lock_write(mapping);
remove_inode_hugepages(inode, 0, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
+
resv_map = (struct resv_map *)inode->i_mapping->private_data;
/* root inode doesn't have the resv_map, so we should check it */
if (resv_map)
@@ -505,8 +496,8 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offset, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
return 0;
}
@@ -540,8 +531,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
hugetlb_vmdelete_list(&mapping->i_mmap,
hole_start >> PAGE_SHIFT,
hole_end >> PAGE_SHIFT);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, hole_start, hole_end);
+ i_mmap_unlock_write(mapping);
inode_unlock(inode);
}
@@ -624,7 +615,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
/* addr is the offset within the file (zero based) */
addr = index * hpage_size;
- /* mutex taken here, fault path and hole punch */
+ /*
+ * fault mutex taken here, protects against fault path
+ * and hole punch. inode_lock previously taken protects
+ * against truncation.
+ */
hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
index, addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2a3162030167..cfd9790b01e3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3757,16 +3757,16 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}
/*
- * Use page lock to guard against racing truncation
- * before we get page_table_lock.
+ * We can not race with truncation due to holding i_mmap_rwsem.
+ * Check once here for faults beyond end of file.
*/
+ size = i_size_read(mapping->host) >> huge_page_shift(h);
+ if (idx >= size)
+ goto out;
+
retry:
page = find_lock_page(mapping, idx);
if (!page) {
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto out;
-
/*
* Check for page in userfault range
*/
@@ -3856,9 +3856,6 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}
ptl = huge_pte_lock(h, mm, ptep);
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto backout;
ret = 0;
if (!huge_pte_none(huge_ptep_get(ptep)))
@@ -3961,8 +3958,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
/*
* Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
- * until finished with ptep. This prevents huge_pmd_unshare from
- * being called elsewhere and making the ptep no longer valid.
+ * until finished with ptep. This serves two purposes:
+ * 1) It prevents huge_pmd_unshare from being called elsewhere
+ * and making the ptep no longer valid.
+ * 2) It synchronizes us with file truncation.
*
* ptep could have already be assigned via huge_pte_offset. That
* is OK, as huge_pte_alloc will return the same value unless
--
2.17.2
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken
into account as well.
Instead of writing the required complicated code for this rare
occurrence, just eliminate the race. i_mmap_rwsem is now held in read
mode for the duration of page fault processing. Hold i_mmap_rwsem
longer in truncation and hold punch code to cover the call to
remove_inode_hugepages.
With this modification, code in remove_inode_hugepages checking for
races becomes 'dead' as it can not longer happen. Remove the dead code
and expand comments to explain reasoning. Similarly, checks for races
with truncation in the page fault path can be simplified and removed.
Cc: <stable(a)vger.kernel.org>
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
---
fs/hugetlbfs/inode.c | 50 +++++++++++++++-----------------------------
mm/hugetlb.c | 21 +++++++++----------
2 files changed, 27 insertions(+), 44 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 32920a10100e..a9c00c6ef80d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -383,17 +383,16 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
* truncation is indicated by end of range being LLONG_MAX
* In this case, we first scan the range and release found pages.
* After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
- * maps and global counts. Page faults can not race with truncation
- * in this routine. hugetlb_no_page() prevents page faults in the
- * truncated range. It checks i_size before allocation, and again after
- * with the page table lock for the page held. The same lock must be
- * acquired to unmap a page.
+ * maps and global counts.
* hole punch is indicated if end is not LLONG_MAX
* In the hole punch case we scan the range and release found pages.
* Only when releasing a page is the associated region/reserv map
* deleted. The region/reserv map for ranges without associated
- * pages are not modified. Page faults can race with hole punch.
- * This is indicated if we find a mapped page.
+ * pages are not modified.
+ *
+ * Callers of this routine must hold the i_mmap_rwsem in write mode to prevent
+ * races with page faults.
+ *
* Note: If the passed end of range value is beyond the end of file, but
* not LLONG_MAX this routine still performs a hole punch operation.
*/
@@ -423,32 +422,14 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
for (i = 0; i < pagevec_count(&pvec); ++i) {
struct page *page = pvec.pages[i];
- u32 hash;
index = page->index;
- hash = hugetlb_fault_mutex_hash(h, current->mm,
- &pseudo_vma,
- mapping, index, 0);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
-
/*
- * If page is mapped, it was faulted in after being
- * unmapped in caller. Unmap (again) now after taking
- * the fault mutex. The mutex will prevent faults
- * until we finish removing the page.
- *
- * This race can only happen in the hole punch case.
- * Getting here in a truncate operation is a bug.
+ * A mapped page is impossible as callers should unmap
+ * all references before calling. And, i_mmap_rwsem
+ * prevents the creation of additional mappings.
*/
- if (unlikely(page_mapped(page))) {
- BUG_ON(truncate_op);
-
- i_mmap_lock_write(mapping);
- hugetlb_vmdelete_list(&mapping->i_mmap,
- index * pages_per_huge_page(h),
- (index + 1) * pages_per_huge_page(h));
- i_mmap_unlock_write(mapping);
- }
+ VM_BUG_ON(page_mapped(page));
lock_page(page);
/*
@@ -470,7 +451,6 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
}
unlock_page(page);
- mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
huge_pagevec_release(&pvec);
cond_resched();
@@ -505,8 +485,8 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offset, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
return 0;
}
@@ -540,8 +520,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
hugetlb_vmdelete_list(&mapping->i_mmap,
hole_start >> PAGE_SHIFT,
hole_end >> PAGE_SHIFT);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, hole_start, hole_end);
+ i_mmap_unlock_write(mapping);
inode_unlock(inode);
}
@@ -624,7 +604,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
/* addr is the offset within the file (zero based) */
addr = index * hpage_size;
- /* mutex taken here, fault path and hole punch */
+ /*
+ * fault mutex taken here, protects against fault path
+ * and hole punch. inode_lock previously taken protects
+ * against truncation.
+ */
hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
index, addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ab4c77b8c72c..25a0cd2f8b39 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3760,16 +3760,16 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}
/*
- * Use page lock to guard against racing truncation
- * before we get page_table_lock.
+ * We can not race with truncation due to holding i_mmap_rwsem.
+ * Check once here for faults beyond end of file.
*/
+ size = i_size_read(mapping->host) >> huge_page_shift(h);
+ if (idx >= size)
+ goto out;
+
retry:
page = find_lock_page(mapping, idx);
if (!page) {
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto out;
-
/*
* Check for page in userfault range
*/
@@ -3859,9 +3859,6 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}
ptl = huge_pte_lock(h, mm, ptep);
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto backout;
ret = 0;
if (!huge_pte_none(huge_ptep_get(ptep)))
@@ -3964,8 +3961,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
/*
* Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
- * until finished with ptep. This prevents huge_pmd_unshare from
- * being called elsewhere and making the ptep no longer valid.
+ * until finished with ptep. This serves two purposes:
+ * 1) It prevents huge_pmd_unshare from being called elsewhere
+ * and making the ptep no longer valid.
+ * 2) It synchronizes us with file truncation.
*
* ptep could have already be assigned via huge_pte_offset. That
* is OK, as huge_pte_alloc will return the same value unless
--
2.17.2
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 55e56f06ed71d9441f3abd5b1d3c1a870812b3fe Mon Sep 17 00:00:00 2001
From: Matthew Wilcox <willy(a)infradead.org>
Date: Tue, 27 Nov 2018 13:16:34 -0800
Subject: [PATCH] dax: Don't access a freed inode
After we drop the i_pages lock, the inode can be freed at any time.
The get_unlocked_entry() code has no choice but to reacquire the lock,
so it can't be used here. Create a new wait_entry_unlocked() which takes
care not to acquire the lock or dereference the address_space in any way.
Fixes: c2a7d2a11552 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Matthew Wilcox <willy(a)infradead.org>
Reviewed-by: Jan Kara <jack(a)suse.cz>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
diff --git a/fs/dax.c b/fs/dax.c
index e69fc231833b..3f592dc18d67 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -232,6 +232,34 @@ static void *get_unlocked_entry(struct xa_state *xas)
}
}
+/*
+ * The only thing keeping the address space around is the i_pages lock
+ * (it's cycled in clear_inode() after removing the entries from i_pages)
+ * After we call xas_unlock_irq(), we cannot touch xas->xa.
+ */
+static void wait_entry_unlocked(struct xa_state *xas, void *entry)
+{
+ struct wait_exceptional_entry_queue ewait;
+ wait_queue_head_t *wq;
+
+ init_wait(&ewait.wait);
+ ewait.wait.func = wake_exceptional_entry_func;
+
+ wq = dax_entry_waitqueue(xas, entry, &ewait.key);
+ prepare_to_wait_exclusive(wq, &ewait.wait, TASK_UNINTERRUPTIBLE);
+ xas_unlock_irq(xas);
+ schedule();
+ finish_wait(wq, &ewait.wait);
+
+ /*
+ * Entry lock waits are exclusive. Wake up the next waiter since
+ * we aren't sure we will acquire the entry lock and thus wake
+ * the next waiter up on unlock.
+ */
+ if (waitqueue_active(wq))
+ __wake_up(wq, TASK_NORMAL, 1, &ewait.key);
+}
+
static void put_unlocked_entry(struct xa_state *xas, void *entry)
{
/* If we were the only waiter woken, wake the next one */
@@ -389,9 +417,7 @@ bool dax_lock_mapping_entry(struct page *page)
entry = xas_load(&xas);
if (dax_is_locked(entry)) {
rcu_read_unlock();
- entry = get_unlocked_entry(&xas);
- xas_unlock_irq(&xas);
- put_unlocked_entry(&xas, entry);
+ wait_entry_unlocked(&xas, entry);
rcu_read_lock();
continue;
}
Hi Sasha,
On Thu, Dec 20, 2018 at 07:26:15PM +0000, Sasha Levin wrote:
> Hi,
>
> [This is an automated email]
>
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: 432c6bacbd0c MIPS: Use per-mm page to execute branch delay slot instructions.
>
> The bot has tested the following trees: v4.19.10, v4.14.89, v4.9.146,
Neat! I like the idea of this automation :)
> v4.19.10: Build OK!
> v4.14.89: Build OK!
> v4.9.146: Failed to apply! Possible dependencies:
> 05ce77249d50 ("userfaultfd: non-cooperative: add madvise() event for MADV_DONTNEED request")
> 163e11bc4f6e ("userfaultfd: hugetlbfs: UFFD_FEATURE_MISSING_HUGETLBFS")
> 67dece7d4c58 ("x86/vdso: Set vDSO pointer only after success")
> 72f87654c696 ("userfaultfd: non-cooperative: add mremap() event")
> 893e26e61d04 ("userfaultfd: non-cooperative: Add fork() event")
> 897ab3e0c49e ("userfaultfd: non-cooperative: add event for memory unmaps")
> 9cd75c3cd4c3 ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor")
> d811914d8757 ("userfaultfd: non-cooperative: rename *EVENT_MADVDONTNEED to *EVENT_REMOVE")
This list includes the correct soft dependency - commit 897ab3e0c49e
("userfaultfd: non-cooperative: add event for memory unmaps") which
added an extra argument to mmap_region().
> How should we proceed with this patch?
The backport to v4.9 should simply drop the last argument (NULL) in the
call to mmap_region().
Is there some way I can indicate this sort of thing in future patches so
that the automation can spot that I already know it won't apply cleanly
to a particular range of kernel versions? Or even better, that I could
indicate what needs to be changed when backporting to those kernel
versions?
Thanks,
Paul
+cc Greg, stable
Greensky, James J <james.j.greensky(a)intel.com> 于2018年12月21日周五 上午11:48写道:
>
> Commit d38d272592737ea88a20 ("perf tools: Synthesize GROUP_DESC feature in pipe mode") broke the LT 4.14 branch when using event groups in pipe-mode.
>
> # perf record -e '{cycles,instructions,branches}' -- sleep 4 | perf report
> # To display the perf.data header info, please use --header/--header-only options
> #
> Oxd7c [0x60]: failed to process type: 80
> Error:
> Failed to process sample
>
> Commit a2015516c5c0be932a69 ("perf record: Synthesize features before events in pipe mode") is the fix. Can we get this cherry-picked and applied?
>
If we fail to pin the ggtt vma slot for the ppgtt page tables, we need
to unwind the locals before reporting the error. Or else on subsequent
attempts to bind the page tables into the ggtt, we will already believe
that the vma has been pinned and continue on blithely. If something else
should happen to be at that location, choas ensues.
Fixes: a2bbf7148342 ("drm/i915/gtt: Only keep gen6 page directories pinned while active")
Signed-off-by: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen(a)linux.intel.com>
Cc: Mika Kuoppala <mika.kuoppala(a)linux.intel.com>
Cc: Matthew Auld <matthew.william.auld(a)gmail.com>
Cc: <stable(a)vger.kernel.org> # v4.19+
---
drivers/gpu/drm/i915/i915_gem_gtt.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
index 6e31745f6156..4ed2f3e61347 100644
--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -2073,6 +2073,7 @@ static struct i915_vma *pd_vma_create(struct gen6_hw_ppgtt *ppgtt, int size)
int gen6_ppgtt_pin(struct i915_hw_ppgtt *base)
{
struct gen6_hw_ppgtt *ppgtt = to_gen6_ppgtt(base);
+ int err;
/*
* Workaround the limited maximum vma->pin_count and the aliasing_ppgtt
@@ -2088,9 +2089,17 @@ int gen6_ppgtt_pin(struct i915_hw_ppgtt *base)
* allocator works in address space sizes, so it's multiplied by page
* size. We allocate at the top of the GTT to avoid fragmentation.
*/
- return i915_vma_pin(ppgtt->vma,
- 0, GEN6_PD_ALIGN,
- PIN_GLOBAL | PIN_HIGH);
+ err = i915_vma_pin(ppgtt->vma,
+ 0, GEN6_PD_ALIGN,
+ PIN_GLOBAL | PIN_HIGH);
+ if (err)
+ goto unpin;
+
+ return 0;
+
+unpin:
+ ppgtt->pin_count = 0;
+ return err;
}
void gen6_ppgtt_unpin(struct i915_hw_ppgtt *base)
--
2.20.1
Big endian machines (at least the one I have access to) cannot mount
f2fs filesystems anymore.
This is with Linux 4.14.89 but I suspect that 4.9.144 (and later) is
affected as well.
commit 0cfe75c5b01199 ("f2fs: enhance sanity_check_raw_super() to avoid
potential overflows") treats the "block_count" from struct
f2fs_super_block as 32-bit little endian value instead of a 64-bit
little endian value.
I tested this fix on top of Linux 4.14.49 but it seems that all stable
and mainline kernel versions are affected:
- 4.9.144 and later because 0cfe75c5b01199 was backported there
- 4.14.86 and later because 0cfe75c5b01199 was backported there
- 4.19
- 4.20-rcX
changes since v1 at [0]:
- change the printk format for block_count from "%u" to "%llu" (thanks
to "kbuild test robot" for spotting this)
- added Chao Yu's reviewed by
[0] https://lore.kernel.org/patchwork/cover/1027285/
Martin Blumenstingl (1):
f2fs: fix validation of the block count in sanity_check_raw_super
fs/f2fs/super.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--
2.20.1
All fields in the PE are big-endian. Use cpu_to_be32() like everywhere
else something is written to the PE. Otherwise a wrong TID will be used
by the NPU. If this TID happens to point to an existing thread sharing
the same mm, it could be woken up by error. This is highly improbable
though. The likely outcome of this is the NPU not finding the target
thread and forcing the AFU into sending an interrupt, which userspace
is supposed to handle anyway.
Fixes: e948e06fc63a ("ocxl: Expose the thread_id needed for wait on POWER9")
Cc: stable(a)vger.kernel.org # v4.18
Signed-off-by: Greg Kurz <groug(a)kaod.org>
---
This bug remained unnoticed so far because the current OCXL test suite
happens to call OCXL_IOCTL_ENABLE_P9_WAIT before attaching a context.
This causes ocxl_link_update_pe() to be called before ocxl_link_add_pe()
which re-writes the TID in the PE with the appropriate endianness.
I have some patches that change the behavior of the OCXL test suite so that
it can catch the issue:
https://github.com/gkurz/libocxl/commits/wake-host-thread-rework
---
drivers/misc/ocxl/link.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index 31695a078485..646d16450066 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -566,7 +566,7 @@ int ocxl_link_update_pe(void *link_handle, int pasid, __u16 tid)
mutex_lock(&spa->spa_lock);
- pe->tid = tid;
+ pe->tid = cpu_to_be32(tid);
/*
* The barrier makes sure the PE is updated
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 00ee8b60102862f4daf0814d12a2ea2744fc0b9b Mon Sep 17 00:00:00 2001
From: Richard Weinberger <richard(a)nod.at>
Date: Mon, 11 Jun 2018 23:41:09 +0200
Subject: [PATCH] ubifs: Fix directory size calculation for symlinks
We have to account the name of the symlink and not the target length.
Fixes: ca7f85be8d6c ("ubifs: Add support for encrypted symlinks")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Richard Weinberger <richard(a)nod.at>
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 9da224d4f2da..e8616040bffc 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -1123,8 +1123,7 @@ static int ubifs_symlink(struct inode *dir, struct dentry *dentry,
struct ubifs_inode *ui;
struct ubifs_inode *dir_ui = ubifs_inode(dir);
struct ubifs_info *c = dir->i_sb->s_fs_info;
- int err, len = strlen(symname);
- int sz_change = CALC_DENT_SIZE(len);
+ int err, sz_change, len = strlen(symname);
struct fscrypt_str disk_link;
struct ubifs_budget_req req = { .new_ino = 1, .new_dent = 1,
.new_ino_d = ALIGN(len, 8),
@@ -1151,6 +1150,8 @@ static int ubifs_symlink(struct inode *dir, struct dentry *dentry,
if (err)
goto out_budg;
+ sz_change = CALC_DENT_SIZE(fname_len(&nm));
+
inode = ubifs_new_inode(c, dir, S_IFLNK | S_IRWXUGO);
if (IS_ERR(inode)) {
err = PTR_ERR(inode);
From: Dave Chinner <dchinner(a)redhat.com>
This reverts commit 61c6de667263184125d5ca75e894fcad632b0dd3.
The reverted commit added page reference counting to iomap page
structures that are used to track block size < page size state. This
was supposed to align the code with page migration page accounting
assumptions, but what it has done instead is break XFS filesystems.
Every fstests run I've done on sub-page block size XFS filesystems
has since picking up this commit 2 days ago has failed with bad page
state errors such as:
# ./run_check.sh "-m rmapbt=1,reflink=1 -i sparse=1 -b size=1k" "generic/038"
....
SECTION -- xfs
FSTYP -- xfs (debug)
PLATFORM -- Linux/x86_64 test1 4.20.0-rc6-dgc+
MKFS_OPTIONS -- -f -m rmapbt=1,reflink=1 -i sparse=1 -b size=1k /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /mnt/scratch
generic/038 454s ...
run fstests generic/038 at 2018-12-20 18:43:05
XFS (sdc): Unmounting Filesystem
XFS (sdc): Mounting V5 Filesystem
XFS (sdc): Ending clean mount
BUG: Bad page state in process kswapd0 pfn:3a7fa
page:ffffea0000ccbeb0 count:0 mapcount:0 mapping:ffff88800d9b6360 index:0x1
flags: 0xfffffc0000000()
raw: 000fffffc0000000 dead000000000100 dead000000000200 ffff88800d9b6360
raw: 0000000000000001 0000000000000000 00000000ffffffff
page dumped because: non-NULL mapping
CPU: 0 PID: 676 Comm: kswapd0 Not tainted 4.20.0-rc6-dgc+ #915
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
Call Trace:
dump_stack+0x67/0x90
bad_page.cold.116+0x8a/0xbd
free_pcppages_bulk+0x4bf/0x6a0
free_unref_page_list+0x10f/0x1f0
shrink_page_list+0x49d/0xf50
shrink_inactive_list+0x19d/0x3b0
shrink_node_memcg.constprop.77+0x398/0x690
? shrink_slab.constprop.81+0x278/0x3f0
shrink_node+0x7a/0x2f0
kswapd+0x34b/0x6d0
? node_reclaim+0x240/0x240
kthread+0x11f/0x140
? __kthread_bind_mask+0x60/0x60
ret_from_fork+0x24/0x30
Disabling lock debugging due to kernel taint
....
The failures are from anyway that frees pages and empties the
per-cpu page magazines, so it's not a predictable failure or an easy
to debug failure.
generic/038 is a reliable reproducer of this problem - it has a 9 in
10 failure rate on one of my test machines. Failure on other
machines have been at random points in fstests runs but every run
has ended up tripping this problem. Hence generic/038 was used to
bisect the failure because it was the most reliable failure.
It is too close to the 4.20 release (not to mention holidays) to
try to diagnose, fix and test the underlying cause of the problem,
so reverting the commit is the only option we have right now. The
revert has been tested against a current tot 4.20-rc7+ kernel across
multiple machines running sub-page block size XFs filesystems and
none of the bad page state failures have been seen.
Signed-off-by: Dave Chinner <dchinner(a)redhat.com>
---
fs/iomap.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/fs/iomap.c b/fs/iomap.c
index 5bc172f3dfe8..d6bc98ae8d35 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -116,12 +116,6 @@ iomap_page_create(struct inode *inode, struct page *page)
atomic_set(&iop->read_count, 0);
atomic_set(&iop->write_count, 0);
bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
-
- /*
- * migrate_page_move_mapping() assumes that pages with private data have
- * their count elevated by 1.
- */
- get_page(page);
set_page_private(page, (unsigned long)iop);
SetPagePrivate(page);
return iop;
@@ -138,7 +132,6 @@ iomap_page_release(struct page *page)
WARN_ON_ONCE(atomic_read(&iop->write_count));
ClearPagePrivate(page);
set_page_private(page, 0);
- put_page(page);
kfree(iop);
}
--
2.19.1
From: Peter Xu <peterx(a)redhat.com>
Subject: mm: thp: fix flags for pmd migration when split
When splitting a huge migrating PMD, we'll transfer all the existing PMD
bits and apply them again onto the small PTEs. However we are fetching
the bits unconditionally via pmd_soft_dirty(), pmd_write() or pmd_yound()
while actually they don't make sense at all when it's a migration entry.
Fix them up. Since at it, drop the ifdef together as not needed.
Note that if my understanding is correct about the problem then if without
the patch there is chance to lose some of the dirty bits in the migrating
pmd pages (on x86_64 we're fetching bit 11 which is part of swap offset
instead of bit 2) and it could potentially corrupt the memory of an
userspace program which depends on the dirty bit.
Link: http://lkml.kernel.org/r/20181213051510.20306-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx(a)redhat.com>
Reviewed-by: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Reviewed-by: William Kucharski <william.kucharski(a)oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Dave Jiang <dave.jiang(a)intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Souptick Joarder <jrdr.linux(a)gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov(a)yandex-team.ru>
Cc: Zi Yan <zi.yan(a)cs.rutgers.edu>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/huge_memory.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)
--- a/mm/huge_memory.c~mm-thp-fix-flags-for-pmd-migration-when-split
+++ a/mm/huge_memory.c
@@ -2144,23 +2144,25 @@ static void __split_huge_pmd_locked(stru
*/
old_pmd = pmdp_invalidate(vma, haddr, pmd);
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
pmd_migration = is_pmd_migration_entry(old_pmd);
- if (pmd_migration) {
+ if (unlikely(pmd_migration)) {
swp_entry_t entry;
entry = pmd_to_swp_entry(old_pmd);
page = pfn_to_page(swp_offset(entry));
- } else
-#endif
+ write = is_write_migration_entry(entry);
+ young = false;
+ soft_dirty = pmd_swp_soft_dirty(old_pmd);
+ } else {
page = pmd_page(old_pmd);
+ if (pmd_dirty(old_pmd))
+ SetPageDirty(page);
+ write = pmd_write(old_pmd);
+ young = pmd_young(old_pmd);
+ soft_dirty = pmd_soft_dirty(old_pmd);
+ }
VM_BUG_ON_PAGE(!page_count(page), page);
page_ref_add(page, HPAGE_PMD_NR - 1);
- if (pmd_dirty(old_pmd))
- SetPageDirty(page);
- write = pmd_write(old_pmd);
- young = pmd_young(old_pmd);
- soft_dirty = pmd_soft_dirty(old_pmd);
/*
* Withdraw the table only after we mark the pmd entry invalid.
_
From: Mikhail Zaslonko <zaslonko(a)linux.ibm.com>
Subject: mm, memory_hotplug: initialize struct pages for the full memory section
If memory end is not aligned with the sparse memory section boundary, the
mapping of such a section is only partly initialized. This may lead to
VM_BUG_ON due to uninitialized struct page access from
is_mem_section_removable() or test_pages_in_a_zone() function triggered by
memory_hotplug sysfs handlers:
Here are the the panic examples:
CONFIG_DEBUG_VM=y
CONFIG_DEBUG_VM_PGFLAGS=y
kernel parameter mem=2050M
--------------------------
page:000003d082008000 is uninitialized and poisoned
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
Call Trace:
([<0000000000385b26>] test_pages_in_a_zone+0xde/0x160)
[<00000000008f15c4>] show_valid_zones+0x5c/0x190
[<00000000008cf9c4>] dev_attr_show+0x34/0x70
[<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
[<00000000003e4194>] seq_read+0x204/0x480
[<00000000003b53ea>] __vfs_read+0x32/0x178
[<00000000003b55b2>] vfs_read+0x82/0x138
[<00000000003b5be2>] ksys_read+0x5a/0xb0
[<0000000000b86ba0>] system_call+0xdc/0x2d8
Last Breaking-Event-Address:
[<0000000000385b26>] test_pages_in_a_zone+0xde/0x160
Kernel panic - not syncing: Fatal exception: panic_on_oops
kernel parameter mem=3075M
--------------------------
page:000003d08300c000 is uninitialized and poisoned
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
Call Trace:
([<000000000038596c>] is_mem_section_removable+0xb4/0x190)
[<00000000008f12fa>] show_mem_removable+0x9a/0xd8
[<00000000008cf9c4>] dev_attr_show+0x34/0x70
[<0000000000463ad0>] sysfs_kf_seq_show+0xc8/0x148
[<00000000003e4194>] seq_read+0x204/0x480
[<00000000003b53ea>] __vfs_read+0x32/0x178
[<00000000003b55b2>] vfs_read+0x82/0x138
[<00000000003b5be2>] ksys_read+0x5a/0xb0
[<0000000000b86ba0>] system_call+0xdc/0x2d8
Last Breaking-Event-Address:
[<000000000038596c>] is_mem_section_removable+0xb4/0x190
Kernel panic - not syncing: Fatal exception: panic_on_oops
Fix the problem by initializing the last memory section of each zone in
memmap_init_zone() till the very end, even if it goes beyond the zone end.
Michal said:
: This has alwways been problem AFAIU. It just went unnoticed because we
: have zeroed memmaps during allocation before f7f99100d8d9 ("mm: stop
: zeroing memory during allocation in vmemmap") and so the above test
: would simply skip these ranges as belonging to zone 0 or provided a
: garbage.
:
: So I guess we do care for post f7f99100d8d9 kernels mostly and
: therefore Fixes: f7f99100d8d9 ("mm: stop zeroing memory during
: allocation in vmemmap")
Link: http://lkml.kernel.org/r/20181212172712.34019-2-zaslonko@linux.ibm.com
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Mikhail Zaslonko <zaslonko(a)linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer(a)de.ibm.com>
Suggested-by: Michal Hocko <mhocko(a)kernel.org>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov(a)gmail.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov(a)gmail.com>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Alexander Duyck <alexander.h.duyck(a)linux.intel.com>
Cc: Pasha Tatashin <Pavel.Tatashin(a)microsoft.com>
Cc: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
Cc: Heiko Carstens <heiko.carstens(a)de.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
--- a/mm/page_alloc.c~mm-memory_hotplug-initialize-struct-pages-for-the-full-memory-section
+++ a/mm/page_alloc.c
@@ -5542,6 +5542,18 @@ void __meminit memmap_init_zone(unsigned
cond_resched();
}
}
+#ifdef CONFIG_SPARSEMEM
+ /*
+ * If the zone does not span the rest of the section then
+ * we should at least initialize those pages. Otherwise we
+ * could blow up on a poisoned page in some paths which depend
+ * on full sections being initialized (e.g. memory hotplug).
+ */
+ while (end_pfn % PAGES_PER_SECTION) {
+ __init_single_page(pfn_to_page(end_pfn), end_pfn, zone, nid);
+ end_pfn++;
+ }
+#endif
}
#ifdef CONFIG_ZONE_DEVICE
_
While looking at BUGs associated with invalid huge page map counts,
it was discovered and observed that a huge pte pointer could become
'invalid' and point to another task's page table. Consider the
following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation,
it unmaps everyone who has the file mapped. If the range being
truncated is covered by a shared pmd, huge_pmd_unshare will be called.
For all but the last user of the shared pmd, huge_pmd_unshare will
clear the pud pointing to the pmd. If the task in the middle of the
page fault is not the last user, the ptep returned by huge_pte_alloc
now points to another task's page table or worse. This leads to bad
things such as incorrect page map/reference counts or invalid memory
references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with
the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
Cc: <stable(a)vger.kernel.org>
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
---
mm/hugetlb.c | 70 ++++++++++++++++++++++++++++++++++-----------
mm/memory-failure.c | 14 ++++++++-
mm/migrate.c | 13 ++++++++-
mm/rmap.c | 3 ++
mm/userfaultfd.c | 11 +++++--
5 files changed, 91 insertions(+), 20 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 309fb8c969af..ab4c77b8c72c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3239,6 +3239,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
int cow;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
+ struct address_space *mapping = vma->vm_file->f_mapping;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
int ret = 0;
@@ -3252,11 +3253,23 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
+
src_pte = huge_pte_offset(src, addr, sz);
if (!src_pte)
continue;
+
+ /*
+ * i_mmap_rwsem must be held to call huge_pte_alloc.
+ * Continue to hold until finished with dst_pte, otherwise
+ * it could go away if part of a shared pmd.
+ *
+ * Technically, i_mmap_rwsem is only needed in the non-cow
+ * case as cow mappings are not shared.
+ */
+ i_mmap_lock_read(mapping);
dst_pte = huge_pte_alloc(dst, addr, sz);
if (!dst_pte) {
+ i_mmap_unlock_read(mapping);
ret = -ENOMEM;
break;
}
@@ -3271,8 +3284,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
* after taking the lock below.
*/
dst_entry = huge_ptep_get(dst_pte);
- if ((dst_pte == src_pte) || !huge_pte_none(dst_entry))
+ if ((dst_pte == src_pte) || !huge_pte_none(dst_entry)) {
+ i_mmap_unlock_read(mapping);
continue;
+ }
dst_ptl = huge_pte_lock(h, dst, dst_pte);
src_ptl = huge_pte_lockptr(h, src, src_pte);
@@ -3321,6 +3336,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
}
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
+
+ i_mmap_unlock_read(mapping);
}
if (cow)
@@ -3772,14 +3789,18 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
};
/*
- * hugetlb_fault_mutex must be dropped before
- * handling userfault. Reacquire after handling
- * fault to make calling code simpler.
+ * hugetlb_fault_mutex and i_mmap_rwsem must be
+ * dropped before handling userfault. Reacquire
+ * after handling fault to make calling code simpler.
*/
hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping,
idx, haddr);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+
ret = handle_userfault(&vmf, VM_UFFD_MISSING);
+
+ i_mmap_lock_read(mapping);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
goto out;
}
@@ -3927,6 +3948,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
if (ptep) {
+ /*
+ * Since we hold no locks, ptep could be stale. That is
+ * OK as we are only making decisions based on content and
+ * not actually modifying content here.
+ */
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
migration_entry_wait_huge(vma, mm, ptep);
@@ -3934,20 +3960,31 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
VM_FAULT_SET_HINDEX(hstate_index(h));
- } else {
- ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
- if (!ptep)
- return VM_FAULT_OOM;
}
+ /*
+ * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
+ * until finished with ptep. This prevents huge_pmd_unshare from
+ * being called elsewhere and making the ptep no longer valid.
+ *
+ * ptep could have already be assigned via huge_pte_offset. That
+ * is OK, as huge_pte_alloc will return the same value unless
+ * something changed.
+ */
mapping = vma->vm_file->f_mapping;
- idx = vma_hugecache_offset(h, vma, haddr);
+ i_mmap_lock_read(mapping);
+ ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
+ if (!ptep) {
+ i_mmap_unlock_read(mapping);
+ return VM_FAULT_OOM;
+ }
/*
* Serialize hugepage allocation and instantiation, so that we don't
* get spurious allocation failures if two CPUs race to instantiate
* the same page in the page cache.
*/
+ idx = vma_hugecache_offset(h, vma, haddr);
hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, haddr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -4035,6 +4072,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
out_mutex:
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
/*
* Generally it's safe to hold refcount during waiting page lock. But
* here we just wait to defer the next page fault to avoid busy loop and
@@ -4639,10 +4677,12 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
* Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
* and returns the corresponding pte. While this is not necessary for the
* !shared pmd case because we can allocate the pmd later as well, it makes the
- * code much cleaner. pmd allocation is essential for the shared case because
- * pud has to be populated inside the same i_mmap_rwsem section - otherwise
- * racing tasks could either miss the sharing (see huge_pte_offset) or select a
- * bad pmd for sharing.
+ * code much cleaner.
+ *
+ * This routine must be called with i_mmap_rwsem held in at least read mode.
+ * For hugetlbfs, this prevents removal of any page table entries associated
+ * with the address space. This is important as we are setting up sharing
+ * based on existing page table entries (mappings).
*/
pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
{
@@ -4659,7 +4699,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
if (!vma_shareable(vma, addr))
return (pte_t *)pmd_alloc(mm, pud, addr);
- i_mmap_lock_write(mapping);
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -4689,7 +4728,6 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
spin_unlock(ptl);
out:
pte = (pte_t *)pmd_alloc(mm, pud, addr);
- i_mmap_unlock_write(mapping);
return pte;
}
@@ -4700,7 +4738,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
* indicated by page_count > 1, unmap is achieved by clearing pud and
* decrementing the ref count. If count == 1, the pte page is not shared.
*
- * called with page table lock held.
+ * Called with page table lock held and i_mmap_rwsem held in write mode.
*
* returns: 1 successfully unmapped a shared pte page
* 0 the underlying pte page is not shared, or it is the last user
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 0cd3de3550f0..b992d1295578 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1028,7 +1028,19 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
if (kill)
collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
- unmap_success = try_to_unmap(hpage, ttu);
+ if (!PageHuge(hpage)) {
+ unmap_success = try_to_unmap(hpage, ttu);
+ } else {
+ /*
+ * For hugetlb pages, try_to_unmap could potentially call
+ * huge_pmd_unshare. Because of this, take semaphore in
+ * write mode here and set TTU_RMAP_LOCKED to indicate we
+ * have taken the lock at this higer level.
+ */
+ i_mmap_lock_write(mapping);
+ unmap_success = try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED);
+ i_mmap_unlock_write(mapping);
+ }
if (!unmap_success)
pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
pfn, page_mapcount(hpage));
diff --git a/mm/migrate.c b/mm/migrate.c
index 84381b55b2bd..725edaef238a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1307,8 +1307,19 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
goto put_anon;
if (page_mapped(hpage)) {
+ struct address_space *mapping = page_mapping(hpage);
+
+ /*
+ * try_to_unmap could potentially call huge_pmd_unshare.
+ * Because of this, take semaphore in write mode here and
+ * set TTU_RMAP_LOCKED to let lower levels know we have
+ * taken the lock.
+ */
+ i_mmap_lock_write(mapping);
try_to_unmap(hpage,
- TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+ TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|
+ TTU_RMAP_LOCKED);
+ i_mmap_unlock_write(mapping);
page_was_mapped = 1;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index 85b7f9423352..322e656d0225 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1374,6 +1374,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
/*
* If sharing is possible, start and end will be adjusted
* accordingly.
+ *
+ * If called for a huge page, caller must hold i_mmap_rwsem
+ * in write mode as it is possible to call huge_pmd_unshare.
*/
adjust_range_if_pmd_sharing_possible(vma, &start, &end);
}
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 458acda96f20..48368589f519 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -267,10 +267,14 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
VM_BUG_ON(dst_addr & ~huge_page_mask(h));
/*
- * Serialize via hugetlb_fault_mutex
+ * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
+ * i_mmap_rwsem ensures the dst_pte remains valid even
+ * in the case of shared pmds. fault mutex prevents
+ * races with other faulting threads.
*/
- idx = linear_page_index(dst_vma, dst_addr);
mapping = dst_vma->vm_file->f_mapping;
+ i_mmap_lock_read(mapping);
+ idx = linear_page_index(dst_vma, dst_addr);
hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping,
idx, dst_addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -279,6 +283,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h));
if (!dst_pte) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
goto out_unlock;
}
@@ -286,6 +291,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
dst_pteval = huge_ptep_get(dst_pte);
if (!huge_pte_none(dst_pteval)) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
goto out_unlock;
}
@@ -293,6 +299,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
dst_addr, src_addr, &page);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
vm_alloc_shared = vm_shared;
cond_resched();
--
2.17.2
This is a note to let you know that I've just added the patch titled
USB: serial: pl2303: add ids for Hewlett-Packard HP POS pole displays
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-next branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will also be merged in the next major kernel release
during the merge window.
If you have any questions about this process, please let me know.
>From 8d503f206c336677954160ac62f0c7d9c219cd89 Mon Sep 17 00:00:00 2001
From: Scott Chen <scott(a)labau.com.tw>
Date: Thu, 13 Dec 2018 06:01:47 -0500
Subject: USB: serial: pl2303: add ids for Hewlett-Packard HP POS pole displays
Add device ids to pl2303 for the HP POS pole displays:
LM920: 03f0:026b
TD620: 03f0:0956
LD960TA: 03f0:4439
LD220TA: 03f0:4349
LM940: 03f0:5039
Signed-off-by: Scott Chen <scott(a)labau.com.tw>
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Johan Hovold <johan(a)kernel.org>
---
drivers/usb/serial/pl2303.c | 5 +++++
drivers/usb/serial/pl2303.h | 5 +++++
2 files changed, 10 insertions(+)
diff --git a/drivers/usb/serial/pl2303.c b/drivers/usb/serial/pl2303.c
index a4e0d13fc121..98e7a5df0f6d 100644
--- a/drivers/usb/serial/pl2303.c
+++ b/drivers/usb/serial/pl2303.c
@@ -91,9 +91,14 @@ static const struct usb_device_id id_table[] = {
{ USB_DEVICE(YCCABLE_VENDOR_ID, YCCABLE_PRODUCT_ID) },
{ USB_DEVICE(SUPERIAL_VENDOR_ID, SUPERIAL_PRODUCT_ID) },
{ USB_DEVICE(HP_VENDOR_ID, HP_LD220_PRODUCT_ID) },
+ { USB_DEVICE(HP_VENDOR_ID, HP_LD220TA_PRODUCT_ID) },
{ USB_DEVICE(HP_VENDOR_ID, HP_LD960_PRODUCT_ID) },
+ { USB_DEVICE(HP_VENDOR_ID, HP_LD960TA_PRODUCT_ID) },
{ USB_DEVICE(HP_VENDOR_ID, HP_LCM220_PRODUCT_ID) },
{ USB_DEVICE(HP_VENDOR_ID, HP_LCM960_PRODUCT_ID) },
+ { USB_DEVICE(HP_VENDOR_ID, HP_LM920_PRODUCT_ID) },
+ { USB_DEVICE(HP_VENDOR_ID, HP_LM940_PRODUCT_ID) },
+ { USB_DEVICE(HP_VENDOR_ID, HP_TD620_PRODUCT_ID) },
{ USB_DEVICE(CRESSI_VENDOR_ID, CRESSI_EDY_PRODUCT_ID) },
{ USB_DEVICE(ZEAGLE_VENDOR_ID, ZEAGLE_N2ITION3_PRODUCT_ID) },
{ USB_DEVICE(SONY_VENDOR_ID, SONY_QN3USB_PRODUCT_ID) },
diff --git a/drivers/usb/serial/pl2303.h b/drivers/usb/serial/pl2303.h
index 26965cc23c17..4e2554d55362 100644
--- a/drivers/usb/serial/pl2303.h
+++ b/drivers/usb/serial/pl2303.h
@@ -119,10 +119,15 @@
/* Hewlett-Packard POS Pole Displays */
#define HP_VENDOR_ID 0x03f0
+#define HP_LM920_PRODUCT_ID 0x026b
+#define HP_TD620_PRODUCT_ID 0x0956
#define HP_LD960_PRODUCT_ID 0x0b39
#define HP_LCM220_PRODUCT_ID 0x3139
#define HP_LCM960_PRODUCT_ID 0x3239
#define HP_LD220_PRODUCT_ID 0x3524
+#define HP_LD220TA_PRODUCT_ID 0x4349
+#define HP_LD960TA_PRODUCT_ID 0x4439
+#define HP_LM940_PRODUCT_ID 0x5039
/* Cressi Edy (diving computer) PC interface */
#define CRESSI_VENDOR_ID 0x04b8
--
2.20.1
> From: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
> Date: 2018年12月20日周四 上午10:39
> Subject: [PATCH 4.14 00/72] 4.14.90-stable review
> To: <linux-kernel(a)vger.kernel.org>
> Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>,
> <torvalds(a)linux-foundation.org>, <akpm(a)linux-foundation.org>,
> <linux(a)roeck-us.net>, <shuah(a)kernel.org>, <patches(a)kernelci.org>,
> <ben.hutchings(a)codethink.co.uk>, <lkft-triage(a)lists.linaro.org>,
> <stable(a)vger.kernel.org>
>
>
> This is the start of the stable review cycle for the 4.14.90 release.
> There are 72 patches in this series, all will be posted as a response
> to this one. If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Sat Dec 22 08:59:06 UTC 2018.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.14.90-rc…
> or in the git tree and branch at:
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
> linux-4.14.y
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h
Merged, basic functional tests, no regression found.
Thanks,
--
Jack Wang
Linux Kernel Developer
1&1 IONOS Cloud GmbH | Greifswalder Str. 207 | 10405 Berlin | Germany
Phone: +49 30 57700-8042 | Fax: +49 30 57700-8598
E-mail: jinpu.wang(a)cloud.ionos.com | Web: www.ionos.de
Head Office: Berlin, Germany
District Court Berlin Charlottenburg, Registration number: HRB 125506 B
Executive Management: Christoph Steffens, Matthias Steinberg, Achim Weiss
Member of United Internet
This e-mail may contain confidential and/or privileged information. If
you are not the intended recipient of this e-mail, you are hereby
notified that saving, distribution or use of the content of this
e-mail in any way is prohibited. If you have received this e-mail in
error, please notify the sender and delete the e-mail.
This is the start of the stable review cycle for the 4.9.147 release.
There are 61 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sat Dec 22 08:58:31 UTC 2018.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.147-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.9.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.9.147-rc1
Trent Piepho <tpiepho(a)impinj.com>
rtc: snvs: Add timeouts to avoid kernel lockups
Guy Shapiro <guy.shapiro(a)mobi-wize.com>
rtc: snvs: add a missing write sync
Israel Rukshin <israelr(a)mellanox.com>
nvmet-rdma: fix response use after free
Hans de Goede <hdegoede(a)redhat.com>
i2c: scmi: Fix probe error on devices with an empty SMB0001 ACPI device node
Adamski, Krzysztof (Nokia - PL/Wroclaw) <krzysztof.adamski(a)nokia.com>
i2c: axxia: properly handle master timeout
Stefan Hajnoczi <stefanha(a)redhat.com>
vhost/vsock: fix reset orphans race with close timeout
Steve French <stfrench(a)microsoft.com>
cifs: In Kconfig CONFIG_CIFS_POSIX needs depends on legacy (insecure cifs)
Sam Bobroff <sbobroff(a)linux.ibm.com>
drm/ast: Fix connector leak during driver unload
Nicolas Saenz Julienne <nsaenzjulienne(a)suse.de>
ethernet: fman: fix wrong of_node_put() in probe function
Vladimir Murzin <vladimir.murzin(a)arm.com>
ARM: 8815/1: V7M: align v7m_dma_inv_range() with v7 counterpart
Chris Cole <chris(a)sageembedded.com>
ARM: 8814/1: mm: improve/fix ARM v7_dma_inv_range() unaligned address handling
Alexei Starovoitov <ast(a)kernel.org>
bpf: check pending signals while verifying programs
Saeed Mahameed <saeedm(a)mellanox.com>
net/mlx4_en: Fix build break when CONFIG_INET is off
Anderson Luiz Alves <alacn1(a)gmail.com>
mv88e6060: disable hardware level MAC learning
Juha-Matti Tilli <juha-matti.tilli(a)iki.fi>
libata: whitelist all SAMSUNG MZ7KM* solid-state disks
Tony Lindgren <tony(a)atomide.com>
Input: omap-keypad - fix keyboard debounce configuration
Dan Carpenter <dan.carpenter(a)oracle.com>
clk: mmp: Off by one in mmp_clk_add()
Dan Carpenter <dan.carpenter(a)oracle.com>
clk: mvebu: Off by one bugs in cp110_of_clk_get()
Yangtao Li <tiny.windzz(a)gmail.com>
ide: pmac: add of_node_put()
Yangtao Li <tiny.windzz(a)gmail.com>
drivers/tty: add missing of_node_put()
Yangtao Li <tiny.windzz(a)gmail.com>
drivers/sbus/char: add of_node_put()
Yangtao Li <tiny.windzz(a)gmail.com>
sbus: char: add of_node_put()
Trond Myklebust <trond.myklebust(a)hammerspace.com>
SUNRPC: Fix a potential race in xprt_connect()
Dave Kleikamp <dave.kleikamp(a)oracle.com>
nfs: don't dirty kernel pages read by direct-io
Toni Peltonen <peltzi(a)peltzi.fi>
bonding: fix 802.3ad state sent to partner when unbinding slave
Jose Abreu <joabreu(a)synopsys.com>
ARC: io.h: Implement reads{x}()/writes{x}()
Sean Paul <seanpaul(a)chromium.org>
drm/msm: Grab a vblank reference when waiting for commit_done
YiFei Zhu <zhuyifei1999(a)gmail.com>
x86/earlyprintk/efi: Fix infinite loop on some screen widths
Cathy Avery <cavery(a)redhat.com>
scsi: vmw_pscsi: Rearrange code to avoid multiple calls to free_irq during unload
Fred Herard <fred.herard(a)oracle.com>
scsi: libiscsi: Fix NULL pointer dereference in iscsi_eh_session_reset
Alexey Khoroshilov <khoroshilov(a)ispras.ru>
mac80211_hwsim: fix module init error paths for netlink
Steven Rostedt (VMware) <rostedt(a)goodmis.org>
locking/qspinlock: Fix build for anonymous union in older GCC compilers
Peter Zijlstra <peterz(a)infradead.org>
locking/qspinlock, x86: Provide liveness guarantee
Will Deacon <will.deacon(a)arm.com>
locking/qspinlock/x86: Increase _Q_PENDING_LOOPS upper bound
Peter Zijlstra <peterz(a)infradead.org>
locking/qspinlock: Re-order code
Will Deacon <will.deacon(a)arm.com>
locking/qspinlock: Kill cmpxchg() loop when claiming lock from head of queue
Will Deacon <will.deacon(a)arm.com>
locking/qspinlock: Remove duplicate clear_pending() function from PV code
Will Deacon <will.deacon(a)arm.com>
locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpath
Will Deacon <will.deacon(a)arm.com>
locking/qspinlock: Merge 'struct __qspinlock' into 'struct qspinlock'
Will Deacon <will.deacon(a)arm.com>
locking/qspinlock: Bound spinning on pending->locked transition in slowpath
Will Deacon <will.deacon(a)arm.com>
locking/qspinlock: Ensure node is initialised before updating prev->next
Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
locking: Remove smp_read_barrier_depends() from queued_spin_lock_slowpath()
Michael J. Ruhl <michael.j.ruhl(a)intel.com>
IB/hfi1: Remove race conditions in user_sdma send path
Ilan Peer <ilan.peer(a)intel.com>
mac80211: Fix condition validating WMM IE
Emmanuel Grumbach <emmanuel.grumbach(a)intel.com>
mac80211: don't WARN on bad WMM parameters from buggy APs
Chris Wilson <chris(a)chris-wilson.co.uk>
drm/i915/execlists: Apply a full mb before execution for Braswell
Brian Norris <briannorris(a)chromium.org>
Revert "drm/rockchip: Allow driver to be shutdown on reboot/kexec"
Radu Rendec <radu.rendec(a)gmail.com>
powerpc/msi: Fix NULL pointer access in teardown code
Steven Rostedt (VMware) <rostedt(a)goodmis.org>
tracing: Fix memory leak of instance function hash filters
Steven Rostedt (VMware) <rostedt(a)goodmis.org>
tracing: Fix memory leak in set_trigger_filter()
Lubomir Rintel <lkundrak(a)v3.sk>
ARM: mmp/mmp2: fix cpu_is_mmp2() on mmp2-dt
Aaro Koskinen <aaro.koskinen(a)iki.fi>
MMC: OMAP: fix broken MMC on OMAP15XX/OMAP5910/OMAP310
Jeff Moyer <jmoyer(a)redhat.com>
aio: fix spectre gadget in lookup_ioctx
Chen-Yu Tsai <wens(a)csie.org>
pinctrl: sunxi: a83t: Fix IRQ offset typo for PH11
Ingo Molnar <mingo(a)kernel.org>
timer/debug: Change /proc/timer_list from 0444 to 0400
Davidlohr Bueso <dave(a)stgolabs.net>
lib/interval_tree_test.c: allow users to limit scope of endpoint
Davidlohr Bueso <dave(a)stgolabs.net>
lib/rbtree-test: lower default params
Davidlohr Bueso <dave(a)stgolabs.net>
lib/rbtree_test.c: make input module parameters
Davidlohr Bueso <dave(a)stgolabs.net>
lib/interval_tree_test.c: allow full tree search
Davidlohr Bueso <dave(a)stgolabs.net>
lib/interval_tree_test.c: make test options module parameters
Will Deacon <will.deacon(a)arm.com>
signal: Introduce COMPAT_SIGMINSTKSZ for use in compat_sys_sigaltstack
-------------
Diffstat:
Makefile | 4 +-
arch/arc/include/asm/io.h | 72 ++++++++++
arch/arm/mach-mmp/cputype.h | 6 +-
arch/arm/mm/cache-v7.S | 8 +-
arch/arm/mm/cache-v7m.S | 14 +-
arch/powerpc/kernel/msi.c | 7 +-
arch/x86/include/asm/qspinlock.h | 25 +++-
arch/x86/include/asm/qspinlock_paravirt.h | 3 +-
arch/x86/platform/efi/early_printk.c | 2 +-
drivers/ata/libata-core.c | 1 +
drivers/clk/mmp/clk.c | 2 +-
drivers/clk/mvebu/cp110-system-controller.c | 4 +-
drivers/gpu/drm/ast/ast_fb.c | 1 +
drivers/gpu/drm/i915/intel_lrc.c | 7 +-
drivers/gpu/drm/msm/msm_atomic.c | 5 +
drivers/gpu/drm/rockchip/rockchip_drm_drv.c | 6 -
drivers/i2c/busses/i2c-axxia.c | 40 ++++--
drivers/i2c/busses/i2c-scmi.c | 10 +-
drivers/ide/pmac.c | 1 +
drivers/infiniband/hw/hfi1/user_sdma.c | 28 ++--
drivers/infiniband/hw/hfi1/user_sdma.h | 7 +-
drivers/input/keyboard/omap4-keypad.c | 18 ++-
drivers/mmc/host/omap.c | 11 +-
drivers/net/bonding/bond_3ad.c | 3 +
drivers/net/dsa/mv88e6060.c | 10 +-
drivers/net/ethernet/freescale/fman/fman.c | 5 +-
drivers/net/ethernet/mellanox/mlx4/Kconfig | 2 +-
drivers/net/wireless/mac80211_hwsim.c | 12 +-
drivers/nvme/target/rdma.c | 3 +-
drivers/pinctrl/sunxi/pinctrl-sun8i-a83t.c | 2 +-
drivers/rtc/rtc-snvs.c | 104 ++++++++++-----
drivers/sbus/char/display7seg.c | 1 +
drivers/sbus/char/envctrl.c | 2 +
drivers/scsi/libiscsi.c | 4 +-
drivers/scsi/vmw_pvscsi.c | 4 +-
drivers/tty/serial/suncore.c | 1 +
drivers/vhost/vsock.c | 22 +++-
fs/aio.c | 2 +
fs/cifs/Kconfig | 2 +-
fs/nfs/direct.c | 9 +-
include/asm-generic/qspinlock_types.h | 32 ++++-
include/linux/compat.h | 3 +
kernel/bpf/verifier.c | 3 +
kernel/locking/qspinlock.c | 195 ++++++++++++++--------------
kernel/locking/qspinlock_paravirt.h | 42 ++----
kernel/signal.c | 17 ++-
kernel/time/timer_list.c | 2 +-
kernel/trace/ftrace.c | 1 +
kernel/trace/trace_events_trigger.c | 6 +-
lib/interval_tree_test.c | 93 ++++++++-----
lib/rbtree_test.c | 55 +++++---
net/mac80211/mlme.c | 3 +-
net/sunrpc/xprt.c | 11 +-
53 files changed, 604 insertions(+), 329 deletions(-)
The current implementation of elan_i2c is known to not support those
2 laptops.
A proper fix is to tweak both elantech and elan_i2c to transmit the
correct information from PS/2, which would make a bad candidate for
stable.
So to give us some time for fixing the root of the problem, disable
elan_i2c for the devices we know are not behaving properly.
Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803600
Link: https://bugs.archlinux.org/task/59714
Fixes: df077237cf55 Input: elantech - detect new ICs and setup Host Notify for them
Cc: stable(a)vger.kernel.org # v4.18+
Signed-off-by: Benjamin Tissoires <benjamin.tissoires(a)redhat.com>
---
drivers/input/mouse/elantech.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/drivers/input/mouse/elantech.c b/drivers/input/mouse/elantech.c
index 2d95e8d93cc7..830ae9f07045 100644
--- a/drivers/input/mouse/elantech.c
+++ b/drivers/input/mouse/elantech.c
@@ -1767,6 +1767,18 @@ static int elantech_smbus = IS_ENABLED(CONFIG_MOUSE_ELAN_I2C_SMBUS) ?
module_param_named(elantech_smbus, elantech_smbus, int, 0644);
MODULE_PARM_DESC(elantech_smbus, "Use a secondary bus for the Elantech device.");
+static const char * const i2c_blacklist_pnp_ids[] = {
+ /*
+ * these are known to not be working properly as bits are missing
+ * in elan_i2c
+ */
+ "LEN2131", /* ThinkPad P52 w/ NFC */
+ "LEN2132", /* ThinkPad P52 */
+ "LEN2133", /* ThinkPad P72 w/ NFC */
+ "LEN2134", /* ThinkPad P72 */
+ NULL
+};
+
static int elantech_create_smbus(struct psmouse *psmouse,
struct elantech_device_info *info,
bool leave_breadcrumbs)
@@ -1802,10 +1814,12 @@ static int elantech_setup_smbus(struct psmouse *psmouse,
if (elantech_smbus == ELANTECH_SMBUS_NOT_SET) {
/*
- * New ICs are enabled by default.
+ * New ICs are enabled by default, unless mentioned in
+ * i2c_blacklist_pnp_ids.
* Old ICs are up to the user to decide.
*/
- if (!ETP_NEW_IC_SMBUS_HOST_NOTIFY(info->fw_version))
+ if (!ETP_NEW_IC_SMBUS_HOST_NOTIFY(info->fw_version) ||
+ psmouse_matches_pnp_id(psmouse, i2c_blacklist_pnp_ids))
return -ENXIO;
}
--
2.19.2
This is a note to let you know that I've just added the patch titled
USB: serial: pl2303: add ids for Hewlett-Packard HP POS pole displays
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-testing branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will be merged to the usb-next branch sometime soon,
after it passes testing, and the merge window is open.
If you have any questions about this process, please let me know.
>From 8d503f206c336677954160ac62f0c7d9c219cd89 Mon Sep 17 00:00:00 2001
From: Scott Chen <scott(a)labau.com.tw>
Date: Thu, 13 Dec 2018 06:01:47 -0500
Subject: USB: serial: pl2303: add ids for Hewlett-Packard HP POS pole displays
Add device ids to pl2303 for the HP POS pole displays:
LM920: 03f0:026b
TD620: 03f0:0956
LD960TA: 03f0:4439
LD220TA: 03f0:4349
LM940: 03f0:5039
Signed-off-by: Scott Chen <scott(a)labau.com.tw>
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Johan Hovold <johan(a)kernel.org>
---
drivers/usb/serial/pl2303.c | 5 +++++
drivers/usb/serial/pl2303.h | 5 +++++
2 files changed, 10 insertions(+)
diff --git a/drivers/usb/serial/pl2303.c b/drivers/usb/serial/pl2303.c
index a4e0d13fc121..98e7a5df0f6d 100644
--- a/drivers/usb/serial/pl2303.c
+++ b/drivers/usb/serial/pl2303.c
@@ -91,9 +91,14 @@ static const struct usb_device_id id_table[] = {
{ USB_DEVICE(YCCABLE_VENDOR_ID, YCCABLE_PRODUCT_ID) },
{ USB_DEVICE(SUPERIAL_VENDOR_ID, SUPERIAL_PRODUCT_ID) },
{ USB_DEVICE(HP_VENDOR_ID, HP_LD220_PRODUCT_ID) },
+ { USB_DEVICE(HP_VENDOR_ID, HP_LD220TA_PRODUCT_ID) },
{ USB_DEVICE(HP_VENDOR_ID, HP_LD960_PRODUCT_ID) },
+ { USB_DEVICE(HP_VENDOR_ID, HP_LD960TA_PRODUCT_ID) },
{ USB_DEVICE(HP_VENDOR_ID, HP_LCM220_PRODUCT_ID) },
{ USB_DEVICE(HP_VENDOR_ID, HP_LCM960_PRODUCT_ID) },
+ { USB_DEVICE(HP_VENDOR_ID, HP_LM920_PRODUCT_ID) },
+ { USB_DEVICE(HP_VENDOR_ID, HP_LM940_PRODUCT_ID) },
+ { USB_DEVICE(HP_VENDOR_ID, HP_TD620_PRODUCT_ID) },
{ USB_DEVICE(CRESSI_VENDOR_ID, CRESSI_EDY_PRODUCT_ID) },
{ USB_DEVICE(ZEAGLE_VENDOR_ID, ZEAGLE_N2ITION3_PRODUCT_ID) },
{ USB_DEVICE(SONY_VENDOR_ID, SONY_QN3USB_PRODUCT_ID) },
diff --git a/drivers/usb/serial/pl2303.h b/drivers/usb/serial/pl2303.h
index 26965cc23c17..4e2554d55362 100644
--- a/drivers/usb/serial/pl2303.h
+++ b/drivers/usb/serial/pl2303.h
@@ -119,10 +119,15 @@
/* Hewlett-Packard POS Pole Displays */
#define HP_VENDOR_ID 0x03f0
+#define HP_LM920_PRODUCT_ID 0x026b
+#define HP_TD620_PRODUCT_ID 0x0956
#define HP_LD960_PRODUCT_ID 0x0b39
#define HP_LCM220_PRODUCT_ID 0x3139
#define HP_LCM960_PRODUCT_ID 0x3239
#define HP_LD220_PRODUCT_ID 0x3524
+#define HP_LD220TA_PRODUCT_ID 0x4349
+#define HP_LD960TA_PRODUCT_ID 0x4439
+#define HP_LM940_PRODUCT_ID 0x5039
/* Cressi Edy (diving computer) PC interface */
#define CRESSI_VENDOR_ID 0x04b8
--
2.20.1
This is a note to let you know that I've just added the patch titled
staging: wilc1000: fix missing read_write setting when reading data
to my staging git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging.git
in the staging-next branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will also be merged in the next major kernel release
during the merge window.
If you have any questions about this process, please let me know.
>From c58eef061dda7d843dcc0ad6fea7e597d4c377c0 Mon Sep 17 00:00:00 2001
From: Colin Ian King <colin.king(a)canonical.com>
Date: Wed, 19 Dec 2018 16:30:07 +0000
Subject: staging: wilc1000: fix missing read_write setting when reading data
Currently the cmd.read_write setting is not initialized so it contains
garbage from the stack. Fix this by setting it to 0 to indicate a
read is required.
Detected by CoverityScan, CID#1357925 ("Uninitialized scalar variable")
Fixes: c5c77ba18ea6 ("staging: wilc1000: Add SDIO/SPI 802.11 driver")
Signed-off-by: Colin Ian King <colin.king(a)canonical.com>
Cc: stable <stable(a)vger.kernel.org>
Acked-by: Ajay Singh <ajay.kathat(a)microchip.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/staging/wilc1000/wilc_sdio.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/staging/wilc1000/wilc_sdio.c b/drivers/staging/wilc1000/wilc_sdio.c
index 27fdfbdda5c0..e2f739fef21c 100644
--- a/drivers/staging/wilc1000/wilc_sdio.c
+++ b/drivers/staging/wilc1000/wilc_sdio.c
@@ -861,6 +861,7 @@ static int sdio_read_int(struct wilc *wilc, u32 *int_status)
if (!sdio_priv->irq_gpio) {
int i;
+ cmd.read_write = 0;
cmd.function = 1;
cmd.address = 0x04;
cmd.data = 0;
--
2.20.1
This is the start of the stable review cycle for the 4.4.169 release.
There are 40 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sat Dec 22 08:58:16 UTC 2018.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.169-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.4.169-rc1
Dan Carpenter <dan.carpenter(a)oracle.com>
ALSA: isa/wavefront: prevent some out of bound writes
Trent Piepho <tpiepho(a)impinj.com>
rtc: snvs: Add timeouts to avoid kernel lockups
Guy Shapiro <guy.shapiro(a)mobi-wize.com>
rtc: snvs: add a missing write sync
Hans de Goede <hdegoede(a)redhat.com>
i2c: scmi: Fix probe error on devices with an empty SMB0001 ACPI device node
Adamski, Krzysztof (Nokia - PL/Wroclaw) <krzysztof.adamski(a)nokia.com>
i2c: axxia: properly handle master timeout
Steve French <stfrench(a)microsoft.com>
cifs: In Kconfig CONFIG_CIFS_POSIX needs depends on legacy (insecure cifs)
Chris Cole <chris(a)sageembedded.com>
ARM: 8814/1: mm: improve/fix ARM v7_dma_inv_range() unaligned address handling
Anderson Luiz Alves <alacn1(a)gmail.com>
mv88e6060: disable hardware level MAC learning
Juha-Matti Tilli <juha-matti.tilli(a)iki.fi>
libata: whitelist all SAMSUNG MZ7KM* solid-state disks
Tony Lindgren <tony(a)atomide.com>
Input: omap-keypad - fix keyboard debounce configuration
Dan Carpenter <dan.carpenter(a)oracle.com>
clk: mmp: Off by one in mmp_clk_add()
Yangtao Li <tiny.windzz(a)gmail.com>
ide: pmac: add of_node_put()
Yangtao Li <tiny.windzz(a)gmail.com>
drivers/tty: add missing of_node_put()
Yangtao Li <tiny.windzz(a)gmail.com>
drivers/sbus/char: add of_node_put()
Yangtao Li <tiny.windzz(a)gmail.com>
sbus: char: add of_node_put()
Trond Myklebust <trond.myklebust(a)hammerspace.com>
SUNRPC: Fix a potential race in xprt_connect()
Toni Peltonen <peltzi(a)peltzi.fi>
bonding: fix 802.3ad state sent to partner when unbinding slave
Jose Abreu <joabreu(a)synopsys.com>
ARC: io.h: Implement reads{x}()/writes{x}()
Sean Paul <seanpaul(a)chromium.org>
drm/msm: Grab a vblank reference when waiting for commit_done
YiFei Zhu <zhuyifei1999(a)gmail.com>
x86/earlyprintk/efi: Fix infinite loop on some screen widths
Cathy Avery <cavery(a)redhat.com>
scsi: vmw_pscsi: Rearrange code to avoid multiple calls to free_irq during unload
Fred Herard <fred.herard(a)oracle.com>
scsi: libiscsi: Fix NULL pointer dereference in iscsi_eh_session_reset
Alexey Khoroshilov <khoroshilov(a)ispras.ru>
mac80211_hwsim: fix module init error paths for netlink
Ilan Peer <ilan.peer(a)intel.com>
mac80211: Fix condition validating WMM IE
Emmanuel Grumbach <emmanuel.grumbach(a)intel.com>
mac80211: don't WARN on bad WMM parameters from buggy APs
Yunlei He <heyunlei(a)huawei.com>
f2fs: fix a panic caused by NULL flush_cmd_control
Brian Norris <briannorris(a)chromium.org>
Revert "drm/rockchip: Allow driver to be shutdown on reboot/kexec"
Radu Rendec <radu.rendec(a)gmail.com>
powerpc/msi: Fix NULL pointer access in teardown code
Steven Rostedt (VMware) <rostedt(a)goodmis.org>
tracing: Fix memory leak of instance function hash filters
Steven Rostedt (VMware) <rostedt(a)goodmis.org>
tracing: Fix memory leak in set_trigger_filter()
Aaro Koskinen <aaro.koskinen(a)iki.fi>
MMC: OMAP: fix broken MMC on OMAP15XX/OMAP5910/OMAP310
Jeff Moyer <jmoyer(a)redhat.com>
aio: fix spectre gadget in lookup_ioctx
Chen-Yu Tsai <wens(a)csie.org>
pinctrl: sunxi: a83t: Fix IRQ offset typo for PH11
Guenter Roeck <linux(a)roeck-us.net>
powerpc/boot: Fix random libfdt related build errors
Ingo Molnar <mingo(a)kernel.org>
timer/debug: Change /proc/timer_list from 0444 to 0400
Davidlohr Bueso <dave(a)stgolabs.net>
lib/interval_tree_test.c: allow users to limit scope of endpoint
Davidlohr Bueso <dave(a)stgolabs.net>
lib/rbtree-test: lower default params
Davidlohr Bueso <dave(a)stgolabs.net>
lib/rbtree_test.c: make input module parameters
Davidlohr Bueso <dave(a)stgolabs.net>
lib/interval_tree_test.c: allow full tree search
Davidlohr Bueso <dave(a)stgolabs.net>
lib/interval_tree_test.c: make test options module parameters
-------------
Diffstat:
Makefile | 4 +-
arch/arc/include/asm/io.h | 72 +++++++++++++++++++
arch/arm/mm/cache-v7.S | 8 ++-
arch/powerpc/boot/Makefile | 3 +-
arch/powerpc/kernel/msi.c | 7 +-
arch/x86/platform/efi/early_printk.c | 2 +-
drivers/ata/libata-core.c | 1 +
drivers/clk/mmp/clk.c | 2 +-
drivers/gpu/drm/msm/msm_atomic.c | 5 ++
drivers/gpu/drm/rockchip/rockchip_drm_drv.c | 6 --
drivers/i2c/busses/i2c-axxia.c | 40 ++++++++---
drivers/i2c/busses/i2c-scmi.c | 10 ++-
drivers/ide/pmac.c | 1 +
drivers/input/keyboard/omap4-keypad.c | 18 +++--
drivers/mmc/host/omap.c | 11 ++-
drivers/net/bonding/bond_3ad.c | 3 +
drivers/net/dsa/mv88e6060.c | 10 +--
drivers/net/wireless/mac80211_hwsim.c | 12 ++--
drivers/pinctrl/sunxi/pinctrl-sun8i-a83t.c | 2 +-
drivers/rtc/rtc-snvs.c | 104 +++++++++++++++++++---------
drivers/sbus/char/display7seg.c | 1 +
drivers/sbus/char/envctrl.c | 2 +
drivers/scsi/libiscsi.c | 4 +-
drivers/scsi/vmw_pvscsi.c | 4 +-
drivers/tty/serial/suncore.c | 1 +
fs/aio.c | 2 +
fs/cifs/Kconfig | 2 +-
fs/f2fs/segment.c | 5 +-
kernel/time/timer_list.c | 2 +-
kernel/trace/ftrace.c | 1 +
kernel/trace/trace_events_trigger.c | 6 +-
lib/interval_tree_test.c | 93 ++++++++++++++++---------
lib/rbtree_test.c | 55 +++++++++------
net/mac80211/mlme.c | 3 +-
net/sunrpc/xprt.c | 11 ++-
sound/isa/wavefront/wavefront_synth.c | 9 +++
36 files changed, 376 insertions(+), 146 deletions(-)
This is the start of the stable review cycle for the 3.18.131 release.
There are 31 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sat Dec 22 08:57:30 UTC 2018.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v3.x/stable-review/patch-3.18.131-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-3.18.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 3.18.131-rc1
Lior David <qca_liord(a)qca.qualcomm.com>
wil6210: missing length check in wmi_set_ie
Kees Cook <keescook(a)chromium.org>
swiotlb: clean up reporting
Jens Axboe <axboe(a)kernel.dk>
sr: pass down correctly sized SCSI sense buffer
Thomas Gleixner <tglx(a)linutronix.de>
posix-timers: Sanitize overrun handling
Takashi Sakamoto <o-takashi(a)sakamocchi.jp>
ALSA: pcm: remove SNDRV_PCM_IOCTL1_INFO internal command
Dan Carpenter <dan.carpenter(a)oracle.com>
ALSA: isa/wavefront: prevent some out of bound writes
Hans de Goede <hdegoede(a)redhat.com>
i2c: scmi: Fix probe error on devices with an empty SMB0001 ACPI device node
Steve French <stfrench(a)microsoft.com>
cifs: In Kconfig CONFIG_CIFS_POSIX needs depends on legacy (insecure cifs)
Chris Cole <chris(a)sageembedded.com>
ARM: 8814/1: mm: improve/fix ARM v7_dma_inv_range() unaligned address handling
Juha-Matti Tilli <juha-matti.tilli(a)iki.fi>
libata: whitelist all SAMSUNG MZ7KM* solid-state disks
Tony Lindgren <tony(a)atomide.com>
Input: omap-keypad - fix keyboard debounce configuration
Yangtao Li <tiny.windzz(a)gmail.com>
ide: pmac: add of_node_put()
Yangtao Li <tiny.windzz(a)gmail.com>
drivers/tty: add missing of_node_put()
Yangtao Li <tiny.windzz(a)gmail.com>
drivers/sbus/char: add of_node_put()
Yangtao Li <tiny.windzz(a)gmail.com>
sbus: char: add of_node_put()
Trond Myklebust <trond.myklebust(a)hammerspace.com>
SUNRPC: Fix a potential race in xprt_connect()
Toni Peltonen <peltzi(a)peltzi.fi>
bonding: fix 802.3ad state sent to partner when unbinding slave
YiFei Zhu <zhuyifei1999(a)gmail.com>
x86/earlyprintk/efi: Fix infinite loop on some screen widths
Cathy Avery <cavery(a)redhat.com>
scsi: vmw_pscsi: Rearrange code to avoid multiple calls to free_irq during unload
Fred Herard <fred.herard(a)oracle.com>
scsi: libiscsi: Fix NULL pointer dereference in iscsi_eh_session_reset
Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
powerpc: Look for "stdout-path" when setting up legacy consoles
Steven Rostedt (VMware) <rostedt(a)goodmis.org>
tracing: Fix memory leak of instance function hash filters
Steven Rostedt (VMware) <rostedt(a)goodmis.org>
tracing: Fix memory leak in set_trigger_filter()
Aaro Koskinen <aaro.koskinen(a)iki.fi>
MMC: OMAP: fix broken MMC on OMAP15XX/OMAP5910/OMAP310
Guenter Roeck <linux(a)roeck-us.net>
powerpc/boot: Fix random libfdt related build errors
Ingo Molnar <mingo(a)kernel.org>
timer/debug: Change /proc/timer_list from 0444 to 0400
Davidlohr Bueso <dave(a)stgolabs.net>
lib/interval_tree_test.c: allow users to limit scope of endpoint
Davidlohr Bueso <dave(a)stgolabs.net>
lib/rbtree-test: lower default params
Davidlohr Bueso <dave(a)stgolabs.net>
lib/rbtree_test.c: make input module parameters
Davidlohr Bueso <dave(a)stgolabs.net>
lib/interval_tree_test.c: allow full tree search
Davidlohr Bueso <dave(a)stgolabs.net>
lib/interval_tree_test.c: make test options module parameters
-------------
Diffstat:
Makefile | 4 +-
arch/arm/mm/cache-v7.S | 8 +--
arch/powerpc/boot/Makefile | 3 +-
arch/powerpc/kernel/legacy_serial.c | 6 ++-
arch/x86/platform/efi/early_printk.c | 2 +-
drivers/ata/libata-core.c | 1 +
drivers/i2c/busses/i2c-scmi.c | 10 ++--
drivers/ide/pmac.c | 1 +
drivers/input/keyboard/omap4-keypad.c | 18 +++++--
drivers/mmc/host/omap.c | 11 +++-
drivers/net/bonding/bond_3ad.c | 3 ++
drivers/net/wireless/ath/wil6210/wmi.c | 7 ++-
drivers/sbus/char/display7seg.c | 1 +
drivers/sbus/char/envctrl.c | 2 +
drivers/scsi/libiscsi.c | 4 +-
drivers/scsi/sr_ioctl.c | 21 +++-----
drivers/scsi/vmw_pvscsi.c | 4 +-
drivers/tty/serial/suncore.c | 1 +
fs/cifs/Kconfig | 2 +-
include/linux/posix-timers.h | 4 +-
include/sound/pcm.h | 2 +-
kernel/time/posix-cpu-timers.c | 2 +-
kernel/time/posix-timers.c | 29 +++++++----
kernel/time/timer_list.c | 2 +-
kernel/trace/ftrace.c | 1 +
kernel/trace/trace_events_trigger.c | 6 ++-
lib/interval_tree_test.c | 93 ++++++++++++++++++++++------------
lib/rbtree_test.c | 55 ++++++++++++--------
lib/swiotlb.c | 18 +++----
net/sunrpc/xprt.c | 11 +++-
sound/core/pcm_lib.c | 2 -
sound/core/pcm_native.c | 6 +--
sound/isa/wavefront/wavefront_synth.c | 9 ++++
33 files changed, 224 insertions(+), 125 deletions(-)
AppArmor recently added the ability for profiles to match extended
attributes, with the intent of targeting "security.ima" and
"security.evm" to differentiate between sign and unsigned files.
The current implementation uses a path glob to match the extended
attribute value. To require the presence of a extended attribute,
profiles supply a wildcard:
# Match any file with the "security.apparmor" attribute
profile test /** xattrs=(security.apparmor="**") {
# ...
}
However, the glob matching implementation is intended for file paths and
doesn't handle null characters correctly. It's currently impossible to
write a profile that targets IMA and EVM attributes, since the
signatures can contain a null byte.
Add the ability for AppArmor to check the presence of an extended
attribute, and not its value. This fixes the profile matching allowing
profiles conditional on EVM and IMA signatures:
profile signed_binary /** xattrs=(security.evm security.ima) {
# ...
}
A modified apparmor_parser and associated regression tests to exercise
this fix can be found at:
https://gitlab.com/ericchiang/apparmor/commits/parser-xattrs-keys
Signed-off-by: Eric Chiang <ericchiang(a)google.com>
CC: stable(a)vger.kernel.org
---
security/apparmor/apparmorfs.c | 1 +
security/apparmor/domain.c | 25 +++++++++++++++++++++----
security/apparmor/include/policy.h | 6 ++++++
security/apparmor/policy.c | 3 +++
security/apparmor/policy_unpack.c | 18 ++++++++++++++++++
5 files changed, 49 insertions(+), 4 deletions(-)
diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
index 8963203319ea..03d9b7f8a2fb 100644
--- a/security/apparmor/apparmorfs.c
+++ b/security/apparmor/apparmorfs.c
@@ -2212,6 +2212,7 @@ static struct aa_sfs_entry aa_sfs_entry_signal[] = {
static struct aa_sfs_entry aa_sfs_entry_attach[] = {
AA_SFS_FILE_BOOLEAN("xattr", 1),
+ AA_SFS_FILE_BOOLEAN("xattr_key", 1),
{ }
};
static struct aa_sfs_entry aa_sfs_entry_domain[] = {
diff --git a/security/apparmor/domain.c b/security/apparmor/domain.c
index 08c88de0ffda..9f223756b416 100644
--- a/security/apparmor/domain.c
+++ b/security/apparmor/domain.c
@@ -317,16 +317,33 @@ static int aa_xattrs_match(const struct linux_binprm *bprm,
ssize_t size;
struct dentry *d;
char *value = NULL;
- int value_size = 0, ret = profile->xattr_count;
+ int value_size = 0;
+ int ret = profile->xattr_count + profile->xattr_keys_count;
- if (!bprm || !profile->xattr_count)
+ if (!bprm)
return 0;
+ d = bprm->file->f_path.dentry;
+
+ if (profile->xattr_keys_count) {
+ /* validate that these attributes are present, ignore values */
+ for (i = 0; i < profile->xattr_keys_count; i++) {
+ size = vfs_getxattr_alloc(d, profile->xattr_keys[i],
+ &value, value_size,
+ GFP_KERNEL);
+ if (size < 0) {
+ ret = -EINVAL;
+ goto out;
+ }
+ }
+ }
+
+ if (!profile->xattr_count)
+ goto out;
+
/* transition from exec match to xattr set */
state = aa_dfa_null_transition(profile->xmatch, state);
- d = bprm->file->f_path.dentry;
-
for (i = 0; i < profile->xattr_count; i++) {
size = vfs_getxattr_alloc(d, profile->xattrs[i], &value,
value_size, GFP_KERNEL);
diff --git a/security/apparmor/include/policy.h b/security/apparmor/include/policy.h
index 8e6707c837be..8ed1d30de7ce 100644
--- a/security/apparmor/include/policy.h
+++ b/security/apparmor/include/policy.h
@@ -112,6 +112,10 @@ struct aa_data {
* @policy: general match rules governing policy
* @file: The set of rules governing basic file access and domain transitions
* @caps: capabilities for the profile
+ * @xattr_count: number of xattrs values
+ * @xattrs: extended attributes whose values must match the xmatch
+ * @xattr_keys_count: number of xattr keys values
+ * @xattr_keys: extended attributes that must be present to match the profile
* @rlimits: rlimits for the profile
*
* @dents: dentries for the profiles file entries in apparmorfs
@@ -152,6 +156,8 @@ struct aa_profile {
int xattr_count;
char **xattrs;
+ int xattr_keys_count;
+ char **xattr_keys;
struct aa_rlimit rlimits;
diff --git a/security/apparmor/policy.c b/security/apparmor/policy.c
index df9c5890a878..e0f9cf8b8318 100644
--- a/security/apparmor/policy.c
+++ b/security/apparmor/policy.c
@@ -231,6 +231,9 @@ void aa_free_profile(struct aa_profile *profile)
for (i = 0; i < profile->xattr_count; i++)
kzfree(profile->xattrs[i]);
kzfree(profile->xattrs);
+ for (i = 0; i < profile->xattr_keys_count; i++)
+ kzfree(profile->xattr_keys[i]);
+ kzfree(profile->xattr_keys);
for (i = 0; i < profile->secmark_count; i++)
kzfree(profile->secmark[i].label);
kzfree(profile->secmark);
diff --git a/security/apparmor/policy_unpack.c b/security/apparmor/policy_unpack.c
index 379682e2a8d5..d1fd75093260 100644
--- a/security/apparmor/policy_unpack.c
+++ b/security/apparmor/policy_unpack.c
@@ -535,6 +535,24 @@ static bool unpack_xattrs(struct aa_ext *e, struct aa_profile *profile)
goto fail;
}
+ if (unpack_nameX(e, AA_STRUCT, "xattr_keys")) {
+ int i, size;
+
+ size = unpack_array(e, NULL);
+ profile->xattr_keys_count = size;
+ profile->xattr_keys = kcalloc(size, sizeof(char *), GFP_KERNEL);
+ if (!profile->xattr_keys)
+ goto fail;
+ for (i = 0; i < size; i++) {
+ if (!unpack_strdup(e, &profile->xattr_keys[i], NULL))
+ goto fail;
+ }
+ if (!unpack_nameX(e, AA_ARRAYEND, NULL))
+ goto fail;
+ if (!unpack_nameX(e, AA_STRUCTEND, NULL))
+ goto fail;
+ }
+
return 1;
fail:
--
2.20.1.415.g653613c723-goog
The patch titled
Subject: hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
has been added to the -mm tree. Its filename is
hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/hugetlbfs-use-i_mmap_rwsem-to-fix-…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/hugetlbfs-use-i_mmap_rwsem-to-fix-…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken
into account as well.
Instead of writing the required complicated code for this rare
occurrence, just eliminate the race. i_mmap_rwsem is now held in read
mode for the duration of page fault processing. Hold i_mmap_rwsem
longer in truncation and hold punch code to cover the call to
remove_inode_hugepages.
With this modification, code in remove_inode_hugepages checking for
races becomes 'dead' as it can not longer happen. Remove the dead code
and expand comments to explain reasoning. Similarly, checks for races
with truncation in the page fault path can be simplified and removed.
Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/hugetlbfs/inode.c | 50 +++++++++++++----------------------------
mm/hugetlb.c | 21 ++++++++---------
2 files changed, 27 insertions(+), 44 deletions(-)
--- a/fs/hugetlbfs/inode.c~hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race
+++ a/fs/hugetlbfs/inode.c
@@ -383,17 +383,16 @@ hugetlb_vmdelete_list(struct rb_root_cac
* truncation is indicated by end of range being LLONG_MAX
* In this case, we first scan the range and release found pages.
* After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
- * maps and global counts. Page faults can not race with truncation
- * in this routine. hugetlb_no_page() prevents page faults in the
- * truncated range. It checks i_size before allocation, and again after
- * with the page table lock for the page held. The same lock must be
- * acquired to unmap a page.
+ * maps and global counts.
* hole punch is indicated if end is not LLONG_MAX
* In the hole punch case we scan the range and release found pages.
* Only when releasing a page is the associated region/reserv map
* deleted. The region/reserv map for ranges without associated
- * pages are not modified. Page faults can race with hole punch.
- * This is indicated if we find a mapped page.
+ * pages are not modified.
+ *
+ * Callers of this routine must hold the i_mmap_rwsem in write mode to prevent
+ * races with page faults.
+ *
* Note: If the passed end of range value is beyond the end of file, but
* not LLONG_MAX this routine still performs a hole punch operation.
*/
@@ -423,32 +422,14 @@ static void remove_inode_hugepages(struc
for (i = 0; i < pagevec_count(&pvec); ++i) {
struct page *page = pvec.pages[i];
- u32 hash;
index = page->index;
- hash = hugetlb_fault_mutex_hash(h, current->mm,
- &pseudo_vma,
- mapping, index, 0);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
-
/*
- * If page is mapped, it was faulted in after being
- * unmapped in caller. Unmap (again) now after taking
- * the fault mutex. The mutex will prevent faults
- * until we finish removing the page.
- *
- * This race can only happen in the hole punch case.
- * Getting here in a truncate operation is a bug.
+ * A mapped page is impossible as callers should unmap
+ * all references before calling. And, i_mmap_rwsem
+ * prevents the creation of additional mappings.
*/
- if (unlikely(page_mapped(page))) {
- BUG_ON(truncate_op);
-
- i_mmap_lock_write(mapping);
- hugetlb_vmdelete_list(&mapping->i_mmap,
- index * pages_per_huge_page(h),
- (index + 1) * pages_per_huge_page(h));
- i_mmap_unlock_write(mapping);
- }
+ VM_BUG_ON(page_mapped(page));
lock_page(page);
/*
@@ -470,7 +451,6 @@ static void remove_inode_hugepages(struc
}
unlock_page(page);
- mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
huge_pagevec_release(&pvec);
cond_resched();
@@ -505,8 +485,8 @@ static int hugetlb_vmtruncate(struct ino
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offset, LLONG_MAX);
+ i_mmap_unlock_write(mapping);
return 0;
}
@@ -540,8 +520,8 @@ static long hugetlbfs_punch_hole(struct
hugetlb_vmdelete_list(&mapping->i_mmap,
hole_start >> PAGE_SHIFT,
hole_end >> PAGE_SHIFT);
- i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, hole_start, hole_end);
+ i_mmap_unlock_write(mapping);
inode_unlock(inode);
}
@@ -624,7 +604,11 @@ static long hugetlbfs_fallocate(struct f
/* addr is the offset within the file (zero based) */
addr = index * hpage_size;
- /* mutex taken here, fault path and hole punch */
+ /*
+ * fault mutex taken here, protects against fault path
+ * and hole punch. inode_lock previously taken protects
+ * against truncation.
+ */
hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
index, addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
--- a/mm/hugetlb.c~hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race
+++ a/mm/hugetlb.c
@@ -3760,16 +3760,16 @@ static vm_fault_t hugetlb_no_page(struct
}
/*
- * Use page lock to guard against racing truncation
- * before we get page_table_lock.
+ * We can not race with truncation due to holding i_mmap_rwsem.
+ * Check once here for faults beyond end of file.
*/
+ size = i_size_read(mapping->host) >> huge_page_shift(h);
+ if (idx >= size)
+ goto out;
+
retry:
page = find_lock_page(mapping, idx);
if (!page) {
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto out;
-
/*
* Check for page in userfault range
*/
@@ -3859,9 +3859,6 @@ retry:
}
ptl = huge_pte_lock(h, mm, ptep);
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto backout;
ret = 0;
if (!huge_pte_none(huge_ptep_get(ptep)))
@@ -3964,8 +3961,10 @@ vm_fault_t hugetlb_fault(struct mm_struc
/*
* Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
- * until finished with ptep. This prevents huge_pmd_unshare from
- * being called elsewhere and making the ptep no longer valid.
+ * until finished with ptep. This serves two purposes:
+ * 1) It prevents huge_pmd_unshare from being called elsewhere
+ * and making the ptep no longer valid.
+ * 2) It synchronizes us with file truncation.
*
* ptep could have already be assigned via huge_pte_offset. That
* is OK, as huge_pte_alloc will return the same value unless
_
Patches currently in -mm which might be from mike.kravetz(a)oracle.com are
hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization.patch
hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race.patch
The patch titled
Subject: hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
has been added to the -mm tree. Its filename is
hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/hugetlbfs-use-i_mmap_rwsem-for-mor…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/hugetlbfs-use-i_mmap_rwsem-for-mor…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with
the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 70 ++++++++++++++++++++++++++++++++----------
mm/memory-failure.c | 14 +++++++-
mm/migrate.c | 13 +++++++
mm/rmap.c | 3 +
mm/userfaultfd.c | 11 +++++-
5 files changed, 91 insertions(+), 20 deletions(-)
--- a/mm/hugetlb.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/hugetlb.c
@@ -3238,6 +3238,7 @@ int copy_hugetlb_page_range(struct mm_st
struct page *ptepage;
unsigned long addr;
int cow;
+ struct address_space *mapping = vma->vm_file->f_mapping;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
struct mmu_notifier_range range;
@@ -3253,11 +3254,23 @@ int copy_hugetlb_page_range(struct mm_st
for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
+
src_pte = huge_pte_offset(src, addr, sz);
if (!src_pte)
continue;
+
+ /*
+ * i_mmap_rwsem must be held to call huge_pte_alloc.
+ * Continue to hold until finished with dst_pte, otherwise
+ * it could go away if part of a shared pmd.
+ *
+ * Technically, i_mmap_rwsem is only needed in the non-cow
+ * case as cow mappings are not shared.
+ */
+ i_mmap_lock_read(mapping);
dst_pte = huge_pte_alloc(dst, addr, sz);
if (!dst_pte) {
+ i_mmap_unlock_read(mapping);
ret = -ENOMEM;
break;
}
@@ -3272,8 +3285,10 @@ int copy_hugetlb_page_range(struct mm_st
* after taking the lock below.
*/
dst_entry = huge_ptep_get(dst_pte);
- if ((dst_pte == src_pte) || !huge_pte_none(dst_entry))
+ if ((dst_pte == src_pte) || !huge_pte_none(dst_entry)) {
+ i_mmap_unlock_read(mapping);
continue;
+ }
dst_ptl = huge_pte_lock(h, dst, dst_pte);
src_ptl = huge_pte_lockptr(h, src, src_pte);
@@ -3322,6 +3337,8 @@ int copy_hugetlb_page_range(struct mm_st
}
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
+
+ i_mmap_unlock_read(mapping);
}
if (cow)
@@ -3772,14 +3789,18 @@ retry:
};
/*
- * hugetlb_fault_mutex must be dropped before
- * handling userfault. Reacquire after handling
- * fault to make calling code simpler.
+ * hugetlb_fault_mutex and i_mmap_rwsem must be
+ * dropped before handling userfault. Reacquire
+ * after handling fault to make calling code simpler.
*/
hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping,
idx, haddr);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+
ret = handle_userfault(&vmf, VM_UFFD_MISSING);
+
+ i_mmap_lock_read(mapping);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
goto out;
}
@@ -3927,6 +3948,11 @@ vm_fault_t hugetlb_fault(struct mm_struc
ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
if (ptep) {
+ /*
+ * Since we hold no locks, ptep could be stale. That is
+ * OK as we are only making decisions based on content and
+ * not actually modifying content here.
+ */
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
migration_entry_wait_huge(vma, mm, ptep);
@@ -3934,20 +3960,31 @@ vm_fault_t hugetlb_fault(struct mm_struc
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
VM_FAULT_SET_HINDEX(hstate_index(h));
- } else {
- ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
- if (!ptep)
- return VM_FAULT_OOM;
}
+ /*
+ * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
+ * until finished with ptep. This prevents huge_pmd_unshare from
+ * being called elsewhere and making the ptep no longer valid.
+ *
+ * ptep could have already be assigned via huge_pte_offset. That
+ * is OK, as huge_pte_alloc will return the same value unless
+ * something changed.
+ */
mapping = vma->vm_file->f_mapping;
- idx = vma_hugecache_offset(h, vma, haddr);
+ i_mmap_lock_read(mapping);
+ ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
+ if (!ptep) {
+ i_mmap_unlock_read(mapping);
+ return VM_FAULT_OOM;
+ }
/*
* Serialize hugepage allocation and instantiation, so that we don't
* get spurious allocation failures if two CPUs race to instantiate
* the same page in the page cache.
*/
+ idx = vma_hugecache_offset(h, vma, haddr);
hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, haddr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -4035,6 +4072,7 @@ out_ptl:
}
out_mutex:
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
/*
* Generally it's safe to hold refcount during waiting page lock. But
* here we just wait to defer the next page fault to avoid busy loop and
@@ -4640,10 +4678,12 @@ void adjust_range_if_pmd_sharing_possibl
* Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
* and returns the corresponding pte. While this is not necessary for the
* !shared pmd case because we can allocate the pmd later as well, it makes the
- * code much cleaner. pmd allocation is essential for the shared case because
- * pud has to be populated inside the same i_mmap_rwsem section - otherwise
- * racing tasks could either miss the sharing (see huge_pte_offset) or select a
- * bad pmd for sharing.
+ * code much cleaner.
+ *
+ * This routine must be called with i_mmap_rwsem held in at least read mode.
+ * For hugetlbfs, this prevents removal of any page table entries associated
+ * with the address space. This is important as we are setting up sharing
+ * based on existing page table entries (mappings).
*/
pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
{
@@ -4660,7 +4700,6 @@ pte_t *huge_pmd_share(struct mm_struct *
if (!vma_shareable(vma, addr))
return (pte_t *)pmd_alloc(mm, pud, addr);
- i_mmap_lock_write(mapping);
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -4690,7 +4729,6 @@ pte_t *huge_pmd_share(struct mm_struct *
spin_unlock(ptl);
out:
pte = (pte_t *)pmd_alloc(mm, pud, addr);
- i_mmap_unlock_write(mapping);
return pte;
}
@@ -4701,7 +4739,7 @@ out:
* indicated by page_count > 1, unmap is achieved by clearing pud and
* decrementing the ref count. If count == 1, the pte page is not shared.
*
- * called with page table lock held.
+ * Called with page table lock held and i_mmap_rwsem held in write mode.
*
* returns: 1 successfully unmapped a shared pte page
* 0 the underlying pte page is not shared, or it is the last user
--- a/mm/memory-failure.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/memory-failure.c
@@ -1028,7 +1028,19 @@ static bool hwpoison_user_mappings(struc
if (kill)
collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
- unmap_success = try_to_unmap(hpage, ttu);
+ if (!PageHuge(hpage)) {
+ unmap_success = try_to_unmap(hpage, ttu);
+ } else {
+ /*
+ * For hugetlb pages, try_to_unmap could potentially call
+ * huge_pmd_unshare. Because of this, take semaphore in
+ * write mode here and set TTU_RMAP_LOCKED to indicate we
+ * have taken the lock at this higer level.
+ */
+ i_mmap_lock_write(mapping);
+ unmap_success = try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED);
+ i_mmap_unlock_write(mapping);
+ }
if (!unmap_success)
pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
pfn, page_mapcount(hpage));
--- a/mm/migrate.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/migrate.c
@@ -1324,8 +1324,19 @@ static int unmap_and_move_huge_page(new_
goto put_anon;
if (page_mapped(hpage)) {
+ struct address_space *mapping = page_mapping(hpage);
+
+ /*
+ * try_to_unmap could potentially call huge_pmd_unshare.
+ * Because of this, take semaphore in write mode here and
+ * set TTU_RMAP_LOCKED to let lower levels know we have
+ * taken the lock.
+ */
+ i_mmap_lock_write(mapping);
try_to_unmap(hpage,
- TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+ TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|
+ TTU_RMAP_LOCKED);
+ i_mmap_unlock_write(mapping);
page_was_mapped = 1;
}
--- a/mm/rmap.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/rmap.c
@@ -1380,6 +1380,9 @@ static bool try_to_unmap_one(struct page
/*
* If sharing is possible, start and end will be adjusted
* accordingly.
+ *
+ * If called for a huge page, caller must hold i_mmap_rwsem
+ * in write mode as it is possible to call huge_pmd_unshare.
*/
adjust_range_if_pmd_sharing_possible(vma, &range.start,
&range.end);
--- a/mm/userfaultfd.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization
+++ a/mm/userfaultfd.c
@@ -267,10 +267,14 @@ retry:
VM_BUG_ON(dst_addr & ~huge_page_mask(h));
/*
- * Serialize via hugetlb_fault_mutex
+ * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
+ * i_mmap_rwsem ensures the dst_pte remains valid even
+ * in the case of shared pmds. fault mutex prevents
+ * races with other faulting threads.
*/
- idx = linear_page_index(dst_vma, dst_addr);
mapping = dst_vma->vm_file->f_mapping;
+ i_mmap_lock_read(mapping);
+ idx = linear_page_index(dst_vma, dst_addr);
hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping,
idx, dst_addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -279,6 +283,7 @@ retry:
dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h));
if (!dst_pte) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
goto out_unlock;
}
@@ -286,6 +291,7 @@ retry:
dst_pteval = huge_ptep_get(dst_pte);
if (!huge_pte_none(dst_pteval)) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
goto out_unlock;
}
@@ -293,6 +299,7 @@ retry:
dst_addr, src_addr, &page);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
vm_alloc_shared = vm_shared;
cond_resched();
_
Patches currently in -mm which might be from mike.kravetz(a)oracle.com are
hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization.patch
hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race.patch
This is a note to let you know that I've just added the patch titled
staging: wilc1000: fix missing read_write setting when reading data
to my staging git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging.git
in the staging-testing branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will be merged to the staging-next branch sometime soon,
after it passes testing, and the merge window is open.
If you have any questions about this process, please let me know.
>From c58eef061dda7d843dcc0ad6fea7e597d4c377c0 Mon Sep 17 00:00:00 2001
From: Colin Ian King <colin.king(a)canonical.com>
Date: Wed, 19 Dec 2018 16:30:07 +0000
Subject: staging: wilc1000: fix missing read_write setting when reading data
Currently the cmd.read_write setting is not initialized so it contains
garbage from the stack. Fix this by setting it to 0 to indicate a
read is required.
Detected by CoverityScan, CID#1357925 ("Uninitialized scalar variable")
Fixes: c5c77ba18ea6 ("staging: wilc1000: Add SDIO/SPI 802.11 driver")
Signed-off-by: Colin Ian King <colin.king(a)canonical.com>
Cc: stable <stable(a)vger.kernel.org>
Acked-by: Ajay Singh <ajay.kathat(a)microchip.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/staging/wilc1000/wilc_sdio.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/staging/wilc1000/wilc_sdio.c b/drivers/staging/wilc1000/wilc_sdio.c
index 27fdfbdda5c0..e2f739fef21c 100644
--- a/drivers/staging/wilc1000/wilc_sdio.c
+++ b/drivers/staging/wilc1000/wilc_sdio.c
@@ -861,6 +861,7 @@ static int sdio_read_int(struct wilc *wilc, u32 *int_status)
if (!sdio_priv->irq_gpio) {
int i;
+ cmd.read_write = 0;
cmd.function = 1;
cmd.address = 0x04;
cmd.data = 0;
--
2.20.1
Mediatek Preloader is a proprietary embedded boot loader for loading
Little Kernel and Linux into device DRAM.
This boot loader also handle firmware update. Mediatek Preloader will be
enumerated as a virtual COM port when the device is connected to Windows
or Linux OS via CDC-ACM class driver. When the USB enumeration has been
done, Mediatek Preloader will send out handshake command "READY" to PC
actively instead of waiting command from the download tool.
Since Linux 4.12, the commit "tty: reset termios state on device
registration" (93857edd9829e144acb6c7e72d593f6e01aead66) causes Mediatek
Preloader receiving some abnoraml command like "READYXX" as it sent.
This will be recognized as an incorrect response. The behavior change
also causes the download handshake fail. This change only affects
subsequent connects if the reconnected device happens to get the same minor
number.
By disabling the ECHO termios flag could avoid this problem. However, it
cannot be done by user space configuration when download tool open
/dev/ttyACM0. This is because the device running Mediatek Preloader will
send handshake command "READY" immediately once the CDC-ACM driver is
ready.
This patch wants to fix above problem by introducing "DISABLE_ECHO"
property in driver_info. When Mediatek Preloader is connected, the
CDC-ACM driver could disable ECHO flag in termios to avoid the problem.
Signed-off-by: Macpaul Lin <macpaul.lin(a)mediatek.com>
Cc: stable(a)vger.kernel.org
---
Changes for v2:
- Move quirks testing of DISABLE_ECHO flag into acm_tty_install().
- Change quirks testing into bitwise comparison.
Changes for v3:
- Replace quirks testing from init_termios to tty->termios.
- Remove parenthesis for ECHO flag.
Changes for v4:
- Drop quirks varible to simplify the patch.
- Move termios operation right after the driver_data has been installed.
- Write general style comment for suppressing initial echoing.
Changes for v5:
- Fix: termios operation right abover the driver_data has been installed.
- Update commit comment about this patch affects the reconnected device
which get the same minor numbers.
Changes for v6:
- Update VID/PID:0x0e8d/0x0003 as Mediatek Inc BROM.
- Update VID/PID:0x0e8d/0x2000 as Mediatek Inc Preloader.
Changes for v7:
- Keep VID/PID:0x0e8d/0x0003 unchanged because of 2 different UNION
descriptor implementated in Mediatek Inc BROM (MT6589/MT6765).
drivers/usb/class/cdc-acm.c | 10 ++++++++++
drivers/usb/class/cdc-acm.h | 1 +
2 files changed, 11 insertions(+)
diff --git a/drivers/usb/class/cdc-acm.c b/drivers/usb/class/cdc-acm.c
index 1b68fed..ed8c62b 100644
--- a/drivers/usb/class/cdc-acm.c
+++ b/drivers/usb/class/cdc-acm.c
@@ -581,6 +581,13 @@ static int acm_tty_install(struct tty_driver *driver, struct tty_struct *tty)
if (retval)
goto error_init_termios;
+ /*
+ * Suppress initial echoing for some devices which might send data
+ * immediately after acm driver has been installed.
+ */
+ if (acm->quirks & DISABLE_ECHO)
+ tty->termios.c_lflag &= ~ECHO;
+
tty->driver_data = acm;
return 0;
@@ -1657,6 +1664,9 @@ static int acm_pre_reset(struct usb_interface *intf)
{ USB_DEVICE(0x0e8d, 0x0003), /* FIREFLY, MediaTek Inc; andrey.arapov(a)gmail.com */
.driver_info = NO_UNION_NORMAL, /* has no union descriptor */
},
+ { USB_DEVICE(0x0e8d, 0x2000), /* MediaTek Inc Preloader */
+ .driver_info = DISABLE_ECHO, /* DISABLE ECHO in termios flag */
+ },
{ USB_DEVICE(0x0e8d, 0x3329), /* MediaTek Inc GPS */
.driver_info = NO_UNION_NORMAL, /* has no union descriptor */
},
diff --git a/drivers/usb/class/cdc-acm.h b/drivers/usb/class/cdc-acm.h
index ca06b20..515aad0 100644
--- a/drivers/usb/class/cdc-acm.h
+++ b/drivers/usb/class/cdc-acm.h
@@ -140,3 +140,4 @@ struct acm {
#define QUIRK_CONTROL_LINE_STATE BIT(6)
#define CLEAR_HALT_CONDITIONS BIT(7)
#define SEND_ZERO_PACKET BIT(8)
+#define DISABLE_ECHO BIT(9)
--
1.9.1