This is a v4.17-stable backport of my "Fix DM DAX handling" series which
had backport collisions.
Greg, please let me know if I need to adjust anything in my backport
formatting.
---
Changes since v1:
* Fixed formatting of my "commit X upstream." lines to include the full
hash to match other existing backports in stable.
* Fixed a few of the commit IDs which were from a dev branch and not
the actual upstream commit ID.
* Pulled in updated tags and changlog wording differences from upstream
commits vs commits in my dev branch.
Darrick J. Wong (1):
fs: allow per-device dax status checking for filesystems
Dave Jiang (1):
dax: change bdev_dax_supported() to support boolean returns
Ross Zwisler (3):
pmem: only set QUEUE_FLAG_DAX for fsdax mode
dax: check for QUEUE_FLAG_DAX in bdev_dax_supported()
dm: prevent DAX mounts if not supported
drivers/dax/super.c | 48 ++++++++++++++++++++++++++++--------------------
drivers/md/dm-table.c | 7 ++++---
drivers/md/dm.c | 3 +--
drivers/nvdimm/pmem.c | 3 ++-
fs/ext2/super.c | 3 +--
fs/ext4/super.c | 3 +--
fs/xfs/xfs_ioctl.c | 3 ++-
fs/xfs/xfs_iops.c | 30 +++++++++++++++++++++++++-----
fs/xfs/xfs_super.c | 10 ++++++++--
include/linux/dax.h | 11 ++++++-----
10 files changed, 78 insertions(+), 43 deletions(-)
--
2.14.4
From: Rasmus Villemoes <linux(a)rasmusvillemoes.dk>
Date: Sun, 8 Apr 2018 23:35:28 +0200
commit 9564a8cf422d7b58f6e857e3546d346fa970191e upstream.
I tried building using a freshly built Make (4.2.1-69-g8a731d1), but
already the objtool build broke with
orc_dump.c: In function ‘orc_dump’:
orc_dump.c:106:2: error: ‘elf_getshnum’ is deprecated [-Werror=deprecated-declarations]
if (elf_getshdrnum(elf, &nr_sections)) {
Turns out that with that new Make, the backslash was not removed, so cpp
didn't see a #include directive, grep found nothing, and
-DLIBELF_USE_DEPRECATED was wrongly put in CFLAGS.
Now, that new Make behaviour is documented in their NEWS file:
* WARNING: Backward-incompatibility!
Number signs (#) appearing inside a macro reference or function invocation
no longer introduce comments and should not be escaped with backslashes:
thus a call such as:
foo := $(shell echo '#')
is legal. Previously the number sign needed to be escaped, for example:
foo := $(shell echo '\#')
Now this latter will resolve to "\#". If you want to write makefiles
portable to both versions, assign the number sign to a variable:
C := \#
foo := $(shell echo '$C')
This was claimed to be fixed in 3.81, but wasn't, for some reason.
To detect this change search for 'nocomment' in the .FEATURES variable.
This also fixes up the two make-cmd instances to replace # with $(pound)
rather than with \#. There might very well be other places that need
similar fixup in preparation for whatever future Make release contains
the above change, but at least this builds an x86_64 defconfig with the
new make.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=197847
Cc: Randy Dunlap <rdunlap(a)infradead.org>
Signed-off-by: Rasmus Villemoes <linux(a)rasmusvillemoes.dk>
Signed-off-by: Masahiro Yamada <yamada.masahiro(a)socionext.com>
Signed-off-by: Paul Menzel <pmenzel(a)molgen.mpg.de>
---
Fix one conflict in `scripts/Kbuild.include`.
scripts/Kbuild.include | 5 +++--
tools/build/Build.include | 5 +++--
tools/objtool/Makefile | 2 +-
tools/scripts/Makefile.include | 2 ++
4 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include
index 97769465de13..fcbbecf92395 100644
--- a/scripts/Kbuild.include
+++ b/scripts/Kbuild.include
@@ -8,6 +8,7 @@ squote := '
empty :=
space := $(empty) $(empty)
space_escape := _-_SPACE_-_
+pound := \#
###
# Name of target with a '.' as filename prefix. foo/bar.o => foo/.bar.o
@@ -251,11 +252,11 @@ endif
# Replace >$< with >$$< to preserve $ when reloading the .cmd file
# (needed for make)
-# Replace >#< with >\#< to avoid starting a comment in the .cmd file
+# Replace >#< with >$(pound)< to avoid starting a comment in the .cmd file
# (needed for make)
# Replace >'< with >'\''< to be able to enclose the whole string in '...'
# (needed for the shell)
-make-cmd = $(call escsq,$(subst \#,\\\#,$(subst $$,$$$$,$(cmd_$(1)))))
+make-cmd = $(call escsq,$(subst $(pound),$$(pound),$(subst $$,$$$$,$(cmd_$(1)))))
# Find any prerequisites that is newer than target or that does not exist.
# PHONY targets skipped in both cases.
diff --git a/tools/build/Build.include b/tools/build/Build.include
index 418871d02ebf..a4bbb984941d 100644
--- a/tools/build/Build.include
+++ b/tools/build/Build.include
@@ -12,6 +12,7 @@
# Convenient variables
comma := ,
squote := '
+pound := \#
###
# Name of target with a '.' as filename prefix. foo/bar.o => foo/.bar.o
@@ -43,11 +44,11 @@ echo-cmd = $(if $($(quiet)cmd_$(1)),\
###
# Replace >$< with >$$< to preserve $ when reloading the .cmd file
# (needed for make)
-# Replace >#< with >\#< to avoid starting a comment in the .cmd file
+# Replace >#< with >$(pound)< to avoid starting a comment in the .cmd file
# (needed for make)
# Replace >'< with >'\''< to be able to enclose the whole string in '...'
# (needed for the shell)
-make-cmd = $(call escsq,$(subst \#,\\\#,$(subst $$,$$$$,$(cmd_$(1)))))
+make-cmd = $(call escsq,$(subst $(pound),$$(pound),$(subst $$,$$$$,$(cmd_$(1)))))
###
# Find any prerequisites that is newer than target or that does not exist.
diff --git a/tools/objtool/Makefile b/tools/objtool/Makefile
index e6acc281dd37..8ae824dbfca3 100644
--- a/tools/objtool/Makefile
+++ b/tools/objtool/Makefile
@@ -35,7 +35,7 @@ CFLAGS += -Wall -Werror $(WARNINGS) -fomit-frame-pointer -O2 -g $(INCLUDES)
LDFLAGS += -lelf $(LIBSUBCMD)
# Allow old libelf to be used:
-elfshdr := $(shell echo '\#include <libelf.h>' | $(CC) $(CFLAGS) -x c -E - | grep elf_getshdr)
+elfshdr := $(shell echo '$(pound)include <libelf.h>' | $(CC) $(CFLAGS) -x c -E - | grep elf_getshdr)
CFLAGS += $(if $(elfshdr),,-DLIBELF_USE_DEPRECATED)
AWK = awk
diff --git a/tools/scripts/Makefile.include b/tools/scripts/Makefile.include
index 654efd9768fd..5f3f1f44ed0a 100644
--- a/tools/scripts/Makefile.include
+++ b/tools/scripts/Makefile.include
@@ -101,3 +101,5 @@ ifneq ($(silent),1)
QUIET_INSTALL = @printf ' INSTALL %s\n' $1;
endif
endif
+
+pound := \#
--
2.17.1
This is backport of the upstream patch
d12067f428c037b4575aaeb2be00847fc214c24a. It should be backported to
stable kernels because this performance problem was seen on the android
4.4 kernel.
commit d12067f428c037b4575aaeb2be00847fc214c24a
Author: Mikulas Patocka <mpatocka(a)redhat.com>
Date: Wed Nov 23 16:52:01 2016 -0500
dm bufio: don't take the lock in dm_bufio_shrink_count
dm_bufio_shrink_count() is called from do_shrink_slab to find out how many
freeable objects are there. The reported value doesn't have to be precise,
so we don't need to take the dm-bufio lock.
Suggested-by: David Rientjes <rientjes(a)google.com>
Signed-off-by: Mikulas Patocka <mpatocka(a)redhat.com>
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
---
drivers/md/dm-bufio.c | 16 ++++------------
1 file changed, 4 insertions(+), 12 deletions(-)
Index: linux-stable/drivers/md/dm-bufio.c
===================================================================
--- linux-stable.orig/drivers/md/dm-bufio.c 2018-07-04 15:25:41.000000000 +0200
+++ linux-stable/drivers/md/dm-bufio.c 2018-07-04 15:33:59.000000000 +0200
@@ -1574,19 +1574,11 @@ dm_bufio_shrink_scan(struct shrinker *sh
static unsigned long
dm_bufio_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
{
- struct dm_bufio_client *c;
- unsigned long count;
- unsigned long retain_target;
+ struct dm_bufio_client *c = container_of(shrink, struct dm_bufio_client, shrinker);
+ unsigned long count = READ_ONCE(c->n_buffers[LIST_CLEAN]) +
+ READ_ONCE(c->n_buffers[LIST_DIRTY]);
+ unsigned long retain_target = get_retain_buffers(c);
- c = container_of(shrink, struct dm_bufio_client, shrinker);
- if (sc->gfp_mask & __GFP_FS)
- dm_bufio_lock(c);
- else if (!dm_bufio_trylock(c))
- return 0;
-
- count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
- retain_target = get_retain_buffers(c);
- dm_bufio_unlock(c);
return (count < retain_target) ? 0 : (count - retain_target);
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3f77f244d8ec28e3a0a81240ffac7d626390060c Mon Sep 17 00:00:00 2001
From: Martin Kaiser <martin(a)kaiser.cx>
Date: Mon, 18 Jun 2018 22:41:03 +0200
Subject: [PATCH] mtd: rawnand: mxc: set spare area size register explicitly
The v21 version of the NAND flash controller contains a Spare Area Size
Register (SPAS) at offset 0x10. Its setting defaults to the maximum
spare area size of 218 bytes. The size that is set in this register is
used by the controller when it calculates the ECC bytes internally in
hardware.
Usually, this register is updated from settings in the IIM fuses when
the system is booting from NAND flash. For other boot media, however,
the SPAS register remains at the default setting, which may not work for
the particular flash chip on the board. The same goes for flash chips
whose configuration cannot be set in the IIM fuses (e.g. chips with 2k
sector size and 128 bytes spare area size can't be configured in the IIM
fuses on imx25 systems).
Set the SPAS register explicitly during the preset operation. Derive the
register value from mtd->oobsize that was detected during probe by
decoding the flash chip's ID bytes.
While at it, rename the define for the spare area register's offset to
NFC_V21_RSLTSPARE_AREA. The register at offset 0x10 on v1 controllers is
different from the register on v21 controllers.
Fixes: d484018 ("mtd: mxc_nand: set NFC registers after reset")
Cc: stable(a)vger.kernel.org
Signed-off-by: Martin Kaiser <martin(a)kaiser.cx>
Reviewed-by: Sascha Hauer <s.hauer(a)pengutronix.de>
Reviewed-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
Signed-off-by: Boris Brezillon <boris.brezillon(a)bootlin.com>
diff --git a/drivers/mtd/nand/raw/mxc_nand.c b/drivers/mtd/nand/raw/mxc_nand.c
index 45786e707b7b..26cef218bb43 100644
--- a/drivers/mtd/nand/raw/mxc_nand.c
+++ b/drivers/mtd/nand/raw/mxc_nand.c
@@ -48,7 +48,7 @@
#define NFC_V1_V2_CONFIG (host->regs + 0x0a)
#define NFC_V1_V2_ECC_STATUS_RESULT (host->regs + 0x0c)
#define NFC_V1_V2_RSLTMAIN_AREA (host->regs + 0x0e)
-#define NFC_V1_V2_RSLTSPARE_AREA (host->regs + 0x10)
+#define NFC_V21_RSLTSPARE_AREA (host->regs + 0x10)
#define NFC_V1_V2_WRPROT (host->regs + 0x12)
#define NFC_V1_UNLOCKSTART_BLKADDR (host->regs + 0x14)
#define NFC_V1_UNLOCKEND_BLKADDR (host->regs + 0x16)
@@ -1274,6 +1274,9 @@ static void preset_v2(struct mtd_info *mtd)
writew(config1, NFC_V1_V2_CONFIG1);
/* preset operation */
+ /* spare area size in 16-bit half-words */
+ writew(mtd->oobsize / 2, NFC_V21_RSLTSPARE_AREA);
+
/* Unlock the internal RAM Buffer */
writew(0x2, NFC_V1_V2_CONFIG);
The patch below does not apply to the 4.17-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 9aa613674f89d01248ae2e4afe691b515ff8fbb6 Mon Sep 17 00:00:00 2001
From: Peter Rosin <peda(a)axentia.se>
Date: Wed, 20 Jun 2018 11:43:23 +0200
Subject: [PATCH] i2c: smbus: kill memory leak on emulated and failed DMA SMBus
xfers
If DMA safe memory was allocated, but the subsequent I2C transfer
fails the memory is leaked. Plug this leak.
Fixes: 8a77821e74d6 ("i2c: smbus: use DMA safe buffers for emulated SMBus transactions")
Signed-off-by: Peter Rosin <peda(a)axentia.se>
Signed-off-by: Wolfram Sang <wsa(a)the-dreams.de>
Cc: stable(a)kernel.org
diff --git a/drivers/i2c/i2c-core-smbus.c b/drivers/i2c/i2c-core-smbus.c
index f3f683041e7f..51970bae3c4a 100644
--- a/drivers/i2c/i2c-core-smbus.c
+++ b/drivers/i2c/i2c-core-smbus.c
@@ -465,15 +465,18 @@ static s32 i2c_smbus_xfer_emulated(struct i2c_adapter *adapter, u16 addr,
status = i2c_transfer(adapter, msg, num);
if (status < 0)
- return status;
- if (status != num)
- return -EIO;
+ goto cleanup;
+ if (status != num) {
+ status = -EIO;
+ goto cleanup;
+ }
+ status = 0;
/* Check PEC if last message is a read */
if (i && (msg[num-1].flags & I2C_M_RD)) {
status = i2c_smbus_check_pec(partial_pec, &msg[num-1]);
if (status < 0)
- return status;
+ goto cleanup;
}
if (read_write == I2C_SMBUS_READ)
@@ -499,12 +502,13 @@ static s32 i2c_smbus_xfer_emulated(struct i2c_adapter *adapter, u16 addr,
break;
}
+cleanup:
if (msg[0].flags & I2C_M_DMA_SAFE)
kfree(msg[0].buf);
if (msg[1].flags & I2C_M_DMA_SAFE)
kfree(msg[1].buf);
- return 0;
+ return status;
}
/**
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 7810e6781e0fcbca78b91cf65053f895bf59e85f Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka(a)suse.cz>
Date: Thu, 7 Jun 2018 17:09:29 -0700
Subject: [PATCH] mm, page_alloc: do not break __GFP_THISNODE by zonelist reset
In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
allocations that can ignore memory policies. The zonelist is obtained
from current CPU's node. This is a problem for __GFP_THISNODE
allocations that want to allocate on a different node, e.g. because the
allocating thread has been migrated to a different CPU.
This has been observed to break SLAB in our 4.4-based kernel, because
there it relies on __GFP_THISNODE working as intended. If a slab page
is put on wrong node's list, then further list manipulations may corrupt
the list because page_to_nid() is used to determine which node's
list_lock should be locked and thus we may take a wrong lock and race.
Current SLAB implementation seems to be immune by luck thanks to commit
511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on
arbitrary node") but there may be others assuming that __GFP_THISNODE
works as promised.
We can fix it by simply removing the zonelist reset completely. There
is actually no reason to reset it, because memory policies and cpusets
don't affect the zonelist choice in the first place. This was different
when commit 183f6371aac2 ("mm: ignore mempolicies when using
ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
own restricted zonelists.
We might consider this for 4.17 although I don't know if there's
anything currently broken.
SLAB is currently not affected, but in kernels older than 4.7 that don't
yet have 511e3a058812 ("mm/slab: make cache_grow() handle the page
allocated on arbitrary node") it is. That's at least 4.4 LTS. Older
ones I'll have to check.
So stable backports should be more important, but will have to be
reviewed carefully, as the code went through many changes. BTW I think
that also the ac->preferred_zoneref reset is currently useless if we
don't also reset ac->nodemask from a mempolicy to NULL first (which we
probably should for the OOM victims etc?), but I would leave that for a
separate patch.
Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka(a)suse.cz>
Fixes: 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
Acked-by: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ef1811531999..07b3c23762ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4169,7 +4169,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
* orientated.
*/
if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
- ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
ac->high_zoneidx, ac->nodemask);
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From dc7a10ddee0c56c6d891dd18de5c4ee9869545e0 Mon Sep 17 00:00:00 2001
From: Jaegeuk Kim <jaegeuk(a)kernel.org>
Date: Fri, 30 Mar 2018 17:58:13 -0700
Subject: [PATCH] f2fs: truncate preallocated blocks in error case
If write is failed, we must deallocate the blocks that we couldn't write.
Cc: stable(a)vger.kernel.org
Reviewed-by: Chao Yu <yuchao0(a)huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk(a)kernel.org>
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 8068b015ece5..6b94f19b3fa8 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -2911,6 +2911,8 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
ret = generic_write_checks(iocb, from);
if (ret > 0) {
+ bool preallocated = false;
+ size_t target_size = 0;
int err;
if (iov_iter_fault_in_readable(from, iov_iter_count(from)))
@@ -2927,6 +2929,9 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
}
} else {
+ preallocated = true;
+ target_size = iocb->ki_pos + iov_iter_count(from);
+
err = f2fs_preallocate_blocks(iocb, from);
if (err) {
clear_inode_flag(inode, FI_NO_PREALLOC);
@@ -2939,6 +2944,10 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
blk_finish_plug(&plug);
clear_inode_flag(inode, FI_NO_PREALLOC);
+ /* if we couldn't write data, we should deallocate blocks. */
+ if (preallocated && i_size_read(inode) < target_size)
+ f2fs_truncate(inode);
+
if (ret > 0)
f2fs_update_iostat(F2FS_I_SB(inode), APP_WRITE_IO, ret);
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 31286a8484a85e8b4e91ddb0f5415aee8a416827 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Date: Thu, 5 Apr 2018 16:23:05 -0700
Subject: [PATCH] mm: hwpoison: disable memory error handling on 1GB hugepage
Recently the following BUG was reported:
Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
Memory failure: 0x3c0000: recovery action for huge page: Recovered
BUG: unable to handle kernel paging request at ffff8dfcc0003000
IP: gup_pgd_range+0x1f0/0xc20
PGD 17ae72067 P4D 17ae72067 PUD 0
Oops: 0000 [#1] SMP PTI
...
CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
a 1GB hugepage. This happens because get_user_pages_fast() is not aware
of a migration entry on pud that was created in the 1st madvise() event.
I think that conversion to pud-aligned migration entry is working, but
other MM code walking over page table isn't prepared for it. We need
some time and effort to make all this work properly, so this patch
avoids the reported bug by just disabling error handling for 1GB
hugepage.
[n-horiguchi(a)ah.jp.nec.com: v2]
Link: http://lkml.kernel.org/r/1517284444-18149-1-git-send-email-n-horiguchi@ah.j…
Link: http://lkml.kernel.org/r/1517207283-15769-1-git-send-email-n-horiguchi@ah.j…
Signed-off-by: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Reviewed-by: Andrew Morton <akpm(a)linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Acked-by: Punit Agrawal <punit.agrawal(a)arm.com>
Tested-by: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Anshuman Khandual <khandual(a)linux.vnet.ibm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e40a44a1fae..2e2be527642a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2613,6 +2613,7 @@ enum mf_action_page_type {
MF_MSG_POISONED_HUGE,
MF_MSG_HUGE,
MF_MSG_FREE_HUGE,
+ MF_MSG_NON_PMD_HUGE,
MF_MSG_UNMAP_FAILED,
MF_MSG_DIRTY_SWAPCACHE,
MF_MSG_CLEAN_SWAPCACHE,
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 8291b75f42c8..2d4bf647cf01 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -502,6 +502,7 @@ static const char * const action_page_types[] = {
[MF_MSG_POISONED_HUGE] = "huge page already hardware poisoned",
[MF_MSG_HUGE] = "huge page",
[MF_MSG_FREE_HUGE] = "free huge page",
+ [MF_MSG_NON_PMD_HUGE] = "non-pmd-sized huge page",
[MF_MSG_UNMAP_FAILED] = "unmapping failed page",
[MF_MSG_DIRTY_SWAPCACHE] = "dirty swapcache page",
[MF_MSG_CLEAN_SWAPCACHE] = "clean swapcache page",
@@ -1084,6 +1085,21 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
return 0;
}
+ /*
+ * TODO: hwpoison for pud-sized hugetlb doesn't work right now, so
+ * simply disable it. In order to make it work properly, we need
+ * make sure that:
+ * - conversion of a pud that maps an error hugetlb into hwpoison
+ * entry properly works, and
+ * - other mm code walking over page table is aware of pud-aligned
+ * hwpoison entries.
+ */
+ if (huge_page_size(page_hstate(head)) > PMD_SIZE) {
+ action_result(pfn, MF_MSG_NON_PMD_HUGE, MF_IGNORED);
+ res = -EBUSY;
+ goto out;
+ }
+
if (!hwpoison_user_mappings(p, pfn, flags, &head)) {
action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
res = -EBUSY;