Check the allocation size before toggling kfence_allocation_gate.
This way allocations that can't be served by KFENCE will not result in
waiting for another CONFIG_KFENCE_SAMPLE_INTERVAL without allocating
anything.
Suggested-by: Marco Elver <elver(a)google.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Marco Elver <elver(a)google.com>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: stable(a)vger.kernel.org # 5.12+
Signed-off-by: Alexander Potapenko <glider(a)google.com>
---
mm/kfence/core.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 4d21ac44d5d35..33bb20d91bf6a 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -733,6 +733,13 @@ void kfence_shutdown_cache(struct kmem_cache *s)
void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags)
{
+ /*
+ * Perform size check before switching kfence_allocation_gate, so that
+ * we don't disable KFENCE without making an allocation.
+ */
+ if (size > PAGE_SIZE)
+ return NULL;
+
/*
* allocation_gate only needs to become non-zero, so it doesn't make
* sense to continue writing to it and pay the associated contention
@@ -757,9 +764,6 @@ void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags)
if (!READ_ONCE(kfence_enabled))
return NULL;
- if (size > PAGE_SIZE)
- return NULL;
-
return kfence_guarded_alloc(s, size, flags);
}
--
2.32.0.93.g670b81a890-goog
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
I'm announcing the release of the 4.4.274 kernel.
All users of the 4.4 kernel series must upgrade.
The updated 4.4.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-4.4.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
Thanks,
Sasha
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCgAdFiEE4n5dijQDou9mhzu83qZv95d3LNwFAmDcdJAACgkQ3qZv95d3
LNy2nw/8DjXdbC/e2QMD0vKWBKmEeJv4IdTEqZ0UFy6TtXYEu5xP6nPjQwFj1QBe
QEjA2/E/OfbVvw03VmkJCYhIJHNjod7+qc2fjwiYqwjuanzaEL1r1bwjvuAWEWir
4Dv7Ld24qGvPIku/9/O6scdEYhCxJD0nmP1PoYi65+qysdKz0Uz9/H8NGMg/4EOL
N0SO402K38PXMpSqNVQO1j/geBhGMEdKYPCCHbYtuTI2HTp+cz24dHuJ3qCnNZvu
6TLienmeqrfSHWZcy5Ylfe4pvac79ZOu2k8R/bOaldoDJEKsGIvhcXkW148u9J2R
344QhW7kf+qyjWbuYUTRT4MCJMU7gmSxOUpKj9uk1k0flC6kQlOmJgxK2tWcI76R
NhokvUXNDMBhD15FkomU2djDTeZlE213UNPerf3rdIvobCzqWKyQRhmf7UuIKA73
gukTSnVr5qW9PpAWsIoIeNVK/MTgqLX/KMTOOVjudPwo9ub+KeGgffWBK5JUBagX
a0THRt0C+bRoSQw5AWc/Wjbk5yd2ZIfmrLWRTjfAxWkEzOmd5k+1xC4BhrDMw/+f
ltsplL3LVFQ4vnUOiRsJfiLxVSPxdHP0xqhdSi/v2BKVyvwooqygv+2E+yCx6ChM
EkmJ7I3Nx+90WUxN4aKS2B4G8yX+r2p3XXzPomXkkqhkoOm5odc=
=dnqU
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
I'm announcing the release of the 4.9.274 kernel.
All users of the 4.9 kernel series must upgrade.
The updated 4.9.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-4.9.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
Thanks,
Sasha
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCgAdFiEE4n5dijQDou9mhzu83qZv95d3LNwFAmDcdIgACgkQ3qZv95d3
LNwY3A/+PUro+6tW4eV+l/s2KhV16XpQeyj4d3r4rA3CxNNS7vRfe36EE2c57Y7y
ij+qHzrSphoA8WVfWbTVwRwxX0WLaOsNRD2UKHzg9Ngnd3EVnEks/g0Cu3FIKaEK
v4GSwNZCUB+TjS/i/VGPdbDMGYNA+S13A4qLf1cX9ZZcbZzX8cQuLO8UBIzeCFzX
M/IajOd+Z8vocrXcThGlXnsVcDGGg7wTd7D5jq2m14xsTYzt8XiS1Bk5a1kBVvrH
XjMhB2k7XaPd8cydK2lg7MQSPqEJzUEssyJKl1odkFcP5h+MrcN7FbBREbZhD3L4
ivvEx0zJnityqdS3av7UQC/OFpL8giSb560hHI3/K+Pe6Ww/DnwqEgNM6SuxmSyX
8ee7DDvUmEOsFIPJPG7zfesowGy1OVFZGPFQtRlb4wUncWJdWrbO1bSez3Kh/2Oa
VaGtmzgRWDaYOvHroVQqAFOXKG01geVgatRojLUCoo0SDKDcF5gCLhUkHVlhSuLG
oZJpboUuAfQcvU9fMFTSn6AuNa1MX9u820/jC3sG8cFh4rH9CKQMBtMDs8D5bxd+
Gp5uPd+KSHY3S2xfSzmXSX5XlOHXDep/4JRwZOZpH/pcv7bzC5qfh9veRjZoIo+a
qcjW8xJ/VqL/+Y/RMYXJAV4lQTsAkQkJK62Stj/Lz0Z5sPxfNkw=
=pF1+
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
I'm announcing the release of the 5.12.14 kernel.
All users of the 5.12 kernel series must upgrade.
The updated 5.12.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.12.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
Thanks,
Sasha
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCgAdFiEE4n5dijQDou9mhzu83qZv95d3LNwFAmDcc0QACgkQ3qZv95d3
LNwH5w//TFlFjU8wQBo9akQQjQguKKAiuIa6Mw21rWbl7kJbQpANN34+CiSF8IKS
zlDmuAcAGASQttQ5Ud0TgF7wkrSFdDEwGKXeqrAmHTL2ux+1dTImBSeFAlIuEPYH
dcZdhrHcrPnQGcOwrh9oE0DBJRRVyztPgqAZXXKtvraPJQnCmr5UryI31C7t2ZjQ
vvtwMXNuYAzmqe9tsWYdEY4zk1cqMLWKjK1a4ePhvknnq+NSolzmv9WYJER1gjxN
j5BXKms2RUZjbWql456rP0k2gRxsL2A2ZEnUzxDMgbwQASSEcQzLjpZIjrZWrteM
kunUw8Qyq6FHLnmavE/l4g1tY6wPDljw6Y80Yijlr9yYgCEOvaBuKNMpeUoiwNhb
WyeebAFDycuguwjpE1Fds+Qb8LgVKyzrXWgHMU75H9+VFRPIShXWgyll77DsJL8s
9Fgt8IVUCC0IyPKERsgkJ7xfWa6K5Icrhlf5jpymIZjGMfhv9G/C86CdAkBwRB9f
oD6lHy/CPffzzUPuI1eSiGEv3RXHlJwrs0GWrkDJfoObli2lTXR6olNblhJwGh2y
bZ26x4NU88oHUBuizMkmCiL9NK/usMfWQZdlDJ7rfPrLiHjQ+Xdf7+ORCCa35ahE
DRGxrF7EpIxMjgq4LEKTlKbvT6AqsAjdw+mhjpoQoE1W6WcebkM=
=Q/ui
-----END PGP SIGNATURE-----
From: Kan Liang <kan.liang(a)linux.intel.com>
The uncore unit with the type ID 0 and the unit ID 0 is missed.
The table3 of the uncore unit maybe 0. The
uncore_discovery_invalid_unit() mistakenly treated it as an invalid
value.
Remove the !unit.table3 check.
Fixes: edae1f06c2cd ("perf/x86/intel/uncore: Parse uncore discovery tables")
Reviewed-by: Andi Kleen <ak(a)linux.intel.com>
Signed-off-by: Kan Liang <kan.liang(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
---
The patch was posted as a part of the "perf: Add Sapphire Rapids server
uncore support" patch set. But it doesn't depend on the other patches in
the patch set. The bugfix can be accepted and merged separately.
https://lore.kernel.org/lkml/cb0d2d43-102a-994c-f777-e11d61c77bf5@linux.int…
arch/x86/events/intel/uncore_discovery.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/uncore_discovery.h b/arch/x86/events/intel/uncore_discovery.h
index 1d65293..abfb1e8 100644
--- a/arch/x86/events/intel/uncore_discovery.h
+++ b/arch/x86/events/intel/uncore_discovery.h
@@ -30,7 +30,7 @@
#define uncore_discovery_invalid_unit(unit) \
- (!unit.table1 || !unit.ctl || !unit.table3 || \
+ (!unit.table1 || !unit.ctl || \
unit.table1 == -1ULL || unit.ctl == -1ULL || \
unit.table3 == -1ULL)
--
2.7.4
From: Kan Liang <kan.liang(a)linux.intel.com>
The uncore unit with the type ID 0 and the unit ID 0 is missed.
The table3 of the uncore unit maybe 0. The
uncore_discovery_invalid_unit() mistakenly treated it as an invalid
value.
Remove the !unit.table3 check.
Fixes: edae1f06c2cd ("perf/x86/intel/uncore: Parse uncore discovery tables")
Reviewed-by: Andi Kleen <ak(a)linux.intel.com>
Signed-off-by: Kan Liang <kan.liang(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
---
arch/x86/events/intel/uncore_discovery.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/uncore_discovery.h b/arch/x86/events/intel/uncore_discovery.h
index 7280c8a..6d735611 100644
--- a/arch/x86/events/intel/uncore_discovery.h
+++ b/arch/x86/events/intel/uncore_discovery.h
@@ -30,7 +30,7 @@
#define uncore_discovery_invalid_unit(unit) \
- (!unit.table1 || !unit.ctl || !unit.table3 || \
+ (!unit.table1 || !unit.ctl || \
unit.table1 == -1ULL || unit.ctl == -1ULL || \
unit.table3 == -1ULL)
--
2.7.4
Hi,
I realized that this cve(
https://www.openwall.com/lists/oss-security/2021/03/23/2 ) is not in
the 4.19 tree but the commits introduced the vulnerability before
4.19. Is there any reason that the fix was not cherry-picked to 4.19?
Best,
Anil
The Tegra serial driver always prints an error message when enabling the
FIFO for devices that have support for checking the FIFO enable status.
Fix this by only display the error message, when an error occurs.
Fixes: 222dcdff3405 ("serial: tegra: check for FIFO mode enabled status")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Jon Hunter <jonathanh(a)nvidia.com>
---
drivers/tty/serial/serial-tegra.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/tty/serial/serial-tegra.c b/drivers/tty/serial/serial-tegra.c
index 222032792d6c..cd481f7ba8eb 100644
--- a/drivers/tty/serial/serial-tegra.c
+++ b/drivers/tty/serial/serial-tegra.c
@@ -1045,9 +1045,10 @@ static int tegra_uart_hw_init(struct tegra_uart_port *tup)
if (tup->cdata->fifo_mode_enable_status) {
ret = tegra_uart_wait_fifo_mode_enabled(tup);
- dev_err(tup->uport.dev, "FIFO mode not enabled\n");
- if (ret < 0)
+ if (ret < 0) {
+ dev_err(tup->uport.dev, "FIFO mode not enabled\n");
return ret;
+ }
} else {
/*
* For all tegra devices (up to t210), there is a hardware
--
2.25.1
From: Mohammad Athari Bin Ismail <mohammad.athari.ismail(a)intel.com>
Add stmmac_fpe_stop_wq() in stmmac_suspend() to terminate FPE workqueue
during suspend. So, in suspend mode, there will be no FPE workqueue
available. Without this fix, new additional FPE workqueue will be created
in every suspend->resume cycle.
Fixes: 5a5586112b92 ("net: stmmac: support FPE link partner hand-shaking procedure")
Signed-off-by: Mohammad Athari Bin Ismail <mohammad.athari.ismail(a)intel.com>
---
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index c87202cbd3d6..796ad594543d 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -7170,6 +7170,7 @@ int stmmac_suspend(struct device *dev)
priv->plat->rx_queues_to_use, false);
stmmac_fpe_handshake(priv, false);
+ stmmac_fpe_stop_wq(priv);
}
priv->speed = SPEED_UNKNOWN;
--
2.17.1
Allocation requests with __GFP_DMA/__GFP_DMA32 or
SLAB_CACHE_DMA/SLAB_CACHE_DMA32 cannot be fulfilled by KFENCE, because
they must reside in low memory, whereas KFENCE memory pool is located in
high memory.
Skip such allocations to avoid crashes where DMAable memory is expected.
Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure")
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Marco Elver <elver(a)google.com>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: stable(a)vger.kernel.org # 5.12+
Signed-off-by: Alexander Potapenko <glider(a)google.com>
---
v2:
- added parentheses around the GFP clause, as requested by Marco
---
mm/kfence/core.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 4d21ac44d5d35..f7ce3d876bc9e 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -760,6 +760,14 @@ void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags)
if (size > PAGE_SIZE)
return NULL;
+ /*
+ * Skip DMA allocations. These must reside in the low memory, which we
+ * cannot guarantee.
+ */
+ if ((flags & (__GFP_DMA | __GFP_DMA32)) ||
+ (s->flags & (SLAB_CACHE_DMA | SLAB_CACHE_DMA32)))
+ return NULL;
+
return kfence_guarded_alloc(s, size, flags);
}
--
2.32.0.93.g670b81a890-goog
Some boot partitions on the Samsung 4GB KLM4G1YE4C "4YMD1R" and "M4G1YC"
cards appear broken when accessed randomly. CMD6 to switch back to the main
partition randomly stalls after CMD18 access to the boot partition 1, and
the card never comes back online. The accesses to the boot partitions work
several times before this happens.
Some problematic eMMC cards are found in the Samsung GT-S7710 mobile phone.
I tried using only single blocks with CMD17 on the boot partitions with the
result that it crashed even faster.
The card may survive on older kernels, but this seems to be because recent
kernel partition parsers are accessing the boot partitions more, and we get
more block traffic to the boot partitions during kernel startup.
After applying this patch, the main partition can be accessed and mounted
without problems.
After a bit of root cause analysis it turns out that these old eMMC cards
cannot do hardware busy detection (monitoring DAT0) properly. This explains
why older kernels work since we only recently added support for this to the
MMCI driver which is used with these cards on these platforms.
Construct a quirk that makes the MMC cord avoid using the ->card_busy()
callback if the card is listed with MMC_QUIRK_BROKEN_HW_BUSY_DETECT and
register the known problematic cards. The core changes are pretty
straight-forward with a helper inline to check of we can use hardware
busy detection.
On the MMCI host we have to counter the fact that if the host was able to
use busy detect, it would be used unsolicited in the command IRQ callback.
Rewrite this so that MMCI will not attempt to use hardware busy detection
in the command IRQ until:
- A card is attached to the host and
- We know that the card can handle this busy detection
I have glanced over the ->card_busy() callbacks on some other hosts and
they seem to mostly read a register reflecting the value of DAT0 for this
which works fine with the quirk in this patch. However if the error appear
on other hosts they might need additional fixes.
Fixes: cb0335b778c7 ("mmc: mmci: add busy_complete callback")
Cc: stable(a)vger.kernel.org
Cc: phone-devel(a)vger.kernel.org
Cc: Ludovic Barre <ludovic.barre(a)st.com>
Cc: Stephan Gerhold <stephan(a)gerhold.net>
Reported-by: newbyte(a)disroot.org
Signed-off-by: Linus Walleij <linus.walleij(a)linaro.org>
---
ChangeLog v1->v2:
- Rewrite to reflect the actual problem of broken busy detection.
---
drivers/mmc/core/core.c | 8 ++++----
drivers/mmc/core/core.h | 17 +++++++++++++++++
drivers/mmc/core/mmc_ops.c | 4 ++--
drivers/mmc/core/quirks.h | 21 +++++++++++++++++++++
drivers/mmc/host/mmci.c | 22 ++++++++++++++++++++--
include/linux/mmc/card.h | 1 +
6 files changed, 65 insertions(+), 8 deletions(-)
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index f194940c5974..7142a806f6a6 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -232,7 +232,7 @@ static void __mmc_start_request(struct mmc_host *host, struct mmc_request *mrq)
* And bypass I/O abort, reset and bus suspend operations.
*/
if (sdio_is_io_busy(mrq->cmd->opcode, mrq->cmd->arg) &&
- host->ops->card_busy) {
+ mmc_hw_busy_detect(host)) {
int tries = 500; /* Wait aprox 500ms at maximum */
while (host->ops->card_busy(host) && --tries)
@@ -1197,7 +1197,7 @@ int mmc_set_uhs_voltage(struct mmc_host *host, u32 ocr)
*/
if (!host->ops->start_signal_voltage_switch)
return -EPERM;
- if (!host->ops->card_busy)
+ if (!mmc_hw_busy_detect(host))
pr_warn("%s: cannot verify signal voltage switch\n",
mmc_hostname(host));
@@ -1217,7 +1217,7 @@ int mmc_set_uhs_voltage(struct mmc_host *host, u32 ocr)
* after the response of cmd11, but wait 1 ms to be sure
*/
mmc_delay(1);
- if (host->ops->card_busy && !host->ops->card_busy(host)) {
+ if (mmc_hw_busy_detect(host) && !host->ops->card_busy(host)) {
err = -EAGAIN;
goto power_cycle;
}
@@ -1238,7 +1238,7 @@ int mmc_set_uhs_voltage(struct mmc_host *host, u32 ocr)
* Failure to switch is indicated by the card holding
* dat[0:3] low
*/
- if (host->ops->card_busy && host->ops->card_busy(host))
+ if (mmc_hw_busy_detect(host) && host->ops->card_busy(host))
err = -EAGAIN;
power_cycle:
diff --git a/drivers/mmc/core/core.h b/drivers/mmc/core/core.h
index db3c9c68875d..68091bbba580 100644
--- a/drivers/mmc/core/core.h
+++ b/drivers/mmc/core/core.h
@@ -172,4 +172,21 @@ static inline bool mmc_cache_enabled(struct mmc_host *host)
return false;
}
+/**
+ * mmc_hw_busy_detect() - Can we use hw busy detection?
+ * @host: the host in question
+ */
+static inline bool mmc_hw_busy_detect(struct mmc_host *host)
+{
+ struct mmc_card *card = host->card;
+ bool has_ops;
+ bool able = true;
+
+ has_ops = (host->ops->card_busy != NULL);
+ if (card)
+ able = !(card->quirks & MMC_QUIRK_BROKEN_HW_BUSY_DETECT);
+
+ return (has_ops && able);
+}
+
#endif
diff --git a/drivers/mmc/core/mmc_ops.c b/drivers/mmc/core/mmc_ops.c
index 5756781fef37..5f37e280dc1c 100644
--- a/drivers/mmc/core/mmc_ops.c
+++ b/drivers/mmc/core/mmc_ops.c
@@ -431,7 +431,7 @@ static int mmc_busy_status(struct mmc_card *card, bool retry_crc_err,
u32 status = 0;
int err;
- if (host->ops->card_busy) {
+ if (mmc_hw_busy_detect(host)) {
*busy = host->ops->card_busy(host);
return 0;
}
@@ -480,7 +480,7 @@ static int __mmc_poll_for_busy(struct mmc_card *card, unsigned int timeout_ms,
* capable of polling by using ->card_busy(), then rely on waiting the
* stated timeout to be sufficient.
*/
- if (!send_status && !host->ops->card_busy) {
+ if (!send_status && mmc_hw_busy_detect(host)) {
mmc_delay(timeout_ms);
return 0;
}
diff --git a/drivers/mmc/core/quirks.h b/drivers/mmc/core/quirks.h
index d68e6e513a4f..8da6526f0eb0 100644
--- a/drivers/mmc/core/quirks.h
+++ b/drivers/mmc/core/quirks.h
@@ -99,6 +99,27 @@ static const struct mmc_fixup __maybe_unused mmc_blk_fixups[] = {
MMC_FIXUP("V10016", CID_MANFID_KINGSTON, CID_OEMID_ANY, add_quirk_mmc,
MMC_QUIRK_TRIM_BROKEN),
+ /*
+ * Some older Samsung eMMCs have broken hardware busy detection.
+ * Enabling this feature in the host controller can make the card
+ * accesses lock up completely.
+ */
+ MMC_FIXUP("4YMD1R", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+ MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+ /* Samsung KLMxGxxE4x eMMCs from 2012: 4, 8, 16, 32 and 64 GB */
+ MMC_FIXUP("M4G1YC", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+ MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+ MMC_FIXUP("M8G1WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+ MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+ MMC_FIXUP("MAG2WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+ MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+ MMC_FIXUP("MBG4WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+ MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+ MMC_FIXUP("MAG2WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+ MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+ MMC_FIXUP("MCG8WA", CID_MANFID_SAMSUNG, CID_OEMID_ANY, add_quirk_mmc,
+ MMC_QUIRK_BROKEN_HW_BUSY_DETECT),
+
END_FIXUP
};
diff --git a/drivers/mmc/host/mmci.c b/drivers/mmc/host/mmci.c
index 984d35055156..3046917b2b67 100644
--- a/drivers/mmc/host/mmci.c
+++ b/drivers/mmc/host/mmci.c
@@ -347,6 +347,24 @@ static int mmci_card_busy(struct mmc_host *mmc)
return busy;
}
+/* Use this if the MMCI variant AND the card supports it */
+static bool mmci_use_busy_detect(struct mmci_host *host)
+{
+ struct mmc_card *card = host->mmc->card;
+
+ if (!host->variant->busy_detect)
+ return false;
+
+ /* We don't allow this until we know that the card can handle it */
+ if (!card)
+ return false;
+
+ if (card->quirks & MMC_QUIRK_BROKEN_HW_BUSY_DETECT)
+ return false;
+
+ return true;
+}
+
static void mmci_reg_delay(struct mmci_host *host)
{
/*
@@ -1381,7 +1399,7 @@ mmci_cmd_irq(struct mmci_host *host, struct mmc_command *cmd,
return;
/* Handle busy detection on DAT0 if the variant supports it. */
- if (busy_resp && host->variant->busy_detect)
+ if (busy_resp && mmci_use_busy_detect(host))
if (!host->ops->busy_complete(host, status, err_msk))
return;
@@ -1725,7 +1743,7 @@ static void mmci_set_max_busy_timeout(struct mmc_host *mmc)
struct mmci_host *host = mmc_priv(mmc);
u32 max_busy_timeout = 0;
- if (!host->variant->busy_detect)
+ if (!mmci_use_busy_detect(host))
return;
if (host->variant->busy_timeout && mmc->actual_clock)
diff --git a/include/linux/mmc/card.h b/include/linux/mmc/card.h
index f9ad35dd6012..b9a12e925a5b 100644
--- a/include/linux/mmc/card.h
+++ b/include/linux/mmc/card.h
@@ -259,6 +259,7 @@ struct mmc_card {
/* for byte mode */
#define MMC_QUIRK_NONSTD_SDIO (1<<2) /* non-standard SDIO card attached */
/* (missing CIA registers) */
+#define MMC_QUIRK_BROKEN_HW_BUSY_DETECT (1<<3) /* Disable hardware busy detection on DAT0 */
#define MMC_QUIRK_NONSTD_FUNC_IF (1<<4) /* SDIO card has nonstd function interfaces */
#define MMC_QUIRK_DISABLE_CD (1<<5) /* disconnect CD/DAT[3] resistor */
#define MMC_QUIRK_INAND_CMD38 (1<<6) /* iNAND devices have broken CMD38 */
--
2.31.1
This is the start of the stable review cycle for the 4.9.274 release.
There are 71 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Wed 30 Jun 2021 02:39:51 PM UTC.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git/…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.9.y
and the diffstat can be found below.
Thanks,
Sasha
-------------
Pseudo-Shortlog of commits:
Alexander Duyck (1):
i40e: Be much more verbose about what we can and cannot offload
Anirudh Rayabharam (1):
HID: usbhid: fix info leak in hid_submit_ctrl
Antti Järvinen (1):
PCI: Mark TI C667X to avoid bus reset
Arnd Bergmann (1):
ARM: 9081/1: fix gcc-10 thumb2-kernel regression
Bixuan Cui (1):
HID: gt683r: add missing MODULE_DEVICE_TABLE
Bumyong Lee (1):
dmaengine: pl330: fix wrong usage of spinlock flags in dma_cyclc
Chen Li (1):
radeon: use memcpy_to/fromio for UVD fw upload
Chengyang Fan (1):
net: ipv4: fix memory leak in ip_mc_add1_src
Christophe JAILLET (4):
alx: Fix an error handling path in 'alx_probe()'
qlcnic: Fix an error handling path in 'qlcnic_probe()'
netxen_nic: Fix an error handling path in 'netxen_nic_probe()'
be2net: Fix an error handling path in 'be_probe()'
Dan Robertson (1):
net: ieee802154: fix null deref in parse dev addr
Dongliang Mu (1):
net: usb: fix possible use-after-free in smsc75xx_bind
Du Cheng (1):
cfg80211: call cfg80211_leave_ocb when switching away from OCB
Eric Dumazet (5):
net/af_unix: fix a data-race in unix_dgram_sendmsg / unix_release_sock
inet: use bigger hash table for IP ID generation
inet: annotate date races around sk->sk_txhash
net/packet: annotate accesses to po->bind
net/packet: annotate accesses to po->ifindex
Esben Haabendal (1):
net: ll_temac: Avoid ndo_start_xmit returning NETDEV_TX_BUSY
Fugang Duan (1):
net: fec_ptp: add clock rate zero check
Hillf Danton (1):
gfs2: Fix use-after-free in gfs2_glock_shrink_scan
Ido Schimmel (1):
rtnetlink: Fix regression in bridge VLAN configuration
Jiapeng Chong (2):
ethernet: myri10ge: Fix missing error code in myri10ge_probe()
rtnetlink: Fix missing error code in rtnl_bridge_notify()
Jisheng Zhang (1):
net: stmmac: dwmac1000: Fix extended MAC address registers definition
Johan Hovold (1):
i2c: robotfuzz-osif: fix control-request directions
Johannes Berg (2):
mac80211: remove warning in ieee80211_get_sband()
mac80211: drop multicast fragments
Josh Triplett (1):
net: ipconfig: Don't override command-line hostnames or domains
Kees Cook (4):
r8152: Avoid memcpy() over-reading of ETH_SS_STATS
sh_eth: Avoid memcpy() over-reading of ETH_SS_STATS
r8169: Avoid memcpy() over-reading of ETH_SS_STATS
net: qed: Fix memcpy() overflow of qed_dcbx_params()
Linyu Yuan (1):
net: cdc_eem: fix tx fixup skb leak
Maciej Żenczykowski (1):
net: cdc_ncm: switch to eth%d interface naming
Mark Bolhuis (1):
HID: Add BUS_VIRTUAL to hid_connect logging
Maurizio Lombardi (1):
scsi: target: core: Fix warning on realtime kernels
Maxim Mikityanskiy (1):
netfilter: synproxy: Fix out of bounds when parsing TCP options
Ming Lei (1):
scsi: core: Put .shost_dev in failure path if host state changes to
RUNNING
Nanyong Sun (1):
net: ipv4: fix memory leak in netlbl_cipsov4_add_std
Naoya Horiguchi (1):
mm: hwpoison: change PageHWPoison behavior on hugetlb pages
Nathan Chancellor (1):
Makefile: Move -Wno-unused-but-set-variable out of GCC only block
Norbert Slusarek (1):
can: bcm: fix infoleak in struct bcm_msg_head
Paolo Abeni (1):
udp: fix race between close() and udp_abort()
Pavel Skripkin (5):
net: rds: fix memory leak in rds_recvmsg
net: hamradio: fix memory leak in mkiss_close
net: ethernet: fix potential use-after-free in ec_bhf_remove
net: caif: fix memory leak in ldisc_open
nilfs2: fix memory leak in nilfs_sysfs_delete_device_group
Peter Chen (1):
usb: dwc3: core: fix kernel panic when do reboot
Rafael J. Wysocki (1):
Revert "PCI: PM: Do not read power state in pci_enable_device_flags()"
Randy Dunlap (1):
dmaengine: QCOM_HIDMA_MGMT depends on HAS_IOMEM
Sasha Levin (1):
Linux 4.9.274-rc1
Shanker Donthineni (1):
PCI: Mark some NVIDIA GPUs to avoid bus reset
Srinivas Pandruvada (1):
HID: hid-sensor-hub: Return error for hid_set_field() failure
Steven Rostedt (VMware) (3):
tracing: Do no increment trace_clock_global() by one
tracing: Do not stop recording cmdlines when tracing is off
tracing: Do not stop recording comms if the trace file is being read
Suzuki K Poulose (1):
arm64: perf: Disable PMU while processing counter overflows
Sven Eckelmann (1):
batman-adv: Avoid WARN_ON timing related checks
Tetsuo Handa (1):
can: bcm/raw/isotp: use per module netdevice notifier
Thomas Gleixner (1):
x86/fpu: Reset state for all signal restore failures
Vineet Gupta (1):
ARCv2: save ABI registers across signal handling
Yang Yingliang (1):
dmaengine: stedma40: add missing iounmap() on error in d40_probe()
Yongqiang Liu (1):
ARM: OMAP2+: Fix build warning when mmc_omap is not built
Zheng Yongjun (4):
net/x25: Return the correct errno code
net: Return the correct errno code
fib: Return the correct errno code
ping: Check return value of function 'ping_queue_rcv_skb'
Makefile | 7 +-
arch/arc/include/uapi/asm/sigcontext.h | 1 +
arch/arc/kernel/signal.c | 43 +++++++++
arch/arm/kernel/setup.c | 16 ++--
arch/arm/mach-omap2/board-n8x0.c | 2 +-
arch/arm64/kernel/perf_event.c | 50 ++++++-----
arch/x86/kernel/fpu/signal.c | 18 ++--
drivers/dma/pl330.c | 6 +-
drivers/dma/qcom/Kconfig | 1 +
drivers/dma/ste_dma40.c | 3 +
drivers/gpu/drm/radeon/radeon_uvd.c | 4 +-
drivers/hid/hid-core.c | 3 +
drivers/hid/hid-gt683r.c | 1 +
drivers/hid/hid-sensor-hub.c | 13 ++-
drivers/hid/usbhid/hid-core.c | 2 +-
drivers/i2c/busses/i2c-robotfuzz-osif.c | 4 +-
drivers/net/caif/caif_serial.c | 1 +
drivers/net/ethernet/atheros/alx/main.c | 1 +
drivers/net/ethernet/ec_bhf.c | 4 +-
drivers/net/ethernet/emulex/benet/be_main.c | 1 +
drivers/net/ethernet/freescale/fec_ptp.c | 4 +
drivers/net/ethernet/intel/i40e/i40e_main.c | 52 +++++++++--
.../net/ethernet/myricom/myri10ge/myri10ge.c | 1 +
.../ethernet/qlogic/netxen/netxen_nic_main.c | 2 +
drivers/net/ethernet/qlogic/qed/qed_dcbx.c | 4 +-
.../net/ethernet/qlogic/qlcnic/qlcnic_main.c | 1 +
drivers/net/ethernet/realtek/r8169.c | 2 +-
drivers/net/ethernet/renesas/sh_eth.c | 2 +-
.../net/ethernet/stmicro/stmmac/dwmac1000.h | 8 +-
drivers/net/ethernet/xilinx/ll_temac_main.c | 5 ++
drivers/net/hamradio/mkiss.c | 1 +
drivers/net/usb/cdc_eem.c | 2 +-
drivers/net/usb/cdc_ncm.c | 2 +-
drivers/net/usb/r8152.c | 2 +-
drivers/net/usb/smsc75xx.c | 10 ++-
drivers/pci/pci.c | 16 +++-
drivers/pci/quirks.c | 22 +++++
drivers/scsi/hosts.c | 8 +-
drivers/target/target_core_transport.c | 4 +-
drivers/usb/dwc3/core.c | 2 +-
fs/gfs2/glock.c | 2 +-
fs/nilfs2/sysfs.c | 1 +
include/linux/hid.h | 3 +-
include/linux/swapops.h | 9 --
include/net/sock.h | 10 ++-
kernel/trace/trace.c | 12 ---
kernel/trace/trace_clock.c | 6 +-
mm/memory-failure.c | 87 +++++--------------
net/batman-adv/bat_iv_ogm.c | 4 +-
net/can/bcm.c | 64 +++++++++++---
net/can/raw.c | 62 ++++++++++---
net/compat.c | 2 +-
net/core/fib_rules.c | 2 +-
net/core/rtnetlink.c | 4 +
net/ieee802154/nl802154.c | 9 +-
net/ipv4/cipso_ipv4.c | 1 +
net/ipv4/igmp.c | 1 +
net/ipv4/ipconfig.c | 13 +--
net/ipv4/ping.c | 12 +--
net/ipv4/route.c | 42 ++++++---
net/ipv4/udp.c | 10 +++
net/ipv6/udp.c | 3 +
net/mac80211/ieee80211_i.h | 2 +-
net/mac80211/rx.c | 9 +-
net/netfilter/nf_synproxy_core.c | 5 ++
net/packet/af_packet.c | 32 +++----
net/rds/recv.c | 2 +-
net/unix/af_unix.c | 7 +-
net/wireless/util.c | 3 +
net/x25/af_x25.c | 2 +-
70 files changed, 493 insertions(+), 259 deletions(-)
--
2.30.2
As GDSCs are registered and found to be already enabled
gdsc_toggle_logic() will be invoked for votable GDSCs and ensure that
the vote is matching the hardware state. Part of this the related
regulator will be enabled.
But for non-votable GDSCs the regulator and GDSC status will be out of
sync and as the GDSC is later disabled regulator_disable() will face an
unbalanced enable-count, or something might turn off the supply under
the feet of the GDSC.
So ensure that the regulator is enabled even for non-votable GDSCs.
Cc: stable(a)vger.kernel.org
Fixes: 37416e554961 ("clk: qcom: gdsc: Handle GDSC regulator supplies")
Signed-off-by: Bjorn Andersson <bjorn.andersson(a)linaro.org>
---
drivers/clk/qcom/gdsc.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/clk/qcom/gdsc.c b/drivers/clk/qcom/gdsc.c
index 51ed640e527b..f7e7759cdb90 100644
--- a/drivers/clk/qcom/gdsc.c
+++ b/drivers/clk/qcom/gdsc.c
@@ -359,10 +359,17 @@ static int gdsc_init(struct gdsc *sc)
/*
* Votable GDSCs can be ON due to Vote from other masters.
- * If a Votable GDSC is ON, make sure we have a Vote.
+ * If a Votable GDSC is ON, make sure we have a Vote. If
+ * non-votable, ensure that the supply is kept enabled (as
+ * is done by gdsc_enable).
*/
- if ((sc->flags & VOTABLE) && on)
+ if ((sc->flags & VOTABLE) && on) {
gdsc_enable(&sc->pd);
+ } else if (on) {
+ ret = regulator_enable(sc->rsupply);
+ if (ret < 0)
+ return ret;
+ }
/*
* Make sure the retain bit is set if the GDSC is already on, otherwise
--
2.29.2
This is the start of the stable review cycle for the 5.4.129 release.
There are 71 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Wed 30 Jun 2021 02:29:43 PM UTC.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git/…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.4.y
and the diffstat can be found below.
Thanks,
Sasha
-------------
Pseudo-Shortlog of commits:
Alex Shi (1):
mm: add VM_WARN_ON_ONCE_PAGE() macro
Arnd Bergmann (1):
ARM: 9081/1: fix gcc-10 thumb2-kernel regression
Austin Kim (1):
net: ethtool: clear heap allocations for ethtool function
Christian König (2):
drm/nouveau: wait for moving fence after pinning v2
drm/radeon: wait for moving fence after pinning
Du Cheng (1):
cfg80211: call cfg80211_leave_ocb when switching away from OCB
Eric Dumazet (3):
inet: annotate date races around sk->sk_txhash
net/packet: annotate accesses to po->bind
net/packet: annotate accesses to po->ifindex
Eric Snowberg (2):
certs: Add EFI_CERT_X509_GUID support for dbx entries
certs: Move load_system_certificate_list to a common function
Esben Haabendal (2):
net: ll_temac: Add memory-barriers for TX BD access
net: ll_temac: Avoid ndo_start_xmit returning NETDEV_TX_BUSY
Fabien Dessenne (1):
pinctrl: stm32: fix the reported number of GPIO lines per bank
Fuad Tabba (1):
KVM: selftests: Fix kvm_check_cap() assertion
Guillaume Ranquet (3):
dmaengine: mediatek: free the proper desc in desc_free handler
dmaengine: mediatek: do not issue a new desc if one is still current
dmaengine: mediatek: use GFP_NOWAIT instead of GFP_ATOMIC in prep_dma
Haibo Chen (1):
spi: spi-nxp-fspi: move the register operation after the clock enable
Hugh Dickins (16):
mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
mm/thp: make is_huge_zero_pmd() safe and quicker
mm/thp: try_to_unmap() use TTU_SYNC for safe splitting
mm/thp: fix vma_address() if virtual address below file offset
mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
mm: page_vma_mapped_walk(): use page for pvmw->page
mm: page_vma_mapped_walk(): settle PageHuge on entry
mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd
mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block
mm: page_vma_mapped_walk(): crossing page table boundary
mm: page_vma_mapped_walk(): add a level of indentation
mm: page_vma_mapped_walk(): use goto instead of while (1)
mm: page_vma_mapped_walk(): get vma_address_end() earlier
mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes
mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk()
mm, futex: fix shared futex pgoff on shmem huge page
Johan Hovold (1):
i2c: robotfuzz-osif: fix control-request directions
Johannes Berg (3):
mac80211: remove warning in ieee80211_get_sband()
mac80211_hwsim: drop pending frames on stop
mac80211: drop multicast fragments
Jue Wang (1):
mm/thp: fix page_address_in_vma() on file THP tails
Kees Cook (4):
r8152: Avoid memcpy() over-reading of ETH_SS_STATS
sh_eth: Avoid memcpy() over-reading of ETH_SS_STATS
r8169: Avoid memcpy() over-reading of ETH_SS_STATS
net: qed: Fix memcpy() overflow of qed_dcbx_params()
Miaohe Lin (2):
mm/rmap: remove unneeded semicolon in page_not_mapped()
mm/rmap: use page_not_mapped in try_to_unmap()
Mikel Rychliski (1):
PCI: Add AMD RS690 quirk to enable 64-bit DMA
Mimi Zohar (1):
module: limit enabling module.sig_enforce
Nathan Chancellor (1):
MIPS: generic: Update node names to avoid unit addresses
Nayna Jain (2):
certs: Add wrapper function to check blacklisted binary hash
x86/efi: move common keyring handler functions to new file
Neil Armstrong (1):
mmc: meson-gx: use memcpy_to/fromio for dram-access-quirk
Nicholas Piggin (1):
KVM: do not allow mapping valid but non-reference-counted pages
Nick Desaulniers (1):
arm64: link with -z norelro for LLD or aarch64-elf
Pavel Skripkin (2):
net: caif: fix memory leak in ldisc_open
nilfs2: fix memory leak in nilfs_sysfs_delete_device_group
Peter Zijlstra (1):
recordmcount: Correct st_shndx handling
Petr Mladek (2):
kthread_worker: split code for canceling the delayed work timer
kthread: prevent deadlock when kthread_mod_delayed_work() races with
kthread_cancel_delayed_work_sync()
Praneeth Bajjuri (1):
net: phy: dp83867: perform soft reset and retain established link
Rafael J. Wysocki (1):
Revert "PCI: PM: Do not read power state in pci_enable_device_flags()"
Sami Tolvanen (1):
kbuild: add CONFIG_LD_IS_LLD
Sasha Levin (1):
Linux 5.4.129-rc1
Xu Yu (1):
mm, thp: use head page in __migration_entry_wait()
Yang Shi (1):
mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split
Yifan Zhang (2):
Revert "drm/amdgpu/gfx9: fix the doorbell missing when in CGPG issue."
Revert "drm/amdgpu/gfx10: enlarge CP_MEC_DOORBELL_RANGE_UPPER to cover
full doorbell."
Yu Kuai (1):
dmaengine: zynqmp_dma: Fix PM reference leak in
zynqmp_dma_alloc_chan_resourc()
Zheng Yongjun (2):
net: ipv4: Remove unneed BUG() function
ping: Check return value of function 'ping_queue_rcv_skb'
Zou Wei (1):
dmaengine: rcar-dmac: Fix PM reference leak in rcar_dmac_probe()
Makefile | 4 +-
arch/arm/kernel/setup.c | 16 +-
arch/arm64/Makefile | 10 +-
arch/mips/generic/board-boston.its.S | 10 +-
arch/mips/generic/board-ni169445.its.S | 10 +-
arch/mips/generic/board-ocelot.its.S | 20 +--
arch/mips/generic/board-xilfpga.its.S | 10 +-
arch/mips/generic/vmlinux.its.S | 10 +-
arch/x86/pci/fixup.c | 44 +++++
certs/Kconfig | 9 +
certs/Makefile | 2 +-
certs/blacklist.c | 52 ++++++
certs/blacklist.h | 2 +
certs/common.c | 57 +++++++
certs/common.h | 9 +
certs/system_keyring.c | 55 +-----
drivers/dma/mediatek/mtk-uart-apdma.c | 27 +--
drivers/dma/sh/rcar-dmac.c | 2 +-
drivers/dma/xilinx/zynqmp_dma.c | 2 +-
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 6 +-
drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 6 +-
drivers/gpu/drm/nouveau/nouveau_prime.c | 17 +-
drivers/gpu/drm/radeon/radeon_prime.c | 16 +-
drivers/i2c/busses/i2c-robotfuzz-osif.c | 4 +-
drivers/mmc/host/meson-gx-mmc.c | 50 +++++-
drivers/net/caif/caif_serial.c | 1 +
drivers/net/ethernet/qlogic/qed/qed_dcbx.c | 4 +-
drivers/net/ethernet/realtek/r8169_main.c | 2 +-
drivers/net/ethernet/renesas/sh_eth.c | 2 +-
drivers/net/ethernet/xilinx/ll_temac_main.c | 19 ++-
drivers/net/phy/dp83867.c | 6 +-
drivers/net/usb/r8152.c | 2 +-
drivers/net/wireless/mac80211_hwsim.c | 5 +
drivers/pci/pci.c | 16 +-
drivers/pinctrl/stm32/pinctrl-stm32.c | 9 +-
drivers/spi/spi-nxp-fspi.c | 11 +-
fs/nilfs2/sysfs.c | 1 +
include/keys/system_keyring.h | 21 +++
include/linux/huge_mm.h | 8 +-
include/linux/hugetlb.h | 16 --
include/linux/mm.h | 3 +
include/linux/mmdebug.h | 13 ++
include/linux/pagemap.h | 13 +-
include/linux/rmap.h | 3 +-
include/net/sock.h | 10 +-
init/Kconfig | 3 +
kernel/futex.c | 2 +-
kernel/kthread.c | 77 ++++++---
kernel/module.c | 14 +-
mm/huge_memory.c | 56 +++---
mm/hugetlb.c | 5 +-
mm/internal.h | 53 ++++--
mm/memory.c | 41 +++++
mm/migrate.c | 1 +
mm/page_vma_mapped.c | 160 +++++++++++-------
mm/pgtable-generic.c | 4 +-
mm/rmap.c | 50 +++---
mm/truncate.c | 43 +++--
net/core/ethtool.c | 10 +-
net/ipv4/devinet.c | 2 +-
net/ipv4/ping.c | 12 +-
net/ipv6/addrconf.c | 2 +-
net/mac80211/ieee80211_i.h | 2 +-
net/mac80211/rx.c | 9 +-
net/packet/af_packet.c | 32 ++--
net/wireless/util.c | 3 +
scripts/recordmcount.h | 15 +-
security/integrity/Makefile | 3 +-
.../platform_certs/keyring_handler.c | 91 ++++++++++
.../platform_certs/keyring_handler.h | 32 ++++
security/integrity/platform_certs/load_uefi.c | 67 +-------
tools/testing/selftests/kvm/lib/kvm_util.c | 2 +-
virt/kvm/kvm_main.c | 19 ++-
73 files changed, 959 insertions(+), 466 deletions(-)
create mode 100644 certs/common.c
create mode 100644 certs/common.h
create mode 100644 security/integrity/platform_certs/keyring_handler.c
create mode 100644 security/integrity/platform_certs/keyring_handler.h
--
2.30.2
In "qmp_cooling_devices_register", the count value is initially
QMP_NUM_COOLING_RESOURCES, which is 2. Based on the initial count value,
the memory for cooling_devs is allocated. Then while calling the
"qmp_cooling_device_add" function, count value is post-incremented for
each child node.
This makes the out of bound access to the cooling_dev array. Fix it by
resetting the count value to zero before adding cooling devices.
While at it, let's also free the memory allocated to cooling_dev if no
cooling device is found in DT and during unroll phase.
Cc: stable(a)vger.kernel.org # 5.4
Fixes: 05589b30b21a ("soc: qcom: Extend AOSS QMP driver to support resources that are used to wake up the SoC.")
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
---
Bjorn: I've just compile tested this patch.
drivers/soc/qcom/qcom_aoss.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/soc/qcom/qcom_aoss.c b/drivers/soc/qcom/qcom_aoss.c
index 934fcc4d2b05..98c665411768 100644
--- a/drivers/soc/qcom/qcom_aoss.c
+++ b/drivers/soc/qcom/qcom_aoss.c
@@ -488,6 +488,7 @@ static int qmp_cooling_devices_register(struct qmp *qmp)
if (!qmp->cooling_devs)
return -ENOMEM;
+ count = 0;
for_each_available_child_of_node(np, child) {
if (!of_find_property(child, "#cooling-cells", NULL))
continue;
@@ -497,12 +498,16 @@ static int qmp_cooling_devices_register(struct qmp *qmp)
goto unroll;
}
+ if (!count)
+ devm_kfree(qmp->dev, qmp->cooling_devs);
+
return 0;
unroll:
while (--count >= 0)
thermal_cooling_device_unregister
(qmp->cooling_devs[count].cdev);
+ devm_kfree(qmp->dev, qmp->cooling_devs);
return ret;
}
--
2.25.1
In "qmp_cooling_devices_register", the count value is initially
QMP_NUM_COOLING_RESOURCES, which is 2. Based on the initial count value,
the memory for cooling_devs is allocated. Then while calling the
"qmp_cooling_device_add" function, count value is post-incremented for
each child node.
This makes the out of bound access to the cooling_dev array. Fix it by
passing the QMP_NUM_COOLING_RESOURCES definition to devm_kzalloc() and
initializing the count to 0.
While at it, let's also free the memory allocated to cooling_dev if no
cooling device is found in DT and during unroll phase.
Cc: stable(a)vger.kernel.org # 5.4
Fixes: 05589b30b21a ("soc: qcom: Extend AOSS QMP driver to support resources that are used to wake up the SoC.")
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
---
Changes in v2:
* Used QMP_NUM_COOLING_RESOURCES directly and initialized count to 0
drivers/soc/qcom/qcom_aoss.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/soc/qcom/qcom_aoss.c b/drivers/soc/qcom/qcom_aoss.c
index 934fcc4d2b05..7b6b94332510 100644
--- a/drivers/soc/qcom/qcom_aoss.c
+++ b/drivers/soc/qcom/qcom_aoss.c
@@ -476,12 +476,12 @@ static int qmp_cooling_device_add(struct qmp *qmp,
static int qmp_cooling_devices_register(struct qmp *qmp)
{
struct device_node *np, *child;
- int count = QMP_NUM_COOLING_RESOURCES;
+ int count = 0;
int ret;
np = qmp->dev->of_node;
- qmp->cooling_devs = devm_kcalloc(qmp->dev, count,
+ qmp->cooling_devs = devm_kcalloc(qmp->dev, QMP_NUM_COOLING_RESOURCES,
sizeof(*qmp->cooling_devs),
GFP_KERNEL);
@@ -497,12 +497,16 @@ static int qmp_cooling_devices_register(struct qmp *qmp)
goto unroll;
}
+ if (!count)
+ devm_kfree(qmp->dev, qmp->cooling_devs);
+
return 0;
unroll:
while (--count >= 0)
thermal_cooling_device_unregister
(qmp->cooling_devs[count].cdev);
+ devm_kfree(qmp->dev, qmp->cooling_devs);
return ret;
}
--
2.25.1
Function ceph_check_delayed_caps() is called from the mdsc->delayed_work
workqueue and it can be kept looping for quite some time if caps keep
being added back to the mdsc->cap_delay_list. This may result in the
watchdog tainting the kernel with the softlockup flag.
This patch breaks this loop if the caps have been recently (i.e. during
the loop execution). Any new caps added to the list will be handled in
the next run.
Cc: stable(a)vger.kernel.org
Link: https://tracker.ceph.com/issues/46284
Signed-off-by: Luis Henriques <lhenriques(a)suse.de>
---
fs/ceph/caps.c | 17 ++++++++++++++++-
fs/ceph/mds_client.c | 7 ++++---
fs/ceph/super.h | 2 +-
3 files changed, 21 insertions(+), 5 deletions(-)
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index a5e93b185515..c79b8dff25d7 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -4224,11 +4224,19 @@ void ceph_handle_caps(struct ceph_mds_session *session,
/*
* Delayed work handler to process end of delayed cap release LRU list.
+ *
+ * If new caps are added to the list while processing it, these won't get
+ * processed in this run. In this case, the ci->i_hold_caps_max will be
+ * returned so that the work can be scheduled accordingly.
*/
-void ceph_check_delayed_caps(struct ceph_mds_client *mdsc)
+unsigned long ceph_check_delayed_caps(struct ceph_mds_client *mdsc)
{
struct inode *inode;
struct ceph_inode_info *ci;
+ struct ceph_mount_options *opt = mdsc->fsc->mount_options;
+ unsigned long delay_max = opt->caps_wanted_delay_max * HZ;
+ unsigned long loop_start = jiffies;
+ unsigned long delay = 0;
dout("check_delayed_caps\n");
spin_lock(&mdsc->cap_delay_lock);
@@ -4236,6 +4244,11 @@ void ceph_check_delayed_caps(struct ceph_mds_client *mdsc)
ci = list_first_entry(&mdsc->cap_delay_list,
struct ceph_inode_info,
i_cap_delay_list);
+ if (time_before(loop_start, ci->i_hold_caps_max - delay_max)) {
+ dout("%s caps added recently. Exiting loop", __func__);
+ delay = ci->i_hold_caps_max;
+ break;
+ }
if ((ci->i_ceph_flags & CEPH_I_FLUSH) == 0 &&
time_before(jiffies, ci->i_hold_caps_max))
break;
@@ -4252,6 +4265,8 @@ void ceph_check_delayed_caps(struct ceph_mds_client *mdsc)
}
}
spin_unlock(&mdsc->cap_delay_lock);
+
+ return delay;
}
/*
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 08c76bf57fb1..5f1bd7f9dce1 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -4518,11 +4518,12 @@ static void schedule_delayed(struct ceph_mds_client *mdsc, unsigned long delay)
static void delayed_work(struct work_struct *work)
{
- int i;
struct ceph_mds_client *mdsc =
container_of(work, struct ceph_mds_client, delayed_work.work);
+ unsigned long delay;
int renew_interval;
int renew_caps;
+ int i;
dout("mdsc delayed_work\n");
@@ -4562,7 +4563,7 @@ static void delayed_work(struct work_struct *work)
}
mutex_unlock(&mdsc->mutex);
- ceph_check_delayed_caps(mdsc);
+ delay = ceph_check_delayed_caps(mdsc);
ceph_queue_cap_reclaim_work(mdsc);
@@ -4570,7 +4571,7 @@ static void delayed_work(struct work_struct *work)
maybe_recover_session(mdsc);
- schedule_delayed(mdsc, 0);
+ schedule_delayed(mdsc, delay);
}
int ceph_mdsc_init(struct ceph_fs_client *fsc)
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 839e6b0239ee..3b5207c82767 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1170,7 +1170,7 @@ extern void ceph_flush_snaps(struct ceph_inode_info *ci,
extern bool __ceph_should_report_size(struct ceph_inode_info *ci);
extern void ceph_check_caps(struct ceph_inode_info *ci, int flags,
struct ceph_mds_session *session);
-extern void ceph_check_delayed_caps(struct ceph_mds_client *mdsc);
+extern unsigned long ceph_check_delayed_caps(struct ceph_mds_client *mdsc);
extern void ceph_flush_dirty_caps(struct ceph_mds_client *mdsc);
extern int ceph_drop_caps_for_unlink(struct inode *inode);
extern int ceph_encode_inode_release(void **p, struct inode *inode,
On 6/28/21 3:47 PM, Kajol Jain wrote:
> The performance reporting driver added cpu hotplug
> feature but it didn't add pmu migration call in cpu
> offline function.
> This can create an issue incase the current designated
> cpu being used to collect fme pmu data got offline,
> as based on current code we are not migrating fme pmu to
> new target cpu. Because of that perf will still try to
> fetch data from that offline cpu and hence we will not
> get counter data.
>
> Patch fixed this issue by adding pmu_migrate_context call
> in fme_perf_offline_cpu function.
>
Adding stable(a)vger.kernel.org in cc list as suggested by Moritz Fischer.
Thanks,
Kajol Jain
> Fixes: 724142f8c42a ("fpga: dfl: fme: add performance reporting support")
> Tested-by: Xu Yilun <yilun.xu(a)intel.com>
> Signed-off-by: Kajol Jain <kjain(a)linux.ibm.com>
> ---
> drivers/fpga/dfl-fme-perf.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> ---
> Changelog:
> - Remove RFC tag
> - Did nits changes on subject and commit message as suggested by Xu Yilun
> - Added Tested-by tag
> - Link to rfc patch: https://lkml.org/lkml/2021/6/28/112
> ---
> diff --git a/drivers/fpga/dfl-fme-perf.c b/drivers/fpga/dfl-fme-perf.c
> index 4299145ef347..b9a54583e505 100644
> --- a/drivers/fpga/dfl-fme-perf.c
> +++ b/drivers/fpga/dfl-fme-perf.c
> @@ -953,6 +953,10 @@ static int fme_perf_offline_cpu(unsigned int cpu, struct hlist_node *node)
> return 0;
>
> priv->cpu = target;
> +
> + /* Migrate fme_perf pmu events to the new target cpu */
> + perf_pmu_migrate_context(&priv->pmu, cpu, target);
> +
> return 0;
> }
>
>
From: Pawel Laszczak <pawell(a)cadence.com>
For delayed status phase, the usb_gadget->state was set
to USB_STATE_ADDRESS and it has never been updated to
USB_STATE_CONFIGURED.
Patch updates the gadget state to correct USB_STATE_CONFIGURED.
As a result of this bug the controller was not able to enter to
Test Mode while using MSC function.
Cc: <stable(a)vger.kernel.org>
Fixes: 7733f6c32e36 ("usb: cdns3: Add Cadence USB3 DRD Driver")
Signed-off-by: Pawel Laszczak <pawell(a)cadence.com>
---
drivers/usb/cdns3/cdns3-ep0.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/usb/cdns3/cdns3-ep0.c b/drivers/usb/cdns3/cdns3-ep0.c
index 9a17802275d5..ec5bfd8944c3 100644
--- a/drivers/usb/cdns3/cdns3-ep0.c
+++ b/drivers/usb/cdns3/cdns3-ep0.c
@@ -731,6 +731,7 @@ static int cdns3_gadget_ep0_queue(struct usb_ep *ep,
request->actual = 0;
priv_dev->status_completion_no_call = true;
priv_dev->pending_status_request = request;
+ usb_gadget_set_state(&priv_dev->gadget, USB_STATE_CONFIGURED);
spin_unlock_irqrestore(&priv_dev->lock, flags);
/*
--
2.25.1
From: Desmond Cheong Zhi Xi <desmondcheongzx(a)gmail.com>
Subject: ntfs: fix validity check for file name attribute
When checking the file name attribute, we want to ensure that it fits
within the bounds of ATTR_RECORD. To do this, we should check that (attr
record + file name offset + file name length) < (attr record + attr record
length).
However, the original check did not include the file name offset in the
calculation. This means that corrupted on-disk metadata might not caught
by the incorrect file name check, and lead to an invalid memory access.
An example can be seen in the crash report of a memory corruption error
found by Syzbot:
https://syzkaller.appspot.com/bug?id=a1a1e379b225812688566745c3e2f7242bffc2…
Adding the file name offset to the validity check fixes this error and
passes the Syzbot reproducer test.
Link: https://lkml.kernel.org/r/20210614050540.289494-1-desmondcheongzx@gmail.com
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx(a)gmail.com>
Reported-by: syzbot+213ac8bb98f7f4420840(a)syzkaller.appspotmail.com
Tested-by: syzbot+213ac8bb98f7f4420840(a)syzkaller.appspotmail.com
Acked-by: Anton Altaparmakov <anton(a)tuxera.com>
Cc: Shuah Khan <skhan(a)linuxfoundation.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ntfs/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/fs/ntfs/inode.c~ntfs-fix-validity-check-for-file-name-attribute
+++ a/fs/ntfs/inode.c
@@ -477,7 +477,7 @@ err_corrupt_attr:
}
file_name_attr = (FILE_NAME_ATTR*)((u8*)attr +
le16_to_cpu(attr->data.resident.value_offset));
- p2 = (u8*)attr + le32_to_cpu(attr->data.resident.value_length);
+ p2 = (u8 *)file_name_attr + le32_to_cpu(attr->data.resident.value_length);
if (p2 < (u8*)attr || p2 > p)
goto err_corrupt_attr;
/* This attribute is ok, but is it in the $Extend directory? */
_
From: Mel Gorman <mgorman(a)techsingularity.net>
Subject: mm/page_alloc: correct return value of populated elements if bulk array is populated
Dave Jones reported the following
This made it into 5.13 final, and completely breaks NFSD for me
(Serving tcp v3 mounts). Existing mounts on clients hang, as do
new mounts from new clients. Rebooting the server back to rc7
everything recovers.
The commit b3b64ebd3822 ("mm/page_alloc: do bulk array bounds check after
checking populated elements") returns the wrong value if the array is
already populated which is interpreted as an allocation failure. Dave
reported this fixes his problem and it also passed a test running dbench
over NFS.
Link: https://lkml.kernel.org/r/20210628150219.GC3840@techsingularity.net
Fixes: b3b64ebd3822 ("mm/page_alloc: do bulk array bounds check after checking populated elements")
Signed-off-by: Mel Gorman <mgorman(a)techsingularity.net>
Reported-by: Dave Jones <davej(a)codemonkey.org.uk>
Tested-by: Dave Jones <davej(a)codemonkey.org.uk>
Cc: Dan Carpenter <dan.carpenter(a)oracle.com>
Cc: Jesper Dangaard Brouer <brouer(a)redhat.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org> [5.13+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/page_alloc.c~mm-page_alloc-correct-return-value-of-populated-elements-if-bulk-array-is-populated
+++ a/mm/page_alloc.c
@@ -5058,7 +5058,7 @@ unsigned long __alloc_pages_bulk(gfp_t g
/* Already populated array? */
if (unlikely(page_array && nr_pages - nr_populated == 0))
- return 0;
+ return nr_populated;
/* Use the single page allocator for one page. */
if (nr_pages - nr_populated == 1)
_
From: Mike Rapoport <rppt(a)linux.ibm.com>
Subject: mm/page_alloc: fix memory map initialization for descending nodes
On systems with memory nodes sorted in descending order, for instance Dell
Precision WorkStation T5500, the struct pages for higher PFNs and
respectively lower nodes, could be overwritten by the initialization of
struct pages corresponding to the holes in the memory sections.
For example for the below memory layout
[ 0.245624] Early memory node ranges
[ 0.248496] node 1: [mem 0x0000000000001000-0x0000000000090fff]
[ 0.251376] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff]
[ 0.254256] node 1: [mem 0x0000000100000000-0x0000001423ffffff]
[ 0.257144] node 0: [mem 0x0000001424000000-0x0000002023ffffff]
the range 0x1424000000 - 0x1428000000 in the beginning of node 0 starts in
the middle of a section and will be considered as a hole during the
initialization of the last section in node 1.
The wrong initialization of the memory map causes panic on boot when
CONFIG_DEBUG_VM is enabled.
Reorder loop order of the memory map initialization so that the outer loop
will always iterate over populated memory regions in the ascending order
and the inner loop will select the zone corresponding to the PFN range.
This way initialization of the struct pages for the memory holes will be
always done for the ranges that are actually not populated.
[akpm(a)linux-foundation.org: coding style fixes]
Link: https://lkml.kernel.org/r/YNXlMqBbL+tBG7yq@kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213073
Link: https://lkml.kernel.org/r/20210624062305.10940-1-rppt@kernel.org
Fixes: 0740a50b9baa ("mm/page_alloc.c: refactor initialization of struct page for holes in memory layout")
Signed-off-by: Mike Rapoport <rppt(a)linux.ibm.com>
Cc: Boris Petkov <bp(a)alien8.de>
Cc: Robert Shteynfeld <robert.shteynfeld(a)gmail.com>
Cc: Baoquan He <bhe(a)redhat.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mm.h | 1
mm/page_alloc.c | 96 ++++++++++++++++++++++++++-----------------
2 files changed, 59 insertions(+), 38 deletions(-)
--- a/include/linux/mm.h~mm-page_alloc-fix-memory-map-initialization-for-descending-nodes
+++ a/include/linux/mm.h
@@ -2474,7 +2474,6 @@ extern void set_dma_reserve(unsigned lon
extern void memmap_init_range(unsigned long, int, unsigned long,
unsigned long, unsigned long, enum meminit_context,
struct vmem_altmap *, int migratetype);
-extern void memmap_init_zone(struct zone *zone);
extern void setup_per_zone_wmarks(void);
extern int __meminit init_per_zone_wmark_min(void);
extern void mem_init(void);
--- a/mm/page_alloc.c~mm-page_alloc-fix-memory-map-initialization-for-descending-nodes
+++ a/mm/page_alloc.c
@@ -6400,7 +6400,7 @@ void __ref memmap_init_zone_device(struc
return;
/*
- * The call to memmap_init_zone should have already taken care
+ * The call to memmap_init should have already taken care
* of the pages reserved for the memmap, so we can just jump to
* the end of that region and start processing the device pages.
*/
@@ -6465,7 +6465,7 @@ static void __meminit zone_init_free_lis
/*
* Only struct pages that correspond to ranges defined by memblock.memory
* are zeroed and initialized by going through __init_single_page() during
- * memmap_init_zone().
+ * memmap_init_zone_range().
*
* But, there could be struct pages that correspond to holes in
* memblock.memory. This can happen because of the following reasons:
@@ -6484,9 +6484,9 @@ static void __meminit zone_init_free_lis
* zone/node above the hole except for the trailing pages in the last
* section that will be appended to the zone/node below.
*/
-static u64 __meminit init_unavailable_range(unsigned long spfn,
- unsigned long epfn,
- int zone, int node)
+static void __init init_unavailable_range(unsigned long spfn,
+ unsigned long epfn,
+ int zone, int node)
{
unsigned long pfn;
u64 pgcnt = 0;
@@ -6502,56 +6502,77 @@ static u64 __meminit init_unavailable_ra
pgcnt++;
}
- return pgcnt;
+ if (pgcnt)
+ pr_info("On node %d, zone %s: %lld pages in unavailable ranges",
+ node, zone_names[zone], pgcnt);
}
#else
-static inline u64 init_unavailable_range(unsigned long spfn, unsigned long epfn,
- int zone, int node)
+static inline void init_unavailable_range(unsigned long spfn,
+ unsigned long epfn,
+ int zone, int node)
{
- return 0;
}
#endif
-void __meminit __weak memmap_init_zone(struct zone *zone)
+static void __init memmap_init_zone_range(struct zone *zone,
+ unsigned long start_pfn,
+ unsigned long end_pfn,
+ unsigned long *hole_pfn)
{
unsigned long zone_start_pfn = zone->zone_start_pfn;
unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages;
- int i, nid = zone_to_nid(zone), zone_id = zone_idx(zone);
- static unsigned long hole_pfn;
+ int nid = zone_to_nid(zone), zone_id = zone_idx(zone);
+
+ start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
+ end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);
+
+ if (start_pfn >= end_pfn)
+ return;
+
+ memmap_init_range(end_pfn - start_pfn, nid, zone_id, start_pfn,
+ zone_end_pfn, MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
+
+ if (*hole_pfn < start_pfn)
+ init_unavailable_range(*hole_pfn, start_pfn, zone_id, nid);
+
+ *hole_pfn = end_pfn;
+}
+
+static void __init memmap_init(void)
+{
unsigned long start_pfn, end_pfn;
- u64 pgcnt = 0;
+ unsigned long hole_pfn = 0;
+ int i, j, zone_id, nid;
- for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
- start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
- end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
+ struct pglist_data *node = NODE_DATA(nid);
+
+ for (j = 0; j < MAX_NR_ZONES; j++) {
+ struct zone *zone = node->node_zones + j;
+
+ if (!populated_zone(zone))
+ continue;
- if (end_pfn > start_pfn)
- memmap_init_range(end_pfn - start_pfn, nid,
- zone_id, start_pfn, zone_end_pfn,
- MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
-
- if (hole_pfn < start_pfn)
- pgcnt += init_unavailable_range(hole_pfn, start_pfn,
- zone_id, nid);
- hole_pfn = end_pfn;
+ memmap_init_zone_range(zone, start_pfn, end_pfn,
+ &hole_pfn);
+ zone_id = j;
+ }
}
#ifdef CONFIG_SPARSEMEM
/*
- * Initialize the hole in the range [zone_end_pfn, section_end].
- * If zone boundary falls in the middle of a section, this hole
- * will be re-initialized during the call to this function for the
- * higher zone.
+ * Initialize the memory map for hole in the range [memory_end,
+ * section_end].
+ * Append the pages in this hole to the highest zone in the last
+ * node.
+ * The call to init_unavailable_range() is outside the ifdef to
+ * silence the compiler warining about zone_id set but not used;
+ * for FLATMEM it is a nop anyway
*/
- end_pfn = round_up(zone_end_pfn, PAGES_PER_SECTION);
+ end_pfn = round_up(end_pfn, PAGES_PER_SECTION);
if (hole_pfn < end_pfn)
- pgcnt += init_unavailable_range(hole_pfn, end_pfn,
- zone_id, nid);
#endif
-
- if (pgcnt)
- pr_info(" %s zone: %llu pages in unavailable ranges\n",
- zone->name, pgcnt);
+ init_unavailable_range(hole_pfn, end_pfn, zone_id, nid);
}
static int zone_batchsize(struct zone *zone)
@@ -7254,7 +7275,6 @@ static void __init free_area_init_core(s
set_pageblock_order();
setup_usemap(zone);
init_currently_empty_zone(zone, zone->zone_start_pfn, size);
- memmap_init_zone(zone);
}
}
@@ -7780,6 +7800,8 @@ void __init free_area_init(unsigned long
node_set_state(nid, N_MEMORY);
check_for_memory(pgdat, nid);
}
+
+ memmap_init();
}
static int __init cmdline_parse_core(char *p, unsigned long *core,
_
From: Jann Horn <jannh(a)google.com>
Subject: mm/gup: fix try_grab_compound_head() race with split_huge_page()
try_grab_compound_head() is used to grab a reference to a page from
get_user_pages_fast(), which is only protected against concurrent freeing
of page tables (via local_irq_save()), but not against concurrent TLB
flushes, freeing of data pages, or splitting of compound pages.
Because no reference is held to the page when try_grab_compound_head() is
called, the page may have been freed and reallocated by the time its
refcount has been elevated; therefore, once we're holding a stable
reference to the page, the caller re-checks whether the PTE still points
to the same page (with the same access rights).
The problem is that try_grab_compound_head() has to grab a reference on
the head page; but between the time we look up what the head page is and
the time we actually grab a reference on the head page, the compound page
may have been split up (either explicitly through split_huge_page() or by
freeing the compound page to the buddy allocator and then allocating its
individual order-0 pages). If that happens, get_user_pages_fast() may end
up returning the right page but lifting the refcount on a now-unrelated
page, leading to use-after-free of pages.
To fix it: Re-check whether the pages still belong together after lifting
the refcount on the head page. Move anything else that checks
compound_head(page) below the refcount increment.
This can't actually happen on bare-metal x86 (because there, disabling
IRQs locks out remote TLB flushes), but it can happen on virtualized x86
(e.g. under KVM) and probably also on arm64. The race window is pretty
narrow, and constantly allocating and shattering hugepages isn't exactly
fast; for now I've only managed to reproduce this in an x86 KVM guest with
an artificially widened timing window (by adding a loop that repeatedly
calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits, so
that PV TLB flushes are used instead of IPIs).
As requested on the list, also replace the existing VM_BUG_ON_PAGE() with
a warning and bailout. Since the existing code only performed the BUG_ON
check on DEBUG_VM kernels, ensure that the new code also only performs the
check under that configuration - I don't want to mix two logically
separate changes together too much. The macro VM_WARN_ON_ONCE_PAGE()
doesn't return a value on !DEBUG_VM, so wrap the whole check in an #ifdef
block. An alternative would be to change the VM_WARN_ON_ONCE_PAGE()
definition for !DEBUG_VM such that it always returns false, but since that
would differ from the behavior of the normal WARN macros, it might be too
confusing for readers.
Link: https://lkml.kernel.org/r/20210615012014.1100672-1-jannh@google.com
Fixes: 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton")
Signed-off-by: Jann Horn <jannh(a)google.com>
Reviewed-by: John Hubbard <jhubbard(a)nvidia.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Kirill A. Shutemov <kirill(a)shutemov.name>
Cc: Jan Kara <jack(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/gup.c | 58 +++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 43 insertions(+), 15 deletions(-)
--- a/mm/gup.c~mm-gup-fix-try_grab_compound_head-race-with-split_huge_page
+++ a/mm/gup.c
@@ -44,6 +44,23 @@ static void hpage_pincount_sub(struct pa
atomic_sub(refs, compound_pincount_ptr(page));
}
+/* Equivalent to calling put_page() @refs times. */
+static void put_page_refs(struct page *page, int refs)
+{
+#ifdef CONFIG_DEBUG_VM
+ if (VM_WARN_ON_ONCE_PAGE(page_ref_count(page) < refs, page))
+ return;
+#endif
+
+ /*
+ * Calling put_page() for each ref is unnecessarily slow. Only the last
+ * ref needs a put_page().
+ */
+ if (refs > 1)
+ page_ref_sub(page, refs - 1);
+ put_page(page);
+}
+
/*
* Return the compound head page with ref appropriately incremented,
* or NULL if that failed.
@@ -56,6 +73,21 @@ static inline struct page *try_get_compo
return NULL;
if (unlikely(!page_cache_add_speculative(head, refs)))
return NULL;
+
+ /*
+ * At this point we have a stable reference to the head page; but it
+ * could be that between the compound_head() lookup and the refcount
+ * increment, the compound page was split, in which case we'd end up
+ * holding a reference on a page that has nothing to do with the page
+ * we were given anymore.
+ * So now that the head page is stable, recheck that the pages still
+ * belong together.
+ */
+ if (unlikely(compound_head(page) != head)) {
+ put_page_refs(head, refs);
+ return NULL;
+ }
+
return head;
}
@@ -96,6 +128,14 @@ __maybe_unused struct page *try_grab_com
return NULL;
/*
+ * CAUTION: Don't use compound_head() on the page before this
+ * point, the result won't be stable.
+ */
+ page = try_get_compound_head(page, refs);
+ if (!page)
+ return NULL;
+
+ /*
* When pinning a compound page of order > 1 (which is what
* hpage_pincount_available() checks for), use an exact count to
* track it, via hpage_pincount_add/_sub().
@@ -103,15 +143,10 @@ __maybe_unused struct page *try_grab_com
* However, be sure to *also* increment the normal page refcount
* field at least once, so that the page really is pinned.
*/
- if (!hpage_pincount_available(page))
- refs *= GUP_PIN_COUNTING_BIAS;
-
- page = try_get_compound_head(page, refs);
- if (!page)
- return NULL;
-
if (hpage_pincount_available(page))
hpage_pincount_add(page, refs);
+ else
+ page_ref_add(page, refs * (GUP_PIN_COUNTING_BIAS - 1));
mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED,
orig_refs);
@@ -135,14 +170,7 @@ static void put_compound_head(struct pag
refs *= GUP_PIN_COUNTING_BIAS;
}
- VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
- /*
- * Calling put_page() for each ref is unnecessarily slow. Only the last
- * ref needs a put_page().
- */
- if (refs > 1)
- page_ref_sub(page, refs - 1);
- put_page(page);
+ put_page_refs(page, refs);
}
/**
_
The patch titled
Subject: ntfs: fix validity check for file name attribute
has been added to the -mm tree. Its filename is
ntfs-fix-validity-check-for-file-name-attribute.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/ntfs-fix-validity-check-for-file-…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/ntfs-fix-validity-check-for-file-…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Desmond Cheong Zhi Xi <desmondcheongzx(a)gmail.com>
Subject: ntfs: fix validity check for file name attribute
When checking the file name attribute, we want to ensure that it fits
within the bounds of ATTR_RECORD. To do this, we should check that (attr
record + file name offset + file name length) < (attr record + attr record
length).
However, the original check did not include the file name offset in the
calculation. This means that corrupted on-disk metadata might not caught
by the incorrect file name check, and lead to an invalid memory access.
An example can be seen in the crash report of a memory corruption error
found by Syzbot:
https://syzkaller.appspot.com/bug?id=a1a1e379b225812688566745c3e2f7242bffc2…
Adding the file name offset to the validity check fixes this error and
passes the Syzbot reproducer test.
Link: https://lkml.kernel.org/r/20210614050540.289494-1-desmondcheongzx@gmail.com
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx(a)gmail.com>
Reported-by: syzbot+213ac8bb98f7f4420840(a)syzkaller.appspotmail.com
Tested-by: syzbot+213ac8bb98f7f4420840(a)syzkaller.appspotmail.com
Acked-by: Anton Altaparmakov <anton(a)tuxera.com>
Cc: Shuah Khan <skhan(a)linuxfoundation.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ntfs/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/fs/ntfs/inode.c~ntfs-fix-validity-check-for-file-name-attribute
+++ a/fs/ntfs/inode.c
@@ -477,7 +477,7 @@ err_corrupt_attr:
}
file_name_attr = (FILE_NAME_ATTR*)((u8*)attr +
le16_to_cpu(attr->data.resident.value_offset));
- p2 = (u8*)attr + le32_to_cpu(attr->data.resident.value_length);
+ p2 = (u8 *)file_name_attr + le32_to_cpu(attr->data.resident.value_length);
if (p2 < (u8*)attr || p2 > p)
goto err_corrupt_attr;
/* This attribute is ok, but is it in the $Extend directory? */
_
Patches currently in -mm which might be from desmondcheongzx(a)gmail.com are
ntfs-fix-validity-check-for-file-name-attribute.patch
The patch titled
Subject: mm/page_alloc: correct return value of populated elements if bulk array is populated
has been added to the -mm tree. Its filename is
mm-page_alloc-correct-return-value-of-populated-elements-if-bulk-array-is-populated.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-correct-return-valu…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-page_alloc-correct-return-valu…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Mel Gorman <mgorman(a)techsingularity.net>
Subject: mm/page_alloc: correct return value of populated elements if bulk array is populated
Dave Jones reported the following
This made it into 5.13 final, and completely breaks NFSD for me
(Serving tcp v3 mounts). Existing mounts on clients hang, as do
new mounts from new clients. Rebooting the server back to rc7
everything recovers.
The commit b3b64ebd3822 ("mm/page_alloc: do bulk array bounds check after
checking populated elements") returns the wrong value if the array is
already populated which is interpreted as an allocation failure. Dave
reported this fixes his problem and it also passed a test running dbench
over NFS.
Link: https://lkml.kernel.org/r/20210628150219.GC3840@techsingularity.net
Fixes: b3b64ebd3822 ("mm/page_alloc: do bulk array bounds check after checking populated elements")
Signed-off-by: Mel Gorman <mgorman(a)techsingularity.net>
Reported-by: Dave Jones <davej(a)codemonkey.org.uk>
Tested-by: Dave Jones <davej(a)codemonkey.org.uk>
Cc: Dan Carpenter <dan.carpenter(a)oracle.com>
Cc: Jesper Dangaard Brouer <brouer(a)redhat.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org> [5.13+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/page_alloc.c~mm-page_alloc-correct-return-value-of-populated-elements-if-bulk-array-is-populated
+++ a/mm/page_alloc.c
@@ -5058,7 +5058,7 @@ unsigned long __alloc_pages_bulk(gfp_t g
/* Already populated array? */
if (unlikely(page_array && nr_pages - nr_populated == 0))
- return 0;
+ return nr_populated;
/* Use the single page allocator for one page. */
if (nr_pages - nr_populated == 1)
_
Patches currently in -mm which might be from mgorman(a)techsingularity.net are
mm-page_alloc-correct-return-value-of-populated-elements-if-bulk-array-is-populated.patch
mm-page_alloc-split-per-cpu-page-lists-and-zone-stats.patch
mm-page_alloc-split-per-cpu-page-lists-and-zone-stats-fix.patch
mm-page_alloc-split-per-cpu-page-lists-and-zone-stats-fix-fix.patch
mm-page_alloc-convert-per-cpu-list-protection-to-local_lock.patch
mm-page_alloc-convert-per-cpu-list-protection-to-local_lock-fix.patch
mm-vmstat-convert-numa-statistics-to-basic-numa-counters.patch
mm-vmstat-inline-numa-event-counter-updates.patch
mm-page_alloc-batch-the-accounting-updates-in-the-bulk-allocator.patch
mm-page_alloc-reduce-duration-that-irqs-are-disabled-for-vm-counters.patch
mm-page_alloc-explicitly-acquire-the-zone-lock-in-__free_pages_ok.patch
mm-page_alloc-avoid-conflating-irqs-disabled-with-zone-lock.patch
mm-page_alloc-update-pgfree-outside-the-zone-lock-in-__free_pages_ok.patch
mm-page_alloc-delete-vmpercpu_pagelist_fraction.patch
mm-page_alloc-disassociate-the-pcp-high-from-pcp-batch.patch
mm-page_alloc-disassociate-the-pcp-high-from-pcp-batch-fix-2.patch
mm-page_alloc-adjust-pcp-high-after-cpu-hotplug-events.patch
mm-page_alloc-scale-the-number-of-pages-that-are-batch-freed.patch
mm-page_alloc-limit-the-number-of-pages-on-pcp-lists-when-reclaim-is-active.patch
mm-page_alloc-introduce-vmpercpu_pagelist_high_fraction.patch
mm-page_alloc-introduce-vmpercpu_pagelist_high_fraction-fix.patch
mm-page_alloc-move-free_the_page.patch
mm-page_alloc-allow-high-order-pages-to-be-stored-on-the-per-cpu-lists.patch
mm-page_alloc-split-pcp-high-across-all-online-cpus-for-cpuless-nodes.patch
mm-vmscan-remove-kerneldoc-like-comment-from-isolate_lru_pages.patch
mm-vmalloc-include-header-for-prototype-of-set_iounmap_nonlazy.patch
mm-page_alloc-make-should_fail_alloc_page-a-static-function-should_fail_alloc_page-static.patch
mm-mapping_dirty_helpers-remove-double-note-in-kerneldoc.patch
mm-memcontrolc-fix-kerneldoc-comment-for-mem_cgroup_calculate_protection.patch
mm-memory_hotplug-fix-kerneldoc-comment-for-__try_online_node.patch
mm-memory_hotplug-fix-kerneldoc-comment-for-__remove_memory.patch
mm-zbud-add-kerneldoc-fields-for-zbud_pool.patch
mm-z3fold-add-kerneldoc-fields-for-z3fold_pool.patch
mm-swap-make-swap_address_space-an-inline-function.patch
mm-mmap_lock-remove-dead-code-for-config_tracing-configurations.patch
mm-page_alloc-move-prototype-for-find_suitable_fallback.patch
mm-swap-make-node_data-an-inline-function-on-config_flatmem.patch
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From edc0b0bccc9c80d9a44d3002dcca94984b25e7cf Mon Sep 17 00:00:00 2001
From: Mark Bloch <mbloch(a)nvidia.com>
Date: Mon, 7 Jun 2021 11:03:12 +0300
Subject: [PATCH] RDMA/mlx5: Block FDB rules when not in switchdev mode
Allow creating FDB steering rules only when in switchdev mode.
The only software model where a userspace application can manipulate
FDB entries is when it manages the eswitch. This is only possible in
switchdev mode where we expose a single RDMA device with representors
for all the vports that are connected to the eswitch.
Fixes: 52438be44112 ("RDMA/mlx5: Allow inserting a steering rule to the FDB")
Link: https://lore.kernel.org/r/e928ae7c58d07f104716a2a8d730963d1bd01204.16230529…
Reviewed-by: Maor Gottlieb <maorg(a)nvidia.com>
Signed-off-by: Mark Bloch <mbloch(a)nvidia.com>
Signed-off-by: Leon Romanovsky <leonro(a)nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
diff --git a/drivers/infiniband/hw/mlx5/fs.c b/drivers/infiniband/hw/mlx5/fs.c
index 2fc6a60c4e77..f84441ff0c81 100644
--- a/drivers/infiniband/hw/mlx5/fs.c
+++ b/drivers/infiniband/hw/mlx5/fs.c
@@ -2134,6 +2134,12 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_FLOW_MATCHER_CREATE)(
if (err)
goto end;
+ if (obj->ns_type == MLX5_FLOW_NAMESPACE_FDB &&
+ mlx5_eswitch_mode(dev->mdev) != MLX5_ESWITCH_OFFLOADS) {
+ err = -EINVAL;
+ goto end;
+ }
+
uobj->object = obj;
obj->mdev = dev->mdev;
atomic_set(&obj->usecnt, 0);
From: ManYi Li <limanyi(a)uniontech.com>
[ Upstream commit 7dd753ca59d6c8cc09aa1ed24f7657524803c7f3 ]
Handle a reported media event code of 3. This indicates that the media has
been removed from the drive and user intervention is required to proceed.
Return DISK_EVENT_EJECT_REQUEST in that case.
Link: https://lore.kernel.org/r/20210611094402.23884-1-limanyi@uniontech.com
Signed-off-by: ManYi Li <limanyi(a)uniontech.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/scsi/sr.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index 7dd4d9ded249..6e31cedf0b6c 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -216,6 +216,8 @@ static unsigned int sr_get_events(struct scsi_device *sdev)
return DISK_EVENT_EJECT_REQUEST;
else if (med->media_event_code == 2)
return DISK_EVENT_MEDIA_CHANGE;
+ else if (med->media_event_code == 3)
+ return DISK_EVENT_EJECT_REQUEST;
return 0;
}
--
2.30.2
From: ManYi Li <limanyi(a)uniontech.com>
[ Upstream commit 7dd753ca59d6c8cc09aa1ed24f7657524803c7f3 ]
Handle a reported media event code of 3. This indicates that the media has
been removed from the drive and user intervention is required to proceed.
Return DISK_EVENT_EJECT_REQUEST in that case.
Link: https://lore.kernel.org/r/20210611094402.23884-1-limanyi@uniontech.com
Signed-off-by: ManYi Li <limanyi(a)uniontech.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/scsi/sr.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index 67a73ea0a615..5e51a39a0c27 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -216,6 +216,8 @@ static unsigned int sr_get_events(struct scsi_device *sdev)
return DISK_EVENT_EJECT_REQUEST;
else if (med->media_event_code == 2)
return DISK_EVENT_MEDIA_CHANGE;
+ else if (med->media_event_code == 3)
+ return DISK_EVENT_EJECT_REQUEST;
return 0;
}
--
2.30.2
From: ManYi Li <limanyi(a)uniontech.com>
[ Upstream commit 7dd753ca59d6c8cc09aa1ed24f7657524803c7f3 ]
Handle a reported media event code of 3. This indicates that the media has
been removed from the drive and user intervention is required to proceed.
Return DISK_EVENT_EJECT_REQUEST in that case.
Link: https://lore.kernel.org/r/20210611094402.23884-1-limanyi@uniontech.com
Signed-off-by: ManYi Li <limanyi(a)uniontech.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/scsi/sr.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index 5be3d6b7991b..a46fbe2d2ee6 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -216,6 +216,8 @@ static unsigned int sr_get_events(struct scsi_device *sdev)
return DISK_EVENT_EJECT_REQUEST;
else if (med->media_event_code == 2)
return DISK_EVENT_MEDIA_CHANGE;
+ else if (med->media_event_code == 3)
+ return DISK_EVENT_EJECT_REQUEST;
return 0;
}
--
2.30.2
From: ManYi Li <limanyi(a)uniontech.com>
[ Upstream commit 7dd753ca59d6c8cc09aa1ed24f7657524803c7f3 ]
Handle a reported media event code of 3. This indicates that the media has
been removed from the drive and user intervention is required to proceed.
Return DISK_EVENT_EJECT_REQUEST in that case.
Link: https://lore.kernel.org/r/20210611094402.23884-1-limanyi@uniontech.com
Signed-off-by: ManYi Li <limanyi(a)uniontech.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/scsi/sr.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index 45c8bf39ad23..acf0c244141f 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -216,6 +216,8 @@ static unsigned int sr_get_events(struct scsi_device *sdev)
return DISK_EVENT_EJECT_REQUEST;
else if (med->media_event_code == 2)
return DISK_EVENT_MEDIA_CHANGE;
+ else if (med->media_event_code == 3)
+ return DISK_EVENT_EJECT_REQUEST;
return 0;
}
--
2.30.2
From: ManYi Li <limanyi(a)uniontech.com>
[ Upstream commit 7dd753ca59d6c8cc09aa1ed24f7657524803c7f3 ]
Handle a reported media event code of 3. This indicates that the media has
been removed from the drive and user intervention is required to proceed.
Return DISK_EVENT_EJECT_REQUEST in that case.
Link: https://lore.kernel.org/r/20210611094402.23884-1-limanyi@uniontech.com
Signed-off-by: ManYi Li <limanyi(a)uniontech.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/scsi/sr.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index 70a28f6fb1d0..2332b245b182 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -218,6 +218,8 @@ static unsigned int sr_get_events(struct scsi_device *sdev)
return DISK_EVENT_EJECT_REQUEST;
else if (med->media_event_code == 2)
return DISK_EVENT_MEDIA_CHANGE;
+ else if (med->media_event_code == 3)
+ return DISK_EVENT_EJECT_REQUEST;
return 0;
}
--
2.30.2
From: ManYi Li <limanyi(a)uniontech.com>
[ Upstream commit 7dd753ca59d6c8cc09aa1ed24f7657524803c7f3 ]
Handle a reported media event code of 3. This indicates that the media has
been removed from the drive and user intervention is required to proceed.
Return DISK_EVENT_EJECT_REQUEST in that case.
Link: https://lore.kernel.org/r/20210611094402.23884-1-limanyi@uniontech.com
Signed-off-by: ManYi Li <limanyi(a)uniontech.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/scsi/sr.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index fd4b582110b2..77961f058367 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -220,6 +220,8 @@ static unsigned int sr_get_events(struct scsi_device *sdev)
return DISK_EVENT_EJECT_REQUEST;
else if (med->media_event_code == 2)
return DISK_EVENT_MEDIA_CHANGE;
+ else if (med->media_event_code == 3)
+ return DISK_EVENT_EJECT_REQUEST;
return 0;
}
--
2.30.2
From: ManYi Li <limanyi(a)uniontech.com>
[ Upstream commit 7dd753ca59d6c8cc09aa1ed24f7657524803c7f3 ]
Handle a reported media event code of 3. This indicates that the media has
been removed from the drive and user intervention is required to proceed.
Return DISK_EVENT_EJECT_REQUEST in that case.
Link: https://lore.kernel.org/r/20210611094402.23884-1-limanyi@uniontech.com
Signed-off-by: ManYi Li <limanyi(a)uniontech.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/scsi/sr.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index e4633b84c556..7815ed642d43 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -220,6 +220,8 @@ static unsigned int sr_get_events(struct scsi_device *sdev)
return DISK_EVENT_EJECT_REQUEST;
else if (med->media_event_code == 2)
return DISK_EVENT_MEDIA_CHANGE;
+ else if (med->media_event_code == 3)
+ return DISK_EVENT_EJECT_REQUEST;
return 0;
}
--
2.30.2
From: Kan Liang <kan.liang(a)linux.intel.com>
The uncore unit with the type ID 0 and the unit ID 0 is missed.
The table3 of the uncore unit maybe 0. The
uncore_discovery_invalid_unit() mistakenly treated it as an invalid
value.
Remove the !unit.table3 check.
Fixes: edae1f06c2cd ("perf/x86/intel/uncore: Parse uncore discovery tables")
Reviewed-by: Andi Kleen <ak(a)linux.intel.com>
Signed-off-by: Kan Liang <kan.liang(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
---
arch/x86/events/intel/uncore_discovery.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/uncore_discovery.h b/arch/x86/events/intel/uncore_discovery.h
index 7280c8a..6d735611 100644
--- a/arch/x86/events/intel/uncore_discovery.h
+++ b/arch/x86/events/intel/uncore_discovery.h
@@ -30,7 +30,7 @@
#define uncore_discovery_invalid_unit(unit) \
- (!unit.table1 || !unit.ctl || !unit.table3 || \
+ (!unit.table1 || !unit.ctl || \
unit.table1 == -1ULL || unit.ctl == -1ULL || \
unit.table3 == -1ULL)
--
2.7.4
We use the async_delalloc_pages mechanism to make sure that we've
completed our async work before trying to continue our delalloc
flushing. The reason for this is we need to see any ordered extents
that were created by our delalloc flushing. However we're waking up
before we do the submit work, which is before we create the ordered
extents. This is a pretty wide race window where we could potentially
think there are no ordered extents and thus exit shrink_delalloc
prematurely. Fix this by waking us up after we've done the work to
create ordered extents.
cc: stable(a)vger.kernel.org
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
---
fs/btrfs/inode.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b1f02e3fea5d..e388153c4ae4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1290,11 +1290,6 @@ static noinline void async_cow_submit(struct btrfs_work *work)
nr_pages = (async_chunk->end - async_chunk->start + PAGE_SIZE) >>
PAGE_SHIFT;
- /* atomic_sub_return implies a barrier */
- if (atomic_sub_return(nr_pages, &fs_info->async_delalloc_pages) <
- 5 * SZ_1M)
- cond_wake_up_nomb(&fs_info->async_submit_wait);
-
/*
* ->inode could be NULL if async_chunk_start has failed to compress,
* in which case we don't have anything to submit, yet we need to
@@ -1303,6 +1298,11 @@ static noinline void async_cow_submit(struct btrfs_work *work)
*/
if (async_chunk->inode)
submit_compressed_extents(async_chunk);
+
+ /* atomic_sub_return implies a barrier */
+ if (atomic_sub_return(nr_pages, &fs_info->async_delalloc_pages) <
+ 5 * SZ_1M)
+ cond_wake_up_nomb(&fs_info->async_submit_wait);
}
static noinline void async_cow_free(struct btrfs_work *work)
--
2.26.3
We have been hitting some early ENOSPC issues in production with more
recent kernels, and I tracked it down to us simply not flushing delalloc
as aggressively as we should be. With tracing I was seeing us failing
all tickets with all of the block rsvs at or around 0, with very little
pinned space, but still around 120MiB of outstanding bytes_may_used.
Upon further investigation I saw that we were flushing around 14 pages
per shrink call for delalloc, despite having around 2GiB of delalloc
outstanding.
Consider the example of a 8 way machine, all CPUs trying to create a
file in parallel, which at the time of this commit requires 5 items to
do. Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
size waiting on reservations. Now assume we have 128MiB of delalloc
outstanding. With our current math we would set items to 20, and then
set to_reclaim to 20 * 256k, or 5MiB.
Assuming that we went through this loop all 3 times, for both
FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
twice, we'd only flush 60MiB of the 128MiB delalloc space. This could
leave a fair bit of delalloc reservations still hanging around by the
time we go to ENOSPC out all the remaining tickets.
Fix this two ways. First, change the calculations to be a fraction of
the total delalloc bytes on the system. Prior to this change we were
calculating based on dirty inodes so our math made more sense, now it's
just completely unrelated to what we're actually doing.
Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
gone through the flush states at least once. This will empty the system
of all delalloc so we're sure to be truly out of space when we start
failing tickets.
I'm tagging stable 5.10 and forward, because this is where we started
using the page stuff heavily again. This affects earlier kernel
versions as well, but would be a pain to backport to them as the
flushing mechanisms aren't the same.
CC: stable(a)vger.kernel.org # 5.10+
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
---
fs/btrfs/ctree.h | 9 +++++----
fs/btrfs/space-info.c | 35 ++++++++++++++++++++++++++---------
include/trace/events/btrfs.h | 1 +
3 files changed, 32 insertions(+), 13 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d7ef4d7d2c1a..232ff1a49ca6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2783,10 +2783,11 @@ enum btrfs_flush_state {
FLUSH_DELAYED_REFS = 4,
FLUSH_DELALLOC = 5,
FLUSH_DELALLOC_WAIT = 6,
- ALLOC_CHUNK = 7,
- ALLOC_CHUNK_FORCE = 8,
- RUN_DELAYED_IPUTS = 9,
- COMMIT_TRANS = 10,
+ FLUSH_DELALLOC_FULL = 7,
+ ALLOC_CHUNK = 8,
+ ALLOC_CHUNK_FORCE = 9,
+ RUN_DELAYED_IPUTS = 10,
+ COMMIT_TRANS = 11,
};
int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index af161eb808a2..0c539a94c6d9 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -494,6 +494,9 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info,
long time_left;
int loops;
+ delalloc_bytes = percpu_counter_sum_positive(&fs_info->delalloc_bytes);
+ ordered_bytes = percpu_counter_sum_positive(&fs_info->ordered_bytes);
+
/* Calc the number of the pages we need flush for space reservation */
if (to_reclaim == U64_MAX) {
items = U64_MAX;
@@ -501,19 +504,21 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info,
/*
* to_reclaim is set to however much metadata we need to
* reclaim, but reclaiming that much data doesn't really track
- * exactly, so increase the amount to reclaim by 2x in order to
- * make sure we're flushing enough delalloc to hopefully reclaim
- * some metadata reservations.
+ * exactly. What we really want to do is reclaim full inode's
+ * worth of reservations, however that's not available to us
+ * here. We will take a fraction of the delalloc bytes for our
+ * flushing loops and hope for the best. Delalloc will expand
+ * the amount we write to cover an entire dirty extent, which
+ * will reclaim the metadata reservation for that range. If
+ * it's not enough subsequent flush stages will be more
+ * aggressive.
*/
+ to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
- to_reclaim = items * EXTENT_SIZE_PER_ITEM;
}
trans = (struct btrfs_trans_handle *)current->journal_info;
- delalloc_bytes = percpu_counter_sum_positive(
- &fs_info->delalloc_bytes);
- ordered_bytes = percpu_counter_sum_positive(&fs_info->ordered_bytes);
if (delalloc_bytes == 0 && ordered_bytes == 0)
return;
@@ -596,8 +601,11 @@ static void flush_space(struct btrfs_fs_info *fs_info,
break;
case FLUSH_DELALLOC:
case FLUSH_DELALLOC_WAIT:
+ case FLUSH_DELALLOC_FULL:
+ if (state == FLUSH_DELALLOC_FULL)
+ num_bytes = U64_MAX;
shrink_delalloc(fs_info, space_info, num_bytes,
- state == FLUSH_DELALLOC_WAIT, for_preempt);
+ state != FLUSH_DELALLOC, for_preempt);
break;
case FLUSH_DELAYED_REFS_NR:
case FLUSH_DELAYED_REFS:
@@ -907,6 +915,14 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
commit_cycles--;
}
+ /*
+ * We do not want to empty the system of delalloc unless we're
+ * under heavy pressure, so allow one trip through the flushing
+ * logic before we start doing a FLUSH_DELALLOC_FULL.
+ */
+ if (flush_state == FLUSH_DELALLOC_FULL && !commit_cycles)
+ flush_state++;
+
/*
* We don't want to force a chunk allocation until we've tried
* pretty hard to reclaim space. Think of the case where we
@@ -1070,7 +1086,7 @@ static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
* so if we now have space to allocate do the force chunk allocation.
*/
static const enum btrfs_flush_state data_flush_states[] = {
- FLUSH_DELALLOC_WAIT,
+ FLUSH_DELALLOC_FULL,
RUN_DELAYED_IPUTS,
COMMIT_TRANS,
ALLOC_CHUNK_FORCE,
@@ -1159,6 +1175,7 @@ static const enum btrfs_flush_state evict_flush_states[] = {
FLUSH_DELAYED_REFS,
FLUSH_DELALLOC,
FLUSH_DELALLOC_WAIT,
+ FLUSH_DELALLOC_FULL,
ALLOC_CHUNK,
COMMIT_TRANS,
};
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 3d81ba8c37b9..ddf5c250726c 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -94,6 +94,7 @@ struct btrfs_space_info;
EM( FLUSH_DELAYED_ITEMS, "FLUSH_DELAYED_ITEMS") \
EM( FLUSH_DELALLOC, "FLUSH_DELALLOC") \
EM( FLUSH_DELALLOC_WAIT, "FLUSH_DELALLOC_WAIT") \
+ EM( FLUSH_DELALLOC_FULL, "FLUSH_DELALLOC_FULL") \
EM( FLUSH_DELAYED_REFS_NR, "FLUSH_DELAYED_REFS_NR") \
EM( FLUSH_DELAYED_REFS, "FLUSH_ELAYED_REFS") \
EM( ALLOC_CHUNK, "ALLOC_CHUNK") \
--
2.26.3
Function ceph_check_delayed_caps() is called from the mdsc->delayed_work
workqueue and it can be kept looping for quite some time if caps keep being
added back to the mdsc->cap_delay_list. This may result in the watchdog
tainting the kernel with the softlockup flag.
This patch re-arranges the loop through the caps list so that it initially
removes all the caps from list, adding them to a temporary list. And then, with
less locking contention, it will eventually call the ceph_check_caps() for each
inode. Any caps added to the list in the meantime will be handled in the next
run.
Cc: stable(a)vger.kernel.org
Signed-off-by: Luis Henriques <lhenriques(a)suse.de>
---
Hi Jeff!
So, I've not based this patch on top of your patchset that gets rid of
ceph_async_iput() so that it will make it easier to backport it for stable
kernels. Of course I'm not 100% this classifies as stable material.
Other than that, I've been testing this patch and I couldn't see anything
breaking. Let me know what you think.
(I *think* I've seen a tracker bug for this in the past but I couldn't
find it. I guess it could be added as a 'Link:' tag.)
Cheers,
--
Luis
fs/ceph/caps.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index a5e93b185515..727e41e3b939 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -4229,6 +4229,7 @@ void ceph_check_delayed_caps(struct ceph_mds_client *mdsc)
{
struct inode *inode;
struct ceph_inode_info *ci;
+ LIST_HEAD(caps_list);
dout("check_delayed_caps\n");
spin_lock(&mdsc->cap_delay_lock);
@@ -4239,19 +4240,23 @@ void ceph_check_delayed_caps(struct ceph_mds_client *mdsc)
if ((ci->i_ceph_flags & CEPH_I_FLUSH) == 0 &&
time_before(jiffies, ci->i_hold_caps_max))
break;
- list_del_init(&ci->i_cap_delay_list);
+ list_move_tail(&ci->i_cap_delay_list, &caps_list);
+ }
+ spin_unlock(&mdsc->cap_delay_lock);
+ while (!list_empty(&caps_list)) {
+ ci = list_first_entry(&caps_list,
+ struct ceph_inode_info,
+ i_cap_delay_list);
+ list_del_init(&ci->i_cap_delay_list);
inode = igrab(&ci->vfs_inode);
if (inode) {
- spin_unlock(&mdsc->cap_delay_lock);
dout("check_delayed_caps on %p\n", inode);
ceph_check_caps(ci, 0, NULL);
/* avoid calling iput_final() in tick thread */
ceph_async_iput(inode);
- spin_lock(&mdsc->cap_delay_lock);
}
}
- spin_unlock(&mdsc->cap_delay_lock);
}
/*
On 4/9/21 1:00 AM, Marek Vasut wrote:
> The Microchip LAN8710Ai PHY requires XTAL1/CLKIN external clock to be
> enabled when the nRST is toggled according to datasheet Microchip
> LAN8710A/LAN8710Ai DS00002164B page 35 section 3.8.5.1 Hardware Reset:
[...]
> Fixes: 34e0c7847dcf ("ARM: dts: stm32: Add DH Electronics DHCOM STM32MP1 SoM and PDK2 board")
Adding stable to CC.
Patch is now part of Linux 5.13 as commit
1cebcf9932ab ("ARM: dts: stm32: Rework LAN8710Ai PHY reset on DHCOM SoM")
It would be nice to pick it into stable, since it fixes ethernet
stability issues on the device.
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ea6d0630100b285f059d0a8d8e86f38a46407536 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Date: Thu, 24 Jun 2021 18:40:01 -0700
Subject: [PATCH] mm/hwpoison: do not lock page again when me_huge_page()
successfully recovers
Currently me_huge_page() temporary unlocks page to perform some actions
then locks it again later. My testcase (which calls hard-offline on
some tail page in a hugetlb, then accesses the address of the hugetlb
range) showed that page allocation code detects this page lock on buddy
page and printed out "BUG: Bad page state" message.
check_new_page_bad() does not consider a page with __PG_HWPOISON as bad
page, so this flag works as kind of filter, but this filtering doesn't
work in this case because the "bad page" is not the actual hwpoisoned
page. So stop locking page again. Actions to be taken depend on the
page type of the error, so page unlocking should be done in ->action()
callbacks. So let's make it assumed and change all existing callbacks
that way.
Link: https://lkml.kernel.org/r/20210609072029.74645-1-nao.horiguchi@gmail.com
Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Tony Luck <tony.luck(a)intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f24105db7081..6f5f78885ab4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -658,6 +658,7 @@ static int truncate_error_page(struct page *p, unsigned long pfn,
*/
static int me_kernel(struct page *p, unsigned long pfn)
{
+ unlock_page(p);
return MF_IGNORED;
}
@@ -667,6 +668,7 @@ static int me_kernel(struct page *p, unsigned long pfn)
static int me_unknown(struct page *p, unsigned long pfn)
{
pr_err("Memory failure: %#lx: Unknown page state\n", pfn);
+ unlock_page(p);
return MF_FAILED;
}
@@ -675,6 +677,7 @@ static int me_unknown(struct page *p, unsigned long pfn)
*/
static int me_pagecache_clean(struct page *p, unsigned long pfn)
{
+ int ret;
struct address_space *mapping;
delete_from_lru_cache(p);
@@ -683,8 +686,10 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
* For anonymous pages we're done the only reference left
* should be the one m_f() holds.
*/
- if (PageAnon(p))
- return MF_RECOVERED;
+ if (PageAnon(p)) {
+ ret = MF_RECOVERED;
+ goto out;
+ }
/*
* Now truncate the page in the page cache. This is really
@@ -698,7 +703,8 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
/*
* Page has been teared down in the meanwhile
*/
- return MF_FAILED;
+ ret = MF_FAILED;
+ goto out;
}
/*
@@ -706,7 +712,10 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
*
* Open: to take i_mutex or not for this? Right now we don't.
*/
- return truncate_error_page(p, pfn, mapping);
+ ret = truncate_error_page(p, pfn, mapping);
+out:
+ unlock_page(p);
+ return ret;
}
/*
@@ -782,24 +791,26 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn)
*/
static int me_swapcache_dirty(struct page *p, unsigned long pfn)
{
+ int ret;
+
ClearPageDirty(p);
/* Trigger EIO in shmem: */
ClearPageUptodate(p);
- if (!delete_from_lru_cache(p))
- return MF_DELAYED;
- else
- return MF_FAILED;
+ ret = delete_from_lru_cache(p) ? MF_FAILED : MF_DELAYED;
+ unlock_page(p);
+ return ret;
}
static int me_swapcache_clean(struct page *p, unsigned long pfn)
{
+ int ret;
+
delete_from_swap_cache(p);
- if (!delete_from_lru_cache(p))
- return MF_RECOVERED;
- else
- return MF_FAILED;
+ ret = delete_from_lru_cache(p) ? MF_FAILED : MF_RECOVERED;
+ unlock_page(p);
+ return ret;
}
/*
@@ -820,6 +831,7 @@ static int me_huge_page(struct page *p, unsigned long pfn)
mapping = page_mapping(hpage);
if (mapping) {
res = truncate_error_page(hpage, pfn, mapping);
+ unlock_page(hpage);
} else {
res = MF_FAILED;
unlock_page(hpage);
@@ -834,7 +846,6 @@ static int me_huge_page(struct page *p, unsigned long pfn)
page_ref_inc(p);
res = MF_RECOVERED;
}
- lock_page(hpage);
}
return res;
@@ -866,6 +877,8 @@ static struct page_state {
unsigned long mask;
unsigned long res;
enum mf_action_page_type type;
+
+ /* Callback ->action() has to unlock the relevant page inside it. */
int (*action)(struct page *p, unsigned long pfn);
} error_states[] = {
{ reserved, reserved, MF_MSG_KERNEL, me_kernel },
@@ -929,6 +942,7 @@ static int page_action(struct page_state *ps, struct page *p,
int result;
int count;
+ /* page p should be unlocked after returning from ps->action(). */
result = ps->action(p, pfn);
count = page_count(p) - 1;
@@ -1313,7 +1327,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
goto out;
}
- res = identify_page_state(pfn, p, page_flags);
+ return identify_page_state(pfn, p, page_flags);
out:
unlock_page(head);
return res;
@@ -1596,6 +1610,8 @@ int memory_failure(unsigned long pfn, int flags)
identify_page_state:
res = identify_page_state(pfn, p, page_flags);
+ mutex_unlock(&mf_mutex);
+ return res;
unlock_page:
unlock_page(p);
unlock_mutex:
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ea6d0630100b285f059d0a8d8e86f38a46407536 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Date: Thu, 24 Jun 2021 18:40:01 -0700
Subject: [PATCH] mm/hwpoison: do not lock page again when me_huge_page()
successfully recovers
Currently me_huge_page() temporary unlocks page to perform some actions
then locks it again later. My testcase (which calls hard-offline on
some tail page in a hugetlb, then accesses the address of the hugetlb
range) showed that page allocation code detects this page lock on buddy
page and printed out "BUG: Bad page state" message.
check_new_page_bad() does not consider a page with __PG_HWPOISON as bad
page, so this flag works as kind of filter, but this filtering doesn't
work in this case because the "bad page" is not the actual hwpoisoned
page. So stop locking page again. Actions to be taken depend on the
page type of the error, so page unlocking should be done in ->action()
callbacks. So let's make it assumed and change all existing callbacks
that way.
Link: https://lkml.kernel.org/r/20210609072029.74645-1-nao.horiguchi@gmail.com
Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Tony Luck <tony.luck(a)intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f24105db7081..6f5f78885ab4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -658,6 +658,7 @@ static int truncate_error_page(struct page *p, unsigned long pfn,
*/
static int me_kernel(struct page *p, unsigned long pfn)
{
+ unlock_page(p);
return MF_IGNORED;
}
@@ -667,6 +668,7 @@ static int me_kernel(struct page *p, unsigned long pfn)
static int me_unknown(struct page *p, unsigned long pfn)
{
pr_err("Memory failure: %#lx: Unknown page state\n", pfn);
+ unlock_page(p);
return MF_FAILED;
}
@@ -675,6 +677,7 @@ static int me_unknown(struct page *p, unsigned long pfn)
*/
static int me_pagecache_clean(struct page *p, unsigned long pfn)
{
+ int ret;
struct address_space *mapping;
delete_from_lru_cache(p);
@@ -683,8 +686,10 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
* For anonymous pages we're done the only reference left
* should be the one m_f() holds.
*/
- if (PageAnon(p))
- return MF_RECOVERED;
+ if (PageAnon(p)) {
+ ret = MF_RECOVERED;
+ goto out;
+ }
/*
* Now truncate the page in the page cache. This is really
@@ -698,7 +703,8 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
/*
* Page has been teared down in the meanwhile
*/
- return MF_FAILED;
+ ret = MF_FAILED;
+ goto out;
}
/*
@@ -706,7 +712,10 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
*
* Open: to take i_mutex or not for this? Right now we don't.
*/
- return truncate_error_page(p, pfn, mapping);
+ ret = truncate_error_page(p, pfn, mapping);
+out:
+ unlock_page(p);
+ return ret;
}
/*
@@ -782,24 +791,26 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn)
*/
static int me_swapcache_dirty(struct page *p, unsigned long pfn)
{
+ int ret;
+
ClearPageDirty(p);
/* Trigger EIO in shmem: */
ClearPageUptodate(p);
- if (!delete_from_lru_cache(p))
- return MF_DELAYED;
- else
- return MF_FAILED;
+ ret = delete_from_lru_cache(p) ? MF_FAILED : MF_DELAYED;
+ unlock_page(p);
+ return ret;
}
static int me_swapcache_clean(struct page *p, unsigned long pfn)
{
+ int ret;
+
delete_from_swap_cache(p);
- if (!delete_from_lru_cache(p))
- return MF_RECOVERED;
- else
- return MF_FAILED;
+ ret = delete_from_lru_cache(p) ? MF_FAILED : MF_RECOVERED;
+ unlock_page(p);
+ return ret;
}
/*
@@ -820,6 +831,7 @@ static int me_huge_page(struct page *p, unsigned long pfn)
mapping = page_mapping(hpage);
if (mapping) {
res = truncate_error_page(hpage, pfn, mapping);
+ unlock_page(hpage);
} else {
res = MF_FAILED;
unlock_page(hpage);
@@ -834,7 +846,6 @@ static int me_huge_page(struct page *p, unsigned long pfn)
page_ref_inc(p);
res = MF_RECOVERED;
}
- lock_page(hpage);
}
return res;
@@ -866,6 +877,8 @@ static struct page_state {
unsigned long mask;
unsigned long res;
enum mf_action_page_type type;
+
+ /* Callback ->action() has to unlock the relevant page inside it. */
int (*action)(struct page *p, unsigned long pfn);
} error_states[] = {
{ reserved, reserved, MF_MSG_KERNEL, me_kernel },
@@ -929,6 +942,7 @@ static int page_action(struct page_state *ps, struct page *p,
int result;
int count;
+ /* page p should be unlocked after returning from ps->action(). */
result = ps->action(p, pfn);
count = page_count(p) - 1;
@@ -1313,7 +1327,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
goto out;
}
- res = identify_page_state(pfn, p, page_flags);
+ return identify_page_state(pfn, p, page_flags);
out:
unlock_page(head);
return res;
@@ -1596,6 +1610,8 @@ int memory_failure(unsigned long pfn, int flags)
identify_page_state:
res = identify_page_state(pfn, p, page_flags);
+ mutex_unlock(&mf_mutex);
+ return res;
unlock_page:
unlock_page(p);
unlock_mutex:
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ea6d0630100b285f059d0a8d8e86f38a46407536 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Date: Thu, 24 Jun 2021 18:40:01 -0700
Subject: [PATCH] mm/hwpoison: do not lock page again when me_huge_page()
successfully recovers
Currently me_huge_page() temporary unlocks page to perform some actions
then locks it again later. My testcase (which calls hard-offline on
some tail page in a hugetlb, then accesses the address of the hugetlb
range) showed that page allocation code detects this page lock on buddy
page and printed out "BUG: Bad page state" message.
check_new_page_bad() does not consider a page with __PG_HWPOISON as bad
page, so this flag works as kind of filter, but this filtering doesn't
work in this case because the "bad page" is not the actual hwpoisoned
page. So stop locking page again. Actions to be taken depend on the
page type of the error, so page unlocking should be done in ->action()
callbacks. So let's make it assumed and change all existing callbacks
that way.
Link: https://lkml.kernel.org/r/20210609072029.74645-1-nao.horiguchi@gmail.com
Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Tony Luck <tony.luck(a)intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f24105db7081..6f5f78885ab4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -658,6 +658,7 @@ static int truncate_error_page(struct page *p, unsigned long pfn,
*/
static int me_kernel(struct page *p, unsigned long pfn)
{
+ unlock_page(p);
return MF_IGNORED;
}
@@ -667,6 +668,7 @@ static int me_kernel(struct page *p, unsigned long pfn)
static int me_unknown(struct page *p, unsigned long pfn)
{
pr_err("Memory failure: %#lx: Unknown page state\n", pfn);
+ unlock_page(p);
return MF_FAILED;
}
@@ -675,6 +677,7 @@ static int me_unknown(struct page *p, unsigned long pfn)
*/
static int me_pagecache_clean(struct page *p, unsigned long pfn)
{
+ int ret;
struct address_space *mapping;
delete_from_lru_cache(p);
@@ -683,8 +686,10 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
* For anonymous pages we're done the only reference left
* should be the one m_f() holds.
*/
- if (PageAnon(p))
- return MF_RECOVERED;
+ if (PageAnon(p)) {
+ ret = MF_RECOVERED;
+ goto out;
+ }
/*
* Now truncate the page in the page cache. This is really
@@ -698,7 +703,8 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
/*
* Page has been teared down in the meanwhile
*/
- return MF_FAILED;
+ ret = MF_FAILED;
+ goto out;
}
/*
@@ -706,7 +712,10 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
*
* Open: to take i_mutex or not for this? Right now we don't.
*/
- return truncate_error_page(p, pfn, mapping);
+ ret = truncate_error_page(p, pfn, mapping);
+out:
+ unlock_page(p);
+ return ret;
}
/*
@@ -782,24 +791,26 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn)
*/
static int me_swapcache_dirty(struct page *p, unsigned long pfn)
{
+ int ret;
+
ClearPageDirty(p);
/* Trigger EIO in shmem: */
ClearPageUptodate(p);
- if (!delete_from_lru_cache(p))
- return MF_DELAYED;
- else
- return MF_FAILED;
+ ret = delete_from_lru_cache(p) ? MF_FAILED : MF_DELAYED;
+ unlock_page(p);
+ return ret;
}
static int me_swapcache_clean(struct page *p, unsigned long pfn)
{
+ int ret;
+
delete_from_swap_cache(p);
- if (!delete_from_lru_cache(p))
- return MF_RECOVERED;
- else
- return MF_FAILED;
+ ret = delete_from_lru_cache(p) ? MF_FAILED : MF_RECOVERED;
+ unlock_page(p);
+ return ret;
}
/*
@@ -820,6 +831,7 @@ static int me_huge_page(struct page *p, unsigned long pfn)
mapping = page_mapping(hpage);
if (mapping) {
res = truncate_error_page(hpage, pfn, mapping);
+ unlock_page(hpage);
} else {
res = MF_FAILED;
unlock_page(hpage);
@@ -834,7 +846,6 @@ static int me_huge_page(struct page *p, unsigned long pfn)
page_ref_inc(p);
res = MF_RECOVERED;
}
- lock_page(hpage);
}
return res;
@@ -866,6 +877,8 @@ static struct page_state {
unsigned long mask;
unsigned long res;
enum mf_action_page_type type;
+
+ /* Callback ->action() has to unlock the relevant page inside it. */
int (*action)(struct page *p, unsigned long pfn);
} error_states[] = {
{ reserved, reserved, MF_MSG_KERNEL, me_kernel },
@@ -929,6 +942,7 @@ static int page_action(struct page_state *ps, struct page *p,
int result;
int count;
+ /* page p should be unlocked after returning from ps->action(). */
result = ps->action(p, pfn);
count = page_count(p) - 1;
@@ -1313,7 +1327,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
goto out;
}
- res = identify_page_state(pfn, p, page_flags);
+ return identify_page_state(pfn, p, page_flags);
out:
unlock_page(head);
return res;
@@ -1596,6 +1610,8 @@ int memory_failure(unsigned long pfn, int flags)
identify_page_state:
res = identify_page_state(pfn, p, page_flags);
+ mutex_unlock(&mf_mutex);
+ return res;
unlock_page:
unlock_page(p);
unlock_mutex:
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ea6d0630100b285f059d0a8d8e86f38a46407536 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Date: Thu, 24 Jun 2021 18:40:01 -0700
Subject: [PATCH] mm/hwpoison: do not lock page again when me_huge_page()
successfully recovers
Currently me_huge_page() temporary unlocks page to perform some actions
then locks it again later. My testcase (which calls hard-offline on
some tail page in a hugetlb, then accesses the address of the hugetlb
range) showed that page allocation code detects this page lock on buddy
page and printed out "BUG: Bad page state" message.
check_new_page_bad() does not consider a page with __PG_HWPOISON as bad
page, so this flag works as kind of filter, but this filtering doesn't
work in this case because the "bad page" is not the actual hwpoisoned
page. So stop locking page again. Actions to be taken depend on the
page type of the error, so page unlocking should be done in ->action()
callbacks. So let's make it assumed and change all existing callbacks
that way.
Link: https://lkml.kernel.org/r/20210609072029.74645-1-nao.horiguchi@gmail.com
Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Tony Luck <tony.luck(a)intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f24105db7081..6f5f78885ab4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -658,6 +658,7 @@ static int truncate_error_page(struct page *p, unsigned long pfn,
*/
static int me_kernel(struct page *p, unsigned long pfn)
{
+ unlock_page(p);
return MF_IGNORED;
}
@@ -667,6 +668,7 @@ static int me_kernel(struct page *p, unsigned long pfn)
static int me_unknown(struct page *p, unsigned long pfn)
{
pr_err("Memory failure: %#lx: Unknown page state\n", pfn);
+ unlock_page(p);
return MF_FAILED;
}
@@ -675,6 +677,7 @@ static int me_unknown(struct page *p, unsigned long pfn)
*/
static int me_pagecache_clean(struct page *p, unsigned long pfn)
{
+ int ret;
struct address_space *mapping;
delete_from_lru_cache(p);
@@ -683,8 +686,10 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
* For anonymous pages we're done the only reference left
* should be the one m_f() holds.
*/
- if (PageAnon(p))
- return MF_RECOVERED;
+ if (PageAnon(p)) {
+ ret = MF_RECOVERED;
+ goto out;
+ }
/*
* Now truncate the page in the page cache. This is really
@@ -698,7 +703,8 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
/*
* Page has been teared down in the meanwhile
*/
- return MF_FAILED;
+ ret = MF_FAILED;
+ goto out;
}
/*
@@ -706,7 +712,10 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
*
* Open: to take i_mutex or not for this? Right now we don't.
*/
- return truncate_error_page(p, pfn, mapping);
+ ret = truncate_error_page(p, pfn, mapping);
+out:
+ unlock_page(p);
+ return ret;
}
/*
@@ -782,24 +791,26 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn)
*/
static int me_swapcache_dirty(struct page *p, unsigned long pfn)
{
+ int ret;
+
ClearPageDirty(p);
/* Trigger EIO in shmem: */
ClearPageUptodate(p);
- if (!delete_from_lru_cache(p))
- return MF_DELAYED;
- else
- return MF_FAILED;
+ ret = delete_from_lru_cache(p) ? MF_FAILED : MF_DELAYED;
+ unlock_page(p);
+ return ret;
}
static int me_swapcache_clean(struct page *p, unsigned long pfn)
{
+ int ret;
+
delete_from_swap_cache(p);
- if (!delete_from_lru_cache(p))
- return MF_RECOVERED;
- else
- return MF_FAILED;
+ ret = delete_from_lru_cache(p) ? MF_FAILED : MF_RECOVERED;
+ unlock_page(p);
+ return ret;
}
/*
@@ -820,6 +831,7 @@ static int me_huge_page(struct page *p, unsigned long pfn)
mapping = page_mapping(hpage);
if (mapping) {
res = truncate_error_page(hpage, pfn, mapping);
+ unlock_page(hpage);
} else {
res = MF_FAILED;
unlock_page(hpage);
@@ -834,7 +846,6 @@ static int me_huge_page(struct page *p, unsigned long pfn)
page_ref_inc(p);
res = MF_RECOVERED;
}
- lock_page(hpage);
}
return res;
@@ -866,6 +877,8 @@ static struct page_state {
unsigned long mask;
unsigned long res;
enum mf_action_page_type type;
+
+ /* Callback ->action() has to unlock the relevant page inside it. */
int (*action)(struct page *p, unsigned long pfn);
} error_states[] = {
{ reserved, reserved, MF_MSG_KERNEL, me_kernel },
@@ -929,6 +942,7 @@ static int page_action(struct page_state *ps, struct page *p,
int result;
int count;
+ /* page p should be unlocked after returning from ps->action(). */
result = ps->action(p, pfn);
count = page_count(p) - 1;
@@ -1313,7 +1327,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
goto out;
}
- res = identify_page_state(pfn, p, page_flags);
+ return identify_page_state(pfn, p, page_flags);
out:
unlock_page(head);
return res;
@@ -1596,6 +1610,8 @@ int memory_failure(unsigned long pfn, int flags)
identify_page_state:
res = identify_page_state(pfn, p, page_flags);
+ mutex_unlock(&mf_mutex);
+ return res;
unlock_page:
unlock_page(p);
unlock_mutex:
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ea6d0630100b285f059d0a8d8e86f38a46407536 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Date: Thu, 24 Jun 2021 18:40:01 -0700
Subject: [PATCH] mm/hwpoison: do not lock page again when me_huge_page()
successfully recovers
Currently me_huge_page() temporary unlocks page to perform some actions
then locks it again later. My testcase (which calls hard-offline on
some tail page in a hugetlb, then accesses the address of the hugetlb
range) showed that page allocation code detects this page lock on buddy
page and printed out "BUG: Bad page state" message.
check_new_page_bad() does not consider a page with __PG_HWPOISON as bad
page, so this flag works as kind of filter, but this filtering doesn't
work in this case because the "bad page" is not the actual hwpoisoned
page. So stop locking page again. Actions to be taken depend on the
page type of the error, so page unlocking should be done in ->action()
callbacks. So let's make it assumed and change all existing callbacks
that way.
Link: https://lkml.kernel.org/r/20210609072029.74645-1-nao.horiguchi@gmail.com
Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Tony Luck <tony.luck(a)intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f24105db7081..6f5f78885ab4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -658,6 +658,7 @@ static int truncate_error_page(struct page *p, unsigned long pfn,
*/
static int me_kernel(struct page *p, unsigned long pfn)
{
+ unlock_page(p);
return MF_IGNORED;
}
@@ -667,6 +668,7 @@ static int me_kernel(struct page *p, unsigned long pfn)
static int me_unknown(struct page *p, unsigned long pfn)
{
pr_err("Memory failure: %#lx: Unknown page state\n", pfn);
+ unlock_page(p);
return MF_FAILED;
}
@@ -675,6 +677,7 @@ static int me_unknown(struct page *p, unsigned long pfn)
*/
static int me_pagecache_clean(struct page *p, unsigned long pfn)
{
+ int ret;
struct address_space *mapping;
delete_from_lru_cache(p);
@@ -683,8 +686,10 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
* For anonymous pages we're done the only reference left
* should be the one m_f() holds.
*/
- if (PageAnon(p))
- return MF_RECOVERED;
+ if (PageAnon(p)) {
+ ret = MF_RECOVERED;
+ goto out;
+ }
/*
* Now truncate the page in the page cache. This is really
@@ -698,7 +703,8 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
/*
* Page has been teared down in the meanwhile
*/
- return MF_FAILED;
+ ret = MF_FAILED;
+ goto out;
}
/*
@@ -706,7 +712,10 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
*
* Open: to take i_mutex or not for this? Right now we don't.
*/
- return truncate_error_page(p, pfn, mapping);
+ ret = truncate_error_page(p, pfn, mapping);
+out:
+ unlock_page(p);
+ return ret;
}
/*
@@ -782,24 +791,26 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn)
*/
static int me_swapcache_dirty(struct page *p, unsigned long pfn)
{
+ int ret;
+
ClearPageDirty(p);
/* Trigger EIO in shmem: */
ClearPageUptodate(p);
- if (!delete_from_lru_cache(p))
- return MF_DELAYED;
- else
- return MF_FAILED;
+ ret = delete_from_lru_cache(p) ? MF_FAILED : MF_DELAYED;
+ unlock_page(p);
+ return ret;
}
static int me_swapcache_clean(struct page *p, unsigned long pfn)
{
+ int ret;
+
delete_from_swap_cache(p);
- if (!delete_from_lru_cache(p))
- return MF_RECOVERED;
- else
- return MF_FAILED;
+ ret = delete_from_lru_cache(p) ? MF_FAILED : MF_RECOVERED;
+ unlock_page(p);
+ return ret;
}
/*
@@ -820,6 +831,7 @@ static int me_huge_page(struct page *p, unsigned long pfn)
mapping = page_mapping(hpage);
if (mapping) {
res = truncate_error_page(hpage, pfn, mapping);
+ unlock_page(hpage);
} else {
res = MF_FAILED;
unlock_page(hpage);
@@ -834,7 +846,6 @@ static int me_huge_page(struct page *p, unsigned long pfn)
page_ref_inc(p);
res = MF_RECOVERED;
}
- lock_page(hpage);
}
return res;
@@ -866,6 +877,8 @@ static struct page_state {
unsigned long mask;
unsigned long res;
enum mf_action_page_type type;
+
+ /* Callback ->action() has to unlock the relevant page inside it. */
int (*action)(struct page *p, unsigned long pfn);
} error_states[] = {
{ reserved, reserved, MF_MSG_KERNEL, me_kernel },
@@ -929,6 +942,7 @@ static int page_action(struct page_state *ps, struct page *p,
int result;
int count;
+ /* page p should be unlocked after returning from ps->action(). */
result = ps->action(p, pfn);
count = page_count(p) - 1;
@@ -1313,7 +1327,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
goto out;
}
- res = identify_page_state(pfn, p, page_flags);
+ return identify_page_state(pfn, p, page_flags);
out:
unlock_page(head);
return res;
@@ -1596,6 +1610,8 @@ int memory_failure(unsigned long pfn, int flags)
identify_page_state:
res = identify_page_state(pfn, p, page_flags);
+ mutex_unlock(&mf_mutex);
+ return res;
unlock_page:
unlock_page(p);
unlock_mutex:
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From ea6d0630100b285f059d0a8d8e86f38a46407536 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Date: Thu, 24 Jun 2021 18:40:01 -0700
Subject: [PATCH] mm/hwpoison: do not lock page again when me_huge_page()
successfully recovers
Currently me_huge_page() temporary unlocks page to perform some actions
then locks it again later. My testcase (which calls hard-offline on
some tail page in a hugetlb, then accesses the address of the hugetlb
range) showed that page allocation code detects this page lock on buddy
page and printed out "BUG: Bad page state" message.
check_new_page_bad() does not consider a page with __PG_HWPOISON as bad
page, so this flag works as kind of filter, but this filtering doesn't
work in this case because the "bad page" is not the actual hwpoisoned
page. So stop locking page again. Actions to be taken depend on the
page type of the error, so page unlocking should be done in ->action()
callbacks. So let's make it assumed and change all existing callbacks
that way.
Link: https://lkml.kernel.org/r/20210609072029.74645-1-nao.horiguchi@gmail.com
Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Tony Luck <tony.luck(a)intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f24105db7081..6f5f78885ab4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -658,6 +658,7 @@ static int truncate_error_page(struct page *p, unsigned long pfn,
*/
static int me_kernel(struct page *p, unsigned long pfn)
{
+ unlock_page(p);
return MF_IGNORED;
}
@@ -667,6 +668,7 @@ static int me_kernel(struct page *p, unsigned long pfn)
static int me_unknown(struct page *p, unsigned long pfn)
{
pr_err("Memory failure: %#lx: Unknown page state\n", pfn);
+ unlock_page(p);
return MF_FAILED;
}
@@ -675,6 +677,7 @@ static int me_unknown(struct page *p, unsigned long pfn)
*/
static int me_pagecache_clean(struct page *p, unsigned long pfn)
{
+ int ret;
struct address_space *mapping;
delete_from_lru_cache(p);
@@ -683,8 +686,10 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
* For anonymous pages we're done the only reference left
* should be the one m_f() holds.
*/
- if (PageAnon(p))
- return MF_RECOVERED;
+ if (PageAnon(p)) {
+ ret = MF_RECOVERED;
+ goto out;
+ }
/*
* Now truncate the page in the page cache. This is really
@@ -698,7 +703,8 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
/*
* Page has been teared down in the meanwhile
*/
- return MF_FAILED;
+ ret = MF_FAILED;
+ goto out;
}
/*
@@ -706,7 +712,10 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
*
* Open: to take i_mutex or not for this? Right now we don't.
*/
- return truncate_error_page(p, pfn, mapping);
+ ret = truncate_error_page(p, pfn, mapping);
+out:
+ unlock_page(p);
+ return ret;
}
/*
@@ -782,24 +791,26 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn)
*/
static int me_swapcache_dirty(struct page *p, unsigned long pfn)
{
+ int ret;
+
ClearPageDirty(p);
/* Trigger EIO in shmem: */
ClearPageUptodate(p);
- if (!delete_from_lru_cache(p))
- return MF_DELAYED;
- else
- return MF_FAILED;
+ ret = delete_from_lru_cache(p) ? MF_FAILED : MF_DELAYED;
+ unlock_page(p);
+ return ret;
}
static int me_swapcache_clean(struct page *p, unsigned long pfn)
{
+ int ret;
+
delete_from_swap_cache(p);
- if (!delete_from_lru_cache(p))
- return MF_RECOVERED;
- else
- return MF_FAILED;
+ ret = delete_from_lru_cache(p) ? MF_FAILED : MF_RECOVERED;
+ unlock_page(p);
+ return ret;
}
/*
@@ -820,6 +831,7 @@ static int me_huge_page(struct page *p, unsigned long pfn)
mapping = page_mapping(hpage);
if (mapping) {
res = truncate_error_page(hpage, pfn, mapping);
+ unlock_page(hpage);
} else {
res = MF_FAILED;
unlock_page(hpage);
@@ -834,7 +846,6 @@ static int me_huge_page(struct page *p, unsigned long pfn)
page_ref_inc(p);
res = MF_RECOVERED;
}
- lock_page(hpage);
}
return res;
@@ -866,6 +877,8 @@ static struct page_state {
unsigned long mask;
unsigned long res;
enum mf_action_page_type type;
+
+ /* Callback ->action() has to unlock the relevant page inside it. */
int (*action)(struct page *p, unsigned long pfn);
} error_states[] = {
{ reserved, reserved, MF_MSG_KERNEL, me_kernel },
@@ -929,6 +942,7 @@ static int page_action(struct page_state *ps, struct page *p,
int result;
int count;
+ /* page p should be unlocked after returning from ps->action(). */
result = ps->action(p, pfn);
count = page_count(p) - 1;
@@ -1313,7 +1327,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
goto out;
}
- res = identify_page_state(pfn, p, page_flags);
+ return identify_page_state(pfn, p, page_flags);
out:
unlock_page(head);
return res;
@@ -1596,6 +1610,8 @@ int memory_failure(unsigned long pfn, int flags)
identify_page_state:
res = identify_page_state(pfn, p, page_flags);
+ mutex_unlock(&mf_mutex);
+ return res;
unlock_page:
unlock_page(p);
unlock_mutex:
It's not sufficient to skip reading when the pos is beyond the EOF.
There may be data at the head of the page that we need to fill in
before the write.
Add a new helper function that corrects and clarifies the logic of
when we can skip reads, and have it only zero out the part of the page
that won't have data copied in for the write.
[ Note: this fix is commit 827a746f405d in mainline kernels. The
original bug was in ceph, but got lifted into the fs/netfs
library for v5.13. This backport should apply to stable
kernels v5.10 though v5.12. ]
Cc: <stable(a)vger.kernel.org> # v5.10+
Fixes: 1cc1699070bd ("ceph: fold ceph_update_writeable_page into ceph_write_begin")
Reported-by: Andrew W Elble <aweits(a)rit.edu>
Signed-off-by: Jeff Layton <jlayton(a)kernel.org>
Signed-off-by: David Howells <dhowells(a)redhat.com>
---
fs/ceph/addr.c | 54 ++++++++++++++++++++++++++++++++++++++------------
1 file changed, 41 insertions(+), 13 deletions(-)
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 26e66436f005..c000fe338f7e 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1302,6 +1302,45 @@ ceph_find_incompatible(struct page *page)
return NULL;
}
+/**
+ * prep_noread_page - prep a page for writing without reading first
+ * @page: page being prepared
+ * @pos: starting position for the write
+ * @len: length of write
+ *
+ * In some cases, write_begin doesn't need to read at all:
+ * - full page write
+ * - file is currently zero-length
+ * - write that lies in a page that is completely beyond EOF
+ * - write that covers the the page from start to EOF or beyond it
+ *
+ * If any of these criteria are met, then zero out the unwritten parts
+ * of the page and return true. Otherwise, return false.
+ */
+static bool skip_page_read(struct page *page, loff_t pos, size_t len)
+{
+ struct inode *inode = page->mapping->host;
+ loff_t i_size = i_size_read(inode);
+ size_t offset = offset_in_page(pos);
+
+ /* Full page write */
+ if (offset == 0 && len >= PAGE_SIZE)
+ return true;
+
+ /* pos beyond last page in the file */
+ if (pos - offset >= i_size)
+ goto zero_out;
+
+ /* write that covers the whole page from start to EOF or beyond it */
+ if (offset == 0 && (pos + len) >= i_size)
+ goto zero_out;
+
+ return false;
+zero_out:
+ zero_user_segments(page, 0, offset, offset + len, PAGE_SIZE);
+ return true;
+}
+
/*
* We are only allowed to write into/dirty the page if the page is
* clean, or already dirty within the same snap context.
@@ -1315,7 +1354,6 @@ static int ceph_write_begin(struct file *file, struct address_space *mapping,
struct ceph_snap_context *snapc;
struct page *page = NULL;
pgoff_t index = pos >> PAGE_SHIFT;
- int pos_in_page = pos & ~PAGE_MASK;
int r = 0;
dout("write_begin file %p inode %p page %p %d~%d\n", file, inode, page, (int)pos, (int)len);
@@ -1350,19 +1388,9 @@ static int ceph_write_begin(struct file *file, struct address_space *mapping,
break;
}
- /*
- * In some cases we don't need to read at all:
- * - full page write
- * - write that lies completely beyond EOF
- * - write that covers the the page from start to EOF or beyond it
- */
- if ((pos_in_page == 0 && len == PAGE_SIZE) ||
- (pos >= i_size_read(inode)) ||
- (pos_in_page == 0 && (pos + len) >= i_size_read(inode))) {
- zero_user_segments(page, 0, pos_in_page,
- pos_in_page + len, PAGE_SIZE);
+ /* No need to read in some cases */
+ if (skip_page_read(page, pos, len))
break;
- }
/*
* We need to read it. If we get back -EINPROGRESS, then the page was
--
2.31.1
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 171936ddaf97e6f4e1264f4128bb5cf15691339c Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck(a)intel.com>
Date: Thu, 24 Jun 2021 18:39:55 -0700
Subject: [PATCH] mm/memory-failure: use a mutex to avoid memory_failure()
races
Patch series "mm,hwpoison: fix sending SIGBUS for Action Required MCE", v5.
I wrote this patchset to materialize what I think is the current
allowable solution mentioned by the previous discussion [1]. I simply
borrowed Tony's mutex patch and Aili's return code patch, then I queued
another one to find error virtual address in the best effort manner. I
know that this is not a perfect solution, but should work for some
typical case.
[1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machi…
This patch (of 2):
There can be races when multiple CPUs consume poison from the same page.
The first into memory_failure() atomically sets the HWPoison page flag
and begins hunting for tasks that map this page. Eventually it
invalidates those mappings and may send a SIGBUS to the affected tasks.
But while all that work is going on, other CPUs see a "success" return
code from memory_failure() and so they believe the error has been
handled and continue executing.
Fix by wrapping most of the internal parts of memory_failure() in a
mutex.
[akpm(a)linux-foundation.org: make mf_mutex local to memory_failure()]
Link: https://lkml.kernel.org/r/20210521030156.2612074-1-nao.horiguchi@gmail.com
Link: https://lkml.kernel.org/r/20210521030156.2612074-2-nao.horiguchi@gmail.com
Signed-off-by: Tony Luck <tony.luck(a)intel.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Reviewed-by: Borislav Petkov <bp(a)suse.de>
Reviewed-by: Oscar Salvador <osalvador(a)suse.de>
Cc: Aili Yao <yaoaili(a)kingsoft.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Jue Wang <juew(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 0143d32bc666..a7fc1cd6765a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1429,9 +1429,10 @@ int memory_failure(unsigned long pfn, int flags)
struct page *hpage;
struct page *orig_head;
struct dev_pagemap *pgmap;
- int res;
+ int res = 0;
unsigned long page_flags;
bool retry = true;
+ static DEFINE_MUTEX(mf_mutex);
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
@@ -1449,13 +1450,18 @@ int memory_failure(unsigned long pfn, int flags)
return -ENXIO;
}
+ mutex_lock(&mf_mutex);
+
try_again:
- if (PageHuge(p))
- return memory_failure_hugetlb(pfn, flags);
+ if (PageHuge(p)) {
+ res = memory_failure_hugetlb(pfn, flags);
+ goto unlock_mutex;
+ }
+
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
pfn);
- return 0;
+ goto unlock_mutex;
}
orig_head = hpage = compound_head(p);
@@ -1488,17 +1494,19 @@ int memory_failure(unsigned long pfn, int flags)
res = MF_FAILED;
}
action_result(pfn, MF_MSG_BUDDY, res);
- return res == MF_RECOVERED ? 0 : -EBUSY;
+ res = res == MF_RECOVERED ? 0 : -EBUSY;
} else {
action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
- return -EBUSY;
+ res = -EBUSY;
}
+ goto unlock_mutex;
}
if (PageTransHuge(hpage)) {
if (try_to_split_thp_page(p, "Memory Failure") < 0) {
action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
- return -EBUSY;
+ res = -EBUSY;
+ goto unlock_mutex;
}
VM_BUG_ON_PAGE(!page_count(p), p);
}
@@ -1522,7 +1530,7 @@ int memory_failure(unsigned long pfn, int flags)
if (PageCompound(p) && compound_head(p) != orig_head) {
action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED);
res = -EBUSY;
- goto out;
+ goto unlock_page;
}
/*
@@ -1542,14 +1550,14 @@ int memory_failure(unsigned long pfn, int flags)
num_poisoned_pages_dec();
unlock_page(p);
put_page(p);
- return 0;
+ goto unlock_mutex;
}
if (hwpoison_filter(p)) {
if (TestClearPageHWPoison(p))
num_poisoned_pages_dec();
unlock_page(p);
put_page(p);
- return 0;
+ goto unlock_mutex;
}
/*
@@ -1573,7 +1581,7 @@ int memory_failure(unsigned long pfn, int flags)
if (!hwpoison_user_mappings(p, pfn, flags, &p)) {
action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
res = -EBUSY;
- goto out;
+ goto unlock_page;
}
/*
@@ -1582,13 +1590,15 @@ int memory_failure(unsigned long pfn, int flags)
if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
res = -EBUSY;
- goto out;
+ goto unlock_page;
}
identify_page_state:
res = identify_page_state(pfn, p, page_flags);
-out:
+unlock_page:
unlock_page(p);
+unlock_mutex:
+ mutex_unlock(&mf_mutex);
return res;
}
EXPORT_SYMBOL_GPL(memory_failure);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 171936ddaf97e6f4e1264f4128bb5cf15691339c Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck(a)intel.com>
Date: Thu, 24 Jun 2021 18:39:55 -0700
Subject: [PATCH] mm/memory-failure: use a mutex to avoid memory_failure()
races
Patch series "mm,hwpoison: fix sending SIGBUS for Action Required MCE", v5.
I wrote this patchset to materialize what I think is the current
allowable solution mentioned by the previous discussion [1]. I simply
borrowed Tony's mutex patch and Aili's return code patch, then I queued
another one to find error virtual address in the best effort manner. I
know that this is not a perfect solution, but should work for some
typical case.
[1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machi…
This patch (of 2):
There can be races when multiple CPUs consume poison from the same page.
The first into memory_failure() atomically sets the HWPoison page flag
and begins hunting for tasks that map this page. Eventually it
invalidates those mappings and may send a SIGBUS to the affected tasks.
But while all that work is going on, other CPUs see a "success" return
code from memory_failure() and so they believe the error has been
handled and continue executing.
Fix by wrapping most of the internal parts of memory_failure() in a
mutex.
[akpm(a)linux-foundation.org: make mf_mutex local to memory_failure()]
Link: https://lkml.kernel.org/r/20210521030156.2612074-1-nao.horiguchi@gmail.com
Link: https://lkml.kernel.org/r/20210521030156.2612074-2-nao.horiguchi@gmail.com
Signed-off-by: Tony Luck <tony.luck(a)intel.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Reviewed-by: Borislav Petkov <bp(a)suse.de>
Reviewed-by: Oscar Salvador <osalvador(a)suse.de>
Cc: Aili Yao <yaoaili(a)kingsoft.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Jue Wang <juew(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 0143d32bc666..a7fc1cd6765a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1429,9 +1429,10 @@ int memory_failure(unsigned long pfn, int flags)
struct page *hpage;
struct page *orig_head;
struct dev_pagemap *pgmap;
- int res;
+ int res = 0;
unsigned long page_flags;
bool retry = true;
+ static DEFINE_MUTEX(mf_mutex);
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
@@ -1449,13 +1450,18 @@ int memory_failure(unsigned long pfn, int flags)
return -ENXIO;
}
+ mutex_lock(&mf_mutex);
+
try_again:
- if (PageHuge(p))
- return memory_failure_hugetlb(pfn, flags);
+ if (PageHuge(p)) {
+ res = memory_failure_hugetlb(pfn, flags);
+ goto unlock_mutex;
+ }
+
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
pfn);
- return 0;
+ goto unlock_mutex;
}
orig_head = hpage = compound_head(p);
@@ -1488,17 +1494,19 @@ int memory_failure(unsigned long pfn, int flags)
res = MF_FAILED;
}
action_result(pfn, MF_MSG_BUDDY, res);
- return res == MF_RECOVERED ? 0 : -EBUSY;
+ res = res == MF_RECOVERED ? 0 : -EBUSY;
} else {
action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
- return -EBUSY;
+ res = -EBUSY;
}
+ goto unlock_mutex;
}
if (PageTransHuge(hpage)) {
if (try_to_split_thp_page(p, "Memory Failure") < 0) {
action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
- return -EBUSY;
+ res = -EBUSY;
+ goto unlock_mutex;
}
VM_BUG_ON_PAGE(!page_count(p), p);
}
@@ -1522,7 +1530,7 @@ int memory_failure(unsigned long pfn, int flags)
if (PageCompound(p) && compound_head(p) != orig_head) {
action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED);
res = -EBUSY;
- goto out;
+ goto unlock_page;
}
/*
@@ -1542,14 +1550,14 @@ int memory_failure(unsigned long pfn, int flags)
num_poisoned_pages_dec();
unlock_page(p);
put_page(p);
- return 0;
+ goto unlock_mutex;
}
if (hwpoison_filter(p)) {
if (TestClearPageHWPoison(p))
num_poisoned_pages_dec();
unlock_page(p);
put_page(p);
- return 0;
+ goto unlock_mutex;
}
/*
@@ -1573,7 +1581,7 @@ int memory_failure(unsigned long pfn, int flags)
if (!hwpoison_user_mappings(p, pfn, flags, &p)) {
action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
res = -EBUSY;
- goto out;
+ goto unlock_page;
}
/*
@@ -1582,13 +1590,15 @@ int memory_failure(unsigned long pfn, int flags)
if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
res = -EBUSY;
- goto out;
+ goto unlock_page;
}
identify_page_state:
res = identify_page_state(pfn, p, page_flags);
-out:
+unlock_page:
unlock_page(p);
+unlock_mutex:
+ mutex_unlock(&mf_mutex);
return res;
}
EXPORT_SYMBOL_GPL(memory_failure);
The patch below does not apply to the 5.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5f89468e2f060031cd89fd4287298e0eaf246bf6 Mon Sep 17 00:00:00 2001
From: Bumyong Lee <bumyong.lee(a)samsung.com>
Date: Mon, 10 May 2021 18:10:04 +0900
Subject: [PATCH] swiotlb: manipulate orig_addr when tlb_addr has offset
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
in case of driver wants to sync part of ranges with offset,
swiotlb_tbl_sync_single() copies from orig_addr base to tlb_addr with
offset and ends up with data mismatch.
It was removed from
"swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single",
but said logic has to be added back in.
>From Linus's email:
"That commit which the removed the offset calculation entirely, because the old
(unsigned long)tlb_addr & (IO_TLB_SIZE - 1)
was wrong, but instead of removing it, I think it should have just
fixed it to be
(tlb_addr - mem->start) & (IO_TLB_SIZE - 1);
instead. That way the slot offset always matches the slot index calculation."
(Unfortunatly that broke NVMe).
The use-case that drivers are hitting is as follow:
1. Get dma_addr_t from dma_map_single()
dma_addr_t tlb_addr = dma_map_single(dev, vaddr, vsize, DMA_TO_DEVICE);
|<---------------vsize------------->|
+-----------------------------------+
| | original buffer
+-----------------------------------+
vaddr
swiotlb_align_offset
|<----->|<---------------vsize------------->|
+-------+-----------------------------------+
| | | swiotlb buffer
+-------+-----------------------------------+
tlb_addr
2. Do something
3. Sync dma_addr_t through dma_sync_single_for_device(..)
dma_sync_single_for_device(dev, tlb_addr + offset, size, DMA_TO_DEVICE);
Error case.
Copy data to original buffer but it is from base addr (instead of
base addr + offset) in original buffer:
swiotlb_align_offset
|<----->|<- offset ->|<- size ->|
+-------+-----------------------------------+
| | |##########| | swiotlb buffer
+-------+-----------------------------------+
tlb_addr
|<- size ->|
+-----------------------------------+
|##########| | original buffer
+-----------------------------------+
vaddr
The fix is to copy the data to the original buffer and take into
account the offset, like so:
swiotlb_align_offset
|<----->|<- offset ->|<- size ->|
+-------+-----------------------------------+
| | |##########| | swiotlb buffer
+-------+-----------------------------------+
tlb_addr
|<- offset ->|<- size ->|
+-----------------------------------+
| |##########| | original buffer
+-----------------------------------+
vaddr
[One fix which was Linus's that made more sense to as it created a
symmetry would break NVMe. The reason for that is the:
unsigned int offset = (tlb_addr - mem->start) & (IO_TLB_SIZE - 1);
would come up with the proper offset, but it would lose the
alignment (which this patch contains).]
Fixes: 16fc3cef33a0 ("swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single")
Signed-off-by: Bumyong Lee <bumyong.lee(a)samsung.com>
Signed-off-by: Chanho Park <chanho61.park(a)samsung.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Reported-by: Dominique MARTINET <dominique.martinet(a)atmark-techno.com>
Reported-by: Horia Geantă <horia.geanta(a)nxp.com>
Tested-by: Horia Geantă <horia.geanta(a)nxp.com>
CC: stable(a)vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 8ca7d505d61c..e50df8d8f87e 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -334,6 +334,14 @@ void __init swiotlb_exit(void)
io_tlb_default_mem = NULL;
}
+/*
+ * Return the offset into a iotlb slot required to keep the device happy.
+ */
+static unsigned int swiotlb_align_offset(struct device *dev, u64 addr)
+{
+ return addr & dma_get_min_align_mask(dev) & (IO_TLB_SIZE - 1);
+}
+
/*
* Bounce: copy the swiotlb buffer from or back to the original dma location
*/
@@ -346,10 +354,17 @@ static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size
size_t alloc_size = mem->slots[index].alloc_size;
unsigned long pfn = PFN_DOWN(orig_addr);
unsigned char *vaddr = phys_to_virt(tlb_addr);
+ unsigned int tlb_offset;
if (orig_addr == INVALID_PHYS_ADDR)
return;
+ tlb_offset = (tlb_addr & (IO_TLB_SIZE - 1)) -
+ swiotlb_align_offset(dev, orig_addr);
+
+ orig_addr += tlb_offset;
+ alloc_size -= tlb_offset;
+
if (size > alloc_size) {
dev_WARN_ONCE(dev, 1,
"Buffer overflow detected. Allocation size: %zu. Mapping size: %zu.\n",
@@ -390,14 +405,6 @@ static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size
#define slot_addr(start, idx) ((start) + ((idx) << IO_TLB_SHIFT))
-/*
- * Return the offset into a iotlb slot required to keep the device happy.
- */
-static unsigned int swiotlb_align_offset(struct device *dev, u64 addr)
-{
- return addr & dma_get_min_align_mask(dev) & (IO_TLB_SIZE - 1);
-}
-
/*
* Carefully handle integer overflow which can occur when boundary_mask == ~0UL.
*/
Hello there,
I hope this message finds you in good spirits especially during
this challenging time of coronavirus pandemic. I hope you and
your family are well and keeping safe. Anyway, I am Jotham
Masuku, a broker working with Flippiebecker Wealth. I got your
contact through an online business directory and I thought I
should contact you to see if you are interested in this
opportunity. I am contacting you because one of my high profile
clients is interested in investing abroad and has asked me to
look for individuals and companies in your country with
interesting business ideas and projects that he can invest in. He
wants to invest a substantial amount of asset abroad.
I have decided to keep this brief for now but please kindly
respond back to this email if you are interested in this
opportunity. Once I receive your response, I will give you more
details and we can plan a strategy that will be beneficial to all
parties.
Best regards
J Masuku
Flippiebecker Wealth
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5f89468e2f060031cd89fd4287298e0eaf246bf6 Mon Sep 17 00:00:00 2001
From: Bumyong Lee <bumyong.lee(a)samsung.com>
Date: Mon, 10 May 2021 18:10:04 +0900
Subject: [PATCH] swiotlb: manipulate orig_addr when tlb_addr has offset
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
in case of driver wants to sync part of ranges with offset,
swiotlb_tbl_sync_single() copies from orig_addr base to tlb_addr with
offset and ends up with data mismatch.
It was removed from
"swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single",
but said logic has to be added back in.
>From Linus's email:
"That commit which the removed the offset calculation entirely, because the old
(unsigned long)tlb_addr & (IO_TLB_SIZE - 1)
was wrong, but instead of removing it, I think it should have just
fixed it to be
(tlb_addr - mem->start) & (IO_TLB_SIZE - 1);
instead. That way the slot offset always matches the slot index calculation."
(Unfortunatly that broke NVMe).
The use-case that drivers are hitting is as follow:
1. Get dma_addr_t from dma_map_single()
dma_addr_t tlb_addr = dma_map_single(dev, vaddr, vsize, DMA_TO_DEVICE);
|<---------------vsize------------->|
+-----------------------------------+
| | original buffer
+-----------------------------------+
vaddr
swiotlb_align_offset
|<----->|<---------------vsize------------->|
+-------+-----------------------------------+
| | | swiotlb buffer
+-------+-----------------------------------+
tlb_addr
2. Do something
3. Sync dma_addr_t through dma_sync_single_for_device(..)
dma_sync_single_for_device(dev, tlb_addr + offset, size, DMA_TO_DEVICE);
Error case.
Copy data to original buffer but it is from base addr (instead of
base addr + offset) in original buffer:
swiotlb_align_offset
|<----->|<- offset ->|<- size ->|
+-------+-----------------------------------+
| | |##########| | swiotlb buffer
+-------+-----------------------------------+
tlb_addr
|<- size ->|
+-----------------------------------+
|##########| | original buffer
+-----------------------------------+
vaddr
The fix is to copy the data to the original buffer and take into
account the offset, like so:
swiotlb_align_offset
|<----->|<- offset ->|<- size ->|
+-------+-----------------------------------+
| | |##########| | swiotlb buffer
+-------+-----------------------------------+
tlb_addr
|<- offset ->|<- size ->|
+-----------------------------------+
| |##########| | original buffer
+-----------------------------------+
vaddr
[One fix which was Linus's that made more sense to as it created a
symmetry would break NVMe. The reason for that is the:
unsigned int offset = (tlb_addr - mem->start) & (IO_TLB_SIZE - 1);
would come up with the proper offset, but it would lose the
alignment (which this patch contains).]
Fixes: 16fc3cef33a0 ("swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single")
Signed-off-by: Bumyong Lee <bumyong.lee(a)samsung.com>
Signed-off-by: Chanho Park <chanho61.park(a)samsung.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Reported-by: Dominique MARTINET <dominique.martinet(a)atmark-techno.com>
Reported-by: Horia Geantă <horia.geanta(a)nxp.com>
Tested-by: Horia Geantă <horia.geanta(a)nxp.com>
CC: stable(a)vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 8ca7d505d61c..e50df8d8f87e 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -334,6 +334,14 @@ void __init swiotlb_exit(void)
io_tlb_default_mem = NULL;
}
+/*
+ * Return the offset into a iotlb slot required to keep the device happy.
+ */
+static unsigned int swiotlb_align_offset(struct device *dev, u64 addr)
+{
+ return addr & dma_get_min_align_mask(dev) & (IO_TLB_SIZE - 1);
+}
+
/*
* Bounce: copy the swiotlb buffer from or back to the original dma location
*/
@@ -346,10 +354,17 @@ static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size
size_t alloc_size = mem->slots[index].alloc_size;
unsigned long pfn = PFN_DOWN(orig_addr);
unsigned char *vaddr = phys_to_virt(tlb_addr);
+ unsigned int tlb_offset;
if (orig_addr == INVALID_PHYS_ADDR)
return;
+ tlb_offset = (tlb_addr & (IO_TLB_SIZE - 1)) -
+ swiotlb_align_offset(dev, orig_addr);
+
+ orig_addr += tlb_offset;
+ alloc_size -= tlb_offset;
+
if (size > alloc_size) {
dev_WARN_ONCE(dev, 1,
"Buffer overflow detected. Allocation size: %zu. Mapping size: %zu.\n",
@@ -390,14 +405,6 @@ static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size
#define slot_addr(start, idx) ((start) + ((idx) << IO_TLB_SHIFT))
-/*
- * Return the offset into a iotlb slot required to keep the device happy.
- */
-static unsigned int swiotlb_align_offset(struct device *dev, u64 addr)
-{
- return addr & dma_get_min_align_mask(dev) & (IO_TLB_SIZE - 1);
-}
-
/*
* Carefully handle integer overflow which can occur when boundary_mask == ~0UL.
*/
The patch titled
Subject: mm/hwpoison: do not lock page again when me_huge_page() successfully recovers
has been removed from the -mm tree. Its filename was
mm-hwpoison-do-not-lock-page-again-when-me_huge_page-successfully-recovers.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Subject: mm/hwpoison: do not lock page again when me_huge_page() successfully recovers
Currently me_huge_page() temporary unlocks page to perform some actions
then locks it again later. My testcase (which calls hard-offline on some
tail page in a hugetlb, then accesses the address of the hugetlb range)
showed that page allocation code detects this page lock on buddy page and
printed out "BUG: Bad page state" message.
check_new_page_bad() does not consider a page with __PG_HWPOISON as bad
page, so this flag works as kind of filter, but this filtering doesn't
work in this case because the "bad page" is not the actual hwpoisoned
page. So stop locking page again. Actions to be taken depend on the page
type of the error, so page unlocking should be done in ->action()
callbacks. So let's make it assumed and change all existing callbacks
that way.
Link: https://lkml.kernel.org/r/20210609072029.74645-1-nao.horiguchi@gmail.com
Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Tony Luck <tony.luck(a)intel.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.vnet.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory-failure.c | 44 ++++++++++++++++++++++++++++--------------
1 file changed, 30 insertions(+), 14 deletions(-)
--- a/mm/memory-failure.c~mm-hwpoison-do-not-lock-page-again-when-me_huge_page-successfully-recovers
+++ a/mm/memory-failure.c
@@ -658,6 +658,7 @@ static int truncate_error_page(struct pa
*/
static int me_kernel(struct page *p, unsigned long pfn)
{
+ unlock_page(p);
return MF_IGNORED;
}
@@ -667,6 +668,7 @@ static int me_kernel(struct page *p, uns
static int me_unknown(struct page *p, unsigned long pfn)
{
pr_err("Memory failure: %#lx: Unknown page state\n", pfn);
+ unlock_page(p);
return MF_FAILED;
}
@@ -675,6 +677,7 @@ static int me_unknown(struct page *p, un
*/
static int me_pagecache_clean(struct page *p, unsigned long pfn)
{
+ int ret;
struct address_space *mapping;
delete_from_lru_cache(p);
@@ -683,8 +686,10 @@ static int me_pagecache_clean(struct pag
* For anonymous pages we're done the only reference left
* should be the one m_f() holds.
*/
- if (PageAnon(p))
- return MF_RECOVERED;
+ if (PageAnon(p)) {
+ ret = MF_RECOVERED;
+ goto out;
+ }
/*
* Now truncate the page in the page cache. This is really
@@ -698,7 +703,8 @@ static int me_pagecache_clean(struct pag
/*
* Page has been teared down in the meanwhile
*/
- return MF_FAILED;
+ ret = MF_FAILED;
+ goto out;
}
/*
@@ -706,7 +712,10 @@ static int me_pagecache_clean(struct pag
*
* Open: to take i_mutex or not for this? Right now we don't.
*/
- return truncate_error_page(p, pfn, mapping);
+ ret = truncate_error_page(p, pfn, mapping);
+out:
+ unlock_page(p);
+ return ret;
}
/*
@@ -782,24 +791,26 @@ static int me_pagecache_dirty(struct pag
*/
static int me_swapcache_dirty(struct page *p, unsigned long pfn)
{
+ int ret;
+
ClearPageDirty(p);
/* Trigger EIO in shmem: */
ClearPageUptodate(p);
- if (!delete_from_lru_cache(p))
- return MF_DELAYED;
- else
- return MF_FAILED;
+ ret = delete_from_lru_cache(p) ? MF_FAILED : MF_DELAYED;
+ unlock_page(p);
+ return ret;
}
static int me_swapcache_clean(struct page *p, unsigned long pfn)
{
+ int ret;
+
delete_from_swap_cache(p);
- if (!delete_from_lru_cache(p))
- return MF_RECOVERED;
- else
- return MF_FAILED;
+ ret = delete_from_lru_cache(p) ? MF_FAILED : MF_RECOVERED;
+ unlock_page(p);
+ return ret;
}
/*
@@ -820,6 +831,7 @@ static int me_huge_page(struct page *p,
mapping = page_mapping(hpage);
if (mapping) {
res = truncate_error_page(hpage, pfn, mapping);
+ unlock_page(hpage);
} else {
res = MF_FAILED;
unlock_page(hpage);
@@ -834,7 +846,6 @@ static int me_huge_page(struct page *p,
page_ref_inc(p);
res = MF_RECOVERED;
}
- lock_page(hpage);
}
return res;
@@ -866,6 +877,8 @@ static struct page_state {
unsigned long mask;
unsigned long res;
enum mf_action_page_type type;
+
+ /* Callback ->action() has to unlock the relevant page inside it. */
int (*action)(struct page *p, unsigned long pfn);
} error_states[] = {
{ reserved, reserved, MF_MSG_KERNEL, me_kernel },
@@ -929,6 +942,7 @@ static int page_action(struct page_state
int result;
int count;
+ /* page p should be unlocked after returning from ps->action(). */
result = ps->action(p, pfn);
count = page_count(p) - 1;
@@ -1313,7 +1327,7 @@ static int memory_failure_hugetlb(unsign
goto out;
}
- res = identify_page_state(pfn, p, page_flags);
+ return identify_page_state(pfn, p, page_flags);
out:
unlock_page(head);
return res;
@@ -1596,6 +1610,8 @@ try_again:
identify_page_state:
res = identify_page_state(pfn, p, page_flags);
+ mutex_unlock(&mf_mutex);
+ return res;
unlock_page:
unlock_page(p);
unlock_mutex:
_
Patches currently in -mm which might be from naoya.horiguchi(a)nec.com are
mmhwpoison-send-sigbus-with-error-virutal-address.patch
mmhwpoison-send-sigbus-with-error-virutal-address-fix.patch
mmhwpoison-make-get_hwpoison_page-call-get_any_page.patch
mm-hwpoison-disable-pcp-for-page_handle_poison.patch
The patch titled
Subject: mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned
has been removed from the -mm tree. Its filename was
mmhwpoison-return-ehwpoison-to-denote-that-the-page-has-already-been-poisoned.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Aili Yao <yaoaili(a)kingsoft.com>
Subject: mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned
When memory_failure() is called with MF_ACTION_REQUIRED on the page that
has already been hwpoisoned, memory_failure() could fail to send SIGBUS to
the affected process, which results in infinite loop of MCEs.
Currently memory_failure() returns 0 if it's called for already hwpoisoned
page, then the caller, kill_me_maybe(), could return without sending
SIGBUS to current process. An action required MCE is raised when the
current process accesses to the broken memory, so no SIGBUS means that the
current process continues to run and access to the error page again soon,
so running into MCE loop.
This issue can arise for example in the following scenarios:
- Two or more threads access to the poisoned page concurrently. If
local MCE is enabled, MCE handler independently handles the MCE events.
So there's a race among MCE events, and the second or latter threads
fall into the situation in question.
- If there was a precedent memory error event and memory_failure() for
the event failed to unmap the error page for some reason, the subsequent
memory access to the error page triggers the MCE loop situation.
To fix the issue, make memory_failure() return an error code when the
error page has already been hwpoisoned. This allows memory error handler
to control how it sends signals to userspace. And make sure that any
process touching a hwpoisoned page should get a SIGBUS even in "already
hwpoisoned" path of memory_failure() as is done in page fault path.
Link: https://lkml.kernel.org/r/20210521030156.2612074-3-nao.horiguchi@gmail.com
Signed-off-by: Aili Yao <yaoaili(a)kingsoft.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Reviewed-by: Oscar Salvador <osalvador(a)suse.de>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Borislav Petkov <bp(a)suse.de>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Jue Wang <juew(a)google.com>
Cc: Tony Luck <tony.luck(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory-failure.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/memory-failure.c~mmhwpoison-return-ehwpoison-to-denote-that-the-page-has-already-been-poisoned
+++ a/mm/memory-failure.c
@@ -1253,7 +1253,7 @@ static int memory_failure_hugetlb(unsign
if (TestSetPageHWPoison(head)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
pfn);
- return 0;
+ return -EHWPOISON;
}
num_poisoned_pages_inc();
@@ -1461,6 +1461,7 @@ try_again:
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
pfn);
+ res = -EHWPOISON;
goto unlock_mutex;
}
_
Patches currently in -mm which might be from yaoaili(a)kingsoft.com are
The patch titled
Subject: mm/memory-failure: use a mutex to avoid memory_failure() races
has been removed from the -mm tree. Its filename was
mm-memory-failure-use-a-mutex-to-avoid-memory_failure-races.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Tony Luck <tony.luck(a)intel.com>
Subject: mm/memory-failure: use a mutex to avoid memory_failure() races
Patch series "mm,hwpoison: fix sending SIGBUS for Action Required MCE", v5.
I wrote this patchset to materialize what I think is the current allowable
solution mentioned by the previous discussion [1]. I simply borrowed
Tony's mutex patch and Aili's return code patch, then I queued another one
to find error virtual address in the best effort manner. I know that this
is not a perfect solution, but should work for some typical case.
[1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machi…
This patch (of 2):
There can be races when multiple CPUs consume poison from the same page.
The first into memory_failure() atomically sets the HWPoison page flag and
begins hunting for tasks that map this page. Eventually it invalidates
those mappings and may send a SIGBUS to the affected tasks.
But while all that work is going on, other CPUs see a "success" return
code from memory_failure() and so they believe the error has been handled
and continue executing.
Fix by wrapping most of the internal parts of memory_failure() in a mutex.
[akpm(a)linux-foundation.org: make mf_mutex local to memory_failure()]
Link: https://lkml.kernel.org/r/20210521030156.2612074-1-nao.horiguchi@gmail.com
Link: https://lkml.kernel.org/r/20210521030156.2612074-2-nao.horiguchi@gmail.com
Signed-off-by: Tony Luck <tony.luck(a)intel.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Reviewed-by: Borislav Petkov <bp(a)suse.de>
Reviewed-by: Oscar Salvador <osalvador(a)suse.de>
Cc: Aili Yao <yaoaili(a)kingsoft.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Jue Wang <juew(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory-failure.c | 36 +++++++++++++++++++++++-------------
1 file changed, 23 insertions(+), 13 deletions(-)
--- a/mm/memory-failure.c~mm-memory-failure-use-a-mutex-to-avoid-memory_failure-races
+++ a/mm/memory-failure.c
@@ -1429,9 +1429,10 @@ int memory_failure(unsigned long pfn, in
struct page *hpage;
struct page *orig_head;
struct dev_pagemap *pgmap;
- int res;
+ int res = 0;
unsigned long page_flags;
bool retry = true;
+ static DEFINE_MUTEX(mf_mutex);
if (!sysctl_memory_failure_recovery)
panic("Memory failure on page %lx", pfn);
@@ -1449,13 +1450,18 @@ int memory_failure(unsigned long pfn, in
return -ENXIO;
}
+ mutex_lock(&mf_mutex);
+
try_again:
- if (PageHuge(p))
- return memory_failure_hugetlb(pfn, flags);
+ if (PageHuge(p)) {
+ res = memory_failure_hugetlb(pfn, flags);
+ goto unlock_mutex;
+ }
+
if (TestSetPageHWPoison(p)) {
pr_err("Memory failure: %#lx: already hardware poisoned\n",
pfn);
- return 0;
+ goto unlock_mutex;
}
orig_head = hpage = compound_head(p);
@@ -1488,17 +1494,19 @@ try_again:
res = MF_FAILED;
}
action_result(pfn, MF_MSG_BUDDY, res);
- return res == MF_RECOVERED ? 0 : -EBUSY;
+ res = res == MF_RECOVERED ? 0 : -EBUSY;
} else {
action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
- return -EBUSY;
+ res = -EBUSY;
}
+ goto unlock_mutex;
}
if (PageTransHuge(hpage)) {
if (try_to_split_thp_page(p, "Memory Failure") < 0) {
action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
- return -EBUSY;
+ res = -EBUSY;
+ goto unlock_mutex;
}
VM_BUG_ON_PAGE(!page_count(p), p);
}
@@ -1522,7 +1530,7 @@ try_again:
if (PageCompound(p) && compound_head(p) != orig_head) {
action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED);
res = -EBUSY;
- goto out;
+ goto unlock_page;
}
/*
@@ -1542,14 +1550,14 @@ try_again:
num_poisoned_pages_dec();
unlock_page(p);
put_page(p);
- return 0;
+ goto unlock_mutex;
}
if (hwpoison_filter(p)) {
if (TestClearPageHWPoison(p))
num_poisoned_pages_dec();
unlock_page(p);
put_page(p);
- return 0;
+ goto unlock_mutex;
}
/*
@@ -1573,7 +1581,7 @@ try_again:
if (!hwpoison_user_mappings(p, pfn, flags, &p)) {
action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
res = -EBUSY;
- goto out;
+ goto unlock_page;
}
/*
@@ -1582,13 +1590,15 @@ try_again:
if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
res = -EBUSY;
- goto out;
+ goto unlock_page;
}
identify_page_state:
res = identify_page_state(pfn, p, page_flags);
-out:
+unlock_page:
unlock_page(p);
+unlock_mutex:
+ mutex_unlock(&mf_mutex);
return res;
}
EXPORT_SYMBOL_GPL(memory_failure);
_
Patches currently in -mm which might be from tony.luck(a)intel.com are
The patch titled
Subject: mm, futex: fix shared futex pgoff on shmem huge page
has been removed from the -mm tree. Its filename was
mm-futex-fix-shared-futex-pgoff-on-shmem-huge-page.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm, futex: fix shared futex pgoff on shmem huge page
If more than one futex is placed on a shmem huge page, it can happen that
waking the second wakes the first instead, and leaves the second waiting:
the key's shared.pgoff is wrong.
When 3.11 commit 13d60f4b6ab5 ("futex: Take hugepages into account when
generating futex_key"), the only shared huge pages came from hugetlbfs,
and the code added to deal with its exceptional page->index was put into
hugetlb source. Then that was missed when 4.8 added shmem huge pages.
page_to_pgoff() is what others use for this nowadays: except that, as
currently written, it gives the right answer on hugetlbfs head, but
nonsense on hugetlbfs tails. Fix that by calling hugetlbfs-specific
hugetlb_basepage_index() on PageHuge tails as well as on head.
Yes, it's unconventional to declare hugetlb_basepage_index() there in
pagemap.h, rather than in hugetlb.h; but I do not expect anything but
page_to_pgoff() ever to need it.
[akpm(a)linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope]
Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Reported-by: Neel Natu <neelnatu(a)google.com>
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Acked-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Zhang Yi <wetpzy(a)gmail.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Darren Hart <dvhart(a)infradead.org>
Cc: Davidlohr Bueso <dave(a)stgolabs.net>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/hugetlb.h | 16 ----------------
include/linux/pagemap.h | 13 +++++++------
kernel/futex.c | 3 +--
mm/hugetlb.c | 5 +----
4 files changed, 9 insertions(+), 28 deletions(-)
--- a/include/linux/hugetlb.h~mm-futex-fix-shared-futex-pgoff-on-shmem-huge-page
+++ a/include/linux/hugetlb.h
@@ -741,17 +741,6 @@ static inline int hstate_index(struct hs
return h - hstates;
}
-pgoff_t __basepage_index(struct page *page);
-
-/* Return page->index in PAGE_SIZE units */
-static inline pgoff_t basepage_index(struct page *page)
-{
- if (!PageCompound(page))
- return page->index;
-
- return __basepage_index(page);
-}
-
extern int dissolve_free_huge_page(struct page *page);
extern int dissolve_free_huge_pages(unsigned long start_pfn,
unsigned long end_pfn);
@@ -988,11 +977,6 @@ static inline int hstate_index(struct hs
return 0;
}
-static inline pgoff_t basepage_index(struct page *page)
-{
- return page->index;
-}
-
static inline int dissolve_free_huge_page(struct page *page)
{
return 0;
--- a/include/linux/pagemap.h~mm-futex-fix-shared-futex-pgoff-on-shmem-huge-page
+++ a/include/linux/pagemap.h
@@ -516,7 +516,7 @@ static inline struct page *read_mapping_
}
/*
- * Get index of the page with in radix-tree
+ * Get index of the page within radix-tree (but not for hugetlb pages).
* (TODO: remove once hugetlb pages will have ->index in PAGE_SIZE)
*/
static inline pgoff_t page_to_index(struct page *page)
@@ -535,15 +535,16 @@ static inline pgoff_t page_to_index(stru
return pgoff;
}
+extern pgoff_t hugetlb_basepage_index(struct page *page);
+
/*
- * Get the offset in PAGE_SIZE.
- * (TODO: hugepage should have ->index in PAGE_SIZE)
+ * Get the offset in PAGE_SIZE (even for hugetlb pages).
+ * (TODO: hugetlb pages should have ->index in PAGE_SIZE)
*/
static inline pgoff_t page_to_pgoff(struct page *page)
{
- if (unlikely(PageHeadHuge(page)))
- return page->index << compound_order(page);
-
+ if (unlikely(PageHuge(page)))
+ return hugetlb_basepage_index(page);
return page_to_index(page);
}
--- a/kernel/futex.c~mm-futex-fix-shared-futex-pgoff-on-shmem-huge-page
+++ a/kernel/futex.c
@@ -35,7 +35,6 @@
#include <linux/jhash.h>
#include <linux/pagemap.h>
#include <linux/syscalls.h>
-#include <linux/hugetlb.h>
#include <linux/freezer.h>
#include <linux/memblock.h>
#include <linux/fault-inject.h>
@@ -650,7 +649,7 @@ again:
key->both.offset |= FUT_OFF_INODE; /* inode-based key */
key->shared.i_seq = get_inode_sequence_number(inode);
- key->shared.pgoff = basepage_index(tail);
+ key->shared.pgoff = page_to_pgoff(tail);
rcu_read_unlock();
}
--- a/mm/hugetlb.c~mm-futex-fix-shared-futex-pgoff-on-shmem-huge-page
+++ a/mm/hugetlb.c
@@ -1588,15 +1588,12 @@ struct address_space *hugetlb_page_mappi
return NULL;
}
-pgoff_t __basepage_index(struct page *page)
+pgoff_t hugetlb_basepage_index(struct page *page)
{
struct page *page_head = compound_head(page);
pgoff_t index = page_index(page_head);
unsigned long compound_idx;
- if (!PageHuge(page_head))
- return page_index(page);
-
if (compound_order(page_head) >= MAX_ORDER)
compound_idx = page_to_pfn(page) - page_to_pfn(page_head);
else
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-remap_page-is-only-needed-on-anonymous-thp.patch
mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync.patch
The patch titled
Subject: kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync()
has been removed from the -mm tree. Its filename was
kthread-prevent-deadlock-when-kthread_mod_delayed_work-races-with-kthread_cancel_delayed_work_sync.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Petr Mladek <pmladek(a)suse.com>
Subject: kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync()
The system might hang with the following backtrace:
schedule+0x80/0x100
schedule_timeout+0x48/0x138
wait_for_common+0xa4/0x134
wait_for_completion+0x1c/0x2c
kthread_flush_work+0x114/0x1cc
kthread_cancel_work_sync.llvm.16514401384283632983+0xe8/0x144
kthread_cancel_delayed_work_sync+0x18/0x2c
xxxx_pm_notify+0xb0/0xd8
blocking_notifier_call_chain_robust+0x80/0x194
pm_notifier_call_chain_robust+0x28/0x4c
suspend_prepare+0x40/0x260
enter_state+0x80/0x3f4
pm_suspend+0x60/0xdc
state_store+0x108/0x144
kobj_attr_store+0x38/0x88
sysfs_kf_write+0x64/0xc0
kernfs_fop_write_iter+0x108/0x1d0
vfs_write+0x2f4/0x368
ksys_write+0x7c/0xec
It is caused by the following race between kthread_mod_delayed_work() and
kthread_cancel_delayed_work_sync():
CPU0 CPU1
Context: Thread A Context: Thread B
kthread_mod_delayed_work()
spin_lock()
__kthread_cancel_work()
spin_unlock()
del_timer_sync()
kthread_cancel_delayed_work_sync()
spin_lock()
__kthread_cancel_work()
spin_unlock()
del_timer_sync()
spin_lock()
work->canceling++
spin_unlock
spin_lock()
queue_delayed_work()
// dwork is put into the worker->delayed_work_list
spin_unlock()
kthread_flush_work()
// flush_work is put at the tail of the dwork
wait_for_completion()
Context: IRQ
kthread_delayed_work_timer_fn()
spin_lock()
list_del_init(&work->node);
spin_unlock()
BANG: flush_work is not longer linked and will never get proceed.
The problem is that kthread_mod_delayed_work() checks work->canceling flag
before canceling the timer.
A simple solution is to (re)check work->canceling after
__kthread_cancel_work(). But then it is not clear what should be returned
when __kthread_cancel_work() removed the work from the queue (list) and it
can't queue it again with the new @delay.
The return value might be used for reference counting. The caller has to
know whether a new work has been queued or an existing one was replaced.
The proper solution is that kthread_mod_delayed_work() will remove the
work from the queue (list) _only_ when work->canceling is not set. The
flag must be checked after the timer is stopped and the remaining
operations can be done under worker->lock.
Note that kthread_mod_delayed_work() could remove the timer and then bail
out. It is fine. The other canceling caller needs to cancel the timer as
well. The important thing is that the queue (list) manipulation is done
atomically under worker->lock.
Link: https://lkml.kernel.org/r/20210610133051.15337-3-pmladek@suse.com
Fixes: 9a6b06c8d9a220860468a ("kthread: allow to modify delayed kthread work")
Signed-off-by: Petr Mladek <pmladek(a)suse.com>
Reported-by: Martin Liu <liumartin(a)google.com>
Cc: <jenhaochen(a)google.com>
Cc: Minchan Kim <minchan(a)google.com>
Cc: Nathan Chancellor <nathan(a)kernel.org>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/kthread.c | 35 ++++++++++++++++++++++++-----------
1 file changed, 24 insertions(+), 11 deletions(-)
--- a/kernel/kthread.c~kthread-prevent-deadlock-when-kthread_mod_delayed_work-races-with-kthread_cancel_delayed_work_sync
+++ a/kernel/kthread.c
@@ -1120,8 +1120,11 @@ static void kthread_cancel_delayed_work_
}
/*
- * This function removes the work from the worker queue. Also it makes sure
- * that it won't get queued later via the delayed work's timer.
+ * This function removes the work from the worker queue.
+ *
+ * It is called under worker->lock. The caller must make sure that
+ * the timer used by delayed work is not running, e.g. by calling
+ * kthread_cancel_delayed_work_timer().
*
* The work might still be in use when this function finishes. See the
* current_work proceed by the worker.
@@ -1129,13 +1132,8 @@ static void kthread_cancel_delayed_work_
* Return: %true if @work was pending and successfully canceled,
* %false if @work was not pending
*/
-static bool __kthread_cancel_work(struct kthread_work *work, bool is_dwork,
- unsigned long *flags)
+static bool __kthread_cancel_work(struct kthread_work *work)
{
- /* Try to cancel the timer if exists. */
- if (is_dwork)
- kthread_cancel_delayed_work_timer(work, flags);
-
/*
* Try to remove the work from a worker list. It might either
* be from worker->work_list or from worker->delayed_work_list.
@@ -1188,11 +1186,23 @@ bool kthread_mod_delayed_work(struct kth
/* Work must not be used with >1 worker, see kthread_queue_work() */
WARN_ON_ONCE(work->worker != worker);
- /* Do not fight with another command that is canceling this work. */
+ /*
+ * Temporary cancel the work but do not fight with another command
+ * that is canceling the work as well.
+ *
+ * It is a bit tricky because of possible races with another
+ * mod_delayed_work() and cancel_delayed_work() callers.
+ *
+ * The timer must be canceled first because worker->lock is released
+ * when doing so. But the work can be removed from the queue (list)
+ * only when it can be queued again so that the return value can
+ * be used for reference counting.
+ */
+ kthread_cancel_delayed_work_timer(work, &flags);
if (work->canceling)
goto out;
+ ret = __kthread_cancel_work(work);
- ret = __kthread_cancel_work(work, true, &flags);
fast_queue:
__kthread_queue_delayed_work(worker, dwork, delay);
out:
@@ -1214,7 +1224,10 @@ static bool __kthread_cancel_work_sync(s
/* Work must not be used with >1 worker, see kthread_queue_work(). */
WARN_ON_ONCE(work->worker != worker);
- ret = __kthread_cancel_work(work, is_dwork, &flags);
+ if (is_dwork)
+ kthread_cancel_delayed_work_timer(work, &flags);
+
+ ret = __kthread_cancel_work(work);
if (worker->current_work != work)
goto out_fast;
_
Patches currently in -mm which might be from pmladek(a)suse.com are
kthread_worker-fix-return-value-when-kthread_mod_delayed_work-races-with-kthread_cancel_delayed_work_sync.patch
The patch titled
Subject: kthread_worker: split code for canceling the delayed work timer
has been removed from the -mm tree. Its filename was
kthread_worker-split-code-for-canceling-the-delayed-work-timer.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Petr Mladek <pmladek(a)suse.com>
Subject: kthread_worker: split code for canceling the delayed work timer
Patch series "kthread_worker: Fix race between kthread_mod_delayed_work()
and kthread_cancel_delayed_work_sync()".
This patchset fixes the race between kthread_mod_delayed_work() and
kthread_cancel_delayed_work_sync() including proper return value handling.
This patch (of 2):
Simple code refactoring as a preparation step for fixing a race between
kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync().
It does not modify the existing behavior.
Link: https://lkml.kernel.org/r/20210610133051.15337-2-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek(a)suse.com>
Cc: <jenhaochen(a)google.com>
Cc: Martin Liu <liumartin(a)google.com>
Cc: Minchan Kim <minchan(a)google.com>
Cc: Nathan Chancellor <nathan(a)kernel.org>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/kthread.c | 46 ++++++++++++++++++++++++++++-----------------
1 file changed, 29 insertions(+), 17 deletions(-)
--- a/kernel/kthread.c~kthread_worker-split-code-for-canceling-the-delayed-work-timer
+++ a/kernel/kthread.c
@@ -1093,6 +1093,33 @@ void kthread_flush_work(struct kthread_w
EXPORT_SYMBOL_GPL(kthread_flush_work);
/*
+ * Make sure that the timer is neither set nor running and could
+ * not manipulate the work list_head any longer.
+ *
+ * The function is called under worker->lock. The lock is temporary
+ * released but the timer can't be set again in the meantime.
+ */
+static void kthread_cancel_delayed_work_timer(struct kthread_work *work,
+ unsigned long *flags)
+{
+ struct kthread_delayed_work *dwork =
+ container_of(work, struct kthread_delayed_work, work);
+ struct kthread_worker *worker = work->worker;
+
+ /*
+ * del_timer_sync() must be called to make sure that the timer
+ * callback is not running. The lock must be temporary released
+ * to avoid a deadlock with the callback. In the meantime,
+ * any queuing is blocked by setting the canceling counter.
+ */
+ work->canceling++;
+ raw_spin_unlock_irqrestore(&worker->lock, *flags);
+ del_timer_sync(&dwork->timer);
+ raw_spin_lock_irqsave(&worker->lock, *flags);
+ work->canceling--;
+}
+
+/*
* This function removes the work from the worker queue. Also it makes sure
* that it won't get queued later via the delayed work's timer.
*
@@ -1106,23 +1133,8 @@ static bool __kthread_cancel_work(struct
unsigned long *flags)
{
/* Try to cancel the timer if exists. */
- if (is_dwork) {
- struct kthread_delayed_work *dwork =
- container_of(work, struct kthread_delayed_work, work);
- struct kthread_worker *worker = work->worker;
-
- /*
- * del_timer_sync() must be called to make sure that the timer
- * callback is not running. The lock must be temporary released
- * to avoid a deadlock with the callback. In the meantime,
- * any queuing is blocked by setting the canceling counter.
- */
- work->canceling++;
- raw_spin_unlock_irqrestore(&worker->lock, *flags);
- del_timer_sync(&dwork->timer);
- raw_spin_lock_irqsave(&worker->lock, *flags);
- work->canceling--;
- }
+ if (is_dwork)
+ kthread_cancel_delayed_work_timer(work, flags);
/*
* Try to remove the work from a worker list. It might either
_
Patches currently in -mm which might be from pmladek(a)suse.com are
kthread_worker-fix-return-value-when-kthread_mod_delayed_work-races-with-kthread_cancel_delayed_work_sync.patch
The patch titled
Subject: mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk()
has been removed from the -mm tree. Its filename was
mm-thp-another-pvmw_sync-fix-in-page_vma_mapped_walk.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk()
Aha! Shouldn't that quick scan over pte_none()s make sure that it holds
ptlock in the PVMW_SYNC case? That too might have been responsible for
BUGs or WARNs in split_huge_page_to_list() or its unmap_page(), though
I've never seen any.
Link: https://lkml.kernel.org/r/1bdf384c-8137-a149-2a1e-475a4791c3c@google.com
Link: https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Tested-by: Wang Yugui <wangyugui(a)e16-tech.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_vma_mapped.c | 4 ++++
1 file changed, 4 insertions(+)
--- a/mm/page_vma_mapped.c~mm-thp-another-pvmw_sync-fix-in-page_vma_mapped_walk
+++ a/mm/page_vma_mapped.c
@@ -276,6 +276,10 @@ next_pte:
goto restart;
}
pvmw->pte++;
+ if ((pvmw->flags & PVMW_SYNC) && !pvmw->ptl) {
+ pvmw->ptl = pte_lockptr(mm, pvmw->pmd);
+ spin_lock(pvmw->ptl);
+ }
} while (pte_none(*pvmw->pte));
if (!pvmw->ptl) {
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-remap_page-is-only-needed-on-anonymous-thp.patch
mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync.patch
The patch titled
Subject: mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes
has been removed from the -mm tree. Its filename was
mm-thp-fix-page_vma_mapped_walk-if-thp-mapped-by-ptes.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes
Running certain tests with a DEBUG_VM kernel would crash within hours, on
the total_mapcount BUG() in split_huge_page_to_list(), while trying to
free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which, on
a !DEBUG_VM kernel, would then keep the huge page pinned in memory).
Crash dumps showed two tail pages of a shmem huge page remained mapped by
pte: ptes in a non-huge-aligned vma of a gVisor process, at the end of a
long unmapped range; and no page table had yet been allocated for the head
of the huge page to be mapped into.
Although designed to handle these odd misaligned huge-page-mapped-by-pte
cases, page_vma_mapped_walk() falls short by returning false prematurely
when !pmd_present or !pud_present or !p4d_present or !pgd_present: there
are cases when a huge page may span the boundary, with ptes present in the
next.
Restructure page_vma_mapped_walk() as a loop to continue in these cases,
while keeping its layout much as before. Add a step_forward() helper to
advance pvmw->address across those boundaries: originally I tried to use
mm's standard p?d_addr_end() macros, but hit the same crash 512 times less
often: because of the way redundant levels are folded together, but folded
differently in different configurations, it was just too difficult to use
them correctly; and step_forward() is simpler anyway.
Link: https://lkml.kernel.org/r/fedb8632-1798-de42-f39e-873551d5bc81@google.com
Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: Wang Yugui <wangyugui(a)e16-tech.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_vma_mapped.c | 34 +++++++++++++++++++++++++---------
1 file changed, 25 insertions(+), 9 deletions(-)
--- a/mm/page_vma_mapped.c~mm-thp-fix-page_vma_mapped_walk-if-thp-mapped-by-ptes
+++ a/mm/page_vma_mapped.c
@@ -116,6 +116,13 @@ static bool check_pte(struct page_vma_ma
return pfn_is_match(pvmw->page, pfn);
}
+static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
+{
+ pvmw->address = (pvmw->address + size) & ~(size - 1);
+ if (!pvmw->address)
+ pvmw->address = ULONG_MAX;
+}
+
/**
* page_vma_mapped_walk - check if @pvmw->page is mapped in @pvmw->vma at
* @pvmw->address
@@ -183,16 +190,22 @@ bool page_vma_mapped_walk(struct page_vm
if (pvmw->pte)
goto next_pte;
restart:
- {
+ do {
pgd = pgd_offset(mm, pvmw->address);
- if (!pgd_present(*pgd))
- return false;
+ if (!pgd_present(*pgd)) {
+ step_forward(pvmw, PGDIR_SIZE);
+ continue;
+ }
p4d = p4d_offset(pgd, pvmw->address);
- if (!p4d_present(*p4d))
- return false;
+ if (!p4d_present(*p4d)) {
+ step_forward(pvmw, P4D_SIZE);
+ continue;
+ }
pud = pud_offset(p4d, pvmw->address);
- if (!pud_present(*pud))
- return false;
+ if (!pud_present(*pud)) {
+ step_forward(pvmw, PUD_SIZE);
+ continue;
+ }
pvmw->pmd = pmd_offset(pud, pvmw->address);
/*
@@ -239,7 +252,8 @@ restart:
spin_unlock(ptl);
}
- return false;
+ step_forward(pvmw, PMD_SIZE);
+ continue;
}
if (!map_pte(pvmw))
goto next_pte;
@@ -269,7 +283,9 @@ next_pte:
spin_lock(pvmw->ptl);
}
goto this_pte;
- }
+ } while (pvmw->address < end);
+
+ return false;
}
/**
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-remap_page-is-only-needed-on-anonymous-thp.patch
mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync.patch
The patch titled
Subject: mm: page_vma_mapped_walk(): get vma_address_end() earlier
has been removed from the -mm tree. Its filename was
mm-page_vma_mapped_walk-get-vma_address_end-earlier.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm: page_vma_mapped_walk(): get vma_address_end() earlier
page_vma_mapped_walk() cleanup: get THP's vma_address_end() at the start,
rather than later at next_pte. It's a little unnecessary overhead on the
first call, but makes for a simpler loop in the following commit.
Link: https://lkml.kernel.org/r/4542b34d-862f-7cb4-bb22-e0df6ce830a2@google.com
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: Wang Yugui <wangyugui(a)e16-tech.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_vma_mapped.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
--- a/mm/page_vma_mapped.c~mm-page_vma_mapped_walk-get-vma_address_end-earlier
+++ a/mm/page_vma_mapped.c
@@ -171,6 +171,15 @@ bool page_vma_mapped_walk(struct page_vm
return true;
}
+ /*
+ * Seek to next pte only makes sense for THP.
+ * But more important than that optimization, is to filter out
+ * any PageKsm page: whose page->index misleads vma_address()
+ * and vma_address_end() to disaster.
+ */
+ end = PageTransCompound(page) ?
+ vma_address_end(page, pvmw->vma) :
+ pvmw->address + PAGE_SIZE;
if (pvmw->pte)
goto next_pte;
restart:
@@ -238,10 +247,6 @@ this_pte:
if (check_pte(pvmw))
return true;
next_pte:
- /* Seek to next pte only makes sense for THP */
- if (!PageTransHuge(page))
- return not_found(pvmw);
- end = vma_address_end(page, pvmw->vma);
do {
pvmw->address += PAGE_SIZE;
if (pvmw->address >= end)
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-remap_page-is-only-needed-on-anonymous-thp.patch
mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync.patch
The patch titled
Subject: mm: page_vma_mapped_walk(): use goto instead of while (1)
has been removed from the -mm tree. Its filename was
mm-page_vma_mapped_walk-use-goto-instead-of-while-1.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm: page_vma_mapped_walk(): use goto instead of while (1)
page_vma_mapped_walk() cleanup: add a label this_pte, matching next_pte,
and use "goto this_pte", in place of the "while (1)" loop at the end.
Link: https://lkml.kernel.org/r/a52b234a-851-3616-2525-f42736e8934@google.com
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: Wang Yugui <wangyugui(a)e16-tech.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_vma_mapped.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
--- a/mm/page_vma_mapped.c~mm-page_vma_mapped_walk-use-goto-instead-of-while-1
+++ a/mm/page_vma_mapped.c
@@ -144,6 +144,7 @@ bool page_vma_mapped_walk(struct page_vm
{
struct mm_struct *mm = pvmw->vma->vm_mm;
struct page *page = pvmw->page;
+ unsigned long end;
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
@@ -233,10 +234,7 @@ restart:
}
if (!map_pte(pvmw))
goto next_pte;
- }
- while (1) {
- unsigned long end;
-
+this_pte:
if (check_pte(pvmw))
return true;
next_pte:
@@ -265,6 +263,7 @@ next_pte:
pvmw->ptl = pte_lockptr(mm, pvmw->pmd);
spin_lock(pvmw->ptl);
}
+ goto this_pte;
}
}
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-remap_page-is-only-needed-on-anonymous-thp.patch
mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync.patch
The patch titled
Subject: mm: page_vma_mapped_walk(): add a level of indentation
has been removed from the -mm tree. Its filename was
mm-page_vma_mapped_walk-add-a-level-of-indentation.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm: page_vma_mapped_walk(): add a level of indentation
page_vma_mapped_walk() cleanup: add a level of indentation to much of the
body, making no functional change in this commit, but reducing the later
diff when this is all converted to a loop.
[hughd(a)google.com: : page_vma_mapped_walk(): add a level of indentation fix]
Link: https://lkml.kernel.org/r/7f817555-3ce1-c785-e438-87d8efdcaf26@google.com
Link: https://lkml.kernel.org/r/efde211-f3e2-fe54-977-ef481419e7f3@google.com
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: Wang Yugui <wangyugui(a)e16-tech.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_vma_mapped.c | 105 +++++++++++++++++++++--------------------
1 file changed, 55 insertions(+), 50 deletions(-)
--- a/mm/page_vma_mapped.c~mm-page_vma_mapped_walk-add-a-level-of-indentation
+++ a/mm/page_vma_mapped.c
@@ -173,62 +173,67 @@ bool page_vma_mapped_walk(struct page_vm
if (pvmw->pte)
goto next_pte;
restart:
- pgd = pgd_offset(mm, pvmw->address);
- if (!pgd_present(*pgd))
- return false;
- p4d = p4d_offset(pgd, pvmw->address);
- if (!p4d_present(*p4d))
- return false;
- pud = pud_offset(p4d, pvmw->address);
- if (!pud_present(*pud))
- return false;
- pvmw->pmd = pmd_offset(pud, pvmw->address);
- /*
- * Make sure the pmd value isn't cached in a register by the
- * compiler and used as a stale value after we've observed a
- * subsequent update.
- */
- pmde = READ_ONCE(*pvmw->pmd);
- if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
- pvmw->ptl = pmd_lock(mm, pvmw->pmd);
- pmde = *pvmw->pmd;
- if (likely(pmd_trans_huge(pmde))) {
- if (pvmw->flags & PVMW_MIGRATION)
- return not_found(pvmw);
- if (pmd_page(pmde) != page)
- return not_found(pvmw);
- return true;
- }
- if (!pmd_present(pmde)) {
- swp_entry_t entry;
+ {
+ pgd = pgd_offset(mm, pvmw->address);
+ if (!pgd_present(*pgd))
+ return false;
+ p4d = p4d_offset(pgd, pvmw->address);
+ if (!p4d_present(*p4d))
+ return false;
+ pud = pud_offset(p4d, pvmw->address);
+ if (!pud_present(*pud))
+ return false;
- if (!thp_migration_supported() ||
- !(pvmw->flags & PVMW_MIGRATION))
- return not_found(pvmw);
- entry = pmd_to_swp_entry(pmde);
- if (!is_migration_entry(entry) ||
- migration_entry_to_page(entry) != page)
- return not_found(pvmw);
- return true;
- }
- /* THP pmd was split under us: handle on pte level */
- spin_unlock(pvmw->ptl);
- pvmw->ptl = NULL;
- } else if (!pmd_present(pmde)) {
+ pvmw->pmd = pmd_offset(pud, pvmw->address);
/*
- * If PVMW_SYNC, take and drop THP pmd lock so that we
- * cannot return prematurely, while zap_huge_pmd() has
- * cleared *pmd but not decremented compound_mapcount().
+ * Make sure the pmd value isn't cached in a register by the
+ * compiler and used as a stale value after we've observed a
+ * subsequent update.
*/
- if ((pvmw->flags & PVMW_SYNC) && PageTransCompound(page)) {
- spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
+ pmde = READ_ONCE(*pvmw->pmd);
+
+ if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
+ pvmw->ptl = pmd_lock(mm, pvmw->pmd);
+ pmde = *pvmw->pmd;
+ if (likely(pmd_trans_huge(pmde))) {
+ if (pvmw->flags & PVMW_MIGRATION)
+ return not_found(pvmw);
+ if (pmd_page(pmde) != page)
+ return not_found(pvmw);
+ return true;
+ }
+ if (!pmd_present(pmde)) {
+ swp_entry_t entry;
- spin_unlock(ptl);
+ if (!thp_migration_supported() ||
+ !(pvmw->flags & PVMW_MIGRATION))
+ return not_found(pvmw);
+ entry = pmd_to_swp_entry(pmde);
+ if (!is_migration_entry(entry) ||
+ migration_entry_to_page(entry) != page)
+ return not_found(pvmw);
+ return true;
+ }
+ /* THP pmd was split under us: handle on pte level */
+ spin_unlock(pvmw->ptl);
+ pvmw->ptl = NULL;
+ } else if (!pmd_present(pmde)) {
+ /*
+ * If PVMW_SYNC, take and drop THP pmd lock so that we
+ * cannot return prematurely, while zap_huge_pmd() has
+ * cleared *pmd but not decremented compound_mapcount().
+ */
+ if ((pvmw->flags & PVMW_SYNC) &&
+ PageTransCompound(page)) {
+ spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
+
+ spin_unlock(ptl);
+ }
+ return false;
}
- return false;
+ if (!map_pte(pvmw))
+ goto next_pte;
}
- if (!map_pte(pvmw))
- goto next_pte;
while (1) {
unsigned long end;
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-remap_page-is-only-needed-on-anonymous-thp.patch
mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync.patch
The patch titled
Subject: mm: page_vma_mapped_walk(): crossing page table boundary
has been removed from the -mm tree. Its filename was
mm-page_vma_mapped_walk-crossing-page-table-boundary.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm: page_vma_mapped_walk(): crossing page table boundary
page_vma_mapped_walk() cleanup: adjust the test for crossing page table
boundary - I believe pvmw->address is always page-aligned, but nothing
else here assumed that; and remember to reset pvmw->pte to NULL after
unmapping the page table, though I never saw any bug from that.
Link: https://lkml.kernel.org/r/799b3f9c-2a9e-dfef-5d89-26e9f76fd97@google.com
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: Wang Yugui <wangyugui(a)e16-tech.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_vma_mapped.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
--- a/mm/page_vma_mapped.c~mm-page_vma_mapped_walk-crossing-page-table-boundary
+++ a/mm/page_vma_mapped.c
@@ -244,16 +244,16 @@ next_pte:
if (pvmw->address >= end)
return not_found(pvmw);
/* Did we cross page table boundary? */
- if (pvmw->address % PMD_SIZE == 0) {
- pte_unmap(pvmw->pte);
+ if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
if (pvmw->ptl) {
spin_unlock(pvmw->ptl);
pvmw->ptl = NULL;
}
+ pte_unmap(pvmw->pte);
+ pvmw->pte = NULL;
goto restart;
- } else {
- pvmw->pte++;
}
+ pvmw->pte++;
} while (pte_none(*pvmw->pte));
if (!pvmw->ptl) {
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-remap_page-is-only-needed-on-anonymous-thp.patch
mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync.patch
The patch titled
Subject: mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block
has been removed from the -mm tree. Its filename was
mm-page_vma_mapped_walk-prettify-pvmw_migration-block.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block
page_vma_mapped_walk() cleanup: rearrange the !pmd_present() block to
follow the same "return not_found, return not_found, return true" pattern
as the block above it (note: returning not_found there is never premature,
since existence or prior existence of huge pmd guarantees good alignment).
Link: https://lkml.kernel.org/r/378c8650-1488-2edf-9647-32a53cf2e21@google.com
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reviewed-by: Peter Xu <peterx(a)redhat.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: Wang Yugui <wangyugui(a)e16-tech.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_vma_mapped.c | 30 ++++++++++++++----------------
1 file changed, 14 insertions(+), 16 deletions(-)
--- a/mm/page_vma_mapped.c~mm-page_vma_mapped_walk-prettify-pvmw_migration-block
+++ a/mm/page_vma_mapped.c
@@ -198,24 +198,22 @@ restart:
if (pmd_page(pmde) != page)
return not_found(pvmw);
return true;
- } else if (!pmd_present(pmde)) {
- if (thp_migration_supported()) {
- if (!(pvmw->flags & PVMW_MIGRATION))
- return not_found(pvmw);
- if (is_migration_entry(pmd_to_swp_entry(pmde))) {
- swp_entry_t entry = pmd_to_swp_entry(pmde);
+ }
+ if (!pmd_present(pmde)) {
+ swp_entry_t entry;
- if (migration_entry_to_page(entry) != page)
- return not_found(pvmw);
- return true;
- }
- }
- return not_found(pvmw);
- } else {
- /* THP pmd was split under us: handle on pte level */
- spin_unlock(pvmw->ptl);
- pvmw->ptl = NULL;
+ if (!thp_migration_supported() ||
+ !(pvmw->flags & PVMW_MIGRATION))
+ return not_found(pvmw);
+ entry = pmd_to_swp_entry(pmde);
+ if (!is_migration_entry(entry) ||
+ migration_entry_to_page(entry) != page)
+ return not_found(pvmw);
+ return true;
}
+ /* THP pmd was split under us: handle on pte level */
+ spin_unlock(pvmw->ptl);
+ pvmw->ptl = NULL;
} else if (!pmd_present(pmde)) {
/*
* If PVMW_SYNC, take and drop THP pmd lock so that we
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-remap_page-is-only-needed-on-anonymous-thp.patch
mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync.patch
The patch titled
Subject: mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd
has been removed from the -mm tree. Its filename was
mm-page_vma_mapped_walk-use-pmde-for-pvmw-pmd.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd
page_vma_mapped_walk() cleanup: re-evaluate pmde after taking lock, then
use it in subsequent tests, instead of repeatedly dereferencing pointer.
Link: https://lkml.kernel.org/r/53fbc9d-891e-46b2-cb4b-468c3b19238e@google.com
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reviewed-by: Peter Xu <peterx(a)redhat.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: Wang Yugui <wangyugui(a)e16-tech.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_vma_mapped.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
--- a/mm/page_vma_mapped.c~mm-page_vma_mapped_walk-use-pmde-for-pvmw-pmd
+++ a/mm/page_vma_mapped.c
@@ -191,18 +191,19 @@ restart:
pmde = READ_ONCE(*pvmw->pmd);
if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
- if (likely(pmd_trans_huge(*pvmw->pmd))) {
+ pmde = *pvmw->pmd;
+ if (likely(pmd_trans_huge(pmde))) {
if (pvmw->flags & PVMW_MIGRATION)
return not_found(pvmw);
- if (pmd_page(*pvmw->pmd) != page)
+ if (pmd_page(pmde) != page)
return not_found(pvmw);
return true;
- } else if (!pmd_present(*pvmw->pmd)) {
+ } else if (!pmd_present(pmde)) {
if (thp_migration_supported()) {
if (!(pvmw->flags & PVMW_MIGRATION))
return not_found(pvmw);
- if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
- swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
+ if (is_migration_entry(pmd_to_swp_entry(pmde))) {
+ swp_entry_t entry = pmd_to_swp_entry(pmde);
if (migration_entry_to_page(entry) != page)
return not_found(pvmw);
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-remap_page-is-only-needed-on-anonymous-thp.patch
mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync.patch
The patch titled
Subject: mm: page_vma_mapped_walk(): settle PageHuge on entry
has been removed from the -mm tree. Its filename was
mm-page_vma_mapped_walk-settle-pagehuge-on-entry.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm: page_vma_mapped_walk(): settle PageHuge on entry
page_vma_mapped_walk() cleanup: get the hugetlbfs PageHuge case out of the
way at the start, so no need to worry about it later.
Link: https://lkml.kernel.org/r/e31a483c-6d73-a6bb-26c5-43c3b880a2@google.com
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reviewed-by: Peter Xu <peterx(a)redhat.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: Wang Yugui <wangyugui(a)e16-tech.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_vma_mapped.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
--- a/mm/page_vma_mapped.c~mm-page_vma_mapped_walk-settle-pagehuge-on-entry
+++ a/mm/page_vma_mapped.c
@@ -153,10 +153,11 @@ bool page_vma_mapped_walk(struct page_vm
if (pvmw->pmd && !pvmw->pte)
return not_found(pvmw);
- if (pvmw->pte)
- goto next_pte;
-
if (unlikely(PageHuge(page))) {
+ /* The only possible mapping was handled on last iteration */
+ if (pvmw->pte)
+ return not_found(pvmw);
+
/* when pud is not present, pte will be NULL */
pvmw->pte = huge_pte_offset(mm, pvmw->address, page_size(page));
if (!pvmw->pte)
@@ -168,6 +169,9 @@ bool page_vma_mapped_walk(struct page_vm
return not_found(pvmw);
return true;
}
+
+ if (pvmw->pte)
+ goto next_pte;
restart:
pgd = pgd_offset(mm, pvmw->address);
if (!pgd_present(*pgd))
@@ -233,7 +237,7 @@ restart:
return true;
next_pte:
/* Seek to next pte only makes sense for THP */
- if (!PageTransHuge(page) || PageHuge(page))
+ if (!PageTransHuge(page))
return not_found(pvmw);
end = vma_address_end(page, pvmw->vma);
do {
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-remap_page-is-only-needed-on-anonymous-thp.patch
mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync.patch
The patch titled
Subject: mm: page_vma_mapped_walk(): use page for pvmw->page
has been removed from the -mm tree. Its filename was
mm-page_vma_mapped_walk-use-page-for-pvmw-page.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm: page_vma_mapped_walk(): use page for pvmw->page
Patch series "mm: page_vma_mapped_walk() cleanup and THP fixes".
I've marked all of these for stable: many are merely cleanups,
but I think they are much better before the main fix than after.
This patch (of 11):
page_vma_mapped_walk() cleanup: sometimes the local copy of pvwm->page was
used, sometimes pvmw->page itself: use the local copy "page" throughout.
Link: https://lkml.kernel.org/r/589b358c-febc-c88e-d4c2-7834b37fa7bf@google.com
Link: https://lkml.kernel.org/r/88e67645-f467-c279-bf5e-af4b5c6b13eb@google.com
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Reviewed-by: Alistair Popple <apopple(a)nvidia.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reviewed-by: Peter Xu <peterx(a)redhat.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Wang Yugui <wangyugui(a)e16-tech.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_vma_mapped.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
--- a/mm/page_vma_mapped.c~mm-page_vma_mapped_walk-use-page-for-pvmw-page
+++ a/mm/page_vma_mapped.c
@@ -156,7 +156,7 @@ bool page_vma_mapped_walk(struct page_vm
if (pvmw->pte)
goto next_pte;
- if (unlikely(PageHuge(pvmw->page))) {
+ if (unlikely(PageHuge(page))) {
/* when pud is not present, pte will be NULL */
pvmw->pte = huge_pte_offset(mm, pvmw->address, page_size(page));
if (!pvmw->pte)
@@ -217,8 +217,7 @@ restart:
* cannot return prematurely, while zap_huge_pmd() has
* cleared *pmd but not decremented compound_mapcount().
*/
- if ((pvmw->flags & PVMW_SYNC) &&
- PageTransCompound(pvmw->page)) {
+ if ((pvmw->flags & PVMW_SYNC) && PageTransCompound(page)) {
spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
spin_unlock(ptl);
@@ -234,9 +233,9 @@ restart:
return true;
next_pte:
/* Seek to next pte only makes sense for THP */
- if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page))
+ if (!PageTransHuge(page) || PageHuge(page))
return not_found(pvmw);
- end = vma_address_end(pvmw->page, pvmw->vma);
+ end = vma_address_end(page, pvmw->vma);
do {
pvmw->address += PAGE_SIZE;
if (pvmw->address >= end)
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-remap_page-is-only-needed-on-anonymous-thp.patch
mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync.patch
The patch titled
Subject: mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split
has been removed from the -mm tree. Its filename was
mm-thp-replace-debug_vm-bug-with-vm_warn-when-unmap-fails-for-split.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Yang Shi <shy828301(a)gmail.com>
Subject: mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split
When debugging the bug reported by Wang Yugui [1], try_to_unmap() may
fail, but the first VM_BUG_ON_PAGE() just checks page_mapcount() however
it may miss the failure when head page is unmapped but other subpage is
mapped. Then the second DEBUG_VM BUG() that check total mapcount would
catch it. This may incur some confusion. And this is not a fatal issue,
so consolidate the two DEBUG_VM checks into one VM_WARN_ON_ONCE_PAGE().
[1] https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
Link: https://lkml.kernel.org/r/d0f0db68-98b8-ebfb-16dc-f29df24cf012@google.com
Signed-off-by: Yang Shi <shy828301(a)gmail.com>
Reviewed-by: Zi Yan <ziy(a)nvidia.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Jue Wang <juew(a)google.com>
Cc: "Matthew Wilcox (Oracle)" <willy(a)infradead.org>
Cc: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Wang Yugui <wangyugui(a)e16-tech.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/huge_memory.c | 24 +++++++-----------------
1 file changed, 7 insertions(+), 17 deletions(-)
--- a/mm/huge_memory.c~mm-thp-replace-debug_vm-bug-with-vm_warn-when-unmap-fails-for-split
+++ a/mm/huge_memory.c
@@ -2352,15 +2352,15 @@ static void unmap_page(struct page *page
{
enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_SYNC |
TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
- bool unmap_success;
VM_BUG_ON_PAGE(!PageHead(page), page);
if (PageAnon(page))
ttu_flags |= TTU_SPLIT_FREEZE;
- unmap_success = try_to_unmap(page, ttu_flags);
- VM_BUG_ON_PAGE(!unmap_success, page);
+ try_to_unmap(page, ttu_flags);
+
+ VM_WARN_ON_ONCE_PAGE(page_mapped(page), page);
}
static void remap_page(struct page *page, unsigned int nr)
@@ -2671,7 +2671,7 @@ int split_huge_page_to_list(struct page
struct deferred_split *ds_queue = get_deferred_split_queue(head);
struct anon_vma *anon_vma = NULL;
struct address_space *mapping = NULL;
- int count, mapcount, extra_pins, ret;
+ int extra_pins, ret;
pgoff_t end;
VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
@@ -2730,7 +2730,6 @@ int split_huge_page_to_list(struct page
}
unmap_page(head);
- VM_BUG_ON_PAGE(compound_mapcount(head), head);
/* block interrupt reentry in xa_lock and spinlock */
local_irq_disable();
@@ -2748,9 +2747,7 @@ int split_huge_page_to_list(struct page
/* Prevent deferred_split_scan() touching ->_refcount */
spin_lock(&ds_queue->split_queue_lock);
- count = page_count(head);
- mapcount = total_mapcount(head);
- if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
+ if (page_ref_freeze(head, 1 + extra_pins)) {
if (!list_empty(page_deferred_list(head))) {
ds_queue->split_queue_len--;
list_del(page_deferred_list(head));
@@ -2770,16 +2767,9 @@ int split_huge_page_to_list(struct page
__split_huge_page(page, list, end);
ret = 0;
} else {
- if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
- pr_alert("total_mapcount: %u, page_count(): %u\n",
- mapcount, count);
- if (PageTail(page))
- dump_page(head, NULL);
- dump_page(page, "total_mapcount(head) > 0");
- BUG();
- }
spin_unlock(&ds_queue->split_queue_lock);
-fail: if (mapping)
+fail:
+ if (mapping)
xa_unlock(&mapping->i_pages);
local_irq_enable();
remap_page(head, thp_nr_pages(head));
_
Patches currently in -mm which might be from shy828301(a)gmail.com are
mm-mempolicy-dont-have-to-split-pmd-for-huge-zero-page.patch
mm-memory-add-orig_pmd-to-struct-vm_fault.patch
mm-memory-make-numa_migrate_prep-non-static.patch
mm-thp-refactor-numa-fault-handling.patch
mm-migrate-account-thp-numa-migration-counters-correctly.patch
mm-migrate-dont-split-thp-for-misplaced-numa-page.patch
mm-migrate-check-mapcount-for-thp-instead-of-refcount.patch
mm-thp-skip-make-pmd-prot_none-if-thp-migration-is-not-supported.patch
mm-rmap-make-try_to_unmap-void-function.patch
The patch titled
Subject: mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
has been removed from the -mm tree. Its filename was
mm-thp-unmap_mapping_page-to-fix-thp-truncate_cleanup_page.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
There is a race between THP unmapping and truncation, when truncate sees
pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it,
but before its page_remove_rmap() gets to decrement compound_mapcount:
generating false "BUG: Bad page cache" reports that the page is still
mapped when deleted. This commit fixes that, but not in the way I hoped.
The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
instead of unmap_mapping_range() in truncate_cleanup_page(): it has often
been an annoyance that we usually call unmap_mapping_range() with no pages
locked, but there apply it to a single locked page. try_to_unmap() looks
more suitable for a single locked page.
However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page): it
is used to insert THP migration entries, but not used to unmap THPs. Copy
zap_huge_pmd() and add THP handling now? Perhaps, but their TLB needs are
different, I'm too ignorant of the DAX cases, and couldn't decide how far
to go for anon+swap. Set that aside.
The second attempt took a different tack: make no change in truncate.c,
but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
clearing it initially, then pmd_clear() between page_remove_rmap() and
unlocking at the end. Nice. But powerpc blows that approach out of the
water, with its serialize_against_pte_lookup(), and interesting pgtable
usage. It would need serious help to get working on powerpc (with a minor
optimization issue on s390 too). Set that aside.
Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
that's likely to reduce or eliminate the number of incidents, it would
give less assurance of whether we had identified the problem correctly.
This successful iteration introduces "unmap_mapping_page(page)" instead of
try_to_unmap(), and goes the usual unmap_mapping_range_tree() route, with
an addition to details. Then zap_pmd_range() watches for this case, and
does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk() now
does in the PVMW_SYNC case. Not pretty, but safe.
Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to assert
its interface; but currently that's only used to make sure that
page->mapping is stable, and zap_pmd_range() doesn't care if the page is
locked or not. Along these lines, in invalidate_inode_pages2_range() move
the initial unmap_mapping_range() out from under page lock, before then
calling unmap_mapping_page() under page lock if still mapped.
Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
Fixes: fc127da085c2 ("truncate: handle file thp")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Jue Wang <juew(a)google.com>
Cc: "Matthew Wilcox (Oracle)" <willy(a)infradead.org>
Cc: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Wang Yugui <wangyugui(a)e16-tech.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mm.h | 3 +++
mm/memory.c | 41 +++++++++++++++++++++++++++++++++++++++++
mm/truncate.c | 43 +++++++++++++++++++------------------------
3 files changed, 63 insertions(+), 24 deletions(-)
--- a/include/linux/mm.h~mm-thp-unmap_mapping_page-to-fix-thp-truncate_cleanup_page
+++ a/include/linux/mm.h
@@ -1719,6 +1719,7 @@ struct zap_details {
struct address_space *check_mapping; /* Check page->mapping if set */
pgoff_t first_index; /* Lowest page->index to unmap */
pgoff_t last_index; /* Highest page->index to unmap */
+ struct page *single_page; /* Locked page to be unmapped */
};
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
@@ -1766,6 +1767,7 @@ extern vm_fault_t handle_mm_fault(struct
extern int fixup_user_fault(struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
+void unmap_mapping_page(struct page *page);
void unmap_mapping_pages(struct address_space *mapping,
pgoff_t start, pgoff_t nr, bool even_cows);
void unmap_mapping_range(struct address_space *mapping,
@@ -1786,6 +1788,7 @@ static inline int fixup_user_fault(struc
BUG();
return -EFAULT;
}
+static inline void unmap_mapping_page(struct page *page) { }
static inline void unmap_mapping_pages(struct address_space *mapping,
pgoff_t start, pgoff_t nr, bool even_cows) { }
static inline void unmap_mapping_range(struct address_space *mapping,
--- a/mm/memory.c~mm-thp-unmap_mapping_page-to-fix-thp-truncate_cleanup_page
+++ a/mm/memory.c
@@ -1361,7 +1361,18 @@ static inline unsigned long zap_pmd_rang
else if (zap_huge_pmd(tlb, vma, pmd, addr))
goto next;
/* fall through */
+ } else if (details && details->single_page &&
+ PageTransCompound(details->single_page) &&
+ next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) {
+ spinlock_t *ptl = pmd_lock(tlb->mm, pmd);
+ /*
+ * Take and drop THP pmd lock so that we cannot return
+ * prematurely, while zap_huge_pmd() has cleared *pmd,
+ * but not yet decremented compound_mapcount().
+ */
+ spin_unlock(ptl);
}
+
/*
* Here there can be other concurrent MADV_DONTNEED or
* trans huge page faults running, and if the pmd is
@@ -3237,6 +3248,36 @@ static inline void unmap_mapping_range_t
}
/**
+ * unmap_mapping_page() - Unmap single page from processes.
+ * @page: The locked page to be unmapped.
+ *
+ * Unmap this page from any userspace process which still has it mmaped.
+ * Typically, for efficiency, the range of nearby pages has already been
+ * unmapped by unmap_mapping_pages() or unmap_mapping_range(). But once
+ * truncation or invalidation holds the lock on a page, it may find that
+ * the page has been remapped again: and then uses unmap_mapping_page()
+ * to unmap it finally.
+ */
+void unmap_mapping_page(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct zap_details details = { };
+
+ VM_BUG_ON(!PageLocked(page));
+ VM_BUG_ON(PageTail(page));
+
+ details.check_mapping = mapping;
+ details.first_index = page->index;
+ details.last_index = page->index + thp_nr_pages(page) - 1;
+ details.single_page = page;
+
+ i_mmap_lock_write(mapping);
+ if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
+ unmap_mapping_range_tree(&mapping->i_mmap, &details);
+ i_mmap_unlock_write(mapping);
+}
+
+/**
* unmap_mapping_pages() - Unmap pages from processes.
* @mapping: The address space containing pages to be unmapped.
* @start: Index of first page to be unmapped.
--- a/mm/truncate.c~mm-thp-unmap_mapping_page-to-fix-thp-truncate_cleanup_page
+++ a/mm/truncate.c
@@ -167,13 +167,10 @@ void do_invalidatepage(struct page *page
* its lock, b) when a concurrent invalidate_mapping_pages got there first and
* c) when tmpfs swizzles a page between a tmpfs inode and swapper_space.
*/
-static void
-truncate_cleanup_page(struct address_space *mapping, struct page *page)
+static void truncate_cleanup_page(struct page *page)
{
- if (page_mapped(page)) {
- unsigned int nr = thp_nr_pages(page);
- unmap_mapping_pages(mapping, page->index, nr, false);
- }
+ if (page_mapped(page))
+ unmap_mapping_page(page);
if (page_has_private(page))
do_invalidatepage(page, 0, thp_size(page));
@@ -218,7 +215,7 @@ int truncate_inode_page(struct address_s
if (page->mapping != mapping)
return -EIO;
- truncate_cleanup_page(mapping, page);
+ truncate_cleanup_page(page);
delete_from_page_cache(page);
return 0;
}
@@ -325,7 +322,7 @@ void truncate_inode_pages_range(struct a
index = indices[pagevec_count(&pvec) - 1] + 1;
truncate_exceptional_pvec_entries(mapping, &pvec, indices);
for (i = 0; i < pagevec_count(&pvec); i++)
- truncate_cleanup_page(mapping, pvec.pages[i]);
+ truncate_cleanup_page(pvec.pages[i]);
delete_from_page_cache_batch(mapping, &pvec);
for (i = 0; i < pagevec_count(&pvec); i++)
unlock_page(pvec.pages[i]);
@@ -639,6 +636,16 @@ int invalidate_inode_pages2_range(struct
continue;
}
+ if (!did_range_unmap && page_mapped(page)) {
+ /*
+ * If page is mapped, before taking its lock,
+ * zap the rest of the file in one hit.
+ */
+ unmap_mapping_pages(mapping, index,
+ (1 + end - index), false);
+ did_range_unmap = 1;
+ }
+
lock_page(page);
WARN_ON(page_to_index(page) != index);
if (page->mapping != mapping) {
@@ -646,23 +653,11 @@ int invalidate_inode_pages2_range(struct
continue;
}
wait_on_page_writeback(page);
- if (page_mapped(page)) {
- if (!did_range_unmap) {
- /*
- * Zap the rest of the file in one hit.
- */
- unmap_mapping_pages(mapping, index,
- (1 + end - index), false);
- did_range_unmap = 1;
- } else {
- /*
- * Just zap this page
- */
- unmap_mapping_pages(mapping, index,
- 1, false);
- }
- }
+
+ if (page_mapped(page))
+ unmap_mapping_page(page);
BUG_ON(page_mapped(page));
+
ret2 = do_launder_page(mapping, page);
if (ret2 == 0) {
if (!invalidate_complete_page2(mapping, page))
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-remap_page-is-only-needed-on-anonymous-thp.patch
mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync.patch
The patch titled
Subject: mm/thp: fix page_address_in_vma() on file THP tails
has been removed from the -mm tree. Its filename was
mm-thp-fix-page_address_in_vma-on-file-thp-tails.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Jue Wang <juew(a)google.com>
Subject: mm/thp: fix page_address_in_vma() on file THP tails
Anon THP tails were already supported, but memory-failure may need to use
page_address_in_vma() on file THP tails, which its page->mapping check did
not permit: fix it.
hughd adds: no current usage is known to hit the issue, but this does fix
a subtle trap in a general helper: best fixed in stable sooner than later.
Link: https://lkml.kernel.org/r/a0d9b53-bf5d-8bab-ac5-759dc61819c1@google.com
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Signed-off-by: Jue Wang <juew(a)google.com>
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Wang Yugui <wangyugui(a)e16-tech.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/rmap.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
--- a/mm/rmap.c~mm-thp-fix-page_address_in_vma-on-file-thp-tails
+++ a/mm/rmap.c
@@ -716,11 +716,11 @@ unsigned long page_address_in_vma(struct
if (!vma->anon_vma || !page__anon_vma ||
vma->anon_vma->root != page__anon_vma->root)
return -EFAULT;
- } else if (page->mapping) {
- if (!vma->vm_file || vma->vm_file->f_mapping != page->mapping)
- return -EFAULT;
- } else
+ } else if (!vma->vm_file) {
+ return -EFAULT;
+ } else if (vma->vm_file->f_mapping != compound_head(page)->mapping) {
return -EFAULT;
+ }
return vma_address(page, vma);
}
_
Patches currently in -mm which might be from juew(a)google.com are
The patch titled
Subject: mm/thp: fix vma_address() if virtual address below file offset
has been removed from the -mm tree. Its filename was
mm-thp-fix-vma_address-if-virtual-address-below-file-offset.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/thp: fix vma_address() if virtual address below file offset
Running certain tests with a DEBUG_VM kernel would crash within hours, on
the total_mapcount BUG() in split_huge_page_to_list(), while trying to
free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which, on
a !DEBUG_VM kernel, would then keep the huge page pinned in memory).
When that BUG() was changed to a WARN(), it would later crash on the
VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in
mm/internal.h:vma_address(), used by rmap_walk_file() for try_to_unmap().
vma_address() is usually correct, but there's a wraparound case when the
vm_start address is unusually low, but vm_pgoff not so low: vma_address()
chooses max(start, vma->vm_start), but that decides on the wrong address,
because start has become almost ULONG_MAX.
Rewrite vma_address() to be more careful about vm_pgoff; move the
VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can be
safely used from page_mapped_in_vma() and page_address_in_vma() too.
Add vma_address_end() to apply similar care to end address calculation, in
page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one();
though it raises a question of whether callers would do better to supply
pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch.
An irritation is that their apparent generality breaks down on KSM pages,
which cannot be located by the page->index that page_to_pgoff() uses: as
4b0ece6fa016 ("mm: migrate: fix remove_migration_pte() for ksm pages")
once discovered. I dithered over the best thing to do about that, and
have ended up with a VM_BUG_ON_PAGE(PageKsm) in both vma_address() and
vma_address_end(); though the only place in danger of using it on them was
try_to_unmap_one().
Sidenote: vma_address() and vma_address_end() now use compound_nr() on a
head page, instead of thp_size(): to make the right calculation on a
hugetlbfs page, whether or not THPs are configured. try_to_unmap() is
used on hugetlbfs pages, but perhaps the wrong calculation never mattered.
Link: https://lkml.kernel.org/r/caf1c1a3-7cfb-7f8f-1beb-ba816e932825@google.com
Fixes: a8fa41ad2f6f ("mm, rmap: check all VMAs that PTE-mapped THP can be part of")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Jue Wang <juew(a)google.com>
Cc: "Matthew Wilcox (Oracle)" <willy(a)infradead.org>
Cc: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Wang Yugui <wangyugui(a)e16-tech.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/internal.h | 51 ++++++++++++++++++++++++++++++-----------
mm/page_vma_mapped.c | 16 ++++--------
mm/rmap.c | 16 ++++++------
3 files changed, 52 insertions(+), 31 deletions(-)
--- a/mm/internal.h~mm-thp-fix-vma_address-if-virtual-address-below-file-offset
+++ a/mm/internal.h
@@ -384,27 +384,52 @@ static inline void mlock_migrate_page(st
extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
/*
- * At what user virtual address is page expected in @vma?
+ * At what user virtual address is page expected in vma?
+ * Returns -EFAULT if all of the page is outside the range of vma.
+ * If page is a compound head, the entire compound page is considered.
*/
static inline unsigned long
-__vma_address(struct page *page, struct vm_area_struct *vma)
+vma_address(struct page *page, struct vm_area_struct *vma)
{
- pgoff_t pgoff = page_to_pgoff(page);
- return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+ pgoff_t pgoff;
+ unsigned long address;
+
+ VM_BUG_ON_PAGE(PageKsm(page), page); /* KSM page->index unusable */
+ pgoff = page_to_pgoff(page);
+ if (pgoff >= vma->vm_pgoff) {
+ address = vma->vm_start +
+ ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+ /* Check for address beyond vma (or wrapped through 0?) */
+ if (address < vma->vm_start || address >= vma->vm_end)
+ address = -EFAULT;
+ } else if (PageHead(page) &&
+ pgoff + compound_nr(page) - 1 >= vma->vm_pgoff) {
+ /* Test above avoids possibility of wrap to 0 on 32-bit */
+ address = vma->vm_start;
+ } else {
+ address = -EFAULT;
+ }
+ return address;
}
+/*
+ * Then at what user virtual address will none of the page be found in vma?
+ * Assumes that vma_address() already returned a good starting address.
+ * If page is a compound head, the entire compound page is considered.
+ */
static inline unsigned long
-vma_address(struct page *page, struct vm_area_struct *vma)
+vma_address_end(struct page *page, struct vm_area_struct *vma)
{
- unsigned long start, end;
-
- start = __vma_address(page, vma);
- end = start + thp_size(page) - PAGE_SIZE;
-
- /* page should be within @vma mapping range */
- VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma);
+ pgoff_t pgoff;
+ unsigned long address;
- return max(start, vma->vm_start);
+ VM_BUG_ON_PAGE(PageKsm(page), page); /* KSM page->index unusable */
+ pgoff = page_to_pgoff(page) + compound_nr(page);
+ address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+ /* Check for address beyond vma (or wrapped through 0?) */
+ if (address < vma->vm_start || address > vma->vm_end)
+ address = vma->vm_end;
+ return address;
}
static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
--- a/mm/page_vma_mapped.c~mm-thp-fix-vma_address-if-virtual-address-below-file-offset
+++ a/mm/page_vma_mapped.c
@@ -228,18 +228,18 @@ restart:
if (!map_pte(pvmw))
goto next_pte;
while (1) {
+ unsigned long end;
+
if (check_pte(pvmw))
return true;
next_pte:
/* Seek to next pte only makes sense for THP */
if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page))
return not_found(pvmw);
+ end = vma_address_end(pvmw->page, pvmw->vma);
do {
pvmw->address += PAGE_SIZE;
- if (pvmw->address >= pvmw->vma->vm_end ||
- pvmw->address >=
- __vma_address(pvmw->page, pvmw->vma) +
- thp_size(pvmw->page))
+ if (pvmw->address >= end)
return not_found(pvmw);
/* Did we cross page table boundary? */
if (pvmw->address % PMD_SIZE == 0) {
@@ -277,14 +277,10 @@ int page_mapped_in_vma(struct page *page
.vma = vma,
.flags = PVMW_SYNC,
};
- unsigned long start, end;
-
- start = __vma_address(page, vma);
- end = start + thp_size(page) - PAGE_SIZE;
- if (unlikely(end < vma->vm_start || start >= vma->vm_end))
+ pvmw.address = vma_address(page, vma);
+ if (pvmw.address == -EFAULT)
return 0;
- pvmw.address = max(start, vma->vm_start);
if (!page_vma_mapped_walk(&pvmw))
return 0;
page_vma_mapped_walk_done(&pvmw);
--- a/mm/rmap.c~mm-thp-fix-vma_address-if-virtual-address-below-file-offset
+++ a/mm/rmap.c
@@ -707,7 +707,6 @@ static bool should_defer_flush(struct mm
*/
unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
{
- unsigned long address;
if (PageAnon(page)) {
struct anon_vma *page__anon_vma = page_anon_vma(page);
/*
@@ -722,10 +721,8 @@ unsigned long page_address_in_vma(struct
return -EFAULT;
} else
return -EFAULT;
- address = __vma_address(page, vma);
- if (unlikely(address < vma->vm_start || address >= vma->vm_end))
- return -EFAULT;
- return address;
+
+ return vma_address(page, vma);
}
pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
@@ -919,7 +916,7 @@ static bool page_mkclean_one(struct page
*/
mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
0, vma, vma->vm_mm, address,
- min(vma->vm_end, address + page_size(page)));
+ vma_address_end(page, vma));
mmu_notifier_invalidate_range_start(&range);
while (page_vma_mapped_walk(&pvmw)) {
@@ -1435,9 +1432,10 @@ static bool try_to_unmap_one(struct page
* Note that the page can not be free in this function as call of
* try_to_unmap() must hold a reference on the page.
*/
+ range.end = PageKsm(page) ?
+ address + PAGE_SIZE : vma_address_end(page, vma);
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
- address,
- min(vma->vm_end, address + page_size(page)));
+ address, range.end);
if (PageHuge(page)) {
/*
* If sharing is possible, start and end will be adjusted
@@ -1889,6 +1887,7 @@ static void rmap_walk_anon(struct page *
struct vm_area_struct *vma = avc->vma;
unsigned long address = vma_address(page, vma);
+ VM_BUG_ON_VMA(address == -EFAULT, vma);
cond_resched();
if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
@@ -1943,6 +1942,7 @@ static void rmap_walk_file(struct page *
pgoff_start, pgoff_end) {
unsigned long address = vma_address(page, vma);
+ VM_BUG_ON_VMA(address == -EFAULT, vma);
cond_resched();
if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-remap_page-is-only-needed-on-anonymous-thp.patch
mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync.patch
The patch titled
Subject: mm/thp: try_to_unmap() use TTU_SYNC for safe splitting
has been removed from the -mm tree. Its filename was
mm-thp-try_to_unmap-use-ttu_sync-for-safe-splitting.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/thp: try_to_unmap() use TTU_SYNC for safe splitting
Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
(!unmap_success): with dump_page() showing mapcount:1, but then its raw
struct page output showing _mapcount ffffffff i.e. mapcount 0.
And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed, it
is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)), and
further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG(): all
indicative of some mapcount difficulty in development here perhaps. But
the !CONFIG_DEBUG_VM path handles the failures correctly and silently.
I believe the problem is that once a racing unmap has cleared pte or pmd,
try_to_unmap_one() may skip taking the page table lock, and emerge from
try_to_unmap() before the racing task has reached decrementing mapcount.
Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding TTU_SYNC
to the options, and passing that from unmap_page().
When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same
for both: the slight overhead added should rarely matter, except perhaps
if splitting sparsely-populated multiply-mapped shmem. Once confident
that bugs are fixed, TTU_SYNC here can be removed, and the race tolerated.
Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com
Fixes: fec89c109f3a ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Jue Wang <juew(a)google.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy(a)infradead.org>
Cc: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Wang Yugui <wangyugui(a)e16-tech.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/rmap.h | 1 +
mm/huge_memory.c | 2 +-
mm/page_vma_mapped.c | 11 +++++++++++
mm/rmap.c | 17 ++++++++++++++++-
4 files changed, 29 insertions(+), 2 deletions(-)
--- a/include/linux/rmap.h~mm-thp-try_to_unmap-use-ttu_sync-for-safe-splitting
+++ a/include/linux/rmap.h
@@ -91,6 +91,7 @@ enum ttu_flags {
TTU_SPLIT_HUGE_PMD = 0x4, /* split huge PMD if any */
TTU_IGNORE_MLOCK = 0x8, /* ignore mlock */
+ TTU_SYNC = 0x10, /* avoid racy checks with PVMW_SYNC */
TTU_IGNORE_HWPOISON = 0x20, /* corrupted page is recoverable */
TTU_BATCH_FLUSH = 0x40, /* Batch TLB flushes where possible
* and caller guarantees they will
--- a/mm/huge_memory.c~mm-thp-try_to_unmap-use-ttu_sync-for-safe-splitting
+++ a/mm/huge_memory.c
@@ -2350,7 +2350,7 @@ void vma_adjust_trans_huge(struct vm_are
static void unmap_page(struct page *page)
{
- enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK |
+ enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_SYNC |
TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
bool unmap_success;
--- a/mm/page_vma_mapped.c~mm-thp-try_to_unmap-use-ttu_sync-for-safe-splitting
+++ a/mm/page_vma_mapped.c
@@ -212,6 +212,17 @@ restart:
pvmw->ptl = NULL;
}
} else if (!pmd_present(pmde)) {
+ /*
+ * If PVMW_SYNC, take and drop THP pmd lock so that we
+ * cannot return prematurely, while zap_huge_pmd() has
+ * cleared *pmd but not decremented compound_mapcount().
+ */
+ if ((pvmw->flags & PVMW_SYNC) &&
+ PageTransCompound(pvmw->page)) {
+ spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
+
+ spin_unlock(ptl);
+ }
return false;
}
if (!map_pte(pvmw))
--- a/mm/rmap.c~mm-thp-try_to_unmap-use-ttu_sync-for-safe-splitting
+++ a/mm/rmap.c
@@ -1405,6 +1405,15 @@ static bool try_to_unmap_one(struct page
struct mmu_notifier_range range;
enum ttu_flags flags = (enum ttu_flags)(long)arg;
+ /*
+ * When racing against e.g. zap_pte_range() on another cpu,
+ * in between its ptep_get_and_clear_full() and page_remove_rmap(),
+ * try_to_unmap() may return false when it is about to become true,
+ * if page table locking is skipped: use TTU_SYNC to wait for that.
+ */
+ if (flags & TTU_SYNC)
+ pvmw.flags = PVMW_SYNC;
+
/* munlock has nothing to gain from examining un-locked vmas */
if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
return true;
@@ -1777,7 +1786,13 @@ bool try_to_unmap(struct page *page, enu
else
rmap_walk(page, &rwc);
- return !page_mapcount(page) ? true : false;
+ /*
+ * When racing against e.g. zap_pte_range() on another cpu,
+ * in between its ptep_get_and_clear_full() and page_remove_rmap(),
+ * try_to_unmap() may return false when it is about to become true,
+ * if page table locking is skipped: use TTU_SYNC to wait for that.
+ */
+ return !page_mapcount(page);
}
/**
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-remap_page-is-only-needed-on-anonymous-thp.patch
mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync.patch
The patch titled
Subject: mm/thp: make is_huge_zero_pmd() safe and quicker
has been removed from the -mm tree. Its filename was
mm-thp-make-is_huge_zero_pmd-safe-and-quicker.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/thp: make is_huge_zero_pmd() safe and quicker
Most callers of is_huge_zero_pmd() supply a pmd already verified present;
but a few (notably zap_huge_pmd()) do not - it might be a pmd migration
entry, in which the pfn is encoded differently from a present pmd: which
might pass the is_huge_zero_pmd() test (though not on x86, since L1TF
forced us to protect against that); or perhaps even crash in pmd_page()
applied to a swap-like entry.
Make it safe by adding pmd_present() check into is_huge_zero_pmd() itself;
and make it quicker by saving huge_zero_pfn, so that is_huge_zero_pmd()
will not need to do that pmd_page() lookup each time.
__split_huge_pmd_locked() checked pmd_trans_huge() before: that worked,
but is unnecessary now that is_huge_zero_pmd() checks present.
Link: https://lkml.kernel.org/r/21ea9ca-a1f5-8b90-5e88-95fb1c49bbfa@google.com
Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reviewed-by: Yang Shi <shy828301(a)gmail.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Jue Wang <juew(a)google.com>
Cc: "Matthew Wilcox (Oracle)" <willy(a)infradead.org>
Cc: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Wang Yugui <wangyugui(a)e16-tech.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/huge_mm.h | 8 +++++++-
mm/huge_memory.c | 5 ++++-
2 files changed, 11 insertions(+), 2 deletions(-)
--- a/include/linux/huge_mm.h~mm-thp-make-is_huge_zero_pmd-safe-and-quicker
+++ a/include/linux/huge_mm.h
@@ -286,6 +286,7 @@ struct page *follow_devmap_pud(struct vm
vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
extern struct page *huge_zero_page;
+extern unsigned long huge_zero_pfn;
static inline bool is_huge_zero_page(struct page *page)
{
@@ -294,7 +295,7 @@ static inline bool is_huge_zero_page(str
static inline bool is_huge_zero_pmd(pmd_t pmd)
{
- return is_huge_zero_page(pmd_page(pmd));
+ return READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd) && pmd_present(pmd);
}
static inline bool is_huge_zero_pud(pud_t pud)
@@ -439,6 +440,11 @@ static inline bool is_huge_zero_page(str
{
return false;
}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+ return false;
+}
static inline bool is_huge_zero_pud(pud_t pud)
{
--- a/mm/huge_memory.c~mm-thp-make-is_huge_zero_pmd-safe-and-quicker
+++ a/mm/huge_memory.c
@@ -62,6 +62,7 @@ static struct shrinker deferred_split_sh
static atomic_t huge_zero_refcount;
struct page *huge_zero_page __read_mostly;
+unsigned long huge_zero_pfn __read_mostly = ~0UL;
bool transparent_hugepage_enabled(struct vm_area_struct *vma)
{
@@ -98,6 +99,7 @@ retry:
__free_pages(zero_page, compound_order(zero_page));
goto retry;
}
+ WRITE_ONCE(huge_zero_pfn, page_to_pfn(zero_page));
/* We take additional reference here. It will be put back by shrinker */
atomic_set(&huge_zero_refcount, 2);
@@ -147,6 +149,7 @@ static unsigned long shrink_huge_zero_pa
if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
struct page *zero_page = xchg(&huge_zero_page, NULL);
BUG_ON(zero_page == NULL);
+ WRITE_ONCE(huge_zero_pfn, ~0UL);
__free_pages(zero_page, compound_order(zero_page));
return HPAGE_PMD_NR;
}
@@ -2071,7 +2074,7 @@ static void __split_huge_pmd_locked(stru
return;
}
- if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) {
+ if (is_huge_zero_pmd(*pmd)) {
/*
* FIXME: Do we want to invalidate secondary mmu by calling
* mmu_notifier_invalidate_range() see comments below inside
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-remap_page-is-only-needed-on-anonymous-thp.patch
mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync.patch
The patch titled
Subject: mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
has been removed from the -mm tree. Its filename was
mm-thp-fix-__split_huge_pmd_locked-on-shmem-migration-entry.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
Patch series "mm/thp: fix THP splitting unmap BUGs and related", v10.
Here is v2 batch of long-standing THP bug fixes that I had not got around
to sending before, but prompted now by Wang Yugui's report
https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
Wang Yugui has tested a rollup of these fixes applied to 5.10.39, and they
have done no harm, but have *not* fixed that issue: something more is
needed and I have no idea of what.
This patch (of 7):
Stressing huge tmpfs page migration racing hole punch often crashed on the
VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y
kernel; or shortly afterwards, on a bad dereference in
__split_huge_pmd_locked() when DEBUG_VM=n. They forgot to allow for pmd
migration entries in the non-anonymous case.
Full disclosure: those particular experiments were on a kernel with more
relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the
vanilla kernel: it is conceivable that stricter locking happens to avoid
those cases, or makes them less likely; but __split_huge_pmd_locked()
already allowed for pmd migration entries when handling anonymous THPs, so
this commit brings the shmem and file THP handling into line.
And while there: use old_pmd rather than _pmd, as in the following blocks;
and make it clearer to the eye that the !vma_is_anonymous() block is
self-contained, making an early return after accounting for unmapping.
Link: https://lkml.kernel.org/r/af88612-1473-2eaa-903-8d1a448b26@google.com
Link: https://lkml.kernel.org/r/dd221a99-efb3-cd1d-6256-7e646af29314@google.com
Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Cc: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Wang Yugui <wangyugui(a)e16-tech.com>
Cc: "Matthew Wilcox (Oracle)" <willy(a)infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Ralph Campbell <rcampbell(a)nvidia.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Jue Wang <juew(a)google.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/huge_memory.c | 27 ++++++++++++++++++---------
mm/pgtable-generic.c | 5 ++---
2 files changed, 20 insertions(+), 12 deletions(-)
--- a/mm/huge_memory.c~mm-thp-fix-__split_huge_pmd_locked-on-shmem-migration-entry
+++ a/mm/huge_memory.c
@@ -2044,7 +2044,7 @@ static void __split_huge_pmd_locked(stru
count_vm_event(THP_SPLIT_PMD);
if (!vma_is_anonymous(vma)) {
- _pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+ old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
/*
* We are going to unmap this huge page. So
* just go ahead and zap it
@@ -2053,16 +2053,25 @@ static void __split_huge_pmd_locked(stru
zap_deposited_table(mm, pmd);
if (vma_is_special_huge(vma))
return;
- page = pmd_page(_pmd);
- if (!PageDirty(page) && pmd_dirty(_pmd))
- set_page_dirty(page);
- if (!PageReferenced(page) && pmd_young(_pmd))
- SetPageReferenced(page);
- page_remove_rmap(page, true);
- put_page(page);
+ if (unlikely(is_pmd_migration_entry(old_pmd))) {
+ swp_entry_t entry;
+
+ entry = pmd_to_swp_entry(old_pmd);
+ page = migration_entry_to_page(entry);
+ } else {
+ page = pmd_page(old_pmd);
+ if (!PageDirty(page) && pmd_dirty(old_pmd))
+ set_page_dirty(page);
+ if (!PageReferenced(page) && pmd_young(old_pmd))
+ SetPageReferenced(page);
+ page_remove_rmap(page, true);
+ put_page(page);
+ }
add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
return;
- } else if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) {
+ }
+
+ if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) {
/*
* FIXME: Do we want to invalidate secondary mmu by calling
* mmu_notifier_invalidate_range() see comments below inside
--- a/mm/pgtable-generic.c~mm-thp-fix-__split_huge_pmd_locked-on-shmem-migration-entry
+++ a/mm/pgtable-generic.c
@@ -135,9 +135,8 @@ pmd_t pmdp_huge_clear_flush(struct vm_ar
{
pmd_t pmd;
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
- VM_BUG_ON(!pmd_present(*pmdp));
- /* Below assumes pmd_present() is true */
- VM_BUG_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+ VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
+ !pmd_devmap(*pmdp));
pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
return pmd;
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-thp-remap_page-is-only-needed-on-anonymous-thp.patch
mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync.patch
The patch titled
Subject: mm, thp: use head page in __migration_entry_wait()
has been removed from the -mm tree. Its filename was
mm-thp-use-head-page-in-__migration_entry_wait.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Xu Yu <xuyu(a)linux.alibaba.com>
Subject: mm, thp: use head page in __migration_entry_wait()
We notice that hung task happens in a corner but practical scenario when
CONFIG_PREEMPT_NONE is enabled, as follows.
Process 0 Process 1 Process 2..Inf
split_huge_page_to_list
unmap_page
split_huge_pmd_address
__migration_entry_wait(head)
__migration_entry_wait(tail)
remap_page (roll back)
remove_migration_ptes
rmap_walk_anon
cond_resched
Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
copy_to_user in fstat, which will immediately fault again without
rescheduling, and thus occupy the cpu fully.
When there are too many processes performing __migration_entry_wait on
tail page, remap_page will never be done after cond_resched.
This makes __migration_entry_wait operate on the compound head page, thus
waits for remap_page to complete, whether the THP is split successfully or
roll back.
Note that put_and_wait_on_page_locked helps to drop the page reference
acquired with get_page_unless_zero, as soon as the page is on the wait
queue, before actually waiting. So splitting the THP is only prevented
for a brief interval.
Link: https://lkml.kernel.org/r/b9836c1dd522e903891760af9f0c86a2cce987eb.16231440…
Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split")
Suggested-by: Hugh Dickins <hughd(a)google.com>
Signed-off-by: Gang Deng <gavin.dg(a)linux.alibaba.com>
Signed-off-by: Xu Yu <xuyu(a)linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Acked-by: Hugh Dickins <hughd(a)google.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/migrate.c | 1 +
1 file changed, 1 insertion(+)
--- a/mm/migrate.c~mm-thp-use-head-page-in-__migration_entry_wait
+++ a/mm/migrate.c
@@ -295,6 +295,7 @@ void __migration_entry_wait(struct mm_st
goto out;
page = migration_entry_to_page(entry);
+ page = compound_head(page);
/*
* Once page cache replacement of page migration started, page_count
_
Patches currently in -mm which might be from xuyu(a)linux.alibaba.com are
The patch titled
Subject: mm/slub.c: include swab.h
has been removed from the -mm tree. Its filename was
mm-slubc-include-swabh.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Andrew Morton <akpm(a)linux-foundation.org>
Subject: mm/slub.c: include swab.h
Fixes build with CONFIG_SLAB_FREELIST_HARDENED=y.
Hopefully. But it's the right thing to do anwyay.
Fixes: 1ad53d9fa3f61 ("slub: improve bit diffusion for freelist ptr obfuscation")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213417
Reported-by: <vannguye(a)cisco.com>
Acked-by: Kees Cook <keescook(a)chromium.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/slub.c | 1 +
1 file changed, 1 insertion(+)
--- a/mm/slub.c~mm-slubc-include-swabh
+++ a/mm/slub.c
@@ -15,6 +15,7 @@
#include <linux/module.h>
#include <linux/bit_spinlock.h>
#include <linux/interrupt.h>
+#include <linux/swab.h>
#include <linux/bitops.h>
#include <linux/slab.h>
#include "slab.h"
_
Patches currently in -mm which might be from akpm(a)linux-foundation.org are
mm-page_alloc-fix-memory-map-initialization-for-descending-nodes-checkpatch-fixes.patch
mm.patch
mm-slub-kunit-add-a-kunit-test-for-slub-debugging-functionality-fix-2.patch
slub-force-on-no_hash_pointers-when-slub_debug-is-enabled-fix.patch
fs-remove-noop_set_page_dirty-fix.patch
mm-gup-pack-has_pinned-in-mmf_has_pinned-checkpatch-fixes.patch
mm-memcg-optimize-user-context-object-stock-access-checkpatch-fixes.patch
mm-memcg-slab-create-a-new-set-of-kmalloc-cg-n-caches-v5-fix.patch
binfmt-remove-in-tree-usage-of-map_executable-fix.patch
mm-mmap-introduce-unlock_range-for-code-cleanup-fix.patch
lazy-tlb-allow-lazy-tlb-mm-refcounting-to-be-configurable-fix.patch
mm-page_alloc-convert-per-cpu-list-protection-to-local_lock-fix-checkpatch-fixes.patch
mm-memory_hotplug-disable-memmap_on_memory-when-hugetlb_free_vmemmap-enabled-fix.patch
nommu-remove-__gfp_highmem-in-vmalloc-vzalloc-checkpatch-fixes.patch
mm-madvise-introduce-madv_populate_readwrite-to-prefault-page-tables-checkpatch-fixes.patch
mmmemory_hotplug-drop-unneeded-locking-fix.patch
fs-proc-kcorec-add-mmap-interface-fix.patch
kernelh-split-out-panic-and-oops-helpers-fix.patch
lib-math-rational-add-kunit-test-cases-fix.patch
lib-decompressors-remove-set-but-not-used-variabled-level-fix.patch
ipc-utilc-use-binary-search-for-max_idx-fix.patch
linux-next-pre.patch
linux-next-post.patch
linux-next-rejects.patch
linux-next-git-rejects.patch
kernel-cgroup-cgroupc-dont-export-cgroup_get_e_css-twice.patch
mm-introduce-memfd_secret-system-call-to-create-secret-memory-areas-fix.patch
module-add-printk-formats-to-add-module-build-id-to-stacktraces-fix.patch
module-add-printk-formats-to-add-module-build-id-to-stacktraces-fix-fix.patch
kernel-forkc-export-kernel_thread-to-modules.patch
The patch titled
Subject: crash_core, vmcoreinfo: append 'SECTION_SIZE_BITS' to vmcoreinfo
has been removed from the -mm tree. Its filename was
crash_core-vmcoreinfo-append-section_size_bits-to-vmcoreinfo.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Pingfan Liu <kernelfans(a)gmail.com>
Subject: crash_core, vmcoreinfo: append 'SECTION_SIZE_BITS' to vmcoreinfo
As mentioned in kernel commit 1d50e5d0c505 ("crash_core, vmcoreinfo:
Append 'MAX_PHYSMEM_BITS' to vmcoreinfo"), SECTION_SIZE_BITS in the
formula:
#define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
Besides SECTIONS_SHIFT, SECTION_SIZE_BITS is also used to calculate
PAGES_PER_SECTION in makedumpfile just like kernel.
Unfortunately, this arch-dependent macro SECTION_SIZE_BITS changes, e.g.
recently in kernel commit f0b13ee23241 ("arm64/sparsemem: reduce
SECTION_SIZE_BITS"). But user space wants a stable interface to get this
info. Such info is impossible to be deduced from a crashdump vmcore.
Hence append SECTION_SIZE_BITS to vmcoreinfo.
Link: https://lkml.kernel.org/r/20210608103359.84907-1-kernelfans@gmail.com
Link: http://lists.infradead.org/pipermail/kexec/2021-June/022676.html
Signed-off-by: Pingfan Liu <kernelfans(a)gmail.com>
Acked-by: Baoquan He <bhe(a)redhat.com>
Cc: Bhupesh Sharma <bhupesh.sharma(a)linaro.org>
Cc: Kazuhito Hagio <k-hagio(a)ab.jp.nec.com>
Cc: Dave Young <dyoung(a)redhat.com>
Cc: Boris Petkov <bp(a)alien8.de>
Cc: Ingo Molnar <mingo(a)kernel.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: James Morse <james.morse(a)arm.com>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Paul Mackerras <paulus(a)samba.org>
Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Cc: Dave Anderson <anderson(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/crash_core.c | 1 +
1 file changed, 1 insertion(+)
--- a/kernel/crash_core.c~crash_core-vmcoreinfo-append-section_size_bits-to-vmcoreinfo
+++ a/kernel/crash_core.c
@@ -464,6 +464,7 @@ static int __init crash_save_vmcoreinfo_
VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);
VMCOREINFO_STRUCT_SIZE(mem_section);
VMCOREINFO_OFFSET(mem_section, section_mem_map);
+ VMCOREINFO_NUMBER(SECTION_SIZE_BITS);
VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
#endif
VMCOREINFO_STRUCT_SIZE(page);
_
Patches currently in -mm which might be from kernelfans(a)gmail.com are
The patch titled
Subject: mm/hugetlb: expand restore_reserve_on_error functionality
has been removed from the -mm tree. Its filename was
mm-hugetlb-expand-restore_reserve_on_error-functionality.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Subject: mm/hugetlb: expand restore_reserve_on_error functionality
The routine restore_reserve_on_error is called to restore reservation
information when an error occurs after page allocation. The routine
alloc_huge_page modifies the mapping reserve map and potentially the
reserve count during allocation. If code calling alloc_huge_page
encounters an error after allocation and needs to free the page, the
reservation information needs to be adjusted.
Currently, restore_reserve_on_error only takes action on pages for which
the reserve count was adjusted(HPageRestoreReserve flag). There is
nothing wrong with these adjustments. However, alloc_huge_page ALWAYS
modifies the reserve map during allocation even if the reserve count is
not adjusted. This can cause issues as observed during development of
this patch [1].
One specific series of operations causing an issue is:
- Create a shared hugetlb mapping
Reservations for all pages created by default
- Fault in a page in the mapping
Reservation exists so reservation count is decremented
- Punch a hole in the file/mapping at index previously faulted
Reservation and any associated pages will be removed
- Allocate a page to fill the hole
No reservation entry, so reserve count unmodified
Reservation entry added to map by alloc_huge_page
- Error after allocation and before instantiating the page
Reservation entry remains in map
- Allocate a page to fill the hole
Reservation entry exists, so decrement reservation count
This will cause a reservation count underflow as the reservation count was
decremented twice for the same index.
A user would observe a very large number for HugePages_Rsvd in
/proc/meminfo. This would also likely cause subsequent allocations of
hugetlb pages to fail as it would 'appear' that all pages are reserved.
This sequence of operations is unlikely to happen, however they were
easily reproduced and observed using hacked up code as described in [1].
Address the issue by having the routine restore_reserve_on_error take
action on pages where HPageRestoreReserve is not set. In this case, we
need to remove any reserve map entry created by alloc_huge_page. A new
helper routine vma_del_reservation assists with this operation.
There are three callers of alloc_huge_page which do not currently call
restore_reserve_on error before freeing a page on error paths. Add those
missing calls.
[1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@google.…
Link: https://lkml.kernel.org/r/20210607204510.22617-1-mike.kravetz@oracle.com
Fixes: 96b96a96ddee ("mm/hugetlb: fix huge page reservation leak in private mapping error paths"
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reviewed-by: Mina Almasry <almasrymina(a)google.com>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/hugetlbfs/inode.c | 1
include/linux/hugetlb.h | 2
mm/hugetlb.c | 122 ++++++++++++++++++++++++++++++--------
3 files changed, 101 insertions(+), 24 deletions(-)
--- a/fs/hugetlbfs/inode.c~mm-hugetlb-expand-restore_reserve_on_error-functionality
+++ a/fs/hugetlbfs/inode.c
@@ -735,6 +735,7 @@ static long hugetlbfs_fallocate(struct f
__SetPageUptodate(page);
error = huge_add_to_page_cache(page, mapping, index);
if (unlikely(error)) {
+ restore_reserve_on_error(h, &pseudo_vma, addr, page);
put_page(page);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
goto out;
--- a/include/linux/hugetlb.h~mm-hugetlb-expand-restore_reserve_on_error-functionality
+++ a/include/linux/hugetlb.h
@@ -610,6 +610,8 @@ struct page *alloc_huge_page_vma(struct
unsigned long address);
int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
pgoff_t idx);
+void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
+ unsigned long address, struct page *page);
/* arch callback */
int __init __alloc_bootmem_huge_page(struct hstate *h);
--- a/mm/hugetlb.c~mm-hugetlb-expand-restore_reserve_on_error-functionality
+++ a/mm/hugetlb.c
@@ -2121,12 +2121,18 @@ out:
* be restored when a newly allocated huge page must be freed. It is
* to be called after calling vma_needs_reservation to determine if a
* reservation exists.
+ *
+ * vma_del_reservation is used in error paths where an entry in the reserve
+ * map was created during huge page allocation and must be removed. It is to
+ * be called after calling vma_needs_reservation to determine if a reservation
+ * exists.
*/
enum vma_resv_mode {
VMA_NEEDS_RESV,
VMA_COMMIT_RESV,
VMA_END_RESV,
VMA_ADD_RESV,
+ VMA_DEL_RESV,
};
static long __vma_reservation_common(struct hstate *h,
struct vm_area_struct *vma, unsigned long addr,
@@ -2170,11 +2176,21 @@ static long __vma_reservation_common(str
ret = region_del(resv, idx, idx + 1);
}
break;
+ case VMA_DEL_RESV:
+ if (vma->vm_flags & VM_MAYSHARE) {
+ region_abort(resv, idx, idx + 1, 1);
+ ret = region_del(resv, idx, idx + 1);
+ } else {
+ ret = region_add(resv, idx, idx + 1, 1, NULL, NULL);
+ /* region_add calls of range 1 should never fail. */
+ VM_BUG_ON(ret < 0);
+ }
+ break;
default:
BUG();
}
- if (vma->vm_flags & VM_MAYSHARE)
+ if (vma->vm_flags & VM_MAYSHARE || mode == VMA_DEL_RESV)
return ret;
/*
* We know private mapping must have HPAGE_RESV_OWNER set.
@@ -2222,25 +2238,39 @@ static long vma_add_reservation(struct h
return __vma_reservation_common(h, vma, addr, VMA_ADD_RESV);
}
+static long vma_del_reservation(struct hstate *h,
+ struct vm_area_struct *vma, unsigned long addr)
+{
+ return __vma_reservation_common(h, vma, addr, VMA_DEL_RESV);
+}
+
/*
- * This routine is called to restore a reservation on error paths. In the
- * specific error paths, a huge page was allocated (via alloc_huge_page)
- * and is about to be freed. If a reservation for the page existed,
- * alloc_huge_page would have consumed the reservation and set
- * HPageRestoreReserve in the newly allocated page. When the page is freed
- * via free_huge_page, the global reservation count will be incremented if
- * HPageRestoreReserve is set. However, free_huge_page can not adjust the
- * reserve map. Adjust the reserve map here to be consistent with global
- * reserve count adjustments to be made by free_huge_page.
- */
-static void restore_reserve_on_error(struct hstate *h,
- struct vm_area_struct *vma, unsigned long address,
- struct page *page)
+ * This routine is called to restore reservation information on error paths.
+ * It should ONLY be called for pages allocated via alloc_huge_page(), and
+ * the hugetlb mutex should remain held when calling this routine.
+ *
+ * It handles two specific cases:
+ * 1) A reservation was in place and the page consumed the reservation.
+ * HPageRestoreReserve is set in the page.
+ * 2) No reservation was in place for the page, so HPageRestoreReserve is
+ * not set. However, alloc_huge_page always updates the reserve map.
+ *
+ * In case 1, free_huge_page later in the error path will increment the
+ * global reserve count. But, free_huge_page does not have enough context
+ * to adjust the reservation map. This case deals primarily with private
+ * mappings. Adjust the reserve map here to be consistent with global
+ * reserve count adjustments to be made by free_huge_page. Make sure the
+ * reserve map indicates there is a reservation present.
+ *
+ * In case 2, simply undo reserve map modifications done by alloc_huge_page.
+ */
+void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
+ unsigned long address, struct page *page)
{
- if (unlikely(HPageRestoreReserve(page))) {
- long rc = vma_needs_reservation(h, vma, address);
+ long rc = vma_needs_reservation(h, vma, address);
- if (unlikely(rc < 0)) {
+ if (HPageRestoreReserve(page)) {
+ if (unlikely(rc < 0))
/*
* Rare out of memory condition in reserve map
* manipulation. Clear HPageRestoreReserve so that
@@ -2253,16 +2283,57 @@ static void restore_reserve_on_error(str
* accounting of reserve counts.
*/
ClearHPageRestoreReserve(page);
- } else if (rc) {
- rc = vma_add_reservation(h, vma, address);
- if (unlikely(rc < 0))
+ else if (rc)
+ (void)vma_add_reservation(h, vma, address);
+ else
+ vma_end_reservation(h, vma, address);
+ } else {
+ if (!rc) {
+ /*
+ * This indicates there is an entry in the reserve map
+ * added by alloc_huge_page. We know it was added
+ * before the alloc_huge_page call, otherwise
+ * HPageRestoreReserve would be set on the page.
+ * Remove the entry so that a subsequent allocation
+ * does not consume a reservation.
+ */
+ rc = vma_del_reservation(h, vma, address);
+ if (rc < 0)
/*
- * See above comment about rare out of
- * memory condition.
+ * VERY rare out of memory condition. Since
+ * we can not delete the entry, set
+ * HPageRestoreReserve so that the reserve
+ * count will be incremented when the page
+ * is freed. This reserve will be consumed
+ * on a subsequent allocation.
*/
- ClearHPageRestoreReserve(page);
+ SetHPageRestoreReserve(page);
+ } else if (rc < 0) {
+ /*
+ * Rare out of memory condition from
+ * vma_needs_reservation call. Memory allocation is
+ * only attempted if a new entry is needed. Therefore,
+ * this implies there is not an entry in the
+ * reserve map.
+ *
+ * For shared mappings, no entry in the map indicates
+ * no reservation. We are done.
+ */
+ if (!(vma->vm_flags & VM_MAYSHARE))
+ /*
+ * For private mappings, no entry indicates
+ * a reservation is present. Since we can
+ * not add an entry, set SetHPageRestoreReserve
+ * on the page so reserve count will be
+ * incremented when freed. This reserve will
+ * be consumed on a subsequent allocation.
+ */
+ SetHPageRestoreReserve(page);
} else
- vma_end_reservation(h, vma, address);
+ /*
+ * No reservation present, do nothing
+ */
+ vma_end_reservation(h, vma, address);
}
}
@@ -4037,6 +4108,8 @@ again:
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
entry = huge_ptep_get(src_pte);
if (!pte_same(src_pte_old, entry)) {
+ restore_reserve_on_error(h, vma, addr,
+ new);
put_page(new);
/* dst_entry won't change as in child */
goto again;
@@ -5006,6 +5079,7 @@ out_release_unlock:
if (vm_shared || is_continue)
unlock_page(page);
out_release_nounlock:
+ restore_reserve_on_error(h, dst_vma, dst_addr, page);
put_page(page);
goto out;
}
_
Patches currently in -mm which might be from mike.kravetz(a)oracle.com are
mm-hugetlb-alloc-the-vmemmap-pages-associated-with-each-hugetlb-page-fix.patch
hugetlb-remove-prep_compound_huge_page-cleanup.patch
hugetlb-address-ref-count-racing-in-prep_compound_gigantic_page.patch
The patch titled
Subject: mm/slub: actually fix freelist pointer vs redzoning
has been removed from the -mm tree. Its filename was
mm-slub-actually-fix-freelist-pointer-vs-redzoning.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Kees Cook <keescook(a)chromium.org>
Subject: mm/slub: actually fix freelist pointer vs redzoning
It turns out that SLUB redzoning ("slub_debug=Z") checks from
s->object_size rather than from s->inuse (which is normally bumped to make
room for the freelist pointer), so a cache created with an object size
less than 24 would have the freelist pointer written beyond
s->object_size, causing the redzone to be corrupted by the freelist
pointer. This was very visible with "slub_debug=ZF":
BUG test (Tainted: G B ): Right Redzone overwritten
-----------------------------------------------------------------------------
INFO: 0xffff957ead1c05de-0xffff957ead1c05df @offset=1502. First byte 0x1a instead of 0xbb
INFO: Slab 0xffffef3950b47000 objects=170 used=170 fp=0x0000000000000000 flags=0x8000000000000200
INFO: Object 0xffff957ead1c05d8 @offset=1496 fp=0xffff957ead1c0620
Redzone (____ptrval____): bb bb bb bb bb bb bb bb ........
Object (____ptrval____): 00 00 00 00 00 f6 f4 a5 ........
Redzone (____ptrval____): 40 1d e8 1a aa @....
Padding (____ptrval____): 00 00 00 00 00 00 00 00 ........
Adjust the offset to stay within s->object_size.
(Note that no caches of in this size range are known to exist in the
kernel currently.)
Link: https://lkml.kernel.org/r/20210608183955.280836-4-keescook@chromium.org
Link: https://lore.kernel.org/linux-mm/20200807160627.GA1420741@elver.google.com/
Link: https://lore.kernel.org/lkml/0f7dd7b2-7496-5e2d-9488-2ec9f8e90441@suse.cz/F…: 89b83f282d8b (slub: avoid redzone when choosing freepointer location)
Link: https://lore.kernel.org/lkml/CANpmjNOwZ5VpKQn+SYWovTkFB4VsT-RPwyENBmaK0dLcp…
Signed-off-by: Kees Cook <keescook(a)chromium.org>
Reported-by: Marco Elver <elver(a)google.com>
Reported-by: "Lin, Zhenpeng" <zplin(a)psu.edu>
Tested-by: Marco Elver <elver(a)google.com>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Christoph Lameter <cl(a)linux.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: Pekka Enberg <penberg(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/slub.c | 14 +++-----------
1 file changed, 3 insertions(+), 11 deletions(-)
--- a/mm/slub.c~mm-slub-actually-fix-freelist-pointer-vs-redzoning
+++ a/mm/slub.c
@@ -3689,7 +3689,6 @@ static int calculate_sizes(struct kmem_c
{
slab_flags_t flags = s->flags;
unsigned int size = s->object_size;
- unsigned int freepointer_area;
unsigned int order;
/*
@@ -3698,13 +3697,6 @@ static int calculate_sizes(struct kmem_c
* the possible location of the free pointer.
*/
size = ALIGN(size, sizeof(void *));
- /*
- * This is the area of the object where a freepointer can be
- * safely written. If redzoning adds more to the inuse size, we
- * can't use that portion for writing the freepointer, so
- * s->offset must be limited within this for the general case.
- */
- freepointer_area = size;
#ifdef CONFIG_SLUB_DEBUG
/*
@@ -3730,7 +3722,7 @@ static int calculate_sizes(struct kmem_c
/*
* With that we have determined the number of bytes in actual use
- * by the object. This is the potential offset to the free pointer.
+ * by the object and redzoning.
*/
s->inuse = size;
@@ -3753,13 +3745,13 @@ static int calculate_sizes(struct kmem_c
*/
s->offset = size;
size += sizeof(void *);
- } else if (freepointer_area > sizeof(void *)) {
+ } else {
/*
* Store freelist pointer near middle of object to keep
* it away from the edges of the object to avoid small
* sized over/underflows from neighboring allocations.
*/
- s->offset = ALIGN(freepointer_area / 2, sizeof(void *));
+ s->offset = ALIGN_DOWN(s->object_size / 2, sizeof(void *));
}
#ifdef CONFIG_SLUB_DEBUG
_
Patches currently in -mm which might be from keescook(a)chromium.org are
The patch titled
Subject: mm/slub: fix redzoning for small allocations
has been removed from the -mm tree. Its filename was
mm-slub-fix-redzoning-for-small-allocations.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Kees Cook <keescook(a)chromium.org>
Subject: mm/slub: fix redzoning for small allocations
The redzone area for SLUB exists between s->object_size and s->inuse
(which is at least the word-aligned object_size). If a cache were created
with an object_size smaller than sizeof(void *), the in-object stored
freelist pointer would overwrite the redzone (e.g. with boot param
"slub_debug=ZF"):
BUG test (Tainted: G B ): Right Redzone overwritten
-----------------------------------------------------------------------------
INFO: 0xffff957ead1c05de-0xffff957ead1c05df @offset=1502. First byte 0x1a instead of 0xbb
INFO: Slab 0xffffef3950b47000 objects=170 used=170 fp=0x0000000000000000 flags=0x8000000000000200
INFO: Object 0xffff957ead1c05d8 @offset=1496 fp=0xffff957ead1c0620
Redzone (____ptrval____): bb bb bb bb bb bb bb bb ........
Object (____ptrval____): f6 f4 a5 40 1d e8 ...@..
Redzone (____ptrval____): 1a aa ..
Padding (____ptrval____): 00 00 00 00 00 00 00 00 ........
Store the freelist pointer out of line when object_size is smaller than
sizeof(void *) and redzoning is enabled.
Additionally remove the "smaller than sizeof(void *)" check under
CONFIG_DEBUG_VM in kmem_cache_sanity_check() as it is now redundant:
SLAB and SLOB both handle small sizes.
(Note that no caches within this size range are known to exist in the
kernel currently.)
Link: https://lkml.kernel.org/r/20210608183955.280836-3-keescook@chromium.org
Fixes: 81819f0fc828 ("SLUB core")
Signed-off-by: Kees Cook <keescook(a)chromium.org>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Christoph Lameter <cl(a)linux.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: "Lin, Zhenpeng" <zplin(a)psu.edu>
Cc: Marco Elver <elver(a)google.com>
Cc: Pekka Enberg <penberg(a)kernel.org>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/slab_common.c | 3 +--
mm/slub.c | 8 +++++---
2 files changed, 6 insertions(+), 5 deletions(-)
--- a/mm/slab_common.c~mm-slub-fix-redzoning-for-small-allocations
+++ a/mm/slab_common.c
@@ -97,8 +97,7 @@ EXPORT_SYMBOL(kmem_cache_size);
#ifdef CONFIG_DEBUG_VM
static int kmem_cache_sanity_check(const char *name, unsigned int size)
{
- if (!name || in_interrupt() || size < sizeof(void *) ||
- size > KMALLOC_MAX_SIZE) {
+ if (!name || in_interrupt() || size > KMALLOC_MAX_SIZE) {
pr_err("kmem_cache_create(%s) integrity check failed\n", name);
return -EINVAL;
}
--- a/mm/slub.c~mm-slub-fix-redzoning-for-small-allocations
+++ a/mm/slub.c
@@ -3734,15 +3734,17 @@ static int calculate_sizes(struct kmem_c
*/
s->inuse = size;
- if (((flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)) ||
- s->ctor)) {
+ if ((flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)) ||
+ ((flags & SLAB_RED_ZONE) && s->object_size < sizeof(void *)) ||
+ s->ctor) {
/*
* Relocate free pointer after the object if it is not
* permitted to overwrite the first word of the object on
* kmem_cache_free.
*
* This is the case if we do RCU, have a constructor or
- * destructor or are poisoning the objects.
+ * destructor, are poisoning the objects, or are
+ * redzoning an object smaller than sizeof(void *).
*
* The assumption that s->offset >= s->inuse means free
* pointer is outside of the object is used in the
_
Patches currently in -mm which might be from keescook(a)chromium.org are
The patch titled
Subject: mm/swap: fix pte_same_as_swp() not removing uffd-wp bit when compare
has been removed from the -mm tree. Its filename was
mm-swap-fix-pte_same_as_swp-not-removing-uffd-wp-bit-when-compare.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Peter Xu <peterx(a)redhat.com>
Subject: mm/swap: fix pte_same_as_swp() not removing uffd-wp bit when compare
I found it by pure code review, that pte_same_as_swp() of unuse_vma()
didn't take uffd-wp bit into account when comparing ptes.
pte_same_as_swp() returning false negative could cause failure to swapoff
swap ptes that was wr-protected by userfaultfd.
Link: https://lkml.kernel.org/r/20210603180546.9083-1-peterx@redhat.com
Fixes: f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration")
Signed-off-by: Peter Xu <peterx(a)redhat.com>
Acked-by: Hugh Dickins <hughd(a)google.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: <stable(a)vger.kernel.org> [5.7+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/swapops.h | 15 +++++++++++----
mm/swapfile.c | 2 +-
2 files changed, 12 insertions(+), 5 deletions(-)
--- a/include/linux/swapops.h~mm-swap-fix-pte_same_as_swp-not-removing-uffd-wp-bit-when-compare
+++ a/include/linux/swapops.h
@@ -23,6 +23,16 @@
#define SWP_TYPE_SHIFT (BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)
#define SWP_OFFSET_MASK ((1UL << SWP_TYPE_SHIFT) - 1)
+/* Clear all flags but only keep swp_entry_t related information */
+static inline pte_t pte_swp_clear_flags(pte_t pte)
+{
+ if (pte_swp_soft_dirty(pte))
+ pte = pte_swp_clear_soft_dirty(pte);
+ if (pte_swp_uffd_wp(pte))
+ pte = pte_swp_clear_uffd_wp(pte);
+ return pte;
+}
+
/*
* Store a type+offset into a swp_entry_t in an arch-independent format
*/
@@ -66,10 +76,7 @@ static inline swp_entry_t pte_to_swp_ent
{
swp_entry_t arch_entry;
- if (pte_swp_soft_dirty(pte))
- pte = pte_swp_clear_soft_dirty(pte);
- if (pte_swp_uffd_wp(pte))
- pte = pte_swp_clear_uffd_wp(pte);
+ pte = pte_swp_clear_flags(pte);
arch_entry = __pte_to_swp_entry(pte);
return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
}
--- a/mm/swapfile.c~mm-swap-fix-pte_same_as_swp-not-removing-uffd-wp-bit-when-compare
+++ a/mm/swapfile.c
@@ -1900,7 +1900,7 @@ unsigned int count_swap_pages(int type,
static inline int pte_same_as_swp(pte_t pte, pte_t swp_pte)
{
- return pte_same(pte_swp_clear_soft_dirty(pte), swp_pte);
+ return pte_same(pte_swp_clear_flags(pte), swp_pte);
}
/*
_
Patches currently in -mm which might be from peterx(a)redhat.com are
mm-gup_benchmark-support-threading.patch
mm-gup-pack-has_pinned-in-mmf_has_pinned-fix.patch
userfaultfd-selftests-use-user-mode-only.patch
userfaultfd-selftests-remove-the-time-check-on-delayed-uffd.patch
userfaultfd-selftests-dropping-verify-check-in-locking_thread.patch
userfaultfd-selftests-only-dump-counts-if-mode-enabled.patch
userfaultfd-selftests-unify-error-handling.patch
mm-thp-simplify-copying-of-huge-zero-page-pmd-when-fork.patch
mm-userfaultfd-fix-uffd-wp-special-cases-for-fork.patch
mm-userfaultfd-fix-a-few-thp-pmd-missing-uffd-wp-bit.patch
mm-userfaultfd-fail-uffd-wp-registeration-if-not-supported.patch
mm-pagemap-export-uffd-wp-protection-information.patch
userfaultfd-selftests-add-pagemap-uffd-wp-test.patch
userfaultfd-selftests-reinitialize-test-context-in-each-test-fix.patch
The patch titled
Subject: mm,hwpoison: fix race with hugetlb page allocation
has been removed from the -mm tree. Its filename was
mmhwpoison-fix-race-with-hugetlb-page-allocation.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Subject: mm,hwpoison: fix race with hugetlb page allocation
When hugetlb page fault (under overcommitting situation) and
memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following
race:
CPU0: CPU1:
gather_surplus_pages()
page = alloc_surplus_huge_page()
memory_failure_hugetlb()
get_hwpoison_page(page)
__get_hwpoison_page(page)
get_page_unless_zero(page)
zero = put_page_testzero(page)
VM_BUG_ON_PAGE(!zero, page)
enqueue_huge_page(h, page)
put_page(page)
__get_hwpoison_page() only checks the page refcount before taking an
additional one for memory error handling, which is not enough because
there's a time window where compound pages have non-zero refcount during
hugetlb page initialization.
So make __get_hwpoison_page() check page status a bit more for hugetlb
pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags
under hugetlb_lock makes sure that the hugetlb page is not transitive.
It's notable that another new function, HWPoisonHandlable(), is helpful to
prevent a race against other transitive page states (like a generic
compound page just before PageHuge becomes true).
Link: https://lkml.kernel.org/r/20210603233632.2964832-2-nao.horiguchi@gmail.com
Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Reported-by: Muchun Song <songmuchun(a)bytedance.com>
Acked-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Tony Luck <tony.luck(a)intel.com>
Cc: <stable(a)vger.kernel.org> [5.12+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/hugetlb.h | 6 ++++++
mm/hugetlb.c | 15 +++++++++++++++
mm/memory-failure.c | 29 +++++++++++++++++++++++++++--
3 files changed, 48 insertions(+), 2 deletions(-)
--- a/include/linux/hugetlb.h~mmhwpoison-fix-race-with-hugetlb-page-allocation
+++ a/include/linux/hugetlb.h
@@ -149,6 +149,7 @@ bool hugetlb_reserve_pages(struct inode
long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
long freed);
bool isolate_huge_page(struct page *page, struct list_head *list);
+int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
void putback_active_hugepage(struct page *page);
void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
void free_huge_page(struct page *page);
@@ -339,6 +340,11 @@ static inline bool isolate_huge_page(str
return false;
}
+static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
+{
+ return 0;
+}
+
static inline void putback_active_hugepage(struct page *page)
{
}
--- a/mm/hugetlb.c~mmhwpoison-fix-race-with-hugetlb-page-allocation
+++ a/mm/hugetlb.c
@@ -5857,6 +5857,21 @@ unlock:
return ret;
}
+int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
+{
+ int ret = 0;
+
+ *hugetlb = false;
+ spin_lock_irq(&hugetlb_lock);
+ if (PageHeadHuge(page)) {
+ *hugetlb = true;
+ if (HPageFreed(page) || HPageMigratable(page))
+ ret = get_page_unless_zero(page);
+ }
+ spin_unlock_irq(&hugetlb_lock);
+ return ret;
+}
+
void putback_active_hugepage(struct page *page)
{
spin_lock_irq(&hugetlb_lock);
--- a/mm/memory-failure.c~mmhwpoison-fix-race-with-hugetlb-page-allocation
+++ a/mm/memory-failure.c
@@ -949,6 +949,17 @@ static int page_action(struct page_state
return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
}
+/*
+ * Return true if a page type of a given page is supported by hwpoison
+ * mechanism (while handling could fail), otherwise false. This function
+ * does not return true for hugetlb or device memory pages, so it's assumed
+ * to be called only in the context where we never have such pages.
+ */
+static inline bool HWPoisonHandlable(struct page *page)
+{
+ return PageLRU(page) || __PageMovable(page);
+}
+
/**
* __get_hwpoison_page() - Get refcount for memory error handling:
* @page: raw error page (hit by memory error)
@@ -959,8 +970,22 @@ static int page_action(struct page_state
static int __get_hwpoison_page(struct page *page)
{
struct page *head = compound_head(page);
+ int ret = 0;
+ bool hugetlb = false;
+
+ ret = get_hwpoison_huge_page(head, &hugetlb);
+ if (hugetlb)
+ return ret;
+
+ /*
+ * This check prevents from calling get_hwpoison_unless_zero()
+ * for any unsupported type of page in order to reduce the risk of
+ * unexpected races caused by taking a page refcount.
+ */
+ if (!HWPoisonHandlable(head))
+ return 0;
- if (!PageHuge(head) && PageTransHuge(head)) {
+ if (PageTransHuge(head)) {
/*
* Non anonymous thp exists only in allocation/free time. We
* can't handle such a case correctly, so let's give it up.
@@ -1017,7 +1042,7 @@ try_again:
ret = -EIO;
}
} else {
- if (PageHuge(p) || PageLRU(p) || __PageMovable(p)) {
+ if (PageHuge(p) || HWPoisonHandlable(p)) {
ret = 1;
} else {
/*
_
Patches currently in -mm which might be from naoya.horiguchi(a)nec.com are
mmhwpoison-send-sigbus-with-error-virutal-address.patch
mmhwpoison-send-sigbus-with-error-virutal-address-fix.patch
mmhwpoison-make-get_hwpoison_page-call-get_any_page.patch
mm-hwpoison-disable-pcp-for-page_handle_poison.patch
commit 827a746f405d upstream.
It's not sufficient to skip reading when the pos is beyond the EOF.
There may be data at the head of the page that we need to fill in
before the write.
Add a new helper function that corrects and clarifies the logic of
when we can skip reads, and have it only zero out the part of the page
that won't have data copied in for the write.
Cc: <stable(a)vger.kernel.org> # v5.10+
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Fixes: 1cc1699070bd ("ceph: fold ceph_update_writeable_page into ceph_write_begin")
Reported-by: Andrew W Elble <aweits(a)rit.edu>
Signed-off-by: Jeff Layton <jlayton(a)kernel.org>
Signed-off-by: David Howells <dhowells(a)redhat.com>
---
fs/ceph/addr.c | 54 ++++++++++++++++++++++++++++++++++++++------------
1 file changed, 41 insertions(+), 13 deletions(-)
This bug was originally in ceph, and then got replicated in the new
netfs helper code in v5.13. This patch is a backport of the netfs patch
for ceph. It should be applied to 5.10.y - 5.12.y.
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 26e66436f005..c000fe338f7e 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1302,6 +1302,45 @@ ceph_find_incompatible(struct page *page)
return NULL;
}
+/**
+ * prep_noread_page - prep a page for writing without reading first
+ * @page: page being prepared
+ * @pos: starting position for the write
+ * @len: length of write
+ *
+ * In some cases, write_begin doesn't need to read at all:
+ * - full page write
+ * - file is currently zero-length
+ * - write that lies in a page that is completely beyond EOF
+ * - write that covers the the page from start to EOF or beyond it
+ *
+ * If any of these criteria are met, then zero out the unwritten parts
+ * of the page and return true. Otherwise, return false.
+ */
+static bool skip_page_read(struct page *page, loff_t pos, size_t len)
+{
+ struct inode *inode = page->mapping->host;
+ loff_t i_size = i_size_read(inode);
+ size_t offset = offset_in_page(pos);
+
+ /* Full page write */
+ if (offset == 0 && len >= PAGE_SIZE)
+ return true;
+
+ /* pos beyond last page in the file */
+ if (pos - offset >= i_size)
+ goto zero_out;
+
+ /* write that covers the whole page from start to EOF or beyond it */
+ if (offset == 0 && (pos + len) >= i_size)
+ goto zero_out;
+
+ return false;
+zero_out:
+ zero_user_segments(page, 0, offset, offset + len, PAGE_SIZE);
+ return true;
+}
+
/*
* We are only allowed to write into/dirty the page if the page is
* clean, or already dirty within the same snap context.
@@ -1315,7 +1354,6 @@ static int ceph_write_begin(struct file *file, struct address_space *mapping,
struct ceph_snap_context *snapc;
struct page *page = NULL;
pgoff_t index = pos >> PAGE_SHIFT;
- int pos_in_page = pos & ~PAGE_MASK;
int r = 0;
dout("write_begin file %p inode %p page %p %d~%d\n", file, inode, page, (int)pos, (int)len);
@@ -1350,19 +1388,9 @@ static int ceph_write_begin(struct file *file, struct address_space *mapping,
break;
}
- /*
- * In some cases we don't need to read at all:
- * - full page write
- * - write that lies completely beyond EOF
- * - write that covers the the page from start to EOF or beyond it
- */
- if ((pos_in_page == 0 && len == PAGE_SIZE) ||
- (pos >= i_size_read(inode)) ||
- (pos_in_page == 0 && (pos + len) >= i_size_read(inode))) {
- zero_user_segments(page, 0, pos_in_page,
- pos_in_page + len, PAGE_SIZE);
+ /* No need to read in some cases */
+ if (skip_page_read(page, pos, len))
break;
- }
/*
* We need to read it. If we get back -EINPROGRESS, then the page was
--
2.31.1
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 34b3d5344719d14fd2185b2d9459b3abcb8cf9d8 Mon Sep 17 00:00:00 2001
From: Petr Mladek <pmladek(a)suse.com>
Date: Thu, 24 Jun 2021 18:39:45 -0700
Subject: [PATCH] kthread_worker: split code for canceling the delayed work
timer
Patch series "kthread_worker: Fix race between kthread_mod_delayed_work()
and kthread_cancel_delayed_work_sync()".
This patchset fixes the race between kthread_mod_delayed_work() and
kthread_cancel_delayed_work_sync() including proper return value
handling.
This patch (of 2):
Simple code refactoring as a preparation step for fixing a race between
kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync().
It does not modify the existing behavior.
Link: https://lkml.kernel.org/r/20210610133051.15337-2-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek(a)suse.com>
Cc: <jenhaochen(a)google.com>
Cc: Martin Liu <liumartin(a)google.com>
Cc: Minchan Kim <minchan(a)google.com>
Cc: Nathan Chancellor <nathan(a)kernel.org>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/kernel/kthread.c b/kernel/kthread.c
index fe3f2a40d61e..121a0e1fc659 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1092,6 +1092,33 @@ void kthread_flush_work(struct kthread_work *work)
}
EXPORT_SYMBOL_GPL(kthread_flush_work);
+/*
+ * Make sure that the timer is neither set nor running and could
+ * not manipulate the work list_head any longer.
+ *
+ * The function is called under worker->lock. The lock is temporary
+ * released but the timer can't be set again in the meantime.
+ */
+static void kthread_cancel_delayed_work_timer(struct kthread_work *work,
+ unsigned long *flags)
+{
+ struct kthread_delayed_work *dwork =
+ container_of(work, struct kthread_delayed_work, work);
+ struct kthread_worker *worker = work->worker;
+
+ /*
+ * del_timer_sync() must be called to make sure that the timer
+ * callback is not running. The lock must be temporary released
+ * to avoid a deadlock with the callback. In the meantime,
+ * any queuing is blocked by setting the canceling counter.
+ */
+ work->canceling++;
+ raw_spin_unlock_irqrestore(&worker->lock, *flags);
+ del_timer_sync(&dwork->timer);
+ raw_spin_lock_irqsave(&worker->lock, *flags);
+ work->canceling--;
+}
+
/*
* This function removes the work from the worker queue. Also it makes sure
* that it won't get queued later via the delayed work's timer.
@@ -1106,23 +1133,8 @@ static bool __kthread_cancel_work(struct kthread_work *work, bool is_dwork,
unsigned long *flags)
{
/* Try to cancel the timer if exists. */
- if (is_dwork) {
- struct kthread_delayed_work *dwork =
- container_of(work, struct kthread_delayed_work, work);
- struct kthread_worker *worker = work->worker;
-
- /*
- * del_timer_sync() must be called to make sure that the timer
- * callback is not running. The lock must be temporary released
- * to avoid a deadlock with the callback. In the meantime,
- * any queuing is blocked by setting the canceling counter.
- */
- work->canceling++;
- raw_spin_unlock_irqrestore(&worker->lock, *flags);
- del_timer_sync(&dwork->timer);
- raw_spin_lock_irqsave(&worker->lock, *flags);
- work->canceling--;
- }
+ if (is_dwork)
+ kthread_cancel_delayed_work_timer(work, flags);
/*
* Try to remove the work from a worker list. It might either
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 34b3d5344719d14fd2185b2d9459b3abcb8cf9d8 Mon Sep 17 00:00:00 2001
From: Petr Mladek <pmladek(a)suse.com>
Date: Thu, 24 Jun 2021 18:39:45 -0700
Subject: [PATCH] kthread_worker: split code for canceling the delayed work
timer
Patch series "kthread_worker: Fix race between kthread_mod_delayed_work()
and kthread_cancel_delayed_work_sync()".
This patchset fixes the race between kthread_mod_delayed_work() and
kthread_cancel_delayed_work_sync() including proper return value
handling.
This patch (of 2):
Simple code refactoring as a preparation step for fixing a race between
kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync().
It does not modify the existing behavior.
Link: https://lkml.kernel.org/r/20210610133051.15337-2-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek(a)suse.com>
Cc: <jenhaochen(a)google.com>
Cc: Martin Liu <liumartin(a)google.com>
Cc: Minchan Kim <minchan(a)google.com>
Cc: Nathan Chancellor <nathan(a)kernel.org>
Cc: Nick Desaulniers <ndesaulniers(a)google.com>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/kernel/kthread.c b/kernel/kthread.c
index fe3f2a40d61e..121a0e1fc659 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1092,6 +1092,33 @@ void kthread_flush_work(struct kthread_work *work)
}
EXPORT_SYMBOL_GPL(kthread_flush_work);
+/*
+ * Make sure that the timer is neither set nor running and could
+ * not manipulate the work list_head any longer.
+ *
+ * The function is called under worker->lock. The lock is temporary
+ * released but the timer can't be set again in the meantime.
+ */
+static void kthread_cancel_delayed_work_timer(struct kthread_work *work,
+ unsigned long *flags)
+{
+ struct kthread_delayed_work *dwork =
+ container_of(work, struct kthread_delayed_work, work);
+ struct kthread_worker *worker = work->worker;
+
+ /*
+ * del_timer_sync() must be called to make sure that the timer
+ * callback is not running. The lock must be temporary released
+ * to avoid a deadlock with the callback. In the meantime,
+ * any queuing is blocked by setting the canceling counter.
+ */
+ work->canceling++;
+ raw_spin_unlock_irqrestore(&worker->lock, *flags);
+ del_timer_sync(&dwork->timer);
+ raw_spin_lock_irqsave(&worker->lock, *flags);
+ work->canceling--;
+}
+
/*
* This function removes the work from the worker queue. Also it makes sure
* that it won't get queued later via the delayed work's timer.
@@ -1106,23 +1133,8 @@ static bool __kthread_cancel_work(struct kthread_work *work, bool is_dwork,
unsigned long *flags)
{
/* Try to cancel the timer if exists. */
- if (is_dwork) {
- struct kthread_delayed_work *dwork =
- container_of(work, struct kthread_delayed_work, work);
- struct kthread_worker *worker = work->worker;
-
- /*
- * del_timer_sync() must be called to make sure that the timer
- * callback is not running. The lock must be temporary released
- * to avoid a deadlock with the callback. In the meantime,
- * any queuing is blocked by setting the canceling counter.
- */
- work->canceling++;
- raw_spin_unlock_irqrestore(&worker->lock, *flags);
- del_timer_sync(&dwork->timer);
- raw_spin_lock_irqsave(&worker->lock, *flags);
- work->canceling--;
- }
+ if (is_dwork)
+ kthread_cancel_delayed_work_timer(work, flags);
/*
* Try to remove the work from a worker list. It might either
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3de218ff39b9e3f0d453fe3154f12a174de44b25 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross(a)suse.com>
Date: Wed, 23 Jun 2021 15:09:13 +0200
Subject: [PATCH] xen/events: reset active flag for lateeoi events later
In order to avoid a race condition for user events when changing
cpu affinity reset the active flag only when EOI-ing the event.
This is working fine as all user events are lateeoi events. Note that
lateeoi_ack_mask_dynirq() is not modified as there is no explicit call
to xen_irq_lateeoi() expected later.
Cc: stable(a)vger.kernel.org
Reported-by: Julien Grall <julien(a)xen.org>
Fixes: b6622798bc50b62 ("xen/events: avoid handling the same event on two cpus at the same time")
Tested-by: Julien Grall <julien(a)xen.org>
Signed-off-by: Juergen Gross <jgross(a)suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrvsky(a)oracle.com>
Link: https://lore.kernel.org/r/20210623130913.9405-1-jgross@suse.com
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 7bbfd58958bc..d7e361fb0548 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -642,6 +642,9 @@ static void xen_irq_lateeoi_locked(struct irq_info *info, bool spurious)
}
info->eoi_time = 0;
+
+ /* is_active hasn't been reset yet, do it now. */
+ smp_store_release(&info->is_active, 0);
do_unmask(info, EVT_MASK_REASON_EOI_PENDING);
}
@@ -811,6 +814,7 @@ static void xen_evtchn_close(evtchn_port_t port)
BUG();
}
+/* Not called for lateeoi events. */
static void event_handler_exit(struct irq_info *info)
{
smp_store_release(&info->is_active, 0);
@@ -1883,7 +1887,12 @@ static void lateeoi_ack_dynirq(struct irq_data *data)
if (VALID_EVTCHN(evtchn)) {
do_mask(info, EVT_MASK_REASON_EOI_PENDING);
- event_handler_exit(info);
+ /*
+ * Don't call event_handler_exit().
+ * Need to keep is_active non-zero in order to ignore re-raised
+ * events after cpu affinity changes while a lateeoi is pending.
+ */
+ clear_evtchn(evtchn);
}
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3de218ff39b9e3f0d453fe3154f12a174de44b25 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross(a)suse.com>
Date: Wed, 23 Jun 2021 15:09:13 +0200
Subject: [PATCH] xen/events: reset active flag for lateeoi events later
In order to avoid a race condition for user events when changing
cpu affinity reset the active flag only when EOI-ing the event.
This is working fine as all user events are lateeoi events. Note that
lateeoi_ack_mask_dynirq() is not modified as there is no explicit call
to xen_irq_lateeoi() expected later.
Cc: stable(a)vger.kernel.org
Reported-by: Julien Grall <julien(a)xen.org>
Fixes: b6622798bc50b62 ("xen/events: avoid handling the same event on two cpus at the same time")
Tested-by: Julien Grall <julien(a)xen.org>
Signed-off-by: Juergen Gross <jgross(a)suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrvsky(a)oracle.com>
Link: https://lore.kernel.org/r/20210623130913.9405-1-jgross@suse.com
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 7bbfd58958bc..d7e361fb0548 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -642,6 +642,9 @@ static void xen_irq_lateeoi_locked(struct irq_info *info, bool spurious)
}
info->eoi_time = 0;
+
+ /* is_active hasn't been reset yet, do it now. */
+ smp_store_release(&info->is_active, 0);
do_unmask(info, EVT_MASK_REASON_EOI_PENDING);
}
@@ -811,6 +814,7 @@ static void xen_evtchn_close(evtchn_port_t port)
BUG();
}
+/* Not called for lateeoi events. */
static void event_handler_exit(struct irq_info *info)
{
smp_store_release(&info->is_active, 0);
@@ -1883,7 +1887,12 @@ static void lateeoi_ack_dynirq(struct irq_data *data)
if (VALID_EVTCHN(evtchn)) {
do_mask(info, EVT_MASK_REASON_EOI_PENDING);
- event_handler_exit(info);
+ /*
+ * Don't call event_handler_exit().
+ * Need to keep is_active non-zero in order to ignore re-raised
+ * events after cpu affinity changes while a lateeoi is pending.
+ */
+ clear_evtchn(evtchn);
}
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3de218ff39b9e3f0d453fe3154f12a174de44b25 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross(a)suse.com>
Date: Wed, 23 Jun 2021 15:09:13 +0200
Subject: [PATCH] xen/events: reset active flag for lateeoi events later
In order to avoid a race condition for user events when changing
cpu affinity reset the active flag only when EOI-ing the event.
This is working fine as all user events are lateeoi events. Note that
lateeoi_ack_mask_dynirq() is not modified as there is no explicit call
to xen_irq_lateeoi() expected later.
Cc: stable(a)vger.kernel.org
Reported-by: Julien Grall <julien(a)xen.org>
Fixes: b6622798bc50b62 ("xen/events: avoid handling the same event on two cpus at the same time")
Tested-by: Julien Grall <julien(a)xen.org>
Signed-off-by: Juergen Gross <jgross(a)suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrvsky(a)oracle.com>
Link: https://lore.kernel.org/r/20210623130913.9405-1-jgross@suse.com
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 7bbfd58958bc..d7e361fb0548 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -642,6 +642,9 @@ static void xen_irq_lateeoi_locked(struct irq_info *info, bool spurious)
}
info->eoi_time = 0;
+
+ /* is_active hasn't been reset yet, do it now. */
+ smp_store_release(&info->is_active, 0);
do_unmask(info, EVT_MASK_REASON_EOI_PENDING);
}
@@ -811,6 +814,7 @@ static void xen_evtchn_close(evtchn_port_t port)
BUG();
}
+/* Not called for lateeoi events. */
static void event_handler_exit(struct irq_info *info)
{
smp_store_release(&info->is_active, 0);
@@ -1883,7 +1887,12 @@ static void lateeoi_ack_dynirq(struct irq_data *data)
if (VALID_EVTCHN(evtchn)) {
do_mask(info, EVT_MASK_REASON_EOI_PENDING);
- event_handler_exit(info);
+ /*
+ * Don't call event_handler_exit().
+ * Need to keep is_active non-zero in order to ignore re-raised
+ * events after cpu affinity changes while a lateeoi is pending.
+ */
+ clear_evtchn(evtchn);
}
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3de218ff39b9e3f0d453fe3154f12a174de44b25 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross(a)suse.com>
Date: Wed, 23 Jun 2021 15:09:13 +0200
Subject: [PATCH] xen/events: reset active flag for lateeoi events later
In order to avoid a race condition for user events when changing
cpu affinity reset the active flag only when EOI-ing the event.
This is working fine as all user events are lateeoi events. Note that
lateeoi_ack_mask_dynirq() is not modified as there is no explicit call
to xen_irq_lateeoi() expected later.
Cc: stable(a)vger.kernel.org
Reported-by: Julien Grall <julien(a)xen.org>
Fixes: b6622798bc50b62 ("xen/events: avoid handling the same event on two cpus at the same time")
Tested-by: Julien Grall <julien(a)xen.org>
Signed-off-by: Juergen Gross <jgross(a)suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrvsky(a)oracle.com>
Link: https://lore.kernel.org/r/20210623130913.9405-1-jgross@suse.com
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 7bbfd58958bc..d7e361fb0548 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -642,6 +642,9 @@ static void xen_irq_lateeoi_locked(struct irq_info *info, bool spurious)
}
info->eoi_time = 0;
+
+ /* is_active hasn't been reset yet, do it now. */
+ smp_store_release(&info->is_active, 0);
do_unmask(info, EVT_MASK_REASON_EOI_PENDING);
}
@@ -811,6 +814,7 @@ static void xen_evtchn_close(evtchn_port_t port)
BUG();
}
+/* Not called for lateeoi events. */
static void event_handler_exit(struct irq_info *info)
{
smp_store_release(&info->is_active, 0);
@@ -1883,7 +1887,12 @@ static void lateeoi_ack_dynirq(struct irq_data *data)
if (VALID_EVTCHN(evtchn)) {
do_mask(info, EVT_MASK_REASON_EOI_PENDING);
- event_handler_exit(info);
+ /*
+ * Don't call event_handler_exit().
+ * Need to keep is_active non-zero in order to ignore re-raised
+ * events after cpu affinity changes while a lateeoi is pending.
+ */
+ clear_evtchn(evtchn);
}
}
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3de218ff39b9e3f0d453fe3154f12a174de44b25 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross(a)suse.com>
Date: Wed, 23 Jun 2021 15:09:13 +0200
Subject: [PATCH] xen/events: reset active flag for lateeoi events later
In order to avoid a race condition for user events when changing
cpu affinity reset the active flag only when EOI-ing the event.
This is working fine as all user events are lateeoi events. Note that
lateeoi_ack_mask_dynirq() is not modified as there is no explicit call
to xen_irq_lateeoi() expected later.
Cc: stable(a)vger.kernel.org
Reported-by: Julien Grall <julien(a)xen.org>
Fixes: b6622798bc50b62 ("xen/events: avoid handling the same event on two cpus at the same time")
Tested-by: Julien Grall <julien(a)xen.org>
Signed-off-by: Juergen Gross <jgross(a)suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrvsky(a)oracle.com>
Link: https://lore.kernel.org/r/20210623130913.9405-1-jgross@suse.com
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 7bbfd58958bc..d7e361fb0548 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -642,6 +642,9 @@ static void xen_irq_lateeoi_locked(struct irq_info *info, bool spurious)
}
info->eoi_time = 0;
+
+ /* is_active hasn't been reset yet, do it now. */
+ smp_store_release(&info->is_active, 0);
do_unmask(info, EVT_MASK_REASON_EOI_PENDING);
}
@@ -811,6 +814,7 @@ static void xen_evtchn_close(evtchn_port_t port)
BUG();
}
+/* Not called for lateeoi events. */
static void event_handler_exit(struct irq_info *info)
{
smp_store_release(&info->is_active, 0);
@@ -1883,7 +1887,12 @@ static void lateeoi_ack_dynirq(struct irq_data *data)
if (VALID_EVTCHN(evtchn)) {
do_mask(info, EVT_MASK_REASON_EOI_PENDING);
- event_handler_exit(info);
+ /*
+ * Don't call event_handler_exit().
+ * Need to keep is_active non-zero in order to ignore re-raised
+ * events after cpu affinity changes while a lateeoi is pending.
+ */
+ clear_evtchn(evtchn);
}
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3de218ff39b9e3f0d453fe3154f12a174de44b25 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross(a)suse.com>
Date: Wed, 23 Jun 2021 15:09:13 +0200
Subject: [PATCH] xen/events: reset active flag for lateeoi events later
In order to avoid a race condition for user events when changing
cpu affinity reset the active flag only when EOI-ing the event.
This is working fine as all user events are lateeoi events. Note that
lateeoi_ack_mask_dynirq() is not modified as there is no explicit call
to xen_irq_lateeoi() expected later.
Cc: stable(a)vger.kernel.org
Reported-by: Julien Grall <julien(a)xen.org>
Fixes: b6622798bc50b62 ("xen/events: avoid handling the same event on two cpus at the same time")
Tested-by: Julien Grall <julien(a)xen.org>
Signed-off-by: Juergen Gross <jgross(a)suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrvsky(a)oracle.com>
Link: https://lore.kernel.org/r/20210623130913.9405-1-jgross@suse.com
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 7bbfd58958bc..d7e361fb0548 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -642,6 +642,9 @@ static void xen_irq_lateeoi_locked(struct irq_info *info, bool spurious)
}
info->eoi_time = 0;
+
+ /* is_active hasn't been reset yet, do it now. */
+ smp_store_release(&info->is_active, 0);
do_unmask(info, EVT_MASK_REASON_EOI_PENDING);
}
@@ -811,6 +814,7 @@ static void xen_evtchn_close(evtchn_port_t port)
BUG();
}
+/* Not called for lateeoi events. */
static void event_handler_exit(struct irq_info *info)
{
smp_store_release(&info->is_active, 0);
@@ -1883,7 +1887,12 @@ static void lateeoi_ack_dynirq(struct irq_data *data)
if (VALID_EVTCHN(evtchn)) {
do_mask(info, EVT_MASK_REASON_EOI_PENDING);
- event_handler_exit(info);
+ /*
+ * Don't call event_handler_exit().
+ * Need to keep is_active non-zero in order to ignore re-raised
+ * events after cpu affinity changes while a lateeoi is pending.
+ */
+ clear_evtchn(evtchn);
}
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 9e3d62d55bf455d4f9fdf2ede5c8756410c64102 Mon Sep 17 00:00:00 2001
From: Sven Schnelle <svens(a)linux.ibm.com>
Date: Tue, 15 Jun 2021 15:05:22 +0200
Subject: [PATCH] s390/topology: clear thread/group maps for offline cpus
The current code doesn't clear the thread/group maps for offline
CPUs. This may cause kernel crashes like the one bewlow in common
code that assumes if a CPU has sibblings it is online.
Unable to handle kernel pointer dereference in virtual kernel address space
Call Trace:
[<000000013a4b8c3c>] blk_mq_map_swqueue+0x10c/0x388
([<000000013a4b8bcc>] blk_mq_map_swqueue+0x9c/0x388)
[<000000013a4b9300>] blk_mq_init_allocated_queue+0x448/0x478
[<000000013a4b9416>] blk_mq_init_queue+0x4e/0x90
[<000003ff8019d3e6>] loop_add+0x106/0x278 [loop]
[<000003ff801b8148>] loop_init+0x148/0x1000 [loop]
[<0000000139de4924>] do_one_initcall+0x3c/0x1e0
[<0000000139ef449a>] do_init_module+0x6a/0x2a0
[<0000000139ef61bc>] __do_sys_finit_module+0xa4/0xc0
[<0000000139de9e6e>] do_syscall+0x7e/0xd0
[<000000013a8e0aec>] __do_syscall+0xbc/0x110
[<000000013a8ee2e8>] system_call+0x78/0xa0
Fixes: 52aeda7accb6 ("s390/topology: remove offline CPUs from CPU topology masks")
Cc: <stable(a)kernel.org> # 5.7+
Reported-by: Marius Hillenbrand <mhillen(a)linux.ibm.com>
Signed-off-by: Sven Schnelle <svens(a)linux.ibm.com>
Reviewed-by: Heiko Carstens <hca(a)linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor(a)linux.ibm.com>
diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c
index bfcc327acc6b..26aa2614ee35 100644
--- a/arch/s390/kernel/topology.c
+++ b/arch/s390/kernel/topology.c
@@ -66,7 +66,10 @@ static void cpu_group_map(cpumask_t *dst, struct mask_info *info, unsigned int c
{
static cpumask_t mask;
- cpumask_copy(&mask, cpumask_of(cpu));
+ cpumask_clear(&mask);
+ if (!cpu_online(cpu))
+ goto out;
+ cpumask_set_cpu(cpu, &mask);
switch (topology_mode) {
case TOPOLOGY_MODE_HW:
while (info) {
@@ -83,10 +86,10 @@ static void cpu_group_map(cpumask_t *dst, struct mask_info *info, unsigned int c
default:
fallthrough;
case TOPOLOGY_MODE_SINGLE:
- cpumask_copy(&mask, cpumask_of(cpu));
break;
}
cpumask_and(&mask, &mask, cpu_online_mask);
+out:
cpumask_copy(dst, &mask);
}
@@ -95,7 +98,10 @@ static void cpu_thread_map(cpumask_t *dst, unsigned int cpu)
static cpumask_t mask;
int i;
- cpumask_copy(&mask, cpumask_of(cpu));
+ cpumask_clear(&mask);
+ if (!cpu_online(cpu))
+ goto out;
+ cpumask_set_cpu(cpu, &mask);
if (topology_mode != TOPOLOGY_MODE_HW)
goto out;
cpu -= cpu % (smp_cpu_mtid + 1);
Trying to get the AB8500 charging driver working I ran into a bit
of bitrot: we haven't used the driver for a while so errors in
refactorings won't be noticed.
This one is pretty self evident: use argument to the macro or we
end up with a random pointer to something else.
Cc: stable(a)vger.kernel.org
Cc: Krzysztof Kozlowski <krzk(a)kernel.org>
Cc: Marcus Cooper <codekipper(a)gmail.com>
Fixes: 297d716f6260 ("power_supply: Change ownership from driver to core")
Signed-off-by: Linus Walleij <linus.walleij(a)linaro.org>
---
drivers/power/supply/ab8500-chargalg.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/power/supply/ab8500-chargalg.h b/drivers/power/supply/ab8500-chargalg.h
index 94a6f9068bc5..07e6ff50084f 100644
--- a/drivers/power/supply/ab8500-chargalg.h
+++ b/drivers/power/supply/ab8500-chargalg.h
@@ -15,7 +15,7 @@
* - POWER_SUPPLY_TYPE_USB,
* because only them store as drv_data pointer to struct ux500_charger.
*/
-#define psy_to_ux500_charger(x) power_supply_get_drvdata(psy)
+#define psy_to_ux500_charger(x) power_supply_get_drvdata(x)
/* Forward declaration */
struct ux500_charger;
--
2.31.1
Hi,
Earlier this year I noticed the CBUS UART, a discrete TI16C550C part
wired directly to the system controller's device bus and supposed to come
up as ttyS2 in addition to ttyS0 and ttyS1 ports from a Super I/O device
behind the PCI southbridge, is not recognised with my MIPS Malta board
booting big-endian.
I got to the bottom of the problem now and as it turns out we have two
long-standing bugs causing it, one in generic 8250 code and another in
Malta platform code, and this has never worked in the big-endian mode.
Here's v2 of the series, addressing minor issues with 1/2 pointed out in
the review.
Please apply.
Maciej
Using two distinct DMA zones turned out to be problematic. Here's an
attempt go back to a saner default.
Ard Biesheuvel (1):
arm64: mm: Set ZONE_DMA size based on early IORT scan
Nicolas Saenz Julienne (6):
arm64: mm: Move reserve_crashkernel() into mem_init()
arm64: mm: Move zone_dma_bits initialization into zone_sizes_init()
of/address: Introduce of_dma_get_max_cpu_address()
of: unittest: Add test for of_dma_get_max_cpu_address()
arm64: mm: Set ZONE_DMA size based on devicetree's dma-ranges
mm: Remove examples from enum zone_type comment
arch/arm64/mm/init.c | 22 +++++++++-------
drivers/acpi/arm64/iort.c | 55 +++++++++++++++++++++++++++++++++++++++
drivers/of/address.c | 42 ++++++++++++++++++++++++++++++
drivers/of/unittest.c | 18 +++++++++++++
include/linux/acpi_iort.h | 4 +++
include/linux/mmzone.h | 20 --------------
include/linux/of.h | 7 +++++
7 files changed, 139 insertions(+), 29 deletions(-)
--
2.25.1
The direction of the pipe argument must match the request-type direction
bit or control requests may fail depending on the host-controller-driver
implementation.
Control transfers without a data stage are treated as OUT requests by
the USB stack and should be using usb_sndctrlpipe(). Failing to do so
will now trigger a warning.
Fix the OSIFI2C_SET_BIT_RATE and OSIFI2C_STOP requests which erroneously
used the osif_usb_read() helper and set the IN direction bit.
Reported-by: syzbot+9d7dadd15b8819d73f41(a)syzkaller.appspotmail.com
Fixes: 83e53a8f120f ("i2c: Add bus driver for for OSIF USB i2c device.")
Cc: stable(a)vger.kernel.org # 3.14
Cc: Andrew Lunn <andrew(a)lunn.ch>
Signed-off-by: Johan Hovold <johan(a)kernel.org>
---
drivers/i2c/busses/i2c-robotfuzz-osif.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/i2c/busses/i2c-robotfuzz-osif.c b/drivers/i2c/busses/i2c-robotfuzz-osif.c
index a39f7d092797..66dfa211e736 100644
--- a/drivers/i2c/busses/i2c-robotfuzz-osif.c
+++ b/drivers/i2c/busses/i2c-robotfuzz-osif.c
@@ -83,7 +83,7 @@ static int osif_xfer(struct i2c_adapter *adapter, struct i2c_msg *msgs,
}
}
- ret = osif_usb_read(adapter, OSIFI2C_STOP, 0, 0, NULL, 0);
+ ret = osif_usb_write(adapter, OSIFI2C_STOP, 0, 0, NULL, 0);
if (ret) {
dev_err(&adapter->dev, "failure sending STOP\n");
return -EREMOTEIO;
@@ -153,7 +153,7 @@ static int osif_probe(struct usb_interface *interface,
* Set bus frequency. The frequency is:
* 120,000,000 / ( 16 + 2 * div * 4^prescale).
* Using dev = 52, prescale = 0 give 100KHz */
- ret = osif_usb_read(&priv->adapter, OSIFI2C_SET_BIT_RATE, 52, 0,
+ ret = osif_usb_write(&priv->adapter, OSIFI2C_SET_BIT_RATE, 52, 0,
NULL, 0);
if (ret) {
dev_err(&interface->dev, "failure sending bit rate");
--
2.26.3
From: Bhaumik Bhatt <bbhatt(a)codeaurora.org>
MHI reads the channel ID from the event ring element sent by the
device which can be any value between 0 and 255. In order to
prevent any out of bound accesses, add a check against the maximum
number of channels supported by the controller and those channels
not configured yet so as to skip processing of that event ring
element.
Cc: stable(a)vger.kernel.org
Fixes: 1d3173a3bae7 ("bus: mhi: core: Add support for processing events from client device")
Signed-off-by: Bhaumik Bhatt <bbhatt(a)codeaurora.org>
Reviewed-by: Hemant Kumar <hemantk(a)codeaurora.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
Reviewed-by: Jeffrey Hugo <quic_jhugo(a)quicinc.com>
Link: https://lore.kernel.org/r/1624558141-11045-1-git-send-email-bbhatt@codeauro…
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
---
drivers/bus/mhi/core/main.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c
index 69cd9dcde6d8..26bbc812121d 100644
--- a/drivers/bus/mhi/core/main.c
+++ b/drivers/bus/mhi/core/main.c
@@ -773,11 +773,18 @@ static void mhi_process_cmd_completion(struct mhi_controller *mhi_cntrl,
cmd_pkt = mhi_to_virtual(mhi_ring, ptr);
chan = MHI_TRE_GET_CMD_CHID(cmd_pkt);
- mhi_chan = &mhi_cntrl->mhi_chan[chan];
- write_lock_bh(&mhi_chan->lock);
- mhi_chan->ccs = MHI_TRE_GET_EV_CODE(tre);
- complete(&mhi_chan->completion);
- write_unlock_bh(&mhi_chan->lock);
+
+ if (chan < mhi_cntrl->max_chan &&
+ mhi_cntrl->mhi_chan[chan].configured) {
+ mhi_chan = &mhi_cntrl->mhi_chan[chan];
+ write_lock_bh(&mhi_chan->lock);
+ mhi_chan->ccs = MHI_TRE_GET_EV_CODE(tre);
+ complete(&mhi_chan->completion);
+ write_unlock_bh(&mhi_chan->lock);
+ } else {
+ dev_err(&mhi_cntrl->mhi_dev->dev,
+ "Completion packet for invalid channel ID: %d\n", chan);
+ }
mhi_del_ring_element(mhi_cntrl, mhi_ring);
}
--
2.25.1
This is a note to let you know that I've just added the patch titled
bus: mhi: pci-generic: Add missing
to my char-misc git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
in the char-misc-next branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will also be merged in the next major kernel release
during the merge window.
If you have any questions about this process, please let me know.
>From a25d144fb883c73506ba384de476bbaff8220a95 Mon Sep 17 00:00:00 2001
From: Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
Date: Mon, 21 Jun 2021 21:46:13 +0530
Subject: bus: mhi: pci-generic: Add missing
'pci_disable_pcie_error_reporting()' calls
If an error occurs after a 'pci_enable_pcie_error_reporting()' call, it
must be undone by a corresponding 'pci_disable_pcie_error_reporting()'
call
Add the missing call in the error handling path of the probe and in the
remove function.
Cc: <stable(a)vger.kernel.org>
Fixes: b012ee6bfe2a ("mhi: pci_generic: Add PCI error handlers")
Signed-off-by: Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
Link: https://lore.kernel.org/r/f70c14701f4922d67e717633c91b6c481b59f298.16234453…
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
Link: https://lore.kernel.org/r/20210621161616.77524-6-manivannan.sadhasivam@lina…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/bus/mhi/pci_generic.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/bus/mhi/pci_generic.c b/drivers/bus/mhi/pci_generic.c
index b3357a8a2fdb..ca3bc40427f8 100644
--- a/drivers/bus/mhi/pci_generic.c
+++ b/drivers/bus/mhi/pci_generic.c
@@ -665,7 +665,7 @@ static int mhi_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
err = mhi_register_controller(mhi_cntrl, mhi_cntrl_config);
if (err)
- return err;
+ goto err_disable_reporting;
/* MHI bus does not power up the controller by default */
err = mhi_prepare_for_power_up(mhi_cntrl);
@@ -699,6 +699,8 @@ static int mhi_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
mhi_unprepare_after_power_down(mhi_cntrl);
err_unregister:
mhi_unregister_controller(mhi_cntrl);
+err_disable_reporting:
+ pci_disable_pcie_error_reporting(pdev);
return err;
}
@@ -721,6 +723,7 @@ static void mhi_pci_remove(struct pci_dev *pdev)
pm_runtime_get_noresume(&pdev->dev);
mhi_unregister_controller(mhi_cntrl);
+ pci_disable_pcie_error_reporting(pdev);
}
static void mhi_pci_shutdown(struct pci_dev *pdev)
--
2.32.0
From: Hyun Kwon <hyun.kwon(a)xilinx.com>
Enable PCIe reference clock. There is no remove function that's why
this should be enough for simple operation.
Normally this clock is enabled by default by firmware but there are
usecases where this clock should be enabled by driver itself.
It is also good that PCIe clock is recorded in a clock framework.
Fixes: ab597d35ef11 ("PCI: xilinx-nwl: Add support for Xilinx NWL PCIe Host Controller")
Cc: stable(a)vger.kernel.org
Signed-off-by: Hyun Kwon <hyun.kwon(a)xilinx.com>
Signed-off-by: Bharat Kumar Gogada <bharat.kumar.gogada(a)xilinx.com>
Signed-off-by: Michal Simek <michal.simek(a)xilinx.com>
---
Changes in v3:
- use PCIe instead of pcie
- add stable cc
- update commit message - reported by Krzysztof
Changes in v2:
- Update commit message - reported by Krzysztof
- Check return value from clk_prepare_enable() - reported by Krzysztof
drivers/pci/controller/pcie-xilinx-nwl.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/pci/controller/pcie-xilinx-nwl.c b/drivers/pci/controller/pcie-xilinx-nwl.c
index 8689311c5ef6..1c3d5b87ef20 100644
--- a/drivers/pci/controller/pcie-xilinx-nwl.c
+++ b/drivers/pci/controller/pcie-xilinx-nwl.c
@@ -6,6 +6,7 @@
* (C) Copyright 2014 - 2015, Xilinx, Inc.
*/
+#include <linux/clk.h>
#include <linux/delay.h>
#include <linux/interrupt.h>
#include <linux/irq.h>
@@ -169,6 +170,7 @@ struct nwl_pcie {
u8 last_busno;
struct nwl_msi msi;
struct irq_domain *legacy_irq_domain;
+ struct clk *clk;
raw_spinlock_t leg_mask_lock;
};
@@ -823,6 +825,16 @@ static int nwl_pcie_probe(struct platform_device *pdev)
return err;
}
+ pcie->clk = devm_clk_get(dev, NULL);
+ if (IS_ERR(pcie->clk))
+ return PTR_ERR(pcie->clk);
+
+ err = clk_prepare_enable(pcie->clk);
+ if (err) {
+ dev_err(dev, "can't enable PCIe ref clock\n");
+ return err;
+ }
+
err = nwl_pcie_bridge_init(pcie);
if (err) {
dev_err(dev, "HW Initialization failed\n");
--
2.32.0
From: Nick Desaulniers <ndesaulniers(a)google.com>
commit 311bea3cb9ee20ef150ca76fc60a592bf6b159f5 upstream.
With GNU binutils 2.35+, linking with BFD produces warnings for vmlinux:
aarch64-linux-gnu-ld: warning: -z norelro ignored
BFD can produce this warning when the target emulation mode does not
support RELRO program headers, and -z relro or -z norelro is passed.
Alan Modra clarifies:
The default linker emulation for an aarch64-linux ld.bfd is
-maarch64linux, the default for an aarch64-elf linker is
-maarch64elf. They are not equivalent. If you choose -maarch64elf
you get an emulation that doesn't support -z relro.
The ARCH=arm64 kernel prefers -maarch64elf, but may fall back to
-maarch64linux based on the toolchain configuration.
LLD will always create RELRO program header regardless of target
emulation.
To avoid the above warning when linking with BFD, pass -z norelro only
when linking with LLD or with -maarch64linux.
Fixes: 3b92fa7485eb ("arm64: link with -z norelro regardless of CONFIG_RELOCATABLE")
Fixes: 3bbd3db86470 ("arm64: relocatable: fix inconsistencies in linker script and options")
Cc: <stable(a)vger.kernel.org> # 5.0.x-
Reported-by: kernelci.org bot <bot(a)kernelci.org>
Reported-by: Quentin Perret <qperret(a)google.com>
Signed-off-by: Nick Desaulniers <ndesaulniers(a)google.com>
Reviewed-by: Nathan Chancellor <natechancellor(a)gmail.com>
Acked-by: Ard Biesheuvel <ardb(a)kernel.org>
Cc: Alan Modra <amodra(a)gmail.com>
Cc: Fāng-ruì Sòng <maskray(a)google.com>
Link: https://lore.kernel.org/r/20201218002432.788499-1-ndesaulniers@google.com
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
---
arch/arm64/Makefile | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index cd8f3cdabfd0..d227cf87c48f 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -10,7 +10,7 @@
#
# Copyright (C) 1995-2001 by Russell King
-LDFLAGS_vmlinux :=--no-undefined -X -z norelro
+LDFLAGS_vmlinux :=--no-undefined -X
CPPFLAGS_vmlinux.lds = -DTEXT_OFFSET=$(TEXT_OFFSET)
GZFLAGS :=-9
@@ -82,17 +82,21 @@ CHECKFLAGS += -D__AARCH64EB__
AS += -EB
# Prefer the baremetal ELF build target, but not all toolchains include
# it so fall back to the standard linux version if needed.
-KBUILD_LDFLAGS += -EB $(call ld-option, -maarch64elfb, -maarch64linuxb)
+KBUILD_LDFLAGS += -EB $(call ld-option, -maarch64elfb, -maarch64linuxb -z norelro)
UTS_MACHINE := aarch64_be
else
KBUILD_CPPFLAGS += -mlittle-endian
CHECKFLAGS += -D__AARCH64EL__
AS += -EL
# Same as above, prefer ELF but fall back to linux target if needed.
-KBUILD_LDFLAGS += -EL $(call ld-option, -maarch64elf, -maarch64linux)
+KBUILD_LDFLAGS += -EL $(call ld-option, -maarch64elf, -maarch64linux -z norelro)
UTS_MACHINE := aarch64
endif
+ifeq ($(CONFIG_LD_IS_LLD), y)
+KBUILD_LDFLAGS += -z norelro
+endif
+
CHECKFLAGS += -D__aarch64__
ifeq ($(CONFIG_ARM64_MODULE_PLTS),y)
--
2.32.0
Hi Greg,
Following the conversation with Marc, we discovered that an
important fix for ARM PMU is not backported to 4.14.y tree, it affects
counter value and give out nonsensical result.
Can you please include the following commit?
```
commit 3cce50dfec4a5b0414c974190940f47dd32c6dee
Author: Suzuki K Poulose <suzuki.poulose(a)arm.com>
Date: Tue Jul 10 09:58:03 2018 +0100
arm64: perf: Disable PMU while processing counter overflows
The arm64 PMU updates the event counters and reprograms the
counters in the overflow IRQ handler without disabling the
PMU. This could potentially cause skews in for group counters,
where the overflowed counters may potentially loose some event
counts, while they are reprogrammed. To prevent this, disable
the PMU while we process the counter overflows and enable it
right back when we are done.
This patch also moves the PMU stop/start routines to avoid a
forward declaration.
Suggested-by: Mark Rutland <mark.rutland(a)arm.com>
Cc: Will Deacon <will.deacon(a)arm.com>
Acked-by: Mark Rutland <mark.rutland(a)arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose(a)arm.com>
Signed-off-by: Will Deacon <will.deacon(a)arm.com>
```
For more details:
https://lists.cs.columbia.edu/pipermail/kvmarm/2021-June/047471.html
Regards,
Aman Priyadarshi
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879