From: Neil Armstrong <narmstrong(a)baylibre.com>
commit 4be9bd10e22dfc7fc101c5cf5969ef2d3a042d8a upstream.
Since "drm/fb: Stop leaking physical address", the default behaviour of
the DRM fbdev emulation is to set the smem_base to 0 and pass the new
FBINFO_HIDE_SMEM_START flag.
The main reason is to avoid leaking physical addresse to user-space, and
it follows a general move over the kernel code to avoid user-space to
manipulate physical addresses and then use some other mechanisms like
dma-buf to transfer physical buffer handles over multiple subsystems.
But, a lot of devices depends on closed sources binaries to enable
OpenGL hardware acceleration that uses this smem_start value to
pass physical addresses to out-of-tree modules in order to render
into these physical adresses. These should use dma-buf buffers allocated
from the DRM display device instead and stop relying on fbdev overallocation
to gather DMA memory (some HW vendors delivers GBM and Wayland capable
binaries, but older unsupported devices won't have these new binaries
and are doomed until an Open Source solution like Lima finalizes).
Since these devices heavily depends on this kind of software and because
the smem_start population was available for years, it's a breakage to
stop leaking smem_start without any alternative solutions.
This patch adds a Kconfig depending on the EXPERT config and an unsafe
kernel module parameter tainting the kernel when enabled.
A clear comment and Kconfig help text was added to clarify why and when
this patch should be reverted, but in the meantime it's a necessary
feature to keep.
Cc: Dave Airlie <airlied(a)gmail.com>
Cc: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie(a)samsung.com>
Cc: Noralf Trønnes <noralf(a)tronnes.org>
Cc: Maxime Ripard <maxime.ripard(a)bootlin.com>
Cc: Eric Anholt <eric(a)anholt.net>
Cc: Lucas Stach <l.stach(a)pengutronix.de>
Cc: Rob Clark <robdclark(a)gmail.com>
Cc: Ben Skeggs <skeggsb(a)gmail.com>
Cc: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Neil Armstrong <narmstrong(a)baylibre.com>
Reviewed-by: Maxime Ripard <maxime.ripard(a)bootlin.com>
Tested-by: Maxime Ripard <maxime.ripard(a)bootlin.com>
Acked-by: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Acked-by: Dave Airlie <airlied(a)gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1538136355-15383-1-git-send-e…
Signed-off-by: Maxime Ripard <maxime.ripard(a)bootlin.com>
---
Hi,
This is a backport of a patch fixing a regression introduced in 4.19, and
merged in 4.20. Therefore, it targets 4.19 only.
Thanks!
Maxime
---
drivers/gpu/drm/Kconfig | 20 ++++++++++++++++++++
drivers/gpu/drm/drm_fb_helper.c | 25 +++++++++++++++++++++++++
2 files changed, 45 insertions(+)
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index cb88528e7b10..e44e567bd789 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -110,6 +110,26 @@ config DRM_FBDEV_OVERALLOC
is 100. Typical values for double buffering will be 200,
triple buffering 300.
+config DRM_FBDEV_LEAK_PHYS_SMEM
+ bool "Shamelessly allow leaking of fbdev physical address (DANGEROUS)"
+ depends on DRM_FBDEV_EMULATION && EXPERT
+ default n
+ help
+ In order to keep user-space compatibility, we want in certain
+ use-cases to keep leaking the fbdev physical address to the
+ user-space program handling the fbdev buffer.
+ This affects, not only, Amlogic, Allwinner or Rockchip devices
+ with ARM Mali GPUs using an userspace Blob.
+ This option is not supported by upstream developers and should be
+ removed as soon as possible and be considered as a broken and
+ legacy behaviour from a modern fbdev device driver.
+
+ Please send any bug reports when using this to your proprietary
+ software vendor that requires this.
+
+ If in doubt, say "N" or spread the word to your closed source
+ library vendor.
+
config DRM_LOAD_EDID_FIRMWARE
bool "Allow to specify an EDID data set instead of probing for it"
depends on DRM
diff --git a/drivers/gpu/drm/drm_fb_helper.c b/drivers/gpu/drm/drm_fb_helper.c
index 9628dd617826..c52e3c80e9e6 100644
--- a/drivers/gpu/drm/drm_fb_helper.c
+++ b/drivers/gpu/drm/drm_fb_helper.c
@@ -56,6 +56,25 @@ MODULE_PARM_DESC(drm_fbdev_overalloc,
"Overallocation of the fbdev buffer (%) [default="
__MODULE_STRING(CONFIG_DRM_FBDEV_OVERALLOC) "]");
+/*
+ * In order to keep user-space compatibility, we want in certain use-cases
+ * to keep leaking the fbdev physical address to the user-space program
+ * handling the fbdev buffer.
+ * This is a bad habit essentially kept into closed source opengl driver
+ * that should really be moved into open-source upstream projects instead
+ * of using legacy physical addresses in user space to communicate with
+ * other out-of-tree kernel modules.
+ *
+ * This module_param *should* be removed as soon as possible and be
+ * considered as a broken and legacy behaviour from a modern fbdev device.
+ */
+#if IS_ENABLED(CONFIG_DRM_FBDEV_LEAK_PHYS_SMEM)
+static bool drm_leak_fbdev_smem = false;
+module_param_unsafe(drm_leak_fbdev_smem, bool, 0600);
+MODULE_PARM_DESC(fbdev_emulation,
+ "Allow unsafe leaking fbdev physical smem address [default=false]");
+#endif
+
static LIST_HEAD(kernel_fb_helper_list);
static DEFINE_MUTEX(kernel_fb_helper_lock);
@@ -3038,6 +3057,12 @@ int drm_fb_helper_generic_probe(struct drm_fb_helper *fb_helper,
fbi->screen_size = fb->height * fb->pitches[0];
fbi->fix.smem_len = fbi->screen_size;
fbi->screen_buffer = buffer->vaddr;
+ /* Shamelessly leak the physical address to user-space */
+#if IS_ENABLED(CONFIG_DRM_FBDEV_LEAK_PHYS_SMEM)
+ if (drm_leak_fbdev_smem && fbi->fix.smem_start == 0)
+ fbi->fix.smem_start =
+ page_to_phys(virt_to_page(fbi->screen_buffer));
+#endif
strcpy(fbi->fix.id, "DRM emulated");
drm_fb_helper_fill_fix(fbi, fb->pitches[0], fb->format->depth);
--
2.20.1
From: Eric Biggers <ebiggers(a)google.com>
Hi Greg, please consider applying this to 4.4-stable and 3.18-stable.
It's a minimal fix for a bug that was fixed incidentally by a large
refactoring in v4.8.
>8------------------------------------------------------8<
In the CTS template, when the input length is <= one block cipher block
(e.g. <= 16 bytes for AES) pass the correct length to the underlying CBC
transform rather than one block. This matches the upstream behavior and
makes the encryption/decryption operation correctly return -EINVAL when
1 <= nbytes < bsize or succeed when nbytes == 0, rather than crashing.
This was fixed upstream incidentally by a large refactoring,
commit 0605c41cc53c ("crypto: cts - Convert to skcipher"). But
syzkaller easily trips over this when running on older kernels, as it's
easily reachable via AF_ALG. Therefore, this patch makes the minimal
fix for older kernels.
Cc: linux-crypto(a)vger.kernel.org
Fixes: 76cb9521795a ("[CRYPTO] cts: Add CTS mode required for Kerberos AES support")
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
---
crypto/cts.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/crypto/cts.c b/crypto/cts.c
index e467ec0acf9f0..e65688d6a4caa 100644
--- a/crypto/cts.c
+++ b/crypto/cts.c
@@ -137,8 +137,8 @@ static int crypto_cts_encrypt(struct blkcipher_desc *desc,
lcldesc.info = desc->info;
lcldesc.flags = desc->flags;
- if (tot_blocks == 1) {
- err = crypto_blkcipher_encrypt_iv(&lcldesc, dst, src, bsize);
+ if (tot_blocks <= 1) {
+ err = crypto_blkcipher_encrypt_iv(&lcldesc, dst, src, nbytes);
} else if (nbytes <= bsize * 2) {
err = cts_cbc_encrypt(ctx, desc, dst, src, 0, nbytes);
} else {
@@ -232,8 +232,8 @@ static int crypto_cts_decrypt(struct blkcipher_desc *desc,
lcldesc.info = desc->info;
lcldesc.flags = desc->flags;
- if (tot_blocks == 1) {
- err = crypto_blkcipher_decrypt_iv(&lcldesc, dst, src, bsize);
+ if (tot_blocks <= 1) {
+ err = crypto_blkcipher_decrypt_iv(&lcldesc, dst, src, nbytes);
} else if (nbytes <= bsize * 2) {
err = cts_cbc_decrypt(ctx, desc, dst, src, 0, nbytes);
} else {
--
2.20.1.97.g81188d93c3-goog
commit 1f82de10d6b1 ("PCI/x86: don't assume prefetchable ranges are 64bit")
added probing of bridge support for 64 bit memory each time bridge is
re-enumerated.
Unfortunately this probing is destructive if any device behind
the bridge is in use at this time.
This was observed in the field, see
https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html
and specifically
https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg02082.html
There's no real need to re-probe the bridge features as the
registers in question never change - detect that using
the memory flag being set (it's always set on the 1st pass since
all PCI2PCI bridges support memory forwarding) and skip the probing.
Thus, only the first call will perform the disruptive probing and sets
the resource flags as required - which we can be reasonably sure happens
before any devices have been configured.
Avoiding repeated calls to pci_bridge_check_ranges might be even nicer.
Unfortunately I couldn't come up with a clean way to do it without a
major probing code refactoring.
Reported-by: xuyandong <xuyandong2(a)huawei.com>
Tested-by: xuyandong <xuyandong2(a)huawei.com>
Cc: stable(a)vger.kernel.org
Cc: Yinghai Lu <yinghai(a)kernel.org>
Cc: Jesse Barnes <jbarnes(a)virtuousgeek.org>
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
---
Please review and consider for stable.
changes from v1:
comment and commit log updates to address comments by Bjorn.
drivers/pci/setup-bus.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index ed960436df5e..d5c25d465d97 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -741,6 +741,16 @@ static void pci_bridge_check_ranges(struct pci_bus *bus)
struct resource *b_res;
b_res = &bridge->resource[PCI_BRIDGE_RESOURCES];
+
+ /*
+ * Don't re-check after this was called once already:
+ * important since bridge might be in use.
+ * Note: this is only reliable because as per spec all PCI to PCI
+ * bridges support memory unconditionally so IORESOURCE_MEM is set.
+ */
+ if (b_res[1].flags & IORESOURCE_MEM)
+ return;
+
b_res[1].flags |= IORESOURCE_MEM;
pci_read_config_word(bridge, PCI_IO_BASE, &io);
--
MST
The _DSM function number validation only happens to succeed when the
generic Linux command number translation corresponds with a
DSM-family-specific function number. This breaks NVDIMM-N
implementations that correctly implement _LSR, _LSW, and _LSI, but do
not happen to publish support for DSM function numbers 4, 5, and 6.
Recall that the support for _LS{I,R,W} family of methods results in the
DIMM being marked as supporting those command numbers at
acpi_nfit_register_dimms() time. The DSM function mask is only used for
ND_CMD_CALL support of non-NVDIMM_FAMILY_INTEL devices.
Fixes: 31eca76ba2fc ("nfit, libnvdimm: limited/whitelisted dimm command...")
Cc: <stable(a)vger.kernel.org>
Link: https://github.com/pmem/ndctl/issues/78
Reported-by: Sujith Pandel <sujith_pandel(a)dell.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
Sujith, this is a larger change than what you originally tested, but it
should behave the same. I wanted to consolidate all the code that
handles Linux command number to DIMM _DSM function number translation.
If you have a chance to re-test with this it would be much appreciated.
Thanks for the report!
drivers/acpi/nfit/core.c | 43 +++++++++++++++++++++++++++++--------------
1 file changed, 29 insertions(+), 14 deletions(-)
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 790691d9a982..d5d64e90ae71 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -409,6 +409,29 @@ static bool payload_dumpable(struct nvdimm *nvdimm, unsigned int func)
return true;
}
+static int cmd_to_func(struct nvdimm *nvdimm, unsigned int cmd,
+ struct nd_cmd_pkg *call_pkg)
+{
+ struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
+
+ if (cmd == ND_CMD_CALL) {
+ int i;
+
+ if (call_pkg && nfit_mem->family != call_pkg->nd_family)
+ return -ENOTTY;
+
+ for (i = 0; i < ARRAY_SIZE(call_pkg->nd_reserved2); i++)
+ if (call_pkg->nd_reserved2[i])
+ return -EINVAL;
+ return call_pkg->nd_command;
+ }
+
+ /* Linux ND commands == NVDIMM_FAMILY_INTEL function numbers */
+ if (nfit_mem->family == NVDIMM_FAMILY_INTEL)
+ return cmd;
+ return 0;
+}
+
int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, struct nvdimm *nvdimm,
unsigned int cmd, void *buf, unsigned int buf_len, int *cmd_rc)
{
@@ -422,30 +445,21 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, struct nvdimm *nvdimm,
unsigned long cmd_mask, dsm_mask;
u32 offset, fw_status = 0;
acpi_handle handle;
- unsigned int func;
const guid_t *guid;
- int rc, i;
+ int func, rc, i;
if (cmd_rc)
*cmd_rc = -EINVAL;
- func = cmd;
- if (cmd == ND_CMD_CALL) {
- call_pkg = buf;
- func = call_pkg->nd_command;
-
- for (i = 0; i < ARRAY_SIZE(call_pkg->nd_reserved2); i++)
- if (call_pkg->nd_reserved2[i])
- return -EINVAL;
- }
if (nvdimm) {
struct acpi_device *adev = nfit_mem->adev;
if (!adev)
return -ENOTTY;
- if (call_pkg && nfit_mem->family != call_pkg->nd_family)
- return -ENOTTY;
+ func = cmd_to_func(nvdimm, cmd, buf);
+ if (func < 0)
+ return func;
dimm_name = nvdimm_name(nvdimm);
cmd_name = nvdimm_cmd_name(cmd);
cmd_mask = nvdimm_cmd_mask(nvdimm);
@@ -456,6 +470,7 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, struct nvdimm *nvdimm,
} else {
struct acpi_device *adev = to_acpi_dev(acpi_desc);
+ func = cmd;
cmd_name = nvdimm_bus_cmd_name(cmd);
cmd_mask = nd_desc->cmd_mask;
dsm_mask = cmd_mask;
@@ -470,7 +485,7 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, struct nvdimm *nvdimm,
if (!desc || (cmd && (desc->out_num + desc->in_num == 0)))
return -ENOTTY;
- if (!test_bit(cmd, &cmd_mask) || !test_bit(func, &dsm_mask))
+ if (!test_bit(cmd, &cmd_mask) && !test_bit(func, &dsm_mask))
return -ENOTTY;
in_obj.type = ACPI_TYPE_PACKAGE;
The patch below does not apply to the 4.20-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4d80273976bf880c4bed9359b8f2d45663140c86 Mon Sep 17 00:00:00 2001
From: Lyude Paul <lyude(a)redhat.com>
Date: Mon, 8 Oct 2018 19:24:30 -0400
Subject: [PATCH] drm/atomic_helper: Disallow new modesets on unregistered
connectors
With the exception of modesets which would switch the DPMS state of a
connector from on to off, we want to make sure that we disallow all
modesets which would result in enabling a new monitor or a new mode
configuration on a monitor if the connector for the display in question
is no longer registered. This allows us to stop userspace from trying to
enable new displays on connectors for an MST topology that were just
removed from the system, without preventing userspace from disabling
DPMS on those connectors.
Changes since v5:
- Fix typo in comment, nothing else
Signed-off-by: Lyude Paul <lyude(a)redhat.com>
Reviewed-by: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Cc: stable(a)vger.kernel.org
Link: https://patchwork.freedesktop.org/patch/msgid/20181008232437.5571-2-lyude@r…
diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
index 3cf1aa132778..c1a35078b2b9 100644
--- a/drivers/gpu/drm/drm_atomic_helper.c
+++ b/drivers/gpu/drm/drm_atomic_helper.c
@@ -308,6 +308,26 @@ update_connector_routing(struct drm_atomic_state *state,
return 0;
}
+ crtc_state = drm_atomic_get_new_crtc_state(state,
+ new_connector_state->crtc);
+ /*
+ * For compatibility with legacy users, we want to make sure that
+ * we allow DPMS On->Off modesets on unregistered connectors. Modesets
+ * which would result in anything else must be considered invalid, to
+ * avoid turning on new displays on dead connectors.
+ *
+ * Since the connector can be unregistered at any point during an
+ * atomic check or commit, this is racy. But that's OK: all we care
+ * about is ensuring that userspace can't do anything but shut off the
+ * display on a connector that was destroyed after its been notified,
+ * not before.
+ */
+ if (!READ_ONCE(connector->registered) && crtc_state->active) {
+ DRM_DEBUG_ATOMIC("[CONNECTOR:%d:%s] is not registered\n",
+ connector->base.id, connector->name);
+ return -EINVAL;
+ }
+
funcs = connector->helper_private;
if (funcs->atomic_best_encoder)
@@ -352,7 +372,6 @@ update_connector_routing(struct drm_atomic_state *state,
set_best_encoder(state, new_connector_state, new_encoder);
- crtc_state = drm_atomic_get_new_crtc_state(state, new_connector_state->crtc);
crtc_state->connectors_changed = true;
DRM_DEBUG_ATOMIC("[CONNECTOR:%d:%s] using [ENCODER:%d:%s] on [CRTC:%d:%s]\n",
Hello,
We ran automated tests on a patchset that was proposed for merging into this
kernel tree. The patches were applied to:
Kernel repo: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
Commit: 8aab2b4410a2 Linux 4.20.2
The results of these automated tests are provided below.
Overall result: FAILED (see details below)
Patch merge: OK
Compile: FAILED
We attempted to compile the kernel for multiple architectures, but the compile
failed on one or more architectures:
s390x: FAILED (build log attached: build_s390.log.gz)
powerpc64le: FAILED (build log attached: build_powerpc.log.gz)
aarch64: FAILED (build log attached: build_arm64.log.gz)
x86_64: FAILED (build log attached: build_x86_64.log.gz)
We hope that these logs can help you find the problem quickly. For the full
detail on our testing procedures, please scroll to the bottom of this message.
Please reply to this email if you have any questions about the tests that we
ran or if you have any suggestions on how to make future tests more effective.
,-. ,-.
( C ) ( K ) Continuous
`-',-.`-' Kernel
( I ) Integration
`-'
______________________________________________________________________________
Merge testing
-------------
We cloned this repository and checked out a ref:
Repo: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
Ref: 8aab2b4410a2 Linux 4.20.2
We then merged the following patches with `git am`:
powerpc-tm-unset-msr-if-not-recheckpointing.patch
btrfs-fix-deadlock-when-using-free-space-tree-due-to-block-group-creation.patch
usbcore-select-only-first-configuration-for-non-uac3-compliant-devices.patch
staging-rtl8188eu-fix-module-loading-from-tasklet-for-ccmp-encryption.patch
staging-rtl8188eu-fix-module-loading-from-tasklet-for-wep-encryption.patch
cpufreq-scpi-scmi-fix-freeing-of-dynamic-opps.patch
cpufreq-scmi-fix-frequency-invariance-in-slow-path.patch
x86-modpost-replace-last-remnants-of-retpoline-with-config_retpoline.patch
alsa-hda-realtek-support-dell-headset-mode-for-new-aio-platform.patch
alsa-hda-realtek-add-unplug-function-into-unplug-state-of-headset-mode-for-alc225.patch
alsa-hda-realtek-disable-headset-mic-vref-for-headset-mode-of-alc225.patch
cifs-fix-adjustment-of-credits-for-mtu-requests.patch
cifs-do-not-set-credits-to-1-if-the-server-didn-t-grant-anything.patch
cifs-do-not-hide-eintr-after-sending-network-packets.patch
cifs-fix-credit-computation-for-compounded-requests.patch
cifs-fix-potential-oob-access-of-lock-element-array.patch
cifs-check-kzalloc-return.patch
arm-davinci-dm355-evm-fix-label-names-in-gpio-lookup-entries.patch
arm-davinci-da850-evm-fix-label-names-in-gpio-lookup-entries.patch
arm-davinci-omapl138-hawk-fix-label-names-in-gpio-lookup-entries.patch
arm-davinci-dm644x-evm-fix-label-names-in-gpio-lookup-entries.patch
arm-davinci-da830-evm-fix-label-names-in-gpio-lookup-entries.patch
usb-cdc-acm-send-zlp-for-telit-3g-intel-based-modems.patch
usb-storage-don-t-insert-sane-sense-for-spc3-when-bad-sense-specified.patch
usb-storage-add-quirk-for-smi-sm3350.patch
usb-add-usb_quirk_delay_ctrl_msg-quirk-for-corsair-k70-rgb.patch
fork-memcg-fix-cached_stacks-case.patch
slab-alien-caches-must-not-be-initialized-if-the-allocation-of-the-alien-cache-failed.patch
mm-usercopy.c-no-check-page-span-for-stack-objects.patch
mm-memcg-fix-reclaim-deadlock-with-writeback.patch
mm-page_mapped-don-t-assume-compound-page-is-huge-or-thp.patch
Compile testing
---------------
We compiled the kernel for 4 architectures:
s390x:
make options: make INSTALL_MOD_STRIP=1 -j64 targz-pkg -j64
configuration:
powerpc64le:
make options: make INSTALL_MOD_STRIP=1 -j64 targz-pkg -j64
configuration:
aarch64:
make options: make INSTALL_MOD_STRIP=1 -j64 targz-pkg -j64
configuration:
x86_64:
make options: make INSTALL_MOD_STRIP=1 -j64 targz-pkg -j64
configuration:
Hardware testing
----------------
We booted each kernel and ran the following tests:
s390:
powerpc:
arm64:
x86_64:
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 63f3655f950186752236bb88a22f8252c11ce394 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko(a)suse.com>
Date: Tue, 8 Jan 2019 15:23:07 -0800
Subject: [PATCH] mm, memcg: fix reclaim deadlock with writeback
Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
ext4 writeback
task1:
wait_on_page_bit+0x82/0xa0
shrink_page_list+0x907/0x960
shrink_inactive_list+0x2c7/0x680
shrink_node_memcg+0x404/0x830
shrink_node+0xd8/0x300
do_try_to_free_pages+0x10d/0x330
try_to_free_mem_cgroup_pages+0xd5/0x1b0
try_charge+0x14d/0x720
memcg_kmem_charge_memcg+0x3c/0xa0
memcg_kmem_charge+0x7e/0xd0
__alloc_pages_nodemask+0x178/0x260
alloc_pages_current+0x95/0x140
pte_alloc_one+0x17/0x40
__pte_alloc+0x1e/0x110
alloc_set_pte+0x5fe/0xc20
do_fault+0x103/0x970
handle_mm_fault+0x61e/0xd10
__do_page_fault+0x252/0x4d0
do_page_fault+0x30/0x80
page_fault+0x28/0x30
task2:
__lock_page+0x86/0xa0
mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
ext4_writepages+0x479/0xd60
do_writepages+0x1e/0x30
__writeback_single_inode+0x45/0x320
writeback_sb_inodes+0x272/0x600
__writeback_inodes_wb+0x92/0xc0
wb_writeback+0x268/0x300
wb_workfn+0xb4/0x390
process_one_work+0x189/0x420
worker_thread+0x4e/0x4b0
kthread+0xe6/0x100
ret_from_fork+0x41/0x50
He adds
"task1 is waiting for the PageWriteback bit of the page that task2 has
collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
LOCKED bit the page which tasks1 has locked"
More precisely task1 is handling a page fault and it has a page locked
while it charges a new page table to a memcg. That in turn hits a
memory limit reclaim and the memcg reclaim for legacy controller is
waiting on the writeback but that is never going to finish because the
writeback itself is waiting for the page locked in the #PF path. So
this is essentially ABBA deadlock:
lock_page(A)
SetPageWriteback(A)
unlock_page(A)
lock_page(B)
lock_page(B)
pte_alloc_pne
shrink_page_list
wait_on_page_writeback(A)
SetPageWriteback(B)
unlock_page(B)
# flush A, B to clear the writeback
This accumulating of more pages to flush is used by several filesystems
to generate a more optimal IO patterns.
Waiting for the writeback in legacy memcg controller is a workaround for
pre-mature OOM killer invocations because there is no dirty IO
throttling available for the controller. There is no easy way around
that unfortunately. Therefore fix this specific issue by pre-allocating
the page table outside of the page lock. We have that handy
infrastructure for that already so simply reuse the fault-around pattern
which already does this.
There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
from under a fs page locked but they should be really rare. I am not
aware of a better solution unfortunately.
[akpm(a)linux-foundation.org: fix mm/memory.c:__do_fault()]
[akpm(a)linux-foundation.org: coding-style fixes]
[mhocko(a)kernel.org: enhance comment, per Johannes]
Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Reported-by: Liu Bo <bo.liu(a)linux.alibaba.com>
Debugged-by: Liu Bo <bo.liu(a)linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Reviewed-by: Liu Bo <bo.liu(a)linux.alibaba.com>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Dave Chinner <david(a)fromorbit.com>
Cc: Theodore Ts'o <tytso(a)mit.edu>
Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com>
Cc: Shakeel Butt <shakeelb(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/memory.c b/mm/memory.c
index a52663c0612d..5e46836714dc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2994,6 +2994,28 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
vm_fault_t ret;
+ /*
+ * Preallocate pte before we take page_lock because this might lead to
+ * deadlocks for memcg reclaim which waits for pages under writeback:
+ * lock_page(A)
+ * SetPageWriteback(A)
+ * unlock_page(A)
+ * lock_page(B)
+ * lock_page(B)
+ * pte_alloc_pne
+ * shrink_page_list
+ * wait_on_page_writeback(A)
+ * SetPageWriteback(B)
+ * unlock_page(B)
+ * # flush A, B to clear the writeback
+ */
+ if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
+ vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
+ if (!vmf->prealloc_pte)
+ return VM_FAULT_OOM;
+ smp_wmb(); /* See comment in __pte_alloc() */
+ }
+
ret = vma->vm_ops->fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
VM_FAULT_DONE_COW)))