From: Bartosz Golaszewski <bgolaszewski(a)baylibre.com>
This patch reverts commit 3243ff2a05ec ("net: ethernet: davinci_emac:
Deduplicate bus_find_device() by name matching") and adds a comment
which should stop anyone from reintroducing the same "fix" in the future.
We can't use bus_find_device_by_name() here because the device name is
not guaranteed to be 'davinci_mdio'. On some systems it can be
'davinci_mdio.0' so we need to use strncmp() against the first part of
the string to correctly match it.
Fixes: 3243ff2a05ec ("net: ethernet: davinci_emac: Deduplicate bus_find_device() by name matching")
Cc: stable(a)vger.kernel.org
Signed-off-by: Bartosz Golaszewski <bgolaszewski(a)baylibre.com>
---
drivers/net/ethernet/ti/davinci_emac.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/ti/davinci_emac.c b/drivers/net/ethernet/ti/davinci_emac.c
index 06d7c9e4dcda..a1a6445b5a7e 100644
--- a/drivers/net/ethernet/ti/davinci_emac.c
+++ b/drivers/net/ethernet/ti/davinci_emac.c
@@ -1385,6 +1385,11 @@ static int emac_devioctl(struct net_device *ndev, struct ifreq *ifrq, int cmd)
return -EOPNOTSUPP;
}
+static int match_first_device(struct device *dev, void *data)
+{
+ return !strncmp(dev_name(dev), "davinci_mdio", 12);
+}
+
/**
* emac_dev_open - EMAC device open
* @ndev: The DaVinci EMAC network adapter
@@ -1484,8 +1489,14 @@ static int emac_dev_open(struct net_device *ndev)
/* use the first phy on the bus if pdata did not give us a phy id */
if (!phydev && !priv->phy_id) {
- phy = bus_find_device_by_name(&mdio_bus_type, NULL,
- "davinci_mdio");
+ /* NOTE: we can't use bus_find_device_by_name() here because
+ * the device name is not guaranteed to be 'davinci_mdio'. On
+ * some systems it can be 'davinci_mdio.0' so we need to use
+ * strncmp() against the first part of the string to correctly
+ * match it.
+ */
+ phy = bus_find_device(&mdio_bus_type, NULL, NULL,
+ match_first_device);
if (phy) {
priv->phy_id = dev_name(phy);
if (!priv->phy_id || !*priv->phy_id)
--
2.17.1
Changes since v2 [1]:
* Rebased on v4.18-rc1
* Collect Logan's reviewed-by for "mm, devm_memremap_pages: Add
MEMORY_DEVICE_PRIVATE support"
* Convert __put_devmap_managed_page and devmap_managed_key to
EXPORT_SYMBOL otherwise put_page() becomes limited to GPL-only
modules.
* Clarify some of the changelogs.
[1]: https://lkml.org/lkml/2018/5/23/24
---
Hi Andrew, here's v3 to replace these 5 currently in mm:
mm-devm_memremap_pages-mark-devm_memremap_pages-export_symbol_gpl.patch
mm-devm_memremap_pages-handle-errors-allocating-final-devres-action.patch
mm-hmm-use-devm-semantics-for-hmm_devmem_add-remove.patch
mm-hmm-replace-hmm_devmem_pages_create-with-devm_memremap_pages.patch
mm-hmm-mark-hmm_devmem_add-add_resource-export_symbol_gpl.patch
For maintainability, as ZONE_DEVICE continues to attract new users,
it is useful to keep all users consolidated on devm_memremap_pages() as
the interface for create "device pages".
The devm_memremap_pages() implementation was recently reworked to make
it more generic for arbitrary users, like the proposed peer-to-peer
PCI-E enabling. HMM pre-dated this rework and opted to duplicate
devm_memremap_pages() as hmm_devmem_pages_create().
Rework HMM to be a consumer of devm_memremap_pages() directly and fix up
the licensing on the exports given the deep dependencies on the mm.
Patches based on v4.18-rc1 where there are no upstream consumers of the
HMM functionality.
---
Dan Williams (8):
mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL
mm, devm_memremap_pages: Kill mapping "System RAM" support
mm, devm_memremap_pages: Fix shutdown handling
mm, devm_memremap_pages: Add MEMORY_DEVICE_PRIVATE support
mm, hmm: Use devm semantics for hmm_devmem_{add,remove}
mm, hmm: Replace hmm_devmem_pages_create() with devm_memremap_pages()
mm, hmm: Mark hmm_devmem_{add,add_resource} EXPORT_SYMBOL_GPL
mm: Fix exports that inadvertently make put_page() EXPORT_SYMBOL_GPL
drivers/dax/pmem.c | 10 -
drivers/nvdimm/pmem.c | 18 +-
include/linux/hmm.h | 4
include/linux/memremap.h | 7 +
kernel/memremap.c | 89 +++++++----
mm/hmm.c | 307 +++++--------------------------------
tools/testing/nvdimm/test/iomap.c | 21 ++-
7 files changed, 132 insertions(+), 324 deletions(-)
SPC5r17 states that the contents of the ADDITIONAL LENGTH field are not
altered based on the allocation length, so always calculate and pack the
full key list length even if the list itself is truncated.
According to Maged:
Yes it fixes the "Storage Spaces Persistent Reservation" test in the
Windows 2016 Server Failover Cluster validation suites when having
many connections that result in more than 8 registrations. I tested
your patch on 4.17 with iblock.
This behaviour can be tested using the libiscsi PrinReadKeys.Truncate
test.
Cc: stable(a)vger.kernel.org
Signed-off-by: David Disseldorp <ddiss(a)suse.de>
Reviewed-by: Mike Christie <mchristi(a)redhat.com>
Tested-by: Maged Mokhtar <mmokhtar(a)petasan.org>
---
drivers/target/target_core_pr.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
Changes since v1:
* CC stable
* mention Maged's Windows PR test fix comment in commit message
* add Reviewed-by and Tested-by tags
diff --git a/drivers/target/target_core_pr.c b/drivers/target/target_core_pr.c
index 01ac306131c1..2e865fdaa362 100644
--- a/drivers/target/target_core_pr.c
+++ b/drivers/target/target_core_pr.c
@@ -3727,11 +3727,16 @@ core_scsi3_pri_read_keys(struct se_cmd *cmd)
* Check for overflow of 8byte PRI READ_KEYS payload and
* next reservation key list descriptor.
*/
- if ((add_len + 8) > (cmd->data_length - 8))
- break;
-
- put_unaligned_be64(pr_reg->pr_res_key, &buf[off]);
- off += 8;
+ if ((off + 8) <= cmd->data_length) {
+ put_unaligned_be64(pr_reg->pr_res_key, &buf[off]);
+ off += 8;
+ }
+ /*
+ * SPC5r17: 6.16.2 READ KEYS service action
+ * The ADDITIONAL LENGTH field indicates the number of bytes in
+ * the Reservation key list. The contents of the ADDITIONAL
+ * LENGTH field are not altered based on the allocation length
+ */
add_len += 8;
}
spin_unlock(&dev->t10_pr.registration_lock);
--
2.13.6
This is the start of the stable review cycle for the 4.4.137 release.
There are 24 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu Jun 14 16:48:07 UTC 2018.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.137-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.4.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.4.137-rc1
Eric Dumazet <edumazet(a)google.com>
net: metrics: add proper netlink validation
Florian Fainelli <f.fainelli(a)gmail.com>
net: phy: broadcom: Fix bcm_write_exp()
Eric Dumazet <edumazet(a)google.com>
rtnetlink: validate attributes in do_setlink()
Dan Carpenter <dan.carpenter(a)oracle.com>
team: use netdev_features_t instead of u32
Jack Morgenstein <jackm(a)dev.mellanox.co.il>
net/mlx4: Fix irq-unsafe spinlock usage
Shahed Shaikh <shahed.shaikh(a)cavium.com>
qed: Fix mask for physical address in ILT entry
Willem de Bruijn <willemb(a)google.com>
packet: fix reserve calculation
Daniele Palmas <dnlplm(a)gmail.com>
net: usb: cdc_mbim: add flag FLAG_SEND_ZLP
Eric Dumazet <edumazet(a)google.com>
net/packet: refine check for priv area size
Cong Wang <xiyou.wangcong(a)gmail.com>
netdev-FAQ: clarify DaveM's position for stable backports
Wenwen Wang <wang6495(a)umn.edu>
isdn: eicon: fix a missing-check bug
Willem de Bruijn <willemb(a)google.com>
ipv4: remove warning in ip_recv_error
Sabrina Dubroca <sd(a)queasysnail.net>
ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
Govindarajulu Varadarajan <gvaradar(a)cisco.com>
enic: set DMA mask to 47 bit
Alexey Kodanev <alexey.kodanev(a)oracle.com>
dccp: don't free ccid2_hc_tx_sock struct in dccp_disconnect()
Julia Lawall <Julia.Lawall(a)lip6.fr>
bnx2x: use the right constant
Stefan Wahren <stefan.wahren(a)i2se.com>
brcmfmac: Fix check for ISO3166 code
Dave Airlie <airlied(a)redhat.com>
drm: set FMODE_UNSIGNED_OFFSET for drm files
Amir Goldstein <amir73il(a)gmail.com>
xfs: fix incorrect log_flushed on fsync
Nathan Chancellor <natechancellor(a)gmail.com>
kconfig: Avoid format overflow warning from GCC 8.1
Linus Torvalds <torvalds(a)linux-foundation.org>
mmap: relax file size limit for regular files
Linus Torvalds <torvalds(a)linux-foundation.org>
mmap: introduce sane default mmap limits
Chris Chiu <chiu(a)endlessm.com>
tpm: self test failure should not cause suspend to fail
Enric Balletbo i Serra <enric.balletbo(a)collabora.com>
tpm: do not suspend/resume if power stays on
-------------
Diffstat:
Documentation/networking/netdev-FAQ.txt | 9 ++++++
Makefile | 4 +--
drivers/char/tpm/tpm-chip.c | 13 +++++++++
drivers/char/tpm/tpm-interface.c | 7 +++++
drivers/char/tpm/tpm.h | 1 +
drivers/gpu/drm/drm_fops.c | 1 +
drivers/isdn/hardware/eicon/diva.c | 22 ++++++++++-----
drivers/isdn/hardware/eicon/diva.h | 5 ++--
drivers/isdn/hardware/eicon/divasmain.c | 18 +++++++-----
drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c | 2 +-
drivers/net/ethernet/cisco/enic/enic_main.c | 8 +++---
drivers/net/ethernet/mellanox/mlx4/qp.c | 4 +--
drivers/net/ethernet/qlogic/qed/qed_cxt.c | 2 +-
drivers/net/phy/bcm-cygnus.c | 6 ++--
drivers/net/phy/bcm-phy-lib.h | 7 +++++
drivers/net/phy/bcm7xxx.c | 4 +--
drivers/net/team/team.c | 3 +-
drivers/net/usb/cdc_mbim.c | 2 +-
drivers/net/wireless/brcm80211/brcmfmac/cfg80211.c | 2 +-
fs/xfs/xfs_log.c | 7 -----
mm/mmap.c | 32 ++++++++++++++++++++++
net/core/rtnetlink.c | 8 +++---
net/dccp/proto.c | 2 --
net/ipv4/fib_semantics.c | 2 ++
net/ipv4/ip_sockglue.c | 2 --
net/ipv6/ip6mr.c | 3 +-
net/packet/af_packet.c | 4 +--
scripts/kconfig/confdata.c | 2 +-
28 files changed, 129 insertions(+), 53 deletions(-)
From: Sandy Huang <hjc(a)rock-chips.com>
The vop irq is shared between vop and iommu and irq probing in the
iommu driver moved to the probe function recently. This can in some
cases lead to a stall if the irq is triggered while the vop driver
still has it disabled, but the vop irq handler gets called.
But there is no real need to disable the irq, as the vop can simply
also track its enabled state and ignore irqs in that case.
For this we can simply check the power-domain state of the vop,
similar to how the iommu driver does it.
So remove the enable/disable handling and add appropriate condition
to the irq handler.
changes in v2:
- move to just check the power-domain state
- add clock handling
changes in v3:
- clarify comment to speak of runtime-pm not power-domain
changes in v4:
- address Marc's comments (clk-enable WARN_ON and style improvement)
Fixes: d0b912bd4c23 ("iommu/rockchip: Request irqs in rk_iommu_probe()")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sandy Huang <hjc(a)rock-chips.com>
Signed-off-by: Heiko Stuebner <heiko(a)sntech.de>
Tested-by: Ezequiel Garcia <ezequiel(a)collabora.com>
---
drivers/gpu/drm/rockchip/rockchip_drm_vop.c | 25 ++++++++++++++-------
1 file changed, 17 insertions(+), 8 deletions(-)
diff --git a/drivers/gpu/drm/rockchip/rockchip_drm_vop.c b/drivers/gpu/drm/rockchip/rockchip_drm_vop.c
index 9a1f272e41c7..d105e984cf09 100644
--- a/drivers/gpu/drm/rockchip/rockchip_drm_vop.c
+++ b/drivers/gpu/drm/rockchip/rockchip_drm_vop.c
@@ -573,8 +573,6 @@ static int vop_enable(struct drm_crtc *crtc)
spin_unlock(&vop->reg_lock);
- enable_irq(vop->irq);
-
drm_crtc_vblank_on(crtc);
return 0;
@@ -618,8 +616,6 @@ static void vop_crtc_atomic_disable(struct drm_crtc *crtc,
vop_dsp_hold_valid_irq_disable(vop);
- disable_irq(vop->irq);
-
vop->is_enabled = false;
/*
@@ -1195,6 +1191,18 @@ static irqreturn_t vop_isr(int irq, void *data)
uint32_t active_irqs;
int ret = IRQ_NONE;
+ /*
+ * The irq is shared with the iommu. If the runtime-pm state of the
+ * vop-device is disabled the irq has to be targetted at the iommu.
+ */
+ if (!pm_runtime_get_if_in_use(vop->dev))
+ return IRQ_NONE;
+
+ if (vop_core_clks_enable(vop)) {
+ DRM_DEV_ERROR_RATELIMITED(vop->dev, "couldn't enable clocks\n");
+ goto out;
+ }
+
/*
* interrupt register has interrupt status, enable and clear bits, we
* must hold irq_lock to avoid a race with enable/disable_vblank().
@@ -1210,7 +1218,7 @@ static irqreturn_t vop_isr(int irq, void *data)
/* This is expected for vop iommu irqs, since the irq is shared */
if (!active_irqs)
- return IRQ_NONE;
+ goto out_disable;
if (active_irqs & DSP_HOLD_VALID_INTR) {
complete(&vop->dsp_hold_completion);
@@ -1236,6 +1244,10 @@ static irqreturn_t vop_isr(int irq, void *data)
DRM_DEV_ERROR(vop->dev, "Unknown VOP IRQs: %#02x\n",
active_irqs);
+out_disable:
+ vop_core_clks_disable(vop);
+out:
+ pm_runtime_put(vop->dev);
return ret;
}
@@ -1614,9 +1626,6 @@ static int vop_bind(struct device *dev, struct device *master, void *data)
if (ret)
goto err_disable_pm_runtime;
- /* IRQ is initially disabled; it gets enabled in power_on */
- disable_irq(vop->irq);
-
return 0;
err_disable_pm_runtime:
--
2.17.0
From: Mike Snitzer <snitzer(a)redhat.com>
Use of bio_clone_bioset() is inefficient if there is no need to clone
the original bio's bio_vec array. Best to use the bio_clone_fast()
variant. Also, just using bio_advance() is only part of what is needed
to properly setup the clone -- it doesn't account for the various
bio_integrity() related work that also needs to be performed (see
bio_split).
Address both of these issues by switching from bio_clone_bioset() to
bio_split().
Fixes: 18a25da8 ("dm: ensure bio submission follows a depth-first tree walk")
Cc: stable(a)vger.kernel.org # 4.15+, requires removal of '&' before md->queue->bio_split
Reported-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: NeilBrown <neilb(a)suse.com>
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
---
drivers/md/dm.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e65429a29c06..a3b103e8e3ce 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1606,10 +1606,9 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md,
* the usage of io->orig_bio in dm_remap_zone_report()
* won't be affected by this reassignment.
*/
- struct bio *b = bio_clone_bioset(bio, GFP_NOIO,
- &md->queue->bio_split);
+ struct bio *b = bio_split(bio, bio_sectors(bio) - ci.sector_count,
+ GFP_NOIO, &md->queue->bio_split);
ci.io->orig_bio = b;
- bio_advance(bio, (bio_sectors(bio) - ci.sector_count) << 9);
bio_chain(b, bio);
ret = generic_make_request(bio);
break;
--
2.17.1
gregkh(a)linuxfoundation.org wrote:
>
> This is a note to let you know that I've just added the patch titled
>
> powerpc/trace/syscalls: Update syscall name matching logic
>
> to the 3.18-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> powerpc-trace-syscalls-update-syscall-name-matching-logic.patch
> and it can be found in the queue-3.18 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
Please drop this patch from v3.18 as this depends on commit
e145242ea0df6, which I don't see backported to v3.18.
Thanks,
Naveen
>
>
> From foo@baz Sun Jun 17 13:19:44 CEST 2018
> From: "Naveen N. Rao" <naveen.n.rao(a)linux.vnet.ibm.com>
> Date: Fri, 4 May 2018 18:44:24 +0530
> Subject: powerpc/trace/syscalls: Update syscall name matching logic
>
> From: "Naveen N. Rao" <naveen.n.rao(a)linux.vnet.ibm.com>
>
> [ Upstream commit 0b7758aaf6543b9a10c8671db559e9d374a3fd95 ]
>
> On powerpc64 ABIv1, we are enabling syscall tracing for only ~20
> syscalls. This is due to commit e145242ea0df6 ("syscalls/core,
> syscalls/x86: Clean up syscall stub naming convention") which has
> changed the syscall entry wrapper prefix from "SyS" to "__se_sys".
>
> Update the logic for ABIv1 to not just skip the initial dot, but also
> the "__se_sys" prefix.
>
> Fixes: commit e145242ea0df6 ("syscalls/core, syscalls/x86: Clean up
> syscall stub naming convention")
> Reported-by: Michael Ellerman <mpe(a)ellerman.id.au>
> Signed-off-by: Naveen N. Rao <naveen.n.rao(a)linux.vnet.ibm.com>
> Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
> Signed-off-by: Sasha Levin <alexander.levin(a)microsoft.com>
> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
> ---
> arch/powerpc/include/asm/ftrace.h | 10 +++-------
> 1 file changed, 3 insertions(+), 7 deletions(-)
>
> --- a/arch/powerpc/include/asm/ftrace.h
> +++ b/arch/powerpc/include/asm/ftrace.h
> @@ -65,13 +65,9 @@ struct dyn_arch_ftrace {
> #define ARCH_HAS_SYSCALL_MATCH_SYM_NAME
> static inline bool arch_syscall_match_sym_name(const char *sym, const char *name)
> {
> - /*
> - * Compare the symbol name with the system call name. Skip the .sys or .SyS
> - * prefix from the symbol name and the sys prefix from the system call name and
> - * just match the rest. This is only needed on ppc64 since symbol names on
> - * 32bit do not start with a period so the generic function will work.
> - */
> - return !strcmp(sym + 4, name + 3);
> + /* We need to skip past the initial dot, and the __se_sys alias */
> + return !strcmp(sym + 1, name) ||
> + (!strncmp(sym, ".__se_sys", 9) && !strcmp(sym + 6, name));
> }
> #endif
> #endif /* CONFIG_FTRACE_SYSCALLS && CONFIG_PPC64 && !__ASSEMBLY__ */
>
>
> Patches currently in stable-queue which might be from naveen.n.rao(a)linux.vnet.ibm.com are
>
> queue-3.18/powerpc-trace-syscalls-update-syscall-name-matching-logic.patch
>
>
The patch titled
Subject: x86/e820: put !E820_TYPE_RAM regions into memblock.reserved
has been added to the -mm tree. Its filename is
x86-e820-put-e820_type_ram-regions-into-memblockreserved.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/x86-e820-put-e820_type_ram-regions…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/x86-e820-put-e820_type_ram-regions…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Subject: x86/e820: put !E820_TYPE_RAM regions into memblock.reserved
There is a kernel panic that is triggered when reading /proc/kpageflags on
the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
BUG: unable to handle kernel paging request at fffffffffffffffe
PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
RIP: 0010:stable_page_flags+0x27/0x3c0
Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
Call Trace:
kpageflags_read+0xc7/0x120
proc_reg_read+0x3c/0x60
__vfs_read+0x36/0x170
vfs_read+0x89/0x130
ksys_pread64+0x71/0x90
do_syscall_64+0x5b/0x160
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7efc42e75e23
Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
According to kernel bisection, this problem became visible due to commit
f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
which changes how struct pages are initialized.
Memblock layout affects the pfn ranges covered by node/zone. Consider
that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the
default (no memmap= given) memblock layout is like below:
MEMBLOCK configuration:
memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
memory.cnt = 0x4
memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
...
If you give memmap=1G!4G (so it just covers memory[0x2]),
the range [0x100000000-0x13fffffff] is gone:
MEMBLOCK configuration:
memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
memory.cnt = 0x3
memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
...
This causes shrinking node 0's pfn range because it is calculated by the
address range of memblock.memory. So some of struct pages in the gap
range are left uninitialized.
We have a function zero_resv_unavail() which does zeroing the struct pages
within the reserved unavailable range (i.e. memblock.memory &&
!memblock.reserved). This patch utilizes it to cover all unavailable
ranges by putting them into memblock.reserved.
Link: http://lkml.kernel.org/r/20180615072947.GB23273@hori1.linux.bs1.fc.nec.co.jp
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Tested-by: Oscar Salvador <osalvador(a)suse.de>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Cc: Steven Sistare <steven.sistare(a)oracle.com>
Cc: Daniel Jordan <daniel.m.jordan(a)oracle.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/x86/kernel/e820.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff -puN arch/x86/kernel/e820.c~x86-e820-put-e820_type_ram-regions-into-memblockreserved arch/x86/kernel/e820.c
--- a/arch/x86/kernel/e820.c~x86-e820-put-e820_type_ram-regions-into-memblockreserved
+++ a/arch/x86/kernel/e820.c
@@ -1248,6 +1248,7 @@ void __init e820__memblock_setup(void)
{
int i;
u64 end;
+ u64 addr = 0;
/*
* The bootstrap memblock region count maximum is 128 entries
@@ -1264,13 +1265,21 @@ void __init e820__memblock_setup(void)
struct e820_entry *entry = &e820_table->entries[i];
end = entry->addr + entry->size;
+ if (addr < entry->addr)
+ memblock_reserve(addr, entry->addr - addr);
+ addr = end;
if (end != (resource_size_t)end)
continue;
+ /*
+ * all !E820_TYPE_RAM ranges (including gap ranges) are put
+ * into memblock.reserved to make sure that struct pages in
+ * such regions are not left uninitialized after bootup.
+ */
if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
- continue;
-
- memblock_add(entry->addr, entry->size);
+ memblock_reserve(entry->addr, entry->size);
+ else
+ memblock_add(entry->addr, entry->size);
}
/* Throw away partial pages: */
_
Patches currently in -mm which might be from n-horiguchi(a)ah.jp.nec.com are
x86-e820-put-e820_type_ram-regions-into-memblockreserved.patch