The patch titled
Subject: mm: zero remaining unavailable struct pages
has been added to the -mm tree. Its filename is
mm-zero-remaining-unavailable-struct-pages-re-kernel-panic-in-reading-proc-kpageflags-when-enabling-ram-simulated-pmem.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-zero-remaining-unavailable-stru…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-zero-remaining-unavailable-stru…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Subject: mm: zero remaining unavailable struct pages
There is a kernel panic that is triggered when reading /proc/kpageflags on
the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
BUG: unable to handle kernel paging request at fffffffffffffffe
PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
RIP: 0010:stable_page_flags+0x27/0x3c0
Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
Call Trace:
kpageflags_read+0xc7/0x120
proc_reg_read+0x3c/0x60
__vfs_read+0x36/0x170
vfs_read+0x89/0x130
ksys_pread64+0x71/0x90
do_syscall_64+0x5b/0x160
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7efc42e75e23
Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
According to kernel bisection, this problem became visible due to commit
f7f99100d8d9 which changes how struct pages are initialized.
Memblock layout affects the pfn ranges covered by node/zone. Consider
that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
the default (no memmap= given) memblock layout is like below:
MEMBLOCK configuration:
memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
memory.cnt = 0x4
memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
...
If you give memmap=1G!4G (so it just covers memory[0x2]),
the range [0x100000000-0x13fffffff] is gone:
MEMBLOCK configuration:
memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
memory.cnt = 0x3
memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
...
This causes shrinking node 0's pfn range because it is calculated by
the address range of memblock.memory. So some of struct pages in the
gap range are left uninitialized.
We have a function zero_resv_unavail() which does zeroing the struct
pages outside memblock.memory, but currently it covers only the reserved
unavailable range (i.e. memblock.memory && !memblock.reserved).
This patch extends it to cover all unavailable range, which fixes
the reported issue.
Link: http://lkml.kernel.org/r/20180613054107.GA5329@hori1.linux.bs1.fc.nec.co.jp
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Tested-by: Oscar Salvador <osalvador(a)suse.de>
Cc: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Cc: Steven Sistare <steven.sistare(a)oracle.com>
Cc: Daniel Jordan <daniel.m.jordan(a)oracle.com>
Cc: "Bob Picco" <bob.picco(a)oracle.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Huang Ying <ying.huang(a)intel.com>
Cc: Ingo Molnar <mingo(a)kernel.org>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
diff -puN include/linux/memblock.h~mm-zero-remaining-unavailable-struct-pages-re-kernel-panic-in-reading-proc-kpageflags-when-enabling-ram-simulated-pmem include/linux/memblock.h
--- a/include/linux/memblock.h~mm-zero-remaining-unavailable-struct-pages-re-kernel-panic-in-reading-proc-kpageflags-when-enabling-ram-simulated-pmem
+++ a/include/linux/memblock.h
@@ -236,22 +236,6 @@ void __next_mem_pfn_range(int *idx, int
for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved, \
nid, flags, p_start, p_end, p_nid)
-/**
- * for_each_resv_unavail_range - iterate through reserved and unavailable memory
- * @i: u64 used as loop variable
- * @flags: pick from blocks based on memory attributes
- * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
- * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
- *
- * Walks over unavailable but reserved (reserved && !memory) areas of memblock.
- * Available as soon as memblock is initialized.
- * Note: because this memory does not belong to any physical node, flags and
- * nid arguments do not make sense and thus not exported as arguments.
- */
-#define for_each_resv_unavail_range(i, p_start, p_end) \
- for_each_mem_range(i, &memblock.reserved, &memblock.memory, \
- NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
-
static inline void memblock_set_region_flags(struct memblock_region *r,
unsigned long flags)
{
diff -puN mm/page_alloc.c~mm-zero-remaining-unavailable-struct-pages-re-kernel-panic-in-reading-proc-kpageflags-when-enabling-ram-simulated-pmem mm/page_alloc.c
--- a/mm/page_alloc.c~mm-zero-remaining-unavailable-struct-pages-re-kernel-panic-in-reading-proc-kpageflags-when-enabling-ram-simulated-pmem
+++ a/mm/page_alloc.c
@@ -6390,25 +6390,40 @@ void __paginginit free_area_init_node(in
* struct pages which are reserved in memblock allocator and their fields
* may be accessed (for example page_to_pfn() on some configuration accesses
* flags). We must explicitly zero those struct pages.
+ *
+ * This function also addresses a similar issue where struct pages are left
+ * uninitialized because the physical address range is not covered by
+ * memblock.memory or memblock.reserved. That could happen when memblock
+ * layout is manually configured via memmap=.
*/
void __paginginit zero_resv_unavail(void)
{
phys_addr_t start, end;
unsigned long pfn;
u64 i, pgcnt;
+ phys_addr_t next = 0;
/*
- * Loop through ranges that are reserved, but do not have reported
- * physical memory backing.
+ * Loop through unavailable ranges not covered by memblock.memory.
*/
pgcnt = 0;
- for_each_resv_unavail_range(i, &start, &end) {
- for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
- if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
- continue;
- mm_zero_struct_page(pfn_to_page(pfn));
- pgcnt++;
+ for_each_mem_range(i, &memblock.memory, NULL,
+ NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, NULL) {
+ if (next < start) {
+ for (pfn = PFN_DOWN(next); pfn < PFN_UP(start); pfn++) {
+ if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
+ continue;
+ mm_zero_struct_page(pfn_to_page(pfn));
+ pgcnt++;
+ }
}
+ next = end;
+ }
+ for (pfn = PFN_DOWN(next); pfn < max_pfn; pfn++) {
+ if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages)))
+ continue;
+ mm_zero_struct_page(pfn_to_page(pfn));
+ pgcnt++;
}
/*
@@ -6419,7 +6434,7 @@ void __paginginit zero_resv_unavail(void
* this code can be removed.
*/
if (pgcnt)
- pr_info("Reserved but unavailable: %lld pages", pgcnt);
+ pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt);
}
#endif /* CONFIG_HAVE_MEMBLOCK */
_
Patches currently in -mm which might be from n-horiguchi(a)ah.jp.nec.com are
mm-zero-remaining-unavailable-struct-pages-re-kernel-panic-in-reading-proc-kpageflags-when-enabling-ram-simulated-pmem.patch
Hello,
Syzkaller has reported a crash here[1] for a slab OOB read in pfkey_add.
Could the following patch be applied to stable kernels for 4.14, 4.4, 3.18, 3.14, 3.10 and 3.8?
4b66af2d("af_key: Always verify length of provided sadb_key")
[1] https://syzkaller.appspot.com/bug?id=26cb120b31cd24d984fc16da67f50fb375c432…
Thanks,
- Zubin
This is the start of the stable review cycle for the 3.18.113 release.
There are 21 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu Jun 14 16:48:15 UTC 2018.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v3.x/stable-review/patch-3.18.113-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-3.18.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 3.18.113-rc1
Eric Dumazet <edumazet(a)google.com>
rtnetlink: validate attributes in do_setlink()
Dan Carpenter <dan.carpenter(a)oracle.com>
team: use netdev_features_t instead of u32
Jack Morgenstein <jackm(a)dev.mellanox.co.il>
net/mlx4: Fix irq-unsafe spinlock usage
Daniele Palmas <dnlplm(a)gmail.com>
net: usb: cdc_mbim: add flag FLAG_SEND_ZLP
Eric Dumazet <edumazet(a)google.com>
net/packet: refine check for priv area size
Wenwen Wang <wang6495(a)umn.edu>
isdn: eicon: fix a missing-check bug
Sabrina Dubroca <sd(a)queasysnail.net>
ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
Govindarajulu Varadarajan <gvaradar(a)cisco.com>
enic: set DMA mask to 47 bit
Alexey Kodanev <alexey.kodanev(a)oracle.com>
dccp: don't free ccid2_hc_tx_sock struct in dccp_disconnect()
Julia Lawall <Julia.Lawall(a)lip6.fr>
bnx2x: use the right constant
Dave Airlie <airlied(a)redhat.com>
drm: set FMODE_UNSIGNED_OFFSET for drm files
Linus Torvalds <torvalds(a)linux-foundation.org>
mmap: relax file size limit for regular files
Linus Torvalds <torvalds(a)linux-foundation.org>
mmap: introduce sane default mmap limits
Hugh Dickins <hughd(a)google.com>
mm: fix the NULL mapping case in __isolate_lru_page()
Al Viro <viro(a)zeniv.linux.org.uk>
fix io_destroy()/aio_complete() race
Ondrej Zary <linux(a)rainbow-software.org>
drm/i915: Disable LVDS on Radiant P845
Maciej W. Rozycki <macro(a)mips.com>
MIPS: ptrace: Fix PTRACE_PEEKUSR requests for 64-bit FGRs
Eric Dumazet <edumazet(a)google.com>
tcp: avoid integer overflows in tcp_rcv_space_adjust()
Eric Biggers <ebiggers(a)google.com>
cfg80211: further limit wiphy names to 64 bytes
Sachin Grover <sgrover(a)codeaurora.org>
selinux: KASAN: slab-out-of-bounds in xattr_getsecurity
Steven Rostedt (VMware) <rostedt(a)goodmis.org>
tracing: Fix crash when freeing instances with event triggers
-------------
Diffstat:
Makefile | 4 +--
arch/mips/kernel/ptrace.c | 2 +-
arch/mips/kernel/ptrace32.c | 2 +-
drivers/gpu/drm/drm_fops.c | 1 +
drivers/gpu/drm/i915/intel_lvds.c | 8 ++++++
drivers/isdn/hardware/eicon/diva.c | 22 ++++++++++------
drivers/isdn/hardware/eicon/diva.h | 5 ++--
drivers/isdn/hardware/eicon/divasmain.c | 18 +++++++------
drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c | 2 +-
drivers/net/ethernet/cisco/enic/enic_main.c | 8 +++---
drivers/net/ethernet/mellanox/mlx4/qp.c | 4 +--
drivers/net/team/team.c | 3 ++-
drivers/net/usb/cdc_mbim.c | 2 +-
fs/aio.c | 3 +--
include/linux/tcp.h | 2 +-
include/uapi/linux/nl80211.h | 2 +-
kernel/trace/trace_events_trigger.c | 5 ++--
mm/mmap.c | 32 ++++++++++++++++++++++++
mm/vmscan.c | 2 +-
net/core/rtnetlink.c | 8 +++---
net/dccp/proto.c | 2 --
net/ipv4/tcp_input.c | 10 +++++---
net/ipv6/ip6mr.c | 3 ++-
net/packet/af_packet.c | 2 +-
security/selinux/ss/services.c | 2 +-
25 files changed, 105 insertions(+), 49 deletions(-)
Hello stable kernel maintainers,
Please backport patch 327ea4adcfa3 ("blkdev_report_zones_ioctl():
Use vmalloc() to allocate large buffers") to at least the v4.17.x and
v4.14.y stable kernel series. That patch fixes a bug introduced in
kernel v4.10. The entire patch is shown below.
Thanks,
Bart.
commit cf0110698846fc5a93df89eb20ac7cc70a860c17
Author: Bart Van Assche <bart.vanassche(a)wdc.com>
Date: Tue May 22 08:27:22 2018 -0700
blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers
Avoid that complaints similar to the following appear in the kernel log
if the number of zones is sufficiently large:
fio: page allocation failure: order:9, mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
Call Trace:
dump_stack+0x63/0x88
warn_alloc+0xf5/0x190
__alloc_pages_slowpath+0x8f0/0xb0d
__alloc_pages_nodemask+0x242/0x260
alloc_pages_current+0x6a/0xb0
kmalloc_order+0x18/0x50
kmalloc_order_trace+0x26/0xb0
__kmalloc+0x20e/0x220
blkdev_report_zones_ioctl+0xa5/0x1a0
blkdev_ioctl+0x1ba/0x930
block_ioctl+0x41/0x50
do_vfs_ioctl+0xaa/0x610
SyS_ioctl+0x79/0x90
do_syscall_64+0x79/0x1b0
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls")
Signed-off-by: Bart Van Assche <bart.vanassche(a)wdc.com>
Cc: Shaun Tancheff <shaun.tancheff(a)seagate.com>
Cc: Damien Le Moal <damien.lemoal(a)hgst.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Martin K. Petersen <martin.petersen(a)oracle.com>
Cc: Hannes Reinecke <hare(a)suse.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 08e84ef2bc05..3d08dc84db16 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -328,7 +328,11 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
if (!rep.nr_zones)
return -EINVAL;
- zones = kcalloc(rep.nr_zones, sizeof(struct blk_zone), GFP_KERNEL);
+ if (rep.nr_zones > INT_MAX / sizeof(struct blk_zone))
+ return -ERANGE;
+
+ zones = kvmalloc(rep.nr_zones * sizeof(struct blk_zone),
+ GFP_KERNEL | __GFP_ZERO);
if (!zones)
return -ENOMEM;
@@ -350,7 +354,7 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode,
}
out:
- kfree(zones);
+ kvfree(zones);
return ret;
}
Hi,
On the latest 4.9 stable active-passive bonding does not always
failover to the passive slave when carrier is lost on the active
slave. It seems that the issue stems from the backport of
c4adfc822bf5d8e97660b6114b5a8892530ce8cb, bonding: make speed, duplex
setting consistent with link state. There were subsequent patches
which resolved issues with the change to bond_update_speed_duplex
which were not backported. The three commits which seem to resolve the
issue are b5bf0f5b16b9c316c34df9f31d4be8729eb86845,
3f3c278c94dd994fe0d9f21679ae19b9c0a55292 and
ad729bc9acfb7c47112964b4877ef5404578ed13. There are other commits in
mainline which also revolve around
c4adfc822bf5d8e97660b6114b5a8892530ce8cb but are not necessary to
resolving the active-passive failover problems.
Would it be possible to queue up the three commits for backporting to
4.9 stable:
b5bf0f5b16b9c316c34df9f31d4be8729eb86845 bonding: correctly update
link status during mii-commit
3f3c278c94dd994fe0d9f21679ae19b9c0a55292 bonding: fix active-backup transition
ad729bc9acfb7c47112964b4877ef5404578ed13 bonding: require speed/duplex
only for 802.3ad, alb and tlb
All of those commits apply cleanly to 4.9.107.
Thanks,
-nate