If max_pfn is not aligned to a section boundary, we can easily run into BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a memory size that is not a multiple of 128MB (e.g., 4097MB, but also 4160MB). I was told that on real HW, we can easily have this scenario (esp., one of the main reasons sub-section hotadd of devmem was added).
The issue is, that we have a valid memmap (pfn_valid()) for the whole section, and the whole section will be marked "online". pfn_to_online_page() will succeed, but the memmap contains garbage.
E.g., doing a "cat /proc/kpageflags > /dev/null" results in
[ 303.218313] BUG: unable to handle page fault for address: fffffffffffffffe [ 303.218899] #PF: supervisor read access in kernel mode [ 303.219344] #PF: error_code(0x0000) - not-present page [ 303.219787] PGD 12614067 P4D 12614067 PUD 12616067 PMD 0 [ 303.220266] Oops: 0000 [#1] SMP NOPTI [ 303.220587] CPU: 0 PID: 424 Comm: cat Not tainted 5.4.0-next-20191128+ #17 [ 303.221169] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4 [ 303.222140] RIP: 0010:stable_page_flags+0x4d/0x410 [ 303.222554] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f [ 303.224135] RSP: 0018:ffff9f5980187e58 EFLAGS: 00010202 [ 303.224576] RAX: fffffffffffffffe RBX: ffffda1285004000 RCX: ffff9f5980187dd4 [ 303.225178] RDX: 0000000000000001 RSI: ffffffff92662420 RDI: 0000000000000246 [ 303.225789] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000 [ 303.226405] R10: 0000000000000000 R11: 0000000000000000 R12: 00007f31d070e000 [ 303.227012] R13: 0000000000140100 R14: 00007f31d070e800 R15: ffffda1285004000 [ 303.227629] FS: 00007f31d08f6580(0000) GS:ffff90a6bba00000(0000) knlGS:0000000000000000 [ 303.228329] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 303.228820] CR2: fffffffffffffffe CR3: 00000001332a2000 CR4: 00000000000006f0 [ 303.229438] Call Trace: [ 303.229654] kpageflags_read.cold+0x57/0xf0 [ 303.230016] proc_reg_read+0x3c/0x60 [ 303.230332] vfs_read+0xc2/0x170 [ 303.230614] ksys_read+0x65/0xe0 [ 303.230898] do_syscall_64+0x5c/0xa0 [ 303.231216] entry_SYSCALL_64_after_hwframe+0x49/0xbe
This patch fixes that by at least zero-ing out that memmap (so e.g., page_to_pfn() will not crash). Commit 907ec5fca3dc ("mm: zero remaining unavailable struct pages") tried to fix a similar issue, but forgot to consider this special case.
After this patch, there are still problems to solve. E.g., not all of these pages falling into a memory hole will actually get initialized later and set PageReserved - they are only zeroed out - but at least the immediate crashes are gone. A follow-up patch will take care of this.
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap") Cc: stable@vger.kernel.org # v4.15+ Cc: Naoya Horiguchi n-horiguchi@ah.jp.nec.com Cc: Pavel Tatashin pasha.tatashin@oracle.com Cc: Andrew Morton akpm@linux-foundation.org Cc: Steven Sistare steven.sistare@oracle.com Cc: Michal Hocko mhocko@suse.com Cc: Daniel Jordan daniel.m.jordan@oracle.com Cc: Bob Picco bob.picco@oracle.com Cc: Oscar Salvador osalvador@suse.de Signed-off-by: David Hildenbrand david@redhat.com --- mm/page_alloc.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 62dcd6b76c80..1eb2ce7c79e4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6932,7 +6932,8 @@ static u64 zero_pfn_range(unsigned long spfn, unsigned long epfn) * This function also addresses a similar issue where struct pages are left * uninitialized because the physical address range is not covered by * memblock.memory or memblock.reserved. That could happen when memblock - * layout is manually configured via memmap=. + * layout is manually configured via memmap=, or when the highest physical + * address (max_pfn) does not end on a section boundary. */ void __init zero_resv_unavail(void) { @@ -6950,7 +6951,16 @@ void __init zero_resv_unavail(void) pgcnt += zero_pfn_range(PFN_DOWN(next), PFN_UP(start)); next = end; } - pgcnt += zero_pfn_range(PFN_DOWN(next), max_pfn); + + /* + * Early sections always have a fully populated memmap for the whole + * section - see pfn_valid(). If the last section has holes at the + * end and that section is marked "online", the memmap will be + * considered initialized. Make sure that memmap has a well defined + * state. + */ + pgcnt += zero_pfn_range(PFN_DOWN(next), + round_up(max_pfn, PAGES_PER_SECTION));
/* * Struct pages that do not have backing memory. This could be because
Hi David,
On Mon, Dec 09, 2019 at 06:48:34PM +0100, David Hildenbrand wrote:
If max_pfn is not aligned to a section boundary, we can easily run into BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a memory size that is not a multiple of 128MB (e.g., 4097MB, but also 4160MB). I was told that on real HW, we can easily have this scenario (esp., one of the main reasons sub-section hotadd of devmem was added).
The issue is, that we have a valid memmap (pfn_valid()) for the whole section, and the whole section will be marked "online". pfn_to_online_page() will succeed, but the memmap contains garbage.
E.g., doing a "cat /proc/kpageflags > /dev/null" results in
[ 303.218313] BUG: unable to handle page fault for address: fffffffffffffffe [ 303.218899] #PF: supervisor read access in kernel mode [ 303.219344] #PF: error_code(0x0000) - not-present page [ 303.219787] PGD 12614067 P4D 12614067 PUD 12616067 PMD 0 [ 303.220266] Oops: 0000 [#1] SMP NOPTI [ 303.220587] CPU: 0 PID: 424 Comm: cat Not tainted 5.4.0-next-20191128+ #17
I can't reproduce this on x86-64 qemu, next-20191128 or mainline, with either memory size. What config are you using? How often are you hitting it?
It may not have anything to do with the config, and I may be getting lucky with the garbage in my memory.
On 09.12.19 22:15, Daniel Jordan wrote:
Hi David,
On Mon, Dec 09, 2019 at 06:48:34PM +0100, David Hildenbrand wrote:
If max_pfn is not aligned to a section boundary, we can easily run into BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a memory size that is not a multiple of 128MB (e.g., 4097MB, but also 4160MB). I was told that on real HW, we can easily have this scenario (esp., one of the main reasons sub-section hotadd of devmem was added).
The issue is, that we have a valid memmap (pfn_valid()) for the whole section, and the whole section will be marked "online". pfn_to_online_page() will succeed, but the memmap contains garbage.
E.g., doing a "cat /proc/kpageflags > /dev/null" results in
[ 303.218313] BUG: unable to handle page fault for address: fffffffffffffffe [ 303.218899] #PF: supervisor read access in kernel mode [ 303.219344] #PF: error_code(0x0000) - not-present page [ 303.219787] PGD 12614067 P4D 12614067 PUD 12616067 PMD 0 [ 303.220266] Oops: 0000 [#1] SMP NOPTI [ 303.220587] CPU: 0 PID: 424 Comm: cat Not tainted 5.4.0-next-20191128+ #17
Hi Daniel,
I can't reproduce this on x86-64 qemu, next-20191128 or mainline, with either memory size. What config are you using? How often are you hitting it?
Thanks for verifying! Hah, there is one piece missing to reproduce via "cat /proc/kpageflags > /dev/null" that I ignored on my QEMU cmdline (see below)
I can reproduce it reliably (QEMU with "-m 4160M") via
[root@localhost ~]# uname -a Linux localhost 5.5.0-rc1-next-20191209 #93 SMP Tue Dec 10 10:46:19 CET 2019 x86_64 x86_64 x86_64 GNU/Linux [root@localhost ~]# ./page-types -r -a 0x144001 [ 200.476376] BUG: unable to handle page fault for address: fffffffffffffffe [ 200.477500] #PF: supervisor read access in kernel mode [ 200.478334] #PF: error_code(0x0000) - not-present page [ 200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0 [ 200.479557] Oops: 0000 [#4] SMP NOPTI [ 200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G D W 5.5.0-rc1-next-20191209 #93 [ 200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4 [ 200.481648] RIP: 0010:stable_page_flags+0x4d/0x410 [ 200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f [ 200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202 [ 200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000 [ 200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246 [ 200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000 [ 200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001 [ 200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08 [ 200.487130] FS: 00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000 [ 200.487804] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0 [ 200.488897] Call Trace: [ 200.489115] kpageflags_read+0xe9/0x140 [ 200.489447] proc_reg_read+0x3c/0x60 [ 200.489755] vfs_read+0xc2/0x170 [ 200.490037] ksys_pread64+0x65/0xa0 [ 200.490352] do_syscall_64+0x5c/0xa0 [ 200.490665] entry_SYSCALL_64_after_hwframe+0x49/0xbe
(tool located in tools/vm/page-types.c, see also patch #2)
To reproduce via "cat /proc/kpageflags > /dev/null", you have to hot/coldplug one DIMM, to move max_pfn beyond the garbage memmap (see also patch #2). My QEMU cmdline with Fedora 31:
qemu-system-x86_64 \ --enable-kvm \ -m 4160M,slots=4,maxmem=8G \ -hda Fedora-Cloud-Base-31-1.9.x86_64.qcow2 \ -machine pc \ -nographic \ -nodefaults \ -chardev stdio,id=serial,signal=off \ -device isa-serial,chardev=serial \ -object memory-backend-ram,id=mem0,size=1024M \ -device pc-dimm,id=dimm0,memdev=mem0
[root@localhost ~]# uname -a Linux localhost 5.3.7-301.fc31.x86_64 #1 SMP Mon Oct 21 19:18:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux [root@localhost ~]# cat /proc/kpageflags > /dev/null [ 111.517275] BUG: unable to handle page fault for address: fffffffffffffffe [ 111.517907] #PF: supervisor read access in kernel mode [ 111.518333] #PF: error_code(0x0000) - not-present page [ 111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0
It may not have anything to do with the config, and I may be getting lucky with the garbage in my memory.
Some things that might be relevant from my config.
# CONFIG_PAGE_POISONING is not set CONFIG_DEFERRED_STRUCT_PAGE_INIT=y CONFIG_SPARSEMEM_EXTREME=y CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y CONFIG_SPARSEMEM_VMEMMAP=y CONFIG_HAVE_MEMBLOCK_NODE_MAP=y CONFIG_MEMORY_HOTPLUG=y CONFIG_MEMORY_HOTPLUG_SPARSE=y CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
The F31 default config should make it trigger.
Will update this patch description - thanks!
...
On Tue, Dec 10, 2019 at 11:11:03AM +0100, David Hildenbrand wrote:
Some things that might be relevant from my config.
# CONFIG_PAGE_POISONING is not set CONFIG_DEFERRED_STRUCT_PAGE_INIT=y CONFIG_SPARSEMEM_EXTREME=y CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y CONFIG_SPARSEMEM_VMEMMAP=y CONFIG_HAVE_MEMBLOCK_NODE_MAP=y CONFIG_MEMORY_HOTPLUG=y CONFIG_MEMORY_HOTPLUG_SPARSE=y CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y
Thanks for all that. After some poking around, turns out enabling DEBUG_VM with its page poisoning let me hit it right away, which makes me wonder how often someone would see this without it.
Anyway, fix looks good to me.
Tested-by: Daniel Jordan daniel.m.jordan@oracle.com
linux-stable-mirror@lists.linaro.org