It is hard to follow your reply as your email client is not quoting properly. Let me try to reconstruct
On Tue 02-11-21 08:48:27, Alexey Makhalov wrote:
On 02.11.21 08:47, Michal Hocko wrote:
[...]
CPU2 has been hot-added BUG: unable to handle page fault for address: 0000000000001608 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 1 Comm: systemd Tainted: G E 5.15.0-rc7+ #11 Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW
RIP: 0010:__alloc_pages+0x127/0x290
Could you resolve this into a specific line of the source code please?
This got probably unnoticed. I would be really curious whether this is a broken zonelist or something else.
backtrace (including inline functions) pcpu_alloc_pages() alloc_pages_node() __alloc_pages_node() __alloc_pages() prepare_alloc_pages() node_zonelist()
Panic happens in node_zonelist(), dereferencing NULL pointer of NODE_DATA(nid) in include/linux/gfp.h:514 512 static inline struct zonelist *node_zonelist(int nid, gfp_t flags) 513 { 514 return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); 515 }
Node can be in one of the following states:
- not present (nid == NUMA_NO_NODE)
- present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0, NODE_DATA(nid) == NULL)
- present and online (nid > NUMA_NO_NODE, node_online(nid) > 0, NODE_DATA(nid) != NULL)
alloc_page_{bulk_array}node() functions verify for nid validity only and do not check if nid is online. Enhanced verification check allows to handle page allocation when node is in 2nd state.
I do not think this is a correct approach. We should make sure that the proper fallback node is used instead. This means that the zone list is initialized properly. IIRC this has been a problem in the past and it has been fixed. The initialization code is quite subtle though so it is possible that this got broken again.
This approach behaves in the same way as CPU was not yet added. (state #1). So, we can think of state #2 as state #1 when CPU is not present.
I'm a little confused:
In add_memory_resource() we hotplug the new node if required and set it online. Memory might get onlined later, via online_pages().
You are correct. In case of memory hot add, it is true. But in case of adding CPU with memoryless node, try_node_online() will be called only during CPU onlining, see cpu_up().
Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()? I think it would be correct to online node during the CPU hot add to align with memory hot add.
I am not familiar with cpu hotplug, but this doesn't seem to be anything new so how come this became problem only now?
This is not CPU only hotplug, but CPU + NUMA node, and this new node is with no memory. We accidentally found it by not unlining the CPU immediately.
So after add_memory_resource()->__try_online_node() succeeded, we have an online pgdat -- essentially 3.
This patch detects if we're past 3. but says that it reproduced by disabling *memory* onlining. This is the hot adding of both new CPU and new _memoryless_ node (with CPU only) And onlining CPU makes its node online. Disabling CPU onlining puts new node into state #2, which leads to repro.
Before we online memory for a hotplugged node, all zones are !populated. So once we online memory for a !populated zone in online_pages(), we trigger setup_zone_pageset().
The confusing part is that this patch checks for 3. but says it can be reproduced by not onlining *memory*. There seems to be something missing.
Do we maybe need a proper populated_zone() check before accessing zone data?
No, we need them initialize properly.
Thanks, —Alexey