On Thu, Mar 15, 2018 at 12:50 PM, Michal Hocko mhocko@kernel.org wrote:
On Thu 15-03-18 02:30:41, Daniel Vacek wrote:
On Wed, Mar 14, 2018 at 3:17 PM, Michal Hocko mhocko@kernel.org wrote:
On Tue 13-03-18 23:42:40, Daniel Vacek wrote:
On some architectures (reported on arm64) commit 864b75f9d6b01 ("mm/page_alloc: fix memmap_init_zone pageblock alignment") causes a boot hang. This patch fixes the hang making sure the alignment never steps back.
I am sorry to be complaining again, but the code is so obscure that I
No worries, I'm glad for any review. Which code exactly you do find obscure? This patch or my former fix or the original commit introducing memblock_next_valid_pfn()? Coz I'd agree the original commit looks pretty obscure...
As mentioned in the other email, the whole going back and forth in the same loop is just too ugly to live.
It's not really supposed to go back, but I guess you understand.
would _really_ appreciate some more information about what is going on here. memblock_next_valid_pfn will most likely return a pfn within the same memblock and the alignment will move it before the old pfn which is not valid - so the block has some holes. Is that correct?
I do not understand what you mean by 'pfn within the same memblock'?
Sorry, I should have said in the same pageblock
And by 'the block has some holes'?
memblock_next_valid_pfn clearly returns pfn which is within a pageblock and that is why we do not initialize pages in the begining of the block while move_freepages_block does really expect the full pageblock to be initialized properly. That is the fundamental problem, right?
Yes, that's correct.
memblock has types 'memory' (as usable memory) and 'reserved' (for unusable mem), if I understand correctly.
We might not have struct pages for invalid pfns. That really depends on the memory mode. Sure sparse mem model will usually allocate struct pages for whole memory sections but that is not universally true and adding such a suble assumption is simply wrong.
This is gray area for me. But if I understand correctly this assumption comes from the code. It was already there and got broken hence I was trying to keep it. If anything needs redesigning I'm all for it. But I was just calming the fire here. I only didn't test on arm, which seems to be the only one different.
I suspect you are making strong assumptions based on a very specific implementation which might be not true in general. That was the feeling I've had since the patch was proposed for the first time. This is such a cluttered area that I am not really sure myself, thoug.
I understand. And again this is likely correct. I'll be glad for any assistance here. My limited knowledge is the primary cause for lack of relevant details I guess. What I checked looks like pfn_valid is a generic function used by all arches but arm, which seems to be the only one to implement CONFIG_HAVE_ARCH_PFN_VALID if I didn't miss anything. So if this config is enabled on arm, it uses it's own version of pfn_valid(). If not, I'd expect all arches behave the same. That's where my assumption comes from.
-- Michal Hocko SUSE Labs