Hi Linus,
On Fri, May 19, 2023 at 10:34 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, May 19, 2023 at 3:52 PM Joel Fernandes joel@joelfernandes.org wrote:
I *suspect* that the test is literally just for the stack movement case by execve, where it catches the case where we're doing the movement entirely within the one vma we set up.
Yes that's right, the test is only for the stack movement case. For the regular mremap case, I don't think there is a way for it to trigger.
So I feel the test is simply redundant.
For the regular mremap case, it never triggers.
Unfortunately, I just found that mremap-ing a range purely within a VMA can actually cause the old and new VMA passed to move_page_tables() to be the same.
I added a printk to the beginning of move_page_tables that prints all the args: printk("move_page_tables(vma=(%lx,%lx), old_addr=%lx, new_vma=(%lx,%lx), new_addr=%lx, len=%lx)\n", vma->vm_start, vma->vm_end, old_addr, new_vma->vm_start, new_vma->vm_end, new_addr, len);
Then I wrote a simple test to move 1MB purely within a 10MB range and I found on running the test that the old and new vma passed to move_page_tables() are exactly the same.
[ 19.697596] move_page_tables(vma=(7f1f985f7000,7f1f98ff7000), old_addr=7f1f987f7000, new_vma=(7f1f985f7000,7f1f98ff7000), new_addr=7f1f98af7000, len=100000)
That is a bit counter intuitive as I really thought we'd be splitting the VMAs with such a move. Any idea what am I missing?
Also, such a usecase will break with my patch as we may accidentally overwrite parts of a range that were not part of the mremap request. Maybe I should just turn off the optimization if vma == new_vma, however that will also turn it off for the stack move so then maybe another way is to special case stack moves in move_page_tables().
So this means I have to go back to the drawing board a bit on this patch, and also add more tests in mremap_test.c to test such within-VMA moving. I believe there are no such existing tests... More work to do for me. :-)
And for the stack movement case by execve, I don't think it matters if you just were to change the logic of the subsequent checks a bit.
In particular, you do this:
/* If the masked address is within vma, there is no prev
mapping of concern. */ if (vma->vm_start <= addr_masked) return false;
/* * Attempt to find vma before prev that contains the address. * On any issue, assume the address is within a previous mapping. * @mmap write lock is held here, so the lookup is safe. */ cur = find_vma_prev(vma->vm_mm, vma->vm_start, &prev); if (!cur || cur != vma || !prev) return true; /* The masked address fell within a previous mapping. */ if (prev->vm_end > addr_masked) return true; return false;
And I think that
if (!cur || cur != vma || !prev) return true;
is actively wrong, because if there is no 'prev', then you should return false.
During my tests, I observed that there was always an existing, unrelated memory mapping present prior to the new memory region allocated by mmap. Based on this observation, I concluded that if there is no previous mapping (i.e., if prev is NULL), it indicates a potential issue with find_vma_prev(). Therefore, I designed this function to return here indicating that the masked address is not suitable for optimization, whenever prev is NULL.
That's obviously confusing so I'll try to rewrite this part of the patch a bit better with appropriate comments.
So I *think* all of the above could just be replaced with this instead:
find_vma_prev(vma->vm_mm, vma->vm_start, &prev); return prev && prev->vm_end > addr_masked;
because only if we have a 'prev', and the prev is into that masked address, do we need to avoid doing the masking.
With that simplified test, do you even care about that whole "the masked address was already in the vma"? Not that I can see.
And we don't even care about the return value of 'find_vma_prev()', because it had better be 'vma'. We're giving it 'vma->vm_start' as an address, for chrissake!
So if you *really* wanted to, you could do something like
cur = find_vma_prev(..); if (WARN_ON_ONCE(cut != vma)) return true;
but even that WARN_ON_ONCE() seems pretty bogus. If it triggers, we have some serious corruption going on.
So I stil find that whole "vma->vm_start <= addr_masked" test a bit confusing, since it seems entirely redundant.
Is it just because you wanted to avoid calling "find_vma_prev()" at all? Maybe just say that in the comment.
Yes exactly, I did not want to run find_vma_prev() unnecessarily. I will add such clarifications in the comments.
Thanks for all the comments so far, I will continue to work on this.
- Joel