Hi all,
I have found a problem in recent kernels 6.9+ when running under the Xen hypervisor as a PV dom0. In my setup (PV Dom0 with 4G RAM, using XFS), I shrink memory to 256M via the balloon driver after boot:
xl mem-set Domain-0 256m
Once memory is reduced, even running a simple command like ls usually triggers the following warning:
[ 27.963562] [ T2553] page: refcount:88 mapcount:21 mapping:ffff88813ff6f6a8 index:0x110 pfn:0x10085c [ 27.963564] [ T2553] head: order:2 mapcount:83 entire_mapcount:0 nr_pages_mapped:4 pincount:0 [ 27.963565] [ T2553] memcg:ffff888003573000 [ 27.963567] [ T2553] aops:0xffffffff8226fd20 ino:82467c dentry name(?):"libc.so.6" [ 27.963570] [ T2553] flags: 0x2000000000416c(referenced|uptodate|lru|active|private|head|node=0|zone=2) [ 27.963573] [ T2553] raw: 002000000000416c ffffea0004021e88 ffffea0004021908 ffff88813ff6f6a8 [ 27.963574] [ T2553] raw: 0000000000000110 ffff88811bf06b60 0000005800000014 ffff888003573000 [ 27.963576] [ T2553] head: 002000000000416c ffffea0004021e88 ffffea0004021908 ffff88813ff6f6a8 [ 27.963577] [ T2553] head: 0000000000000110 ffff88811bf06b60 0000005800000014 ffff888003573000 [ 27.963578] [ T2553] head: 0020000000000202 ffffea0004021701 0000000400000052 00000000ffffffff [ 27.963580] [ T2553] head: 0000000300000003 8000000300000002 0000000000000014 0000000000000004 [ 27.963581] [ T2553] page dumped because: VM_WARN_ON_FOLIO((_Generic((page + nr_pages - 1), const struct page *: (const struct folio *)_compound_head(page + nr_pages - 1), struct page *: (struct folio *)_compound_head(page + nr_pages - 1))) != folio) [ 27.963590] [ T2553] ------------[ cut here ]------------ [ 27.963591] [ T2553] WARNING: CPU: 1 PID: 2553 at include/linux/rmap.h:427 __folio_rmap_sanity_checks+0x1a0/0x270 [ 27.963596] [ T2553] CPU: 1 UID: 0 PID: 2553 Comm: ls Not tainted 6.15.0-rc4-00002-ge0363f942651 #270 PREEMPT(undef) [ 27.963599] [ T2553] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 [ 27.963601] [ T2553] RIP: e030:__folio_rmap_sanity_checks+0x1a0/0x270 [ 27.963603] [ T2553] Code: 89 df e8 b3 7d fd ff 0f 0b e9 eb fe ff ff 48 8d 42 ff 48 39 c3 0f 84 0e ff ff ff 48 c7 c6 e8 49 5b 82 48 89 df e8 90 7d fd ff <0f> 0b e9 f8 fe ff ff 41 f7 c4 ff 0f 00 00 0f 85 b2 fe ff ff 49 8b [ 27.963605] [ T2553] RSP: e02b:ffffc90040f8fac0 EFLAGS: 00010246 [ 27.963608] [ T2553] RAX: 00000000000000e5 RBX: ffffea0004021700 RCX: 0000000000000000 [ 27.963609] [ T2553] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00000000ffffffff [ 27.963610] [ T2553] RBP: ffffc90040f8fae0 R08: 00000000ffffdfff R09: ffffffff82929308 [ 27.963612] [ T2553] R10: ffffffff82879360 R11: 0000000000000002 R12: ffffea0004021700 [ 27.963613] [ T2553] R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000005 [ 27.963625] [ T2553] FS: 00007ff06dafe740(0000) GS:ffff8880b7ccb000(0000) knlGS:0000000000000000 [ 27.963627] [ T2553] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 27.963628] [ T2553] CR2: 000055deb932f630 CR3: 0000000002844000 CR4: 0000000000050660 [ 27.963630] [ T2553] Call Trace: [ 27.963632] [ T2553] <TASK> [ 27.963633] [ T2553] folio_remove_rmap_ptes+0x24/0x2b0 [ 27.963635] [ T2553] unmap_page_range+0x132e/0x17e0 [ 27.963638] [ T2553] unmap_single_vma+0x81/0xd0 [ 27.963640] [ T2553] unmap_vmas+0xb5/0x180 [ 27.963642] [ T2553] exit_mmap+0x10c/0x460 [ 27.963644] [ T2553] mmput+0x59/0x120 [ 27.963647] [ T2553] do_exit+0x2d1/0xbd0 [ 27.963649] [ T2553] do_group_exit+0x2f/0x90 [ 27.963651] [ T2553] __x64_sys_exit_group+0x13/0x20 [ 27.963652] [ T2553] x64_sys_call+0x126b/0x1d70 [ 27.963655] [ T2553] do_syscall_64+0x54/0x120 [ 27.963657] [ T2553] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 27.963659] [ T2553] RIP: 0033:0x7ff06dbdbe9d [ 27.963660] [ T2553] Code: Unable to access opcode bytes at 0x7ff06dbdbe73. [ 27.963661] [ T2553] RSP: 002b:00007ffde44f9ac8 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7 [ 27.963664] [ T2553] RAX: ffffffffffffffda RBX: 00007ff06dce4fa8 RCX: 00007ff06dbdbe9d [ 27.963665] [ T2553] RDX: 00000000000000e7 RSI: ffffffffffffff88 RDI: 0000000000000000 [ 27.963666] [ T2553] RBP: 0000000000000000 R08: 00007ffde44f9a70 R09: 00007ffde44f99ff [ 27.963667] [ T2553] R10: 00007ffde44f9980 R11: 0000000000000206 R12: 00007ff06dce3680 [ 27.963668] [ T2553] R13: 00007ff06dd010e0 R14: 0000000000000002 R15: 00007ff06dce4fc0 [ 27.963670] [ T2553] </TASK> [ 27.963671] [ T2553] ---[ end trace 0000000000000000 ]---
Later on bigger problems start to appear:
[ 69.035059] [ T2593] BUG: Bad page map in process ls pte:10006c125 pmd:1003dc067 [ 69.035061] [ T2593] page: refcount:8 mapcount:20 mapping:ffff88813fd736a8 index:0x110 pfn:0x10006c [ 69.035064] [ T2593] head: order:2 mapcount:-2 entire_mapcount:0 nr_pages_mapped:8388532 pincount:0 [ 69.035066] [ T2593] memcg:ffff888003573000 [ 69.035067] [ T2593] aops:0xffffffff8226fd20 ino:82467c dentry name(?):"libc.so.6" [ 69.035069] [ T2593] flags: 0x2000000000416c(referenced|uptodate|lru|active|private|head|node=0|zone=2) [ 69.035072] [ T2593] raw: 002000000000416c ffffea0004002348 ffffea0004001d08 ffff88813fd736a8 [ 69.035074] [ T2593] raw: 0000000000000110 ffff888100d19860 0000000800000013 ffff888003573000 [ 69.035076] [ T2593] head: 002000000000416c ffffea0004002348 ffffea0004001d08 ffff88813fd736a8 [ 69.035078] [ T2593] head: 0000000000000110 ffff888100d19860 0000000800000013 ffff888003573000 [ 69.035079] [ T2593] head: 0020000000000202 ffffea0004001b01 ffffffb4fffffffd 00000000ffffffff [ 69.035081] [ T2593] head: 0000000300000003 0000000500000002 0000000000000013 0000000000000004 [ 69.035082] [ T2593] page dumped because: bad pte [ 69.035083] [ T2593] addr:00007f6524b96000 vm_flags:00000075 anon_vma:0000000000000000 mapping:ffff88813fd736a8 index:110 [ 69.035086] [ T2593] file:libc.so.6 fault:xfs_filemap_fault mmap:xfs_file_mmap read_folio:xfs_vm_read_folio
The system eventually becomes unusable and typically crashes.
I was able to bisect this issue to the commit: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP").
If I understand correctly, the folio from the first warning has 4 pages, but folio_pte_batch incorrectly counts 5 pages, because expected_pte and pte are both zero-valued PTEs, and the loop is not broken in such a case because pte_same() returns true.
[ 27.963266] [ T2553] folio_pte_batch debug: printing PTE values starting at addr=0x7ff06dc11000 [ 27.963268] [ T2553] PTE[ 0] = 000000010085c125 [ 27.963272] [ T2553] PTE[ 1] = 000000010085d125 [ 27.963274] [ T2553] PTE[ 2] = 000000010085e125 [ 27.963276] [ T2553] PTE[ 3] = 000000010085f125 [ 27.963277] [ T2553] PTE[ 4] = 0000000000000000 <-- not present [ 27.963279] [ T2553] PTE[ 5] = 0000000102e47125 [ 27.963281] [ T2553] PTE[ 6] = 0000000102e48125 [ 27.963283] [ T2553] PTE[ 7] = 0000000102e49125
As a consequence, zap_present_folio_ptes() is called with nr = 5, which later calls folio_remove_rmap_ptes(), where __folio_rmap_sanity_checks() emits the warning log.
The patch in the following message fixes the problem for me. It adds a check that breaks the loop when encountering non-present PTE, before the !pte_same() check.
I still don't fully understand how the zero-valued PTE appears there or why this issue is triggered only when xen-balloon driver shrinks the memory. I wasn't able to reproduce it without shrinking.
Cheers, Petr
Petr Vaněk (1): mm: Fix folio_pte_batch() overcount with zero PTEs
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)