mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device and started happening on linux -next master branch kernel tag next-20200430 and next-20200501. We did not bisect this problem.
metadata git branch: master git repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git git commit: e4a08b64261ab411b15580c369a3b8fbed28bbc1 git describe: next-20200430 make_kernelversion: 5.7.0-rc3 kernel-config: https://builds.tuxbuild.com/1YrE_XUQ6odA52tSBM919w/kernel.config
Steps to reproduce: (always reproducible) --------------------------- mkfs -t ext4 <external-STORAGE_DEV>
Test log: ------------ + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: 05e8451c-1dd6-4d94-b030-0f806653e4b4 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 34.739137] mkfs.ext4 invoked oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0, oom_score_adj=0 [ 34.748889] CPU: 0 PID: 393 Comm: mkfs.ext4 Not tainted 5.7.0-rc3-next-20200430 #1 [ 34.756450] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 34.763844] Call Trace: [ 34.766305] dump_stack+0x54/0x6e [ 34.769629] dump_header+0x3d/0x1c6 [ 34.773126] ? oom_badness.part.0+0x10/0x120 [ 34.777397] ? ___ratelimit+0x8f/0xdc [ 34.781056] oom_kill_process.cold+0x9/0xe [ 34.785152] out_of_memory+0x1ab/0x260 [ 34.788898] __alloc_pages_nodemask+0xe0e/0xec0 [ 34.793430] pagecache_get_page+0xae/0x260 [ 34.797521] grab_cache_page_write_begin+0x1c/0x30 [ 34.802303] block_write_begin+0x1e/0x90 [ 34.806222] blkdev_write_begin+0x1e/0x20 [ 34.810225] ? bdev_evict_inode+0xd0/0xd0 [ 34.814230] generic_perform_write+0x97/0x180 [ 34.818579] __generic_file_write_iter+0x140/0x1f0 [ 34.823365] blkdev_write_iter+0xc0/0x190 [ 34.827376] __vfs_write+0x132/0x1e0 [ 34.830947] ? __audit_syscall_entry+0xa8/0xe0 [ 34.835385] vfs_write+0xa1/0x1a0 [ 34.838696] ksys_pwrite64+0x50/0x80 [ 34.842267] __ia32_sys_ia32_pwrite64+0x16/0x20 [ 34.846798] do_fast_syscall_32+0x6b/0x270 [ 34.850890] entry_SYSENTER_32+0xa5/0xf8 [ 34.854805] EIP: 0xb7f0d549 [ 34.857596] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76 [ 34.876334] EAX: ffffffda EBX: 00000003 ECX: b7801010 EDX: 00400000 [ 34.882591] ESI: 38400000 EDI: 00000074 EBP: 07438400 ESP: bfd266f0 [ 34.888847] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246 [ 34.895630] Mem-Info: [ 34.897923] active_anon:5366 inactive_anon:2172 isolated_anon:0 [ 34.897923] active_file:4151 inactive_file:212494 isolated_file:0 [ 34.897923] unevictable:0 dirty:16505 writeback:6520 unstable:0 [ 34.897923] slab_reclaimable:5855 slab_unreclaimable:3531 [ 34.897923] mapped:6321 shmem:2236 pagetables:178 bounce:0 [ 34.897923] free:264202 free_pcp:1082 free_cma:0 [ 34.931663] Node 0 active_anon:21464kB inactive_anon:8688kB active_file:16604kB inactive_file:849976kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:25284kB dirty:58952kB writeback:27772kB shmem:8944kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes [ 34.955523] DMA free:3356kB min:68kB low:84kB high:100kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:11964kB unevictable:0kB writepending:11980kB present:15964kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 34.983385] lowmem_reserve[]: 0 825 1947 825 [ 34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:1096kB inactive_file:786400kB unevictable:0kB writepending:65432kB present:884728kB managed:845576kB mlocked:0kB kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB local_pcp:500kB free_cma:0kB [ 35.017427] lowmem_reserve[]: 0 0 8980 0 [ 35.021362] HighMem free:1049496kB min:512kB low:1748kB high:2984kB reserved_highatomic:0KB active_anon:21464kB inactive_anon:8688kB active_file:15508kB inactive_file:51612kB unevictable:0kB writepending:0kB present:1149540kB managed:1149540kB mlocked:0kB kernel_stack:0kB pagetables:712kB bounce:0kB free_pcp:1524kB local_pcp:292kB free_cma:0kB [ 35.051717] lowmem_reserve[]: 0 0 0 0 [ 35.055374] DMA: 8*4kB (UE) 1*8kB (E) 1*16kB (E) 0*32kB 0*64kB 0*128kB 1*256kB (E) 0*512kB 1*1024kB (E) 1*2048kB (E) 0*4096kB = 3384kB [ 35.067446] Normal: 27*4kB (U) 23*8kB (U) 12*16kB (UE) 12*32kB (U) 4*64kB (UE) 2*128kB (U) 2*256kB (UE) 1*512kB (E) 0*1024kB 1*2048kB (U) 0*4096kB = 4452kB [ 35.081347] HighMem: 2*4kB (UM) 0*8kB 1*16kB (M) 2*32kB (UM) 1*64kB (U) 0*128kB 1*256kB (M) 1*512kB (M) 0*1024kB 0*2048kB 256*4096kB (M) = 1049496kB [ 35.094634] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=4096kB [ 35.103059] 218892 total pagecache pages [ 35.106985] 0 pages in swap cache [ 35.110303] Swap cache stats: add 0, delete 0, find 0/0 [ 35.115519] Free swap = 0kB [ 35.118396] Total swap = 0kB [ 35.121274] 512558 pages RAM [ 35.124151] 287385 pages HighMem/MovableOnly [ 35.128418] 9810 pages reserved [ 35.131563] Tasks state (memory values in pages): [ 35.136260] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ 35.144866] [ 224] 0 224 3425 1273 28672 0 0 systemd-journal [ 35.153932] [ 241] 0 241 3260 828 20480 0 -1000 systemd-udevd [ 35.162797] [ 244] 994 244 3929 456 24576 0 0 systemd-timesyn [ 35.171837] [ 277] 993 277 1569 786 20480 0 0 systemd-network [ 35.180891] [ 279] 992 279 1729 825 20480 0 0 systemd-resolve [ 35.189948] [ 283] 0 283 2032 1087 24576 0 0 haveged [ 35.198312] [ 284] 0 284 810 457 16384 0 0 crond [ 35.206485] [ 285] 996 285 1175 812 20480 0 -900 dbus-daemon [ 35.215177] [ 286] 0 286 11786 2558 49152 0 0 NetworkManager [ 35.224121] [ 287] 0 287 922 174 12288 0 0 klogd [ 35.232293] [ 288] 0 288 1468 1001 20480 0 0 systemd-logind [ 35.241247] [ 289] 995 289 1213 791 20480 0 0 avahi-daemon [ 35.250026] [ 290] 0 290 677 435 16384 0 0 atd [ 35.258040] [ 302] 0 302 921 420 16384 0 0 syslogd [ 35.266380] [ 303] 0 303 5638 1558 32768 0 0 thermald [ 35.274828] [ 305] 995 305 1182 58 20480 0 0 avahi-daemon [ 35.283659] [ 306] 0 306 594 16 16384 0 0 acpid [ 35.291848] [ 320] 0 320 1347 334 20480 0 0 systemd-hostnam [ 35.300906] [ 336] 65534 336 729 32 16384 0 0 dnsmasq [ 35.309253] [ 337] 0 337 666 443 16384 0 0 agetty [ 35.317528] [ 338] 0 338 947 710 16384 0 0 login [ 35.325693] [ 339] 0 339 666 458 16384 0 0 agetty [ 35.333994] [ 350] 998 350 19521 2816 73728 0 0 polkitd [ 35.342330] [ 358] 0 358 1892 1149 20480 0 0 systemd [ 35.350668] [ 359] 0 359 2341 329 20480 0 0 (sd-pam) [ 35.359093] [ 363] 0 363 971 711 16384 0 0 sh [ 35.367023] [ 367] 0 367 920 627 20480 0 0 su [ 35.374937] [ 368] 0 368 971 668 16384 0 0 sh [ 35.382864] [ 373] 0 373 903 613 16384 0 0 lava-test-runne [ 35.391897] [ 383] 0 383 903 518 16384 0 0 lava-test-shell [ 35.400935] [ 384] 0 384 903 612 16384 0 0 sh [ 35.408847] [ 393] 0 393 1976 1713 20480 0 0 mkfs.ext4 [ 35.417384] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=polkitd,pid=350,uid=998 [ 35.429982] Out of memory: Killed process 350 (polkitd) total-vm:78084kB, anon-rss:2976kB, file-rss:8288kB, shmem-rss:0kB, UID:998 pgtables:72kB oom_score_adj:0 [ 35.444646] oom_reaper: reaped process 350 (polkitd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 35.444648] mkfs.ext4 invoked oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0, oom_score_adj=0 [ 35.463429] CPU: 0 PID: 393 Comm: mkfs.ext4 Not tainted 5.7.0-rc3-next-20200430 #1 [ 35.470991] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 35.478377] Call Trace: [ 35.480822] dump_stack+0x54/0x6e [ 35.484139] dump_header+0x3d/0x1c6 [ 35.487634] ? oom_badness.part.0+0x10/0x120 [ 35.491922] ? ___ratelimit+0x8f/0xdc [ 35.495578] oom_kill_process.cold+0x9/0xe [ 35.499669] out_of_memory+0x1ab/0x260 [ 35.503414] __alloc_pages_nodemask+0xe0e/0xec0 [ 35.507939] pagecache_get_page+0xae/0x260
Git log from recent changes on fs and mm.
# fs$ git log --oneline ext4 | head 5868dada23f7 ext4: pass the inode to ext4_mpage_readpages 0c855f1fc999 ext4: convert from readpages to readahead ebc0198b60e9 mm: add page_cache_readahead_unbounded 907ea529fc4c ext4: convert BUG_ON's to WARN_ON's in mballoc.c a17a9d935dc4 ext4: increase wait time needed before reuse of deleted inode numbers 648814111af2 ext4: remove set but not used variable 'es' in ext4_jbd2.c 05ca87c149ae ext4: remove set but not used variable 'es' 801674f34ecf ext4: do not zeroout extents beyond i_disksize 9033783c8cfd ext4: fix return-value types in several function comments d87f639258a6 ext4: use non-movable memory for superblock readahead
# fs/f2fs$ git log --oneline . | head a4928e314c45 Merge branch 'akpm-current/current' f1c6758147a8 f2fs: pass the inode to f2fs_mpage_readpages 272e45338126 f2fs: convert from readpages to readahead ebc0198b60e9 mm: add page_cache_readahead_unbounded 435cbab95e39 f2fs: fix quota_sync failure due to f2fs_lock_op 8b83ac81f428 f2fs: support read iostat df4233997575 f2fs: Fix the accounting of dcc->undiscard_blks ce4c638cdd52 f2fs: fix to handle error path of f2fs_ra_meta_pages() 3fa6a8c5b55d f2fs: report the discard cmd errors properly 141af6ba5216 f2fs: fix long latency due to discard during umount
# fs$ git log --oneline ext4 | head 5868dada23f7 ext4: pass the inode to ext4_mpage_readpages 0c855f1fc999 ext4: convert from readpages to readahead ebc0198b60e9 mm: add page_cache_readahead_unbounded 907ea529fc4c ext4: convert BUG_ON's to WARN_ON's in mballoc.c a17a9d935dc4 ext4: increase wait time needed before reuse of deleted inode numbers 648814111af2 ext4: remove set but not used variable 'es' in ext4_jbd2.c 05ca87c149ae ext4: remove set but not used variable 'es' 801674f34ecf ext4: do not zeroout extents beyond i_disksize 9033783c8cfd ext4: fix return-value types in several function comments d87f639258a6 ext4: use non-movable memory for superblock readahead
Test full log link, https://lkft.validation.linaro.org/scheduler/job/1406110#L1223 https://lkft.validation.linaro.org/scheduler/job/1408508#L1250
On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device and started happening on linux -next master branch kernel tag next-20200430 and next-20200501. We did not bisect this problem.
It would be wonderful if you could do so, please. I can't immediately see any MM change in this area which might cause this.
metadata git branch: master git repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git git commit: e4a08b64261ab411b15580c369a3b8fbed28bbc1 git describe: next-20200430 make_kernelversion: 5.7.0-rc3 kernel-config: https://builds.tuxbuild.com/1YrE_XUQ6odA52tSBM919w/kernel.config
Steps to reproduce: (always reproducible)
Reproducibility helps!
oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
[ 34.793430] pagecache_get_page+0xae/0x260
[ 34.897923] active_anon:5366 inactive_anon:2172 isolated_anon:0 [ 34.897923] active_file:4151 inactive_file:212494 isolated_file:0 [ 34.897923] unevictable:0 dirty:16505 writeback:6520 unstable:0
[ 34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:1096kB inactive_file:786400kB unevictable:0kB writepending:65432kB present:884728kB managed:845576kB mlocked:0kB kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB local_pcp:500kB free_cma:0kB
ZONE_NORMAL has a huge amount of clean pagecache stuck on the inactive list, not being reclaimed.
Thanks for looking into this problem.
On Sat, 2 May 2020 at 02:28, Andrew Morton akpm@linux-foundation.org wrote:
On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device and started happening on linux -next master branch kernel tag next-20200430 and next-20200501. We did not bisect this problem.
It would be wonderful if you could do so, please. I can't immediately see any MM change in this area which might cause this.
We are planning a bisection soon on this problem.
metadata git branch: master git repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git git commit: e4a08b64261ab411b15580c369a3b8fbed28bbc1 git describe: next-20200430 make_kernelversion: 5.7.0-rc3 kernel-config: https://builds.tuxbuild.com/1YrE_XUQ6odA52tSBM919w/kernel.config
Steps to reproduce: (always reproducible)
Reproducibility helps!
oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
[ 34.793430] pagecache_get_page+0xae/0x260
[ 34.897923] active_anon:5366 inactive_anon:2172 isolated_anon:0 [ 34.897923] active_file:4151 inactive_file:212494 isolated_file:0 [ 34.897923] unevictable:0 dirty:16505 writeback:6520 unstable:0
[ 34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:1096kB inactive_file:786400kB unevictable:0kB writepending:65432kB present:884728kB managed:845576kB mlocked:0kB kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB local_pcp:500kB free_cma:0kB
ZONE_NORMAL has a huge amount of clean pagecache stuck on the inactive list, not being reclaimed.
FYI, This issue is already reported here. Now this problem is happening and easily reproducible on i386 and arm beagleboard x15 devices.
mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190703A01414 mke2fs 1.43.8 (1-Jan-2018) Discarding device blocks: 4096/29306880 2625536/29306880 9441280/29306880 16257024/29306880 23072768/29306880 done Creating filesystem with 29306880 4k blocks and 7331840 inodes Filesystem UUID: a838d994-0a1e-403a-88d5-444d75aecc5a Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872 Allocating group tables: 0/895 done Writing inode tables: 0/895 done Creating journal (131072 blocks): [ 31.251333] mkfs.ext4 invoked oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0, oom_score_adj=0 [ 31.261172] CPU: 0 PID: 397 Comm: mkfs.ext4 Not tainted 5.7.0-rc6-next-20200518 #1 [ 31.268771] Hardware name: Generic DRA74X (Flattened Device Tree) [ 31.274904] [<c0411500>] (unwind_backtrace) from [<c040b66c>] (show_stack+0x10/0x14) [ 31.282685] [<c040b66c>] (show_stack) from [<c08b1b14>] (dump_stack+0xc4/0xd8) [ 31.289940] [<c08b1b14>] (dump_stack) from [<c0547bf8>] (dump_header+0x54/0x1ec) [ 31.297367] [<c0547bf8>] (dump_header) from [<c0547008>] (oom_kill_process+0x18c/0x198) [ 31.305405] [<c0547008>] (oom_kill_process) from [<c0547a0c>] (out_of_memory+0x250/0x368) [ 31.313619] [<c0547a0c>] (out_of_memory) from [<c0599d80>] (__alloc_pages_nodemask+0xce8/0x10bc) [ 31.322445] [<c0599d80>] (__alloc_pages_nodemask) from [<c0541bb4>] (pagecache_get_page+0x128/0x358) [ 31.331619] [<c0541bb4>] (pagecache_get_page) from [<c0543a8c>] (grab_cache_page_write_begin+0x18/0x2c) [ 31.341054] [<c0543a8c>] (grab_cache_page_write_begin) from [<c0619fb0>] (block_write_begin+0x20/0xc4) [ 31.350401] [<c0619fb0>] (block_write_begin) from [<c053e718>] (generic_perform_write+0xb8/0x1d8) [ 31.359312] [<c053e718>] (generic_perform_write) from [<c054496c>] (__generic_file_write_iter+0x164/0x1ec) [ 31.369007] [<c054496c>] (__generic_file_write_iter) from [<c061c8a4>] (blkdev_write_iter+0xc8/0x1a4) [ 31.378269] [<c061c8a4>] (blkdev_write_iter) from [<c05d50d0>] (__vfs_write+0x13c/0x1cc) [ 31.386397] [<c05d50d0>] (__vfs_write) from [<c05d81d4>] (vfs_write+0xb0/0x1bc) [ 31.393738] [<c05d81d4>] (vfs_write) from [<c05d85e4>] (ksys_pwrite64+0x60/0x8c) [ 31.401167] [<c05d85e4>] (ksys_pwrite64) from [<c04001a0>] (ret_fast_syscall+0x0/0x4c) [ 31.409115] Exception stack(0xe810dfa8 to 0xe810dff0) [ 31.414185] dfa0: a2000000 0000000d 00000003 b6952008 00400000 00000000 [ 31.422395] dfc0: a2000000 0000000d a2000000 000000b5 00400000 0003b768 b6952008 00da2000 [ 31.430604] dfe0: 00000064 beb891b8 b6f85108 b6e38f2c [ 31.435809] Mem-Info: [ 31.438098] active_anon:5813 inactive_anon:4129 isolated_anon:0 [ 31.438098] active_file:6080 inactive_file:118548 isolated_file:0 [ 31.438098] unevictable:0 dirty:13674 writeback:7440 unstable:0 [ 31.438098] slab_reclaimable:5651 slab_unreclaimable:4566 [ 31.438098] mapped:5585 shmem:4468 pagetables:182 bounce:0 [ 31.438098] free:347556 free_pcp:608 free_cma:57235 [ 31.472362] Node 0 active_anon:23252kB inactive_anon:16516kB active_file:24320kB inactive_file:474192kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:22340kB dirty:54696kB writeback:11196kB shmem:17872kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes [ 31.500943] DMA free:187396kB min:22528kB low:28160kB high:33792kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:4736kB inactive_file:431688kB unevictable:0kB writepending:62020kB present:783360kB managed:668264kB mlocked:0kB kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:880kB local_pcp:216kB free_cma:163840kB [ 31.531339] lowmem_reserve[]: 0 0 1216 0 [ 31.535289] HighMem free:1203904kB min:512kB low:11592kB high:22672kB reserved_highatomic:0KB active_anon:23252kB inactive_anon:16516kB active_file:19584kB inactive_file:42420kB unevictable:0kB writepending:0kB present:1310720kB managed:1310720kB mlocked:0kB kernel_stack:0kB pagetables:728kB bounce:0kB free_pcp:1584kB local_pcp:1232kB free_cma:65100kB [ 31.566540] lowmem_reserve[]: 0 0 0 0 [ 31.570244] DMA: 87*4kB (UME) 53*8kB (UME) 26*16kB (UE) 6*32kB (UM) 1*64kB (E) 1*128kB (U) 5*256kB (ME) 5*512kB (ME) 4*1024kB (ME) 5*2048kB (M) 1*4096kB (M) 20*8192kB (C) = 187684kB [ 31.586520] HighMem: 2*4kB (MC) 1*8kB (C) 1*16kB (M) 5*32kB (UM) 4*64kB (UMC) 2*128kB (UM) 2*256kB (UM) 1*512kB (C) 2*1024kB (MC) 2*2048kB (MC) 2*4096kB (UC) 145*8192kB (MC) = 1203904kB [ 31.603150] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 31.611637] 129102 total pagecache pages [ 31.615577] 0 pages in swap cache [ 31.618902] Swap cache stats: add 0, delete 0, find 0/0 [ 31.624162] Free swap = 0kB [ 31.627053] Total swap = 0kB [ 31.629955] 523520 pages RAM [ 31.632846] 327680 pages HighMem/MovableOnly [ 31.637128] 28774 pages reserved [ 31.640381] 57344 pages cma reserved [ 31.643971] Tasks state (memory values in pages): [ 31.648691] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ 31.657367] [ 183] 0 183 7370 1082 36864 0 0 systemd-journal [ 31.666466] [ 209] 994 209 3742 326 40960 0 0 systemd-timesyn [ 31.675570] [ 217] 0 217 3398 817 32768 0 -1000 systemd-udevd [ 31.684498] [ 230] 993 230 1411 737 32768 0 0 systemd-network [ 31.693598] [ 231] 992 231 1496 712 32768 0 0 systemd-resolve [ 31.702702] [ 236] 996 236 1112 742 24576 0 -900 dbus-daemon [ 31.711454] [ 241] 0 241 1895 1045 36864 0 0 haveged [ 31.719857] [ 242] 0 242 1362 906 28672 0 0 systemd-logind [ 31.728855] [ 243] 0 243 13412 2571 69632 0 0 NetworkManager [ 31.737867] [ 244] 995 244 1197 608 28672 0 0 avahi-daemon [ 31.746707] [ 245] 995 245 1164 59 28672 0 0 avahi-daemon [ 31.755545] [ 246] 0 246 594 332 28672 0 0 atd [ 31.763601] [ 248] 0 248 699 99 24576 0 0 syslogd [ 31.772001] [ 251] 0 251 699 102 24576 0 0 klogd [ 31.780231] [ 252] 0 252 676 365 24576 0 0 crond [ 31.788443] [ 254] 0 254 1172 240 32768 0 0 systemd-hostnam [ 31.797547] [ 264] 65534 264 605 32 24576 0 0 dnsmasq [ 31.805948] [ 265] 0 265 556 357 28672 0 0 agetty [ 31.814262] [ 266] 0 266 1131 613 32768 0 0 login [ 31.822492] [ 268] 998 268 18201 2629 81920 0 0 polkitd [ 31.830895] [ 350] 0 350 1840 1161 32768 0 0 systemd [ 31.839286] [ 351] 0 351 2403 473 36864 0 0 (sd-pam) [ 31.847774] [ 355] 0 355 827 611 24576 0 0 sh [ 31.855742] [ 364] 0 364 7341 1145 53248 0 0 nm-dispatcher [ 31.864667] [ 377] 0 377 711 510 28672 0 0 lava-test-runne [ 31.873770] [ 387] 0 387 711 138 20480 0 0 lava-test-shell [ 31.882869] [ 388] 0 388 711 523 20480 0 0 sh [ 31.890837] [ 397] 0 397 1785 1518 36864 0 0 mkfs.ext4 [ 31.899397] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),global_oom,task_memcg=/,task=polkitd,pid=268,uid=998 [ 31.910012] Out of memory: Killed process 268 (polkitd) total-vm:72804kB, anon-rss:2948kB, file-rss:7568kB, shmem-rss:0kB, UID:998 pgtables:80kB oom_score_adj:0 [ 31.927948] oom_reaper: reaped process 268 (polkitd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 31.937461] mkfs.ext4 invoked oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0, oom_score_adj=0 [ 31.947273] CPU: 1 PID: 397 Comm: mkfs.ext4 Not tainted 5.7.0-rc6-next-20200518 #1 [ 31.954871] Hardware name: Generic DRA74X (Flattened Device Tree) [ 31.961000] [<c0411500>] (unwind_backtrace) from [<c040b66c>] (show_stack+0x10/0x14) [ 31.968778] [<c040b66c>] (show_stack) from [<c08b1b14>] (dump_stack+0xc4/0xd8) [ 31.976032] [<c08b1b14>] (dump_stack) from [<c0547bf8>] (dump_header+0x54/0x1ec) [ 31.983458] [<c0547bf8>] (dump_header) from [<c0547008>] (oom_kill_process+0x18c/0x198) [ 31.991495] [<c0547008>] (oom_kill_process) from [<c0547a0c>] (out_of_memory+0x250/0x368) [ 31.999706] [<c0547a0c>] (out_of_memory) from [<c0599d80>] (__alloc_pages_nodemask+0xce8/0x10bc) [ 32.008532] [<c0599d80>] (__alloc_pages_nodemask) from [<c0541bb4>] (pagecache_get_page+0x128/0x358) [ 32.017704] [<c0541bb4>] (pagecache_get_page) from [<c0543a8c>] (grab_cache_page_write_begin+0x18/0x2c) [ 32.027138] [<c0543a8c>] (grab_cache_page_write_begin) from [<c0619fb0>] (block_write_begin+0x20/0xc4) [ 32.036484] [<c0619fb0>] (block_write_begin) from [<c053e718>] (generic_perform_write+0xb8/0x1d8) [ 32.045395] [<c053e718>] (generic_perform_write) from [<c054496c>] (__generic_file_write_iter+0x164/0x1ec) [ 32.055090] [<c054496c>] (__generic_file_write_iter) from [<c061c8a4>] (blkdev_write_iter+0xc8/0x1a4) [ 32.064350] [<c061c8a4>] (blkdev_write_iter) from [<c05d50d0>] (__vfs_write+0x13c/0x1cc) [ 32.072476] [<c05d50d0>] (__vfs_write) from [<c05d81d4>] (vfs_write+0xb0/0x1bc) [ 32.079814] [<c05d81d4>] (vfs_write) from [<c05d85e4>] (ksys_pwrite64+0x60/0x8c) [ 32.087241] [<c05d85e4>] (ksys_pwrite64) from [<c04001a0>] (ret_fast_syscall+0x0/0x4c) [ 32.095187] Exception stack(0xe810dfa8 to 0xe810dff0) [ 32.100256] dfa0: a2000000 0000000d 00000003 b6952008 00400000 00000000 [ 32.108466] dfc0: a2000000 0000000d a2000000 000000b5 00400000 0003b768 b6952008 00da2000 [ 32.116673] dfe0: 00000064 beb891b8 b6f85108 b6e38f2c [ 32.121786] Mem-Info: [ 32.124070] active_anon:5056 inactive_anon:4129 isolated_anon:0 [ 32.124070] active_file:6289 inactive_file:118790 isolated_file:0 [ 32.124070] unevictable:0 dirty:14118 writeback:6 unstable:0 [ 32.124070] slab_reclaimable:5653 slab_unreclaimable:4209 [ 32.124070] mapped:4839 shmem:4468 pagetables:165 bounce:0 [ 32.124070] free:348249 free_pcp:562 free_cma:57235 [ 32.158031] Node 0 active_anon:20224kB inactive_anon:16516kB active_file:25156kB inactive_file:475160kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:19356kB dirty:56472kB writeback:24kB shmem:17872kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes [ 32.186324] DMA free:186320kB min:22528kB low:28160kB high:33792kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:4736kB inactive_file:433580kB unevictable:0kB writepending:56468kB present:783360kB managed:668264kB mlocked:0kB kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:420kB local_pcp:220kB free_cma:163840kB [ 32.216693] lowmem_reserve[]: 0 0 1216 0 [ 32.220652] HighMem free:1206676kB min:512kB low:11592kB high:22672kB reserved_highatomic:0KB active_anon:20224kB inactive_anon:16516kB active_file:20420kB inactive_file:41584kB unevictable:0kB writepending:0kB present:1310720kB managed:1310720kB mlocked:0kB kernel_stack:0kB pagetables:660kB bounce:0kB free_pcp:1816kB local_pcp:340kB free_cma:65100kB [ 32.251805] lowmem_reserve[]: 0 0 0 0 [ 32.255482] DMA: 2*4kB (UM) 3*8kB (UME) 1*16kB (U) 1*32kB (M) 0*64kB 1*128kB (U) 5*256kB (ME) 5*512kB (ME) 4*1024kB (ME) 5*2048kB (M) 1*4096kB (M) 20*8192kB (C) = 186320kB [ 32.270871] HighMem: 183*4kB (UMC) 65*8kB (UMC) 21*16kB (M) 11*32kB (UM) 6*64kB (UMC) 3*128kB (UM) 3*256kB (UM) 2*512kB (MC) 2*1024kB (MC) 2*2048kB (MC) 2*4096kB (UC) 145*8192kB (MC) = 1206676kB [ 32.288273] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 32.296751] 129546 total pagecache pages [ 32.300695] 0 pages in swap cache [ 32.304019] Swap cache stats: add 0, delete 0, find 0/0 [ 32.309260] Free swap = 0kB [ 32.312155] Total swap = 0kB [ 32.315045] 523520 pages RAM [ 32.317932] 327680 pages HighMem/MovableOnly [ 32.322221] 28774 pages reserved [ 32.325457] 57344 pages cma reserved [ 32.329043] Tasks state (memory values in pages): [ 32.333771] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ 32.342436] [ 183] 0 183 7370 1082 36864 0 0 systemd-journal [ 32.351529] [ 209] 994 209 3742 326 40960 0 0 systemd-timesyn [ 32.360620] [ 217] 0 217 3398 817 32768 0 -1000 systemd-udevd [ 32.369528] [ 230] 993 230 1411 737 32768 0 0 systemd-network [ 32.378620] [ 231] 992 231 1496 712 32768 0 0 systemd-resolve [ 32.387713] [ 236] 996 236 1112 742 24576 0 -900 dbus-daemon [ 32.396456] [ 241] 0 241 1895 1045 36864 0 0 haveged [ 32.404850] [ 242] 0 242 1362 906 28672 0 0 systemd-logind [ 32.413852] [ 243] 0 243 13412 2571 69632 0 0 NetworkManager [ 32.422858] [ 244] 995 244 1197 608 28672 0 0 avahi-daemon [ 32.431687] [ 245] 995 245 1164 59 28672 0 0 avahi-daemon [ 32.440518] [ 246] 0 246 594 332 28672 0 0 atd [ 32.448553] [ 248] 0 248 699 99 24576 0 0 syslogd [ 32.456945] [ 251] 0 251 699 102 24576 0 0 klogd [ 32.465171] [ 252] 0 252 676 365 24576 0 0 crond [ 32.473390] [ 254] 0 254 1172 240 32768 0 0 systemd-hostnam [ 32.482481] [ 264] 65534 264 605 32 24576 0 0 dnsmasq [ 32.490876] [ 265] 0 265 556 357 28672 0 0 agetty [ 32.499175] [ 266] 0 266 1131 613 32768 0 0 login [ 32.507394] [ 350] 0 350 1840 1161 32768 0 0 systemd [ 32.515788] [ 351] 0 351 2403 473 36864 0 0 (sd-pam) [ 32.524268] [ 355] 0 355 827 611 24576 0 0 sh [ 32.532227] [ 364] 0 364 7341 1145 53248 0 0 nm-dispatcher [ 32.541142] [ 377] 0 377 711 510 28672 0 0 lava-test-runne [ 32.550234] [ 387] 0 387 711 138 20480 0 0 lava-test-shell [ 32.559316] [ 388] 0 388 711 523 20480 0 0 sh [ 32.567273] [ 397] 0 397 1785 1518 36864 0 0 mkfs.ext4
ref: https://lkft.validation.linaro.org/scheduler/job/1436647#L4261 https://lkft.validation.linaro.org/scheduler/job/1436562#L1247
-- Linaro LKFT https://lkft.linaro.org
On Mon 18-05-20 19:40:55, Naresh Kamboju wrote:
Thanks for looking into this problem.
On Sat, 2 May 2020 at 02:28, Andrew Morton akpm@linux-foundation.org wrote:
On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device and started happening on linux -next master branch kernel tag next-20200430 and next-20200501. We did not bisect this problem.
[...]
Creating journal (131072 blocks): [ 31.251333] mkfs.ext4 invoked oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0, oom_score_adj=0
[...]
[ 31.500943] DMA free:187396kB min:22528kB low:28160kB high:33792kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:4736kB inactive_file:431688kB unevictable:0kB writepending:62020kB present:783360kB managed:668264kB mlocked:0kB kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:880kB local_pcp:216kB free_cma:163840kB
This is really unexpected. You are saying this is a regular i386 and DMA should be bottom 16MB while yours is 780MB and the rest of the low mem is in the Normal zone which is completely missing here. How have you got to that configuration? I have to say I haven't seen anything like that on i386.
The failing request is GFP_USER so highmem is not really allowed but free pages are way above watermarks so the allocation should have just succeeded.
On Tue, May 19, 2020 at 9:52 AM Michal Hocko mhocko@kernel.org wrote:
On Mon 18-05-20 19:40:55, Naresh Kamboju wrote:
Thanks for looking into this problem.
On Sat, 2 May 2020 at 02:28, Andrew Morton akpm@linux-foundation.org wrote:
On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device and started happening on linux -next master branch kernel tag next-20200430 and next-20200501. We did not bisect this problem.
[...]
Creating journal (131072 blocks): [ 31.251333] mkfs.ext4 invoked oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0, oom_score_adj=0
[...]
[ 31.500943] DMA free:187396kB min:22528kB low:28160kB high:33792kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:4736kB inactive_file:431688kB unevictable:0kB writepending:62020kB present:783360kB managed:668264kB mlocked:0kB kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:880kB local_pcp:216kB free_cma:163840kB
This is really unexpected. You are saying this is a regular i386 and DMA should be bottom 16MB while yours is 780MB and the rest of the low mem is in the Normal zone which is completely missing here. How have you got to that configuration? I have to say I haven't seen anything like that on i386.
I think that line comes from an ARM32 beaglebone-X15 machine showing the same symptom. The i386 line from the log file that Naresh linked to at https://lkft.validation.linaro.org/scheduler/job/1406110#L1223 is less unusual:
[ 34.931663] Node 0 active_anon:21464kB inactive_anon:8688kB active_file:16604kB inactive_file:849976kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:25284kB dirty:58952kB writeback:27772kB shmem:8944kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes [ 34.955523] DMA free:3356kB min:68kB low:84kB high:100kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:11964kB unevictable:0kB writepending:11980kB present:15964kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 34.983385] lowmem_reserve[]: 0 825 1947 825 [ 34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:1096kB inactive_file:786400kB unevictable:0kB writepending:65432kB present:884728kB managed:845576kB mlocked:0kB kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB local_pcp:500kB free_cma:0kB [ 35.017427] lowmem_reserve[]: 0 0 8980 0 [ 35.021362] HighMem free:1049496kB min:512kB low:1748kB high:2984kB reserved_highatomic:0KB active_anon:21464kB inactive_anon:8688kB active_file:15508kB inactive_file:51612kB unevictable:0kB writepending:0kB present:1149540kB managed:1149540kB mlocked:0kB kernel_stack:0kB pagetables:712kB bounce:0kB free_pcp:1524kB local_pcp:292kB free_cma:0kB [ 35.051717] lowmem_reserve[]: 0 0 0 0 [ 35.055374] DMA: 8*4kB (UE) 1*8kB (E) 1*16kB (E) 0*32kB 0*64kB 0*128kB 1*256kB (E) 0*512kB 1*1024kB (E) 1*2048kB (E) 0*4096kB = 3384kB [ 35.067446] Normal: 27*4kB (U) 23*8kB (U) 12*16kB (UE) 12*32kB (U) 4*64kB (UE) 2*128kB (U) 2*256kB (UE) 1*512kB (E) 0*1024kB 1*2048kB (U) 0*4096kB = 4452kB [ 35.081347] HighMem: 2*4kB (UM) 0*8kB 1*16kB (M) 2*32kB (UM) 1*64kB (U) 0*128kB 1*256kB (M) 1*512kB (M) 0*1024kB 0*2048kB 256*4096kB (M) = 1049496kB
Arnd
On Tue 19-05-20 10:11:25, Arnd Bergmann wrote:
On Tue, May 19, 2020 at 9:52 AM Michal Hocko mhocko@kernel.org wrote:
On Mon 18-05-20 19:40:55, Naresh Kamboju wrote:
Thanks for looking into this problem.
On Sat, 2 May 2020 at 02:28, Andrew Morton akpm@linux-foundation.org wrote:
On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device and started happening on linux -next master branch kernel tag next-20200430 and next-20200501. We did not bisect this problem.
[...]
Creating journal (131072 blocks): [ 31.251333] mkfs.ext4 invoked oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0, oom_score_adj=0
[...]
[ 31.500943] DMA free:187396kB min:22528kB low:28160kB high:33792kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:4736kB inactive_file:431688kB unevictable:0kB writepending:62020kB present:783360kB managed:668264kB mlocked:0kB kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:880kB local_pcp:216kB free_cma:163840kB
This is really unexpected. You are saying this is a regular i386 and DMA should be bottom 16MB while yours is 780MB and the rest of the low mem is in the Normal zone which is completely missing here. How have you got to that configuration? I have to say I haven't seen anything like that on i386.
I think that line comes from an ARM32 beaglebone-X15 machine showing the same symptom. The i386 line from the log file that Naresh linked to at https://lkft.validation.linaro.org/scheduler/job/1406110#L1223 is less unusual:
OK, that makes more sense! At least for the memory layout.
[ 34.931663] Node 0 active_anon:21464kB inactive_anon:8688kB active_file:16604kB inactive_file:849976kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:25284kB dirty:58952kB writeback:27772kB shmem:8944kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes [ 34.955523] DMA free:3356kB min:68kB low:84kB high:100kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:11964kB unevictable:0kB writepending:11980kB present:15964kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 34.983385] lowmem_reserve[]: 0 825 1947 825 [ 34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:1096kB inactive_file:786400kB unevictable:0kB writepending:65432kB present:884728kB managed:845576kB mlocked:0kB kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB local_pcp:500kB free_cma:0kB
The lowmem is really low (way below the min watermark so even memory reserves for high priority and atomic requests are depleted. There is still 786MB of inactive page cache to be reclaimed. It doesn't seem to be dirty or under the writeback but it still might be pinned by the filesystem. I would suggest watching vmscan reclaim tracepoints and check why the reclaim fails to reclaim anything.
FYI,
This issue is specific on 32-bit architectures i386 and arm on linux-next tree. As per the test results history this problem started happening from Bad : next-20200430 Good : next-20200429
steps to reproduce: dd if=/dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190504A00573 of=/dev/null bs=1M count=2048 or mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190804A00BE5
Problem: [ 38.802375] dd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER), order=0, oom_score_adj=0
i386 crash log: https://pastebin.com/Hb8U89vU arm crash log: https://pastebin.com/BD9t3JTm
On Tue, 19 May 2020 at 14:15, Michal Hocko mhocko@kernel.org wrote:
On Tue 19-05-20 10:11:25, Arnd Bergmann wrote:
On Tue, May 19, 2020 at 9:52 AM Michal Hocko mhocko@kernel.org wrote:
On Mon 18-05-20 19:40:55, Naresh Kamboju wrote:
Thanks for looking into this problem.
On Sat, 2 May 2020 at 02:28, Andrew Morton akpm@linux-foundation.org wrote:
On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device and started happening on linux -next master branch kernel tag next-20200430 and next-20200501. We did not bisect this problem.
[...]
Creating journal (131072 blocks): [ 31.251333] mkfs.ext4 invoked oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0, oom_score_adj=0
[...]
[ 31.500943] DMA free:187396kB min:22528kB low:28160kB high:33792kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:4736kB inactive_file:431688kB unevictable:0kB writepending:62020kB present:783360kB managed:668264kB mlocked:0kB kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:880kB local_pcp:216kB free_cma:163840kB
This is really unexpected. You are saying this is a regular i386 and DMA should be bottom 16MB while yours is 780MB and the rest of the low mem is in the Normal zone which is completely missing here. How have you got to that configuration? I have to say I haven't seen anything like that on i386.
I think that line comes from an ARM32 beaglebone-X15 machine showing the same symptom. The i386 line from the log file that Naresh linked to at https://lkft.validation.linaro.org/scheduler/job/1406110#L1223 is less unusual:
OK, that makes more sense! At least for the memory layout.
[ 34.931663] Node 0 active_anon:21464kB inactive_anon:8688kB active_file:16604kB inactive_file:849976kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:25284kB dirty:58952kB writeback:27772kB shmem:8944kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes [ 34.955523] DMA free:3356kB min:68kB low:84kB high:100kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:11964kB unevictable:0kB writepending:11980kB present:15964kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 34.983385] lowmem_reserve[]: 0 825 1947 825 [ 34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:1096kB inactive_file:786400kB unevictable:0kB writepending:65432kB present:884728kB managed:845576kB mlocked:0kB kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB local_pcp:500kB free_cma:0kB
The lowmem is really low (way below the min watermark so even memory reserves for high priority and atomic requests are depleted. There is still 786MB of inactive page cache to be reclaimed. It doesn't seem to be dirty or under the writeback but it still might be pinned by the filesystem. I would suggest watching vmscan reclaim tracepoints and check why the reclaim fails to reclaim anything. -- Michal Hocko SUSE Labs
On Wed, 20 May 2020 at 17:26, Naresh Kamboju naresh.kamboju@linaro.org wrote:
This issue is specific on 32-bit architectures i386 and arm on linux-next tree. As per the test results history this problem started happening from Bad : next-20200430 Good : next-20200429
steps to reproduce: dd if=/dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190504A00573 of=/dev/null bs=1M count=2048 or mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190804A00BE5
Problem: [ 38.802375] dd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER), order=0, oom_score_adj=0
As a part of investigation on this issue LKFT teammate Anders Roxell git bisected the problem and found bad commit(s) which caused this problem.
The following two patches have been reverted on next-20200519 and retested the reproducible steps and confirmed the test case mkfs -t ext4 got PASS. ( invoked oom-killer is gone now)
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
Revert "mm, memcg: decouple e{low,min} state mutations from protection checks" This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
i386 test log shows mkfs -t ext4 pass https://lkft.validation.linaro.org/scheduler/job/1443405#L1200
ref: https://lore.kernel.org/linux-mm/cover.1588092152.git.chris@chrisdown.name/ https://lore.kernel.org/linux-mm/CA+G9fYvzLm7n1BE7AJXd8_49fOgPgWWTiQ7sXkVre_...
Hi Naresh,
Naresh Kamboju writes:
As a part of investigation on this issue LKFT teammate Anders Roxell git bisected the problem and found bad commit(s) which caused this problem.
The following two patches have been reverted on next-20200519 and retested the reproducible steps and confirmed the test case mkfs -t ext4 got PASS. ( invoked oom-killer is gone now)
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
Revert "mm, memcg: decouple e{low,min} state mutations from protection checks" This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
Thanks Anders and Naresh for tracking this down and reverting.
I'll take a look tomorrow. I don't see anything immediately obviously wrong in either of those commits from a (very) cursory glance, but they should only be taking effect if protections are set.
Since you have i386 hardware available, and I don't, could you please apply only "avoid stale protection" again and check if it only happens with that commit, or requires both? That would help narrow down the suspects.
Do you use any memcg protections in these tests?
Thank you!
Chris
On Thu, May 21, 2020 at 2:00 AM Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Wed, 20 May 2020 at 17:26, Naresh Kamboju naresh.kamboju@linaro.org wrote:
This issue is specific on 32-bit architectures i386 and arm on linux-next tree. As per the test results history this problem started happening from Bad : next-20200430 Good : next-20200429
steps to reproduce: dd if=/dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190504A00573 of=/dev/null bs=1M count=2048 or mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190804A00BE5
Problem: [ 38.802375] dd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER), order=0, oom_score_adj=0
As a part of investigation on this issue LKFT teammate Anders Roxell git bisected the problem and found bad commit(s) which caused this problem.
The following two patches have been reverted on next-20200519 and retested the reproducible steps and confirmed the test case mkfs -t ext4 got PASS. ( invoked oom-killer is gone now)
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
Revert "mm, memcg: decouple e{low,min} state mutations from protection checks" This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
My guess is that we made the same mistake in commit "mm, memcg: decouple e{low,min} state mutations from protection checks" that it read a stale memcg protection in mem_cgroup_below_low() and mem_cgroup_below_min().
Bellow is a possble fix,
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 7a2c56fc..6591b71 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -391,20 +391,28 @@ static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root, void mem_cgroup_calculate_protection(struct mem_cgroup *root, struct mem_cgroup *memcg);
-static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg) +static inline bool mem_cgroup_below_low(struct mem_cgroup *root, + struct mem_cgroup *memcg) { if (mem_cgroup_disabled()) return false;
+ if (root == memcg) + return false; + return READ_ONCE(memcg->memory.elow) >= page_counter_read(&memcg->memory); }
-static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg) +static inline bool mem_cgroup_below_min(struct mem_cgroup *root, + struct mem_cgroup *memcg) { if (mem_cgroup_disabled()) return false;
+ if (root == memcg) + return false; + return READ_ONCE(memcg->memory.emin) >= page_counter_read(&memcg->memory); } @@ -896,12 +904,14 @@ static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root, { }
-static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg) +static inline bool mem_cgroup_below_low(struct mem_cgroup *root, + struct mem_cgroup *memcg) { return false; }
-static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg) +static inline bool mem_cgroup_below_min(struct mem_cgroup *root, + struct mem_cgroup *memcg) { return false; } diff --git a/mm/vmscan.c b/mm/vmscan.c index c71660e..fdcdd88 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2637,13 +2637,13 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
mem_cgroup_calculate_protection(target_memcg, memcg);
- if (mem_cgroup_below_min(memcg)) { + if (mem_cgroup_below_min(target_memcg, memcg)) { /* * Hard protection. * If there is no reclaimable memory, OOM. */ continue; - } else if (mem_cgroup_below_low(memcg)) { + } else if (mem_cgroup_below_low(target_memcg, memcg)) { /* * Soft protection. * Respect the protection only as long as
i386 test log shows mkfs -t ext4 pass https://lkft.validation.linaro.org/scheduler/job/1443405#L1200
ref: https://lore.kernel.org/linux-mm/cover.1588092152.git.chris@chrisdown.name/ https://lore.kernel.org/linux-mm/CA+G9fYvzLm7n1BE7AJXd8_49fOgPgWWTiQ7sXkVre_...
-- Linaro LKFT https://lkft.linaro.org
-- Thanks Yafang
On Thu, 21 May 2020 at 08:10, Yafang Shao laoar.shao@gmail.com wrote:
On Thu, May 21, 2020 at 2:00 AM Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Wed, 20 May 2020 at 17:26, Naresh Kamboju naresh.kamboju@linaro.org wrote:
This issue is specific on 32-bit architectures i386 and arm on linux-next tree. As per the test results history this problem started happening from mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190804A00BE5
Problem: [ 38.802375] dd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER), order=0, oom_score_adj=0
My guess is that we made the same mistake in commit "mm, memcg: decouple e{low,min} state mutations from protection checks" that it read a stale memcg protection in mem_cgroup_below_low() and mem_cgroup_below_min().
Bellow is a possble fix,
Sorry. The proposed fix did not work. I have took your patch and applied on top of linux-next master branch and tested and mkfs -t ext4 invoked oom-killer.
After patch applied test log link, https://lkft.validation.linaro.org/scheduler/job/1443936#L1168
test log, + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: ab107250-bf18-4357-a06a-67f2bfcc1048 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 34.423940] mkfs.ext4 invoked oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0, oom_score_adj=0 [ 34.433694] CPU: 0 PID: 402 Comm: mkfs.ext4 Not tainted 5.7.0-rc6-next-20200519+ #1 [ 34.441342] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 34.448734] Call Trace: [ 34.451196] dump_stack+0x54/0x76 [ 34.454517] dump_header+0x40/0x1f0 [ 34.458008] ? oom_badness+0x1f/0x120 [ 34.461673] ? ___ratelimit+0x6c/0xe0 [ 34.465332] oom_kill_process+0xc9/0x110 [ 34.469255] out_of_memory+0xd7/0x2f0 [ 34.472916] __alloc_pages_nodemask+0xdd1/0xe90 [ 34.477446] ? set_bh_page+0x33/0x50 [ 34.481016] ? __xa_set_mark+0x4d/0x70 [ 34.484762] pagecache_get_page+0xbe/0x250 [ 34.488859] grab_cache_page_write_begin+0x1a/0x30 [ 34.493645] block_write_begin+0x25/0x90 [ 34.497569] blkdev_write_begin+0x1e/0x20 [ 34.501574] ? bdev_evict_inode+0xc0/0xc0 [ 34.505578] generic_perform_write+0x95/0x190 [ 34.509927] __generic_file_write_iter+0xe0/0x1a0 [ 34.514626] blkdev_write_iter+0xbf/0x1c0 [ 34.518630] __vfs_write+0x122/0x1e0 [ 34.522200] vfs_write+0x8f/0x1b0 [ 34.525510] ksys_pwrite64+0x60/0x80 [ 34.529081] __ia32_sys_ia32_pwrite64+0x16/0x20 [ 34.533604] do_fast_syscall_32+0x66/0x240 [ 34.537697] entry_SYSENTER_32+0xa5/0xf8 [ 34.541613] EIP: 0xb7f3c549 [ 34.544403] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76 [ 34.563140] EAX: ffffffda EBX: 00000003 ECX: b7830010 EDX: 00400000 [ 34.569397] ESI: 38400000 EDI: 00000074 EBP: 07438400 ESP: bff1e650 [ 34.575654] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246 [ 34.582453] Mem-Info: [ 34.584732] active_anon:5713 inactive_anon:2169 isolated_anon:0 [ 34.584732] active_file:4040 inactive_file:211204 isolated_file:0 [ 34.584732] unevictable:0 dirty:17270 writeback:6240 unstable:0 [ 34.584732] slab_reclaimable:5856 slab_unreclaimable:3439 [ 34.584732] mapped:6192 shmem:2258 pagetables:178 bounce:0 [ 34.584732] free:265105 free_pcp:1330 free_cma:0 [ 34.618483] Node 0 active_anon:22852kB inactive_anon:8676kB active_file:16160kB inactive_file:844816kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:24768kB dirty:69080kB writeback:19628kB shmem:9032kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes [ 34.642354] DMA free:3588kB min:68kB low:84kB high:100kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:11848kB unevictable:0kB writepending:11856kB present:15964kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 34.670194] lowmem_reserve[]: 0 824 1947 824 [ 34.674483] Normal free:4228kB min:3636kB low:4544kB high:5452kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:1136kB inactive_file:786456kB unevictable:0kB writepending:68084kB present:884728kB managed:845324kB mlocked:0kB kernel_stack:1104kB pagetables:0kB bounce:0kB free_pcp:3056kB local_pcp:388kB free_cma:0kB [ 34.704243] lowmem_reserve[]: 0 0 8980 0 [ 34.708189] HighMem free:1053028kB min:512kB low:1748kB high:2984kB reserved_highatomic:0KB active_anon:22852kB inactive_anon:8676kB active_file:15024kB inactive_file:46596kB unevictable:0kB writepending:0kB present:1149544kB managed:1149544kB mlocked:0kB kernel_stack:0kB pagetables:712kB bounce:0kB free_pcp:2160kB local_pcp:736kB free_cma:0kB [ 34.738563] lowmem_reserve[]: 0 0 0 0 [ 34.742245] DMA: 23*4kB (U) 2*8kB (U) 3*16kB (U) 2*32kB (UE) 2*64kB (U) 1*128kB (U) 1*256kB (E) 0*512kB 1*1024kB (E) 1*2048kB (E) 0*4096kB = 3804kB [ 34.755479] Normal: 25*4kB (UM) 27*8kB (UME) 16*16kB (UME) 14*32kB (UME) 7*64kB (UME) 2*128kB (UM) 1*256kB (E) 1*512kB (E) 0*1024kB 1*2048kB (M) 0*4096kB = 4540kB [ 34.770004] HighMem: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 1*64kB (M) 2*128kB (UM) 2*256kB (UM) 1*512kB (U) 1*1024kB (U) 1*2048kB (U) 256*4096kB (M) = 1053028kB [ 34.784010] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=4096kB [ 34.792466] 217507 total pagecache pages [ 34.796387] 0 pages in swap cache [ 34.799704] Swap cache stats: add 0, delete 0, find 0/0 [ 34.804923] Free swap = 0kB [ 34.807834] Total swap = 0kB [ 34.810738] 512559 pages RAM [ 34.813640] 287386 pages HighMem/MovableOnly [ 34.817931] 9873 pages reserved
- Naresh
On Thu, 21 May 2020 at 00:39, Chris Down chris@chrisdown.name wrote:
Hi Naresh,
Naresh Kamboju writes:
As a part of investigation on this issue LKFT teammate Anders Roxell git bisected the problem and found bad commit(s) which caused this problem.
The following two patches have been reverted on next-20200519 and retested the reproducible steps and confirmed the test case mkfs -t ext4 got PASS. ( invoked oom-killer is gone now)
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
Revert "mm, memcg: decouple e{low,min} state mutations from protection checks" This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
Thanks Anders and Naresh for tracking this down and reverting.
I'll take a look tomorrow. I don't see anything immediately obviously wrong in either of those commits from a (very) cursory glance, but they should only be taking effect if protections are set.
Since you have i386 hardware available, and I don't, could you please apply only "avoid stale protection" again and check if it only happens with that commit, or requires both? That would help narrow down the suspects.
Not both. The bad commit is "mm, memcg: decouple e{low,min} state mutations from protection checks"
Do you use any memcg protections in these tests?
I see three MEMCG configs and please find the kernel config link for more details.
CONFIG_MEMCG=y CONFIG_MEMCG_SWAP=y CONFIG_MEMCG_KMEM=y
kernel config link, https://builds.tuxbuild.com/8lg6WQibcwtQRRtIa0bcFA/kernel.config
- Naresh
On Thu, May 21, 2020 at 11:22 AM Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Thu, 21 May 2020 at 00:39, Chris Down chris@chrisdown.name wrote:
Since you have i386 hardware available, and I don't, could you please apply only "avoid stale protection" again and check if it only happens with that commit, or requires both? That would help narrow down the suspects.
Note that Naresh is running an i386 kernel on regular 64-bit hardware that most people have access to.
kernel config link, https://builds.tuxbuild.com/8lg6WQibcwtQRRtIa0bcFA/kernel.config
Do you know if the same bug shows up running a kernel with that configuration in qemu? I would expect it to, and that would make it much easier to reproduce.
I would also not be surprised if it happens on all architectures but only shows up on the 32-bit arm and x86 machines first because they have a rather limited amount of lowmem. Maybe booting a 64-bit kernel with "mem=512M" and then running "dd if=/dev/sda of=/dev/null bs=1M" will also trigger it. I did not attempt to run this myself.
Arnd
On Thu, May 21, 2020 at 4:59 PM Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Thu, 21 May 2020 at 08:10, Yafang Shao laoar.shao@gmail.com wrote:
On Thu, May 21, 2020 at 2:00 AM Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Wed, 20 May 2020 at 17:26, Naresh Kamboju naresh.kamboju@linaro.org wrote:
This issue is specific on 32-bit architectures i386 and arm on linux-next tree. As per the test results history this problem started happening from mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190804A00BE5
Problem: [ 38.802375] dd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER), order=0, oom_score_adj=0
My guess is that we made the same mistake in commit "mm, memcg: decouple e{low,min} state mutations from protection checks" that it read a stale memcg protection in mem_cgroup_below_low() and mem_cgroup_below_min().
Bellow is a possble fix,
Sorry. The proposed fix did not work. I have took your patch and applied on top of linux-next master branch and tested and mkfs -t ext4 invoked oom-killer.
After patch applied test log link, https://lkft.validation.linaro.org/scheduler/job/1443936#L1168
test log,
- mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: ab107250-bf18-4357-a06a-67f2bfcc1048 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 34.423940] mkfs.ext4 invoked oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0, oom_score_adj=0 [ 34.433694] CPU: 0 PID: 402 Comm: mkfs.ext4 Not tainted 5.7.0-rc6-next-20200519+ #1 [ 34.441342] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 34.448734] Call Trace: [ 34.451196] dump_stack+0x54/0x76 [ 34.454517] dump_header+0x40/0x1f0 [ 34.458008] ? oom_badness+0x1f/0x120 [ 34.461673] ? ___ratelimit+0x6c/0xe0 [ 34.465332] oom_kill_process+0xc9/0x110 [ 34.469255] out_of_memory+0xd7/0x2f0 [ 34.472916] __alloc_pages_nodemask+0xdd1/0xe90 [ 34.477446] ? set_bh_page+0x33/0x50 [ 34.481016] ? __xa_set_mark+0x4d/0x70 [ 34.484762] pagecache_get_page+0xbe/0x250 [ 34.488859] grab_cache_page_write_begin+0x1a/0x30 [ 34.493645] block_write_begin+0x25/0x90 [ 34.497569] blkdev_write_begin+0x1e/0x20 [ 34.501574] ? bdev_evict_inode+0xc0/0xc0 [ 34.505578] generic_perform_write+0x95/0x190 [ 34.509927] __generic_file_write_iter+0xe0/0x1a0 [ 34.514626] blkdev_write_iter+0xbf/0x1c0 [ 34.518630] __vfs_write+0x122/0x1e0 [ 34.522200] vfs_write+0x8f/0x1b0 [ 34.525510] ksys_pwrite64+0x60/0x80 [ 34.529081] __ia32_sys_ia32_pwrite64+0x16/0x20 [ 34.533604] do_fast_syscall_32+0x66/0x240 [ 34.537697] entry_SYSENTER_32+0xa5/0xf8 [ 34.541613] EIP: 0xb7f3c549 [ 34.544403] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76 [ 34.563140] EAX: ffffffda EBX: 00000003 ECX: b7830010 EDX: 00400000 [ 34.569397] ESI: 38400000 EDI: 00000074 EBP: 07438400 ESP: bff1e650 [ 34.575654] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246 [ 34.582453] Mem-Info: [ 34.584732] active_anon:5713 inactive_anon:2169 isolated_anon:0 [ 34.584732] active_file:4040 inactive_file:211204 isolated_file:0 [ 34.584732] unevictable:0 dirty:17270 writeback:6240 unstable:0 [ 34.584732] slab_reclaimable:5856 slab_unreclaimable:3439 [ 34.584732] mapped:6192 shmem:2258 pagetables:178 bounce:0 [ 34.584732] free:265105 free_pcp:1330 free_cma:0 [ 34.618483] Node 0 active_anon:22852kB inactive_anon:8676kB active_file:16160kB inactive_file:844816kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:24768kB dirty:69080kB writeback:19628kB shmem:9032kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes [ 34.642354] DMA free:3588kB min:68kB low:84kB high:100kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:11848kB unevictable:0kB writepending:11856kB present:15964kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 34.670194] lowmem_reserve[]: 0 824 1947 824 [ 34.674483] Normal free:4228kB min:3636kB low:4544kB high:5452kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:1136kB inactive_file:786456kB unevictable:0kB writepending:68084kB present:884728kB managed:845324kB mlocked:0kB kernel_stack:1104kB pagetables:0kB bounce:0kB free_pcp:3056kB local_pcp:388kB free_cma:0kB [ 34.704243] lowmem_reserve[]: 0 0 8980 0 [ 34.708189] HighMem free:1053028kB min:512kB low:1748kB high:2984kB reserved_highatomic:0KB active_anon:22852kB inactive_anon:8676kB active_file:15024kB inactive_file:46596kB unevictable:0kB writepending:0kB present:1149544kB managed:1149544kB mlocked:0kB kernel_stack:0kB pagetables:712kB bounce:0kB free_pcp:2160kB local_pcp:736kB free_cma:0kB [ 34.738563] lowmem_reserve[]: 0 0 0 0 [ 34.742245] DMA: 23*4kB (U) 2*8kB (U) 3*16kB (U) 2*32kB (UE) 2*64kB (U) 1*128kB (U) 1*256kB (E) 0*512kB 1*1024kB (E) 1*2048kB (E) 0*4096kB = 3804kB [ 34.755479] Normal: 25*4kB (UM) 27*8kB (UME) 16*16kB (UME) 14*32kB (UME) 7*64kB (UME) 2*128kB (UM) 1*256kB (E) 1*512kB (E) 0*1024kB 1*2048kB (M) 0*4096kB = 4540kB [ 34.770004] HighMem: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 1*64kB (M) 2*128kB (UM) 2*256kB (UM) 1*512kB (U) 1*1024kB (U) 1*2048kB (U) 256*4096kB (M) = 1053028kB [ 34.784010] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=4096kB [ 34.792466] 217507 total pagecache pages [ 34.796387] 0 pages in swap cache [ 34.799704] Swap cache stats: add 0, delete 0, find 0/0 [ 34.804923] Free swap = 0kB [ 34.807834] Total swap = 0kB [ 34.810738] 512559 pages RAM [ 34.813640] 287386 pages HighMem/MovableOnly [ 34.817931] 9873 pages reserved
- Naresh
Thanks for your work. I just noticed that this is a system oom, rather than a memcg oom. While this patch is against memcg oom.
As you have verified this oom is only caused by commit "mm, memcg: decouple e{low,min} state mutations from protection checks", this commit really introduce the issue of using the stale protection value, but I haven't thought deeply why this occurs. This issue can occur only when you set memcg {min, low} protection, but unfortunately memcg {min, low} isn't shown in the oom log.
Appreciat if you would like to check the memcg {min, low} protection setting. If they are set, I think bellow workaround can avoid this issue.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 474815a..f6f794a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6380,6 +6380,9 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root, if (mem_cgroup_disabled()) return;
+ memcg->memory.elow = 0; + memcg->memory.emin = 0; + if (!root) root = root_mem_cgroup;
But I think the right thing to do now is reverting the bad commit, because the usage of memory.{emin, elow} is very subtle, we shouldn't place them here and there at the risk of reading a stale value.
On Wed 20-05-20 20:09:06, Chris Down wrote:
Hi Naresh,
Naresh Kamboju writes:
As a part of investigation on this issue LKFT teammate Anders Roxell git bisected the problem and found bad commit(s) which caused this problem.
The following two patches have been reverted on next-20200519 and retested the reproducible steps and confirmed the test case mkfs -t ext4 got PASS. ( invoked oom-killer is gone now)
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
Revert "mm, memcg: decouple e{low,min} state mutations from protection checks" This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
Thanks Anders and Naresh for tracking this down and reverting.
I'll take a look tomorrow. I don't see anything immediately obviously wrong in either of those commits from a (very) cursory glance, but they should only be taking effect if protections are set.
Agreed. If memory.{low,min} is not used then the patch should be effectively a nop. Btw. do you see the problem when booting with cgroup_disable=memory kernel command line parameter?
I suspect that something might be initialized for memcg incorrectly and the patch just makes it more visible for some reason.
On Thu, 21 May 2020 at 15:25, Michal Hocko mhocko@kernel.org wrote:
On Wed 20-05-20 20:09:06, Chris Down wrote:
Hi Naresh,
Naresh Kamboju writes:
As a part of investigation on this issue LKFT teammate Anders Roxell git bisected the problem and found bad commit(s) which caused this problem.
The following two patches have been reverted on next-20200519 and retested the reproducible steps and confirmed the test case mkfs -t ext4 got PASS. ( invoked oom-killer is gone now)
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
Revert "mm, memcg: decouple e{low,min} state mutations from protection checks" This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
Thanks Anders and Naresh for tracking this down and reverting.
I'll take a look tomorrow. I don't see anything immediately obviously wrong in either of those commits from a (very) cursory glance, but they should only be taking effect if protections are set.
Agreed. If memory.{low,min} is not used then the patch should be effectively a nop. Btw. do you see the problem when booting with cgroup_disable=memory kernel command line parameter?
With extra kernel command line parameters, cgroup_disable=memory I have noticed a differ problem now.
+ mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 35.502102] BUG: kernel NULL pointer dereference, address: 000000c8 [ 35.508372] #PF: supervisor read access in kernel mode [ 35.513506] #PF: error_code(0x0000) - not-present page [ 35.518638] *pde = 00000000 [ 35.521514] Oops: 0000 [#1] SMP [ 35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted 5.7.0-rc6-next-20200519+ #1 [ 35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60 [ 35.544724] Code: 00 00 80 3d 84 b5 e1 cb 00 89 c2 a1 9c a5 f5 cb 75 48 55 89 e5 57 56 53 3e 8d 74 26 00 8b 1d 88 b5 e1 cb 31 f6 eb 27 8d 76 00 <8b> 8a c8 00 00 00 8b ba bc 00 00 00 29 f9 39 c8 0f 4f c1 8b 8a 98 [ 35.563461] EAX: 00000000 EBX: f5411000 ECX: 00000000 EDX: 00000000 [ 35.569718] ESI: 00000000 EDI: f4e13ea8 EBP: f4e13e10 ESP: f4e13e04 [ 35.575976] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010207 [ 35.582751] CR0: 80050033 CR2: 000000c8 CR3: 0bef4000 CR4: 003406d0 [ 35.589010] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 35.595266] DR6: fffe0ff0 DR7: 00000400 [ 35.599096] Call Trace: [ 35.601544] shrink_lruvec+0x447/0x630 [ 35.605294] ? newidle_balance.isra.100+0x8e/0x3f0 [ 35.610080] ? pick_next_task_fair+0x3a/0x320 [ 35.614437] ? deactivate_task+0xcf/0x100 [ 35.618442] ? put_prev_entity+0x1a/0xd0 [ 35.622359] ? deactivate_task+0xcf/0x100 [ 35.626363] shrink_node+0x1be/0x640 [ 35.629932] ? shrink_node+0x1be/0x640 [ 35.633676] kswapd+0x32c/0x890 [ 35.636815] ? deactivate_task+0xcf/0x100 [ 35.640820] kthread+0xf1/0x110 [ 35.643963] ? do_try_to_free_pages+0x3b0/0x3b0 [ 35.648489] ? kthread_park+0xa0/0xa0 [ 35.652147] ret_from_fork+0x1c/0x28 [ 35.655726] Modules linked in: x86_pkg_temp_thermal [ 35.660605] CR2: 00000000000000c8 [ 35.663916] ---[ end trace d85b8564ea55fb0d ]--- [ 35.663917] BUG: kernel NULL pointer dereference, address: 000000c8 [ 35.663918] #PF: supervisor read access in kernel mode [ 35.668534] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60 [ 35.674792] #PF: error_code(0x0000) - not-present page [ 35.674792] *pde = 00000000 [ 35.679921] Code: 00 00 80 3d 84 b5 e1 cb 00 89 c2 a1 9c a5 f5 cb 75 48 55 89 e5 57 56 53 3e 8d 74 26 00 8b 1d 88 b5 e1 cb 31 f6 eb 27 8d 76 00 <8b> 8a c8 00 00 00 8b ba bc 00 00 00 29 f9 39 c8 0f 4f c1 8b 8a 98 [ 35.685140] Oops: 0000 [#2] SMP [ 35.685142] CPU: 2 PID: 391 Comm: mkfs.ext4 Tainted: G D 5.7.0-rc6-next-20200519+ #1 [ 35.690278] EAX: 00000000 EBX: f5411000 ECX: 00000000 EDX: 00000000 [ 35.690279] ESI: 00000000 EDI: f4e13ea8 EBP: f4e13e10 ESP: f4e13e04 [ 35.693155] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 35.693158] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60 [ 35.711893] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010207 [ 35.711894] CR0: 80050033 CR2: 000000c8 CR3: 0bef4000 CR4: 003406d0 [ 35.715031] Code: 00 00 80 3d 84 b5 e1 cb 00 89 c2 a1 9c a5 f5 cb 75 48 55 89 e5 57 56 53 3e 8d 74 26 00 8b 1d 88 b5 e1 cb 31 f6 eb 27 8d 76 00 <8b> 8a c8 00 00 00 8b ba bc 00 00 00 29 f9 39 c8 0f 4f c1 8b 8a 98 [ 35.724061] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 35.730317] EAX: 00000000 EBX: f5411000 ECX: 00000000 EDX: 00000000 [ 35.730318] ESI: 00000000 EDI: f2d73c14 EBP: f2d73b78 ESP: f2d73b6c [ 35.736576] DR6: fffe0ff0 DR7: 00000400 [ 35.803603] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010207 [ 35.810380] CR0: 80050033 CR2: 000000c8 CR3: 33241000 CR4: 003406d0 [ 35.816636] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 35.822893] DR6: fffe0ff0 DR7: 00000400 [ 35.826725] Call Trace: [ 35.829171] shrink_lruvec+0x447/0x630 [ 35.832921] ? check_preempt_curr+0x75/0x80 [ 35.837100] shrink_node+0x1be/0x640 [ 35.840670] ? shrink_node+0x1be/0x640 [ 35.844412] do_try_to_free_pages+0xc1/0x3b0 [ 35.848677] try_to_free_pages+0xba/0x1d0 [ 35.852683] __alloc_pages_nodemask+0x573/0xe90 [ 35.857232] ? set_bh_page+0x33/0x50 [ 35.860829] ? xas_load+0xf/0x70 [ 35.864050] ? __xa_set_mark+0x4d/0x70 [ 35.867795] ? find_get_entry+0x47/0x110 [ 35.871714] pagecache_get_page+0xbe/0x250 [ 35.875805] grab_cache_page_write_begin+0x1a/0x30 [ 35.880588] block_write_begin+0x25/0x90 [ 35.884504] blkdev_write_begin+0x1e/0x20 [ 35.888507] ? bdev_evict_inode+0xc0/0xc0 [ 35.892513] generic_perform_write+0x95/0x190 [ 35.896863] __generic_file_write_iter+0xe0/0x1a0 [ 35.901562] blkdev_write_iter+0xbf/0x1c0 [ 35.905564] __vfs_write+0x122/0x1e0 [ 35.909136] vfs_write+0x8f/0x1b0 [ 35.912454] ksys_pwrite64+0x60/0x80 [ 35.916024] __ia32_sys_ia32_pwrite64+0x16/0x20 [ 35.920549] do_fast_syscall_32+0x66/0x240 [ 35.924641] entry_SYSENTER_32+0xa5/0xf8 [ 35.928567] EIP: 0xb7f72549 [ 35.931357] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76 [ 35.950093] EAX: ffffffda EBX: 00000003 ECX: b7866010 EDX: 00400000 [ 35.956351] ESI: 39000000 EDI: 00000074 EBP: 07439000 ESP: bf973700 [ 35.962607] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246 [ 35.969384] Modules linked in: x86_pkg_temp_thermal [ 35.974269] CR2: 00000000000000c8 [ 35.977582] ---[ end trace d85b8564ea55fb0e ]--- [ 35.977583] BUG: kernel NULL pointer dereference, address: 000000c8 [ 35.977584] #PF: supervisor read access in kernel mode [ 35.982193] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60 [ 35.982195] Code: 00 00 80 3d 84 b5 e1 cb 00 89 c2 a1 9c a5 f5 cb 75 48 55 89 e5 57 56 53 3e 8d 74 26 00 8b 1d 88 b5 e1 cb 31 f6 eb 27 8d 76 00 <8b> 8a c8 00 00 00 8b ba bc 00 00 00 29 f9 39 c8 0f 4f c1 8b 8a 98 [ 35.988450] #PF: error_code(0x0000) - not-present page [ 35.988451] *pde = 00000000
full test log link, https://lkft.validation.linaro.org/scheduler/job/1443939#L1170
- Naresh
On Thu 21-05-20 16:11:11, Naresh Kamboju wrote:
On Thu, 21 May 2020 at 15:25, Michal Hocko mhocko@kernel.org wrote:
On Wed 20-05-20 20:09:06, Chris Down wrote:
Hi Naresh,
Naresh Kamboju writes:
As a part of investigation on this issue LKFT teammate Anders Roxell git bisected the problem and found bad commit(s) which caused this problem.
The following two patches have been reverted on next-20200519 and retested the reproducible steps and confirmed the test case mkfs -t ext4 got PASS. ( invoked oom-killer is gone now)
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
Revert "mm, memcg: decouple e{low,min} state mutations from protection checks" This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
Thanks Anders and Naresh for tracking this down and reverting.
I'll take a look tomorrow. I don't see anything immediately obviously wrong in either of those commits from a (very) cursory glance, but they should only be taking effect if protections are set.
Agreed. If memory.{low,min} is not used then the patch should be effectively a nop. Btw. do you see the problem when booting with cgroup_disable=memory kernel command line parameter?
With extra kernel command line parameters, cgroup_disable=memory I have noticed a differ problem now.
- mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 35.502102] BUG: kernel NULL pointer dereference, address: 000000c8 [ 35.508372] #PF: supervisor read access in kernel mode [ 35.513506] #PF: error_code(0x0000) - not-present page [ 35.518638] *pde = 00000000 [ 35.521514] Oops: 0000 [#1] SMP [ 35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted 5.7.0-rc6-next-20200519+ #1 [ 35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
Could you get faddr2line for this offset?
On Thu, 21 May 2020, Michal Hocko wrote:
On Thu 21-05-20 16:11:11, Naresh Kamboju wrote:
On Thu, 21 May 2020 at 15:25, Michal Hocko mhocko@kernel.org wrote:
On Wed 20-05-20 20:09:06, Chris Down wrote:
Hi Naresh,
Naresh Kamboju writes:
As a part of investigation on this issue LKFT teammate Anders Roxell git bisected the problem and found bad commit(s) which caused this problem.
The following two patches have been reverted on next-20200519 and retested the reproducible steps and confirmed the test case mkfs -t ext4 got PASS. ( invoked oom-killer is gone now)
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
Revert "mm, memcg: decouple e{low,min} state mutations from protection checks" This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
Thanks Anders and Naresh for tracking this down and reverting.
I'll take a look tomorrow. I don't see anything immediately obviously wrong in either of those commits from a (very) cursory glance, but they should only be taking effect if protections are set.
Agreed. If memory.{low,min} is not used then the patch should be effectively a nop. Btw. do you see the problem when booting with cgroup_disable=memory kernel command line parameter?
With extra kernel command line parameters, cgroup_disable=memory I have noticed a differ problem now.
- mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 35.502102] BUG: kernel NULL pointer dereference, address: 000000c8 [ 35.508372] #PF: supervisor read access in kernel mode [ 35.513506] #PF: error_code(0x0000) - not-present page [ 35.518638] *pde = 00000000 [ 35.521514] Oops: 0000 [#1] SMP [ 35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted 5.7.0-rc6-next-20200519+ #1 [ 35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
Could you get faddr2line for this offset?
No need for that, I can help with the "cgroup_disabled=memory" crash: I've been happily running with the fixup below, but haven't got to send it in yet (and wouldn't normally be reading mail at this time!) because of busy chasing a couple of other bugs (not necessarily mm); and maybe the fix would be better with explicit mem_cgroup_disabled() test, or maybe that should be where cgroup_memory_noswap is decided - up to Johannes.
---
mm/memcontrol.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- 5.7-rc6-mm1/mm/memcontrol.c 2020-05-20 12:21:56.109693740 -0700 +++ linux/mm/memcontrol.c 2020-05-20 12:26:15.500478753 -0700 @@ -6954,7 +6954,8 @@ long mem_cgroup_get_nr_swap_pages(struct { long nr_swap_pages = get_nr_swap_pages();
- if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys)) + if (!memcg || cgroup_memory_noswap || + !cgroup_subsys_on_dfl(memory_cgrp_subsys)) return nr_swap_pages; for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) nr_swap_pages = min_t(long, nr_swap_pages,
On Thu 21-05-20 05:24:27, Hugh Dickins wrote:
On Thu, 21 May 2020, Michal Hocko wrote:
On Thu 21-05-20 16:11:11, Naresh Kamboju wrote:
On Thu, 21 May 2020 at 15:25, Michal Hocko mhocko@kernel.org wrote:
On Wed 20-05-20 20:09:06, Chris Down wrote:
Hi Naresh,
Naresh Kamboju writes:
As a part of investigation on this issue LKFT teammate Anders Roxell git bisected the problem and found bad commit(s) which caused this problem.
The following two patches have been reverted on next-20200519 and retested the reproducible steps and confirmed the test case mkfs -t ext4 got PASS. ( invoked oom-killer is gone now)
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
Revert "mm, memcg: decouple e{low,min} state mutations from protection checks" This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
Thanks Anders and Naresh for tracking this down and reverting.
I'll take a look tomorrow. I don't see anything immediately obviously wrong in either of those commits from a (very) cursory glance, but they should only be taking effect if protections are set.
Agreed. If memory.{low,min} is not used then the patch should be effectively a nop. Btw. do you see the problem when booting with cgroup_disable=memory kernel command line parameter?
With extra kernel command line parameters, cgroup_disable=memory I have noticed a differ problem now.
- mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 35.502102] BUG: kernel NULL pointer dereference, address: 000000c8 [ 35.508372] #PF: supervisor read access in kernel mode [ 35.513506] #PF: error_code(0x0000) - not-present page [ 35.518638] *pde = 00000000 [ 35.521514] Oops: 0000 [#1] SMP [ 35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted 5.7.0-rc6-next-20200519+ #1 [ 35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
Could you get faddr2line for this offset?
No need for that, I can help with the "cgroup_disabled=memory" crash: I've been happily running with the fixup below, but haven't got to send it in yet (and wouldn't normally be reading mail at this time!) because of busy chasing a couple of other bugs (not necessarily mm); and maybe the fix would be better with explicit mem_cgroup_disabled() test, or maybe that should be where cgroup_memory_noswap is decided - up to Johannes.
Thanks Hugh. I can see what is the problem now. I was looking at the Linus' tree and we have a different code there
long nr_swap_pages = get_nr_swap_pages();
if (!do_swap_account || !cgroup_subsys_on_dfl(memory_cgrp_subsys)) return nr_swap_pages;
which would be impossible to crash so I was really wondering what is going on here. But there are other changes in the mmotm which I haven't reviewed yet. Looking at the next tree now it is a fallout from "mm: memcontrol: prepare swap controller setup for integration".
!memcg check slightly more cryptic than an explicit mem_cgroup_disabled but I would just leave it to Johannes as well.
mm/memcontrol.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- 5.7-rc6-mm1/mm/memcontrol.c 2020-05-20 12:21:56.109693740 -0700 +++ linux/mm/memcontrol.c 2020-05-20 12:26:15.500478753 -0700 @@ -6954,7 +6954,8 @@ long mem_cgroup_get_nr_swap_pages(struct { long nr_swap_pages = get_nr_swap_pages();
- if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
- if (!memcg || cgroup_memory_noswap ||
return nr_swap_pages; for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) nr_swap_pages = min_t(long, nr_swap_pages,!cgroup_subsys_on_dfl(memory_cgrp_subsys))
On Thu 21-05-20 11:55:16, Michal Hocko wrote:
On Wed 20-05-20 20:09:06, Chris Down wrote:
Hi Naresh,
Naresh Kamboju writes:
As a part of investigation on this issue LKFT teammate Anders Roxell git bisected the problem and found bad commit(s) which caused this problem.
The following two patches have been reverted on next-20200519 and retested the reproducible steps and confirmed the test case mkfs -t ext4 got PASS. ( invoked oom-killer is gone now)
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
Revert "mm, memcg: decouple e{low,min} state mutations from protection checks" This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
Thanks Anders and Naresh for tracking this down and reverting.
I'll take a look tomorrow. I don't see anything immediately obviously wrong in either of those commits from a (very) cursory glance, but they should only be taking effect if protections are set.
Agreed. If memory.{low,min} is not used then the patch should be effectively a nop.
I was staring into the code and do not see anything. Could you give the following debugging patch a try and see whether it triggers?
diff --git a/mm/vmscan.c b/mm/vmscan.c index cc555903a332..df2e8df0eb71 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2404,6 +2404,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, * sc->priority further than desirable. */ scan = max(scan, SWAP_CLUSTER_MAX); + + trace_printk("scan:%lu protection:%lu\n", scan, protection); } else { scan = lruvec_size; } @@ -2648,6 +2650,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) mem_cgroup_calculate_protection(target_memcg, memcg);
if (mem_cgroup_below_min(memcg)) { + trace_printk("under min:%lu emin:%lu\n", memcg->memory.min, memcg->memory.emin); /* * Hard protection. * If there is no reclaimable memory, OOM. @@ -2660,6 +2663,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) * there is an unprotected supply * of reclaimable memory from other cgroups. */ + trace_printk("under low:%lu elow:%lu\n", memcg->memory.low, memcg->memory.elow); if (!sc->memcg_low_reclaim) { sc->memcg_low_skipped = 1; continue;
On Thu, 21 May 2020 at 22:04, Michal Hocko mhocko@kernel.org wrote:
On Thu 21-05-20 11:55:16, Michal Hocko wrote:
On Wed 20-05-20 20:09:06, Chris Down wrote:
Hi Naresh,
Naresh Kamboju writes:
As a part of investigation on this issue LKFT teammate Anders Roxell git bisected the problem and found bad commit(s) which caused this problem.
The following two patches have been reverted on next-20200519 and retested the reproducible steps and confirmed the test case mkfs -t ext4 got PASS. ( invoked oom-killer is gone now)
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
Revert "mm, memcg: decouple e{low,min} state mutations from protection checks" This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
Thanks Anders and Naresh for tracking this down and reverting.
I'll take a look tomorrow. I don't see anything immediately obviously wrong in either of those commits from a (very) cursory glance, but they should only be taking effect if protections are set.
Agreed. If memory.{low,min} is not used then the patch should be effectively a nop.
I was staring into the code and did not see anything. Could you give the following debugging patch a try and see whether it triggers?
These code paths did not touch it seems. but still see the reported problem. Please find a detailed test log output [1]
And One more test log with cgroup_disable=memory [2]
Test log link, [1] https://pastebin.com/XJU7We1g [2] https://pastebin.com/BZ0BMUVt
On Thu, May 21, 2020 at 02:44:44PM +0200, Michal Hocko wrote:
On Thu 21-05-20 05:24:27, Hugh Dickins wrote:
On Thu, 21 May 2020, Michal Hocko wrote:
On Thu 21-05-20 16:11:11, Naresh Kamboju wrote:
On Thu, 21 May 2020 at 15:25, Michal Hocko mhocko@kernel.org wrote:
On Wed 20-05-20 20:09:06, Chris Down wrote:
Hi Naresh,
Naresh Kamboju writes: > As a part of investigation on this issue LKFT teammate Anders Roxell > git bisected the problem and found bad commit(s) which caused this problem. > > The following two patches have been reverted on next-20200519 and retested the > reproducible steps and confirmed the test case mkfs -t ext4 got PASS. > ( invoked oom-killer is gone now) > > Revert "mm, memcg: avoid stale protection values when cgroup is above > protection" > This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6. > > Revert "mm, memcg: decouple e{low,min} state mutations from protection > checks" > This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
Thanks Anders and Naresh for tracking this down and reverting.
I'll take a look tomorrow. I don't see anything immediately obviously wrong in either of those commits from a (very) cursory glance, but they should only be taking effect if protections are set.
Agreed. If memory.{low,min} is not used then the patch should be effectively a nop. Btw. do you see the problem when booting with cgroup_disable=memory kernel command line parameter?
With extra kernel command line parameters, cgroup_disable=memory I have noticed a differ problem now.
- mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 35.502102] BUG: kernel NULL pointer dereference, address: 000000c8 [ 35.508372] #PF: supervisor read access in kernel mode [ 35.513506] #PF: error_code(0x0000) - not-present page [ 35.518638] *pde = 00000000 [ 35.521514] Oops: 0000 [#1] SMP [ 35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted 5.7.0-rc6-next-20200519+ #1 [ 35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
Could you get faddr2line for this offset?
No need for that, I can help with the "cgroup_disabled=memory" crash: I've been happily running with the fixup below, but haven't got to send it in yet (and wouldn't normally be reading mail at this time!) because of busy chasing a couple of other bugs (not necessarily mm); and maybe the fix would be better with explicit mem_cgroup_disabled() test, or maybe that should be where cgroup_memory_noswap is decided - up to Johannes.
Thanks Hugh. I can see what is the problem now. I was looking at the Linus' tree and we have a different code there
long nr_swap_pages = get_nr_swap_pages();
if (!do_swap_account || !cgroup_subsys_on_dfl(memory_cgrp_subsys)) return nr_swap_pages;
which would be impossible to crash so I was really wondering what is going on here. But there are other changes in the mmotm which I haven't reviewed yet. Looking at the next tree now it is a fallout from "mm: memcontrol: prepare swap controller setup for integration".
!memcg check slightly more cryptic than an explicit mem_cgroup_disabled but I would just leave it to Johannes as well.
Very much appreciate you guys tracking it down so quickly. Sorry about the breakage.
I think mem_cgroup_disabled() checks are pretty good markers of public entry points to the memcg API, so I'd prefer that even if a bit more verbose. What do you think?
---
From cd373ec232942a9bc43ee5e7d2171352019a58fb Mon Sep 17 00:00:00 2001
From: Hugh Dickins hughd@google.com Date: Thu, 21 May 2020 14:58:36 -0400 Subject: [PATCH] mm: memcontrol: prepare swap controller setup for integration fix
Fix crash with cgroup_disable=memory:
- mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 35.502102] BUG: kernel NULL pointer dereference, address: 000000c8 [ 35.508372] #PF: supervisor read access in kernel mode [ 35.513506] #PF: error_code(0x0000) - not-present page [ 35.518638] *pde = 00000000 [ 35.521514] Oops: 0000 [#1] SMP [ 35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted 5.7.0-rc6-next-20200519+ #1 [ 35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
do_memsw_account() used to be automatically false when the cgroup controller was disabled. Now that it's replaced by cgroup_memory_noswap, for which this isn't true, make the mem_cgroup_disabled() checks explicit in the swap control API.
[hannes@cmpxchg.org: use mem_cgroup_disabled() in all API functions] Reported-by: Naresh Kamboju naresh.kamboju@linaro.org Debugged-by: Hugh Dickins hughd@google.com Debugged-by: Michal Hocko mhocko@kernel.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org --- mm/memcontrol.c | 47 +++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 41 insertions(+), 6 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3e000a316b59..850bca380562 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6811,6 +6811,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) VM_BUG_ON_PAGE(PageLRU(page), page); VM_BUG_ON_PAGE(page_count(page), page);
+ if (mem_cgroup_disabled()) + return; + if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) return;
@@ -6876,6 +6879,10 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry) struct mem_cgroup *memcg; unsigned short oldid;
+ if (mem_cgroup_disabled()) + return 0; + + /* Only cgroup2 has swap.max */ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) return 0;
@@ -6920,6 +6927,9 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages) struct mem_cgroup *memcg; unsigned short id;
+ if (mem_cgroup_disabled()) + return; + id = swap_cgroup_record(entry, 0, nr_pages); rcu_read_lock(); memcg = mem_cgroup_from_id(id); @@ -6940,12 +6950,25 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg) { long nr_swap_pages = get_nr_swap_pages();
- if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys)) - return nr_swap_pages; + if (mem_cgroup_disabled()) + goto out; + + /* Swap control disabled */ + if (cgroup_memory_noswap) + goto out; + + /* + * Only cgroup2 has swap.max, cgroup1 does mem+sw accounting, + * which does not place restrictions specifically on swap. + */ + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) + goto out; + for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) nr_swap_pages = min_t(long, nr_swap_pages, READ_ONCE(memcg->swap.max) - page_counter_read(&memcg->swap)); +out: return nr_swap_pages; }
@@ -6957,18 +6980,30 @@ bool mem_cgroup_swap_full(struct page *page)
if (vm_swap_full()) return true; - if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys)) - return false; + + if (mem_cgroup_disabled()) + goto out; + + /* Swap control disabled */ + if (cgroup_memory_noswap) + goto out; + + /* + * Only cgroup2 has swap.max, cgroup1 does mem+sw accounting, + * which does not place restrictions specifically on swap. + */ + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) + goto out;
memcg = page->mem_cgroup; if (!memcg) - return false; + goto out;
for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) if (page_counter_read(&memcg->swap) * 2 >= READ_ONCE(memcg->swap.max)) return true; - +out: return false; }
On Thu, 21 May 2020, Johannes Weiner wrote:
Very much appreciate you guys tracking it down so quickly. Sorry about the breakage.
I think mem_cgroup_disabled() checks are pretty good markers of public entry points to the memcg API, so I'd prefer that even if a bit more verbose. What do you think?
An explicit mem_cgroup_disabled() check would be fine, but I must admit, the patch below is rather too verbose for my own taste. Your call.
From cd373ec232942a9bc43ee5e7d2171352019a58fb Mon Sep 17 00:00:00 2001 From: Hugh Dickins hughd@google.com Date: Thu, 21 May 2020 14:58:36 -0400 Subject: [PATCH] mm: memcontrol: prepare swap controller setup for integration fix
Fix crash with cgroup_disable=memory:
- mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 35.502102] BUG: kernel NULL pointer dereference, address: 000000c8 [ 35.508372] #PF: supervisor read access in kernel mode [ 35.513506] #PF: error_code(0x0000) - not-present page [ 35.518638] *pde = 00000000 [ 35.521514] Oops: 0000 [#1] SMP [ 35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted 5.7.0-rc6-next-20200519+ #1 [ 35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
do_memsw_account() used to be automatically false when the cgroup controller was disabled. Now that it's replaced by cgroup_memory_noswap, for which this isn't true, make the mem_cgroup_disabled() checks explicit in the swap control API.
[hannes@cmpxchg.org: use mem_cgroup_disabled() in all API functions] Reported-by: Naresh Kamboju naresh.kamboju@linaro.org Debugged-by: Hugh Dickins hughd@google.com Debugged-by: Michal Hocko mhocko@kernel.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org
mm/memcontrol.c | 47 +++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 41 insertions(+), 6 deletions(-)
I'm certainly not against a mem_cgroup_disabled() check in the only place that's been observed to need it, as a fixup to merge into your original patch; but this seems rather an over-reaction - and I'm a little surprised that setting mem_cgroup_disabled() doesn't just force cgroup_memory_noswap, saving repetitious checks elsewhere (perhaps there's a difficulty in that, I haven't looked).
Historically, I think we've added mem_cgroup_disabled() checks (accessing a cacheline we'd rather avoid) where they're necessary, rather than at every "interface".
And you seem to be in a very "goto out" mood today - we all have our "goto out" days, alternating with our "return 0" days :)
Hugh
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3e000a316b59..850bca380562 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6811,6 +6811,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) VM_BUG_ON_PAGE(PageLRU(page), page); VM_BUG_ON_PAGE(page_count(page), page);
- if (mem_cgroup_disabled())
return;
- if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) return;
@@ -6876,6 +6879,10 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry) struct mem_cgroup *memcg; unsigned short oldid;
- if (mem_cgroup_disabled())
return 0;
- /* Only cgroup2 has swap.max */ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) return 0;
@@ -6920,6 +6927,9 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages) struct mem_cgroup *memcg; unsigned short id;
- if (mem_cgroup_disabled())
return;
- id = swap_cgroup_record(entry, 0, nr_pages); rcu_read_lock(); memcg = mem_cgroup_from_id(id);
@@ -6940,12 +6950,25 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg) { long nr_swap_pages = get_nr_swap_pages();
- if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
return nr_swap_pages;
- if (mem_cgroup_disabled())
goto out;
- /* Swap control disabled */
- if (cgroup_memory_noswap)
goto out;
- /*
* Only cgroup2 has swap.max, cgroup1 does mem+sw accounting,
* which does not place restrictions specifically on swap.
*/
- if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
goto out;
- for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) nr_swap_pages = min_t(long, nr_swap_pages, READ_ONCE(memcg->swap.max) - page_counter_read(&memcg->swap));
+out: return nr_swap_pages; } @@ -6957,18 +6980,30 @@ bool mem_cgroup_swap_full(struct page *page) if (vm_swap_full()) return true;
- if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;
- if (mem_cgroup_disabled())
goto out;
- /* Swap control disabled */
- if (cgroup_memory_noswap)
goto out;
- /*
* Only cgroup2 has swap.max, cgroup1 does mem+sw accounting,
* which does not place restrictions specifically on swap.
*/
- if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
goto out;
memcg = page->mem_cgroup; if (!memcg)
return false;
goto out;
for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) if (page_counter_read(&memcg->swap) * 2 >= READ_ONCE(memcg->swap.max)) return true;
+out: return false; } -- 2.26.2
My apology ! As per the test results history this problem started happening from Bad : next-20200430 (still reproducible on next-20200519) Good : next-20200429
The git tree / tag used for testing is from linux next-20200430 tag and reverted following three patches and oom-killer problem fixed.
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks" Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"
Ref tree: https://github.com/roxell/linux/commits/my-next-20200430
Build images: https://builds.tuxbuild.com/whyTLI1O8s5HiILwpLTLtg/
Test log: https://lkft.validation.linaro.org/scheduler/job/1444321#L1164
- Naresh
On Thu, May 21, 2020 at 01:06:28PM -0700, Hugh Dickins wrote:
On Thu, 21 May 2020, Johannes Weiner wrote:
do_memsw_account() used to be automatically false when the cgroup controller was disabled. Now that it's replaced by cgroup_memory_noswap, for which this isn't true, make the mem_cgroup_disabled() checks explicit in the swap control API.
[hannes@cmpxchg.org: use mem_cgroup_disabled() in all API functions] Reported-by: Naresh Kamboju naresh.kamboju@linaro.org Debugged-by: Hugh Dickins hughd@google.com Debugged-by: Michal Hocko mhocko@kernel.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org
mm/memcontrol.c | 47 +++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 41 insertions(+), 6 deletions(-)
I'm certainly not against a mem_cgroup_disabled() check in the only place that's been observed to need it, as a fixup to merge into your original patch; but this seems rather an over-reaction - and I'm a little surprised that setting mem_cgroup_disabled() doesn't just force cgroup_memory_noswap, saving repetitious checks elsewhere (perhaps there's a difficulty in that, I haven't looked).
Fair enough, I changed it to set the flag at initialization time if mem_cgroup_disabled(). I was never a fan of the old flags, where it was never clear what was commandline, and what was internal runtime state - do_swap_account? really_do_swap_account? But I think it's straight-forward in this case now.
Historically, I think we've added mem_cgroup_disabled() checks (accessing a cacheline we'd rather avoid) where they're necessary, rather than at every "interface".
To me that always seemed like bugs waiting to happen. Like this one!
It's a jump label nowadays, so I've been liberal with these to avoid subtle bugs.
And you seem to be in a very "goto out" mood today - we all have our "goto out" days, alternating with our "return 0" days :)
:-)
But I agree, best to keep this fixup self-contained and defer anything else to separate cleanup patches.
How about the below? It survives a swaptest with cgroup_disable=memory for me.
Hugh, I started with your patch, which is why I kept you as the author, but as the patch now (and arguably the previous one) is sufficiently different, I dropped that now. I hope that's okay.
---
From d9e7ed15d1c9248a3fd99e35e82437549154dac7 Mon Sep 17 00:00:00 2001
From: Johannes Weiner hannes@cmpxchg.org Date: Thu, 21 May 2020 17:44:25 -0400 Subject: [PATCH] mm: memcontrol: prepare swap controller setup for integration fix
Fix crash with cgroup_disable=memory:
- mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 35.502102] BUG: kernel NULL pointer dereference, address: 000000c8 [ 35.508372] #PF: supervisor read access in kernel mode [ 35.513506] #PF: error_code(0x0000) - not-present page [ 35.518638] *pde = 00000000 [ 35.521514] Oops: 0000 [#1] SMP [ 35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted 5.7.0-rc6-next-20200519+ #1 [ 35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
Swap accounting used to be implied-disabled when the cgroup controller was disabled. Restore that for the new cgroup_memory_noswap, so that we bail out of this function instead of dereferencing a NULL memcg.
Reported-by: Naresh Kamboju naresh.kamboju@linaro.org Debugged-by: Hugh Dickins hughd@google.com Debugged-by: Michal Hocko mhocko@kernel.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org --- mm/memcontrol.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3e000a316b59..e3b785d6e771 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7075,7 +7075,11 @@ static struct cftype memsw_files[] = {
static int __init mem_cgroup_swap_init(void) { - if (mem_cgroup_disabled() || cgroup_memory_noswap) + /* No memory control -> no swap control */ + if (mem_cgroup_disabled()) + cgroup_memory_noswap = true; + + if (cgroup_memory_noswap) return 0;
WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, swap_files));
On Thu, 21 May 2020, Johannes Weiner wrote:
On Thu, May 21, 2020 at 01:06:28PM -0700, Hugh Dickins wrote:
On Thu, 21 May 2020, Johannes Weiner wrote:
do_memsw_account() used to be automatically false when the cgroup controller was disabled. Now that it's replaced by cgroup_memory_noswap, for which this isn't true, make the mem_cgroup_disabled() checks explicit in the swap control API.
[hannes@cmpxchg.org: use mem_cgroup_disabled() in all API functions] Reported-by: Naresh Kamboju naresh.kamboju@linaro.org Debugged-by: Hugh Dickins hughd@google.com Debugged-by: Michal Hocko mhocko@kernel.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org
mm/memcontrol.c | 47 +++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 41 insertions(+), 6 deletions(-)
I'm certainly not against a mem_cgroup_disabled() check in the only place that's been observed to need it, as a fixup to merge into your original patch; but this seems rather an over-reaction - and I'm a little surprised that setting mem_cgroup_disabled() doesn't just force cgroup_memory_noswap, saving repetitious checks elsewhere (perhaps there's a difficulty in that, I haven't looked).
Fair enough, I changed it to set the flag at initialization time if mem_cgroup_disabled(). I was never a fan of the old flags, where it was never clear what was commandline, and what was internal runtime state - do_swap_account? really_do_swap_account? But I think it's straight-forward in this case now.
Historically, I think we've added mem_cgroup_disabled() checks (accessing a cacheline we'd rather avoid) where they're necessary, rather than at every "interface".
To me that always seemed like bugs waiting to happen. Like this one!
It's a jump label nowadays, so I've been liberal with these to avoid subtle bugs.
And you seem to be in a very "goto out" mood today - we all have our "goto out" days, alternating with our "return 0" days :)
:-)
But I agree, best to keep this fixup self-contained and defer anything else to separate cleanup patches.
How about the below? It survives a swaptest with cgroup_disable=memory for me.
I like this version *a lot*, thank you. I got worried for a bit by the "#define cgroup_memory_noswap 1" when #ifndef CONFIG_MEMCG_SWAP, but now realize that fits perfectly.
Hugh, I started with your patch, which is why I kept you as the author, but as the patch now (and arguably the previous one) is sufficiently different, I dropped that now. I hope that's okay.
Absolutely okay, these are yours: I was a little uncomfortable to see me on the From line before, but it also seemed just too petty to insist that my name be removed.
(By the way, off-topic for this particular issue, but advance warning that I hope to post a couple of patches to __read_swap_cache_async() before the end of the day, first being fixup to some of your mods - I suspect you got it working well enough, and intended to come back to check a few details later, but never quite got around to that.)
From d9e7ed15d1c9248a3fd99e35e82437549154dac7 Mon Sep 17 00:00:00 2001 From: Johannes Weiner hannes@cmpxchg.org Date: Thu, 21 May 2020 17:44:25 -0400 Subject: [PATCH] mm: memcontrol: prepare swap controller setup for integration fix
Fix crash with cgroup_disable=memory:
- mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 35.502102] BUG: kernel NULL pointer dereference, address: 000000c8 [ 35.508372] #PF: supervisor read access in kernel mode [ 35.513506] #PF: error_code(0x0000) - not-present page [ 35.518638] *pde = 00000000 [ 35.521514] Oops: 0000 [#1] SMP [ 35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted 5.7.0-rc6-next-20200519+ #1 [ 35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
Swap accounting used to be implied-disabled when the cgroup controller was disabled. Restore that for the new cgroup_memory_noswap, so that we bail out of this function instead of dereferencing a NULL memcg.
Reported-by: Naresh Kamboju naresh.kamboju@linaro.org Debugged-by: Hugh Dickins hughd@google.com Debugged-by: Michal Hocko mhocko@kernel.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org
Acked-by: Hugh Dickins hughd@google.com
mm/memcontrol.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3e000a316b59..e3b785d6e771 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7075,7 +7075,11 @@ static struct cftype memsw_files[] = { static int __init mem_cgroup_swap_init(void) {
- if (mem_cgroup_disabled() || cgroup_memory_noswap)
- /* No memory control -> no swap control */
- if (mem_cgroup_disabled())
cgroup_memory_noswap = true;
- if (cgroup_memory_noswap) return 0;
WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, swap_files)); -- 2.26.2
[Sorry for a late reply - was offline for few days]
On Thu 21-05-20 17:58:55, Johannes Weiner wrote:
On Thu, May 21, 2020 at 01:06:28PM -0700, Hugh Dickins wrote:
[...]
From d9e7ed15d1c9248a3fd99e35e82437549154dac7 Mon Sep 17 00:00:00 2001
From: Johannes Weiner hannes@cmpxchg.org Date: Thu, 21 May 2020 17:44:25 -0400 Subject: [PATCH] mm: memcontrol: prepare swap controller setup for integration fix
Fix crash with cgroup_disable=memory:
- mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 35.502102] BUG: kernel NULL pointer dereference, address: 000000c8 [ 35.508372] #PF: supervisor read access in kernel mode [ 35.513506] #PF: error_code(0x0000) - not-present page [ 35.518638] *pde = 00000000 [ 35.521514] Oops: 0000 [#1] SMP [ 35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted 5.7.0-rc6-next-20200519+ #1 [ 35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.2 05/23/2018 [ 35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
Swap accounting used to be implied-disabled when the cgroup controller was disabled. Restore that for the new cgroup_memory_noswap, so that we bail out of this function instead of dereferencing a NULL memcg.
Reported-by: Naresh Kamboju naresh.kamboju@linaro.org Debugged-by: Hugh Dickins hughd@google.com Debugged-by: Michal Hocko mhocko@kernel.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org
Yes this looks better. I hope to get to your series soon to have the full picture finally.
mm/memcontrol.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3e000a316b59..e3b785d6e771 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7075,7 +7075,11 @@ static struct cftype memsw_files[] = { static int __init mem_cgroup_swap_init(void) {
- if (mem_cgroup_disabled() || cgroup_memory_noswap)
- /* No memory control -> no swap control */
- if (mem_cgroup_disabled())
cgroup_memory_noswap = true;
- if (cgroup_memory_noswap) return 0;
WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, swap_files)); -- 2.26.2
On Fri 22-05-20 02:23:09, Naresh Kamboju wrote:
My apology ! As per the test results history this problem started happening from Bad : next-20200430 (still reproducible on next-20200519) Good : next-20200429
The git tree / tag used for testing is from linux next-20200430 tag and reverted following three patches and oom-killer problem fixed.
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks" Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"
The discussion has fragmented and I got lost TBH. In http://lkml.kernel.org/r/CA+G9fYuDWGZx50UpD+WcsDeHX9vi3hpksvBAWbMgRZadb0Pkww... you have said that none of the added tracing output has triggered. Does this still hold? Because I still have a hard time to understand how those three patches could have the observed effects.
On Thu, 28 May 2020 at 20:33, Michal Hocko mhocko@kernel.org wrote:
On Fri 22-05-20 02:23:09, Naresh Kamboju wrote:
My apology ! As per the test results history this problem started happening from Bad : next-20200430 (still reproducible on next-20200519) Good : next-20200429
The git tree / tag used for testing is from linux next-20200430 tag and reverted following three patches and oom-killer problem fixed.
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks" Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"
The discussion has fragmented and I got lost TBH. In http://lkml.kernel.org/r/CA+G9fYuDWGZx50UpD+WcsDeHX9vi3hpksvBAWbMgRZadb0Pkww... you have said that none of the added tracing output has triggered. Does this still hold? Because I still have a hard time to understand how those three patches could have the observed effects.
On the other email thread [1] this issue is concluded.
Yafang wrote on May 22 2020,
Regarding the root cause, my guess is it makes a similar mistake that I tried to fix in the previous patch that the direct reclaimer read a stale protection value. But I don't think it is worth to add another fix. The best way is to revert this commit.
[1] [PATCH v3 2/2] mm, memcg: Decouple e{low,min} state mutations from protection checks https://lore.kernel.org/linux-mm/CALOAHbArZ3NsuR3mCnx_kbSF8ktpjhUF2kaaTa7Mb7...
- Naresh
-- Michal Hocko SUSE Labs
Naresh Kamboju writes:
On Thu, 28 May 2020 at 20:33, Michal Hocko mhocko@kernel.org wrote:
On Fri 22-05-20 02:23:09, Naresh Kamboju wrote:
My apology ! As per the test results history this problem started happening from Bad : next-20200430 (still reproducible on next-20200519) Good : next-20200429
The git tree / tag used for testing is from linux next-20200430 tag and reverted following three patches and oom-killer problem fixed.
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks" Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"
The discussion has fragmented and I got lost TBH. In http://lkml.kernel.org/r/CA+G9fYuDWGZx50UpD+WcsDeHX9vi3hpksvBAWbMgRZadb0Pkww... you have said that none of the added tracing output has triggered. Does this still hold? Because I still have a hard time to understand how those three patches could have the observed effects.
On the other email thread [1] this issue is concluded.
Yafang wrote on May 22 2020,
Regarding the root cause, my guess is it makes a similar mistake that I tried to fix in the previous patch that the direct reclaimer read a stale protection value. But I don't think it is worth to add another fix. The best way is to revert this commit.
This isn't a conclusion, just a guess (and one I think is unlikely). For this to reliably happen, it implies that the same race happens the same way each time.
On Fri, May 29, 2020 at 12:41 AM Chris Down chris@chrisdown.name wrote:
Naresh Kamboju writes:
On Thu, 28 May 2020 at 20:33, Michal Hocko mhocko@kernel.org wrote:
On Fri 22-05-20 02:23:09, Naresh Kamboju wrote:
My apology ! As per the test results history this problem started happening from Bad : next-20200430 (still reproducible on next-20200519) Good : next-20200429
The git tree / tag used for testing is from linux next-20200430 tag and reverted following three patches and oom-killer problem fixed.
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks" Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"
The discussion has fragmented and I got lost TBH. In http://lkml.kernel.org/r/CA+G9fYuDWGZx50UpD+WcsDeHX9vi3hpksvBAWbMgRZadb0Pkww... you have said that none of the added tracing output has triggered. Does this still hold? Because I still have a hard time to understand how those three patches could have the observed effects.
On the other email thread [1] this issue is concluded.
Yafang wrote on May 22 2020,
Regarding the root cause, my guess is it makes a similar mistake that I tried to fix in the previous patch that the direct reclaimer read a stale protection value. But I don't think it is worth to add another fix. The best way is to revert this commit.
This isn't a conclusion, just a guess (and one I think is unlikely). For this to reliably happen, it implies that the same race happens the same way each time.
Hi Chris,
Look at this patch[1] carefully you will find that it introduces the same issue that I tried to fix in another patch [2]. Even more sad is these two patches are in the same patchset. Although this issue isn't related with the issue found by Naresh, we have to ask ourselves why we always make the same mistake ? One possible answer is that we always forget the lifecyle of memory.emin before we read it. memory.emin doesn't have the same lifecycle with the memcg, while it really has the same lifecyle with the reclaimer. IOW, once a reclaimer begins the protetion value should be set to 0, and after we traversal the memcg tree we calculate a protection value for this reclaimer, finnaly it disapears after the reclaimer stops. That is why I highly suggest to add an new protection member in scan_control before.
[1]. https://lore.kernel.org/linux-mm/20200505084127.12923-3-laoar.shao@gmail.com... [2]. https://lore.kernel.org/linux-mm/20200505084127.12923-2-laoar.shao@gmail.com...
Yafang Shao writes:
Look at this patch[1] carefully you will find that it introduces the same issue that I tried to fix in another patch [2]. Even more sad is these two patches are in the same patchset. Although this issue isn't related with the issue found by Naresh, we have to ask ourselves why we always make the same mistake ? One possible answer is that we always forget the lifecyle of memory.emin before we read it. memory.emin doesn't have the same lifecycle with the memcg, while it really has the same lifecyle with the reclaimer. IOW, once a reclaimer begins the protetion value should be set to 0, and after we traversal the memcg tree we calculate a protection value for this reclaimer, finnaly it disapears after the reclaimer stops. That is why I highly suggest to add an new protection member in scan_control before.
I agree with you that the e{min,low} lifecycle is confusing for everyone -- the only thing I've not seen confirmation of is any confirmed correlation with the i386 oom killer issue. If you've validated that, I'd like to see the data :-)
On Fri 29-05-20 02:56:44, Chris Down wrote:
Yafang Shao writes:
Look at this patch[1] carefully you will find that it introduces the same issue that I tried to fix in another patch [2]. Even more sad is these two patches are in the same patchset. Although this issue isn't related with the issue found by Naresh, we have to ask ourselves why we always make the same mistake ? One possible answer is that we always forget the lifecyle of memory.emin before we read it. memory.emin doesn't have the same lifecycle with the memcg, while it really has the same lifecyle with the reclaimer. IOW, once a reclaimer begins the protetion value should be set to 0, and after we traversal the memcg tree we calculate a protection value for this reclaimer, finnaly it disapears after the reclaimer stops. That is why I highly suggest to add an new protection member in scan_control before.
I agree with you that the e{min,low} lifecycle is confusing for everyone -- the only thing I've not seen confirmation of is any confirmed correlation with the i386 oom killer issue. If you've validated that, I'd like to see the data :-)
Agreed. Even if e{low,min} might still have some rough edges I am completely puzzled how we could end up oom if none of the protection path triggers which the additional debugging should confirm. Maybe my debugging patch is incomplete or used incorrectly (maybe it would be esier to use printk rather than trace_printk?).
On Fri 29-05-20 11:49:20, Michal Hocko wrote:
On Fri 29-05-20 02:56:44, Chris Down wrote:
Yafang Shao writes:
Look at this patch[1] carefully you will find that it introduces the same issue that I tried to fix in another patch [2]. Even more sad is these two patches are in the same patchset. Although this issue isn't related with the issue found by Naresh, we have to ask ourselves why we always make the same mistake ? One possible answer is that we always forget the lifecyle of memory.emin before we read it. memory.emin doesn't have the same lifecycle with the memcg, while it really has the same lifecyle with the reclaimer. IOW, once a reclaimer begins the protetion value should be set to 0, and after we traversal the memcg tree we calculate a protection value for this reclaimer, finnaly it disapears after the reclaimer stops. That is why I highly suggest to add an new protection member in scan_control before.
I agree with you that the e{min,low} lifecycle is confusing for everyone -- the only thing I've not seen confirmation of is any confirmed correlation with the i386 oom killer issue. If you've validated that, I'd like to see the data :-)
Agreed. Even if e{low,min} might still have some rough edges I am completely puzzled how we could end up oom if none of the protection path triggers which the additional debugging should confirm. Maybe my debugging patch is incomplete or used incorrectly (maybe it would be esier to use printk rather than trace_printk?).
It would be really great if we could move forward. While the fix (which has been dropped from mmotm) is not super urgent I would really like to understand how it could hit the observed behavior. Can we double check that the debugging patch really doesn't trigger (e.g. s@trace_printk@printk in the first step)? I have checked it again but do not see any potential code path which would be affected by the patch yet not trigger any output. But another pair of eyes would be really great.
On Thu, 11 Jun 2020 at 15:25, Michal Hocko mhocko@kernel.org wrote:
On Fri 29-05-20 11:49:20, Michal Hocko wrote:
On Fri 29-05-20 02:56:44, Chris Down wrote:
Yafang Shao writes:
Agreed. Even if e{low,min} might still have some rough edges I am completely puzzled how we could end up oom if none of the protection path triggers which the additional debugging should confirm. Maybe my debugging patch is incomplete or used incorrectly (maybe it would be esier to use printk rather than trace_printk?).
It would be really great if we could move forward. While the fix (which has been dropped from mmotm) is not super urgent I would really like to understand how it could hit the observed behavior. Can we double check that the debugging patch really doesn't trigger (e.g. s@trace_printk@printk in the first step)?
Please suggest to me the way to get more debug information by providing kernel debug patches and extra kernel configs.
I have applied your debug patch and tested on top on linux next 20200612 but did not find any printk output while running mkfs -t ext4 /drive test case.
I have checked it again but do not see any potential code path which would be affected by the patch yet not trigger any output. But another pair of eyes would be really great.
--- diff --git a/mm/vmscan.c b/mm/vmscan.c index b6d84326bdf2..d13ce7b02de4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2375,6 +2375,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, * sc->priority further than desirable. */ scan = max(scan, SWAP_CLUSTER_MAX); + + trace_printk("scan:%lu protection:%lu\n", scan, protection); } else { scan = lruvec_size; } @@ -2618,6 +2620,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
switch (mem_cgroup_protected(target_memcg, memcg)) { case MEMCG_PROT_MIN: + trace_printk("under min:%lu emin:%lu\n", memcg->memory.min, memcg->memory.emin); /* * Hard protection. * If there is no reclaimable memory, OOM. @@ -2630,6 +2633,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) * there is an unprotected supply * of reclaimable memory from other cgroups. */ + trace_printk("under low:%lu elow:%lu\n", memcg->memory.low, memcg->memory.elow); if (!sc->memcg_low_reclaim) { sc->memcg_low_skipped = 1; continue;
On Fri 12-06-20 15:13:22, Naresh Kamboju wrote:
On Thu, 11 Jun 2020 at 15:25, Michal Hocko mhocko@kernel.org wrote:
On Fri 29-05-20 11:49:20, Michal Hocko wrote:
On Fri 29-05-20 02:56:44, Chris Down wrote:
Yafang Shao writes:
Agreed. Even if e{low,min} might still have some rough edges I am completely puzzled how we could end up oom if none of the protection path triggers which the additional debugging should confirm. Maybe my debugging patch is incomplete or used incorrectly (maybe it would be esier to use printk rather than trace_printk?).
It would be really great if we could move forward. While the fix (which has been dropped from mmotm) is not super urgent I would really like to understand how it could hit the observed behavior. Can we double check that the debugging patch really doesn't trigger (e.g. s@trace_printk@printk in the first step)?
Please suggest to me the way to get more debug information by providing kernel debug patches and extra kernel configs.
I have applied your debug patch and tested on top on linux next 20200612 but did not find any printk output while running mkfs -t ext4 /drive test case.
Have you tried s@trace_printk@printk@ in the patch? AFAIK trace_printk doesn't dump anything into the printk ring buffer. You would have to look into trace ring buffer.
On Thu, 21 May 2020 at 22:04, Michal Hocko mhocko@kernel.org wrote:
On Thu 21-05-20 11:55:16, Michal Hocko wrote:
On Wed 20-05-20 20:09:06, Chris Down wrote:
Hi Naresh,
Naresh Kamboju writes:
As a part of investigation on this issue LKFT teammate Anders Roxell git bisected the problem and found bad commit(s) which caused this problem.
The following two patches have been reverted on next-20200519 and retested the reproducible steps and confirmed the test case mkfs -t ext4 got PASS. ( invoked oom-killer is gone now)
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
Revert "mm, memcg: decouple e{low,min} state mutations from protection checks" This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
Thanks Anders and Naresh for tracking this down and reverting.
I'll take a look tomorrow. I don't see anything immediately obviously wrong in either of those commits from a (very) cursory glance, but they should only be taking effect if protections are set.
Agreed. If memory.{low,min} is not used then the patch should be effectively a nop.
I was staring into the code and do not see anything. Could you give the following debugging patch a try and see whether it triggers?
diff --git a/mm/vmscan.c b/mm/vmscan.c index cc555903a332..df2e8df0eb71 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2404,6 +2404,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, * sc->priority further than desirable. */ scan = max(scan, SWAP_CLUSTER_MAX);
trace_printk("scan:%lu protection:%lu\n", scan, protection); } else { scan = lruvec_size; }
@@ -2648,6 +2650,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) mem_cgroup_calculate_protection(target_memcg, memcg);
if (mem_cgroup_below_min(memcg)) {
trace_printk("under min:%lu emin:%lu\n", memcg->memory.min, memcg->memory.emin); /* * Hard protection. * If there is no reclaimable memory, OOM.
@@ -2660,6 +2663,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) * there is an unprotected supply * of reclaimable memory from other cgroups. */
trace_printk("under low:%lu elow:%lu\n", memcg->memory.low, memcg->memory.elow); if (!sc->memcg_low_reclaim) { sc->memcg_low_skipped = 1; continue;
As per your suggestions on debugging this problem, trace_printk is replaced with printk and applied to your patch on top of the problematic kernel and here is the test output and link.
mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 51.544525] under min:0 emin:0 [ 51.845304] under min:0 emin:0 [ 51.848738] under min:0 emin:0 [ 51.858147] under min:0 emin:0 [ 51.861333] under min:0 emin:0 [ 51.862034] under min:0 emin:0 [ 51.862442] under min:0 emin:0 [ 51.862763] under min:0 emin:0
Full test log link, https://lkft.validation.linaro.org/scheduler/job/1497412#L1451
- Naresh
Naresh Kamboju writes:
mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 51.544525] under min:0 emin:0 [ 51.845304] under min:0 emin:0 [ 51.848738] under min:0 emin:0 [ 51.858147] under min:0 emin:0 [ 51.861333] under min:0 emin:0 [ 51.862034] under min:0 emin:0 [ 51.862442] under min:0 emin:0 [ 51.862763] under min:0 emin:0
Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min even when min/emin is 0 (which should indeed be the case if you haven't set them in the hierarchy).
My guess is that page_counter_read(&memcg->memory) is 0, which means mem_cgroup_below_min will return 1.
However, I don't know for sure why that should then result in the OOM killer coming along. My guess is that since this memcg has 0 pages to scan anyway, we enter premature OOM under some conditions. I don't know why we wouldn't have hit that with the old version of mem_cgroup_protected that returned MEMCG_PROT_* members, though.
Can you please try the patch with the `>=` checks in mem_cgroup_below_min and mem_cgroup_below_low changed to `>`? If that fixes it, then that gives a strong hint about what's going on here.
Thanks for your help!
On Wed 17-06-20 19:07:20, Naresh Kamboju wrote:
On Thu, 21 May 2020 at 22:04, Michal Hocko mhocko@kernel.org wrote:
On Thu 21-05-20 11:55:16, Michal Hocko wrote:
On Wed 20-05-20 20:09:06, Chris Down wrote:
Hi Naresh,
Naresh Kamboju writes:
As a part of investigation on this issue LKFT teammate Anders Roxell git bisected the problem and found bad commit(s) which caused this problem.
The following two patches have been reverted on next-20200519 and retested the reproducible steps and confirmed the test case mkfs -t ext4 got PASS. ( invoked oom-killer is gone now)
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
Revert "mm, memcg: decouple e{low,min} state mutations from protection checks" This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
Thanks Anders and Naresh for tracking this down and reverting.
I'll take a look tomorrow. I don't see anything immediately obviously wrong in either of those commits from a (very) cursory glance, but they should only be taking effect if protections are set.
Agreed. If memory.{low,min} is not used then the patch should be effectively a nop.
I was staring into the code and do not see anything. Could you give the following debugging patch a try and see whether it triggers?
diff --git a/mm/vmscan.c b/mm/vmscan.c index cc555903a332..df2e8df0eb71 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2404,6 +2404,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, * sc->priority further than desirable. */ scan = max(scan, SWAP_CLUSTER_MAX);
trace_printk("scan:%lu protection:%lu\n", scan, protection); } else { scan = lruvec_size; }
@@ -2648,6 +2650,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) mem_cgroup_calculate_protection(target_memcg, memcg);
if (mem_cgroup_below_min(memcg)) {
trace_printk("under min:%lu emin:%lu\n", memcg->memory.min, memcg->memory.emin); /* * Hard protection. * If there is no reclaimable memory, OOM.
@@ -2660,6 +2663,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) * there is an unprotected supply * of reclaimable memory from other cgroups. */
trace_printk("under low:%lu elow:%lu\n", memcg->memory.low, memcg->memory.elow); if (!sc->memcg_low_reclaim) { sc->memcg_low_skipped = 1; continue;
As per your suggestions on debugging this problem, trace_printk is replaced with printk and applied to your patch on top of the problematic kernel and here is the test output and link.
mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 51.544525] under min:0 emin:0 [ 51.845304] under min:0 emin:0 [ 51.848738] under min:0 emin:0 [ 51.858147] under min:0 emin:0 [ 51.861333] under min:0 emin:0 [ 51.862034] under min:0 emin:0 [ 51.862442] under min:0 emin:0 [ 51.862763] under min:0 emin:0
Full test log link, https://lkft.validation.linaro.org/scheduler/job/1497412#L1451
Thanks a lot. So it is clear that mem_cgroup_below_min got confused and reported protected cgroup. Both effective and real limits are 0 so there is no garbage in them. The problem is in mem_cgroup_below_* and it is quite obvious.
We are doing the following +static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg) +{ + if (mem_cgroup_disabled()) + return false; + + return READ_ONCE(memcg->memory.emin) >= + page_counter_read(&memcg->memory); +}
and it makes some sense. Except for the root memcg where we do not account any memory. Adding if (mem_cgroup_is_root(memcg)) return false; should do the trick. The same is the case for mem_cgroup_below_low. Could you give it a try please just to confirm?
Michal Hocko writes:
and it makes some sense. Except for the root memcg where we do not account any memory. Adding if (mem_cgroup_is_root(memcg)) return false; should do the trick. The same is the case for mem_cgroup_below_low. Could you give it a try please just to confirm?
Oh, of course :-) This seems more likely than what I proposed, and would be great to test.
[Our emails have crossed]
On Wed 17-06-20 14:57:58, Chris Down wrote:
Naresh Kamboju writes:
mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 51.544525] under min:0 emin:0 [ 51.845304] under min:0 emin:0 [ 51.848738] under min:0 emin:0 [ 51.858147] under min:0 emin:0 [ 51.861333] under min:0 emin:0 [ 51.862034] under min:0 emin:0 [ 51.862442] under min:0 emin:0 [ 51.862763] under min:0 emin:0
Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min even when min/emin is 0 (which should indeed be the case if you haven't set them in the hierarchy).
My guess is that page_counter_read(&memcg->memory) is 0, which means mem_cgroup_below_min will return 1.
Yes this is the case because this is likely the root memcg which skips all charges.
However, I don't know for sure why that should then result in the OOM killer coming along. My guess is that since this memcg has 0 pages to scan anyway, we enter premature OOM under some conditions. I don't know why we wouldn't have hit that with the old version of mem_cgroup_protected that returned MEMCG_PROT_* members, though.
Not really. There is likely no other memcg to reclaim from and assuming min limit protection will result in no reclaimable memory and thus the OOM killer.
Can you please try the patch with the `>=` checks in mem_cgroup_below_min and mem_cgroup_below_low changed to `>`? If that fixes it, then that gives a strong hint about what's going on here.
This would work but I believe an explicit check for the root memcg would be easier to spot the reasoning.
On Wed, 17 Jun 2020 at 19:41, Michal Hocko mhocko@kernel.org wrote:
[Our emails have crossed]
On Wed 17-06-20 14:57:58, Chris Down wrote:
Naresh Kamboju writes:
mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 51.544525] under min:0 emin:0 [ 51.845304] under min:0 emin:0 [ 51.848738] under min:0 emin:0 [ 51.858147] under min:0 emin:0 [ 51.861333] under min:0 emin:0 [ 51.862034] under min:0 emin:0 [ 51.862442] under min:0 emin:0 [ 51.862763] under min:0 emin:0
Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min even when min/emin is 0 (which should indeed be the case if you haven't set them in the hierarchy).
My guess is that page_counter_read(&memcg->memory) is 0, which means mem_cgroup_below_min will return 1.
Yes this is the case because this is likely the root memcg which skips all charges.
However, I don't know for sure why that should then result in the OOM killer coming along. My guess is that since this memcg has 0 pages to scan anyway, we enter premature OOM under some conditions. I don't know why we wouldn't have hit that with the old version of mem_cgroup_protected that returned MEMCG_PROT_* members, though.
Not really. There is likely no other memcg to reclaim from and assuming min limit protection will result in no reclaimable memory and thus the OOM killer.
Can you please try the patch with the `>=` checks in mem_cgroup_below_min and mem_cgroup_below_low changed to `>`? If that fixes it, then that gives a strong hint about what's going on here.
This would work but I believe an explicit check for the root memcg would be easier to spot the reasoning.
May I request you to send debugging or proposed fix patches here. I am happy to do more testing.
FYI, Here is my repository for testing. git: https://github.com/nareshkamboju/linux/tree/printk branch: printk
- Naresh
On Wed 17-06-20 21:23:05, Naresh Kamboju wrote:
On Wed, 17 Jun 2020 at 19:41, Michal Hocko mhocko@kernel.org wrote:
[Our emails have crossed]
On Wed 17-06-20 14:57:58, Chris Down wrote:
Naresh Kamboju writes:
mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 51.544525] under min:0 emin:0 [ 51.845304] under min:0 emin:0 [ 51.848738] under min:0 emin:0 [ 51.858147] under min:0 emin:0 [ 51.861333] under min:0 emin:0 [ 51.862034] under min:0 emin:0 [ 51.862442] under min:0 emin:0 [ 51.862763] under min:0 emin:0
Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min even when min/emin is 0 (which should indeed be the case if you haven't set them in the hierarchy).
My guess is that page_counter_read(&memcg->memory) is 0, which means mem_cgroup_below_min will return 1.
Yes this is the case because this is likely the root memcg which skips all charges.
However, I don't know for sure why that should then result in the OOM killer coming along. My guess is that since this memcg has 0 pages to scan anyway, we enter premature OOM under some conditions. I don't know why we wouldn't have hit that with the old version of mem_cgroup_protected that returned MEMCG_PROT_* members, though.
Not really. There is likely no other memcg to reclaim from and assuming min limit protection will result in no reclaimable memory and thus the OOM killer.
Can you please try the patch with the `>=` checks in mem_cgroup_below_min and mem_cgroup_below_low changed to `>`? If that fixes it, then that gives a strong hint about what's going on here.
This would work but I believe an explicit check for the root memcg would be easier to spot the reasoning.
May I request you to send debugging or proposed fix patches here. I am happy to do more testing.
Sure, here is the diff to test.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c74a8f2323f1..6b5a31672fbe 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -392,6 +392,13 @@ static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg) if (mem_cgroup_disabled()) return false;
+ /* + * Root memcg doesn't account charges and doesn't support + * protection + */ + if (mem_cgroup_is_root(memcg)) + return false; + return READ_ONCE(memcg->memory.elow) >= page_counter_read(&memcg->memory); } @@ -401,6 +408,13 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg) if (mem_cgroup_disabled()) return false;
+ /* + * Root memcg doesn't account charges and doesn't support + * protection + */ + if (mem_cgroup_is_root(memcg)) + return false; + return READ_ONCE(memcg->memory.emin) >= page_counter_read(&memcg->memory); }
On Wed, 17 Jun 2020 at 21:36, Michal Hocko mhocko@kernel.org wrote:
On Wed 17-06-20 21:23:05, Naresh Kamboju wrote:
On Wed, 17 Jun 2020 at 19:41, Michal Hocko mhocko@kernel.org wrote:
[Our emails have crossed]
On Wed 17-06-20 14:57:58, Chris Down wrote:
Naresh Kamboju writes:
mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF mke2fs 1.43.8 (1-Jan-2018) Creating filesystem with 244190646 4k blocks and 61054976 inodes Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Allocating group tables: 0/7453 done Writing inode tables: 0/7453 done Creating journal (262144 blocks): [ 51.544525] under min:0 emin:0 [ 51.845304] under min:0 emin:0 [ 51.848738] under min:0 emin:0 [ 51.858147] under min:0 emin:0 [ 51.861333] under min:0 emin:0 [ 51.862034] under min:0 emin:0 [ 51.862442] under min:0 emin:0 [ 51.862763] under min:0 emin:0
Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min even when min/emin is 0 (which should indeed be the case if you haven't set them in the hierarchy).
My guess is that page_counter_read(&memcg->memory) is 0, which means mem_cgroup_below_min will return 1.
Yes this is the case because this is likely the root memcg which skips all charges.
However, I don't know for sure why that should then result in the OOM killer coming along. My guess is that since this memcg has 0 pages to scan anyway, we enter premature OOM under some conditions. I don't know why we wouldn't have hit that with the old version of mem_cgroup_protected that returned MEMCG_PROT_* members, though.
Not really. There is likely no other memcg to reclaim from and assuming min limit protection will result in no reclaimable memory and thus the OOM killer.
Can you please try the patch with the `>=` checks in mem_cgroup_below_min and mem_cgroup_below_low changed to `>`? If that fixes it, then that gives a strong hint about what's going on here.
This would work but I believe an explicit check for the root memcg would be easier to spot the reasoning.
May I request you to send debugging or proposed fix patches here. I am happy to do more testing.
Sure, here is the diff to test.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c74a8f2323f1..6b5a31672fbe 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -392,6 +392,13 @@ static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg) if (mem_cgroup_disabled()) return false;
/*
* Root memcg doesn't account charges and doesn't support
* protection
*/
if (mem_cgroup_is_root(memcg))
return false;
return READ_ONCE(memcg->memory.elow) >= page_counter_read(&memcg->memory);
} @@ -401,6 +408,13 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg) if (mem_cgroup_disabled()) return false;
/*
* Root memcg doesn't account charges and doesn't support
* protection
*/
if (mem_cgroup_is_root(memcg))
return false;
return READ_ONCE(memcg->memory.emin) >= page_counter_read(&memcg->memory);
}
After this patch applied the reported issue got fixed.
test log link, https://lkft.validation.linaro.org/scheduler/job/1505417#L1429
- Naresh
Naresh Kamboju writes:
After this patch applied the reported issue got fixed.
Great! Thank you Naresh and Michal for helping to get to the bottom of this :-)
I'll send out a new version tomorrow with the fixes applied and both of you credited in the changelog for the detection and fix.
On Thu, Jun 18, 2020 at 5:09 AM Chris Down chris@chrisdown.name wrote:
Naresh Kamboju writes:
After this patch applied the reported issue got fixed.
Great! Thank you Naresh and Michal for helping to get to the bottom of this :-)
I'll send out a new version tomorrow with the fixes applied and both of you credited in the changelog for the detection and fix.
As we have already found that the usage around memory.{emin, elow} has many limitations, I think memory.{emin, elow} should be used for memcg-tree internally only, that means they can only be used to calculate the protection of a memcg in a specified memcg-tree but should not be exposed to other MM parts.
Yafang Shao writes:
On Thu, Jun 18, 2020 at 5:09 AM Chris Down chris@chrisdown.name wrote:
Naresh Kamboju writes:
After this patch applied the reported issue got fixed.
Great! Thank you Naresh and Michal for helping to get to the bottom of this :-)
I'll send out a new version tomorrow with the fixes applied and both of you credited in the changelog for the detection and fix.
As we have already found that the usage around memory.{emin, elow} has many limitations, I think memory.{emin, elow} should be used for memcg-tree internally only, that means they can only be used to calculate the protection of a memcg in a specified memcg-tree but should not be exposed to other MM parts.
I agree that the current semantics are mentally taxing and we should generally avoid exposing the implementation details outside of memcg where possible. Do you have a suggested rework? :-)
On Thu 18-06-20 13:37:43, Chris Down wrote:
Yafang Shao writes:
On Thu, Jun 18, 2020 at 5:09 AM Chris Down chris@chrisdown.name wrote:
Naresh Kamboju writes:
After this patch applied the reported issue got fixed.
Great! Thank you Naresh and Michal for helping to get to the bottom of this :-)
I'll send out a new version tomorrow with the fixes applied and both of you credited in the changelog for the detection and fix.
As we have already found that the usage around memory.{emin, elow} has many limitations, I think memory.{emin, elow} should be used for memcg-tree internally only, that means they can only be used to calculate the protection of a memcg in a specified memcg-tree but should not be exposed to other MM parts.
I agree that the current semantics are mentally taxing and we should generally avoid exposing the implementation details outside of memcg where possible. Do you have a suggested rework? :-)
I would really prefer to do that work on top of the fixes we (used to) have in mmotm (with the fixup).
On Thu, Jun 18, 2020 at 8:37 PM Chris Down chris@chrisdown.name wrote:
Yafang Shao writes:
On Thu, Jun 18, 2020 at 5:09 AM Chris Down chris@chrisdown.name wrote:
Naresh Kamboju writes:
After this patch applied the reported issue got fixed.
Great! Thank you Naresh and Michal for helping to get to the bottom of this :-)
I'll send out a new version tomorrow with the fixes applied and both of you credited in the changelog for the detection and fix.
As we have already found that the usage around memory.{emin, elow} has many limitations, I think memory.{emin, elow} should be used for memcg-tree internally only, that means they can only be used to calculate the protection of a memcg in a specified memcg-tree but should not be exposed to other MM parts.
I agree that the current semantics are mentally taxing and we should generally avoid exposing the implementation details outside of memcg where possible. Do you have a suggested rework? :-)
Keeping the mem_cgroup_protected() as-is is my suggestion. Anyway I think it is bad to put memory.{emin, elow} here and there. If we don't have any better idea by now, just putting all the references of memory.{emin, elow} into one wrapper(mem_cgroup_protected()) is the reasonable solution.