On Fri, 10 Jul 2020 at 21:28, Yafang Shao laoar.shao@gmail.com wrote:
Recently we found an issue on our production environment that when memcg oom is triggered the oom killer doesn't chose the process with largest resident memory but chose the first scanned process. Note that all processes in this memcg have the same oom_score_adj, so the oom killer should chose the process with largest resident memory.
Bellow is part of the oom info, which is enough to analyze this issue. [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
We can find that the first scanned process 5740 (pause) was killed, but its rss is only one page. That is because, when we calculate the oom badness in oom_badness(), we always ignore the negtive point and convert all of these negtive points to 1. Now as oom_score_adj of all the processes in this targeted memcg have the same value -998, the points of these processes are all negtive value. As a result, the first scanned process will be killed.
The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a a Guaranteed pod, which has higher priority to prevent from being killed by system oom.
To fix this issue, we should make the calculation of oom point more accurate. We can achieve it by convert the chosen_point from 'unsigned long' to 'long'.
[cai@lca.pw: reported a issue in the previous version] [mhocko@suse.com: fixed the issue reported by Cai] [mhocko@suse.com: add the comment in proc_oom_score()] Signed-off-by: Yafang Shao laoar.shao@gmail.com Acked-by: Michal Hocko mhocko@suse.com Cc: David Rientjes rientjes@google.com Cc: Qian Cai cai@lca.pw
v2 -> v3:
- fix the type of variable 'point' in oom_evaluate_task()
- initialize oom_control->chosen_points in select_bad_process() per Michal
- update the comment in proc_oom_score() per Michal
Signed-off-by: Yafang Shao laoar.shao@gmail.com
Tested-by: Naresh Kamboju naresh.kamboju@linaro.org
I have noticed kernel panic with v2 patch while running LTP mm test suite.
[ 63.451494] Out of memory and no killable processes... [ 63.456633] Kernel panic - not syncing: System is deadlocked on memory
Then I have removed the v2 patch and applied this below v3 patch and re-tested. No regression noticed with v3 patch while running LTP mm on x86_64 and arm.
OTOH, oom01 test case started with 100 iterations but runltp got killed after the 6th iteration [3]. I think this is expected.
test steps: - cd /opt/ltp - ./runltp -s oom01 -I 100 || true
[ 209.052842] Out of memory: Killed process 519 (runltp) total-vm:10244kB, anon-rss:904kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:60kB oom_score_adj:0 [ 209.066782] oom_reaper: reaped process 519 (runltp), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB /lava-1558245/0/tests/0_prep-tmp-disk/run.sh: line 21: 519 Killed ./runltp -s oom01 -I 100
fs/proc/base.c | 11 ++++++++++- include/linux/oom.h | 4 ++-- mm/oom_kill.c | 22 ++++++++++------------ 3 files changed, 22 insertions(+), 15 deletions(-)
Reference test jobs, [1] https://lkft.validation.linaro.org/scheduler/job/1558246#L9189 [2] https://lkft.validation.linaro.org/scheduler/job/1558247#L17213 [3] https://lkft.validation.linaro.org/scheduler/job/1558245#L1407
On Mon, Jul 13, 2020 at 2:34 AM Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Fri, 10 Jul 2020 at 21:28, Yafang Shao laoar.shao@gmail.com wrote:
Recently we found an issue on our production environment that when memcg oom is triggered the oom killer doesn't chose the process with largest resident memory but chose the first scanned process. Note that all processes in this memcg have the same oom_score_adj, so the oom killer should chose the process with largest resident memory.
Bellow is part of the oom info, which is enough to analyze this issue. [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
We can find that the first scanned process 5740 (pause) was killed, but its rss is only one page. That is because, when we calculate the oom badness in oom_badness(), we always ignore the negtive point and convert all of these negtive points to 1. Now as oom_score_adj of all the processes in this targeted memcg have the same value -998, the points of these processes are all negtive value. As a result, the first scanned process will be killed.
The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a a Guaranteed pod, which has higher priority to prevent from being killed by system oom.
To fix this issue, we should make the calculation of oom point more accurate. We can achieve it by convert the chosen_point from 'unsigned long' to 'long'.
[cai@lca.pw: reported a issue in the previous version] [mhocko@suse.com: fixed the issue reported by Cai] [mhocko@suse.com: add the comment in proc_oom_score()] Signed-off-by: Yafang Shao laoar.shao@gmail.com Acked-by: Michal Hocko mhocko@suse.com Cc: David Rientjes rientjes@google.com Cc: Qian Cai cai@lca.pw
v2 -> v3:
- fix the type of variable 'point' in oom_evaluate_task()
- initialize oom_control->chosen_points in select_bad_process() per Michal
- update the comment in proc_oom_score() per Michal
Signed-off-by: Yafang Shao laoar.shao@gmail.com
Tested-by: Naresh Kamboju naresh.kamboju@linaro.org
I have noticed kernel panic with v2 patch while running LTP mm test suite.
[ 63.451494] Out of memory and no killable processes... [ 63.456633] Kernel panic - not syncing: System is deadlocked on memory
Then I have removed the v2 patch and applied this below v3 patch and re-tested. No regression noticed with v3 patch while running LTP mm on x86_64 and arm.
OTOH, oom01 test case started with 100 iterations but runltp got killed after the 6th iteration [3]. I think this is expected.
test steps: - cd /opt/ltp - ./runltp -s oom01 -I 100 || true
[ 209.052842] Out of memory: Killed process 519 (runltp) total-vm:10244kB, anon-rss:904kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:60kB oom_score_adj:0 [ 209.066782] oom_reaper: reaped process 519 (runltp), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB /lava-1558245/0/tests/0_prep-tmp-disk/run.sh: line 21: 519 Killed ./runltp -s oom01 -I 100
fs/proc/base.c | 11 ++++++++++- include/linux/oom.h | 4 ++-- mm/oom_kill.c | 22 ++++++++++------------ 3 files changed, 22 insertions(+), 15 deletions(-)
Reference test jobs, [1] https://lkft.validation.linaro.org/scheduler/job/1558246#L9189 [2] https://lkft.validation.linaro.org/scheduler/job/1558247#L17213 [3] https://lkft.validation.linaro.org/scheduler/job/1558245#L1407
Thanks for the test, Naresh.