New subject: [PATCH v3] mm, oom: make the calculation of oom badness more accurate

12 Jul 2020


      On Fri, 10 Jul 2020 at 21:28, Yafang Shao laoar.shao@gmail.com wrote:
...
Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process. Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.
Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
...
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
We can find that the first scanned process 5740 (pause) was killed, but its
rss is only one page. That is because, when we calculate the oom badness in
oom_badness(), we always ignore the negtive point and convert all of these
negtive points to 1. Now as oom_score_adj of all the processes in this
targeted memcg have the same value -998, the points of these processes are
all negtive value. As a result, the first scanned process will be killed.
The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed by
system oom.
To fix this issue, we should make the calculation of oom point more
accurate. We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.
[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
Signed-off-by: Yafang Shao laoar.shao@gmail.com
Acked-by: Michal Hocko mhocko@suse.com
Cc: David Rientjes rientjes@google.com
Cc: Qian Cai cai@lca.pw

v2 -> v3:

fix the type of variable 'point' in oom_evaluate_task()
initialize oom_control->chosen_points in select_bad_process() per Michal
update the comment in proc_oom_score() per Michal

Signed-off-by: Yafang Shao laoar.shao@gmail.com
Tested-by: Naresh Kamboju naresh.kamboju@linaro.org
I have noticed kernel panic with v2 patch while running LTP mm test suite.
[ 63.451494] Out of memory and no killable processes...
[ 63.456633] Kernel panic - not syncing: System is deadlocked on memory
Then I have removed the v2 patch and applied this below v3 patch and re-tested.
No regression noticed with v3 patch while running LTP mm on x86_64 and arm.
OTOH,
oom01 test case started with 100 iterations but runltp got killed after the
6th iteration [3]. I think this is expected.
test steps:
          - cd /opt/ltp
          - ./runltp -s oom01 -I 100 || true
[  209.052842] Out of memory: Killed process 519 (runltp)
total-vm:10244kB, anon-rss:904kB, file-rss:4kB, shmem-rss:0kB, UID:0
pgtables:60kB oom_score_adj:0
[  209.066782] oom_reaper: reaped process 519 (runltp), now
anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
/lava-1558245/0/tests/0_prep-tmp-disk/run.sh: line 21:   519 Killed
              ./runltp -s oom01 -I 100
...

fs/proc/base.c      | 11 ++++++++++-
 include/linux/oom.h |  4 ++--
 mm/oom_kill.c       | 22 ++++++++++------------
 3 files changed, 22 insertions(+), 15 deletions(-)
Reference test jobs,
[1] https://lkft.validation.linaro.org/scheduler/job/1558246#L9189
[2] https://lkft.validation.linaro.org/scheduler/job/1558247#L17213
[3] https://lkft.validation.linaro.org/scheduler/job/1558245#L1407

Re: [PATCH v3] mm, oom: make the calculation of oom badness more accurate