On Fri, 10 Jul 2020 at 21:28, Yafang Shao <laoar.shao(a)gmail.com> wrote:
>
> Recently we found an issue on our production environment that when memcg
> oom is triggered the oom killer doesn't chose the process with largest
> resident memory but chose the first scanned process. Note that all
> processes in this memcg have the same oom_score_adj, so the oom killer
> should chose the process with largest resident memory.
>
> Bellow is part of the oom info, which is enough to analyze this issue.
> [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
> [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
> [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
> [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
> [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>
> We can find that the first scanned process 5740 (pause) was killed, but its
> rss is only one page. That is because, when we calculate the oom badness in
> oom_badness(), we always ignore the negtive point and convert all of these
> negtive points to 1. Now as oom_score_adj of all the processes in this
> targeted memcg have the same value -998, the points of these processes are
> all negtive value. As a result, the first scanned process will be killed.
>
> The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
> a Guaranteed pod, which has higher priority to prevent from being killed by
> system oom.
>
> To fix this issue, we should make the calculation of oom point more
> accurate. We can achieve it by convert the chosen_point from 'unsigned
> long' to 'long'.
>
> [cai(a)lca.pw: reported a issue in the previous version]
> [mhocko(a)suse.com: fixed the issue reported by Cai]
> [mhocko(a)suse.com: add the comment in proc_oom_score()]
> Signed-off-by: Yafang Shao <laoar.shao(a)gmail.com>
> Acked-by: Michal Hocko <mhocko(a)suse.com>
> Cc: David Rientjes <rientjes(a)google.com>
> Cc: Qian Cai <cai(a)lca.pw>
>
> ---
> v2 -> v3:
> - fix the type of variable 'point' in oom_evaluate_task()
> - initialize oom_control->chosen_points in select_bad_process() per Michal
> - update the comment in proc_oom_score() per Michal
>
> Signed-off-by: Yafang Shao <laoar.shao(a)gmail.com>
Tested-by: Naresh Kamboju <naresh.kamboju(a)linaro.org>
I have noticed kernel panic with v2 patch while running LTP mm test suite.
[ 63.451494] Out of memory and no killable processes...
[ 63.456633] Kernel panic - not syncing: System is deadlocked on memory
Then I have removed the v2 patch and applied this below v3 patch and re-tested.
No regression noticed with v3 patch while running LTP mm on x86_64 and arm.
OTOH,
oom01 test case started with 100 iterations but runltp got killed after the
6th iteration [3]. I think this is expected.
test steps:
- cd /opt/ltp
- ./runltp -s oom01 -I 100 || true
[ 209.052842] Out of memory: Killed process 519 (runltp)
total-vm:10244kB, anon-rss:904kB, file-rss:4kB, shmem-rss:0kB, UID:0
pgtables:60kB oom_score_adj:0
[ 209.066782] oom_reaper: reaped process 519 (runltp), now
anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
/lava-1558245/0/tests/0_prep-tmp-disk/run.sh: line 21: 519 Killed
./runltp -s oom01 -I 100
> ---
> fs/proc/base.c | 11 ++++++++++-
> include/linux/oom.h | 4 ++--
> mm/oom_kill.c | 22 ++++++++++------------
> 3 files changed, 22 insertions(+), 15 deletions(-)
Reference test jobs,
[1] https://lkft.validation.linaro.org/scheduler/job/1558246#L9189
[2] https://lkft.validation.linaro.org/scheduler/job/1558247#L17213
[3] https://lkft.validation.linaro.org/scheduler/job/1558245#L1407
This is the start of the stable review cycle for the 5.4.51 release.
There are 65 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu, 09 Jul 2020 14:57:34 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.4.51-rc1…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.4.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.4.51-rc1
Peter Jones <pjones(a)redhat.com>
efi: Make it possible to disable efivar_ssdt entirely
Hou Tao <houtao1(a)huawei.com>
dm zoned: assign max_io_len correctly
Babu Moger <babu.moger(a)amd.com>
x86/resctrl: Fix memory bandwidth counter width for AMD
Vlastimil Babka <vbabka(a)suse.cz>
mm, compaction: make capture control handling safe wrt interrupts
Vlastimil Babka <vbabka(a)suse.cz>
mm, compaction: fully assume capture is not NULL in compact_zone_order()
Marc Zyngier <maz(a)kernel.org>
irqchip/gic: Atomically update affinity
Sumit Semwal <sumit.semwal(a)linaro.org>
dma-buf: Move dma_buf_release() from fops to dentry_ops
Alex Deucher <alexander.deucher(a)amd.com>
drm/amdgpu/atomfirmware: fix vram_info fetching for renoir
Alex Deucher <alexander.deucher(a)amd.com>
drm/amdgpu: use %u rather than %d for sclk/mclk
Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
drm/amd/display: Only revalidate bandwidth on medium and fast updates
Hauke Mehrtens <hauke(a)hauke-m.de>
MIPS: Add missing EHB in mtc0 -> mfc0 sequence for DSPen
Martin Blumenstingl <martin.blumenstingl(a)googlemail.com>
MIPS: lantiq: xway: sysctrl: fix the GPHY clock alias names
Zhang Xiaoxu <zhangxiaoxu5(a)huawei.com>
cifs: Fix the target file was deleted when rename failed.
Paul Aurich <paul(a)darkrain42.org>
SMB3: Honor 'handletimeout' flag for multiuser mounts
Paul Aurich <paul(a)darkrain42.org>
SMB3: Honor lease disabling for multiuser mounts
Paul Aurich <paul(a)darkrain42.org>
SMB3: Honor persistent/resilient handle flags for multiuser mounts
Paul Aurich <paul(a)darkrain42.org>
SMB3: Honor 'seal' flag for multiuser mounts
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Revert "ALSA: usb-audio: Improve frames size computation"
J. Bruce Fields <bfields(a)redhat.com>
nfsd: apply umask on fs without ACL support
Krzysztof Kozlowski <krzk(a)kernel.org>
spi: spi-fsl-dspi: Fix external abort on interrupt in resume or exit paths
Wolfram Sang <wsa+renesas(a)sang-engineering.com>
i2c: mlxcpld: check correct size of maximum RECV_LEN packet
Chris Packham <chris.packham(a)alliedtelesis.co.nz>
i2c: algo-pca: Add 0x78 as SCL stuck low status for PCA9665
Kees Cook <keescook(a)chromium.org>
samples/vfs: avoid warning in statx override
Christoph Hellwig <hch(a)lst.de>
nvme: fix a crash in nvme_mpath_add_disk
Sagi Grimberg <sagi(a)grimberg.me>
nvme: fix identify error status silent ignore
Paul Aurich <paul(a)darkrain42.org>
SMB3: Honor 'posix' flag for multiuser mounts
Hou Tao <houtao1(a)huawei.com>
virtio-blk: free vblk-vqs in error path of virtblk_probe()
Chen-Yu Tsai <wens(a)csie.org>
drm: sun4i: hdmi: Remove extra HPD polling
J. Bruce Fields <bfields(a)redhat.com>
nfsd: fix nfsdfs inode reference count leak
J. Bruce Fields <bfields(a)redhat.com>
nfsd4: fix nfsdfs reference count loop
J. Bruce Fields <bfields(a)redhat.com>
nfsd: clients don't need to break their own delegations
J. Bruce Fields <bfields(a)redhat.com>
kthread: save thread function
Dien Pham <dien.pham.ry(a)renesas.com>
thermal/drivers/rcar_gen3: Fix undefined temperature if negative
Michael Kao <michael.kao(a)mediatek.com>
thermal/drivers/mediatek: Fix bank number settings on mt8183
Misono Tomohiro <misono.tomohiro(a)jp.fujitsu.com>
hwmon: (acpi_power_meter) Fix potential memory leak in acpi_power_meter_add()
Chu Lin <linchuyuan(a)google.com>
hwmon: (max6697) Make sure the OVERT mask is set correctly
Rahul Lakkireddy <rahul.lakkireddy(a)chelsio.com>
cxgb4: fix SGE queue dump destination buffer context
Rahul Lakkireddy <rahul.lakkireddy(a)chelsio.com>
cxgb4: use correct type for all-mask IP address comparison
Rahul Lakkireddy <rahul.lakkireddy(a)chelsio.com>
cxgb4: fix endian conversions for L4 ports in filters
Rahul Lakkireddy <rahul.lakkireddy(a)chelsio.com>
cxgb4: parse TC-U32 key values and masks natively
Rahul Lakkireddy <rahul.lakkireddy(a)chelsio.com>
cxgb4: use unaligned conversion for fetching timestamp
Mark Zhang <markz(a)mellanox.com>
RDMA/counter: Query a counter before release
David Howells <dhowells(a)redhat.com>
rxrpc: Fix afs large storage transmission performance drop
Chen Tao <chentao107(a)huawei.com>
drm/msm/dpu: fix error return code in dpu_encoder_init
Herbert Xu <herbert(a)gondor.apana.org.au>
crypto: af_alg - fix use-after-free in af_alg_accept() due to bh_lock_sock()
James Bottomley <James.Bottomley(a)HansenPartnership.com>
tpm: Fix TIS locality timeout problems
Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
selftests: tpm: Use /bin/sh instead of /bin/bash
Douglas Anderson <dianders(a)chromium.org>
kgdb: Avoid suspicious RCU usage warning
Sagi Grimberg <sagi(a)grimberg.me>
nvme-multipath: fix bogus request queue reference put
Anton Eidelman <anton(a)lightbitslabs.com>
nvme-multipath: fix deadlock due to head->lock
Anton Eidelman <anton(a)lightbitslabs.com>
nvme-multipath: fix deadlock between ana_work and scan_work
Sagi Grimberg <sagi(a)grimberg.me>
nvme: fix possible deadlock when I/O is blocked
Keith Busch <kbusch(a)kernel.org>
nvme-multipath: set bdi capabilities once
Christian Borntraeger <borntraeger(a)de.ibm.com>
s390/debug: avoid kernel warning on too large number of pages
Steven Rostedt (VMware) <rostedt(a)goodmis.org>
tools lib traceevent: Handle __attribute__((user)) in field names
Steven Rostedt (VMware) <rostedt(a)goodmis.org>
tools lib traceevent: Add append() function helper for appending strings
Zqiang <qiang.zhang(a)windriver.com>
usb: usbtest: fix missing kfree(dev->buf) in usbtest_disconnect
David Howells <dhowells(a)redhat.com>
rxrpc: Fix race between incoming ACK parser and retransmitter
Qian Cai <cai(a)lca.pw>
mm/slub: fix stack overruns with SLUB_STATS
Dongli Zhang <dongli.zhang(a)oracle.com>
mm/slub.c: fix corrupted freechain in deactivate_slab()
Valentin Schneider <valentin.schneider(a)arm.com>
sched/debug: Make sd->flags sysctl read-only
Tuomas Tynkkynen <tuomas.tynkkynen(a)iki.fi>
usbnet: smsc95xx: Fix use-after-free after removal
Borislav Petkov <bp(a)suse.de>
EDAC/amd64: Read back the scrub rate PCI register on F15h
Hugh Dickins <hughd(a)google.com>
mm: fix swap cache node allocation mask
Jens Axboe <axboe(a)kernel.dk>
io_uring: make sure async workqueue is canceled on exit
-------------
Diffstat:
Documentation/filesystems/locking.rst | 2 +
Makefile | 4 +-
arch/mips/kernel/traps.c | 1 +
arch/mips/lantiq/xway/sysctrl.c | 8 +-
arch/s390/kernel/debug.c | 3 +-
arch/x86/kernel/cpu/resctrl/core.c | 2 +
arch/x86/kernel/cpu/resctrl/internal.h | 3 +
arch/x86/kernel/cpu/resctrl/monitor.c | 3 +-
crypto/af_alg.c | 26 ++--
crypto/algif_aead.c | 9 +-
crypto/algif_hash.c | 9 +-
crypto/algif_skcipher.c | 9 +-
drivers/block/virtio_blk.c | 1 +
drivers/char/tpm/tpm-dev-common.c | 19 ++-
drivers/dma-buf/dma-buf.c | 54 ++++-----
drivers/edac/amd64_edac.c | 2 +
drivers/firmware/efi/Kconfig | 11 ++
drivers/firmware/efi/efi.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_atomfirmware.c | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 4 +-
drivers/gpu/drm/amd/display/dc/core/dc.c | 10 +-
drivers/gpu/drm/msm/disp/dpu1/dpu_encoder.c | 2 +-
drivers/gpu/drm/sun4i/sun4i_hdmi_enc.c | 5 +-
drivers/hwmon/acpi_power_meter.c | 4 +-
drivers/hwmon/max6697.c | 7 +-
drivers/i2c/algos/i2c-algo-pca.c | 3 +-
drivers/i2c/busses/i2c-mlxcpld.c | 4 +-
drivers/infiniband/core/counters.c | 4 +-
drivers/irqchip/irq-gic.c | 14 +--
drivers/md/dm-zoned-target.c | 2 +-
drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c | 6 +-
drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c | 25 ++--
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 2 +-
.../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c | 30 ++---
drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c | 18 +--
.../ethernet/chelsio/cxgb4/cxgb4_tc_u32_parse.h | 122 ++++++++++++-------
drivers/net/ethernet/chelsio/cxgb4/sge.c | 2 +-
drivers/net/usb/smsc95xx.c | 2 +-
drivers/nvme/host/core.c | 13 +-
drivers/nvme/host/multipath.c | 45 +++++--
drivers/nvme/host/nvme.h | 2 +
drivers/spi/spi-fsl-dspi.c | 17 ++-
drivers/thermal/mtk_thermal.c | 5 +-
drivers/thermal/rcar_gen3_thermal.c | 2 +-
drivers/usb/misc/usbtest.c | 1 +
fs/cifs/connect.c | 10 +-
fs/cifs/inode.c | 10 +-
fs/io_uring.c | 63 ++++++++++
fs/locks.c | 3 +
fs/nfsd/nfs4proc.c | 2 +
fs/nfsd/nfs4state.c | 22 +++-
fs/nfsd/nfsctl.c | 23 ++--
fs/nfsd/nfsd.h | 5 +
fs/nfsd/nfssvc.c | 6 +
fs/nfsd/vfs.c | 6 +
include/crypto/if_alg.h | 4 +-
include/linux/fs.h | 1 +
include/linux/kthread.h | 1 +
include/linux/sunrpc/svc.h | 1 +
kernel/debug/debug_core.c | 4 +
kernel/kthread.c | 17 +++
kernel/sched/debug.c | 2 +-
mm/compaction.c | 19 ++-
mm/slub.c | 30 ++++-
mm/swap_state.c | 4 +-
net/rxrpc/call_event.c | 29 ++---
samples/vfs/test-statx.c | 2 +
sound/usb/card.h | 4 -
sound/usb/endpoint.c | 43 +------
sound/usb/endpoint.h | 1 -
sound/usb/pcm.c | 2 -
tools/lib/traceevent/event-parse.c | 133 ++++++++++++---------
tools/testing/selftests/tpm2/test_smoke.sh | 2 +-
tools/testing/selftests/tpm2/test_space.sh | 2 +-
74 files changed, 609 insertions(+), 362 deletions(-)