Naresh Kamboju writes:
On Thu, 28 May 2020 at 20:33, Michal Hocko mhocko@kernel.org wrote:
On Fri 22-05-20 02:23:09, Naresh Kamboju wrote:
My apology ! As per the test results history this problem started happening from Bad : next-20200430 (still reproducible on next-20200519) Good : next-20200429
The git tree / tag used for testing is from linux next-20200430 tag and reverted following three patches and oom-killer problem fixed.
Revert "mm, memcg: avoid stale protection values when cgroup is above protection" Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks" Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"
The discussion has fragmented and I got lost TBH. In http://lkml.kernel.org/r/CA+G9fYuDWGZx50UpD+WcsDeHX9vi3hpksvBAWbMgRZadb0Pkww... you have said that none of the added tracing output has triggered. Does this still hold? Because I still have a hard time to understand how those three patches could have the observed effects.
On the other email thread [1] this issue is concluded.
Yafang wrote on May 22 2020,
Regarding the root cause, my guess is it makes a similar mistake that I tried to fix in the previous patch that the direct reclaimer read a stale protection value. But I don't think it is worth to add another fix. The best way is to revert this commit.
This isn't a conclusion, just a guess (and one I think is unlikely). For this to reliably happen, it implies that the same race happens the same way each time.