Will the recent memory leak fixes be backported to longterm kernels?

List overview All Threads
Download

newer

older

[PATCH 4.4 000/104] 4.4.172-stable...

+...

Dexuan Cui

2 Nov 2018 2 Nov '18

12:16 a.m.

Hi all, When debugging a memory leak issue (https://github.com/coreos/bugs/issues/2516) with v4.14.11-coreos, we noticed the same issue may have been fixed recently by Roman in the latest mainline (i.e. Linus's master branch) according to comment #7 of https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1792349, which lists these patches (I'm not sure if the 5-patch list is complete):

010cb21d4ede math64: prevent double calculation of DIV64_U64_ROUND_UP() arguments f77d7a05670d mm: don't miss the last page because of round-off error d18bf0af683e mm: drain memcg stocks on css offlining 71cd51b2e1ca mm: rework memcg kernel stack accounting f3a2fccbce15 mm: slowly shrink slabs with a relatively small number of objects

Obviously at least some of the fixes are also needed in the longterm kernels like v4.14.y, but none of the 5 patches has the "Cc: stable@vger.kernel.org" tag? I'm wondering if these patches will be backported to the longterm kernels. BTW, the patches are not in v4.19, but I suppose they will be in v4.19.1-rc1?

Thanks, -- Dexuan

Show replies by date

Sasha Levin

2 Nov 2 Nov

12:45 a.m.

On Fri, Nov 02, 2018 at 12:16:02AM +0000, Dexuan Cui wrote:

...

Hi all, When debugging a memory leak issue (https://github.com/coreos/bugs/issues/2516) with v4.14.11-coreos, we noticed the same issue may have been fixed recently by Roman in the latest mainline (i.e. Linus's master branch) according to comment #7 of https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1792349, which lists these patches (I'm not sure if the 5-patch list is complete):

010cb21d4ede math64: prevent double calculation of DIV64_U64_ROUND_UP() arguments f77d7a05670d mm: don't miss the last page because of round-off error d18bf0af683e mm: drain memcg stocks on css offlining 71cd51b2e1ca mm: rework memcg kernel stack accounting f3a2fccbce15 mm: slowly shrink slabs with a relatively small number of objects

Obviously at least some of the fixes are also needed in the longterm kernels like v4.14.y, but none of the 5 patches has the "Cc: stable@vger.kernel.org" tag? I'm wondering if these patches will be backported to the longterm kernels. BTW, the patches are not in v4.19, but I suppose they will be in v4.19.1-rc1?

There was an issue with this series: https://lkml.org/lkml/2018/10/23/586, so it's waiting on a fix to be properly tested.

-- Thanks, Sasha

Roman Gushchin

12:58 a.m.

On Fri, Nov 02, 2018 at 12:16:02AM +0000, Dexuan Cui wrote:

...

Hi all, When debugging a memory leak issue (https://github.com/coreos/bugs/issues/2516) with v4.14.11-coreos, we noticed the same issue may have been fixed recently by Roman in the latest mainline (i.e. Linus's master branch) according to comment #7 of https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_ubun..., which lists these patches (I'm not sure if the 5-patch list is complete):

010cb21d4ede math64: prevent double calculation of DIV64_U64_ROUND_UP() arguments f77d7a05670d mm: don't miss the last page because of round-off error d18bf0af683e mm: drain memcg stocks on css offlining 71cd51b2e1ca mm: rework memcg kernel stack accounting f3a2fccbce15 mm: slowly shrink slabs with a relatively small number of objects

Obviously at least some of the fixes are also needed in the longterm kernels like v4.14.y, but none of the 5 patches has the "Cc: stable@vger.kernel.org" tag? I'm wondering if these patches will be backported to the longterm kernels. BTW, the patches are not in v4.19, but I suppose they will be in v4.19.1-rc1?

Hello, Dexuan!

A couple of issues has been revealed recently, here are fixes (hashes are from the next tree):

5f4b04528b5f mm: don't reclaim inodes with many attached pages 5a03b371ad6a mm: handle no memcg case in memcg_kmem_charge() properly

These two patches should be added to the serie.

Re stable backporting, I'd really wait for some time. Memory reclaim is a quite complex and fragile area, so even if patches are correct by themselves, they can easily cause a regression by revealing some other issues (as it was with the inode reclaim case).

Thanks!

Dexuan Cui

2:45 a.m.

...

From: Roman Gushchin guro@fb.com Sent: Thursday, November 1, 2018 17:58

On Fri, Nov 02, 2018 at 12:16:02AM +0000, Dexuan Cui wrote: Hello, Dexuan!

A couple of issues has been revealed recently, here are fixes (hashes are from the next tree):

5f4b04528b5f mm: don't reclaim inodes with many attached pages 5a03b371ad6a mm: handle no memcg case in memcg_kmem_charge() properly

These two patches should be added to the serie.

Thanks for the new info!

...

Re stable backporting, I'd really wait for some time. Memory reclaim is a quite complex and fragile area, so even if patches are correct by themselves, they can easily cause a regression by revealing some other issues (as it was with the inode reclaim case).

I totally agree. I'm now just wondering if there is any temporary workaround, even if that means we have to run the kernel with some features disabled or with a suboptimal performance?

Thanks! --Dexuan

Roman Gushchin

3:16 a.m.

On Fri, Nov 02, 2018 at 02:45:42AM +0000, Dexuan Cui wrote:

...

...
From: Roman Gushchin guro@fb.com Sent: Thursday, November 1, 2018 17:58

On Fri, Nov 02, 2018 at 12:16:02AM +0000, Dexuan Cui wrote: Hello, Dexuan!

A couple of issues has been revealed recently, here are fixes (hashes are from the next tree):

5f4b04528b5f mm: don't reclaim inodes with many attached pages 5a03b371ad6a mm: handle no memcg case in memcg_kmem_charge() properly

These two patches should be added to the serie.

Thanks for the new info!

...
Re stable backporting, I'd really wait for some time. Memory reclaim is a quite complex and fragile area, so even if patches are correct by themselves, they can easily cause a regression by revealing some other issues (as it was with the inode reclaim case).

I totally agree. I'm now just wondering if there is any temporary workaround, even if that means we have to run the kernel with some features disabled or with a suboptimal performance?

I don't think there is any, except not using memory cgroups at all. Limiting the amount of cgroups which are created and destroyed helps too: a faulty service running under systemd can be especially painful.

Thanks!

Michal Hocko

8:03 a.m.

On Fri 02-11-18 02:45:42, Dexuan Cui wrote: [...]

...

I totally agree. I'm now just wondering if there is any temporary workaround, even if that means we have to run the kernel with some features disabled or with a suboptimal performance?

One way would be to disable kmem accounting (cgroup.memory=nokmem kernel option). That would reduce the memory isolation because quite a lot of memory will not be accounted for but the primary source of in-flight and hard to reclaim memory will be gone.

Another workaround could be to use force_empty knob we have in v1 and use it when removing a cgroup. We do not have it in cgroup v2 though. The file hasn't been added to v2 because we didn't really have any proper usecase. Working around a bug doesn't sound like a _proper_ usecase but I can imagine workloads that bring a lot of metadata objects that are not really interesting for later use so something like a targeted drop_caches...

-- Michal Hocko SUSE Labs

Roman Gushchin

3:48 p.m.

On Fri, Nov 02, 2018 at 09:03:55AM +0100, Michal Hocko wrote:

...

On Fri 02-11-18 02:45:42, Dexuan Cui wrote: [...]

...
I totally agree. I'm now just wondering if there is any temporary workaround, even if that means we have to run the kernel with some features disabled or with a suboptimal performance?

One way would be to disable kmem accounting (cgroup.memory=nokmem kernel option). That would reduce the memory isolation because quite a lot of memory will not be accounted for but the primary source of in-flight and hard to reclaim memory will be gone.

In my experience disabling the kmem accounting doesn't really solve the issue (without patches), but can lower the rate of the leak.

...

Another workaround could be to use force_empty knob we have in v1 and use it when removing a cgroup. We do not have it in cgroup v2 though. The file hasn't been added to v2 because we didn't really have any proper usecase. Working around a bug doesn't sound like a _proper_ usecase but I can imagine workloads that bring a lot of metadata objects that are not really interesting for later use so something like a targeted drop_caches...

This can help a bit too, but even using the system-wide drop_caches knob unfortunately doesn't return all the memory back.

Thanks!

Michal Hocko

4:13 p.m.

On Fri 02-11-18 15:48:57, Roman Gushchin wrote:

...

On Fri, Nov 02, 2018 at 09:03:55AM +0100, Michal Hocko wrote:

...
On Fri 02-11-18 02:45:42, Dexuan Cui wrote: [...]

...
I totally agree. I'm now just wondering if there is any temporary workaround, even if that means we have to run the kernel with some features disabled or with a suboptimal performance?

One way would be to disable kmem accounting (cgroup.memory=nokmem kernel option). That would reduce the memory isolation because quite a lot of memory will not be accounted for but the primary source of in-flight and hard to reclaim memory will be gone.

In my experience disabling the kmem accounting doesn't really solve the issue (without patches), but can lower the rate of the leak.

This is unexpected. 90cbc2508827e was introduced to address offline memcgs to be reclaim even when they are small. But maybe you mean that we still leak in an absence of the memory pressure. Or what does prevent memcg from going down?

...

...
Another workaround could be to use force_empty knob we have in v1 and use it when removing a cgroup. We do not have it in cgroup v2 though. The file hasn't been added to v2 because we didn't really have any proper usecase. Working around a bug doesn't sound like a _proper_ usecase but I can imagine workloads that bring a lot of metadata objects that are not really interesting for later use so something like a targeted drop_caches...

This can help a bit too, but even using the system-wide drop_caches knob unfortunately doesn't return all the memory back.

Could you be more specific please?

-- Michal Hocko SUSE Labs

Roman Gushchin

4:22 p.m.

On Fri, Nov 02, 2018 at 05:13:14PM +0100, Michal Hocko wrote:

...

On Fri 02-11-18 15:48:57, Roman Gushchin wrote:

...
On Fri, Nov 02, 2018 at 09:03:55AM +0100, Michal Hocko wrote:

...
On Fri 02-11-18 02:45:42, Dexuan Cui wrote: [...]

...
I totally agree. I'm now just wondering if there is any temporary workaround, even if that means we have to run the kernel with some features disabled or with a suboptimal performance?

One way would be to disable kmem accounting (cgroup.memory=nokmem kernel option). That would reduce the memory isolation because quite a lot of memory will not be accounted for but the primary source of in-flight and hard to reclaim memory will be gone.

In my experience disabling the kmem accounting doesn't really solve the issue (without patches), but can lower the rate of the leak.

This is unexpected. 90cbc2508827e was introduced to address offline memcgs to be reclaim even when they are small. But maybe you mean that we still leak in an absence of the memory pressure. Or what does prevent memcg from going down?

There are 3 independent issues which are contributing to this leak: 1) Kernel stack accounting weirdness: processes can reuse stack accounted to different cgroups. So basically any running process can take a reference to any cgroup. 2) We do forget to scan the last page in the LRU list. So if we ended up with 1-page long LRU, it can stay there basically forever. 3) We don't apply enough pressure on slab objects.

Because one reference is enough to keep the entire memcg structure in place, we really have to close all three to eliminate the leak. Disabling kmem accounting mitigates only the last one.

...

...
...
Another workaround could be to use force_empty knob we have in v1 and use it when removing a cgroup. We do not have it in cgroup v2 though. The file hasn't been added to v2 because we didn't really have any proper usecase. Working around a bug doesn't sound like a _proper_ usecase but I can imagine workloads that bring a lot of metadata objects that are not really interesting for later use so something like a targeted drop_caches...

This can help a bit too, but even using the system-wide drop_caches knob unfortunately doesn't return all the memory back.

Could you be more specific please?

Sure, because problems 1) and 2) exist, echo 3 > /proc/sys/vm/drop_caches can't reclaim all memcg structures in most cases.

Thanks!

Michal Hocko

4:51 p.m.

On Fri 02-11-18 16:22:41, Roman Gushchin wrote:

...

On Fri, Nov 02, 2018 at 05:13:14PM +0100, Michal Hocko wrote:

...
On Fri 02-11-18 15:48:57, Roman Gushchin wrote:

...
On Fri, Nov 02, 2018 at 09:03:55AM +0100, Michal Hocko wrote:

...
On Fri 02-11-18 02:45:42, Dexuan Cui wrote: [...]

...
I totally agree. I'm now just wondering if there is any temporary workaround, even if that means we have to run the kernel with some features disabled or with a suboptimal performance?

One way would be to disable kmem accounting (cgroup.memory=nokmem kernel option). That would reduce the memory isolation because quite a lot of memory will not be accounted for but the primary source of in-flight and hard to reclaim memory will be gone.

In my experience disabling the kmem accounting doesn't really solve the issue (without patches), but can lower the rate of the leak.

This is unexpected. 90cbc2508827e was introduced to address offline memcgs to be reclaim even when they are small. But maybe you mean that we still leak in an absence of the memory pressure. Or what does prevent memcg from going down?

There are 3 independent issues which are contributing to this leak:

Kernel stack accounting weirdness: processes can reuse stack accounted to

different cgroups. So basically any running process can take a reference to any cgroup.

yes, but kmem accounting should rule that out, right? If not then this is a clear bug and easy to backport because that would mean to add a missing memcg_kmem_enabled check.

...

We do forget to scan the last page in the LRU list. So if we ended up with

1-page long LRU, it can stay there basically forever.

Why /* * If the cgroup's already been deleted, make sure to * scrape out the remaining cache. */ if (!scan && !mem_cgroup_online(memcg)) scan = min(size, SWAP_CLUSTER_MAX);

in get_scan_count doesn't work for that case?

...

We don't apply enough pressure on slab objects.

again kmem accounting disabled should make this moot

...

Because one reference is enough to keep the entire memcg structure in place, we really have to close all three to eliminate the leak. Disabling kmem accounting mitigates only the last one.

-- Michal Hocko SUSE Labs

Roman Gushchin

5:25 p.m.

On Fri, Nov 02, 2018 at 05:51:47PM +0100, Michal Hocko wrote:

...

On Fri 02-11-18 16:22:41, Roman Gushchin wrote:

...
On Fri, Nov 02, 2018 at 05:13:14PM +0100, Michal Hocko wrote:

...
On Fri 02-11-18 15:48:57, Roman Gushchin wrote:

...
On Fri, Nov 02, 2018 at 09:03:55AM +0100, Michal Hocko wrote:

...
On Fri 02-11-18 02:45:42, Dexuan Cui wrote: [...]

...
I totally agree. I'm now just wondering if there is any temporary workaround, even if that means we have to run the kernel with some features disabled or with a suboptimal performance?

One way would be to disable kmem accounting (cgroup.memory=nokmem kernel option). That would reduce the memory isolation because quite a lot of memory will not be accounted for but the primary source of in-flight and hard to reclaim memory will be gone.

In my experience disabling the kmem accounting doesn't really solve the issue (without patches), but can lower the rate of the leak.

This is unexpected. 90cbc2508827e was introduced to address offline memcgs to be reclaim even when they are small. But maybe you mean that we still leak in an absence of the memory pressure. Or what does prevent memcg from going down?

There are 3 independent issues which are contributing to this leak:

Kernel stack accounting weirdness: processes can reuse stack accounted to

different cgroups. So basically any running process can take a reference to any cgroup.

yes, but kmem accounting should rule that out, right? If not then this is a clear bug and easy to backport because that would mean to add a missing memcg_kmem_enabled check.

Yes, you're right, disabling kmem accounting should mitigate this problem.

...

...

We do forget to scan the last page in the LRU list. So if we ended up with

1-page long LRU, it can stay there basically forever.

Why /* * If the cgroup's already been deleted, make sure to * scrape out the remaining cache. */ if (!scan && !mem_cgroup_online(memcg)) scan = min(size, SWAP_CLUSTER_MAX);

in get_scan_count doesn't work for that case?

No, it doesn't. Let's look at the whole picture:

size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx); scan = size >> sc->priority; /* * If the cgroup's already been deleted, make sure to * scrape out the remaining cache. */ if (!scan && !mem_cgroup_online(memcg)) scan = min(size, SWAP_CLUSTER_MAX);

If size == 1, scan == 0 => scan = min(1, 32) == 1. And after proportional adjustment we'll have 0.

So, disabling kmem accounting mitigates 2 other issues, but not this one.

Anyway, I'd prefer to wait a bit for test results, and backport the whole series as a whole.

Thanks!

Michal Hocko

5:48 p.m.

On Fri 02-11-18 17:25:58, Roman Gushchin wrote:

...

On Fri, Nov 02, 2018 at 05:51:47PM +0100, Michal Hocko wrote:

...
On Fri 02-11-18 16:22:41, Roman Gushchin wrote:

[...]

...

...
...

We do forget to scan the last page in the LRU list. So if we ended up with

1-page long LRU, it can stay there basically forever.

Why /* * If the cgroup's already been deleted, make sure to * scrape out the remaining cache. */ if (!scan && !mem_cgroup_online(memcg)) scan = min(size, SWAP_CLUSTER_MAX);

in get_scan_count doesn't work for that case?

No, it doesn't. Let's look at the whole picture:
size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
scan = size >> sc->priority;
/*
 * If the cgroup's already been deleted, make sure to
 * scrape out the remaining cache.
 */
if (!scan && !mem_cgroup_online(memcg))
	scan = min(size, SWAP_CLUSTER_MAX);
If size == 1, scan == 0 => scan = min(1, 32) == 1. And after proportional adjustment we'll have 0.

My friday brain hurst when looking at this but if it doesn't work as advertized then it should be fixed. I do not see any of your patches to touch this logic so how come it would work after them applied?

-- Michal Hocko SUSE Labs

Roman Gushchin

7:38 p.m.

On Fri, Nov 02, 2018 at 06:48:23PM +0100, Michal Hocko wrote:

...

On Fri 02-11-18 17:25:58, Roman Gushchin wrote:

...
On Fri, Nov 02, 2018 at 05:51:47PM +0100, Michal Hocko wrote:

...
On Fri 02-11-18 16:22:41, Roman Gushchin wrote:

[...]

...
...
...

We do forget to scan the last page in the LRU list. So if we ended up with

1-page long LRU, it can stay there basically forever.

Why /* * If the cgroup's already been deleted, make sure to * scrape out the remaining cache. */ if (!scan && !mem_cgroup_online(memcg)) scan = min(size, SWAP_CLUSTER_MAX);

in get_scan_count doesn't work for that case?

No, it doesn't. Let's look at the whole picture:
size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
scan = size >> sc->priority;
/*
 * If the cgroup's already been deleted, make sure to
 * scrape out the remaining cache.
 */
if (!scan && !mem_cgroup_online(memcg))
	scan = min(size, SWAP_CLUSTER_MAX);
If size == 1, scan == 0 => scan = min(1, 32) == 1. And after proportional adjustment we'll have 0.
My friday brain hurst when looking at this but if it doesn't work as advertized then it should be fixed. I do not see any of your patches to touch this logic so how come it would work after them applied?

This part works as expected. But the following scan = div64_u64(scan * fraction[file], denominator); reliable turns 1 page to scan to 0 pages to scan.

And this is the issue which my patches do address.

Thanks!

Michal Hocko

5 Nov 5 Nov

9:21 a.m.

On Fri 02-11-18 19:38:35, Roman Gushchin wrote:

...

On Fri, Nov 02, 2018 at 06:48:23PM +0100, Michal Hocko wrote:

...
On Fri 02-11-18 17:25:58, Roman Gushchin wrote:

...
On Fri, Nov 02, 2018 at 05:51:47PM +0100, Michal Hocko wrote:

...
On Fri 02-11-18 16:22:41, Roman Gushchin wrote:

[...]

...
...
...

We do forget to scan the last page in the LRU list. So if we ended up with

1-page long LRU, it can stay there basically forever.

Why /* * If the cgroup's already been deleted, make sure to * scrape out the remaining cache. */ if (!scan && !mem_cgroup_online(memcg)) scan = min(size, SWAP_CLUSTER_MAX);

in get_scan_count doesn't work for that case?

No, it doesn't. Let's look at the whole picture:
size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
scan = size >> sc->priority;
/*
 * If the cgroup's already been deleted, make sure to
 * scrape out the remaining cache.
 */
if (!scan && !mem_cgroup_online(memcg))
	scan = min(size, SWAP_CLUSTER_MAX);
If size == 1, scan == 0 => scan = min(1, 32) == 1. And after proportional adjustment we'll have 0.
My friday brain hurst when looking at this but if it doesn't work as advertized then it should be fixed. I do not see any of your patches to touch this logic so how come it would work after them applied?
This part works as expected. But the following scan = div64_u64(scan * fraction[file], denominator); reliable turns 1 page to scan to 0 pages to scan.

OK, 68600f623d69 ("mm: don't miss the last page because of round-off error") sounds like a good and safe stable backport material.

-- Michal Hocko SUSE Labs

Greg KH

28 Dec 28 Dec

10:50 a.m.

On Mon, Nov 05, 2018 at 10:21:23AM +0100, Michal Hocko wrote:

...

On Fri 02-11-18 19:38:35, Roman Gushchin wrote:

...
On Fri, Nov 02, 2018 at 06:48:23PM +0100, Michal Hocko wrote:

...
On Fri 02-11-18 17:25:58, Roman Gushchin wrote:

...
On Fri, Nov 02, 2018 at 05:51:47PM +0100, Michal Hocko wrote:

...
On Fri 02-11-18 16:22:41, Roman Gushchin wrote:

[...]

...
...
...

We do forget to scan the last page in the LRU list. So if we ended up with

1-page long LRU, it can stay there basically forever.

Why /* * If the cgroup's already been deleted, make sure to * scrape out the remaining cache. */ if (!scan && !mem_cgroup_online(memcg)) scan = min(size, SWAP_CLUSTER_MAX);

in get_scan_count doesn't work for that case?

No, it doesn't. Let's look at the whole picture:
size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
scan = size >> sc->priority;
/*
 * If the cgroup's already been deleted, make sure to
 * scrape out the remaining cache.
 */
if (!scan && !mem_cgroup_online(memcg))
	scan = min(size, SWAP_CLUSTER_MAX);
If size == 1, scan == 0 => scan = min(1, 32) == 1. And after proportional adjustment we'll have 0.
My friday brain hurst when looking at this but if it doesn't work as advertized then it should be fixed. I do not see any of your patches to touch this logic so how come it would work after them applied?
This part works as expected. But the following scan = div64_u64(scan * fraction[file], denominator); reliable turns 1 page to scan to 0 pages to scan.
OK, 68600f623d69 ("mm: don't miss the last page because of round-off error") sounds like a good and safe stable backport material.

Thanks for this, now queued up.

greg k-h

Sasha Levin

30 Jan 30 Jan

12:23 a.m.

On Fri, Dec 28, 2018 at 11:50:08AM +0100, Greg KH wrote:

...

On Mon, Nov 05, 2018 at 10:21:23AM +0100, Michal Hocko wrote:

...
On Fri 02-11-18 19:38:35, Roman Gushchin wrote:

...
On Fri, Nov 02, 2018 at 06:48:23PM +0100, Michal Hocko wrote:

...
On Fri 02-11-18 17:25:58, Roman Gushchin wrote:

...
On Fri, Nov 02, 2018 at 05:51:47PM +0100, Michal Hocko wrote:

...
On Fri 02-11-18 16:22:41, Roman Gushchin wrote:

[...]

...
...
> 2) We do forget to scan the last page in the LRU list. So if we ended up with > 1-page long LRU, it can stay there basically forever.

Why /* * If the cgroup's already been deleted, make sure to * scrape out the remaining cache. */ if (!scan && !mem_cgroup_online(memcg)) scan = min(size, SWAP_CLUSTER_MAX);

in get_scan_count doesn't work for that case?

No, it doesn't. Let's look at the whole picture:
size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
scan = size >> sc->priority;
/*
 * If the cgroup's already been deleted, make sure to
 * scrape out the remaining cache.
 */
if (!scan && !mem_cgroup_online(memcg))
	scan = min(size, SWAP_CLUSTER_MAX);
If size == 1, scan == 0 => scan = min(1, 32) == 1. And after proportional adjustment we'll have 0.
My friday brain hurst when looking at this but if it doesn't work as advertized then it should be fixed. I do not see any of your patches to touch this logic so how come it would work after them applied?
This part works as expected. But the following scan = div64_u64(scan * fraction[file], denominator); reliable turns 1 page to scan to 0 pages to scan.
OK, 68600f623d69 ("mm: don't miss the last page because of round-off error") sounds like a good and safe stable backport material.
Thanks for this, now queued up.

greg k-h

It seems that 172b06c32b949 ("mm: slowly shrink slabs with a relatively small number of objects") and a76cf1a474d ("mm: don't reclaim inodes with many attached pages") cause a regression reported against the 4.19 stable tree: https://bugzilla.kernel.org/show_bug.cgi?id=202441 .

Given the history and complexity of these (and other patches from that series) it would be nice to understand if this is something that will be fixed soon or should we look into reverting the series for now?

-- Thanks, Sasha

Roman Gushchin

5:58 a.m.

On Tue, Jan 29, 2019 at 07:23:56PM -0500, Sasha Levin wrote:

...

On Fri, Dec 28, 2018 at 11:50:08AM +0100, Greg KH wrote:

...
On Mon, Nov 05, 2018 at 10:21:23AM +0100, Michal Hocko wrote:

...
On Fri 02-11-18 19:38:35, Roman Gushchin wrote:

...
On Fri, Nov 02, 2018 at 06:48:23PM +0100, Michal Hocko wrote:

...
On Fri 02-11-18 17:25:58, Roman Gushchin wrote:

...
On Fri, Nov 02, 2018 at 05:51:47PM +0100, Michal Hocko wrote: > On Fri 02-11-18 16:22:41, Roman Gushchin wrote:

[...]

...
> > 2) We do forget to scan the last page in the LRU list. So if we ended up with > > 1-page long LRU, it can stay there basically forever. > > Why > /* > * If the cgroup's already been deleted, make sure to > * scrape out the remaining cache. > */ > if (!scan && !mem_cgroup_online(memcg)) > scan = min(size, SWAP_CLUSTER_MAX); > > in get_scan_count doesn't work for that case?

No, it doesn't. Let's look at the whole picture:
size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
scan = size >> sc->priority;
/*
 * If the cgroup's already been deleted, make sure to
 * scrape out the remaining cache.
 */
if (!scan && !mem_cgroup_online(memcg))
	scan = min(size, SWAP_CLUSTER_MAX);
If size == 1, scan == 0 => scan = min(1, 32) == 1. And after proportional adjustment we'll have 0.
My friday brain hurst when looking at this but if it doesn't work as advertized then it should be fixed. I do not see any of your patches to touch this logic so how come it would work after them applied?
This part works as expected. But the following scan = div64_u64(scan * fraction[file], denominator); reliable turns 1 page to scan to 0 pages to scan.
OK, 68600f623d69 ("mm: don't miss the last page because of round-off error") sounds like a good and safe stable backport material.
Thanks for this, now queued up.

greg k-h
It seems that 172b06c32b949 ("mm: slowly shrink slabs with a relatively small number of objects") and a76cf1a474d ("mm: don't reclaim inodes with many attached pages") cause a regression reported against the 4.19 stable tree: https://bugzilla.kernel.org/show_bug.cgi?id=202441 .

Given the history and complexity of these (and other patches from that series) it would be nice to understand if this is something that will be fixed soon or should we look into reverting the series for now?

In that thread I've just suggested to give a chance to Rik's patch, which hopefully will mitigate or easy the regression ( https://lkml.org/lkml/2019/1/28/1865 ).

Of course, we can simple revert those changes, but this will re-introduce the memory leak, so I'd leave it as a last option.

Thanks!

Sasha Levin

2 Nov 2 Nov

4:01 p.m.

On Fri, Nov 02, 2018 at 02:45:42AM +0000, Dexuan Cui wrote:

...

...
From: Roman Gushchin guro@fb.com Sent: Thursday, November 1, 2018 17:58

On Fri, Nov 02, 2018 at 12:16:02AM +0000, Dexuan Cui wrote: Hello, Dexuan!

A couple of issues has been revealed recently, here are fixes (hashes are from the next tree):

5f4b04528b5f mm: don't reclaim inodes with many attached pages 5a03b371ad6a mm: handle no memcg case in memcg_kmem_charge() properly

These two patches should be added to the serie.

Thanks for the new info!

...
Re stable backporting, I'd really wait for some time. Memory reclaim is a quite complex and fragile area, so even if patches are correct by themselves, they can easily cause a regression by revealing some other issues (as it was with the inode reclaim case).

I totally agree. I'm now just wondering if there is any temporary workaround, even if that means we have to run the kernel with some features disabled or with a suboptimal performance?

I'm not sure what workload you're seeing it on, but if you could merge these 7 patches and see that it solves the problem you're seeing and doesn't cause any regressions it'll be a useful test for the rest of us.

-- Thanks, Sasha

Mike Rapoport

4 Nov 4 Nov

9:16 a.m.

On Fri, Nov 02, 2018 at 12:01:22PM -0400, Sasha Levin wrote:

...

On Fri, Nov 02, 2018 at 02:45:42AM +0000, Dexuan Cui wrote:

...
...
From: Roman Gushchin guro@fb.com Sent: Thursday, November 1, 2018 17:58

On Fri, Nov 02, 2018 at 12:16:02AM +0000, Dexuan Cui wrote: Hello, Dexuan!

A couple of issues has been revealed recently, here are fixes (hashes are from the next tree):

5f4b04528b5f mm: don't reclaim inodes with many attached pages 5a03b371ad6a mm: handle no memcg case in memcg_kmem_charge() properly

These two patches should be added to the serie.

Thanks for the new info!

...
Re stable backporting, I'd really wait for some time. Memory reclaim is a quite complex and fragile area, so even if patches are correct by themselves, they can easily cause a regression by revealing some other issues (as it was with the inode reclaim case).

I totally agree. I'm now just wondering if there is any temporary workaround, even if that means we have to run the kernel with some features disabled or with a suboptimal performance?

I'm not sure what workload you're seeing it on, but if you could merge these 7 patches and see that it solves the problem you're seeing and doesn't cause any regressions it'll be a useful test for the rest of us.

AFAIK, with Roman's patches backported to Ubuntu version of 4.15, the problem reported at [1] is solved.

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1792349

...

-- Thanks, Sasha

-- Sincerely yours, Mike.

1923

days inactive

2012

days old

linux-stable-mirror@lists.linaro.org

18 comments

participants

tags (0)

participants (6)

Dexuan Cui
Greg KH
Michal Hocko
Mike Rapoport
Roman Gushchin
Sasha Levin