I've recently discovered that doing infinite loop of systemctl start <ext4_on_lvm>.mount, and systemctl stop <ext4_on_lvm>.mount linearly increases percpu allocator memory consumption. In several hours, it might lead to system instability by consuming most of the memory.
Bug is not reproducible when the ext4 filesystem is on physical partition, but it is persistent when ext4 is on logical volume.
During debugging it was found that most of active percpu allocations are from /system.slice/<ext4_on_lvm>.mount memory cgroups (created by systemd for each mount). All of these cgroups are in dying state with refcount equal to 2. And most interesting that each mount/umount itera- tion creates exactly one dying memory cgroup.
Tracking down the remaining refcounts showed that it was charged from ext4_fill_super(). And the page is always 0 index in the page cache mapping.
The issue was hidden behind initial super block read using logical blocksize from bdev and adjusting blocksize later after reading actual block size from superblock. If blocksizes differ, sb_set_blocksize() will kill current buffers and page cache by using kill_bdev(). And then super block will be reread again but using correct blocksize this time. sb_set_blocksize() didn't fully free superblock page and buffers as buffer pointed by bh variable remained busy. So buffer and its page remains in the memory (leak). Super block reread logic does not happen when ext4 filesystem is on physical partition as blocksize is correct for initial superblock read.
brelse(bh), where bh is a buffer head of superblock page, must be called and bh references must be released before kill_bdev(). kill_bdev() subfunctions (see callstack below) won't be able to free not released buffer (even if it's clean) and superblock page won't be freed as well.
callstack: kill_bdev() ->truncate_inode_pages() ->truncate_inode_pages_range() ->truncate_cleanup_page() ->do_invalidatepage ->block_invalidatepage() ->try_to_release_page() == fail to release ->try_to_free_buffers() == fail to free ->drop_buffers() ->buffer_busy() == yes
Incorrect order of brelse() and kill_bdev() in ext4_fill_super() was introduced by commit ce40733ce93d ("ext4: Check for return value from sb_set_blocksize") 13 years ago! Thanks to memory hungry percpu, it was easy to detect this issue now.
Fix this by moving the brelse() before sb_set_blocksize() and add a comment about the dependency.
In addition, fix similar issue under failed_mount: label (in the same function) about incorrect order of ext4_blkdev_remove() vs brelse() introduced by commit ac27a0ec112a ("ext4: initial copy of files from ext3")
Signed-off-by: Alexey Makhalov amakhalov@vmware.com Cc: stable@vger.kernel.org Fixes: ce40733ce93d ("ext4: Check for return value from sb_set_blocksize") Fixes: ac27a0ec112a ("ext4: initial copy of files from ext3") Cc: "Theodore Ts'o" tytso@mit.edu Cc: Andreas Dilger adilger.kernel@dilger.ca --- fs/ext4/super.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index b9693680463a..6c8f68309834 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4451,14 +4451,20 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) }
if (sb->s_blocksize != blocksize) { + /* + * bh must be released before kill_bdev(), otherwise + * it won't be freed and its page also. kill_bdev() + * is called by sb_set_blocksize(). + */ + brelse(bh); /* Validate the filesystem blocksize */ if (!sb_set_blocksize(sb, blocksize)) { ext4_msg(sb, KERN_ERR, "bad block size %d", blocksize); + bh = NULL; goto failed_mount; }
- brelse(bh); logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE; offset = do_div(logical_sb_block, blocksize); bh = ext4_sb_bread_unmovable(sb, logical_sb_block); @@ -5178,8 +5184,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) kfree(get_qf_name(sb, sbi, i)); #endif fscrypt_free_dummy_policy(&sbi->s_dummy_enc_policy); - ext4_blkdev_remove(sbi); + /* ext4_blkdev_remove() calls kill_bdev(), release bh before it. */ brelse(bh); + ext4_blkdev_remove(sbi); out_fail: sb->s_fs_info = NULL; kfree(sbi->s_blockgroup_lock);
On Wed, Apr 28, 2021 at 10:19:28PM +0000, Alexey Makhalov wrote:
I've recently discovered that doing infinite loop of systemctl start <ext4_on_lvm>.mount, and systemctl stop <ext4_on_lvm>.mount linearly increases percpu allocator memory consumption. In several hours, it might lead to system instability by consuming most of the memory.
Bug is not reproducible when the ext4 filesystem is on physical partition, but it is persistent when ext4 is on logical volume.
Why is this the case? It sounds like we're looking a buffer for each mount where the block size is not 1k. It shouldn't matter whether it is a physical partition or not.
- Ted
Hi Ted,
Good point! This paragraph can be just dropped as the next one describes the issue with superblock re-read. Will send v2 shortly.
Thanks, —Alexey
On May 20, 2021, at 9:43 PM, Theodore Y. Ts'o tytso@mit.edu wrote:
On Wed, Apr 28, 2021 at 10:19:28PM +0000, Alexey Makhalov wrote:
I've recently discovered that doing infinite loop of systemctl start <ext4_on_lvm>.mount, and systemctl stop <ext4_on_lvm>.mount linearly increases percpu allocator memory consumption. In several hours, it might lead to system instability by consuming most of the memory.
Bug is not reproducible when the ext4 filesystem is on physical partition, but it is persistent when ext4 is on logical volume.
Why is this the case? It sounds like we're looking a buffer for each mount where the block size is not 1k. It shouldn't matter whether it is a physical partition or not.
- Ted
I've recently discovered that doing infinite loop of systemctl start <ext4_on_lvm>.mount, and systemctl stop <ext4_on_lvm>.mount linearly increases percpu allocator memory consumption. In several hours, it might lead to system instability by consuming most of the memory.
During debugging it was found that most of active percpu allocations are from /system.slice/<ext4_on_lvm>.mount memory cgroups (created by systemd for each mount). All of these cgroups are in dying state with refcount equal to 2. And most interesting that each mount/umount itera- tion creates exactly one dying memory cgroup.
Tracking down the remaining refcounts showed that it was charged from ext4_fill_super(). And the page is always 0 index in the page cache mapping.
The issue was hidden behind initial super block read using logical blocksize from bdev and adjusting blocksize later after reading actual block size from superblock. If blocksizes differ, sb_set_blocksize() will kill current buffers and page cache by using kill_bdev(). And then super block will be reread again but using correct blocksize this time. sb_set_blocksize() didn't fully free superblock page and buffers as buffer pointed by bh variable remained busy. So buffer and its page remains in the memory (leak). Super block reread logic does not happen when ext4 filesystem is on physical partition as blocksize is correct for initial superblock read.
brelse(bh), where bh is a buffer head of superblock page, must be called and bh references must be released before kill_bdev(). kill_bdev() subfunctions (see callstack below) won't be able to free not released buffer (even if it's clean) and superblock page won't be freed as well.
callstack: kill_bdev() ->truncate_inode_pages() ->truncate_inode_pages_range() ->truncate_cleanup_page() ->do_invalidatepage ->block_invalidatepage() ->try_to_release_page() == fail to release ->try_to_free_buffers() == fail to free ->drop_buffers() ->buffer_busy() == yes
Incorrect order of brelse() and kill_bdev() in ext4_fill_super() was introduced by commit ce40733ce93d ("ext4: Check for return value from sb_set_blocksize") 13 years ago! Thanks to memory hungry percpu, it was easy to detect this issue now.
Fix this by moving the brelse() before sb_set_blocksize() and add a comment about the dependency.
In addition, fix similar issue under failed_mount: label (in the same function) about incorrect order of ext4_blkdev_remove() vs brelse() introduced by commit ac27a0ec112a ("ext4: initial copy of files from ext3")
Signed-off-by: Alexey Makhalov amakhalov@vmware.com Cc: stable@vger.kernel.org Fixes: ce40733ce93d ("ext4: Check for return value from sb_set_blocksize") Fixes: ac27a0ec112a ("ext4: initial copy of files from ext3") Cc: "Theodore Ts'o" tytso@mit.edu Cc: Andreas Dilger adilger.kernel@dilger.ca --- fs/ext4/super.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index b9693680463a..6c8f68309834 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4451,14 +4451,20 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) }
if (sb->s_blocksize != blocksize) { + /* + * bh must be released before kill_bdev(), otherwise + * it won't be freed and its page also. kill_bdev() + * is called by sb_set_blocksize(). + */ + brelse(bh); /* Validate the filesystem blocksize */ if (!sb_set_blocksize(sb, blocksize)) { ext4_msg(sb, KERN_ERR, "bad block size %d", blocksize); + bh = NULL; goto failed_mount; }
- brelse(bh); logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE; offset = do_div(logical_sb_block, blocksize); bh = ext4_sb_bread_unmovable(sb, logical_sb_block); @@ -5178,8 +5184,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) kfree(get_qf_name(sb, sbi, i)); #endif fscrypt_free_dummy_policy(&sbi->s_dummy_enc_policy); - ext4_blkdev_remove(sbi); + /* ext4_blkdev_remove() calls kill_bdev(), release bh before it. */ brelse(bh); + ext4_blkdev_remove(sbi); out_fail: sb->s_fs_info = NULL; kfree(sbi->s_blockgroup_lock);
On Fri, May 21, 2021 at 12:43:46AM -0700, Alexey Makhalov wrote:
Hi Ted,
Good point! This paragraph can be just dropped as the next one describes the issue with superblock re-read. Will send v2 shortly.
Thanks; for what it's worth, I'm going to be editing the commit description anyway; it's really helpful during the patch review to understand how you found the problem, and how to reproduce it. However, the commit description when it lands upstream will include a link to this mail thread on lore.kernel.org, and so including a long stack trace isn't really necessary.
I'm going to cut it down to something like this:
------------------------------ ext4: fix memory leak in ext4_fill_super
Buffer head references must be released before calling kill_bdev(); otherwise the buffer head (and its page referenced by b_data) will not be freed by kill_bdev, and subsequently that bh will be leaked.
If blocksizes differ, sb_set_blocksize() will kill current buffers and page cache by using kill_bdev(). And then super block will be reread again but using correct blocksize this time. sb_set_blocksize() didn't fully free superblock page and buffer head, and being busy, they were not freed and instead leaked.
This can easily be reproduced by calling an infinite loop of:
systemctl start <ext4_on_lvm>.mount, and systemctl stop <ext4_on_lvm>.mount
... since systemd creates a cgroup for each slice which it mounts, and the bh leak get amplified by a dying memory cgroup that also never gets freed, and memory consumption is much more easily noticed.
Signed-off-by: Alexey Makhalov amakhalov@vmware.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/459B4724-842E-4B47-B2E7-D29805431E69@vmware.com Fixes: ce40733ce93d ("ext4: Check for return value from sb_set_blocksize") Fixes: ac27a0ec112a ("ext4: initial copy of files from ext3") Signed-off-by: "Theodore Ts'o" tytso@mit.edu Cc: Andreas Dilger adilger.kernel@dilger.ca ------------------------------
The commit description above is 18 lines (exclusive of trailers and headers) versus 71 lines, and is much more to the point for someone who might be doing code archeology via "git log".
When submitting this as a patch, you can either use a separate cover letter (git format-patch --cover-letter) or including the explanation after the --- in the diff, so that it disappears before the commit is added via "git am". But it's not that hard for me to rework a commit description, so I'll take care of it for this patch; no need to send a V3.
Cheers,
- Ted
Sounds good! Thanks for the guidance and v3) —Alexey
On May 21, 2021, at 7:29 AM, Theodore Y. Ts'o tytso@mit.edu wrote:
On Fri, May 21, 2021 at 12:43:46AM -0700, Alexey Makhalov wrote:
Hi Ted,
Good point! This paragraph can be just dropped as the next one describes the issue with superblock re-read. Will send v2 shortly.
Thanks; for what it's worth, I'm going to be editing the commit description anyway; it's really helpful during the patch review to understand how you found the problem, and how to reproduce it. However, the commit description when it lands upstream will include a link to this mail thread on lore.kernel.org, and so including a long stack trace isn't really necessary.
I'm going to cut it down to something like this:
ext4: fix memory leak in ext4_fill_super
Buffer head references must be released before calling kill_bdev(); otherwise the buffer head (and its page referenced by b_data) will not be freed by kill_bdev, and subsequently that bh will be leaked.
If blocksizes differ, sb_set_blocksize() will kill current buffers and page cache by using kill_bdev(). And then super block will be reread again but using correct blocksize this time. sb_set_blocksize() didn't fully free superblock page and buffer head, and being busy, they were not freed and instead leaked.
This can easily be reproduced by calling an infinite loop of:
systemctl start <ext4_on_lvm>.mount, and systemctl stop <ext4_on_lvm>.mount
... since systemd creates a cgroup for each slice which it mounts, and the bh leak get amplified by a dying memory cgroup that also never gets freed, and memory consumption is much more easily noticed.
Signed-off-by: Alexey Makhalov amakhalov@vmware.com Cc: stable@vger.kernel.org Link: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne... Fixes: ce40733ce93d ("ext4: Check for return value from sb_set_blocksize") Fixes: ac27a0ec112a ("ext4: initial copy of files from ext3") Signed-off-by: "Theodore Ts'o" tytso@mit.edu Cc: Andreas Dilger adilger.kernel@dilger.ca
The commit description above is 18 lines (exclusive of trailers and headers) versus 71 lines, and is much more to the point for someone who might be doing code archeology via "git log".
When submitting this as a patch, you can either use a separate cover letter (git format-patch --cover-letter) or including the explanation after the --- in the diff, so that it disappears before the commit is added via "git am". But it's not that hard for me to rework a commit description, so I'll take care of it for this patch; no need to send a V3.
Cheers,
- Ted
linux-stable-mirror@lists.linaro.org