July 2024 - Linux-stable-mirror

[PATCH v2 1/2] ext4: Forcing subclasses to have same name pointer as their parent class

by botta633

From: Ahmed Ehab <bottaawesome633(a)gmail.com> Preventing lockdep_set_subclass from creating a new instance of the string literal. Hence, we will always have the same class->name among parent and subclasses. This prevents kernel panics when looking up a lock class while comparing class locks and class names. Reported-by: <syzbot+7f4a6f7f7051474e40ad(a)syzkaller.appspotmail.com> Fixes: fd5e3f5fe27 Cc: <stable(a)vger.kernel.org> Signed-off-by: Ahmed Ehab <bottaawesome633(a)gmail.com> --- include/linux/lockdep.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 08b0d1d9d78b..df8fa5929de7 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -173,7 +173,7 @@ static inline void lockdep_init_map(struct lockdep_map *lock, const char *name, (lock)->dep_map.lock_type) #define lockdep_set_subclass(lock, sub) \ - lockdep_init_map_type(&(lock)->dep_map, #lock, (lock)->dep_map.key, sub,\ + lockdep_init_map_type(&(lock)->dep_map, (lock)->dep_map.name, (lock)->dep_map.key, sub,\ (lock)->dep_map.wait_type_inner, \ (lock)->dep_map.wait_type_outer, \ (lock)->dep_map.lock_type) -- 2.45.2

1 year, 5 months

3
3
0 0

Re: [PATCH v2 1/2] ext4: Forcing subclasses to have same name pointer as their parent class

by Boqun Feng

On Mon, Jul 15, 2024 at 12:39:45AM +0300, ahmed Ehab wrote: > Ok, I will. > I just put ext4 because the syzkaller bug was mentioned in the ext4 > subsystem. > Thanks, > Ahmed > Please avoid top-posting. And > On Mon, Jul 15, 2024 at 12:22 AM Waiman Long <longman(a)redhat.com> wrote: > > > On 7/14/24 01:14, botta633 wrote: > > > From: Ahmed Ehab <bottaawesome633(a)gmail.com> > > > > > > Preventing lockdep_set_subclass from creating a new instance of the > > > string literal. Hence, we will always have the same class->name among > > > parent and subclasses. This prevents kernel panics when looking up a > > > lock class while comparing class locks and class names. > > > > > > Reported-by: <syzbot+7f4a6f7f7051474e40ad(a)syzkaller.appspotmail.com> > > > Fixes: fd5e3f5fe27 please add the title of the commit here as well, e.g. Fixes: <sha1> ("<title>") see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… for example. Regards, Boqun > > > Cc: <stable(a)vger.kernel.org> > > > Signed-off-by: Ahmed Ehab <bottaawesome633(a)gmail.com> > > > --- > > > include/linux/lockdep.h | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h > > > index 08b0d1d9d78b..df8fa5929de7 100644 > > > --- a/include/linux/lockdep.h > > > +++ b/include/linux/lockdep.h > > > @@ -173,7 +173,7 @@ static inline void lockdep_init_map(struct > > lockdep_map *lock, const char *name, > > > (lock)->dep_map.lock_type) > > > > > > #define lockdep_set_subclass(lock, sub) > > \ > > > - lockdep_init_map_type(&(lock)->dep_map, #lock, > > (lock)->dep_map.key, sub,\ > > > + lockdep_init_map_type(&(lock)->dep_map, (lock)->dep_map.name, > > (lock)->dep_map.key, sub,\ > > > (lock)->dep_map.wait_type_inner, \ > > > (lock)->dep_map.wait_type_outer, \ > > > (lock)->dep_map.lock_type) > > > > ext4 is a filesystem. It has nothing to do with locking/lockdep. Could > > you resend the patches with the proper prefix of "lockdep:" or > > "locking/lockdep:"? > > > > Thanks, > > Longman > > > >

1 year, 5 months

1
0
0 0

[PATCH net v2] net: netconsole: Disable target before netpoll cleanup

by Breno Leitao

Currently, netconsole cleans up the netpoll structure before disabling the target. This approach can lead to race conditions, as message senders (write_ext_msg() and write_msg()) check if the target is enabled before using netpoll. The sender can validate that the target is enabled, but, the netpoll might be de-allocated already, causing undesired behaviours. This patch reverses the order of operations: 1. Disable the target 2. Clean up the netpoll structure This change eliminates the potential race condition, ensuring that no messages are sent through a partially cleaned-up netpoll structure. Fixes: 2382b15bcc39 ("netconsole: take care of NETDEV_UNREGISTER event") Cc: stable(a)vger.kernel.org Signed-off-by: Breno Leitao <leitao(a)debian.org> --- Changelog: v2: * Targeting "net" instead of "net-dev" (Jakub) v1: * https://lore.kernel.org/all/20240709144403.544099-4-leitao@debian.org/ drivers/net/netconsole.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c index d7070dd4fe73..aa66c923790f 100644 --- a/drivers/net/netconsole.c +++ b/drivers/net/netconsole.c @@ -974,6 +974,7 @@ static int netconsole_netdev_event(struct notifier_block *this, /* rtnl_lock already held * we might sleep in __netpoll_cleanup() */ + nt->enabled = false; spin_unlock_irqrestore(&target_list_lock, flags); __netpoll_cleanup(&nt->np); @@ -981,7 +982,6 @@ static int netconsole_netdev_event(struct notifier_block *this, spin_lock_irqsave(&target_list_lock, flags); netdev_put(nt->np.dev, &nt->np.dev_tracker); nt->np.dev = NULL; - nt->enabled = false; stopped = true; netconsole_target_put(nt); goto restart; -- 2.43.0

1 year, 5 months

3
2
0 0

[PATCH v3] cxl: Fix possible null pointer dereference in read_handle()

by Ma Ke

In read_handle(), of_get_address() may return NULL which is later dereferenced. Fix this by adding NULL check. Based on our customized static analysis tool, extract vulnerability features[1], then match similar vulnerability features in this function. [1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit /?id=2d9adecc88ab678785b581ab021f039372c324cb Cc: stable(a)vger.kernel.org Fixes: 14baf4d9c739 ("cxl: Add guest-specific code") Signed-off-by: Ma Ke <make24(a)iscas.ac.cn> --- Changes in v3: - fixed up the changelog text as suggestions. Changes in v2: - added an explanation of how the potential vulnerability was discovered, but not meet the description specification requirements. --- drivers/misc/cxl/of.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/misc/cxl/of.c b/drivers/misc/cxl/of.c index bcc005dff1c0..d8dbb3723951 100644 --- a/drivers/misc/cxl/of.c +++ b/drivers/misc/cxl/of.c @@ -58,7 +58,7 @@ static int read_handle(struct device_node *np, u64 *handle) /* Get address and size of the node */ prop = of_get_address(np, 0, &size, NULL); - if (size) + if (!prop || size) return -EINVAL; /* Helper to read a big number; size is in cells (not bytes) */ -- 2.25.1

1 year, 5 months

2
1
0 0

[PATCH v2 2/2] ext4: Testing lock class and subclass got the same name pointer

by botta633

From: Ahmed Ehab <bottaawesome633(a)gmail.com> Checking if the lockdep_map->name will change when setting the subclass. It shouldn't change so that the lock class and subclass will have the same name Reported-by: <syzbot+7f4a6f7f7051474e40ad(a)syzkaller.appspotmail.com> Fixes: fd5e3f5fe27 Cc: <stable(a)vger.kernel.org> Signed-off-by: botta633 <bottaawesome633(a)gmail.com> --- lib/locking-selftest.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c index 6f6a5fc85b42..1d7885205f36 100644 --- a/lib/locking-selftest.c +++ b/lib/locking-selftest.c @@ -2710,12 +2710,24 @@ static void local_lock_3B(void) } +static void class_subclass_X1_name(void) +{ + const char *name_before_subclass = rwsem_X1.dep_map.name; + const char *name_after_subclass; + + WARN_ON(!rwsem_X1.dep_map.name); + lockdep_set_subclass(&rwsem_X1, 1); + WARN_ON(name_before_subclass != name_after_subclass); +} + static void local_lock_tests(void) { printk(" --------------------------------------------------------------------------\n"); printk(" | local_lock tests |\n"); printk(" ---------------------\n"); + init_class_X(&lock_X1, &rwlock_X1, &mutex_X1, &rwsem_X1); + print_testname("local_lock inversion 2"); dotest(local_lock_2, SUCCESS, LOCKTYPE_LL); pr_cont("\n"); @@ -2727,6 +2739,10 @@ static void local_lock_tests(void) print_testname("local_lock inversion 3B"); dotest(local_lock_3B, FAILURE, LOCKTYPE_LL); pr_cont("\n"); + + print_testname("Class and subclass"); + dotest(class_subclass_X1_name, SUCCESS, LOCKTYPE_RWSEM); + pr_cont("\n"); } static void hardirq_deadlock_softirq_not_deadlock(void) -- 2.45.2

1 year, 5 months

1
0
0 0

[PATCH v2] media: ov5675: Fix power on/off delay timings

by Bryan O'Donoghue

The ov5675 specification says that the gap between XSHUTDN deassert and the first I2C transaction should be a minimum of 8192 XVCLK cycles. Right now we use a usleep_rage() that gives a sleep time of between about 430 and 860 microseconds. On the Lenovo X13s we have observed that in about 1/20 cases the current timing is too tight and we start transacting before the ov5675's reset cycle completes, leading to I2C bus transaction failures. The reset racing is sometimes triggered at initial chip probe but, more usually on a subsequent power-off/power-on cycle e.g. [ 71.451662] ov5675 24-0010: failed to write reg 0x0103. error = -5 [ 71.451686] ov5675 24-0010: failed to set plls The current quiescence period we have is too tight. Instead of expressing the post reset delay in terms of the current XVCLK this patch converts the power-on and power-off delays to the maximum theoretical delay @ 6 MHz with an additional buffer. 1.365 milliseconds on the power-on path is 1.5 milliseconds with grace. 853 microseconds on the power-off path is 900 microseconds with grace. Fixes: 49d9ad719e89 ("media: ov5675: add device-tree support and support runtime PM") Cc: stable(a)vger.kernel.org Signed-off-by: Bryan O'Donoghue <bryan.odonoghue(a)linaro.org> --- v2: - Drop patch to read and act on reported XVCLK - Use worst-case timings + a reasonable grace period in-lieu of previous xvclk calculations on power-on and power-off. - Link to v1: https://lore.kernel.org/r/20240711-linux-next-ov5675-v1-0-69e9b6c62c16@lina… v1: One long running saga for me on the Lenovo X13s is the occasional failure to either probe or subsequently bring-up the ov5675 main RGB sensor on the laptop. Initially I suspected the PMIC for this part as the PMIC is using a new interface on an I2C bus instead of an SPMI bus. In particular I thought perhaps the I2C write to PMIC had completed but the regulator output hadn't become stable from the perspective of the SoC. This however doesn't appear to be the case - I can introduce a delay of milliseconds on the PMIC path without resolving the sensor reset problem. Secondly I thought about reset pin polarity or drive-strength but, again playing about with both didn't yield decent results. I also played with the duration of reset to no avail. The error manifested as an I2C write timeout to the sensor which indicated that the chip likely hadn't come out reset. An intermittent fault appearing in perhaps 1/10 or 1/20 reset cycles. Looking at the expression of the reset we see that there is a minimum time expressed in XVCLK cycles between reset completion and first I2C transaction to the sensor. The specification calls out the minimum delay @ 8192 XVCLK cycles and the ov5675 driver meets that timing almost exactly. A little too exactly - testing finally showed that we were too racy with respect to the minimum quiescence between reset completion and first command to the chip. Fixing this error I choose to base the fix again on the number of clocks but to also support any clock rate the chip could support by moving away from a define to reading and using the XVCLK. True enough only 19.2 MHz is currently supported but for the hypothetical case where some other frequency is supported in the future, I wanted the fix introduced in this series to still hold. Hence this series: 1. Allows for any clock rate to be used in the valid range for the reset. 2. Elongates the post-reset period based on clock cycles which can now vary. Patch #2 can still be backported to stable irrespective of patch #1. --- drivers/media/i2c/ov5675.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/media/i2c/ov5675.c b/drivers/media/i2c/ov5675.c index 3641911bc73f..547d6fab816a 100644 --- a/drivers/media/i2c/ov5675.c +++ b/drivers/media/i2c/ov5675.c @@ -972,12 +972,10 @@ static int ov5675_set_stream(struct v4l2_subdev *sd, int enable) static int ov5675_power_off(struct device *dev) { - /* 512 xvclk cycles after the last SCCB transation or MIPI frame end */ - u32 delay_us = DIV_ROUND_UP(512, OV5675_XVCLK_19_2 / 1000 / 1000); struct v4l2_subdev *sd = dev_get_drvdata(dev); struct ov5675 *ov5675 = to_ov5675(sd); - usleep_range(delay_us, delay_us * 2); + usleep_range(900, 1000); clk_disable_unprepare(ov5675->xvclk); gpiod_set_value_cansleep(ov5675->reset_gpio, 1); @@ -988,7 +986,6 @@ static int ov5675_power_off(struct device *dev) static int ov5675_power_on(struct device *dev) { - u32 delay_us = DIV_ROUND_UP(8192, OV5675_XVCLK_19_2 / 1000 / 1000); struct v4l2_subdev *sd = dev_get_drvdata(dev); struct ov5675 *ov5675 = to_ov5675(sd); int ret; @@ -1014,8 +1011,11 @@ static int ov5675_power_on(struct device *dev) gpiod_set_value_cansleep(ov5675->reset_gpio, 0); - /* 8192 xvclk cycles prior to the first SCCB transation */ - usleep_range(delay_us, delay_us * 2); + /* Worst case quiesence gap is 1.365 milliseconds @ 6MHz XVCLK + * Add an additional threshold grace period to ensure reset + * completion before initiating our first I2C transaction. + */ + usleep_range(1500, 1600); return 0; } --- base-commit: 523b23f0bee3014a7a752c9bb9f5c54f0eddae88 change-id: 20240710-linux-next-ov5675-60b0e83c73f1 Best regards, -- Bryan O'Donoghue <bryan.odonoghue(a)linaro.org>

1 year, 5 months

2
2
0 0

[PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()

by Mateusz Jończyk

Linux 6.9+ is unable to start a degraded RAID1 array with one drive, when that drive has a write-mostly flag set. During such an attempt, the following assertion in bio_split() is hit: BUG_ON(sectors <= 0); Call Trace: ? bio_split+0x96/0xb0 ? exc_invalid_op+0x53/0x70 ? bio_split+0x96/0xb0 ? asm_exc_invalid_op+0x1b/0x20 ? bio_split+0x96/0xb0 ? raid1_read_request+0x890/0xd20 ? __call_rcu_common.constprop.0+0x97/0x260 raid1_make_request+0x81/0xce0 ? __get_random_u32_below+0x17/0x70 ? new_slab+0x2b3/0x580 md_handle_request+0x77/0x210 md_submit_bio+0x62/0xa0 __submit_bio+0x17b/0x230 submit_bio_noacct_nocheck+0x18e/0x3c0 submit_bio_noacct+0x244/0x670 After investigation, it turned out that choose_slow_rdev() does not set the value of max_sectors in some cases and because of it, raid1_read_request calls bio_split with sectors == 0. Fix it by filling in this variable. This bug was introduced in commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()") but apparently hidden until commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best rdev from read_balance()") shortly thereafter. Cc: stable(a)vger.kernel.org # 6.9.x+ Signed-off-by: Mateusz Jończyk <mat.jonczyk(a)o2.pl> Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()") Cc: Song Liu <song(a)kernel.org> Cc: Yu Kuai <yukuai3(a)huawei.com> Cc: Paul Luse <paul.e.luse(a)linux.intel.com> Cc: Xiao Ni <xni(a)redhat.com> Cc: Mariusz Tkaczyk <mariusz.tkaczyk(a)linux.intel.com> Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/ -- Tested on both Linux 6.10 and 6.9.8. Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any problems: ./test --dev=loop --no-error --raidtype=raid1 (on 6.9.8 there was one failure, caused by external bitmap support not compiled in). Notes: - I was reliably getting deadlocks when adding / removing devices on such an array - while the array was loaded with fsstress with 20 concurrent processes. When the array was idle or loaded with fsstress with 8 processes, no such deadlocks happened in my tests. This occurred also on unpatched Linux 6.8.0 though, but not on 6.1.97-rc1, so this is likely an independent regression (to be investigated). - I was also getting deadlocks when adding / removing the bitmap on the array in similar conditions - this happened on Linux 6.1.97-rc1 also though. fsstress with 8 concurrent processes did cause it only once during many tests. - in my testing, there was once a problem with hot adding an internal bitmap to the array: mdadm: Cannot add bitmap while array is resyncing or reshaping etc. mdadm: failed to set internal bitmap. even though no such reshaping was happening according to /proc/mdstat. This seems unrelated, though. --- drivers/md/raid1.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 7b8a71ca66dd..82f70a4ce6ed 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio, len = r1_bio->sectors; read_len = raid1_check_read_range(rdev, this_sector, &len); if (read_len == r1_bio->sectors) { + *max_sectors = read_len; update_read_sectors(conf, disk, this_sector, read_len); return disk; } base-commit: 256abd8e550ce977b728be79a74e1729438b4948 -- 2.25.1

1 year, 5 months

4
4
0 0

[to-be-updated] mm-huge_memory-avoid-pmd-size-page-cache-if-needed.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm/huge_memory: avoid PMD-size page cache if needed has been removed from the -mm tree. Its filename was mm-huge_memory-avoid-pmd-size-page-cache-if-needed.patch This patch was dropped because an updated version will be issued ------------------------------------------------------ From: Gavin Shan <gshan(a)redhat.com> Subject: mm/huge_memory: avoid PMD-size page cache if needed Date: Thu, 11 Jul 2024 20:48:40 +1000 Currently, xarray can't support arbitrary page cache size and the largest and supported page cache size is defined as MAX_PAGECACHE_ORDER in commit 099d90642a71 ("mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray"). However, it's possible to have 512MB page cache in the huge memory collapsing path on ARM64 system whose base page size is 64KB. A warning is raised when the huge page cache is split as shown in the following example. [root@dhcp-10-26-1-207 ~]# cat /proc/1/smaps | grep KernelPageSize KernelPageSize: 64 kB [root@dhcp-10-26-1-207 ~]# cat /tmp/test.c : int main(int argc, char **argv) { const char *filename = TEST_XFS_FILENAME; int fd = 0; void *buf = (void *)-1, *p; int pgsize = getpagesize(); int ret = 0; if (pgsize != 0x10000) { fprintf(stdout, "System with 64KB base page size is required!\n"); return -EPERM; } system("echo 0 > /sys/devices/virtual/bdi/253:0/read_ahead_kb"); system("echo 1 > /proc/sys/vm/drop_caches"); /* Open xfs or shmem file */ fd = open(filename, O_RDONLY); assert(fd > 0); /* Create VMA */ buf = mmap(NULL, TEST_MEM_SIZE, PROT_READ, MAP_SHARED, fd, 0); assert(buf != (void *)-1); fprintf(stdout, "mapped buffer at 0x%p\n", buf); /* Populate VMA */ ret = madvise(buf, TEST_MEM_SIZE, MADV_NOHUGEPAGE); assert(ret == 0); ret = madvise(buf, TEST_MEM_SIZE, MADV_POPULATE_READ); assert(ret == 0); /* Collapse VMA */ ret = madvise(buf, TEST_MEM_SIZE, MADV_HUGEPAGE); assert(ret == 0); ret = madvise(buf, TEST_MEM_SIZE, MADV_COLLAPSE); if (ret) { fprintf(stdout, "Error %d to madvise(MADV_COLLAPSE)\n", errno); goto out; } /* Split xarray. The file needs to reopened with write permission */ munmap(buf, TEST_MEM_SIZE); buf = (void *)-1; close(fd); fd = open(filename, O_RDWR); assert(fd > 0); fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, TEST_MEM_SIZE - pgsize, pgsize); out: if (buf != (void *)-1) munmap(buf, TEST_MEM_SIZE); if (fd > 0) close(fd); return ret; } [root@dhcp-10-26-1-207 ~]# gcc /tmp/test.c -o /tmp/test [root@dhcp-10-26-1-207 ~]# /tmp/test ------------[ cut here ]------------ WARNING: CPU: 25 PID: 7560 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128 Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib \ nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct \ nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 \ ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm fuse \ xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 virtio_net \ sha1_ce net_failover virtio_blk virtio_console failover dimlib virtio_mmio CPU: 25 PID: 7560 Comm: test Kdump: loaded Not tainted 6.10.0-rc7-gavin+ #9 Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024 pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : xas_split_alloc+0xf8/0x128 lr : split_huge_page_to_list_to_order+0x1c4/0x780 sp : ffff8000ac32f660 x29: ffff8000ac32f660 x28: ffff0000e0969eb0 x27: ffff8000ac32f6c0 x26: 0000000000000c40 x25: ffff0000e0969eb0 x24: 000000000000000d x23: ffff8000ac32f6c0 x22: ffffffdfc0700000 x21: 0000000000000000 x20: 0000000000000000 x19: ffffffdfc0700000 x18: 0000000000000000 x17: 0000000000000000 x16: ffffd5f3708ffc70 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: ffffffffffffffc0 x10: 0000000000000040 x9 : ffffd5f3708e692c x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff0000e0969eb8 x5 : ffffd5f37289e378 x4 : 0000000000000000 x3 : 0000000000000c40 x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000 Call trace: xas_split_alloc+0xf8/0x128 split_huge_page_to_list_to_order+0x1c4/0x780 truncate_inode_partial_folio+0xdc/0x160 truncate_inode_pages_range+0x1b4/0x4a8 truncate_pagecache_range+0x84/0xa0 xfs_flush_unmap_range+0x70/0x90 [xfs] xfs_file_fallocate+0xfc/0x4d8 [xfs] vfs_fallocate+0x124/0x2f0 ksys_fallocate+0x4c/0xa0 __arm64_sys_fallocate+0x24/0x38 invoke_syscall.constprop.0+0x7c/0xd8 do_el0_svc+0xb4/0xd0 el0_svc+0x44/0x1d8 el0t_64_sync_handler+0x134/0x150 el0t_64_sync+0x17c/0x180 Fix it by avoiding PMD-sized page cache in the huge memory collapsing path. After this patch is applied, the test program fails with error -EINVAL returned from __thp_vma_allowable_orders() and the madvise() system call to collapse the page caches. Link: https://lkml.kernel.org/r/20240711104840.200573-1-gshan@redhat.com Fixes: 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache") Signed-off-by: Gavin Shan <gshan(a)redhat.com> Cc: David Hildenbrand <david(a)redhat.com> Cc: Matthew Wilcox <willy(a)infradead.org> Cc: Ryan Roberts <ryan.roberts(a)arm.com> Cc: William Kucharski <william.kucharski(a)oracle.com> Cc: <stable(a)vger.kernel.org> [5.17+] Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/huge_memory.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- a/mm/huge_memory.c~mm-huge_memory-avoid-pmd-size-page-cache-if-needed +++ a/mm/huge_memory.c @@ -136,7 +136,8 @@ unsigned long __thp_vma_allowable_orders while (orders) { addr = vma->vm_end - (PAGE_SIZE << order); - if (thp_vma_suitable_order(vma, addr, order)) + if (!(vma->vm_file && order > MAX_PAGECACHE_ORDER) && + thp_vma_suitable_order(vma, addr, order)) break; order = next_order(&orders, order); } _ Patches currently in -mm which might be from gshan(a)redhat.com are

1 year, 5 months

1
0
0 0

[PATCH mm-unstable v1] mm/mglru: fix ineffective protection calculation

by Yu Zhao

mem_cgroup_calculate_protection() is not stateless and should only be used as part of a top-down tree traversal. shrink_one() traverses the per-node memcg LRU instead of the root_mem_cgroup tree, and therefore it should not call mem_cgroup_calculate_protection(). The existing misuse in shrink_one() can cause ineffective protection of sub-trees that are grandchildren of root_mem_cgroup. Fix it by reusing lru_gen_age_node(), which already traverses the root_mem_cgroup tree, to calculate the protection. Previously lru_gen_age_node() opportunistically skips the first pass, i.e., when scan_control->priority is DEF_PRIORITY. On the second pass, lruvec_is_sizable() uses appropriate scan_control->priority, set by set_initial_priority() from lru_gen_shrink_node(), to decide whether a memcg is too small to reclaim from. Now lru_gen_age_node() unconditionally traverses the root_mem_cgroup tree. So it should call set_initial_priority() upfront, to make sure lruvec_is_sizable() uses appropriate scan_control->priority on the first pass. Otherwise, lruvec_is_reclaimable() can return false negatives and result in premature OOM kills when min_ttl_ms is used. Reported-by: T.J. Mercier <tjmercier(a)google.com> Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Cc: stable(a)vger.kernel.org Signed-off-by: Yu Zhao <yuzhao(a)google.com> --- mm/vmscan.c | 86 +++++++++++++++++++++++++---------------------------- 1 file changed, 40 insertions(+), 46 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 6216d79edb7f..525d3ffa8451 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3915,6 +3915,32 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, * working set protection ******************************************************************************/ +static void set_initial_priority(struct pglist_data *pgdat, struct scan_control *sc) +{ + int priority; + unsigned long reclaimable; + + if (sc->priority != DEF_PRIORITY || sc->nr_to_reclaim < MIN_LRU_BATCH) + return; + /* + * Determine the initial priority based on + * (total >> priority) * reclaimed_to_scanned_ratio = nr_to_reclaim, + * where reclaimed_to_scanned_ratio = inactive / total. + */ + reclaimable = node_page_state(pgdat, NR_INACTIVE_FILE); + if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc)) + reclaimable += node_page_state(pgdat, NR_INACTIVE_ANON); + + /* round down reclaimable and round up sc->nr_to_reclaim */ + priority = fls_long(reclaimable) - 1 - fls_long(sc->nr_to_reclaim - 1); + + /* + * The estimation is based on LRU pages only, so cap it to prevent + * overshoots of shrinker objects by large margins. + */ + sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY); +} + static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc) { int gen, type, zone; @@ -3948,19 +3974,17 @@ static bool lruvec_is_reclaimable(struct lruvec *lruvec, struct scan_control *sc struct mem_cgroup *memcg = lruvec_memcg(lruvec); DEFINE_MIN_SEQ(lruvec); + if (mem_cgroup_below_min(NULL, memcg)) + return false; + + if (!lruvec_is_sizable(lruvec, sc)) + return false; + /* see the comment on lru_gen_folio */ gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]); birth = READ_ONCE(lruvec->lrugen.timestamps[gen]); - if (time_is_after_jiffies(birth + min_ttl)) - return false; - - if (!lruvec_is_sizable(lruvec, sc)) - return false; - - mem_cgroup_calculate_protection(NULL, memcg); - - return !mem_cgroup_below_min(NULL, memcg); + return time_is_before_jiffies(birth + min_ttl); } /* to protect the working set of the last N jiffies */ @@ -3970,23 +3994,20 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) { struct mem_cgroup *memcg; unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl); + bool reclaimable = !min_ttl; VM_WARN_ON_ONCE(!current_is_kswapd()); - /* check the order to exclude compaction-induced reclaim */ - if (!min_ttl || sc->order || sc->priority == DEF_PRIORITY) - return; + set_initial_priority(pgdat, sc); memcg = mem_cgroup_iter(NULL, NULL, NULL); do { struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); - if (lruvec_is_reclaimable(lruvec, sc, min_ttl)) { - mem_cgroup_iter_break(NULL, memcg); - return; - } + mem_cgroup_calculate_protection(NULL, memcg); - cond_resched(); + if (!reclaimable) + reclaimable = lruvec_is_reclaimable(lruvec, sc, min_ttl); } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); /* @@ -3994,7 +4015,7 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) * younger than min_ttl. However, another possibility is all memcgs are * either too small or below min. */ - if (mutex_trylock(&oom_lock)) { + if (!reclaimable && mutex_trylock(&oom_lock)) { struct oom_control oc = { .gfp_mask = sc->gfp_mask, }; @@ -4786,8 +4807,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc) struct mem_cgroup *memcg = lruvec_memcg(lruvec); struct pglist_data *pgdat = lruvec_pgdat(lruvec); - mem_cgroup_calculate_protection(NULL, memcg); - + /* lru_gen_age_node() called mem_cgroup_calculate_protection() */ if (mem_cgroup_below_min(NULL, memcg)) return MEMCG_LRU_YOUNG; @@ -4911,32 +4931,6 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc blk_finish_plug(&plug); } -static void set_initial_priority(struct pglist_data *pgdat, struct scan_control *sc) -{ - int priority; - unsigned long reclaimable; - - if (sc->priority != DEF_PRIORITY || sc->nr_to_reclaim < MIN_LRU_BATCH) - return; - /* - * Determine the initial priority based on - * (total >> priority) * reclaimed_to_scanned_ratio = nr_to_reclaim, - * where reclaimed_to_scanned_ratio = inactive / total. - */ - reclaimable = node_page_state(pgdat, NR_INACTIVE_FILE); - if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc)) - reclaimable += node_page_state(pgdat, NR_INACTIVE_ANON); - - /* round down reclaimable and round up sc->nr_to_reclaim */ - priority = fls_long(reclaimable) - 1 - fls_long(sc->nr_to_reclaim - 1); - - /* - * The estimation is based on LRU pages only, so cap it to prevent - * overshoots of shrinker objects by large margins. - */ - sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY); -} - static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *sc) { struct blk_plug plug; -- 2.45.2.993.g49e7a77208-goog

1 year, 5 months

1
0
0 0

[merged mm-stable] mm-mglru-fix-overshooting-shrinker-memory.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm/mglru: fix overshooting shrinker memory has been removed from the -mm tree. Its filename was mm-mglru-fix-overshooting-shrinker-memory.patch This patch was dropped because it was merged into the mm-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Yu Zhao <yuzhao(a)google.com> Subject: mm/mglru: fix overshooting shrinker memory Date: Thu, 11 Jul 2024 13:19:57 -0600 set_initial_priority() tries to jump-start global reclaim by estimating the priority based on cold/hot LRU pages. The estimation does not account for shrinker objects, and it cannot do so because their sizes can be in different units other than page. If shrinker objects are the majority, e.g., on TrueNAS SCALE 24.04.0 where ZFS ARC can use almost all system memory, set_initial_priority() can vastly underestimate how much memory ARC shrinker can evict and assign extreme low values to scan_control->priority, resulting in overshoots of shrinker objects. To reproduce the problem, using TrueNAS SCALE 24.04.0 with 32GB DRAM, a test ZFS pool and the following commands: fio --name=mglru.file --numjobs=36 --ioengine=io_uring \ --directory=/root/test-zfs-pool/ --size=1024m --buffered=1 \ --rw=randread --random_distribution=random \ --time_based --runtime=1h & for ((i = 0; i < 20; i++)) do sleep 120 fio --name=mglru.anon --numjobs=16 --ioengine=mmap \ --filename=/dev/zero --size=1024m --fadvise_hint=0 \ --rw=randrw --random_distribution=random \ --time_based --runtime=1m done To fix the problem: 1. Cap scan_control->priority at or above DEF_PRIORITY/2, to prevent the jump-start from being overly aggressive. 2. Account for the progress from mm_account_reclaimed_pages(), to prevent kswapd_shrink_node() from raising the priority unnecessarily. Link: https://lkml.kernel.org/r/20240711191957.939105-2-yuzhao@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: Yu Zhao <yuzhao(a)google.com> Reported-by: Alexander Motin <mav(a)ixsystems.com> Cc: Wei Xu <weixugc(a)google.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/vmscan.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) --- a/mm/vmscan.c~mm-mglru-fix-overshooting-shrinker-memory +++ a/mm/vmscan.c @@ -4930,7 +4930,11 @@ static void set_initial_priority(struct /* round down reclaimable and round up sc->nr_to_reclaim */ priority = fls_long(reclaimable) - 1 - fls_long(sc->nr_to_reclaim - 1); - sc->priority = clamp(priority, 0, DEF_PRIORITY); + /* + * The estimation is based on LRU pages only, so cap it to prevent + * overshoots of shrinker objects by large margins. + */ + sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY); } static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *sc) @@ -6754,6 +6758,7 @@ static bool kswapd_shrink_node(pg_data_t { struct zone *zone; int z; + unsigned long nr_reclaimed = sc->nr_reclaimed; /* Reclaim a number of pages proportional to the number of zones */ sc->nr_to_reclaim = 0; @@ -6781,7 +6786,8 @@ static bool kswapd_shrink_node(pg_data_t if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order)) sc->order = 0; - return sc->nr_scanned >= sc->nr_to_reclaim; + /* account for progress from mm_account_reclaimed_pages() */ + return max(sc->nr_scanned, sc->nr_reclaimed - nr_reclaimed) >= sc->nr_to_reclaim; } /* Page allocator PCP high watermark is lowered if reclaim is active. */ _ Patches currently in -mm which might be from yuzhao(a)google.com are

1 year, 5 months

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror July 2024