July 2024 - Linux-stable-mirror

[PATCH] mm: Fix endless reclaim on machines with unaccepted memory.

by Kirill A. Shutemov

Unaccepted memory is considered unusable free memory, which is not counted as free on the zone watermark check. This causes get_page_from_freelist() to accept more memory to hit the high watermark, but it creates problems in the reclaim path. The reclaim path encounters a failed zone watermark check and attempts to reclaim memory. This is usually successful, but if there is little or no reclaimable memory, it can result in endless reclaim with little to no progress. This can occur early in the boot process, just after start of the init process when the only reclaimable memory is the page cache of the init executable and its libraries. To address this issue, teach shrink_node() and shrink_zones() to accept memory before attempting to reclaim. Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com> Reported-by: Jianxiong Gao <jxgao(a)google.com> Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory") Cc: stable(a)vger.kernel.org # v6.5+ --- mm/internal.h | 9 +++++++++ mm/page_alloc.c | 8 +------- mm/vmscan.c | 36 ++++++++++++++++++++++++++++++++++++ 3 files changed, 46 insertions(+), 7 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index cc2c5e07fad3..ea55cbad061f 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1515,4 +1515,13 @@ static inline void shrinker_debugfs_remove(struct dentry *debugfs_entry, void workingset_update_node(struct xa_node *node); extern struct list_lru shadow_nodes; +#ifdef CONFIG_UNACCEPTED_MEMORY +bool try_to_accept_memory(struct zone *zone, unsigned int order); +#else +static inline bool try_to_accept_memory(struct zone *zone, unsigned int order) +{ + return false; +} +#endif /* CONFIG_UNACCEPTED_MEMORY */ + #endif /* __MM_INTERNAL_H */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9ecf99190ea2..9a108c92245f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -287,7 +287,6 @@ EXPORT_SYMBOL(nr_online_nodes); static bool page_contains_unaccepted(struct page *page, unsigned int order); static void accept_page(struct page *page, unsigned int order); -static bool try_to_accept_memory(struct zone *zone, unsigned int order); static inline bool has_unaccepted_memory(void); static bool __free_unaccepted(struct page *page); @@ -6940,7 +6939,7 @@ static bool try_to_accept_memory_one(struct zone *zone) return true; } -static bool try_to_accept_memory(struct zone *zone, unsigned int order) +bool try_to_accept_memory(struct zone *zone, unsigned int order) { long to_accept; int ret = false; @@ -6999,11 +6998,6 @@ static void accept_page(struct page *page, unsigned int order) { } -static bool try_to_accept_memory(struct zone *zone, unsigned int order) -{ - return false; -} - static inline bool has_unaccepted_memory(void) { return false; diff --git a/mm/vmscan.c b/mm/vmscan.c index 2e34de9cd0d4..b2af1263b1bc 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -5900,12 +5900,44 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) } while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL))); } +#ifdef CONFIG_UNACCEPTED_MEMORY +static bool node_try_to_accept_memory(pg_data_t *pgdat, struct scan_control *sc) +{ + bool progress = false; + struct zone *zone; + int z; + + for (z = 0; z <= sc->reclaim_idx; z++) { + zone = pgdat->node_zones + z; + if (!managed_zone(zone)) + continue; + + if (try_to_accept_memory(zone, sc->order)) + progress = true; + } + + return progress; +} +#else +static inline bool node_try_to_accept_memory(pg_data_t *pgdat, + struct scan_control *sc) +{ + return false; +} +#endif /* CONFIG_UNACCEPTED_MEMORY */ + static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) { unsigned long nr_reclaimed, nr_scanned, nr_node_reclaimed; struct lruvec *target_lruvec; bool reclaimable = false; + /* Try to accept memory before going for reclaim */ + if (node_try_to_accept_memory(pgdat, sc)) { + if (!should_continue_reclaim(pgdat, 0, sc)) + return; + } + if (lru_gen_enabled() && root_reclaim(sc)) { lru_gen_shrink_node(pgdat, sc); return; @@ -6118,6 +6150,10 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) GFP_KERNEL | __GFP_HARDWALL)) continue; + /* Try to accept memory before going for reclaim */ + if (try_to_accept_memory(zone, sc->order)) + continue; + /* * If we already have plenty of memory free for * compaction in this zone, don't free any more. -- 2.43.0

1 year, 5 months

4
9
0 0

[PATCH v5.10] ext4: fix error code saved on super block during file system abort

by Ajay Kaher

From: Gabriel Krisman Bertazi <krisman(a)collabora.com> [ Upstream commit 124e7c61deb27d758df5ec0521c36cf08d417f7a ] ext4_abort will eventually call ext4_errno_to_code, which translates the errno to an EXT4_ERR specific error. This means that ext4_abort expects an errno. By using EXT4_ERR_ here, it gets misinterpreted (as an errno), and ends up saving EXT4_ERR_EBUSY on the superblock during an abort, which makes no sense. ESHUTDOWN will get properly translated to EXT4_ERR_SHUTDOWN, so use that instead. Signed-off-by: Gabriel Krisman Bertazi <krisman(a)collabora.com> Link: https://lore.kernel.org/r/20211026173302.84000-1-krisman@collabora.com Signed-off-by: Theodore Ts'o <tytso(a)mit.edu> Signed-off-by: Ajay Kaher <ajay.kaher(a)broadcom.com> --- fs/ext4/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 160e5824948270..0e8406f5bf0aa0 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -5820,7 +5820,7 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data) } if (ext4_test_mount_flag(sb, EXT4_MF_FS_ABORTED)) - ext4_abort(sb, EXT4_ERR_ESHUTDOWN, "Abort forced by user"); + ext4_abort(sb, ESHUTDOWN, "Abort forced by user"); sb->s_flags = (sb->s_flags & ~SB_POSIXACL) | (test_opt(sb, POSIX_ACL) ? SB_POSIXACL : 0); -- cgit 1.2.3-korg

1 year, 5 months

2
2
0 0

[PATCH 5.10] scsi: core: Fix a use-after-free

by Maximilian Heyne

From: Bart Van Assche <bvanassche(a)acm.org> [ Upstream commit 8fe4ce5836e932f5766317cb651c1ff2a4cd0506 ] There are two .exit_cmd_priv implementations. Both implementations use resources associated with the SCSI host. Make sure that these resources are still available when .exit_cmd_priv is called by waiting inside scsi_remove_host() until the tag set has been freed. This commit fixes the following use-after-free: ================================================================== BUG: KASAN: use-after-free in srp_exit_cmd_priv+0x27/0xd0 [ib_srp] Read of size 8 at addr ffff888100337000 by task multipathd/16727 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 print_report.cold+0x5e/0x5db kasan_report+0xab/0x120 srp_exit_cmd_priv+0x27/0xd0 [ib_srp] scsi_mq_exit_request+0x4d/0x70 blk_mq_free_rqs+0x143/0x410 __blk_mq_free_map_and_rqs+0x6e/0x100 blk_mq_free_tag_set+0x2b/0x160 scsi_host_dev_release+0xf3/0x1a0 device_release+0x54/0xe0 kobject_put+0xa5/0x120 device_release+0x54/0xe0 kobject_put+0xa5/0x120 scsi_device_dev_release_usercontext+0x4c1/0x4e0 execute_in_process_context+0x23/0x90 device_release+0x54/0xe0 kobject_put+0xa5/0x120 scsi_disk_release+0x3f/0x50 device_release+0x54/0xe0 kobject_put+0xa5/0x120 disk_release+0x17f/0x1b0 device_release+0x54/0xe0 kobject_put+0xa5/0x120 dm_put_table_device+0xa3/0x160 [dm_mod] dm_put_device+0xd0/0x140 [dm_mod] free_priority_group+0xd8/0x110 [dm_multipath] free_multipath+0x94/0xe0 [dm_multipath] dm_table_destroy+0xa2/0x1e0 [dm_mod] __dm_destroy+0x196/0x350 [dm_mod] dev_remove+0x10c/0x160 [dm_mod] ctl_ioctl+0x2c2/0x590 [dm_mod] dm_ctl_ioctl+0x5/0x10 [dm_mod] __x64_sys_ioctl+0xb4/0xf0 dm_ctl_ioctl+0x5/0x10 [dm_mod] __x64_sys_ioctl+0xb4/0xf0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Link: https://lore.kernel.org/r/20220826002635.919423-1-bvanassche@acm.org Fixes: 65ca846a5314 ("scsi: core: Introduce {init,exit}_cmd_priv()") Cc: Ming Lei <ming.lei(a)redhat.com> Cc: Christoph Hellwig <hch(a)lst.de> Cc: Mike Christie <michael.christie(a)oracle.com> Cc: Hannes Reinecke <hare(a)suse.de> Cc: John Garry <john.garry(a)huawei.com> Cc: Li Zhijian <lizhijian(a)fujitsu.com> Reported-by: Li Zhijian <lizhijian(a)fujitsu.com> Tested-by: Li Zhijian <lizhijian(a)fujitsu.com> Signed-off-by: Bart Van Assche <bvanassche(a)acm.org> Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com> [mheyne: fixed contextual conflicts: - drivers/scsi/hosts.c: due to missing commit 973dac8a8a14 ("scsi: core: Refine how we set tag_set NUMA node") - drivers/scsi/scsi_sysfs.c: due to missing commit 6f8191fdf41d ("block: simplify disk shutdown") - drivers/scsi/scsi_scan.c: due to missing commit 59506abe5e34 ("scsi: core: Inline scsi_mq_alloc_queue()")] Signed-off-by: Maximilian Heyne <mheyne(a)amazon.de> Cc: stable(a)vger.kernel.org # v5.10 --- drivers/scsi/hosts.c | 16 +++++++++++++--- drivers/scsi/scsi_lib.c | 6 +++++- drivers/scsi/scsi_priv.h | 2 +- drivers/scsi/scsi_scan.c | 1 + drivers/scsi/scsi_sysfs.c | 1 + include/scsi/scsi_host.h | 2 ++ 6 files changed, 23 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index 297f2b412d07..17fa1cd91da6 100644 --- a/drivers/scsi/hosts.c +++ b/drivers/scsi/hosts.c @@ -182,6 +182,15 @@ void scsi_remove_host(struct Scsi_Host *shost) scsi_proc_host_rm(shost); scsi_proc_hostdir_rm(shost->hostt); + /* + * New SCSI devices cannot be attached anymore because of the SCSI host + * state so drop the tag set refcnt. Wait until the tag set refcnt drops + * to zero because .exit_cmd_priv implementations may need the host + * pointer. + */ + kref_put(&shost->tagset_refcnt, scsi_mq_free_tags); + wait_for_completion(&shost->tagset_freed); + spin_lock_irqsave(shost->host_lock, flags); if (scsi_host_set_state(shost, SHOST_DEL)) BUG_ON(scsi_host_set_state(shost, SHOST_DEL_RECOVERY)); @@ -240,6 +249,9 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev, shost->dma_dev = dma_dev; + kref_init(&shost->tagset_refcnt); + init_completion(&shost->tagset_freed); + /* * Increase usage count temporarily here so that calling * scsi_autopm_put_host() will trigger runtime idle if there is @@ -312,6 +324,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev, pm_runtime_disable(&shost->shost_gendev); pm_runtime_set_suspended(&shost->shost_gendev); pm_runtime_put_noidle(&shost->shost_gendev); + kref_put(&shost->tagset_refcnt, scsi_mq_free_tags); fail: return error; } @@ -344,9 +357,6 @@ static void scsi_host_dev_release(struct device *dev) kfree(dev_name(&shost->shost_dev)); } - if (shost->tag_set.tags) - scsi_mq_destroy_tags(shost); - kfree(shost->shost_data); ida_simple_remove(&host_index_ida, shost->host_no); diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 14dec86ff749..64ae7bc2de60 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1923,9 +1923,13 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost) return blk_mq_alloc_tag_set(tag_set); } -void scsi_mq_destroy_tags(struct Scsi_Host *shost) +void scsi_mq_free_tags(struct kref *kref) { + struct Scsi_Host *shost = container_of(kref, typeof(*shost), + tagset_refcnt); + blk_mq_free_tag_set(&shost->tag_set); + complete(&shost->tagset_freed); } /** diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h index 1183dbed687c..3fee9af05925 100644 --- a/drivers/scsi/scsi_priv.h +++ b/drivers/scsi/scsi_priv.h @@ -93,7 +93,7 @@ extern void scsi_requeue_run_queue(struct work_struct *work); extern struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev); extern void scsi_start_queue(struct scsi_device *sdev); extern int scsi_mq_setup_tags(struct Scsi_Host *shost); -extern void scsi_mq_destroy_tags(struct Scsi_Host *shost); +extern void scsi_mq_free_tags(struct kref *kref); extern void scsi_exit_queue(void); extern void scsi_evt_thread(struct work_struct *work); struct request_queue; diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c index 6f7c4d41c51d..e8703b043805 100644 --- a/drivers/scsi/scsi_scan.c +++ b/drivers/scsi/scsi_scan.c @@ -273,6 +273,7 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget, kfree(sdev); goto out; } + kref_get(&sdev->host->tagset_refcnt); WARN_ON_ONCE(!blk_get_queue(sdev->request_queue)); sdev->request_queue->queuedata = sdev; diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c index 6cc4d0792e3d..bb37d698da1a 100644 --- a/drivers/scsi/scsi_sysfs.c +++ b/drivers/scsi/scsi_sysfs.c @@ -1481,6 +1481,7 @@ void __scsi_remove_device(struct scsi_device *sdev) mutex_unlock(&sdev->state_mutex); blk_cleanup_queue(sdev->request_queue); + kref_put(&sdev->host->tagset_refcnt, scsi_mq_free_tags); cancel_work_sync(&sdev->requeue_work); if (sdev->host->hostt->slave_destroy) diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h index 4a9f1e6e3aac..3979e946c3fd 100644 --- a/include/scsi/scsi_host.h +++ b/include/scsi/scsi_host.h @@ -548,6 +548,8 @@ struct Scsi_Host { struct scsi_host_template *hostt; struct scsi_transport_template *transportt; + struct kref tagset_refcnt; + struct completion tagset_freed; /* Area to keep a shared tag map */ struct blk_mq_tag_set tag_set; -- 2.40.1 Amazon Web Services Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597

1 year, 5 months

2
1
0 0

[PATCH v5.10] bpf, skmsg: Fix NULL pointer dereference in sk_psock_skb_ingress_enqueue

by Ashwin Kamat

From: Jason Xing <kernelxing(a)tencent.com> [ Upstream commit 6648e613226e18897231ab5e42ffc29e63fa3365 ] Fix NULL pointer data-races in sk_psock_skb_ingress_enqueue() which syzbot reported [1]. [1] BUG: KCSAN: data-race in sk_psock_drop / sk_psock_skb_ingress_enqueue write to 0xffff88814b3278b8 of 8 bytes by task 10724 on cpu 1: sk_psock_stop_verdict net/core/skmsg.c:1257 [inline] sk_psock_drop+0x13e/0x1f0 net/core/skmsg.c:843 sk_psock_put include/linux/skmsg.h:459 [inline] sock_map_close+0x1a7/0x260 net/core/sock_map.c:1648 unix_release+0x4b/0x80 net/unix/af_unix.c:1048 __sock_release net/socket.c:659 [inline] sock_close+0x68/0x150 net/socket.c:1421 __fput+0x2c1/0x660 fs/file_table.c:422 __fput_sync+0x44/0x60 fs/file_table.c:507 __do_sys_close fs/open.c:1556 [inline] __se_sys_close+0x101/0x1b0 fs/open.c:1541 __x64_sys_close+0x1f/0x30 fs/open.c:1541 do_syscall_64+0xd3/0x1d0 entry_SYSCALL_64_after_hwframe+0x6d/0x75 read to 0xffff88814b3278b8 of 8 bytes by task 10713 on cpu 0: sk_psock_data_ready include/linux/skmsg.h:464 [inline] sk_psock_skb_ingress_enqueue+0x32d/0x390 net/core/skmsg.c:555 sk_psock_skb_ingress_self+0x185/0x1e0 net/core/skmsg.c:606 sk_psock_verdict_apply net/core/skmsg.c:1008 [inline] sk_psock_verdict_recv+0x3e4/0x4a0 net/core/skmsg.c:1202 unix_read_skb net/unix/af_unix.c:2546 [inline] unix_stream_read_skb+0x9e/0xf0 net/unix/af_unix.c:2682 sk_psock_verdict_data_ready+0x77/0x220 net/core/skmsg.c:1223 unix_stream_sendmsg+0x527/0x860 net/unix/af_unix.c:2339 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x140/0x180 net/socket.c:745 ____sys_sendmsg+0x312/0x410 net/socket.c:2584 ___sys_sendmsg net/socket.c:2638 [inline] __sys_sendmsg+0x1e9/0x280 net/socket.c:2667 __do_sys_sendmsg net/socket.c:2676 [inline] __se_sys_sendmsg net/socket.c:2674 [inline] __x64_sys_sendmsg+0x46/0x50 net/socket.c:2674 do_syscall_64+0xd3/0x1d0 entry_SYSCALL_64_after_hwframe+0x6d/0x75 value changed: 0xffffffff83d7feb0 -> 0x0000000000000000 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 10713 Comm: syz-executor.4 Tainted: G W 6.8.0-syzkaller-08951-gfe46a7dd189e #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024 Prior to this, commit 4cd12c6065df ("bpf, sockmap: Fix NULL pointer dereference in sk_psock_verdict_data_ready()") fixed one NULL pointer similarly due to no protection of saved_data_ready. Here is another different caller causing the same issue because of the same reason. So we should protect it with sk_callback_lock read lock because the writer side in the sk_psock_drop() uses "write_lock_bh(&sk->sk_callback_lock);". To avoid errors that could happen in future, I move those two pairs of lock into the sk_psock_data_ready(), which is suggested by John Fastabend. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Reported-by: syzbot+aa8c8ec2538929f18f2d(a)syzkaller.appspotmail.com Signed-off-by: Jason Xing <kernelxing(a)tencent.com> Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net> Reviewed-by: John Fastabend <john.fastabend(a)gmail.com> Closes: https://syzkaller.appspot.com/bug?extid=aa8c8ec2538929f18f2d Link: https://lore.kernel.org/all/20240329134037.92124-1-kerneljasonxing@gmail.com Link: https://lore.kernel.org/bpf/20240404021001.94815-1-kerneljasonxing@gmail.com Signed-off-by: Sasha Levin <sashal(a)kernel.org> [Ashwin: Regenerated the patch for v5.10] Signed-off-by: Ashwin Dayanand Kamat <ashwin.kamat(a)broadcom.com> --- include/linux/skmsg.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h index 1138dd3071db..a197c9a49e97 100644 --- a/include/linux/skmsg.h +++ b/include/linux/skmsg.h @@ -406,10 +406,12 @@ static inline void sk_psock_put(struct sock *sk, struct sk_psock *psock) static inline void sk_psock_data_ready(struct sock *sk, struct sk_psock *psock) { + read_lock_bh(&sk->sk_callback_lock); if (psock->parser.enabled) psock->parser.saved_data_ready(sk); else sk->sk_data_ready(sk); + read_unlock_bh(&sk->sk_callback_lock); } static inline void psock_set_prog(struct bpf_prog **pprog, -- 2.45.1

1 year, 5 months

2
1
0 0

[PATCH 5.10] bpf: Fix overrunning reservations in ringbuf

by Dominique Martinet

From: Daniel Borkmann <daniel(a)iogearbox.net> [ Upstream commit cfa1a2329a691ffd991fcf7248a57d752e712881 ] The BPF ring buffer internally is implemented as a power-of-2 sized circular buffer, with two logical and ever-increasing counters: consumer_pos is the consumer counter to show which logical position the consumer consumed the data, and producer_pos which is the producer counter denoting the amount of data reserved by all producers. Each time a record is reserved, the producer that "owns" the record will successfully advance producer counter. In user space each time a record is read, the consumer of the data advanced the consumer counter once it finished processing. Both counters are stored in separate pages so that from user space, the producer counter is read-only and the consumer counter is read-write. One aspect that simplifies and thus speeds up the implementation of both producers and consumers is how the data area is mapped twice contiguously back-to-back in the virtual memory, allowing to not take any special measures for samples that have to wrap around at the end of the circular buffer data area, because the next page after the last data page would be first data page again, and thus the sample will still appear completely contiguous in virtual memory. Each record has a struct bpf_ringbuf_hdr { u32 len; u32 pg_off; } header for book-keeping the length and offset, and is inaccessible to the BPF program. Helpers like bpf_ringbuf_reserve() return `(void *)hdr + BPF_RINGBUF_HDR_SZ` for the BPF program to use. Bing-Jhong and Muhammad reported that it is however possible to make a second allocated memory chunk overlapping with the first chunk and as a result, the BPF program is now able to edit first chunk's header. For example, consider the creation of a BPF_MAP_TYPE_RINGBUF map with size of 0x4000. Next, the consumer_pos is modified to 0x3000 /before/ a call to bpf_ringbuf_reserve() is made. This will allocate a chunk A, which is in [0x0,0x3008], and the BPF program is able to edit [0x8,0x3008]. Now, lets allocate a chunk B with size 0x3000. This will succeed because consumer_pos was edited ahead of time to pass the `new_prod_pos - cons_pos > rb->mask` check. Chunk B will be in range [0x3008,0x6010], and the BPF program is able to edit [0x3010,0x6010]. Due to the ring buffer memory layout mentioned earlier, the ranges [0x0,0x4000] and [0x4000,0x8000] point to the same data pages. This means that chunk B at [0x4000,0x4008] is chunk A's header. bpf_ringbuf_submit() / bpf_ringbuf_discard() use the header's pg_off to then locate the bpf_ringbuf itself via bpf_ringbuf_restore_from_rec(). Once chunk B modified chunk A's header, then bpf_ringbuf_commit() refers to the wrong page and could cause a crash. Fix it by calculating the oldest pending_pos and check whether the range from the oldest outstanding record to the newest would span beyond the ring buffer size. If that is the case, then reject the request. We've tested with the ring buffer benchmark in BPF selftests (./benchs/run_bench_ringbufs.sh) before/after the fix and while it seems a bit slower on some benchmarks, it is still not significantly enough to matter. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Reported-by: Bing-Jhong Billy Jheng <billy(a)starlabs.sg> Reported-by: Muhammad Ramdhan <ramdhan(a)starlabs.sg> Co-developed-by: Bing-Jhong Billy Jheng <billy(a)starlabs.sg> Co-developed-by: Andrii Nakryiko <andrii(a)kernel.org> Signed-off-by: Bing-Jhong Billy Jheng <billy(a)starlabs.sg> Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org> Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net> Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org> Link: https://lore.kernel.org/bpf/20240621140828.18238-1-daniel@iogearbox.net Signed-off-by: Dominique Martinet <dominique.martinet(a)atmark-techno.com> --- The only conflict with the patch was in the comment at top of the patch (the commit that had changed this comment, 583c1f420173 ("bpf: Define new BPF_MAP_TYPE_USER_RINGBUF map type"), has nothing to do with this fix), so I went ahead with it. I'm not familiar with the ringbuf code but it doesn't look too wrong to me at first glance; and with this all stable branches are covered. kernel/bpf/ringbuf.c | 30 +++++++++++++++++++++++++----- 1 file changed, 25 insertions(+), 5 deletions(-) diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c index 1e4bf23528a3..eac0026e2fa6 100644 --- a/kernel/bpf/ringbuf.c +++ b/kernel/bpf/ringbuf.c @@ -41,9 +41,12 @@ struct bpf_ringbuf { * mapping consumer page as r/w, but restrict producer page to r/o. * This protects producer position from being modified by user-space * application and ruining in-kernel position tracking. + * Note that the pending counter is placed in the same + * page as the producer, so that it shares the same cache line. */ unsigned long consumer_pos __aligned(PAGE_SIZE); unsigned long producer_pos __aligned(PAGE_SIZE); + unsigned long pending_pos; char data[] __aligned(PAGE_SIZE); }; @@ -145,6 +148,7 @@ static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node) rb->mask = data_sz - 1; rb->consumer_pos = 0; rb->producer_pos = 0; + rb->pending_pos = 0; return rb; } @@ -323,9 +327,9 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr) static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size) { - unsigned long cons_pos, prod_pos, new_prod_pos, flags; - u32 len, pg_off; + unsigned long cons_pos, prod_pos, new_prod_pos, pend_pos, flags; struct bpf_ringbuf_hdr *hdr; + u32 len, pg_off, tmp_size, hdr_len; if (unlikely(size > RINGBUF_MAX_RECORD_SZ)) return NULL; @@ -343,13 +347,29 @@ static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size) spin_lock_irqsave(&rb->spinlock, flags); } + pend_pos = rb->pending_pos; prod_pos = rb->producer_pos; new_prod_pos = prod_pos + len; - /* check for out of ringbuf space by ensuring producer position - * doesn't advance more than (ringbuf_size - 1) ahead + while (pend_pos < prod_pos) { + hdr = (void *)rb->data + (pend_pos & rb->mask); + hdr_len = READ_ONCE(hdr->len); + if (hdr_len & BPF_RINGBUF_BUSY_BIT) + break; + tmp_size = hdr_len & ~BPF_RINGBUF_DISCARD_BIT; + tmp_size = round_up(tmp_size + BPF_RINGBUF_HDR_SZ, 8); + pend_pos += tmp_size; + } + rb->pending_pos = pend_pos; + + /* check for out of ringbuf space: + * - by ensuring producer position doesn't advance more than + * (ringbuf_size - 1) ahead + * - by ensuring oldest not yet committed record until newest + * record does not span more than (ringbuf_size - 1) */ - if (new_prod_pos - cons_pos > rb->mask) { + if (new_prod_pos - cons_pos > rb->mask || + new_prod_pos - pend_pos > rb->mask) { spin_unlock_irqrestore(&rb->spinlock, flags); return NULL; } -- 2.39.2

1 year, 5 months

2
2
0 0

[v6.9-stable PATCH] mm: page_ref: remove folio_try_get_rcu()

by Yang Shi

commit fa2690af573dfefb47ba6eef888797a64b6b5f3c upstream The below bug was reported on a non-SMP kernel: [ 275.267158][ T4335] ------------[ cut here ]------------ [ 275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275! [ 275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI [ 275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1 [ 275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) [ 275.272813][ T4335] RSP: 0018:ffffc90005dcf650 EFLAGS: 00010202 [ 275.273346][ T4335] RAX: 0000000000000246 RBX: ffffea00066e0000 RCX: 0000000000000000 [ 275.274032][ T4335] RDX: fffff94000cdc007 RSI: 0000000000000004 RDI: ffffea00066e0034 [ 275.274719][ T4335] RBP: ffffea00066e0000 R08: 0000000000000000 R09: fffff94000cdc006 [ 275.275404][ T4335] R10: ffffea00066e0037 R11: 0000000000000000 R12: 0000000000000136 [ 275.276106][ T4335] R13: ffffea00066e0034 R14: dffffc0000000000 R15: ffffea00066e0008 [ 275.276790][ T4335] FS: 00007fa2f9b61740(0000) GS:ffffffff89d0d000(0000) knlGS:0000000000000000 [ 275.277570][ T4335] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 275.278143][ T4335] CR2: 00007fa2f6c00000 CR3: 0000000134b04000 CR4: 00000000000406f0 [ 275.278833][ T4335] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 275.279521][ T4335] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 275.280201][ T4335] Call Trace: [ 275.280499][ T4335] <TASK> [ 275.280751][ T4335] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447) [ 275.281087][ T4335] ? do_trap (arch/x86/kernel/traps.c:112 arch/x86/kernel/traps.c:153) [ 275.281463][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) [ 275.281884][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) [ 275.282300][ T4335] ? do_error_trap (arch/x86/kernel/traps.c:174) [ 275.282711][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) [ 275.283129][ T4335] ? handle_invalid_op (arch/x86/kernel/traps.c:212) [ 275.283561][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) [ 275.283990][ T4335] ? exc_invalid_op (arch/x86/kernel/traps.c:264) [ 275.284415][ T4335] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) [ 275.284859][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) [ 275.285278][ T4335] try_grab_folio (mm/gup.c:148) [ 275.285684][ T4335] __get_user_pages (mm/gup.c:1297 (discriminator 1)) [ 275.286111][ T4335] ? __pfx___get_user_pages (mm/gup.c:1188) [ 275.286579][ T4335] ? __pfx_validate_chain (kernel/locking/lockdep.c:3825) [ 275.287034][ T4335] ? mark_lock (kernel/locking/lockdep.c:4656 (discriminator 1)) [ 275.287416][ T4335] __gup_longterm_locked (mm/gup.c:1509 mm/gup.c:2209) [ 275.288192][ T4335] ? __pfx___gup_longterm_locked (mm/gup.c:2204) [ 275.288697][ T4335] ? __pfx_lock_acquire (kernel/locking/lockdep.c:5722) [ 275.289135][ T4335] ? __pfx___might_resched (kernel/sched/core.c:10106) [ 275.289595][ T4335] pin_user_pages_remote (mm/gup.c:3350) [ 275.290041][ T4335] ? __pfx_pin_user_pages_remote (mm/gup.c:3350) [ 275.290545][ T4335] ? find_held_lock (kernel/locking/lockdep.c:5244 (discriminator 1)) [ 275.290961][ T4335] ? mm_access (kernel/fork.c:1573) [ 275.291353][ T4335] process_vm_rw_single_vec+0x142/0x360 [ 275.291900][ T4335] ? __pfx_process_vm_rw_single_vec+0x10/0x10 [ 275.292471][ T4335] ? mm_access (kernel/fork.c:1573) [ 275.292859][ T4335] process_vm_rw_core+0x272/0x4e0 [ 275.293384][ T4335] ? hlock_class (arch/x86/include/asm/bitops.h:227 arch/x86/include/asm/bitops.h:239 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228) [ 275.293780][ T4335] ? __pfx_process_vm_rw_core+0x10/0x10 [ 275.294350][ T4335] process_vm_rw (mm/process_vm_access.c:284) [ 275.294748][ T4335] ? __pfx_process_vm_rw (mm/process_vm_access.c:259) [ 275.295197][ T4335] ? __task_pid_nr_ns (include/linux/rcupdate.h:306 (discriminator 1) include/linux/rcupdate.h:780 (discriminator 1) kernel/pid.c:504 (discriminator 1)) [ 275.295634][ T4335] __x64_sys_process_vm_readv (mm/process_vm_access.c:291) [ 275.296139][ T4335] ? syscall_enter_from_user_mode (kernel/entry/common.c:94 kernel/entry/common.c:112) [ 275.296642][ T4335] do_syscall_64 (arch/x86/entry/common.c:51 (discriminator 1) arch/x86/entry/common.c:82 (discriminator 1)) [ 275.297032][ T4335] ? __task_pid_nr_ns (include/linux/rcupdate.h:306 (discriminator 1) include/linux/rcupdate.h:780 (discriminator 1) kernel/pid.c:504 (discriminator 1)) [ 275.297470][ T4335] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359) [ 275.297988][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97) [ 275.298389][ T4335] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359) [ 275.298906][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97) [ 275.299304][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97) [ 275.299703][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97) [ 275.300115][ T4335] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129) This BUG is the VM_BUG_ON(!in_atomic() && !irqs_disabled()) assertion in folio_ref_try_add_rcu() for non-SMP kernel. The process_vm_readv() calls GUP to pin the THP. An optimization for pinning THP instroduced by commit 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") calls try_grab_folio() to pin the THP, but try_grab_folio() is supposed to be called in atomic context for non-SMP kernel, for example, irq disabled or preemption disabled, due to the optimization introduced by commit e286781d5f2e ("mm: speculative page references"). The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") is not actually the root cause although it was bisected to. It just makes the problem exposed more likely. The follow up discussion suggested the optimization for non-SMP kernel may be out-dated and not worth it anymore [1]. So removing the optimization to silence the BUG. However calling try_grab_folio() in GUP slow path actually is unnecessary, so the following patch will clean this up. [1] https://lore.kernel.org/linux-mm/821cf1d6-92b9-4ac4-bacc-d8f2364ac14f@paulm… Link: https://lkml.kernel.org/r/20240625205350.1777481-1-yang@os.amperecomputing.… Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") Signed-off-by: Yang Shi <yang(a)os.amperecomputing.com> Reported-by: kernel test robot <oliver.sang(a)intel.com> Tested-by: Oliver Sang <oliver.sang(a)intel.com> Acked-by: Peter Xu <peterx(a)redhat.com> Acked-by: David Hildenbrand <david(a)redhat.com> Cc: Christoph Lameter <cl(a)linux.com> Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org> Cc: Paul E. McKenney <paulmck(a)kernel.org> Cc: Rik van Riel <riel(a)surriel.com> Cc: Vivek Kasireddy <vivek.kasireddy(a)intel.com> Cc: <stable(a)vger.kernel.org> [6.6+] Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- fs/netfs/buffered_write.c | 4 ++-- fs/smb/client/file.c | 4 ++-- include/linux/page_ref.h | 49 ++------------------------------------- mm/filemap.c | 10 ++++---- mm/gup.c | 2 +- 5 files changed, 12 insertions(+), 57 deletions(-) diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c index d2ce0849bb53..e6066ddbb3ac 100644 --- a/fs/netfs/buffered_write.c +++ b/fs/netfs/buffered_write.c @@ -811,7 +811,7 @@ static void netfs_extend_writeback(struct address_space *mapping, break; } - if (!folio_try_get_rcu(folio)) { + if (!folio_try_get(folio)) { xas_reset(xas); continue; } @@ -1028,7 +1028,7 @@ static ssize_t netfs_writepages_begin(struct address_space *mapping, if (!folio) break; - if (!folio_try_get_rcu(folio)) { + if (!folio_try_get(folio)) { xas_reset(xas); continue; } diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c index 9be37d0fe724..4784fece4d99 100644 --- a/fs/smb/client/file.c +++ b/fs/smb/client/file.c @@ -2753,7 +2753,7 @@ static void cifs_extend_writeback(struct address_space *mapping, break; } - if (!folio_try_get_rcu(folio)) { + if (!folio_try_get(folio)) { xas_reset(xas); continue; } @@ -2989,7 +2989,7 @@ static ssize_t cifs_writepages_begin(struct address_space *mapping, if (!folio) break; - if (!folio_try_get_rcu(folio)) { + if (!folio_try_get(folio)) { xas_reset(xas); continue; } diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index d7c2d33baa7f..fdd2a75adb03 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -263,54 +263,9 @@ static inline bool folio_try_get(struct folio *folio) return folio_ref_add_unless(folio, 1, 0); } -static inline bool folio_ref_try_add_rcu(struct folio *folio, int count) -{ -#ifdef CONFIG_TINY_RCU - /* - * The caller guarantees the folio will not be freed from interrupt - * context, so (on !SMP) we only need preemption to be disabled - * and TINY_RCU does that for us. - */ -# ifdef CONFIG_PREEMPT_COUNT - VM_BUG_ON(!in_atomic() && !irqs_disabled()); -# endif - VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio); - folio_ref_add(folio, count); -#else - if (unlikely(!folio_ref_add_unless(folio, count, 0))) { - /* Either the folio has been freed, or will be freed. */ - return false; - } -#endif - return true; -} - -/** - * folio_try_get_rcu - Attempt to increase the refcount on a folio. - * @folio: The folio. - * - * This is a version of folio_try_get() optimised for non-SMP kernels. - * If you are still holding the rcu_read_lock() after looking up the - * page and know that the page cannot have its refcount decreased to - * zero in interrupt context, you can use this instead of folio_try_get(). - * - * Example users include get_user_pages_fast() (as pages are not unmapped - * from interrupt context) and the page cache lookups (as pages are not - * truncated from interrupt context). We also know that pages are not - * frozen in interrupt context for the purposes of splitting or migration. - * - * You can also use this function if you're holding a lock that prevents - * pages being frozen & removed; eg the i_pages lock for the page cache - * or the mmap_lock or page table lock for page tables. In this case, - * it will always succeed, and you could have used a plain folio_get(), - * but it's sometimes more convenient to have a common function called - * from both locked and RCU-protected contexts. - * - * Return: True if the reference count was successfully incremented. - */ -static inline bool folio_try_get_rcu(struct folio *folio) +static inline bool folio_ref_try_add(struct folio *folio, int count) { - return folio_ref_try_add_rcu(folio, 1); + return folio_ref_add_unless(folio, count, 0); } static inline int page_ref_freeze(struct page *page, int count) diff --git a/mm/filemap.c b/mm/filemap.c index 30de18c4fd28..367ea201468d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1823,7 +1823,7 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index) if (!folio || xa_is_value(folio)) goto out; - if (!folio_try_get_rcu(folio)) + if (!folio_try_get(folio)) goto repeat; if (unlikely(folio != xas_reload(&xas))) { @@ -1977,7 +1977,7 @@ static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max, if (!folio || xa_is_value(folio)) return folio; - if (!folio_try_get_rcu(folio)) + if (!folio_try_get(folio)) goto reset; if (unlikely(folio != xas_reload(xas))) { @@ -2157,7 +2157,7 @@ unsigned filemap_get_folios_contig(struct address_space *mapping, if (xa_is_value(folio)) goto update_start; - if (!folio_try_get_rcu(folio)) + if (!folio_try_get(folio)) goto retry; if (unlikely(folio != xas_reload(&xas))) @@ -2289,7 +2289,7 @@ static void filemap_get_read_batch(struct address_space *mapping, break; if (xa_is_sibling(folio)) break; - if (!folio_try_get_rcu(folio)) + if (!folio_try_get(folio)) goto retry; if (unlikely(folio != xas_reload(&xas))) @@ -3448,7 +3448,7 @@ static struct folio *next_uptodate_folio(struct xa_state *xas, continue; if (folio_test_locked(folio)) continue; - if (!folio_try_get_rcu(folio)) + if (!folio_try_get(folio)) continue; /* Has the page moved or been split? */ if (unlikely(folio != xas_reload(xas))) diff --git a/mm/gup.c b/mm/gup.c index 1611e73b1121..ec8570d25a6c 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -76,7 +76,7 @@ static inline struct folio *try_get_folio(struct page *page, int refs) folio = page_folio(page); if (WARN_ON_ONCE(folio_ref_count(folio) < 0)) return NULL; - if (unlikely(!folio_ref_try_add_rcu(folio, refs))) + if (unlikely(!folio_ref_try_add(folio, refs))) return NULL; /* -- 2.41.0

1 year, 5 months

2
1
0 0

FAILED: patch "[PATCH] ACPI: processor_idle: Fix invalid comparison with insertion" failed to apply to 5.15-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 5.15-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. To reproduce the conflict and resubmit, you may use the following commands: git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y git checkout FETCH_HEAD git cherry-pick -x 233323f9b9f828cd7cd5145ad811c1990b692542 # <resolve conflicts, build, test, etc.> git commit -s git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024071528-foothill-overdraft-d69a@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^.. Possible dependencies: 233323f9b9f8 ("ACPI: processor_idle: Fix invalid comparison with insertion sort for latency") 0e6078c3c673 ("ACPI: processor idle: Use swap() instead of open coding it") thanks, greg k-h ------------------ original commit in Linus's tree ------------------ From 233323f9b9f828cd7cd5145ad811c1990b692542 Mon Sep 17 00:00:00 2001 From: Kuan-Wei Chiu <visitorckw(a)gmail.com> Date: Tue, 2 Jul 2024 04:56:39 +0800 Subject: [PATCH] ACPI: processor_idle: Fix invalid comparison with insertion sort for latency The acpi_cst_latency_cmp() comparison function currently used for sorting C-state latencies does not satisfy transitivity, causing incorrect sorting results. Specifically, if there are two valid acpi_processor_cx elements A and B and one invalid element C, it may occur that A < B, A = C, and B = C. Sorting algorithms assume that if A < B and A = C, then C < B, leading to incorrect ordering. Given the small size of the array (<=8), we replace the library sort function with a simple insertion sort that properly ignores invalid elements and sorts valid ones based on latency. This change ensures correct ordering of the C-state latencies. Fixes: 65ea8f2c6e23 ("ACPI: processor idle: Fix up C-state latency if not ordered") Reported-by: Julian Sikorski <belegdol(a)gmail.com> Closes: https://lore.kernel.org/lkml/70674dc7-5586-4183-8953-8095567e73df@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw(a)gmail.com> Tested-by: Julian Sikorski <belegdol(a)gmail.com> Cc: All applicable <stable(a)vger.kernel.org> Link: https://patch.msgid.link/20240701205639.117194-1-visitorckw@gmail.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c index bd6a7857ce05..831fa4a12159 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -16,7 +16,6 @@ #include <linux/acpi.h> #include <linux/dmi.h> #include <linux/sched.h> /* need_resched() */ -#include <linux/sort.h> #include <linux/tick.h> #include <linux/cpuidle.h> #include <linux/cpu.h> @@ -386,25 +385,24 @@ static void acpi_processor_power_verify_c3(struct acpi_processor *pr, acpi_write_bit_register(ACPI_BITREG_BUS_MASTER_RLD, 1); } -static int acpi_cst_latency_cmp(const void *a, const void *b) +static void acpi_cst_latency_sort(struct acpi_processor_cx *states, size_t length) { - const struct acpi_processor_cx *x = a, *y = b; + int i, j, k; - if (!(x->valid && y->valid)) - return 0; - if (x->latency > y->latency) - return 1; - if (x->latency < y->latency) - return -1; - return 0; -} -static void acpi_cst_latency_swap(void *a, void *b, int n) -{ - struct acpi_processor_cx *x = a, *y = b; + for (i = 1; i < length; i++) { + if (!states[i].valid) + continue; - if (!(x->valid && y->valid)) - return; - swap(x->latency, y->latency); + for (j = i - 1, k = i; j >= 0; j--) { + if (!states[j].valid) + continue; + + if (states[j].latency > states[k].latency) + swap(states[j].latency, states[k].latency); + + k = j; + } + } } static int acpi_processor_power_verify(struct acpi_processor *pr) @@ -449,10 +447,7 @@ static int acpi_processor_power_verify(struct acpi_processor *pr) if (buggy_latency) { pr_notice("FW issue: working around C-state latencies out of order\n"); - sort(&pr->power.states[1], max_cstate, - sizeof(struct acpi_processor_cx), - acpi_cst_latency_cmp, - acpi_cst_latency_swap); + acpi_cst_latency_sort(&pr->power.states[1], max_cstate); } lapic_timer_propagate_broadcast(pr);

1 year, 5 months

3
2
0 0

Missing writer side af_unix annotations in 4.19.317

by syphyr

Several commits were made in v4.19.317 for af_unix to fix race conditions. Most of the READ_ONCE commits were added to v4.19.317, but the WRITE_ONCE on the writer side did not get added. For example, https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=…

1 year, 5 months

2
1
0 0

Re: [PATCH] ARM: fix get_user() broken with veneer

by John Stultz

On Tue, Sep 26, 2023 at 9:09 AM Masahiro Yamada <masahiroy(a)kernel.org> wrote: > > The 32-bit ARM kernel stops working if the kernel grows to the point > where veneers for __get_user_* are created. > > AAPCS32 [1] states, "Register r12 (IP) may be used by a linker as a > scratch register between a routine and any subroutine it calls. It > can also be used within a routine to hold intermediate values between > subroutine calls." > > However, bl instructions buried within the inline asm are unpredictable > for compilers; hence, "ip" must be added to the clobber list. > > This becomes critical when veneers for __get_user_* are created because > veneers use the ip register since commit 02e541db0540 ("ARM: 8323/1: > force linker to use PIC veneers"). > > [1]: https://github.com/ARM-software/abi-aa/blob/2023Q1/aapcs32/aapcs32.rst > > Signed-off-by: Masahiro Yamada <masahiroy(a)kernel.org> > Reviewed-by: Ard Biesheuvel <ardb(a)kernel.org> + stable(a)vger.kernel.org It seems like this (commit 24d3ba0a7b44c1617c27f5045eecc4f34752ab03 upstream) would be a good candidate for -stable? The issue it fixes can manifest in lots of very strange ways, so it would be good to avoid others getting tripped up by it on -stable branches. (Apologies for being a bit verbose in the following, I've included a lot of details and breadcrumbs so others might find this if they run into the same issues.) I was recently looking into an arm32 issue, and found getting a custom built kernel consistently working in qemu-system-arm to bisect issues in the range of 5.15-6.6 was a bit difficult, as I would hit a couple different odd errors. For 5.15 I was seeing systemd fail to start in a fairly opaque way: starting systemd-udevd.service - Rule-based Manager for Device Events and Files. systemd-udevd.service: Main process exited, code=exited, status=1/FAILURE systemd-udevd.service: Failed with result 'exit-code'. Failed to start systemd-udevd.service - Rule-based Manager for Device Events and Files. But further looking through the logs I found: systemd[1]: Failed to open netlink: Operation not permitted Despite lots of digging to try to understand what was going wrong, the one thing that worked was switching to CONFIG_CC_OPTIMIZE_FOR_SIZE (which I only tried as I came across this old thread: https://lists.yoctoproject.org/g/linux-yocto/message/8035 ), this seemed very suspicious, but I didn't have a lot of time to dig further. That resolved things until ~6.1, where I started seeing crashes at init: [ 16.982562] Run /init as init process [ 16.989311] Failed to execute /init (error -22) [ 16.990017] Run /sbin/init as init process [ 16.994737] Starting init: /sbin/init exists but couldn't execute it (error -22) That I bisected that failure down to being supposedly caused by commit 5750121ae738 ("kbuild: list sub-directories in ./Kbuild") https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… And searching around that commit luckily led me to this change, which finally seems to resolve the different issues I saw for 6.6, 6.1 and 5.15! Now, In my rush to get something booting with qemu, I started with the debian config but disabled modules, and didn't put much time into getting rid of config options or drivers I wouldn't need. So the kernel is pretty large. So maybe not super common, but I definitely wouldn't want others to have to go down this debugging rabbit hole. thanks -john

1 year, 5 months

3
2
0 0

[PATCH 2/2] serial: sc16is7xx: fix invalid FIFO access with special register set

by Hugo Villeneuve

From: Hugo Villeneuve <hvilleneuve(a)dimonoff.com> When enabling access to the special register set, Receiver time-out and RHR interrupts can happen. In this case, the IRQ handler will try to read from the FIFO thru the RHR register at address 0x00, but address 0x00 is mapped to DLL register, resulting in erroneous FIFO reading. Call graph example: sc16is7xx_startup(): entry sc16is7xx_ms_proc(): entry sc16is7xx_set_termios(): entry sc16is7xx_set_baud(): DLH/DLL = $009C --> access special register set sc16is7xx_port_irq() entry --> IIR is 0x0C sc16is7xx_handle_rx() entry sc16is7xx_fifo_read(): --> unable to access FIFO (RHR) because it is mapped to DLL (LCR=LCR_CONF_MODE_A) sc16is7xx_set_baud(): exit --> Restore access to general register set Fix the problem by claiming the efr_lock mutex when accessing the Special register set. Fixes: dfeae619d781 ("serial: sc16is7xx") Cc: <stable(a)vger.kernel.org> Signed-off-by: Hugo Villeneuve <hvilleneuve(a)dimonoff.com> --- drivers/tty/serial/sc16is7xx.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/tty/serial/sc16is7xx.c b/drivers/tty/serial/sc16is7xx.c index 58696e05492c..b4c1798a1df2 100644 --- a/drivers/tty/serial/sc16is7xx.c +++ b/drivers/tty/serial/sc16is7xx.c @@ -592,6 +592,8 @@ static int sc16is7xx_set_baud(struct uart_port *port, int baud) SC16IS7XX_MCR_CLKSEL_BIT, prescaler == 1 ? 0 : SC16IS7XX_MCR_CLKSEL_BIT); + mutex_lock(&one->efr_lock); + /* Backup LCR and access special register set (DLL/DLH) */ lcr = sc16is7xx_port_read(port, SC16IS7XX_LCR_REG); sc16is7xx_port_write(port, SC16IS7XX_LCR_REG, @@ -606,6 +608,8 @@ static int sc16is7xx_set_baud(struct uart_port *port, int baud) /* Restore LCR and access to general register set */ sc16is7xx_port_write(port, SC16IS7XX_LCR_REG, lcr); + mutex_unlock(&one->efr_lock); + return DIV_ROUND_CLOSEST((clk / prescaler) / 16, div); } -- 2.39.2

1 year, 5 months

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror July 2024