From: Azeem Shaikh azeemshaikh38@gmail.com
[ Upstream commit c642256b91770e201519d037a91f255a617a4602 ]
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe.
[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89
Signed-off-by: Azeem Shaikh azeemshaikh38@gmail.com Reviewed-by: Kees Cook keescook@chromium.org Message-Id: 20230510221119.3508930-1-azeemshaikh38@gmail.com Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/char_dev.c | 2 +- fs/super.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/char_dev.c b/fs/char_dev.c index 13deb45f1ec65..950b6919fb872 100644 --- a/fs/char_dev.c +++ b/fs/char_dev.c @@ -150,7 +150,7 @@ __register_chrdev_region(unsigned int major, unsigned int baseminor, cd->major = major; cd->baseminor = baseminor; cd->minorct = minorct; - strlcpy(cd->name, name, sizeof(cd->name)); + strscpy(cd->name, name, sizeof(cd->name));
if (!prev) { cd->next = curr; diff --git a/fs/super.c b/fs/super.c index 04bc62ab7dfea..09668ddfbbd55 100644 --- a/fs/super.c +++ b/fs/super.c @@ -595,7 +595,7 @@ struct super_block *sget_fc(struct fs_context *fc, fc->s_fs_info = NULL; s->s_type = fc->fs_type; s->s_iflags |= fc->s_iflags; - strlcpy(s->s_id, s->s_type->name, sizeof(s->s_id)); + strscpy(s->s_id, s->s_type->name, sizeof(s->s_id)); list_add_tail(&s->s_list, &super_blocks); hlist_add_head(&s->s_instances, &s->s_type->fs_supers); spin_unlock(&sb_lock); @@ -674,7 +674,7 @@ struct super_block *sget(struct file_system_type *type, return ERR_PTR(err); } s->s_type = type; - strlcpy(s->s_id, type->name, sizeof(s->s_id)); + strscpy(s->s_id, type->name, sizeof(s->s_id)); list_add_tail(&s->s_list, &super_blocks); hlist_add_head(&s->s_instances, &type->fs_supers); spin_unlock(&sb_lock);
From: Vladimir Sementsov-Ogievskiy vsementsov@yandex-team.ru
[ Upstream commit 88e4607034ee49e09e32d91d083dced5c2f4f127 ]
The motivation for this patch has been to enable using a stricter apparmor profile to prevent programs from reading any coredump in the system.
However, this became something else. The following details are based on Christian's and Linus' archeology into the history of the number "2" in the coredump handling code.
To make sure we're not accidently introducing some subtle behavioral change into the coredump code we set out on a voyage into the depths of history.git to figure out why this was O_RDWR in the first place.
Coredump handling was introduced over 30 years ago in commit ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)"). The original code used O_WRONLY:
open_namei("core",O_CREAT | O_WRONLY | O_TRUNC,0600,&inode,NULL)
However, this changed in 1993 and starting with commit 9cb9f18b5d26 ("[PATCH] Linux-0.99.10 (June 7, 1993)") the coredump code suddenly used the constant "2":
open_namei("core",O_CREAT | 2 | O_TRUNC,0600,&inode,NULL)
This was curious as in the same commit the kernel switched from constants to proper defines in other places such as KERNEL_DS and USER_DS and O_RDWR did already exist.
So why was "2" used? It turns out that open_namei() - an early version of what later turned into filp_open() - didn't accept O_RDWR.
A semantic quirk of the open() uapi is the definition of the O_RDONLY flag. It would seem natural to define:
#define O_RDWR (O_RDONLY | O_WRONLY)
but that isn't possible because:
#define O_RDONLY 0
This makes O_RDONLY effectively meaningless when passed to the kernel. In other words, there has never been a way - until O_PATH at least - to open a file without any permission; O_RDONLY was always implied on the uapi side while the kernel does in fact allow opening files without permissions.
The trouble comes when trying to map the uapi flags onto the corresponding file mode flags FMODE_{READ,WRITE}. This mapping still happens today and is causing issues to this day (We ran into this during additions for openat2() for example.).
So the special value "3" was used to indicate that the file was opened for special access:
f->f_flags = flag = flags; f->f_mode = (flag+1) & O_ACCMODE; if (f->f_mode) flag++;
This allowed the file mode to be set to FMODE_READ | FMODE_WRITE mapping the O_{RDONLY,WRONLY,RDWR} flags into the FMODE_{READ,WRITE} flags. The special access then required read-write permissions and 0 was used to access symlinks.
But back when ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") added coredump handling open_namei() took the FMODE_{READ,WRITE} flags as an argument. So the coredump handling introduced in ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") was buggy because O_WRONLY shouldn't have been passed. Since O_WRONLY is 1 but open_namei() took FMODE_{READ,WRITE} it was passed FMODE_READ on accident.
So 9cb9f18b5d26 ("[PATCH] Linux-0.99.10 (June 7, 1993)") was a bugfix for this and the 2 didn't really mean O_RDWR, it meant FMODE_WRITE which was correct.
The clue is that FMODE_{READ,WRITE} didn't exist yet and thus a raw "2" value was passed.
Fast forward 5 years when around 2.2.4pre4 (February 16, 1999) this code was changed to:
- dentry = open_namei(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600); ... + file = filp_open(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600);
At this point the raw "2" should have become O_WRONLY again as filp_open() didn't take FMODE_{READ,WRITE} but O_{RDONLY,WRONLY,RDWR}.
Another 17 years later, the code was changed again cementing the mistake and making it almost impossible to detect when commit 378c6520e7d2 ("fs/coredump: prevent fsuid=0 dumps into user-controlled directories") replaced the raw "2" with O_RDWR.
And now, here we are with this patch that sent us on a quest to answer the big questions in life such as "Why are coredump files opened with O_RDWR?" and "Is it safe to just use O_WRONLY?".
So with this commit we're reintroducing O_WRONLY again and bringing this code back to its original state when it was first introduced in commit ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") over 30 years ago.
Signed-off-by: Vladimir Sementsov-Ogievskiy vsementsov@yandex-team.ru Message-Id: 20230420120409.602576-1-vsementsov@yandex-team.ru [brauner@kernel.org: completely rewritten commit message] Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/coredump.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/coredump.c b/fs/coredump.c index 88740c51b9420..9d235fa14ab98 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -648,7 +648,7 @@ void do_coredump(const kernel_siginfo_t *siginfo) } else { struct mnt_idmap *idmap; struct inode *inode; - int open_flags = O_CREAT | O_RDWR | O_NOFOLLOW | + int open_flags = O_CREAT | O_WRONLY | O_NOFOLLOW | O_LARGEFILE | O_EXCL;
if (cprm.limit < binfmt->min_coredump)
From: David Sterba dsterba@suse.com
[ Upstream commit b7a9a503c38d665c05a789132b632d81ec0b2703 ]
The following warning pops up with enabled UBSAN in tests fstests/generic/303:
[23127.529395] UBSAN: Undefined behaviour in fs/read_write.c:1725:7 [23127.529400] signed integer overflow: [23127.529403] 4611686018427322368 + 9223372036854775807 cannot be represented in type 'long long int' [23127.529412] CPU: 4 PID: 26180 Comm: xfs_io Not tainted 5.2.0-rc2-1.ge195904-vanilla+ #450 [23127.556999] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008 [23127.557001] Call Trace: [23127.557060] dump_stack+0x67/0x9b [23127.557070] ubsan_epilogue+0x9/0x40 [23127.573496] handle_overflow+0xb3/0xc0 [23127.573514] do_clone_file_range+0x28f/0x2a0 [23127.573547] vfs_clone_file_range+0x35/0xb0 [23127.573564] ioctl_file_clone+0x8d/0xc0 [23127.590144] do_vfs_ioctl+0x300/0x700 [23127.590160] ksys_ioctl+0x70/0x80 [23127.590203] ? trace_hardirqs_off_thunk+0x1a/0x1c [23127.590210] __x64_sys_ioctl+0x16/0x20 [23127.590215] do_syscall_64+0x5c/0x1d0 [23127.590224] entry_SYSCALL_64_after_hwframe+0x49/0xbe [23127.590231] RIP: 0033:0x7ff6d7250327 [23127.590241] RSP: 002b:00007ffe3a38f1d8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 [23127.590246] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007ff6d7250327 [23127.590249] RDX: 00007ffe3a38f220 RSI: 000000004020940d RDI: 0000000000000003 [23127.590252] RBP: 0000000000000000 R08: 00007ffe3a3c80a0 R09: 00007ffe3a3c8080 [23127.590255] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 0000000000000000 [23127.590260] R13: 0000000000000000 R14: 3fffffffffff0000 R15: 00007ff6d750a20c
As loff_t is a signed type, we should use the safe overflow checks instead of relying on compiler implementation.
The bogus values are intentional and the test is supposed to verify the boundary conditions.
Signed-off-by: David Sterba dsterba@suse.com Message-Id: 20230523162628.17071-1-dsterba@suse.com Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/remap_range.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/fs/remap_range.c b/fs/remap_range.c index 1331a890f2f29..87ae4f0dc3aa0 100644 --- a/fs/remap_range.c +++ b/fs/remap_range.c @@ -15,6 +15,7 @@ #include <linux/mount.h> #include <linux/fs.h> #include <linux/dax.h> +#include <linux/overflow.h> #include "internal.h"
#include <linux/uaccess.h> @@ -101,10 +102,12 @@ static int generic_remap_checks(struct file *file_in, loff_t pos_in, static int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write) { + loff_t tmp; + if (unlikely(pos < 0 || len < 0)) return -EINVAL;
- if (unlikely((loff_t) (pos + len) < 0)) + if (unlikely(check_add_overflow(pos, len, &tmp))) return -EINVAL;
return security_file_permission(file, write ? MAY_WRITE : MAY_READ);
From: Danila Chernetsov listdansp@mail.ru
[ Upstream commit aa4b92c5234878d55da96d387ea4d3695ca5e4ab ]
In ntfs_mft_data_extend_allocation_nolock(), if an error condition occurs prior to 'ctx' being set to a non-NULL value, avoid dereferencing the NULL 'ctx' pointer in error handling.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Signed-off-by: Danila Chernetsov listdansp@mail.ru Reviewed-by: Namjae Jeon linkinjeon@kernel.org Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/ntfs/mft.c | 36 +++++++++++++++++++----------------- 1 file changed, 19 insertions(+), 17 deletions(-)
diff --git a/fs/ntfs/mft.c b/fs/ntfs/mft.c index 48030899dc6ec..0155f106ec344 100644 --- a/fs/ntfs/mft.c +++ b/fs/ntfs/mft.c @@ -1955,36 +1955,38 @@ static int ntfs_mft_data_extend_allocation_nolock(ntfs_volume *vol) "attribute.%s", es); NVolSetErrors(vol); } - a = ctx->attr; + if (ntfs_rl_truncate_nolock(vol, &mft_ni->runlist, old_last_vcn)) { ntfs_error(vol->sb, "Failed to truncate mft data attribute " "runlist.%s", es); NVolSetErrors(vol); } - if (mp_rebuilt && !IS_ERR(ctx->mrec)) { - if (ntfs_mapping_pairs_build(vol, (u8*)a + le16_to_cpu( + if (ctx) { + a = ctx->attr; + if (mp_rebuilt && !IS_ERR(ctx->mrec)) { + if (ntfs_mapping_pairs_build(vol, (u8 *)a + le16_to_cpu( a->data.non_resident.mapping_pairs_offset), old_alen - le16_to_cpu( - a->data.non_resident.mapping_pairs_offset), + a->data.non_resident.mapping_pairs_offset), rl2, ll, -1, NULL)) { - ntfs_error(vol->sb, "Failed to restore mapping pairs " + ntfs_error(vol->sb, "Failed to restore mapping pairs " "array.%s", es); - NVolSetErrors(vol); - } - if (ntfs_attr_record_resize(ctx->mrec, a, old_alen)) { - ntfs_error(vol->sb, "Failed to restore attribute " + NVolSetErrors(vol); + } + if (ntfs_attr_record_resize(ctx->mrec, a, old_alen)) { + ntfs_error(vol->sb, "Failed to restore attribute " "record.%s", es); + NVolSetErrors(vol); + } + flush_dcache_mft_record_page(ctx->ntfs_ino); + mark_mft_record_dirty(ctx->ntfs_ino); + } else if (IS_ERR(ctx->mrec)) { + ntfs_error(vol->sb, "Failed to restore attribute search " + "context.%s", es); NVolSetErrors(vol); } - flush_dcache_mft_record_page(ctx->ntfs_ino); - mark_mft_record_dirty(ctx->ntfs_ino); - } else if (IS_ERR(ctx->mrec)) { - ntfs_error(vol->sb, "Failed to restore attribute search " - "context.%s", es); - NVolSetErrors(vol); - } - if (ctx) ntfs_attr_put_search_ctx(ctx); + } if (!IS_ERR(mrec)) unmap_mft_record(mft_ni); up_write(&mft_ni->runlist.lock);
From: Prince Kumar Maurya princekumarmaurya06@gmail.com
[ Upstream commit ea2b62f305893992156a798f665847e0663c9f41 ]
sb_getblk(inode->i_sb, parent) return a null ptr and taking lock on that leads to the null-ptr-deref bug.
Reported-by: syzbot+aad58150cbc64ba41bdc@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=aad58150cbc64ba41bdc Signed-off-by: Prince Kumar Maurya princekumarmaurya06@gmail.com Message-Id: 20230531013141.19487-1-princekumarmaurya06@gmail.com Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/sysv/itree.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/sysv/itree.c b/fs/sysv/itree.c index b22764fe669c8..58d7f43a13712 100644 --- a/fs/sysv/itree.c +++ b/fs/sysv/itree.c @@ -145,6 +145,10 @@ static int alloc_branch(struct inode *inode, */ parent = block_to_cpu(SYSV_SB(inode->i_sb), branch[n-1].key); bh = sb_getblk(inode->i_sb, parent); + if (!bh) { + sysv_free_block(inode->i_sb, branch[n].key); + break; + } lock_buffer(bh); memset(bh->b_data, 0, blocksize); branch[n].bh = bh;
From: Kees Cook keescook@chromium.org
[ Upstream commit 820eb59da8c7ca7e705a02f37dda2be316807847 ]
To avoid confusing the compiler about possible negative sizes, switch "ssize" which can never be negative from int to u32. Seen with GCC 13:
../fs/jfs/namei.c: In function 'jfs_symlink': ../include/linux/fortify-string.h:57:33: warning: '__builtin_memcpy' pointer overflow between offset 0 and size [-2147483648, -1] [-Warray-bounds=] 57 | #define __underlying_memcpy __builtin_memcpy | ^ ... ../fs/jfs/namei.c:950:17: note: in expansion of macro 'memcpy' 950 | memcpy(ip->i_link, name, ssize); | ^~~~~~
Cc: Dave Kleikamp shaggy@kernel.org Cc: Christian Brauner brauner@kernel.org Cc: Dave Chinner dchinner@redhat.com Cc: jfs-discussion@lists.sourceforge.net Signed-off-by: Kees Cook keescook@chromium.org Acked-by: Jeff Xu jeffxu@chromium.org Message-Id: 20230204183355.never.877-kees@kernel.org Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/jfs/namei.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c index b29d68b5eec53..494b9f4043cf6 100644 --- a/fs/jfs/namei.c +++ b/fs/jfs/namei.c @@ -876,7 +876,7 @@ static int jfs_symlink(struct mnt_idmap *idmap, struct inode *dip, tid_t tid; ino_t ino = 0; struct component_name dname; - int ssize; /* source pathname size */ + u32 ssize; /* source pathname size */ struct btstack btstack; struct inode *ip = d_inode(dentry); s64 xlen = 0; @@ -957,7 +957,7 @@ static int jfs_symlink(struct mnt_idmap *idmap, struct inode *dip, if (ssize > sizeof (JFS_IP(ip)->i_inline)) JFS_IP(ip)->mode2 &= ~INLINEEA;
- jfs_info("jfs_symlink: fast symlink added ssize:%d name:%s ", + jfs_info("jfs_symlink: fast symlink added ssize:%u name:%s ", ssize, name); } /* @@ -987,7 +987,7 @@ static int jfs_symlink(struct mnt_idmap *idmap, struct inode *dip, ip->i_size = ssize - 1; while (ssize) { /* This is kind of silly since PATH_MAX == 4K */ - int copy_size = min(ssize, PSIZE); + u32 copy_size = min_t(u32, ssize, PSIZE);
mp = get_metapage(ip, xaddr, PSIZE, 1);
From: Jan Kara jack@suse.cz
[ Upstream commit c541dce86c537714b6761a79a969c1623dfa222b ]
The reconfigure / remount code takes a lot of effort to protect filesystem's reconfiguration code from racing writes on remounting read-only. However during remounting read-only filesystem to read-write mode userspace writes can start immediately once we clear SB_RDONLY flag. This is inconvenient for example for ext4 because we need to do some writes to the filesystem (such as preparation of quota files) before we can take userspace writes so we are clearing SB_RDONLY flag before we are fully ready to accept userpace writes and syzbot has found a way to exploit this [1]. Also as far as I'm reading the code the filesystem remount code was protected from racing writes in the legacy mount path by the mount's MNT_READONLY flag so this is relatively new problem. It is actually fairly easy to protect remount read-write from racing writes using sb->s_readonly_remount flag so let's just do that instead of having to workaround these races in the filesystem code.
[1] https://lore.kernel.org/all/00000000000006a0df05f6667499@google.com/T/
Signed-off-by: Jan Kara jack@suse.cz Message-Id: 20230615113848.8439-1-jack@suse.cz Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/super.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/fs/super.c b/fs/super.c index 09668ddfbbd55..860d7a4b14c7c 100644 --- a/fs/super.c +++ b/fs/super.c @@ -903,6 +903,7 @@ int reconfigure_super(struct fs_context *fc) struct super_block *sb = fc->root->d_sb; int retval; bool remount_ro = false; + bool remount_rw = false; bool force = fc->sb_flags & SB_FORCE;
if (fc->sb_flags_mask & ~MS_RMT_MASK) @@ -920,7 +921,7 @@ int reconfigure_super(struct fs_context *fc) bdev_read_only(sb->s_bdev)) return -EACCES; #endif - + remount_rw = !(fc->sb_flags & SB_RDONLY) && sb_rdonly(sb); remount_ro = (fc->sb_flags & SB_RDONLY) && !sb_rdonly(sb); }
@@ -950,6 +951,14 @@ int reconfigure_super(struct fs_context *fc) if (retval) return retval; } + } else if (remount_rw) { + /* + * We set s_readonly_remount here to protect filesystem's + * reconfigure code from writes from userspace until + * reconfigure finishes. + */ + sb->s_readonly_remount = 1; + smp_wmb(); }
if (fc->ops->reconfigure) {
From: Alexander Gordeev agordeev@linux.ibm.com
[ Upstream commit 3e8261003bd28208986d3c42004510083c086e24 ]
Kernel Address Sanitizer uses 3 bits per byte to encode memory. That is the number of bits the start and end address of a memory range is shifted right when the corresponding shadow memory is created for that memory range.
The used memory mapping routine expects page-aligned addresses, while the above described 3-bit shift might turn the shadow memory range start and end boundaries into non-page-aligned in case the size of the original memory range is less than (PAGE_SIZE << 3). As result, the resulting shadow memory range could be short on one page.
Align on page boundary the start and end addresses when mapping a shadow memory range and avoid the described issue in the future.
Note, that does not fix a real problem, since currently no virtual regions of size less than (PAGE_SIZE << 3) exist.
Reviewed-by: Vasily Gorbik gor@linux.ibm.com Signed-off-by: Alexander Gordeev agordeev@linux.ibm.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/s390/boot/vmem.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/arch/s390/boot/vmem.c b/arch/s390/boot/vmem.c index acb1f8b53105b..c67f59db7a512 100644 --- a/arch/s390/boot/vmem.c +++ b/arch/s390/boot/vmem.c @@ -45,6 +45,13 @@ static void pgtable_populate(unsigned long addr, unsigned long end, enum populat
static pte_t pte_z;
+static inline void kasan_populate(unsigned long start, unsigned long end, enum populate_mode mode) +{ + start = PAGE_ALIGN_DOWN(__sha(start)); + end = PAGE_ALIGN(__sha(end)); + pgtable_populate(start, end, mode); +} + static void kasan_populate_shadow(void) { pmd_t pmd_z = __pmd(__pa(kasan_early_shadow_pte) | _SEGMENT_ENTRY); @@ -95,17 +102,17 @@ static void kasan_populate_shadow(void) */
for_each_physmem_usable_range(i, &start, &end) - pgtable_populate(__sha(start), __sha(end), POPULATE_KASAN_MAP_SHADOW); + kasan_populate(start, end, POPULATE_KASAN_MAP_SHADOW); if (IS_ENABLED(CONFIG_KASAN_VMALLOC)) { untracked_end = VMALLOC_START; /* shallowly populate kasan shadow for vmalloc and modules */ - pgtable_populate(__sha(VMALLOC_START), __sha(MODULES_END), POPULATE_KASAN_SHALLOW); + kasan_populate(VMALLOC_START, MODULES_END, POPULATE_KASAN_SHALLOW); } else { untracked_end = MODULES_VADDR; } /* populate kasan shadow for untracked memory */ - pgtable_populate(__sha(ident_map_size), __sha(untracked_end), POPULATE_KASAN_ZERO_SHADOW); - pgtable_populate(__sha(MODULES_END), __sha(_REGION1_SIZE), POPULATE_KASAN_ZERO_SHADOW); + kasan_populate(ident_map_size, untracked_end, POPULATE_KASAN_ZERO_SHADOW); + kasan_populate(MODULES_END, _REGION1_SIZE, POPULATE_KASAN_ZERO_SHADOW); }
static bool kasan_pgd_populate_zero_shadow(pgd_t *pgd, unsigned long addr,
From: Christophe JAILLET christophe.jaillet@wanadoo.fr
[ Upstream commit f471c6585c7f2f168b8eb2d19e2d6e5f22a6645f ]
get_elfcorehdr_size() returns a size_t, so there is no real point to store it in a u32.
Turn 'alloc_size' into a size_t.
Signed-off-by: Christophe JAILLET christophe.jaillet@wanadoo.fr Link: https://lore.kernel.org/r/0756118c9058338f3040edb91971d0bfd100027b.168668821... Signed-off-by: Alexander Gordeev agordeev@linux.ibm.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/s390/kernel/crash_dump.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c index 8a617be28bb48..7af69948b290a 100644 --- a/arch/s390/kernel/crash_dump.c +++ b/arch/s390/kernel/crash_dump.c @@ -568,9 +568,9 @@ static size_t get_elfcorehdr_size(int mem_chunk_cnt) int elfcorehdr_alloc(unsigned long long *addr, unsigned long long *size) { Elf64_Phdr *phdr_notes, *phdr_loads; + size_t alloc_size; int mem_chunk_cnt; void *ptr, *hdr; - u32 alloc_size; u64 hdr_off;
/* If we are not in kdump or zfcp/nvme dump mode return */
From: Alexander Gordeev agordeev@linux.ibm.com
[ Upstream commit c70505434c8defb111326a143e2dd2be09778593 ]
Fix virtual vs physical address confusion (which currently are the same).
Reviewed-by: Vasily Gorbik gor@linux.ibm.com Signed-off-by: Alexander Gordeev agordeev@linux.ibm.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/s390/include/asm/physmem_info.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/s390/include/asm/physmem_info.h b/arch/s390/include/asm/physmem_info.h index 8e9c582592b3f..9e41a74fce9ac 100644 --- a/arch/s390/include/asm/physmem_info.h +++ b/arch/s390/include/asm/physmem_info.h @@ -3,6 +3,7 @@ #define _ASM_S390_MEM_DETECT_H
#include <linux/types.h> +#include <asm/page.h>
enum physmem_info_source { MEM_DETECT_NONE = 0, @@ -133,7 +134,7 @@ static inline const char *get_rr_type_name(enum reserved_range_type t)
#define for_each_physmem_reserved_type_range(t, range, p_start, p_end) \ for (range = &physmem_info.reserved[t], *p_start = range->start, *p_end = range->end; \ - range && range->end; range = range->chain, \ + range && range->end; range = range->chain ? __va(range->chain) : NULL, \ *p_start = range ? range->start : 0, *p_end = range ? range->end : 0)
static inline struct reserved_range *__physmem_reserved_next(enum reserved_range_type *t, @@ -145,7 +146,7 @@ static inline struct reserved_range *__physmem_reserved_next(enum reserved_range return range; } if (range->chain) - return range->chain; + return __va(range->chain); while (++*t < RR_MAX) { range = &physmem_info.reserved[*t]; if (range->end)
From: Min-Hua Chen minhuadotchen@gmail.com
[ Upstream commit 38f1755a3e59a3f88e33030f8e4ee0421de2f05a ]
Fix the following sparse warnings by using __poll_t instead of unsigned type.
fs/eventpoll.c:541:9: sparse: warning: restricted __poll_t degrades to integer fs/eventfd.c:67:17: sparse: warning: restricted __poll_t degrades to integer
Signed-off-by: Min-Hua Chen minhuadotchen@gmail.com Message-Id: 20230511164628.336586-1-minhuadotchen@gmail.com Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/eventfd.c | 2 +- fs/eventpoll.c | 2 +- include/linux/eventfd.h | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/eventfd.c b/fs/eventfd.c index 95850a13ce8d0..6c06a527747f2 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -43,7 +43,7 @@ struct eventfd_ctx { int id; };
-__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, unsigned mask) +__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, __poll_t mask) { unsigned long flags;
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 266d45c7685b4..4b1b3362f697b 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -536,7 +536,7 @@ static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi, #else
static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi, - unsigned pollflags) + __poll_t pollflags) { wake_up_poll(&ep->poll_wait, EPOLLIN | pollflags); } diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h index 36a486505b081..98d31cdaca400 100644 --- a/include/linux/eventfd.h +++ b/include/linux/eventfd.h @@ -40,7 +40,7 @@ struct file *eventfd_fget(int fd); struct eventfd_ctx *eventfd_ctx_fdget(int fd); struct eventfd_ctx *eventfd_ctx_fileget(struct file *file); __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n); -__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, unsigned mask); +__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, __poll_t mask); int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_entry_t *wait, __u64 *cnt); void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt);
From: chenzhiyin zhiyin.chen@intel.com
[ Upstream commit a7bc2e8ddf3c8e1f5bfeb401f7ee112956cea259 ]
In the syscall test of UnixBench, performance regression occurred due to false sharing.
The lock and atomic members, including file::f_lock, file::f_count and file::f_pos_lock are highly contended and frequently updated in the high-concurrency test scenarios. perf c2c indentified one affected read access, file::f_op. To prevent false sharing, the layout of file struct is changed as following (A) f_lock, f_count and f_pos_lock are put together to share the same cache line. (B) The read mostly members, including f_path, f_inode, f_op are put into a separate cache line. (C) f_mode is put together with f_count, since they are used frequently at the same time. Due to '__randomize_layout' attribute of file struct, the updated layout only can be effective when CONFIG_RANDSTRUCT_NONE is 'y'.
The optimization has been validated in the syscall test of UnixBench. performance gain is 30~50%. Furthermore, to confirm the optimization effectiveness on the other codes path, the results of fsdisk, fsbuffer and fstime are also shown.
Here are the detailed test results of unixbench.
Command: numactl -C 3-18 ./Run -c 16 syscall fsbuffer fstime fsdisk
Without Patch ------------------------------------------------------------------------ File Copy 1024 bufsize 2000 maxblocks 875052.1 KBps (30.0 s, 2 samples) File Copy 256 bufsize 500 maxblocks 235484.0 KBps (30.0 s, 2 samples) File Copy 4096 bufsize 8000 maxblocks 2815153.5 KBps (30.0 s, 2 samples) System Call Overhead 5772268.3 lps (10.0 s, 7 samples)
System Benchmarks Partial Index BASELINE RESULT INDEX File Copy 1024 bufsize 2000 maxblocks 3960.0 875052.1 2209.7 File Copy 256 bufsize 500 maxblocks 1655.0 235484.0 1422.9 File Copy 4096 bufsize 8000 maxblocks 5800.0 2815153.5 4853.7 System Call Overhead 15000.0 5772268.3 3848.2 ======== System Benchmarks Index Score (Partial Only) 2768.3
With Patch ------------------------------------------------------------------------ File Copy 1024 bufsize 2000 maxblocks 1009977.2 KBps (30.0 s, 2 samples) File Copy 256 bufsize 500 maxblocks 264765.9 KBps (30.0 s, 2 samples) File Copy 4096 bufsize 8000 maxblocks 3052236.0 KBps (30.0 s, 2 samples) System Call Overhead 8237404.4 lps (10.0 s, 7 samples)
System Benchmarks Partial Index BASELINE RESULT INDEX File Copy 1024 bufsize 2000 maxblocks 3960.0 1009977.2 2550.4 File Copy 256 bufsize 500 maxblocks 1655.0 264765.9 1599.8 File Copy 4096 bufsize 8000 maxblocks 5800.0 3052236.0 5262.5 System Call Overhead 15000.0 8237404.4 5491.6 ======== System Benchmarks Index Score (Partial Only) 3295.3
Signed-off-by: chenzhiyin zhiyin.chen@intel.com Message-Id: 20230601092400.27162-1-zhiyin.chen@intel.com Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/fs.h | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h index 133f0640fb241..6f96f99ab9511 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -956,29 +956,35 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index) index < ra->start + ra->size); }
+/* + * f_{lock,count,pos_lock} members can be highly contended and share + * the same cacheline. f_{lock,mode} are very frequently used together + * and so share the same cacheline as well. The read-mostly + * f_{path,inode,op} are kept on a separate cacheline. + */ struct file { union { struct llist_node f_llist; struct rcu_head f_rcuhead; unsigned int f_iocb_flags; }; - struct path f_path; - struct inode *f_inode; /* cached value */ - const struct file_operations *f_op;
/* * Protects f_ep, f_flags. * Must not be taken from IRQ context. */ spinlock_t f_lock; - atomic_long_t f_count; - unsigned int f_flags; fmode_t f_mode; + atomic_long_t f_count; struct mutex f_pos_lock; loff_t f_pos; + unsigned int f_flags; struct fown_struct f_owner; const struct cred *f_cred; struct file_ra_state f_ra; + struct path f_path; + struct inode *f_inode; /* cached value */ + const struct file_operations *f_op;
u64 f_version; #ifdef CONFIG_SECURITY
From: Christoph Hellwig hch@lst.de
[ Upstream commit 4bb218a65a43782b1e75f5510744cb44795a1105 ]
buffer_check_dirty_writeback is only used by the block device aops, remove the export.
Signed-off-by: Christoph Hellwig hch@lst.de Message-Id: 20230608122958.276954-1-hch@lst.de Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/buffer.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/fs/buffer.c b/fs/buffer.c index a7fc561758b1a..fe64356e89b8a 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -111,7 +111,6 @@ void buffer_check_dirty_writeback(struct folio *folio, bh = bh->b_this_page; } while (bh != head); } -EXPORT_SYMBOL(buffer_check_dirty_writeback);
/* * Block until a buffer comes unlocked. This doesn't stop it
From: Wen Yang wenyang.linux@foxmail.com
[ Upstream commit 33d8b5d7824c7175ed968b8e89e6db3566e9c177 ]
The EFD_SEMAPHORE flag should be displayed in fdinfo, as different value could affect the behavior of eventfd.
Suggested-by: Christian Brauner brauner@kernel.org Signed-off-by: Wen Yang wenyang.linux@foxmail.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Jens Axboe axboe@kernel.dk Cc: Christian Brauner brauner@kernel.org Cc: Christoph Hellwig hch@lst.de Cc: Dylan Yudaken dylany@fb.com Cc: David Woodhouse dwmw@amazon.co.uk Cc: Matthew Wilcox willy@infradead.org Cc: Eric Biggers ebiggers@google.com Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Message-Id: tencent_05B9CFEFE6B9BC2A9B3A27886A122A7D9205@qq.com Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/eventfd.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/fs/eventfd.c b/fs/eventfd.c index 6c06a527747f2..8aa36cd373516 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -33,10 +33,10 @@ struct eventfd_ctx { /* * Every time that a write(2) is performed on an eventfd, the * value of the __u64 being written is added to "count" and a - * wakeup is performed on "wqh". A read(2) will return the "count" - * value to userspace, and will reset "count" to zero. The kernel - * side eventfd_signal() also, adds to the "count" counter and - * issue a wakeup. + * wakeup is performed on "wqh". If EFD_SEMAPHORE flag was not + * specified, a read(2) will return the "count" value to userspace, + * and will reset "count" to zero. The kernel side eventfd_signal() + * also, adds to the "count" counter and issue a wakeup. */ __u64 count; unsigned int flags; @@ -301,6 +301,8 @@ static void eventfd_show_fdinfo(struct seq_file *m, struct file *f) (unsigned long long)ctx->count); spin_unlock_irq(&ctx->wqh.lock); seq_printf(m, "eventfd-id: %d\n", ctx->id); + seq_printf(m, "eventfd-semaphore: %d\n", + !!(ctx->flags & EFD_SEMAPHORE)); } #endif
From: Jeff Layton jlayton@kernel.org
[ Upstream commit 797a1d894d7b7586f422cabf8d7807cd39d0b5aa ]
When adding entries to a directory, POSIX generally requires that the ctime also be updated alongside the mtime.
Signed-off-by: Jeff Layton jlayton@kernel.org Acked-by: Ian Kent raven@themaw.net Message-Id: 20230612104524.17058-4-jlayton@kernel.org Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/autofs/root.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/autofs/root.c b/fs/autofs/root.c index 6baf90b08e0ef..93046c9dc4610 100644 --- a/fs/autofs/root.c +++ b/fs/autofs/root.c @@ -600,7 +600,7 @@ static int autofs_dir_symlink(struct mnt_idmap *idmap, p_ino = autofs_dentry_ino(dentry->d_parent); p_ino->count++;
- dir->i_mtime = current_time(dir); + dir->i_mtime = dir->i_ctime = current_time(dir);
return 0; } @@ -633,7 +633,7 @@ static int autofs_dir_unlink(struct inode *dir, struct dentry *dentry) d_inode(dentry)->i_size = 0; clear_nlink(d_inode(dentry));
- dir->i_mtime = current_time(dir); + dir->i_mtime = dir->i_ctime = current_time(dir);
spin_lock(&sbi->lookup_lock); __autofs_add_expiring(dentry); @@ -749,7 +749,7 @@ static int autofs_dir_mkdir(struct mnt_idmap *idmap, p_ino = autofs_dentry_ino(dentry->d_parent); p_ino->count++; inc_nlink(dir); - dir->i_mtime = current_time(dir); + dir->i_mtime = dir->i_ctime = current_time(dir);
return 0; }
From: Jan Kara jack@suse.cz
[ Upstream commit d7439fb1f4338fffd0bc68bb62d78f7712725f26 ]
Provide helpers to set and clear sb->s_readonly_remount including appropriate memory barriers. Also use this opportunity to document what the barriers pair with and why they are needed.
Suggested-by: Dave Chinner david@fromorbit.com Signed-off-by: Jan Kara jack@suse.cz Reviewed-by: Dave Chinner dchinner@redhat.com Message-Id: 20230620112832.5158-1-jack@suse.cz Signed-off-by: Christian Brauner brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/internal.h | 41 +++++++++++++++++++++++++++++++++++++++++ fs/namespace.c | 25 ++++++++++++++++--------- fs/super.c | 17 ++++++----------- include/linux/fs.h | 2 +- 4 files changed, 64 insertions(+), 21 deletions(-)
diff --git a/fs/internal.h b/fs/internal.h index bd3b2810a36b6..b916b84809f36 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -120,6 +120,47 @@ void put_super(struct super_block *sb); extern bool mount_capable(struct fs_context *); int sb_init_dio_done_wq(struct super_block *sb);
+/* + * Prepare superblock for changing its read-only state (i.e., either remount + * read-write superblock read-only or vice versa). After this function returns + * mnt_is_readonly() will return true for any mount of the superblock if its + * caller is able to observe any changes done by the remount. This holds until + * sb_end_ro_state_change() is called. + */ +static inline void sb_start_ro_state_change(struct super_block *sb) +{ + WRITE_ONCE(sb->s_readonly_remount, 1); + /* + * For RO->RW transition, the barrier pairs with the barrier in + * mnt_is_readonly() making sure if mnt_is_readonly() sees SB_RDONLY + * cleared, it will see s_readonly_remount set. + * For RW->RO transition, the barrier pairs with the barrier in + * __mnt_want_write() before the mnt_is_readonly() check. The barrier + * makes sure if __mnt_want_write() sees MNT_WRITE_HOLD already + * cleared, it will see s_readonly_remount set. + */ + smp_wmb(); +} + +/* + * Ends section changing read-only state of the superblock. After this function + * returns if mnt_is_readonly() returns false, the caller will be able to + * observe all the changes remount did to the superblock. + */ +static inline void sb_end_ro_state_change(struct super_block *sb) +{ + /* + * This barrier provides release semantics that pairs with + * the smp_rmb() acquire semantics in mnt_is_readonly(). + * This barrier pair ensure that when mnt_is_readonly() sees + * 0 for sb->s_readonly_remount, it will also see all the + * preceding flag changes that were made during the RO state + * change. + */ + smp_wmb(); + WRITE_ONCE(sb->s_readonly_remount, 0); +} + /* * open.c */ diff --git a/fs/namespace.c b/fs/namespace.c index 54847db5b8195..5ba1eca6f7208 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -309,9 +309,16 @@ static unsigned int mnt_get_writers(struct mount *mnt)
static int mnt_is_readonly(struct vfsmount *mnt) { - if (mnt->mnt_sb->s_readonly_remount) + if (READ_ONCE(mnt->mnt_sb->s_readonly_remount)) return 1; - /* Order wrt setting s_flags/s_readonly_remount in do_remount() */ + /* + * The barrier pairs with the barrier in sb_start_ro_state_change() + * making sure if we don't see s_readonly_remount set yet, we also will + * not see any superblock / mount flag changes done by remount. + * It also pairs with the barrier in sb_end_ro_state_change() + * assuring that if we see s_readonly_remount already cleared, we will + * see the values of superblock / mount flags updated by remount. + */ smp_rmb(); return __mnt_is_readonly(mnt); } @@ -364,9 +371,11 @@ int __mnt_want_write(struct vfsmount *m) } } /* - * After the slowpath clears MNT_WRITE_HOLD, mnt_is_readonly will - * be set to match its requirements. So we must not load that until - * MNT_WRITE_HOLD is cleared. + * The barrier pairs with the barrier sb_start_ro_state_change() making + * sure that if we see MNT_WRITE_HOLD cleared, we will also see + * s_readonly_remount set (or even SB_RDONLY / MNT_READONLY flags) in + * mnt_is_readonly() and bail in case we are racing with remount + * read-only. */ smp_rmb(); if (mnt_is_readonly(m)) { @@ -588,10 +597,8 @@ int sb_prepare_remount_readonly(struct super_block *sb) if (!err && atomic_long_read(&sb->s_remove_count)) err = -EBUSY;
- if (!err) { - sb->s_readonly_remount = 1; - smp_wmb(); - } + if (!err) + sb_start_ro_state_change(sb); list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) { if (mnt->mnt.mnt_flags & MNT_WRITE_HOLD) mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD; diff --git a/fs/super.c b/fs/super.c index 860d7a4b14c7c..48c29954d4875 100644 --- a/fs/super.c +++ b/fs/super.c @@ -944,8 +944,7 @@ int reconfigure_super(struct fs_context *fc) */ if (remount_ro) { if (force) { - sb->s_readonly_remount = 1; - smp_wmb(); + sb_start_ro_state_change(sb); } else { retval = sb_prepare_remount_readonly(sb); if (retval) @@ -953,12 +952,10 @@ int reconfigure_super(struct fs_context *fc) } } else if (remount_rw) { /* - * We set s_readonly_remount here to protect filesystem's - * reconfigure code from writes from userspace until - * reconfigure finishes. + * Protect filesystem's reconfigure code from writes from + * userspace until reconfigure finishes. */ - sb->s_readonly_remount = 1; - smp_wmb(); + sb_start_ro_state_change(sb); }
if (fc->ops->reconfigure) { @@ -974,9 +971,7 @@ int reconfigure_super(struct fs_context *fc)
WRITE_ONCE(sb->s_flags, ((sb->s_flags & ~fc->sb_flags_mask) | (fc->sb_flags & fc->sb_flags_mask))); - /* Needs to be ordered wrt mnt_is_readonly() */ - smp_wmb(); - sb->s_readonly_remount = 0; + sb_end_ro_state_change(sb);
/* * Some filesystems modify their metadata via some other path than the @@ -991,7 +986,7 @@ int reconfigure_super(struct fs_context *fc) return 0;
cancel_readonly: - sb->s_readonly_remount = 0; + sb_end_ro_state_change(sb); return retval; }
diff --git a/include/linux/fs.h b/include/linux/fs.h index 6f96f99ab9511..879c000eec397 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1248,7 +1248,7 @@ struct super_block { */ atomic_long_t s_fsnotify_connectors;
- /* Being remounted read-only */ + /* Read-only state of the superblock is being changed */ int s_readonly_remount;
/* per-sb errseq_t for reporting writeback errors via syncfs */
linux-stable-mirror@lists.linaro.org