The quilt patch titled
Subject: ocfs2: reserve space for inline xattr before attaching reflink tree
has been removed from the -mm tree. Its filename was
ocfs2-reserve-space-for-inline-xattr-before-attaching-reflink-tree.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Gautham Ananthakrishna <gautham.ananthakrishna(a)oracle.com>
Subject: ocfs2: reserve space for inline xattr before attaching reflink tree
Date: Wed, 18 Sep 2024 06:38:44 +0000
One of our customers reported a crash and a corrupted ocfs2 filesystem.
The crash was due to the detection of corruption. Upon troubleshooting,
the fsck -fn output showed the below corruption
[EXTENT_LIST_FREE] Extent list in owner 33080590 claims 230 as the next free chain record,
but fsck believes the largest valid value is 227. Clamp the next record value? n
The stat output from the debugfs.ocfs2 showed the following corruption
where the "Next Free Rec:" had overshot the "Count:" in the root metadata
block.
Inode: 33080590 Mode: 0640 Generation: 2619713622 (0x9c25a856)
FS Generation: 904309833 (0x35e6ac49)
CRC32: 00000000 ECC: 0000
Type: Regular Attr: 0x0 Flags: Valid
Dynamic Features: (0x16) HasXattr InlineXattr Refcounted
Extended Attributes Block: 0 Extended Attributes Inline Size: 256
User: 0 (root) Group: 0 (root) Size: 281320357888
Links: 1 Clusters: 141738
ctime: 0x66911b56 0x316edcb8 -- Fri Jul 12 06:02:30.829349048 2024
atime: 0x66911d6b 0x7f7a28d -- Fri Jul 12 06:11:23.133669517 2024
mtime: 0x66911b56 0x12ed75d7 -- Fri Jul 12 06:02:30.317552087 2024
dtime: 0x0 -- Wed Dec 31 17:00:00 1969
Refcount Block: 2777346
Last Extblk: 2886943 Orphan Slot: 0
Sub Alloc Slot: 0 Sub Alloc Bit: 14
Tree Depth: 1 Count: 227 Next Free Rec: 230
## Offset Clusters Block#
0 0 2310 2776351
1 2310 2139 2777375
2 4449 1221 2778399
3 5670 731 2779423
4 6401 566 2780447
....... .... .......
....... .... .......
The issue was in the reflink workfow while reserving space for inline
xattr. The problematic function is ocfs2_reflink_xattr_inline(). By the
time this function is called the reflink tree is already recreated at the
destination inode from the source inode. At this point, this function
reserves space for inline xattrs at the destination inode without even
checking if there is space at the root metadata block. It simply reduces
the l_count from 243 to 227 thereby making space of 256 bytes for inline
xattr whereas the inode already has extents beyond this index (in this
case up to 230), thereby causing corruption.
The fix for this is to reserve space for inline metadata at the destination
inode before the reflink tree gets recreated. The customer has verified the
fix.
Link: https://lkml.kernel.org/r/20240918063844.1830332-1-gautham.ananthakrishna@o…
Fixes: ef962df057aa ("ocfs2: xattr: fix inlined xattr reflink")
Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna(a)oracle.com>
Reviewed-by: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Cc: Mark Fasheh <mark(a)fasheh.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Changwei Ge <gechangwei(a)live.cn>
Cc: Gang He <ghe(a)suse.com>
Cc: Jun Piao <piaojun(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ocfs2/refcounttree.c | 26 ++++++++++++++++++++++++--
fs/ocfs2/xattr.c | 11 +----------
2 files changed, 25 insertions(+), 12 deletions(-)
--- a/fs/ocfs2/refcounttree.c~ocfs2-reserve-space-for-inline-xattr-before-attaching-reflink-tree
+++ a/fs/ocfs2/refcounttree.c
@@ -25,6 +25,7 @@
#include "namei.h"
#include "ocfs2_trace.h"
#include "file.h"
+#include "symlink.h"
#include <linux/bio.h>
#include <linux/blkdev.h>
@@ -4148,8 +4149,9 @@ static int __ocfs2_reflink(struct dentry
int ret;
struct inode *inode = d_inode(old_dentry);
struct buffer_head *new_bh = NULL;
+ struct ocfs2_inode_info *oi = OCFS2_I(inode);
- if (OCFS2_I(inode)->ip_flags & OCFS2_INODE_SYSTEM_FILE) {
+ if (oi->ip_flags & OCFS2_INODE_SYSTEM_FILE) {
ret = -EINVAL;
mlog_errno(ret);
goto out;
@@ -4175,6 +4177,26 @@ static int __ocfs2_reflink(struct dentry
goto out_unlock;
}
+ if ((oi->ip_dyn_features & OCFS2_HAS_XATTR_FL) &&
+ (oi->ip_dyn_features & OCFS2_INLINE_XATTR_FL)) {
+ /*
+ * Adjust extent record count to reserve space for extended attribute.
+ * Inline data count had been adjusted in ocfs2_duplicate_inline_data().
+ */
+ struct ocfs2_inode_info *new_oi = OCFS2_I(new_inode);
+
+ if (!(new_oi->ip_dyn_features & OCFS2_INLINE_DATA_FL) &&
+ !(ocfs2_inode_is_fast_symlink(new_inode))) {
+ struct ocfs2_dinode *new_di = (struct ocfs2_dinode *)new_bh->b_data;
+ struct ocfs2_dinode *old_di = (struct ocfs2_dinode *)old_bh->b_data;
+ struct ocfs2_extent_list *el = &new_di->id2.i_list;
+ int inline_size = le16_to_cpu(old_di->i_xattr_inline_size);
+
+ le16_add_cpu(&el->l_count, -(inline_size /
+ sizeof(struct ocfs2_extent_rec)));
+ }
+ }
+
ret = ocfs2_create_reflink_node(inode, old_bh,
new_inode, new_bh, preserve);
if (ret) {
@@ -4182,7 +4204,7 @@ static int __ocfs2_reflink(struct dentry
goto inode_unlock;
}
- if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_XATTR_FL) {
+ if (oi->ip_dyn_features & OCFS2_HAS_XATTR_FL) {
ret = ocfs2_reflink_xattrs(inode, old_bh,
new_inode, new_bh,
preserve);
--- a/fs/ocfs2/xattr.c~ocfs2-reserve-space-for-inline-xattr-before-attaching-reflink-tree
+++ a/fs/ocfs2/xattr.c
@@ -6511,16 +6511,7 @@ static int ocfs2_reflink_xattr_inline(st
}
new_oi = OCFS2_I(args->new_inode);
- /*
- * Adjust extent record count to reserve space for extended attribute.
- * Inline data count had been adjusted in ocfs2_duplicate_inline_data().
- */
- if (!(new_oi->ip_dyn_features & OCFS2_INLINE_DATA_FL) &&
- !(ocfs2_inode_is_fast_symlink(args->new_inode))) {
- struct ocfs2_extent_list *el = &new_di->id2.i_list;
- le16_add_cpu(&el->l_count, -(inline_size /
- sizeof(struct ocfs2_extent_rec)));
- }
+
spin_lock(&new_oi->ip_lock);
new_oi->ip_dyn_features |= OCFS2_HAS_XATTR_FL | OCFS2_INLINE_XATTR_FL;
new_di->i_dyn_features = cpu_to_le16(new_oi->ip_dyn_features);
_
Patches currently in -mm which might be from gautham.ananthakrishna(a)oracle.com are
The quilt patch titled
Subject: mm: migrate: annotate data-race in migrate_folio_unmap()
has been removed from the -mm tree. Its filename was
mm-migrate-annotate-data-race-in-migrate_folio_unmap.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Jeongjun Park <aha310510(a)gmail.com>
Subject: mm: migrate: annotate data-race in migrate_folio_unmap()
Date: Tue, 24 Sep 2024 22:00:53 +0900
I found a report from syzbot [1]
This report shows that the value can be changed, but in reality, the
value of __folio_set_movable() cannot be changed because it holds the
folio refcount.
Therefore, it is appropriate to add an annotate to make KCSAN
ignore that data-race.
[1]
==================================================================
BUG: KCSAN: data-race in __filemap_remove_folio / migrate_pages_batch
write to 0xffffea0004b81dd8 of 8 bytes by task 6348 on cpu 0:
page_cache_delete mm/filemap.c:153 [inline]
__filemap_remove_folio+0x1ac/0x2c0 mm/filemap.c:233
filemap_remove_folio+0x6b/0x1f0 mm/filemap.c:265
truncate_inode_folio+0x42/0x50 mm/truncate.c:178
shmem_undo_range+0x25b/0xa70 mm/shmem.c:1028
shmem_truncate_range mm/shmem.c:1144 [inline]
shmem_evict_inode+0x14d/0x530 mm/shmem.c:1272
evict+0x2f0/0x580 fs/inode.c:731
iput_final fs/inode.c:1883 [inline]
iput+0x42a/0x5b0 fs/inode.c:1909
dentry_unlink_inode+0x24f/0x260 fs/dcache.c:412
__dentry_kill+0x18b/0x4c0 fs/dcache.c:615
dput+0x5c/0xd0 fs/dcache.c:857
__fput+0x3fb/0x6d0 fs/file_table.c:439
____fput+0x1c/0x30 fs/file_table.c:459
task_work_run+0x13a/0x1a0 kernel/task_work.c:228
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
exit_to_user_mode_loop kernel/entry/common.c:114 [inline]
exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
__syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
syscall_exit_to_user_mode+0xbe/0x130 kernel/entry/common.c:218
do_syscall_64+0xd6/0x1c0 arch/x86/entry/common.c:89
entry_SYSCALL_64_after_hwframe+0x77/0x7f
read to 0xffffea0004b81dd8 of 8 bytes by task 6342 on cpu 1:
__folio_test_movable include/linux/page-flags.h:699 [inline]
migrate_folio_unmap mm/migrate.c:1199 [inline]
migrate_pages_batch+0x24c/0x1940 mm/migrate.c:1797
migrate_pages_sync mm/migrate.c:1963 [inline]
migrate_pages+0xff1/0x1820 mm/migrate.c:2072
do_mbind mm/mempolicy.c:1390 [inline]
kernel_mbind mm/mempolicy.c:1533 [inline]
__do_sys_mbind mm/mempolicy.c:1607 [inline]
__se_sys_mbind+0xf76/0x1160 mm/mempolicy.c:1603
__x64_sys_mbind+0x78/0x90 mm/mempolicy.c:1603
x64_sys_call+0x2b4d/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:238
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
value changed: 0xffff888127601078 -> 0x0000000000000000
Link: https://lkml.kernel.org/r/20240924130053.107490-1-aha310510@gmail.com
Fixes: 7e2a5e5ab217 ("mm: migrate: use __folio_test_movable()")
Signed-off-by: Jeongjun Park <aha310510(a)gmail.com>
Reported-by: syzbot <syzkaller(a)googlegroups.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/migrate.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/migrate.c~mm-migrate-annotate-data-race-in-migrate_folio_unmap
+++ a/mm/migrate.c
@@ -1196,7 +1196,7 @@ static int migrate_folio_unmap(new_folio
int rc = -EAGAIN;
int old_page_state = 0;
struct anon_vma *anon_vma = NULL;
- bool is_lru = !__folio_test_movable(src);
+ bool is_lru = data_race(!__folio_test_movable(src));
bool locked = false;
bool dst_locked = false;
_
Patches currently in -mm which might be from aha310510(a)gmail.com are
mm-percpu-fix-typo-to-pcpu_alloc_noprof-description.patch
mm-shmem-fix-data-race-in-shmem_getattr.patch
The quilt patch titled
Subject: mm/hugetlb: simplify refs in memfd_alloc_folio
has been removed from the -mm tree. Its filename was
mm-hugetlb-simplify-refs-in-memfd_alloc_folio.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Steve Sistare <steven.sistare(a)oracle.com>
Subject: mm/hugetlb: simplify refs in memfd_alloc_folio
Date: Wed, 4 Sep 2024 12:41:08 -0700
The folio_try_get in memfd_alloc_folio is not necessary. Delete it, and
delete the matching folio_put in memfd_pin_folios. This also avoids
leaking a ref if the memfd_alloc_folio call to hugetlb_add_to_page_cache
fails. That error path is also broken in a second way -- when its
folio_put causes the ref to become 0, it will implicitly call
free_huge_folio, but then the path *explicitly* calls free_huge_folio.
Delete the latter.
This is a continuation of the fix
"mm/hugetlb: fix memfd_pin_folios free_huge_pages leak"
[steven.sistare(a)oracle.com: remove explicit call to free_huge_folio(), per Matthew]
Link: https://lkml.kernel.org/r/Zti-7nPVMcGgpcbi@casper.infradead.org
Link: https://lkml.kernel.org/r/1725481920-82506-1-git-send-email-steven.sistare@…
Link: https://lkml.kernel.org/r/1725478868-61732-1-git-send-email-steven.sistare@…
Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios")
Signed-off-by: Steve Sistare <steven.sistare(a)oracle.com>
Suggested-by: Vivek Kasireddy <vivek.kasireddy(a)intel.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/gup.c | 4 +---
mm/memfd.c | 3 +--
2 files changed, 2 insertions(+), 5 deletions(-)
--- a/mm/gup.c~mm-hugetlb-simplify-refs-in-memfd_alloc_folio
+++ a/mm/gup.c
@@ -3615,7 +3615,7 @@ long memfd_pin_folios(struct file *memfd
pgoff_t start_idx, end_idx, next_idx;
struct folio *folio = NULL;
struct folio_batch fbatch;
- struct hstate *h = NULL;
+ struct hstate *h;
long ret = -EINVAL;
if (start < 0 || start > end || !max_folios)
@@ -3659,8 +3659,6 @@ long memfd_pin_folios(struct file *memfd
&fbatch);
if (folio) {
folio_put(folio);
- if (h)
- folio_put(folio);
folio = NULL;
}
--- a/mm/memfd.c~mm-hugetlb-simplify-refs-in-memfd_alloc_folio
+++ a/mm/memfd.c
@@ -89,13 +89,12 @@ struct folio *memfd_alloc_folio(struct f
numa_node_id(),
NULL,
gfp_mask);
- if (folio && folio_try_get(folio)) {
+ if (folio) {
err = hugetlb_add_to_page_cache(folio,
memfd->f_mapping,
idx);
if (err) {
folio_put(folio);
- free_huge_folio(folio);
return ERR_PTR(err);
}
folio_unlock(folio);
_
Patches currently in -mm which might be from steven.sistare(a)oracle.com are
The quilt patch titled
Subject: mm/gup: fix memfd_pin_folios alloc race panic
has been removed from the -mm tree. Its filename was
mm-gup-fix-memfd_pin_folios-alloc-race-panic.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Steve Sistare <steven.sistare(a)oracle.com>
Subject: mm/gup: fix memfd_pin_folios alloc race panic
Date: Tue, 3 Sep 2024 07:25:21 -0700
If memfd_pin_folios tries to create a hugetlb page, but someone else
already did, then folio gets the value -EEXIST here:
folio = memfd_alloc_folio(memfd, start_idx);
if (IS_ERR(folio)) {
ret = PTR_ERR(folio);
if (ret != -EEXIST)
goto err;
then on the next trip through the "while start_idx" loop we panic here:
if (folio) {
folio_put(folio);
To fix, set the folio to NULL on error.
Link: https://lkml.kernel.org/r/1725373521-451395-6-git-send-email-steven.sistare…
Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios")
Signed-off-by: Steve Sistare <steven.sistare(a)oracle.com>
Acked-by: Vivek Kasireddy <vivek.kasireddy(a)intel.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/gup.c | 1 +
1 file changed, 1 insertion(+)
--- a/mm/gup.c~mm-gup-fix-memfd_pin_folios-alloc-race-panic
+++ a/mm/gup.c
@@ -3702,6 +3702,7 @@ long memfd_pin_folios(struct file *memfd
ret = PTR_ERR(folio);
if (ret != -EEXIST)
goto err;
+ folio = NULL;
}
}
}
_
Patches currently in -mm which might be from steven.sistare(a)oracle.com are
The quilt patch titled
Subject: mm/gup: fix memfd_pin_folios hugetlb page allocation
has been removed from the -mm tree. Its filename was
mm-gup-fix-memfd_pin_folios-hugetlb-page-allocation.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Steve Sistare <steven.sistare(a)oracle.com>
Subject: mm/gup: fix memfd_pin_folios hugetlb page allocation
Date: Tue, 3 Sep 2024 07:25:20 -0700
When memfd_pin_folios -> memfd_alloc_folio creates a hugetlb page, the
index is wrong. The subsequent call to filemap_get_folios_contig thus
cannot find it, and fails, and memfd_pin_folios loops forever. To fix,
adjust the index for the huge_page_order.
memfd_alloc_folio also forgets to unlock the folio, so the next touch of
the page calls hugetlb_fault which blocks forever trying to take the lock.
Unlock it.
Link: https://lkml.kernel.org/r/1725373521-451395-5-git-send-email-steven.sistare…
Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios")
Signed-off-by: Steve Sistare <steven.sistare(a)oracle.com>
Acked-by: Vivek Kasireddy <vivek.kasireddy(a)intel.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memfd.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
--- a/mm/memfd.c~mm-gup-fix-memfd_pin_folios-hugetlb-page-allocation
+++ a/mm/memfd.c
@@ -79,10 +79,13 @@ struct folio *memfd_alloc_folio(struct f
* alloc from. Also, the folio will be pinned for an indefinite
* amount of time, so it is not expected to be migrated away.
*/
- gfp_mask = htlb_alloc_mask(hstate_file(memfd));
+ struct hstate *h = hstate_file(memfd);
+
+ gfp_mask = htlb_alloc_mask(h);
gfp_mask &= ~(__GFP_HIGHMEM | __GFP_MOVABLE);
+ idx >>= huge_page_order(h);
- folio = alloc_hugetlb_folio_reserve(hstate_file(memfd),
+ folio = alloc_hugetlb_folio_reserve(h,
numa_node_id(),
NULL,
gfp_mask);
@@ -95,6 +98,7 @@ struct folio *memfd_alloc_folio(struct f
free_huge_folio(folio);
return ERR_PTR(err);
}
+ folio_unlock(folio);
return folio;
}
return ERR_PTR(-ENOMEM);
_
Patches currently in -mm which might be from steven.sistare(a)oracle.com are
The quilt patch titled
Subject: mm/hugetlb: fix memfd_pin_folios free_huge_pages leak
has been removed from the -mm tree. Its filename was
mm-hugetlb-fix-memfd_pin_folios-free_huge_pages-leak.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Steve Sistare <steven.sistare(a)oracle.com>
Subject: mm/hugetlb: fix memfd_pin_folios free_huge_pages leak
Date: Tue, 3 Sep 2024 07:25:18 -0700
memfd_pin_folios followed by unpin_folios fails to restore free_huge_pages
if the pages were not already faulted in, because the folio refcount for
pages created by memfd_alloc_folio never goes to 0. memfd_pin_folios
needs another folio_put to undo the folio_try_get below:
memfd_alloc_folio()
alloc_hugetlb_folio_nodemask()
dequeue_hugetlb_folio_nodemask()
dequeue_hugetlb_folio_node_exact()
folio_ref_unfreeze(folio, 1); ; adds 1 refcount
folio_try_get() ; adds 1 refcount
hugetlb_add_to_page_cache() ; adds 512 refcount (on x86)
With the fix, after memfd_pin_folios + unpin_folios, the refcount for the
(unfaulted) page is 512, which is correct, as the refcount for a faulted
unpinned page is 513.
Link: https://lkml.kernel.org/r/1725373521-451395-3-git-send-email-steven.sistare…
Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios")
Signed-off-by: Steve Sistare <steven.sistare(a)oracle.com>
Acked-by: Vivek Kasireddy <vivek.kasireddy(a)intel.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/gup.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
--- a/mm/gup.c~mm-hugetlb-fix-memfd_pin_folios-free_huge_pages-leak
+++ a/mm/gup.c
@@ -3615,7 +3615,7 @@ long memfd_pin_folios(struct file *memfd
pgoff_t start_idx, end_idx, next_idx;
struct folio *folio = NULL;
struct folio_batch fbatch;
- struct hstate *h;
+ struct hstate *h = NULL;
long ret = -EINVAL;
if (start < 0 || start > end || !max_folios)
@@ -3659,6 +3659,8 @@ long memfd_pin_folios(struct file *memfd
&fbatch);
if (folio) {
folio_put(folio);
+ if (h)
+ folio_put(folio);
folio = NULL;
}
_
Patches currently in -mm which might be from steven.sistare(a)oracle.com are
The quilt patch titled
Subject: mm/filemap: fix filemap_get_folios_contig THP panic
has been removed from the -mm tree. Its filename was
mm-filemap-fix-filemap_get_folios_contig-thp-panic.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Steve Sistare <steven.sistare(a)oracle.com>
Subject: mm/filemap: fix filemap_get_folios_contig THP panic
Date: Tue, 3 Sep 2024 07:25:17 -0700
Patch series "memfd-pin huge page fixes".
Fix multiple bugs that occur when using memfd_pin_folios with hugetlb
pages and THP. The hugetlb bugs only bite when the page is not yet
faulted in when memfd_pin_folios is called. The THP bug bites when the
starting offset passed to memfd_pin_folios is not huge page aligned. See
the commit messages for details.
This patch (of 5):
memfd_pin_folios on memory backed by THP panics if the requested start
offset is not huge page aligned:
BUG: kernel NULL pointer dereference, address: 0000000000000036
RIP: 0010:filemap_get_folios_contig+0xdf/0x290
RSP: 0018:ffffc9002092fbe8 EFLAGS: 00010202
RAX: 0000000000000002 RBX: 0000000000000002 RCX: 0000000000000002
The fault occurs here, because xas_load returns a folio with value 2:
filemap_get_folios_contig()
for (folio = xas_load(&xas); folio && xas.xa_index <= end;
folio = xas_next(&xas)) {
...
if (!folio_try_get(folio)) <-- BOOM
"2" is an xarray sibling entry. We get it because memfd_pin_folios does
not round the indices passed to filemap_get_folios_contig to huge page
boundaries for THP, so we load from the middle of a huge page range see a
sibling. (It does round for hugetlbfs, at the is_file_hugepages test).
To fix, if the folio is a sibling, then return the next index as the
starting point for the next call to filemap_get_folios_contig.
Link: https://lkml.kernel.org/r/1725373521-451395-1-git-send-email-steven.sistare…
Link: https://lkml.kernel.org/r/1725373521-451395-2-git-send-email-steven.sistare…
Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios")
Signed-off-by: Steve Sistare <steven.sistare(a)oracle.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Vivek Kasireddy <vivek.kasireddy(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/filemap.c | 4 ++++
1 file changed, 4 insertions(+)
--- a/mm/filemap.c~mm-filemap-fix-filemap_get_folios_contig-thp-panic
+++ a/mm/filemap.c
@@ -2196,6 +2196,10 @@ unsigned filemap_get_folios_contig(struc
if (xa_is_value(folio))
goto update_start;
+ /* If we landed in the middle of a THP, continue at its end. */
+ if (xa_is_sibling(folio))
+ goto update_start;
+
if (!folio_try_get(folio))
goto retry;
_
Patches currently in -mm which might be from steven.sistare(a)oracle.com are
nfs41_init_clientid does not signal a failure condition from
nfs4_proc_exchange_id and nfs4_proc_create_session to a client which may
lead to mount syscall indefinitely blocked in the following stack trace:
nfs_wait_client_init_complete
nfs41_discover_server_trunking
nfs4_discover_server_trunking
nfs4_init_client
nfs4_set_client
nfs4_create_server
nfs4_try_get_tree
vfs_get_tree
do_new_mount
__se_sys_mount
and the client stuck in uninitialized state.
In addition to this all subsequent mount calls would also get blocked in
nfs_match_client waiting for the uninitialized client to finish
initialization:
nfs_wait_client_init_complete
nfs_match_client
nfs_get_client
nfs4_set_client
nfs4_create_server
nfs4_try_get_tree
vfs_get_tree
do_new_mount
__se_sys_mount
To avoid this situation propagate error condition to the mount thread
and let mount syscall fail properly.
Signed-off-by: Oleksandr Tymoshenko <ovt(a)google.com>
---
fs/nfs/nfs4state.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 877f682b45f2..54ad3440ad2b 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -335,8 +335,8 @@ int nfs41_init_clientid(struct nfs_client *clp, const struct cred *cred)
if (!(clp->cl_exchange_flags & EXCHGID4_FLAG_CONFIRMED_R))
nfs4_state_start_reclaim_reboot(clp);
nfs41_finish_session_reset(clp);
- nfs_mark_client_ready(clp, NFS_CS_READY);
out:
+ nfs_mark_client_ready(clp, status == 0 ? NFS_CS_READY : status);
return status;
}
---
base-commit: ad618736883b8970f66af799e34007475fe33a68
change-id: 20240906-nfs-mount-deadlock-fix-55c14b38e088
Best regards,
--
Oleksandr Tymoshenko <ovt(a)google.com>