Linux-stable-mirror May 2025

linux-stable-mirror@lists.linaro.org

568 participants
1328 discussions

Re: [PATCH] um: work around sched_yield not yielding in time-travel mode

by Sasha Levin

[ Sasha's backport helper bot ] Hi, Summary of potential issues: ⚠️ Found matching upstream commit but patch is missing proper reference to it Found matching upstream commit: 887c5c12e80c8424bd471122d2e8b6b462e12874 WARNING: Author mismatch between patch and found commit: Backport author: Benjamin Berg<benjamin(a)sipsolutions.net> Commit author: Benjamin Berg<benjamin.berg(a)intel.com> Note: The patch differs from the upstream commit: --- 1: 887c5c12e80c8 < -: ------------- um: work around sched_yield not yielding in time-travel mode -: ------------- > 1: aeaee199900ee Linux 6.14.5 --- Results of testing on various branches: | Branch | Patch Apply | Build Test | |---------------------------|-------------|------------| | stable/linux-6.14.y | Success | Success | | stable/linux-6.12.y | Success | Success | | stable/linux-6.6.y | Success | Success | | stable/linux-6.1.y | Success | Success | | stable/linux-5.15.y | Success | Success | | stable/linux-5.10.y | Success | Success | | stable/linux-5.4.y | Success | Success |

8 months

[PATCH 6.6.y] btrfs: always fallback to buffered write if the inode requires checksum

by Qu Wenruo

commit 968f19c5b1b7d5595423b0ac0020cc18dfed8cb5 upstream. [BUG] It is a long known bug that VM image on btrfs can lead to data csum mismatch, if the qemu is using direct-io for the image (this is commonly known as cache mode 'none'). [CAUSE] Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the fs is allowed to dirty/modify the folio even if the folio is under writeback (as long as the address space doesn't have AS_STABLE_WRITES flag inherited from the block device). This is a valid optimization to improve the concurrency, and since these filesystems have no extra checksum on data, the content change is not a problem at all. But the final write into the image file is handled by btrfs, which needs the content not to be modified during writeback, or the checksum will not match the data (checksum is calculated before submitting the bio). So EXT4/XFS/NTRFS assume they can modify the folio under writeback, but btrfs requires no modification, this leads to the false csum mismatch. This is only a controlled example, there are even cases where multi-thread programs can submit a direct IO write, then another thread modifies the direct IO buffer for whatever reason. For such cases, btrfs has no sane way to detect such cases and leads to false data csum mismatch. [FIX] I have considered the following ideas to solve the problem: - Make direct IO to always skip data checksum This not only requires a new incompatible flag, as it breaks the current per-inode NODATASUM flag. But also requires extra handling for no csum found cases. And this also reduces our checksum protection. - Let hardware handle all the checksum AKA, just nodatasum mount option. That requires trust for hardware (which is not that trustful in a lot of cases), and it's not generic at all. - Always fallback to buffered write if the inode requires checksum This was suggested by Christoph, and is the solution utilized by this patch. The cost is obvious, the extra buffer copying into page cache, thus it reduces the performance. But at least it's still user configurable, if the end user still wants the zero-copy performance, just set NODATASUM flag for the inode (which is a common practice for VM images on btrfs). Since we cannot trust user space programs to keep the buffer consistent during direct IO, we have no choice but always falling back to buffered IO. At least by this, we avoid the more deadly false data checksum mismatch error. CC: stable(a)vger.kernel.org # 6.6 Suggested-by: Christoph Hellwig <hch(a)infradead.org> Reviewed-by: Filipe Manana <fdmanana(a)suse.com> Signed-off-by: Qu Wenruo <wqu(a)suse.com> Reviewed-by: David Sterba <dsterba(a)suse.com> Signed-off-by: David Sterba <dsterba(a)suse.com> [ Fix a conflict due to the movement of the function. ] --- fs/btrfs/file.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index e794606e7c78..f1456c745c6d 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1515,6 +1515,23 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from) goto buffered; } + /* + * We can't control the folios being passed in, applications can write + * to them while a direct IO write is in progress. This means the + * content might change after we calculated the data checksum. + * Therefore we can end up storing a checksum that doesn't match the + * persisted data. + * + * To be extra safe and avoid false data checksum mismatch, if the + * inode requires data checksum, just fallback to buffered IO. + * For buffered IO we have full control of page cache and can ensure + * no one is modifying the content during writeback. + */ + if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) { + btrfs_inode_unlock(BTRFS_I(inode), ilock_flags); + goto buffered; + } + /* * The iov_iter can be mapped to the same file range we are writing to. * If that's the case, then we will deadlock in the iomap code, because -- 2.49.0

8 months

[PATCH] bcachefs: Change btree_insert_node() assertion to error

by Kent Overstreet

Debug for https://github.com/koverstreet/bcachefs/issues/843 Print useful debug info and go emergency read-only. (cherry picked from commit 63c3b8f616cc95bb1fcc6101c92485d41c535d7c) Signed-off-by: Kent Overstreet <kent.overstreet(a)linux.dev> --- fs/bcachefs/btree_update_interior.c | 17 ++++++++++++++++- fs/bcachefs/error.c | 8 ++++++++ fs/bcachefs/error.h | 2 ++ 3 files changed, 26 insertions(+), 1 deletion(-) diff --git a/fs/bcachefs/btree_update_interior.c b/fs/bcachefs/btree_update_interior.c index e4e7c804625e..e9be8b5571a4 100644 --- a/fs/bcachefs/btree_update_interior.c +++ b/fs/bcachefs/btree_update_interior.c @@ -35,6 +35,8 @@ static const char * const bch2_btree_update_modes[] = { NULL }; +static void bch2_btree_update_to_text(struct printbuf *, struct btree_update *); + static int bch2_btree_insert_node(struct btree_update *, struct btree_trans *, btree_path_idx_t, struct btree *, struct keylist *); static void bch2_btree_update_add_new_node(struct btree_update *, struct btree *); @@ -1782,11 +1784,24 @@ static int bch2_btree_insert_node(struct btree_update *as, struct btree_trans *t int ret; lockdep_assert_held(&c->gc_lock); - BUG_ON(!btree_node_intent_locked(path, b->c.level)); BUG_ON(!b->c.level); BUG_ON(!as || as->b); bch2_verify_keylist_sorted(keys); + if (!btree_node_intent_locked(path, b->c.level)) { + struct printbuf buf = PRINTBUF; + bch2_log_msg_start(c, &buf); + prt_printf(&buf, "%s(): node not locked at level %u\n", + __func__, b->c.level); + bch2_btree_update_to_text(&buf, as); + bch2_btree_path_to_text(&buf, trans, path_idx); + + bch2_print_string_as_lines(KERN_ERR, buf.buf); + printbuf_exit(&buf); + bch2_fs_emergency_read_only(c); + return -EIO; + } + ret = bch2_btree_node_lock_write(trans, path, &b->c); if (ret) return ret; diff --git a/fs/bcachefs/error.c b/fs/bcachefs/error.c index 038da6a61f6b..6cbf4819e923 100644 --- a/fs/bcachefs/error.c +++ b/fs/bcachefs/error.c @@ -11,6 +11,14 @@ #define FSCK_ERR_RATELIMIT_NR 10 +void bch2_log_msg_start(struct bch_fs *c, struct printbuf *out) +{ +#ifdef BCACHEFS_LOG_PREFIX + prt_printf(out, bch2_log_msg(c, "")); +#endif + printbuf_indent_add(out, 2); +} + bool bch2_inconsistent_error(struct bch_fs *c) { set_bit(BCH_FS_error, &c->flags); diff --git a/fs/bcachefs/error.h b/fs/bcachefs/error.h index 7acf2a27ca28..5730eb6b2f38 100644 --- a/fs/bcachefs/error.h +++ b/fs/bcachefs/error.h @@ -18,6 +18,8 @@ struct work_struct; /* Error messages: */ +void bch2_log_msg_start(struct bch_fs *, struct printbuf *); + /* * Inconsistency errors: The on disk data is inconsistent. If these occur during * initial recovery, they don't indicate a bug in the running code - we walk all -- 2.49.0

8 months

[PATCH 5.10.y] of: module: add buffer overflow check in of_modalias()

by Uwe Kleine-König

From: Sergey Shtylyov <s.shtylyov(a)omp.ru> commit cf7385cb26ac4f0ee6c7385960525ad534323252 upstream. In of_modalias(), if the buffer happens to be too small even for the 1st snprintf() call, the len parameter will become negative and str parameter (if not NULL initially) will point beyond the buffer's end. Add the buffer overflow check after the 1st snprintf() call and fix such check after the strlen() call (accounting for the terminating NUL char). Fixes: bc575064d688 ("of/device: use of_property_for_each_string to parse compatible strings") Signed-off-by: Sergey Shtylyov <s.shtylyov(a)omp.ru> Link: https://lore.kernel.org/r/bbfc6be0-c687-62b6-d015-5141b93f313e@omp.ru Signed-off-by: Rob Herring <robh(a)kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Signed-off-by: Uwe Kleine-König <ukleinek(a)debian.org> --- drivers/of/device.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/of/device.c b/drivers/of/device.c index 3a547793135c..93f08f18f6b3 100644 --- a/drivers/of/device.c +++ b/drivers/of/device.c @@ -231,14 +231,15 @@ static ssize_t of_device_get_modalias(struct device *dev, char *str, ssize_t len csize = snprintf(str, len, "of:N%pOFn%c%s", dev->of_node, 'T', of_node_get_device_type(dev->of_node)); tsize = csize; + if (csize >= len) + csize = len > 0 ? len - 1 : 0; len -= csize; - if (str) - str += csize; + str += csize; of_property_for_each_string(dev->of_node, "compatible", p, compat) { csize = strlen(compat) + 1; tsize += csize; - if (csize > len) + if (csize >= len) continue; csize = snprintf(str, len, "C%s", compat); base-commit: 024a4a45fdf87218e3c0925475b05a27bcea103f -- 2.47.2

8 months

[PATCH 5.15.y] of: module: add buffer overflow check in of_modalias()

by Uwe Kleine-König

From: Sergey Shtylyov <s.shtylyov(a)omp.ru> commit cf7385cb26ac4f0ee6c7385960525ad534323252 upstream. In of_modalias(), if the buffer happens to be too small even for the 1st snprintf() call, the len parameter will become negative and str parameter (if not NULL initially) will point beyond the buffer's end. Add the buffer overflow check after the 1st snprintf() call and fix such check after the strlen() call (accounting for the terminating NUL char). Fixes: bc575064d688 ("of/device: use of_property_for_each_string to parse compatible strings") Signed-off-by: Sergey Shtylyov <s.shtylyov(a)omp.ru> Link: https://lore.kernel.org/r/bbfc6be0-c687-62b6-d015-5141b93f313e@omp.ru Signed-off-by: Rob Herring <robh(a)kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Signed-off-by: Uwe Kleine-König <ukleinek(a)debian.org> --- drivers/of/device.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/of/device.c b/drivers/of/device.c index 19c42a9dcba9..f503bb10b10b 100644 --- a/drivers/of/device.c +++ b/drivers/of/device.c @@ -257,14 +257,15 @@ static ssize_t of_device_get_modalias(struct device *dev, char *str, ssize_t len csize = snprintf(str, len, "of:N%pOFn%c%s", dev->of_node, 'T', of_node_get_device_type(dev->of_node)); tsize = csize; + if (csize >= len) + csize = len > 0 ? len - 1 : 0; len -= csize; - if (str) - str += csize; + str += csize; of_property_for_each_string(dev->of_node, "compatible", p, compat) { csize = strlen(compat) + 1; tsize += csize; - if (csize > len) + if (csize >= len) continue; csize = snprintf(str, len, "C%s", compat); base-commit: 16fdf2c7111bd6927f16c3e811f5086fecebbf00 -- 2.47.2

8 months

[PATCH 5.15.y] btrfs: do not clean up repair bio if submit fails

by bin.lan.cn＠windriver.com

From: Josef Bacik <josef(a)toxicpanda.com> [ Upstream commit 8cbc3001a3264d998d6b6db3e23f935c158abd4d ] The submit helper will always run bio_endio() on the bio if it fails to submit, so cleaning up the bio just leads to a variety of use-after-free and NULL pointer dereference bugs because we race with the endio function that is cleaning up the bio. Instead just return BLK_STS_OK as the repair function has to continue to process the rest of the pages, and the endio for the repair bio will do the appropriate cleanup for the page that it was given. Reviewed-by: Boris Burkov <boris(a)bur.io> Signed-off-by: Josef Bacik <josef(a)toxicpanda.com> Signed-off-by: David Sterba <dsterba(a)suse.com> [Minor context change fixed.] Signed-off-by: Bin Lan <bin.lan.cn(a)windriver.com> Signed-off-by: He Zhe <zhe.he(a)windriver.com> --- Build test passed. --- fs/btrfs/extent_io.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 346fc46d019b..a1946d62911c 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2624,7 +2624,6 @@ int btrfs_repair_one_sector(struct inode *inode, const int icsum = bio_offset >> fs_info->sectorsize_bits; struct bio *repair_bio; struct btrfs_io_bio *repair_io_bio; - blk_status_t status; btrfs_debug(fs_info, "repair read error: read error at %llu", start); @@ -2664,13 +2663,13 @@ int btrfs_repair_one_sector(struct inode *inode, "repair read error: submitting new read to mirror %d", failrec->this_mirror); - status = submit_bio_hook(inode, repair_bio, failrec->this_mirror, - failrec->bio_flags); - if (status) { - free_io_failure(failure_tree, tree, failrec); - bio_put(repair_bio); - } - return blk_status_to_errno(status); + /* + * At this point we have a bio, so any errors from submit_bio_hook() + * will be handled by the endio on the repair_bio, so we can't return an + * error here. + */ + submit_bio_hook(inode, repair_bio, failrec->this_mirror, failrec->bio_flags); + return BLK_STS_OK; } static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len) -- 2.34.1

8 months

[PATCH 5.4.y] of: module: add buffer overflow check in of_modalias()

by Uwe Kleine-König

From: Sergey Shtylyov <s.shtylyov(a)omp.ru> commit cf7385cb26ac4f0ee6c7385960525ad534323252 upstream. In of_modalias(), if the buffer happens to be too small even for the 1st snprintf() call, the len parameter will become negative and str parameter (if not NULL initially) will point beyond the buffer's end. Add the buffer overflow check after the 1st snprintf() call and fix such check after the strlen() call (accounting for the terminating NUL char). Fixes: bc575064d688 ("of/device: use of_property_for_each_string to parse compatible strings") Signed-off-by: Sergey Shtylyov <s.shtylyov(a)omp.ru> Link: https://lore.kernel.org/r/bbfc6be0-c687-62b6-d015-5141b93f313e@omp.ru Signed-off-by: Rob Herring <robh(a)kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Signed-off-by: Uwe Kleine-König <ukleinek(a)debian.org> --- drivers/of/device.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/of/device.c b/drivers/of/device.c index 7fb870097a84..ee3467730dac 100644 --- a/drivers/of/device.c +++ b/drivers/of/device.c @@ -213,14 +213,15 @@ static ssize_t of_device_get_modalias(struct device *dev, char *str, ssize_t len csize = snprintf(str, len, "of:N%pOFn%c%s", dev->of_node, 'T', of_node_get_device_type(dev->of_node)); tsize = csize; + if (csize >= len) + csize = len > 0 ? len - 1 : 0; len -= csize; - if (str) - str += csize; + str += csize; of_property_for_each_string(dev->of_node, "compatible", p, compat) { csize = strlen(compat) + 1; tsize += csize; - if (csize > len) + if (csize >= len) continue; csize = snprintf(str, len, "C%s", compat); base-commit: 2c8115e4757809ffd537ed9108da115026d3581f -- 2.47.2

8 months

+ mm-userfaultfd-correct-dirty-flags-set-for-both-present-and-swap-pte.patch added to mm-hotfixes-unstable branch

by Andrew Morton

The patch titled Subject: mm: userfaultfd: correct dirty flags set for both present and swap pte has been added to the -mm mm-hotfixes-unstable branch. Its filename is mm-userfaultfd-correct-dirty-flags-set-for-both-present-and-swap-pte.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Barry Song <v-songbaohua(a)oppo.com> Subject: mm: userfaultfd: correct dirty flags set for both present and swap pte Date: Fri, 9 May 2025 10:09:12 +1200 As David pointed out, what truly matters for mremap and userfaultfd move operations is the soft dirty bit. The current comment and implementation��which always sets the dirty bit for present PTEs and fails to set the soft dirty bit for swap PTEs��are incorrect. This could break features like Checkpoint-Restore in Userspace (CRIU). This patch updates the behavior to correctly set the soft dirty bit for both present and swap PTEs in accordance with mremap. Link: https://lkml.kernel.org/r/20250508220912.7275-1-21cnbao@gmail.com Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Barry Song <v-songbaohua(a)oppo.com> Reported-by: David Hildenbrand <david(a)redhat.com> Closes: https://lore.kernel.org/linux-mm/02f14ee1-923f-47e3-a994-4950afb9afcc@redha… Acked-by: Peter Xu <peterx(a)redhat.com> Reviewed-by: Suren Baghdasaryan <surenb(a)google.com> Cc: Lokesh Gidra <lokeshgidra(a)google.com> Cc: Andrea Arcangeli <aarcange(a)redhat.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/userfaultfd.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) --- a/mm/userfaultfd.c~mm-userfaultfd-correct-dirty-flags-set-for-both-present-and-swap-pte +++ a/mm/userfaultfd.c @@ -1064,8 +1064,13 @@ static int move_present_pte(struct mm_st src_folio->index = linear_page_index(dst_vma, dst_addr); orig_dst_pte = mk_pte(&src_folio->page, dst_vma->vm_page_prot); - /* Follow mremap() behavior and treat the entry dirty after the move */ - orig_dst_pte = pte_mkwrite(pte_mkdirty(orig_dst_pte), dst_vma); + /* Set soft dirty bit so userspace can notice the pte was moved */ +#ifdef CONFIG_MEM_SOFT_DIRTY + orig_dst_pte = pte_mksoft_dirty(orig_dst_pte); +#endif + if (pte_dirty(orig_src_pte)) + orig_dst_pte = pte_mkdirty(orig_dst_pte); + orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma); set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte); out: @@ -1100,6 +1105,9 @@ static int move_swap_pte(struct mm_struc } orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte); +#ifdef CONFIG_MEM_SOFT_DIRTY + orig_src_pte = pte_swp_mksoft_dirty(orig_src_pte); +#endif set_pte_at(mm, dst_addr, dst_pte, orig_src_pte); double_pt_unlock(dst_ptl, src_ptl); _ Patches currently in -mm which might be from v-songbaohua(a)oppo.com are mm-userfaultfd-correct-dirty-flags-set-for-both-present-and-swap-pte.patch

8 months

[PATCH v3] block, scsi: sd_zbc: Respect bio vector limits for report zones buffer

by Steve Siwinski

The report zones buffer size is currently limited by the HBA's maximum segment count to ensure the buffer can be mapped. However, the block layer further limits the number of iovec entries to 1024 when allocating a bio. To avoid allocation of buffers too large to be mapped, further restrict the maximum buffer size to BIO_MAX_INLINE_VECS. Replace the UIO_MAXIOV symbolic name with the more contextually appropriate BIO_MAX_INLINE_VECS. Fixes: b091ac616846 ("sd_zbc: Fix report zones buffer allocation") Cc: stable(a)vger.kernel.org Signed-off-by: Steve Siwinski <ssiwinski(a)atto.com> --- block/bio.c | 2 +- drivers/scsi/sd_zbc.c | 6 +++++- include/linux/bio.h | 1 + 3 files changed, 7 insertions(+), 2 deletions(-) diff --git a/block/bio.c b/block/bio.c index 4e6c85a33d74..4be592d37fb6 100644 --- a/block/bio.c +++ b/block/bio.c @@ -611,7 +611,7 @@ struct bio *bio_kmalloc(unsigned short nr_vecs, gfp_t gfp_mask) { struct bio *bio; - if (nr_vecs > UIO_MAXIOV) + if (nr_vecs > BIO_MAX_INLINE_VECS) return NULL; return kmalloc(struct_size(bio, bi_inline_vecs, nr_vecs), gfp_mask); } diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index 7a447ff600d2..a8db66428f80 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -169,6 +169,7 @@ static void *sd_zbc_alloc_report_buffer(struct scsi_disk *sdkp, unsigned int nr_zones, size_t *buflen) { struct request_queue *q = sdkp->disk->queue; + unsigned int max_segments; size_t bufsize; void *buf; @@ -180,12 +181,15 @@ static void *sd_zbc_alloc_report_buffer(struct scsi_disk *sdkp, * Furthermore, since the report zone command cannot be split, make * sure that the allocated buffer can always be mapped by limiting the * number of pages allocated to the HBA max segments limit. + * Since max segments can be larger than the max inline bio vectors, + * further limit the allocated buffer to BIO_MAX_INLINE_VECS. */ nr_zones = min(nr_zones, sdkp->zone_info.nr_zones); bufsize = roundup((nr_zones + 1) * 64, SECTOR_SIZE); bufsize = min_t(size_t, bufsize, queue_max_hw_sectors(q) << SECTOR_SHIFT); - bufsize = min_t(size_t, bufsize, queue_max_segments(q) << PAGE_SHIFT); + max_segments = min(BIO_MAX_INLINE_VECS, queue_max_segments(q)); + bufsize = min_t(size_t, bufsize, max_segments << PAGE_SHIFT); while (bufsize >= SECTOR_SIZE) { buf = kvzalloc(bufsize, GFP_KERNEL | __GFP_NORETRY); diff --git a/include/linux/bio.h b/include/linux/bio.h index cafc7c215de8..b786ec5bcc81 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -11,6 +11,7 @@ #include <linux/uio.h> #define BIO_MAX_VECS 256U +#define BIO_MAX_INLINE_VECS UIO_MAXIOV struct queue_limits; -- 2.43.5

8 months

[PATCH v2] mm: userfaultfd: correct dirty flags set for both present and swap pte

by Barry Song

From: Barry Song <v-songbaohua(a)oppo.com> As David pointed out, what truly matters for mremap and userfaultfd move operations is the soft dirty bit. The current comment and implementation—which always sets the dirty bit for present PTEs and fails to set the soft dirty bit for swap PTEs—are incorrect. This could break features like Checkpoint-Restore in Userspace (CRIU). This patch updates the behavior to correctly set the soft dirty bit for both present and swap PTEs in accordance with mremap. Reported-by: David Hildenbrand <david(a)redhat.com> Closes: https://lore.kernel.org/linux-mm/02f14ee1-923f-47e3-a994-4950afb9afcc@redha… Acked-by: Peter Xu <peterx(a)redhat.com> Reviewed-by: Suren Baghdasaryan <surenb(a)google.com> Cc: Lokesh Gidra <lokeshgidra(a)google.com> Cc: Andrea Arcangeli <aarcange(a)redhat.com> Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI") Cc: stable(a)vger.kernel.org Signed-off-by: Barry Song <v-songbaohua(a)oppo.com> --- mm/userfaultfd.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index e8ce92dc105f..bc473ad21202 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1064,8 +1064,13 @@ static int move_present_pte(struct mm_struct *mm, src_folio->index = linear_page_index(dst_vma, dst_addr); orig_dst_pte = folio_mk_pte(src_folio, dst_vma->vm_page_prot); - /* Follow mremap() behavior and treat the entry dirty after the move */ - orig_dst_pte = pte_mkwrite(pte_mkdirty(orig_dst_pte), dst_vma); + /* Set soft dirty bit so userspace can notice the pte was moved */ +#ifdef CONFIG_MEM_SOFT_DIRTY + orig_dst_pte = pte_mksoft_dirty(orig_dst_pte); +#endif + if (pte_dirty(orig_src_pte)) + orig_dst_pte = pte_mkdirty(orig_dst_pte); + orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma); set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte); out: @@ -1100,6 +1105,9 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, } orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte); +#ifdef CONFIG_MEM_SOFT_DIRTY + orig_src_pte = pte_swp_mksoft_dirty(orig_src_pte); +#endif set_pte_at(mm, dst_addr, dst_pte, orig_src_pte); double_pt_unlock(dst_ptl, src_ptl); -- 2.39.3 (Apple Git-146)

8 months

Jump to page:

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror May 2025