Hello,
This patch removes a dead code in lib/inflate.c; it follows from a discussion
in Xen.
The dead code is tracked by Coverity-ID 1055253 in Xen, was triggered by a file
taken unmodified from Linux.
Thank you,
Link: https://lore.kernel.org/all/7587b503-b2ca-4476-8dc9-e9683d4ca5f0@suse.com/
--
v2:
* Cc stable
Ariel Otilibili (1):
lib: Remove dead code
lib/inflate.c | 2 --
1 file changed, 2 deletions(-)
--
2.47.1
Hi all,
This is the latest revision of a patchset that adds to XFS kernel
support for reverse mapping for the realtime device. This time around
I've fixed some of the bitrot that I've noticed over the past few
months, and most notably have converted rtrmapbt to use the metadata
inode directory feature instead of burning more space in the superblock.
At the beginning of the set are patches to implement storing B+tree
leaves in an inode root, since the realtime rmapbt is rooted in an
inode, unlike the regular rmapbt which is rooted in an AG block.
Prior to this, the only btree that could be rooted in the inode fork
was the block mapping btree; if all the extent records fit in the
inode, format would be switched from 'btree' to 'extents'.
The next few patches enhance the reverse mapping routines to handle
the parts that are specific to rtgroups -- adding the new btree type,
adding a new log intent item type, and wiring up the metadata directory
tree entries.
Finally, implement GETFSMAP with the rtrmapbt and scrub functionality
for the rtrmapbt and rtbitmap and online fsck functionality.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=re…
xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h…
fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h…
xfsdocs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-documentation.git/l…
---
Commits in this patchset:
* xfs: add some rtgroup inode helpers
* xfs: prepare rmap btree cursor tracepoints for realtime
* xfs: simplify the xfs_rmap_{alloc,free}_extent calling conventions
* xfs: introduce realtime rmap btree ondisk definitions
* xfs: realtime rmap btree transaction reservations
* xfs: add realtime rmap btree operations
* xfs: prepare rmap functions to deal with rtrmapbt
* xfs: add a realtime flag to the rmap update log redo items
* xfs: support recovering rmap intent items targetting realtime extents
* xfs: pretty print metadata file types in error messages
* xfs: support file data forks containing metadata btrees
* xfs: add realtime reverse map inode to metadata directory
* xfs: add metadata reservations for realtime rmap btrees
* xfs: wire up a new metafile type for the realtime rmap
* xfs: wire up rmap map and unmap to the realtime rmapbt
* xfs: create routine to allocate and initialize a realtime rmap btree inode
* xfs: wire up getfsmap to the realtime reverse mapping btree
* xfs: check that the rtrmapbt maxlevels doesn't increase when growing fs
* xfs: report realtime rmap btree corruption errors to the health system
* xfs: allow queued realtime intents to drain before scrubbing
* xfs: scrub the realtime rmapbt
* xfs: cross-reference realtime bitmap to realtime rmapbt scrubber
* xfs: cross-reference the realtime rmapbt
* xfs: scan rt rmap when we're doing an intense rmap check of bmbt mappings
* xfs: scrub the metadir path of rt rmap btree files
* xfs: walk the rt reverse mapping tree when rebuilding rmap
* xfs: online repair of realtime file bmaps
* xfs: repair inodes that have realtime extents
* xfs: repair rmap btree inodes
* xfs: online repair of realtime bitmaps for a realtime group
* xfs: support repairing metadata btrees rooted in metadir inodes
* xfs: online repair of the realtime rmap btree
* xfs: create a shadow rmap btree during realtime rmap repair
* xfs: hook live realtime rmap operations during a repair operation
* xfs: don't shut down the filesystem for media failures beyond end of log
* xfs: react to fsdax failure notifications on the rt device
* xfs: enable realtime rmap btree
---
fs/xfs/Makefile | 3
fs/xfs/libxfs/xfs_btree.c | 73 +++
fs/xfs/libxfs/xfs_btree.h | 8
fs/xfs/libxfs/xfs_btree_mem.c | 1
fs/xfs/libxfs/xfs_btree_staging.c | 1
fs/xfs/libxfs/xfs_defer.h | 1
fs/xfs/libxfs/xfs_exchmaps.c | 4
fs/xfs/libxfs/xfs_format.h | 28 +
fs/xfs/libxfs/xfs_fs.h | 7
fs/xfs/libxfs/xfs_health.h | 4
fs/xfs/libxfs/xfs_inode_buf.c | 32 +
fs/xfs/libxfs/xfs_inode_fork.c | 25 +
fs/xfs/libxfs/xfs_log_format.h | 6
fs/xfs/libxfs/xfs_log_recover.h | 2
fs/xfs/libxfs/xfs_metafile.c | 18 +
fs/xfs/libxfs/xfs_metafile.h | 2
fs/xfs/libxfs/xfs_ondisk.h | 2
fs/xfs/libxfs/xfs_refcount.c | 6
fs/xfs/libxfs/xfs_rmap.c | 171 +++++-
fs/xfs/libxfs/xfs_rmap.h | 12
fs/xfs/libxfs/xfs_rtbitmap.c | 2
fs/xfs/libxfs/xfs_rtbitmap.h | 9
fs/xfs/libxfs/xfs_rtgroup.c | 53 +-
fs/xfs/libxfs/xfs_rtgroup.h | 49 ++
fs/xfs/libxfs/xfs_rtrmap_btree.c | 1011 +++++++++++++++++++++++++++++++++++++
fs/xfs/libxfs/xfs_rtrmap_btree.h | 210 ++++++++
fs/xfs/libxfs/xfs_sb.c | 6
fs/xfs/libxfs/xfs_shared.h | 14 +
fs/xfs/libxfs/xfs_trans_resv.c | 12
fs/xfs/libxfs/xfs_trans_space.h | 13
fs/xfs/scrub/alloc_repair.c | 5
fs/xfs/scrub/bmap.c | 108 +++-
fs/xfs/scrub/bmap_repair.c | 129 +++++
fs/xfs/scrub/common.c | 160 ++++++
fs/xfs/scrub/common.h | 23 +
fs/xfs/scrub/health.c | 1
fs/xfs/scrub/inode.c | 10
fs/xfs/scrub/inode_repair.c | 136 +++++
fs/xfs/scrub/metapath.c | 3
fs/xfs/scrub/newbt.c | 42 ++
fs/xfs/scrub/newbt.h | 1
fs/xfs/scrub/reap.c | 41 ++
fs/xfs/scrub/reap.h | 2
fs/xfs/scrub/repair.c | 191 +++++++
fs/xfs/scrub/repair.h | 17 +
fs/xfs/scrub/rgsuper.c | 6
fs/xfs/scrub/rmap_repair.c | 84 +++
fs/xfs/scrub/rtbitmap.c | 75 ++-
fs/xfs/scrub/rtbitmap.h | 55 ++
fs/xfs/scrub/rtbitmap_repair.c | 429 +++++++++++++++-
fs/xfs/scrub/rtrmap.c | 271 ++++++++++
fs/xfs/scrub/rtrmap_repair.c | 903 +++++++++++++++++++++++++++++++++
fs/xfs/scrub/rtsummary.c | 17 -
fs/xfs/scrub/rtsummary_repair.c | 3
fs/xfs/scrub/scrub.c | 11
fs/xfs/scrub/scrub.h | 14 +
fs/xfs/scrub/stats.c | 1
fs/xfs/scrub/tempexch.h | 2
fs/xfs/scrub/tempfile.c | 20 -
fs/xfs/scrub/trace.c | 1
fs/xfs/scrub/trace.h | 228 ++++++++
fs/xfs/xfs_buf.c | 1
fs/xfs/xfs_buf_item_recover.c | 4
fs/xfs/xfs_drain.c | 20 -
fs/xfs/xfs_drain.h | 7
fs/xfs/xfs_fsmap.c | 174 ++++++
fs/xfs/xfs_fsops.c | 11
fs/xfs/xfs_health.c | 1
fs/xfs/xfs_inode.c | 19 +
fs/xfs/xfs_inode_item.c | 2
fs/xfs/xfs_inode_item_recover.c | 44 +-
fs/xfs/xfs_log_recover.c | 2
fs/xfs/xfs_mount.c | 5
fs/xfs/xfs_mount.h | 9
fs/xfs/xfs_notify_failure.c | 230 +++++---
fs/xfs/xfs_notify_failure.h | 11
fs/xfs/xfs_qm.c | 8
fs/xfs/xfs_rmap_item.c | 216 +++++++-
fs/xfs/xfs_rtalloc.c | 82 ++-
fs/xfs/xfs_rtalloc.h | 10
fs/xfs/xfs_stats.c | 4
fs/xfs/xfs_stats.h | 2
fs/xfs/xfs_super.c | 6
fs/xfs/xfs_super.h | 1
fs/xfs/xfs_trace.h | 104 ++--
85 files changed, 5381 insertions(+), 366 deletions(-)
create mode 100644 fs/xfs/libxfs/xfs_rtrmap_btree.c
create mode 100644 fs/xfs/libxfs/xfs_rtrmap_btree.h
create mode 100644 fs/xfs/scrub/rtrmap.c
create mode 100644 fs/xfs/scrub/rtrmap_repair.c
create mode 100644 fs/xfs/xfs_notify_failure.h
Hi all,
Bug fixes for 6.13.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=xf…
xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h…
---
Commits in this patchset:
* xfs: don't over-report free space or inodes in statvfs
* xfs: release the dquot buf outside of qli_lock
---
fs/xfs/xfs_dquot.c | 12 ++++++++----
fs/xfs/xfs_qm_bhv.c | 27 +++++++++++++++++----------
2 files changed, 25 insertions(+), 14 deletions(-)
This partially reverts commit that made hfi_session_destroy()
the first step of vdec/venc close(). The reason being is a
regression report when, supposedly, encode/decoder is closed
with still active streaming (no ->stop_streaming() call before
close()) and pending pkts, so isr_thread cannot find instance
and fails to process those pending pkts. This was the idea
behind the original patch - make it impossible to use instance
under destruction, because this is racy, but apparently there
are uses cases that depend on that unsafe pattern. Return to
the old (unsafe) behaviour for the time being (until a better
fix is found).
Fixes: 45b1a1b348ec ("media: venus: sync with threaded IRQ during inst destruction")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sergey Senozhatsky <senozhatsky(a)chromium.org>
---
drivers/media/platform/qcom/venus/core.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/drivers/media/platform/qcom/venus/core.c b/drivers/media/platform/qcom/venus/core.c
index 2d27c5167246..807487a1f536 100644
--- a/drivers/media/platform/qcom/venus/core.c
+++ b/drivers/media/platform/qcom/venus/core.c
@@ -506,18 +506,14 @@ static __maybe_unused int venus_runtime_suspend(struct device *dev)
void venus_close_common(struct venus_inst *inst)
{
/*
- * First, remove the inst from the ->instances list, so that
- * to_instance() will return NULL.
- */
- hfi_session_destroy(inst);
- /*
- * Second, make sure we don't have IRQ/IRQ-thread currently running
+ * Make sure we don't have IRQ/IRQ-thread currently running
* or pending execution, which would race with the inst destruction.
*/
synchronize_irq(inst->core->irq);
v4l2_m2m_ctx_release(inst->m2m_ctx);
v4l2_m2m_release(inst->m2m_dev);
+ hfi_session_destroy(inst);
v4l2_fh_del(&inst->fh);
v4l2_fh_exit(&inst->fh);
v4l2_ctrl_handler_free(&inst->ctrl_handler);
--
2.47.1.613.gc27f4b7a9f-goog
On Friday, December 20, 2024 00:13 CET, Andrew Morton <akpm(a)linux-foundation.org> wrote:
>
> The patch titled
> Subject: lib/inflate.c: remove dead code
> has been added to the -mm mm-nonmm-unstable branch. Its filename is
> lib-remove-dead-code.patch
>
> This patch will shortly appear at
> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
>
> This patch will later appear in the mm-nonmm-unstable branch at
> git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
>
> Before you just go and hit "reply", please:
> a) Consider who else should be cc'ed
Hello Andrew,
Cc: linux-kernel(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Cc: xen-devel(a)lists.xenproject.org
https://lore.kernel.org/lkml/20241219224645.749233-1-ariel.otilibili-anieli…
Thank you,
Ariel
> b) Prefer to cc a suitable mailing list as well
> c) Ideally: find the original patch on the mailing list and do a
> reply-to-all to that, adding suitable additional cc's
>
> *** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
>
> The -mm tree is included into linux-next via the mm-everything
> branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> and is updated there every 2-3 working days
>
> ------------------------------------------------------
> From: Ariel Otilibili <ariel.otilibili-anieli(a)eurecom.fr>
> Subject: lib/inflate.c: remove dead code
> Date: Thu, 19 Dec 2024 10:21:12 +0100
>
> This is a follow up from a discussion in Xen:
>
> The if-statement tests that `res` is non-zero; meaning the case zero is
> never reached.
>
> Link: https://lore.kernel.org/all/7587b503-b2ca-4476-8dc9-e9683d4ca5f0@suse.com/
> Link: https://lkml.kernel.org/r/20241219092615.644642-2-ariel.otilibili-anieli@eu…
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Ariel Otilibili <ariel.otilibili-anieli(a)eurecom.fr>
> Suggested-by: Jan Beulich <jbeulich(a)suse.com>
> Cc: Andrew Cooper <andrew.cooper3(a)citrix.com>
> Cc: Anthony PERARD <anthony.perard(a)vates.tech>
> Cc: Michal Orzel <michal.orzel(a)amd.com>
> Cc: Julien Grall <julien(a)xen.org>
> Cc: Roger Pau Monné <roger.pau(a)citrix.com>
> Cc: Stefano Stabellini <sstabellini(a)kernel.org>
> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
> ---
>
> lib/inflate.c | 2 --
> 1 file changed, 2 deletions(-)
>
> --- a/lib/inflate.c~lib-remove-dead-code
> +++ a/lib/inflate.c
> @@ -1257,8 +1257,6 @@ static int INIT gunzip(void)
> /* Decompress */
> if ((res = inflate())) {
> switch (res) {
> - case 0:
> - break;
> case 1:
> error("invalid compressed format (err=1)");
> break;
> _
>
> Patches currently in -mm which might be from ariel.otilibili-anieli(a)eurecom.fr are
>
> lib-remove-dead-code.patch
>
The patch titled
Subject: mm: shmem: fix the update of 'shmem_falloc->nr_unswapped'
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-shmem-fix-the-update-of-shmem_falloc-nr_unswapped.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Subject: mm: shmem: fix the update of 'shmem_falloc->nr_unswapped'
Date: Thu, 19 Dec 2024 15:30:09 +0800
The 'shmem_falloc->nr_unswapped' is used to record how many writepage
refused to swap out because fallocate() is allocating, but after shmem
supports large folio swap out, the update of 'shmem_falloc->nr_unswapped'
does not use the correct number of pages in the large folio, which may
lead to fallocate() not exiting as soon as possible.
Anyway, this is found through code inspection, and I am not sure whether
it would actually cause serious issues.
Link: https://lkml.kernel.org/r/f66a0119d0564c2c37c84f045835b870d1b2196f.17345931…
Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/shmem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/shmem.c~mm-shmem-fix-the-update-of-shmem_falloc-nr_unswapped
+++ a/mm/shmem.c
@@ -1535,7 +1535,7 @@ try_split:
!shmem_falloc->waitq &&
index >= shmem_falloc->start &&
index < shmem_falloc->next)
- shmem_falloc->nr_unswapped++;
+ shmem_falloc->nr_unswapped += nr_pages;
else
shmem_falloc = NULL;
spin_unlock(&inode->i_lock);
_
Patches currently in -mm which might be from baolin.wang(a)linux.alibaba.com are
docs-mm-fix-the-incorrect-filehugemapped-field.patch
mm-shmem-fix-incorrect-index-alignment-for-within_size-policy.patch
mm-shmem-fix-the-update-of-shmem_falloc-nr_unswapped.patch
mm-factor-out-the-order-calculation-into-a-new-helper.patch
mm-shmem-change-shmem_huge_global_enabled-to-return-huge-order-bitmap.patch
mm-shmem-add-large-folio-support-for-tmpfs.patch
mm-shmem-add-a-kernel-command-line-to-change-the-default-huge-policy-for-tmpfs.patch
docs-tmpfs-drop-fadvise-from-the-documentation.patch
The patch titled
Subject: mm: shmem: fix incorrect index alignment for within_size policy
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-shmem-fix-incorrect-index-alignment-for-within_size-policy.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Subject: mm: shmem: fix incorrect index alignment for within_size policy
Date: Thu, 19 Dec 2024 15:30:08 +0800
With enabling the shmem per-size within_size policy, using an incorrect
'order' size to round_up() the index can lead to incorrect i_size checks,
resulting in an inappropriate large orders being returned.
Changing to use '1 << order' to round_up() the index to fix this issue.
Additionally, adding an 'aligned_index' variable to avoid affecting the
index checks.
Link: https://lkml.kernel.org/r/77d8ef76a7d3d646e9225e9af88a76549a68aab1.17345931…
Fixes: e7a2ab7b3bb5 ("mm: shmem: add mTHP support for anonymous shmem")
Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/shmem.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- a/mm/shmem.c~mm-shmem-fix-incorrect-index-alignment-for-within_size-policy
+++ a/mm/shmem.c
@@ -1689,6 +1689,7 @@ unsigned long shmem_allowable_huge_order
unsigned long mask = READ_ONCE(huge_shmem_orders_always);
unsigned long within_size_orders = READ_ONCE(huge_shmem_orders_within_size);
unsigned long vm_flags = vma ? vma->vm_flags : 0;
+ pgoff_t aligned_index;
bool global_huge;
loff_t i_size;
int order;
@@ -1723,9 +1724,9 @@ unsigned long shmem_allowable_huge_order
/* Allow mTHP that will be fully within i_size. */
order = highest_order(within_size_orders);
while (within_size_orders) {
- index = round_up(index + 1, order);
+ aligned_index = round_up(index + 1, 1 << order);
i_size = round_up(i_size_read(inode), PAGE_SIZE);
- if (i_size >> PAGE_SHIFT >= index) {
+ if (i_size >> PAGE_SHIFT >= aligned_index) {
mask |= within_size_orders;
break;
}
_
Patches currently in -mm which might be from baolin.wang(a)linux.alibaba.com are
docs-mm-fix-the-incorrect-filehugemapped-field.patch
mm-shmem-fix-incorrect-index-alignment-for-within_size-policy.patch
mm-shmem-fix-the-update-of-shmem_falloc-nr_unswapped.patch
mm-factor-out-the-order-calculation-into-a-new-helper.patch
mm-shmem-change-shmem_huge_global_enabled-to-return-huge-order-bitmap.patch
mm-shmem-add-large-folio-support-for-tmpfs.patch
mm-shmem-add-a-kernel-command-line-to-change-the-default-huge-policy-for-tmpfs.patch
docs-tmpfs-drop-fadvise-from-the-documentation.patch
The patch titled
Subject: mm: zswap: fix race between [de]compression and CPU hotunplug
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-zswap-fix-race-between-compression-and-cpu-hotunplug.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yosry Ahmed <yosryahmed(a)google.com>
Subject: mm: zswap: fix race between [de]compression and CPU hotunplug
Date: Thu, 19 Dec 2024 21:24:37 +0000
In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of the
current CPU at the beginning of the operation is retrieved and used
throughout. However, since neither preemption nor migration are disabled,
it is possible that the operation continues on a different CPU.
If the original CPU is hotunplugged while the acomp_ctx is still in use,
we run into a UAF bug as the resources attached to the acomp_ctx are freed
during hotunplug in zswap_cpu_comp_dead().
The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to use
crypto_acomp API for hardware acceleration") when the switch to the
crypto_acomp API was made. Prior to that, the per-CPU crypto_comp was
retrieved using get_cpu_ptr() which disables preemption and makes sure the
CPU cannot go away from under us. Preemption cannot be disabled with the
crypto_acomp API as a sleepable context is needed.
Commit 8ba2f844f050 ("mm/zswap: change per-cpu mutex and buffer to
per-acomp_ctx") increased the UAF surface area by making the per-CPU
buffers dynamic, adding yet another resource that can be freed from under
zswap compression/decompression by CPU hotunplug.
There are a few ways to fix this:
(a) Add a refcount for acomp_ctx.
(b) Disable migration while using the per-CPU acomp_ctx.
(c) Disable CPU hotunplug while using the per-CPU acomp_ctx by holding
the CPUs read lock.
Implement (c) since it's simpler than (a), and (b) involves using
migrate_disable() which is apparently undesired (see huge comment in
include/linux/preempt.h).
Link: https://lkml.kernel.org/r/20241219212437.2714151-1-yosryahmed@google.com
Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration")
Signed-off-by: Yosry Ahmed <yosryahmed(a)google.com>
Reported-by: Johannes Weiner <hannes(a)cmpxchg.org>
Closes: https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/
Reported-by: Sam Sun <samsun1006219(a)gmail.com>
Closes: https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4OcuruL4tP…
Cc: Barry Song <baohua(a)kernel.org>
Cc: Chengming Zhou <chengming.zhou(a)linux.dev>
Cc: Nhat Pham <nphamcs(a)gmail.com>
Cc: Vitaly Wool <vitalywool(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/zswap.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
--- a/mm/zswap.c~mm-zswap-fix-race-between-compression-and-cpu-hotunplug
+++ a/mm/zswap.c
@@ -880,6 +880,18 @@ static int zswap_cpu_comp_dead(unsigned
return 0;
}
+/* Prevent CPU hotplug from freeing up the per-CPU acomp_ctx resources */
+static struct crypto_acomp_ctx *acomp_ctx_get_cpu(struct crypto_acomp_ctx __percpu *acomp_ctx)
+{
+ cpus_read_lock();
+ return raw_cpu_ptr(acomp_ctx);
+}
+
+static void acomp_ctx_put_cpu(void)
+{
+ cpus_read_unlock();
+}
+
static bool zswap_compress(struct page *page, struct zswap_entry *entry,
struct zswap_pool *pool)
{
@@ -893,8 +905,7 @@ static bool zswap_compress(struct page *
gfp_t gfp;
u8 *dst;
- acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
-
+ acomp_ctx = acomp_ctx_get_cpu(pool->acomp_ctx);
mutex_lock(&acomp_ctx->mutex);
dst = acomp_ctx->buffer;
@@ -950,6 +961,7 @@ unlock:
zswap_reject_alloc_fail++;
mutex_unlock(&acomp_ctx->mutex);
+ acomp_ctx_put_cpu();
return comp_ret == 0 && alloc_ret == 0;
}
@@ -960,7 +972,7 @@ static void zswap_decompress(struct zswa
struct crypto_acomp_ctx *acomp_ctx;
u8 *src;
- acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
+ acomp_ctx = acomp_ctx_get_cpu(entry->pool->acomp_ctx);
mutex_lock(&acomp_ctx->mutex);
src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
@@ -990,6 +1002,7 @@ static void zswap_decompress(struct zswa
if (src != acomp_ctx->buffer)
zpool_unmap_handle(zpool, entry->handle);
+ acomp_ctx_put_cpu();
}
/*********************************
_
Patches currently in -mm which might be from yosryahmed(a)google.com are
mm-zswap-fix-race-between-compression-and-cpu-hotunplug.patch
When comparing to the ARM list [1], it appears that several ARM cores
were missing from the lists in spectre_bhb_loop_affected(). Add them.
NOTE: for some of these cores it may not matter since other ways of
clearing the BHB may be used (like the CLRBHB instruction or ECBHB),
but it still seems good to have all the info from ARM's whitepaper
included.
[1] https://developer.arm.com/Arm%20Security%20Center/Spectre-BHB
Fixes: 558c303c9734 ("arm64: Mitigate spectre style branch history side channels")
Cc: stable(a)vger.kernel.org
Signed-off-by: Douglas Anderson <dianders(a)chromium.org>
---
Changes in v3:
- New
arch/arm64/kernel/proton-pack.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/proton-pack.c b/arch/arm64/kernel/proton-pack.c
index 06e04c9e6480..86d67f5a5a72 100644
--- a/arch/arm64/kernel/proton-pack.c
+++ b/arch/arm64/kernel/proton-pack.c
@@ -872,6 +872,14 @@ static u8 spectre_bhb_loop_affected(void)
{
u8 k = 0;
+ static const struct midr_range spectre_bhb_k132_list[] = {
+ MIDR_ALL_VERSIONS(MIDR_CORTEX_X3),
+ MIDR_ALL_VERSIONS(MIDR_NEOVERSE_V2),
+ };
+ static const struct midr_range spectre_bhb_k38_list[] = {
+ MIDR_ALL_VERSIONS(MIDR_CORTEX_A715),
+ MIDR_ALL_VERSIONS(MIDR_CORTEX_A720),
+ };
static const struct midr_range spectre_bhb_k32_list[] = {
MIDR_ALL_VERSIONS(MIDR_CORTEX_A78),
MIDR_ALL_VERSIONS(MIDR_CORTEX_A78AE),
@@ -885,6 +893,7 @@ static u8 spectre_bhb_loop_affected(void)
};
static const struct midr_range spectre_bhb_k24_list[] = {
MIDR_ALL_VERSIONS(MIDR_CORTEX_A76),
+ MIDR_ALL_VERSIONS(MIDR_CORTEX_A76AE),
MIDR_ALL_VERSIONS(MIDR_CORTEX_A77),
MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N1),
{},
@@ -899,7 +908,11 @@ static u8 spectre_bhb_loop_affected(void)
{},
};
- if (is_midr_in_range_list(read_cpuid_id(), spectre_bhb_k32_list))
+ if (is_midr_in_range_list(read_cpuid_id(), spectre_bhb_k132_list))
+ k = 132;
+ else if (is_midr_in_range_list(read_cpuid_id(), spectre_bhb_k38_list))
+ k = 38;
+ else if (is_midr_in_range_list(read_cpuid_id(), spectre_bhb_k32_list))
k = 32;
else if (is_midr_in_range_list(read_cpuid_id(), spectre_bhb_k24_list))
k = 24;
--
2.47.1.613.gc27f4b7a9f-goog
The code for detecting CPUs that are vulnerable to Spectre BHB was
based on a hardcoded list of CPU IDs that were known to be affected.
Unfortunately, the list mostly only contained the IDs of standard ARM
cores. The IDs for many cores that are minor variants of the standard
ARM cores (like many Qualcomm Kyro CPUs) weren't listed. This led the
code to assume that those variants were not affected.
Flip the code on its head and instead list CPU IDs for cores that are
known to be _not_ affected. Now CPUs will be assumed vulnerable until
added to the list saying that they're safe.
As of right now, the only CPU IDs added to the "unaffected" list are
ARM Cortex A35, A53, and A55. This list was created by looking at
older cores listed in cputype.h that weren't listed in the "affected"
list previously.
Unfortunately, while this solution is better than what we had before,
it's still an imperfect solution. Specifically there are two ways to
mitigate Spectre BHB and one of those ways is parameterized with a "k"
value indicating how many loops are needed to mitigate. If we have an
unknown CPU ID then we've got to guess about how to mitigate it. Since
more cores seem to be mitigated by looping (and because it's unlikely
that the needed FW code will be in place for FW mitigation for unknown
cores), we'll choose looping for unknown CPUs and choose the highest
"k" value of 32.
The downside of our guessing is that some CPUs may now report as
"mitigated" when in reality they should need a firmware mitigation.
We'll choose to put a WARN_ON splat in the logs in this case any time
we had to make a guess since guessing the right mitigation is pretty
awful. Hopefully this will encourage CPU vendors to add their CPU IDs
to the list.
Fixes: 558c303c9734 ("arm64: Mitigate spectre style branch history side channels")
Cc: stable(a)vger.kernel.org
Signed-off-by: Douglas Anderson <dianders(a)chromium.org>
---
Changes in v2:
- New
arch/arm64/kernel/proton-pack.c | 46 +++++++++++++++++++++++++++------
1 file changed, 38 insertions(+), 8 deletions(-)
diff --git a/arch/arm64/kernel/proton-pack.c b/arch/arm64/kernel/proton-pack.c
index da53722f95d4..39c5573c7527 100644
--- a/arch/arm64/kernel/proton-pack.c
+++ b/arch/arm64/kernel/proton-pack.c
@@ -841,13 +841,31 @@ enum bhb_mitigation_bits {
};
static unsigned long system_bhb_mitigations;
+static const struct midr_range spectre_bhb_firmware_mitigated_list[] = {
+ MIDR_ALL_VERSIONS(MIDR_CORTEX_A73),
+ MIDR_ALL_VERSIONS(MIDR_CORTEX_A75),
+ {},
+};
+
+static const struct midr_range spectre_bhb_safe_list[] = {
+ MIDR_ALL_VERSIONS(MIDR_CORTEX_A35),
+ MIDR_ALL_VERSIONS(MIDR_CORTEX_A53),
+ MIDR_ALL_VERSIONS(MIDR_CORTEX_A55),
+ {},
+};
+
/*
* This must be called with SCOPE_LOCAL_CPU for each type of CPU, before any
* SCOPE_SYSTEM call will give the right answer.
+ *
+ * NOTE: Unknown CPUs are reported as affected. In order to make this work
+ * and still keep the list short, only handle CPUs where:
+ * - supports_csv2p3() returned false
+ * - supports_clearbhb() returned false.
*/
u8 spectre_bhb_loop_affected(int scope)
{
- u8 k = 0;
+ u8 k;
static u8 max_bhb_k;
if (scope == SCOPE_LOCAL_CPU) {
@@ -886,6 +904,16 @@ u8 spectre_bhb_loop_affected(int scope)
k = 11;
else if (is_midr_in_range_list(read_cpuid_id(), spectre_bhb_k8_list))
k = 8;
+ else if (is_midr_in_range_list(read_cpuid_id(), spectre_bhb_safe_list) ||
+ is_midr_in_range_list(read_cpuid_id(), spectre_bhb_firmware_mitigated_list))
+ k = 0;
+ else {
+ WARN_ONCE(true,
+ "Unrecognized CPU %#010x, assuming Spectre BHB vulnerable\n",
+ read_cpuid_id());
+ /* Hopefully k = 32 handles the worst case for unknown CPUs */
+ k = 32;
+ }
max_bhb_k = max(max_bhb_k, k);
} else {
@@ -916,24 +944,26 @@ static enum mitigation_state spectre_bhb_get_cpu_fw_mitigation_state(void)
}
}
+/*
+ * NOTE: Unknown CPUs are reported as affected. In order to make this work
+ * and still keep the list short, only handle CPUs where:
+ * - supports_csv2p3() returned false
+ * - supports_clearbhb() returned false.
+ * - spectre_bhb_loop_affected() returned 0.
+ */
static bool is_spectre_bhb_fw_affected(int scope)
{
static bool system_affected;
enum mitigation_state fw_state;
bool has_smccc = arm_smccc_1_1_get_conduit() != SMCCC_CONDUIT_NONE;
- static const struct midr_range spectre_bhb_firmware_mitigated_list[] = {
- MIDR_ALL_VERSIONS(MIDR_CORTEX_A73),
- MIDR_ALL_VERSIONS(MIDR_CORTEX_A75),
- {},
- };
bool cpu_in_list = is_midr_in_range_list(read_cpuid_id(),
- spectre_bhb_firmware_mitigated_list);
+ spectre_bhb_safe_list);
if (scope != SCOPE_LOCAL_CPU)
return system_affected;
fw_state = spectre_bhb_get_cpu_fw_mitigation_state();
- if (cpu_in_list || (has_smccc && fw_state == SPECTRE_MITIGATED)) {
+ if (!cpu_in_list || (has_smccc && fw_state == SPECTRE_MITIGATED)) {
system_affected = true;
return true;
}
--
2.47.1.613.gc27f4b7a9f-goog
Like other Asus Vivobook models the X1504VAP has its keybopard IRQ (1)
described as ActiveLow in the DSDT, which the kernel overrides to EdgeHigh
which breaks the keyboard.
Add the X1504VAP to the irq1_level_low_skip_override[] quirk table to fix
this.
Thank You.
I'm announcing the release of the 5.4.288 kernel.
All users of the 5.4 kernel series must upgrade.
The updated 5.4.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.4.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2 -
block/blk-iocost.c | 24 ++++++++++++-
drivers/acpi/acpica/evxfregn.c | 2 -
drivers/acpi/resource.c | 6 +--
drivers/ata/sata_highbank.c | 1
drivers/net/ethernet/qualcomm/qca_spi.c | 26 ++++++--------
drivers/net/ethernet/qualcomm/qca_spi.h | 1
drivers/net/xen-netfront.c | 5 ++
drivers/usb/dwc2/hcd.c | 16 +++-----
drivers/usb/gadget/function/u_serial.c | 9 +++-
drivers/usb/host/ehci-sh.c | 9 +++-
drivers/usb/host/max3421-hcd.c | 16 ++++++--
fs/xfs/xfs_file.c | 8 ++++
include/net/lapb.h | 2 -
kernel/trace/trace_kprobe.c | 2 -
net/batman-adv/translation-table.c | 58 ++++++++++++++++++++++----------
net/core/sock_map.c | 1
net/sched/sch_netem.c | 22 ++++++++----
net/tipc/udp_media.c | 7 +++
sound/usb/quirks.c | 31 +++++++++++------
virt/kvm/arm/pmu.c | 1
21 files changed, 166 insertions(+), 83 deletions(-)
Dan Carpenter (1):
ALSA: usb-audio: Fix a DMA to stack memory bug
Daniil Tatianin (1):
ACPICA: events/evxfregn: don't release the ContextMutex that was never acquired
Darrick J. Wong (1):
xfs: don't drop errno values when we fail to ficlone the entire range
Eric Dumazet (2):
tipc: fix NULL deref in cleanup_bearer()
net: lapb: increase LAPB_HEADER_LEN
Greg Kroah-Hartman (1):
Linux 5.4.288
Ilpo Järvinen (1):
ACPI: resource: Fix memory resource type union access
Joe Hattori (1):
ata: sata_highbank: fix OF node reference leak in highbank_initialize_phys()
Juergen Gross (1):
xen/netfront: fix crash when removing device
Lianqin Hu (1):
usb: gadget: u_serial: Fix the issue that gs_start_io crashed due to accessing null pointer
Mark Tomlinson (1):
usb: host: max3421-hcd: Correctly abort a USB request.
Martin Ottens (1):
net/sched: netem: account for backlog updates from child qdisc
Michal Luczaj (1):
bpf, sockmap: Fix update element with same
Nathan Chancellor (1):
blk-iocost: Avoid using clamp() on inuse in __propagate_weights()
Nikolay Kuratov (1):
tracing/kprobes: Skip symbol counting logic for module symbols in create_local_trace_kprobe()
Raghavendra Rao Ananta (1):
KVM: arm64: Ignore PMCNTENSET_EL0 while checking for overflow status
Remi Pommarel (3):
batman-adv: Do not send uninitialized TT changes
batman-adv: Remove uninitialized data in full table TT response
batman-adv: Do not let TT changes list grows indefinitely
Stefan Wahren (3):
usb: dwc2: hcd: Fix GetPortStatus & SetPortFeature
qca_spi: Fix clock speed for multiple QCA7000
qca_spi: Make driver probing reliable
Tejun Heo (2):
blk-iocost: clamp inuse and skip noops in __propagate_weights()
blk-iocost: fix weight updates of inner active iocgs
Vitalii Mordan (1):
usb: ehci-hcd: fix call balance of clocks handling routines
During suspend/resume process all connectors are explicitly disabled and
then reenabled. However resume fails because of the connector_status check:
[dpu error]connector not connected 3
[drm:drm_mode_config_helper_resume [drm_kms_helper]] *ERROR* Failed to resume (-22)
It doesn't make sense to check for the Writeback connected status (and
other drivers don't perform such check), so drop the check.
It wasn't a problem before the commit 71174f362d67 ("drm/msm/dpu: move
writeback's atomic_check to dpu_writeback.c"), since encoder's
atomic_check() is called under a different conditions that the
connector's atomic_check() (e.g. it is not called if there is no
connected CRTC or if the corresponding connector is not a part of the
new state).
Fixes: 71174f362d67 ("drm/msm/dpu: move writeback's atomic_check to dpu_writeback.c")
Cc: stable(a)vger.kernel.org
Reported-by: Leonard Lausen <leonard(a)lausen.nl>
Closes: https://gitlab.freedesktop.org/drm/msm/-/issues/57
Tested-by: Leonard Lausen <leonard(a)lausen.nl> # on sc7180 lazor
Reported-by: György Kurucz <me(a)kuruczgy.com>
Link: https://lore.kernel.org/all/b70a4d1d-f98f-4169-942c-cb9006a42b40@kuruczgy.c…
Reported-by: Johan Hovold <johan+linaro(a)kernel.org>
Link: https://lore.kernel.org/all/ZzyYI8KkWK36FfXf@hovoldconsulting.com/
Tested-by: György Kurucz <me(a)kuruczgy.com>
Reviewed-by: Johan Hovold <johan+linaro(a)kernel.org>
Tested-by: Johan Hovold <johan+linaro(a)kernel.org>
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov(a)linaro.org>
---
Leonard Lausen reported an issue with suspend/resume of the sc7180
devices. Fix the WB atomic check, which caused the issue.
---
Changes in v4:
- Epanded commit message (Johan)
- Link to v3: https://lore.kernel.org/r/20241208-dpu-fix-wb-v3-1-a1de69ce4a1b@linaro.org
Changes in v3:
- Rebased on top of msm-fixes
- Link to v2: https://lore.kernel.org/r/20240802-dpu-fix-wb-v2-0-7eac9eb8e895@linaro.org
Changes in v2:
- Reworked the writeback to just drop the connector->status check.
- Expanded commit message for the debugging patch.
- Link to v1: https://lore.kernel.org/r/20240709-dpu-fix-wb-v1-0-448348bfd4cb@linaro.org
---
drivers/gpu/drm/msm/disp/dpu1/dpu_writeback.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_writeback.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_writeback.c
index 16f144cbc0c986ee266412223d9e605b01f9fb8c..8ff496082902b1ee713e806140f39b4730ed256a 100644
--- a/drivers/gpu/drm/msm/disp/dpu1/dpu_writeback.c
+++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_writeback.c
@@ -42,9 +42,6 @@ static int dpu_wb_conn_atomic_check(struct drm_connector *connector,
if (!conn_state || !conn_state->connector) {
DPU_ERROR("invalid connector state\n");
return -EINVAL;
- } else if (conn_state->connector->status != connector_status_connected) {
- DPU_ERROR("connector not connected %d\n", conn_state->connector->status);
- return -EINVAL;
}
crtc = conn_state->crtc;
---
base-commit: 86313a9cd152330c634b25d826a281c6a002eb77
change-id: 20240709-dpu-fix-wb-6cd57e3eb182
Best regards,
--
Dmitry Baryshkov <dmitry.baryshkov(a)linaro.org>
From: Joshua Washington <joshwash(a)google.com>
This patch predicates the enabling and disabling of XSK pools on the
existence of queues. As it stands, if the interface is down, disabling
or enabling XSK pools would result in a crash, as the RX queue pointer
would be NULL. XSK pool registration will occur as part of the next
interface up.
Similarly, xsk_wakeup needs be guarded against queues disappearing
while the function is executing, so a check against the
GVE_PRIV_FLAGS_NAPI_ENABLED flag is added to synchronize with the
disabling of the bit and the synchronize_net() in gve_turndown.
Fixes: fd8e40321a12 ("gve: Add AF_XDP zero-copy support for GQI-QPL format")
Cc: stable(a)vger.kernel.org
Signed-off-by: Joshua Washington <joshwash(a)google.com>
Signed-off-by: Praveen Kaligineedi <pkaligineedi(a)google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi(a)google.com>
Reviewed-by: Shailend Chand <shailend(a)google.com>
Reviewed-by: Willem de Bruijn <willemb(a)google.com>
---
drivers/net/ethernet/google/gve/gve_main.c | 22 ++++++++++------------
1 file changed, 10 insertions(+), 12 deletions(-)
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 5d7b0cc59959..e4e8ff4f9f80 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -1623,8 +1623,8 @@ static int gve_xsk_pool_enable(struct net_device *dev,
if (err)
return err;
- /* If XDP prog is not installed, return */
- if (!priv->xdp_prog)
+ /* If XDP prog is not installed or interface is down, return. */
+ if (!priv->xdp_prog || !netif_running(dev))
return 0;
rx = &priv->rx[qid];
@@ -1669,21 +1669,16 @@ static int gve_xsk_pool_disable(struct net_device *dev,
if (qid >= priv->rx_cfg.num_queues)
return -EINVAL;
- /* If XDP prog is not installed, unmap DMA and return */
- if (!priv->xdp_prog)
+ /* If XDP prog is not installed or interface is down, unmap DMA and
+ * return.
+ */
+ if (!priv->xdp_prog || !netif_running(dev))
goto done;
- tx_qid = gve_xdp_tx_queue_id(priv, qid);
- if (!netif_running(dev)) {
- priv->rx[qid].xsk_pool = NULL;
- xdp_rxq_info_unreg(&priv->rx[qid].xsk_rxq);
- priv->tx[tx_qid].xsk_pool = NULL;
- goto done;
- }
-
napi_rx = &priv->ntfy_blocks[priv->rx[qid].ntfy_id].napi;
napi_disable(napi_rx); /* make sure current rx poll is done */
+ tx_qid = gve_xdp_tx_queue_id(priv, qid);
napi_tx = &priv->ntfy_blocks[priv->tx[tx_qid].ntfy_id].napi;
napi_disable(napi_tx); /* make sure current tx poll is done */
@@ -1711,6 +1706,9 @@ static int gve_xsk_wakeup(struct net_device *dev, u32 queue_id, u32 flags)
struct gve_priv *priv = netdev_priv(dev);
int tx_queue_id = gve_xdp_tx_queue_id(priv, queue_id);
+ if (!gve_get_napi_enabled(priv))
+ return -ENETDOWN;
+
if (queue_id >= priv->rx_cfg.num_queues || !priv->xdp_prog)
return -EINVAL;
--
2.47.1.613.gc27f4b7a9f-goog
From: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
After commit
746ae46c1113 ("drm/sched: Mark scheduler work queues with WQ_MEM_RECLAIM")
amdgpu started seeing the following warning:
[ ] workqueue: WQ_MEM_RECLAIM sdma0:drm_sched_run_job_work [gpu_sched] is flushing !WQ_MEM_RECLAIM events:amdgpu_device_delay_enable_gfx_off [amdgpu]
...
[ ] Workqueue: sdma0 drm_sched_run_job_work [gpu_sched]
...
[ ] Call Trace:
[ ] <TASK>
...
[ ] ? check_flush_dependency+0xf5/0x110
...
[ ] cancel_delayed_work_sync+0x6e/0x80
[ ] amdgpu_gfx_off_ctrl+0xab/0x140 [amdgpu]
[ ] amdgpu_ring_alloc+0x40/0x50 [amdgpu]
[ ] amdgpu_ib_schedule+0xf4/0x810 [amdgpu]
[ ] ? drm_sched_run_job_work+0x22c/0x430 [gpu_sched]
[ ] amdgpu_job_run+0xaa/0x1f0 [amdgpu]
[ ] drm_sched_run_job_work+0x257/0x430 [gpu_sched]
[ ] process_one_work+0x217/0x720
...
[ ] </TASK>
The intent of the verifcation done in check_flush_depedency is to ensure
forward progress during memory reclaim, by flagging cases when either a
memory reclaim process, or a memory reclaim work item is flushed from a
context not marked as memory reclaim safe.
This is correct when flushing, but when called from the
cancel(_delayed)_work_sync() paths it is a false positive because work is
either already running, or will not be running at all. Therefore
cancelling it is safe and we can relax the warning criteria by letting the
helper know of the calling context.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
Fixes: fca839c00a12 ("workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue")
References: 746ae46c1113 ("drm/sched: Mark scheduler work queues with WQ_MEM_RECLAIM")
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Lai Jiangshan <jiangshanlai(a)gmail.com>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: Christian König <christian.koenig(a)amd.com
Cc: Matthew Brost <matthew.brost(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v4.5+
---
kernel/workqueue.c | 22 +++++++++++++---------
1 file changed, 13 insertions(+), 9 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9949ffad8df0..7abba81296cd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3680,23 +3680,27 @@ void workqueue_softirq_dead(unsigned int cpu)
* check_flush_dependency - check for flush dependency sanity
* @target_wq: workqueue being flushed
* @target_work: work item being flushed (NULL for workqueue flushes)
+ * @from_cancel: are we called from the work cancel path
*
* %current is trying to flush the whole @target_wq or @target_work on it.
- * If @target_wq doesn't have %WQ_MEM_RECLAIM, verify that %current is not
- * reclaiming memory or running on a workqueue which doesn't have
- * %WQ_MEM_RECLAIM as that can break forward-progress guarantee leading to
- * a deadlock.
+ * If this is not the cancel path (which implies work being flushed is either
+ * already running, or will not be at all), check if @target_wq doesn't have
+ * %WQ_MEM_RECLAIM and verify that %current is not reclaiming memory or running
+ * on a workqueue which doesn't have %WQ_MEM_RECLAIM as that can break forward-
+ * progress guarantee leading to a deadlock.
*/
static void check_flush_dependency(struct workqueue_struct *target_wq,
- struct work_struct *target_work)
+ struct work_struct *target_work,
+ bool from_cancel)
{
- work_func_t target_func = target_work ? target_work->func : NULL;
+ work_func_t target_func;
struct worker *worker;
- if (target_wq->flags & WQ_MEM_RECLAIM)
+ if (from_cancel || target_wq->flags & WQ_MEM_RECLAIM)
return;
worker = current_wq_worker();
+ target_func = target_work ? target_work->func : NULL;
WARN_ONCE(current->flags & PF_MEMALLOC,
"workqueue: PF_MEMALLOC task %d(%s) is flushing !WQ_MEM_RECLAIM %s:%ps",
@@ -3966,7 +3970,7 @@ void __flush_workqueue(struct workqueue_struct *wq)
list_add_tail(&this_flusher.list, &wq->flusher_overflow);
}
- check_flush_dependency(wq, NULL);
+ check_flush_dependency(wq, NULL, false);
mutex_unlock(&wq->mutex);
@@ -4141,7 +4145,7 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
}
wq = pwq->wq;
- check_flush_dependency(wq, work);
+ check_flush_dependency(wq, work, from_cancel);
insert_wq_barrier(pwq, barr, work, worker);
raw_spin_unlock_irq(&pool->lock);
--
2.47.1
Encoding file handles is usually performed by a filesystem >encode_fh()
method that may fail for various reasons.
The legacy users of exportfs_encode_fh(), namely, nfsd and
name_to_handle_at(2) syscall are ready to cope with the possibility
of failure to encode a file handle.
There are a few other users of exportfs_encode_{fh,fid}() that
currently have a WARN_ON() assertion when ->encode_fh() fails.
Relax those assertions because they are wrong.
The second linked bug report states commit 16aac5ad1fa9 ("ovl: support
encoding non-decodable file handles") in v6.6 as the regressing commit,
but this is not accurate.
The aforementioned commit only increases the chances of the assertion
and allows triggering the assertion with the reproducer using overlayfs,
inotify and drop_caches.
Triggering this assertion was always possible with other filesystems and
other reasons of ->encode_fh() failures and more particularly, it was
also possible with the exact same reproducer using overlayfs that is
mounted with options index=on,nfs_export=on also on kernels < v6.6.
Therefore, I am not listing the aforementioned commit as a Fixes commit.
Backport hint: this patch will have a trivial conflict applying to
v6.6.y, and other trivial conflicts applying to stable kernels < v6.6.
Reported-by: syzbot+ec07f6f5ce62b858579f(a)syzkaller.appspotmail.com
Tested-by: syzbot+ec07f6f5ce62b858579f(a)syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-unionfs/671fd40c.050a0220.4735a.024f.GAE@goog…
Reported-by: Dmitry Safonov <dima(a)arista.com>
Closes: https://lore.kernel.org/linux-fsdevel/CAGrbwDTLt6drB9eaUagnQVgdPBmhLfqqxAf3…
Cc: stable(a)vger.kernel.org
Signed-off-by: Amir Goldstein <amir73il(a)gmail.com>
---
Christian,
I could have sumbitted two independant patches to relax the assertion
in fsnotify and overlayfs via fsnotify and overlayfs trees, but the
nature of the problem is the same and in both cases, the problem became
worse with the introduction of non-decodable file handles support,
so decided to fix them together and ask you to take the fix via the
vfs tree.
Please let you if you think it should be done differently.
Thanks,
Amir.
fs/notify/fdinfo.c | 4 +---
fs/overlayfs/copy_up.c | 5 ++---
2 files changed, 3 insertions(+), 6 deletions(-)
diff --git a/fs/notify/fdinfo.c b/fs/notify/fdinfo.c
index dec553034027e..e933f9c65d904 100644
--- a/fs/notify/fdinfo.c
+++ b/fs/notify/fdinfo.c
@@ -47,10 +47,8 @@ static void show_mark_fhandle(struct seq_file *m, struct inode *inode)
size = f->handle_bytes >> 2;
ret = exportfs_encode_fid(inode, (struct fid *)f->f_handle, &size);
- if ((ret == FILEID_INVALID) || (ret < 0)) {
- WARN_ONCE(1, "Can't encode file handler for inotify: %d\n", ret);
+ if ((ret == FILEID_INVALID) || (ret < 0))
return;
- }
f->handle_type = ret;
f->handle_bytes = size * sizeof(u32);
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 3601ddfeddc2e..56eee9f23ea9a 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -442,9 +442,8 @@ struct ovl_fh *ovl_encode_real_fh(struct ovl_fs *ofs, struct dentry *real,
buflen = (dwords << 2);
err = -EIO;
- if (WARN_ON(fh_type < 0) ||
- WARN_ON(buflen > MAX_HANDLE_SZ) ||
- WARN_ON(fh_type == FILEID_INVALID))
+ if (fh_type < 0 || fh_type == FILEID_INVALID ||
+ WARN_ON(buflen > MAX_HANDLE_SZ))
goto out_err;
fh->fb.version = OVL_FH_VERSION;
--
2.34.1
Hi,
This series fixes the several suspend issues on Qcom platforms. Patch 1 fixes
the resume failure with spm_lvl=5 suspend on most of the Qcom platforms. For
this patch, I couldn't figure out the exact commit that caused the issue. So I
used the commit that introduced reinit support as a placeholder.
Patch 3 fixes the suspend issue on SM8550 and SM8650 platforms where UFS
PHY retention is not supported. Hence the default spm_lvl=3 suspend fails. So
this patch configures spm_lvl=5 as the default suspend level to force UFSHC/
device powerdown during suspend. This supersedes the previous series [1] that
tried to fix the issue in clock drivers.
This series is tested on Qcom SM8550 MTP and Qcom RB5 boards.
[1] https://lore.kernel.org/linux-arm-msm/20241107-ufs-clk-fix-v1-0-6032ff22a05…
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
---
Changes in v2:
- Changed 'ufs_qcom_drvdata::quirks' type to 'enum ufshcd_quirks'
- Collected tags
- Link to v1: https://lore.kernel.org/r/20241211-ufs-qcom-suspend-fix-v1-0-83ebbde76b1c@l…
---
Manivannan Sadhasivam (3):
scsi: ufs: qcom: Power off the PHY if it was already powered on in ufs_qcom_power_up_sequence()
scsi: ufs: qcom: Allow passing platform specific OF data
scsi: ufs: qcom: Power down the controller/device during system suspend for SM8550/SM8650 SoCs
drivers/ufs/core/ufshcd-priv.h | 6 ------
drivers/ufs/core/ufshcd.c | 1 -
drivers/ufs/host/ufs-qcom.c | 31 +++++++++++++++++++------------
drivers/ufs/host/ufs-qcom.h | 5 +++++
include/ufs/ufshcd.h | 2 --
5 files changed, 24 insertions(+), 21 deletions(-)
---
base-commit: 40384c840ea1944d7c5a392e8975ed088ecf0b37
change-id: 20241211-ufs-qcom-suspend-fix-5618e9c56d93
Best regards,
--
Manivannan Sadhasivam <manivannan.sadhasivam(a)linaro.org>
From: Claudiu Beznea <claudiu.beznea.uj(a)bp.renesas.com>
In the sh-sci driver, sci_ports[0] is used by earlycon. If the earlycon is
still active when sci_probe() is called and the new serial port is supposed
to map to sci_ports[0], return -EBUSY to prevent breaking the earlycon.
This situation should occurs in debug scenarios, and users should be
aware of the potential conflict.
Fixes: 0b0cced19ab1 ("serial: sh-sci: Add CONFIG_SERIAL_EARLYCON support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj(a)bp.renesas.com>
---
drivers/tty/serial/sh-sci.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/drivers/tty/serial/sh-sci.c b/drivers/tty/serial/sh-sci.c
index 373195995d3b..e12fbc71082a 100644
--- a/drivers/tty/serial/sh-sci.c
+++ b/drivers/tty/serial/sh-sci.c
@@ -158,6 +158,7 @@ struct sci_port {
bool has_rtscts;
bool autorts;
bool tx_occurred;
+ bool earlycon;
};
#define SCI_NPORTS CONFIG_SERIAL_SH_SCI_NR_UARTS
@@ -3443,6 +3444,7 @@ static int sci_probe_single(struct platform_device *dev,
static int sci_probe(struct platform_device *dev)
{
struct plat_sci_port *p;
+ struct resource *res;
struct sci_port *sp;
unsigned int dev_id;
int ret;
@@ -3472,6 +3474,26 @@ static int sci_probe(struct platform_device *dev)
}
sp = &sci_ports[dev_id];
+
+ /*
+ * In case:
+ * - the probed port alias is zero (as the one used by earlycon), and
+ * - the earlycon is still active (e.g., "earlycon keep_bootcon" in
+ * bootargs)
+ *
+ * defer the probe of this serial. This is a debug scenario and the user
+ * must be aware of it.
+ *
+ * Except when the probed port is the same as the earlycon port.
+ */
+
+ res = platform_get_resource(dev, IORESOURCE_MEM, 0);
+ if (!res)
+ return -ENODEV;
+
+ if (sp->earlycon && res->start != sp->port.mapbase)
+ return dev_err_probe(&dev->dev, -EBUSY, "sci_port[0] is used by earlycon!\n");
+
platform_set_drvdata(dev, sp);
ret = sci_probe_single(dev, dev_id, p, sp);
@@ -3568,6 +3590,7 @@ static int __init early_console_setup(struct earlycon_device *device,
port_cfg.type = type;
sci_ports[0].cfg = &port_cfg;
sci_ports[0].params = sci_probe_regmap(&port_cfg);
+ sci_ports[0].earlycon = true;
port_cfg.scscr = sci_serial_in(&sci_ports[0].port, SCSCR);
sci_serial_out(&sci_ports[0].port, SCSCR,
SCSCR_RE | SCSCR_TE | port_cfg.scscr);
--
2.39.2
The reference count of the device incremented in device_initialize() is
not decremented when device_add() fails. Add a put_device() call before
returning from the function to decrement reference count for cleanup.
Or it could cause memory leak.
As comment of device_add() says, if device_add() succeeds, you should
call device_del() when you want to get rid of it. If device_add() has
not succeeded, use only put_device() to drop the reference count.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: c8e4c2397655 ("RDMA/srp: Rework the srp_add_port() error path")
Signed-off-by: Ma Ke <make_ruc2021(a)163.com>
---
drivers/infiniband/ulp/srp/ib_srp.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 2916e77f589b..7289ae0b83ac 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -3978,7 +3978,6 @@ static struct srp_host *srp_add_port(struct srp_device *device, u32 port)
return host;
put_host:
- device_del(&host->dev);
put_device(&host->dev);
return NULL;
}
--
2.25.1
From: James Morse <james.morse(a)arm.com>
commit 6685f5d572c22e1003e7c0d089afe1c64340ab1f upstream.
commit 011e5f5bf529f ("arm64/cpufeature: Add remaining feature bits in
ID_AA64PFR0 register") exposed the MPAM field of AA64PFR0_EL1 to guests,
but didn't add trap handling. A previous patch supplied the missing trap
handling.
Existing VMs that have the MPAM field of ID_AA64PFR0_EL1 set need to
be migratable, but there is little point enabling the MPAM CPU
interface on new VMs until there is something a guest can do with it.
Clear the MPAM field from the guest's ID_AA64PFR0_EL1 and on hardware
that supports MPAM, politely ignore the VMMs attempts to set this bit.
Guests exposed to this bug have the sanitised value of the MPAM field,
so only the correct value needs to be ignored. This means the field
can continue to be used to block migration to incompatible hardware
(between MPAM=1 and MPAM=5), and the VMM can't rely on the field
being ignored.
Signed-off-by: James Morse <james.morse(a)arm.com>
Co-developed-by: Joey Gouly <joey.gouly(a)arm.com>
Signed-off-by: Joey Gouly <joey.gouly(a)arm.com>
Reviewed-by: Gavin Shan <gshan(a)redhat.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi(a)huawei.com>
Reviewed-by: Marc Zyngier <maz(a)kernel.org>
Link: https://lore.kernel.org/r/20241030160317.2528209-7-joey.gouly@arm.com
Signed-off-by: Oliver Upton <oliver.upton(a)linux.dev>
[ joey: fixed up merge conflict, no ID_FILTERED macro in 6.6 ]
Signed-off-by: Joey Gouly <joey.gouly(a)arm.com>
Cc: stable(a)vger.kernel.org # 6.6.x
Cc: Vitaly Chikunov <vt(a)altlinux.org>
Link: https://lore.kernel.org/linux-arm-kernel/20241202045830.e4yy3nkvxtzaybxk@al…
---
This fixes an issue seen when using KVM with a 6.6 host kernel, and
newer (6.13+) kernels in the guest.
Tested with a stripped down version of set_id_regs from the original
patch series.
arch/arm64/kvm/sys_regs.c | 52 +++++++++++++++++++++++++++++++++++++--
1 file changed, 50 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 370a1a7bd369..2031703424ea 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1330,6 +1330,7 @@ static u64 __kvm_read_sanitised_id_reg(const struct kvm_vcpu *vcpu,
val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_MTE);
val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_SME);
+ val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_MPAM_frac);
break;
case SYS_ID_AA64ISAR1_EL1:
if (!vcpu_has_ptrauth(vcpu))
@@ -1472,6 +1473,13 @@ static u64 read_sanitised_id_aa64pfr0_el1(struct kvm_vcpu *vcpu,
val &= ~ID_AA64PFR0_EL1_AMU_MASK;
+ /*
+ * MPAM is disabled by default as KVM also needs a set of PARTID to
+ * program the MPAMVPMx_EL2 PARTID remapping registers with. But some
+ * older kernels let the guest see the ID bit.
+ */
+ val &= ~ID_AA64PFR0_EL1_MPAM_MASK;
+
return val;
}
@@ -1560,6 +1568,42 @@ static int set_id_dfr0_el1(struct kvm_vcpu *vcpu,
return set_id_reg(vcpu, rd, val);
}
+static int set_id_aa64pfr0_el1(struct kvm_vcpu *vcpu,
+ const struct sys_reg_desc *rd, u64 user_val)
+{
+ u64 hw_val = read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1);
+ u64 mpam_mask = ID_AA64PFR0_EL1_MPAM_MASK;
+
+ /*
+ * Commit 011e5f5bf529f ("arm64/cpufeature: Add remaining feature bits
+ * in ID_AA64PFR0 register") exposed the MPAM field of AA64PFR0_EL1 to
+ * guests, but didn't add trap handling. KVM doesn't support MPAM and
+ * always returns an UNDEF for these registers. The guest must see 0
+ * for this field.
+ *
+ * But KVM must also accept values from user-space that were provided
+ * by KVM. On CPUs that support MPAM, permit user-space to write
+ * the sanitizied value to ID_AA64PFR0_EL1.MPAM, but ignore this field.
+ */
+ if ((hw_val & mpam_mask) == (user_val & mpam_mask))
+ user_val &= ~ID_AA64PFR0_EL1_MPAM_MASK;
+
+ return set_id_reg(vcpu, rd, user_val);
+}
+
+static int set_id_aa64pfr1_el1(struct kvm_vcpu *vcpu,
+ const struct sys_reg_desc *rd, u64 user_val)
+{
+ u64 hw_val = read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1);
+ u64 mpam_mask = ID_AA64PFR1_EL1_MPAM_frac_MASK;
+
+ /* See set_id_aa64pfr0_el1 for comment about MPAM */
+ if ((hw_val & mpam_mask) == (user_val & mpam_mask))
+ user_val &= ~ID_AA64PFR1_EL1_MPAM_frac_MASK;
+
+ return set_id_reg(vcpu, rd, user_val);
+}
+
/*
* cpufeature ID register user accessors
*
@@ -2018,10 +2062,14 @@ static const struct sys_reg_desc sys_reg_descs[] = {
{ SYS_DESC(SYS_ID_AA64PFR0_EL1),
.access = access_id_reg,
.get_user = get_id_reg,
- .set_user = set_id_reg,
+ .set_user = set_id_aa64pfr0_el1,
.reset = read_sanitised_id_aa64pfr0_el1,
.val = ID_AA64PFR0_EL1_CSV2_MASK | ID_AA64PFR0_EL1_CSV3_MASK, },
- ID_SANITISED(ID_AA64PFR1_EL1),
+ { SYS_DESC(SYS_ID_AA64PFR1_EL1),
+ .access = access_id_reg,
+ .get_user = get_id_reg,
+ .set_user = set_id_aa64pfr1_el1,
+ .reset = kvm_read_sanitised_id_reg, },
ID_UNALLOCATED(4,2),
ID_UNALLOCATED(4,3),
ID_SANITISED(ID_AA64ZFR0_EL1),
--
2.25.1
Allow the guest to both clear and set the LFENCE_SERIALIZE bit as long as
it is set in the host. It is absolutely okay for the guest to set it if
LFENCE_RDTSC is supported but userspace left it cleared; and it is also
acceptable that the guest clears the bit even if this will actually have
no effect.
This fixes booting Windows in some configuration where it tries to set
the bit, and hangs if it does not succeed.
Suggested-by: Sean Christopherson <seanjc(a)google.com>
Fixes: 74a0e79df68a ("KVM: SVM: Disallow guest from changing userspace's MSR_AMD64_DE_CFG value")
Cc: stable(a)vger.kernel.org
Cc: Tom Lendacky <thomas.lendacky(a)amd.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
---
arch/x86/kvm/svm/svm.c | 9 ---------
1 file changed, 9 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index dd15cc635655..21dacd312779 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3201,15 +3201,6 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
if (data & ~supported_de_cfg)
return 1;
- /*
- * Don't let the guest change the host-programmed value. The
- * MSR is very model specific, i.e. contains multiple bits that
- * are completely unknown to KVM, and the one bit known to KVM
- * is simply a reflection of hardware capabilities.
- */
- if (!msr->host_initiated && data != svm->msr_decfg)
- return 1;
-
svm->msr_decfg = data;
break;
}
--
2.43.5
From: Claudiu Beznea <claudiu.beznea.uj(a)bp.renesas.com>
The port_cfg object is used by serial_console_write(), which serves as
the write function for the earlycon device. Marking port_cfg as __initdata
causes it to be freed after kernel initialization, resulting in earlycon
becoming unavailable thereafter. Remove the __initdata macro from port_cfg
to resolve this issue.
Fixes: dd076cffb8cd ("serial: sh-sci: Fix init data attribute for struct 'port_cfg'")
Cc: stable(a)vger.kernel.org
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj(a)bp.renesas.com>
---
drivers/tty/serial/sh-sci.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/tty/serial/sh-sci.c b/drivers/tty/serial/sh-sci.c
index 924b803af440..4f5da3254420 100644
--- a/drivers/tty/serial/sh-sci.c
+++ b/drivers/tty/serial/sh-sci.c
@@ -3562,7 +3562,7 @@ sh_early_platform_init_buffer("earlyprintk", &sci_driver,
early_serial_buf, ARRAY_SIZE(early_serial_buf));
#endif
#ifdef CONFIG_SERIAL_SH_SCI_EARLYCON
-static struct plat_sci_port port_cfg __initdata;
+static struct plat_sci_port port_cfg;
static int __init early_console_setup(struct earlycon_device *device,
int type)
--
2.39.2
This series introduces the camera pipeline support for the
STM32MP25 SOC. The STM32MP25 has 3 pipelines, fed from a
single camera input which can be either parallel or csi.
This series adds the basic support for the 1st pipe (dump)
which, in term of features is same as the one featured on
the STM32MP13 SOC. It focuses on introduction of the
CSI input stage for the DCMIPP, and the CSI specific new
control code for the DCMIPP.
One of the subdev of the DCMIPP, dcmipp_parallel is now
renamed as dcmipp_input since it allows to not only control
the parallel but also the csi interface.
Signed-off-by: Alain Volmat <alain.volmat(a)foss.st.com>
---
Changes in v4:
* stm32-dcmipp: correct patch 13/15 with clk error handling in
dcmipp_runtime_resume function
- Link to v3: https://lore.kernel.org/r/20241118-csi_dcmipp_mp25-v3-0-c1914afb0a0f@foss.s…
Changes in v3:
* stm32-csi: use clk_bulk api
* stm32-csiL perform reset control within the probe
- Link to v2: https://lore.kernel.org/r/20241105-csi_dcmipp_mp25-v2-0-b9fc8a7273c2@foss.s…
---
Alain Volmat (15):
media: stm32: dcmipp: correct dma_set_mask_and_coherent mask value
dt-bindings: media: add description of stm32 csi
media: stm32: csi: addition of the STM32 CSI driver
media: stm32: dcmipp: use v4l2_subdev_is_streaming
media: stm32: dcmipp: replace s_stream with enable/disable_streams
media: stm32: dcmipp: rename dcmipp_parallel into dcmipp_input
media: stm32: dcmipp: add support for csi input into dcmipp-input
media: stm32: dcmipp: add bayer 10~14 bits formats
media: stm32: dcmipp: add 1X16 RGB / YUV formats support
media: stm32: dcmipp: avoid duplicated format on enum in bytecap
media: stm32: dcmipp: fill media ctl hw_revision field
dt-bindings: media: add the stm32mp25 compatible of DCMIPP
media: stm32: dcmipp: add core support for the stm32mp25
arm64: dts: st: add csi & dcmipp node in stm32mp25
arm64: dts: st: enable imx335/csi/dcmipp pipeline on stm32mp257f-ev1
.../devicetree/bindings/media/st,stm32-dcmipp.yaml | 53 +-
.../bindings/media/st,stm32mp25-csi.yaml | 125 +++
MAINTAINERS | 8 +
arch/arm64/boot/dts/st/stm32mp251.dtsi | 23 +
arch/arm64/boot/dts/st/stm32mp257f-ev1.dts | 85 ++
drivers/media/platform/st/stm32/Kconfig | 14 +
drivers/media/platform/st/stm32/Makefile | 1 +
drivers/media/platform/st/stm32/stm32-csi.c | 1137 ++++++++++++++++++++
.../media/platform/st/stm32/stm32-dcmipp/Makefile | 2 +-
.../st/stm32/stm32-dcmipp/dcmipp-bytecap.c | 128 ++-
.../st/stm32/stm32-dcmipp/dcmipp-byteproc.c | 119 +-
.../platform/st/stm32/stm32-dcmipp/dcmipp-common.h | 4 +-
.../platform/st/stm32/stm32-dcmipp/dcmipp-core.c | 122 ++-
.../platform/st/stm32/stm32-dcmipp/dcmipp-input.c | 540 ++++++++++
.../st/stm32/stm32-dcmipp/dcmipp-parallel.c | 440 --------
15 files changed, 2224 insertions(+), 577 deletions(-)
---
base-commit: 40384c840ea1944d7c5a392e8975ed088ecf0b37
change-id: 20241007-csi_dcmipp_mp25-7779601f57da
Best regards,
--
Alain Volmat <alain.volmat(a)foss.st.com>
This patchset fixes two +1 bugs with the async controls for the uvc driver.
They were found while implementing the granular PM, but I am sending
them as a separate patches, so they can be reviewed sooner. They fix
real issues in the driver that need to be taken care.
Signed-off-by: Ricardo Ribalda <ribalda(a)chromium.org>
---
Changes in v6:
- Swap order of patches
- Use uvc_ctrl_set_handle again
- Move loaded=0 to uvc_ctrl_status_event()
- Link to v5: https://lore.kernel.org/r/20241202-uvc-fix-async-v5-0-6658c1fe312b@chromium…
Changes in v5:
- Move set handle to the entity_commit
- Replace uvc_ctrl_set_handle with get/put_handle.
- Add a patch to flush the cache of async controls.
- Link to v4: https://lore.kernel.org/r/20241129-uvc-fix-async-v4-0-f23784dba80f@chromium…
Changes in v4:
- Fix implementation of uvc_ctrl_set_handle.
- Link to v3: https://lore.kernel.org/r/20241129-uvc-fix-async-v3-0-ab675ce66db7@chromium…
Changes in v3:
- change again! order of patches.
- Introduce uvc_ctrl_set_handle.
- Do not change ctrl->handle if it is not NULL.
Changes in v2:
- Annotate lockdep
- ctrl->handle != handle
- Change order of patches
- Move documentation of mutex
- Link to v1: https://lore.kernel.org/r/20241127-uvc-fix-async-v1-0-eb8722531b8c@chromium…
---
Ricardo Ribalda (5):
media: uvcvideo: Only save async fh if success
media: uvcvideo: Remove redundant NULL assignment
media: uvcvideo: Remove dangling pointers
media: uvcvideo: Annotate lock requirements for uvc_ctrl_set
media: uvcvideo: Flush the control cache when we get an event
drivers/media/usb/uvc/uvc_ctrl.c | 83 ++++++++++++++++++++++++++++++++++------
drivers/media/usb/uvc/uvc_v4l2.c | 2 +
drivers/media/usb/uvc/uvcvideo.h | 9 ++++-
3 files changed, 82 insertions(+), 12 deletions(-)
---
base-commit: 291a8d98186f0a704cb954855d2ae3233971f07d
change-id: 20241127-uvc-fix-async-2c9d40413ad8
Best regards,
--
Ricardo Ribalda <ribalda(a)chromium.org>
This is the start of the stable review cycle for the 5.4.288 release.
There are 24 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu, 19 Dec 2024 17:05:03 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.4.288-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.4.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.4.288-rc1
Dan Carpenter <dan.carpenter(a)linaro.org>
ALSA: usb-audio: Fix a DMA to stack memory bug
Juergen Gross <jgross(a)suse.com>
xen/netfront: fix crash when removing device
Nikolay Kuratov <kniv(a)yandex-team.ru>
tracing/kprobes: Skip symbol counting logic for module symbols in create_local_trace_kprobe()
Raghavendra Rao Ananta <rananta(a)google.com>
KVM: arm64: Ignore PMCNTENSET_EL0 while checking for overflow status
Nathan Chancellor <nathan(a)kernel.org>
blk-iocost: Avoid using clamp() on inuse in __propagate_weights()
Tejun Heo <tj(a)kernel.org>
blk-iocost: fix weight updates of inner active iocgs
Tejun Heo <tj(a)kernel.org>
blk-iocost: clamp inuse and skip noops in __propagate_weights()
Daniil Tatianin <d-tatianin(a)yandex-team.ru>
ACPICA: events/evxfregn: don't release the ContextMutex that was never acquired
Martin Ottens <martin.ottens(a)fau.de>
net/sched: netem: account for backlog updates from child qdisc
Stefan Wahren <wahrenst(a)gmx.net>
qca_spi: Make driver probing reliable
Stefan Wahren <wahrenst(a)gmx.net>
qca_spi: Fix clock speed for multiple QCA7000
Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
ACPI: resource: Fix memory resource type union access
Eric Dumazet <edumazet(a)google.com>
net: lapb: increase LAPB_HEADER_LEN
Eric Dumazet <edumazet(a)google.com>
tipc: fix NULL deref in cleanup_bearer()
Remi Pommarel <repk(a)triplefau.lt>
batman-adv: Do not let TT changes list grows indefinitely
Remi Pommarel <repk(a)triplefau.lt>
batman-adv: Remove uninitialized data in full table TT response
Remi Pommarel <repk(a)triplefau.lt>
batman-adv: Do not send uninitialized TT changes
Michal Luczaj <mhal(a)rbox.co>
bpf, sockmap: Fix update element with same
Darrick J. Wong <djwong(a)kernel.org>
xfs: don't drop errno values when we fail to ficlone the entire range
Lianqin Hu <hulianqin(a)vivo.com>
usb: gadget: u_serial: Fix the issue that gs_start_io crashed due to accessing null pointer
Vitalii Mordan <mordan(a)ispras.ru>
usb: ehci-hcd: fix call balance of clocks handling routines
Stefan Wahren <wahrenst(a)gmx.net>
usb: dwc2: hcd: Fix GetPortStatus & SetPortFeature
Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
ata: sata_highbank: fix OF node reference leak in highbank_initialize_phys()
Mark Tomlinson <mark.tomlinson(a)alliedtelesis.co.nz>
usb: host: max3421-hcd: Correctly abort a USB request.
-------------
Diffstat:
Makefile | 4 +--
block/blk-iocost.c | 24 ++++++++++++--
drivers/acpi/acpica/evxfregn.c | 2 --
drivers/acpi/resource.c | 6 ++--
drivers/ata/sata_highbank.c | 1 +
drivers/net/ethernet/qualcomm/qca_spi.c | 26 +++++++--------
drivers/net/ethernet/qualcomm/qca_spi.h | 1 -
drivers/net/xen-netfront.c | 5 ++-
drivers/usb/dwc2/hcd.c | 16 ++++-----
drivers/usb/gadget/function/u_serial.c | 9 +++--
drivers/usb/host/ehci-sh.c | 9 +++--
drivers/usb/host/max3421-hcd.c | 16 ++++++---
fs/xfs/xfs_file.c | 8 +++++
include/net/lapb.h | 2 +-
kernel/trace/trace_kprobe.c | 2 +-
net/batman-adv/translation-table.c | 58 +++++++++++++++++++++++----------
net/core/sock_map.c | 1 +
net/sched/sch_netem.c | 22 +++++++++----
net/tipc/udp_media.c | 7 +++-
sound/usb/quirks.c | 31 ++++++++++++------
virt/kvm/arm/pmu.c | 1 -
21 files changed, 167 insertions(+), 84 deletions(-)
The QSPI peripheral control and status registers are
accessible via the SoC's APB bus, whereas MMIO transactions'
data travels on the AHB bus.
Microchip documentation and even sample code from Atmel
emphasises the need for a memory barrier before the first
MMIO transaction to the AHB-connected QSPI, and before the
last write to its registers via APB. This is achieved by
the following lines in `atmel_qspi_transfer()`:
/* Dummy read of QSPI_IFR to synchronize APB and AHB accesses */
(void)atmel_qspi_read(aq, QSPI_IFR);
However, the current documentation makes no mention to
synchronization requirements in the other direction, i.e.
after the last data written via AHB, and before the first
register access on APB.
---
In our case, we were facing an issue where the QSPI peripheral
would cease to send any new CSR (nCS Rise) interrupts,
leading to a timeout in `atmel_qspi_wait_for_completion()`
and ultimately this panic in higher levels:
ubi0 error: ubi_io_write: error -110 while writing 63108 bytes to PEB 491:128, written 63104 bytes
After months of extensive research of the codebase, fiddling
around the debugger with kgdb, and back-and-forth with
Microchip, we came to the conclusion that the issue is
probably that the peripheral is still busy receiving on AHB
when the LASTXFER bit is written to its Control Register
on APB, therefore this write gets lost, and the peripheral
still thinks there is more data to come in the MMIO transfer.
This was first formulated when we noticed that doubling the
write() of QSPI_CR_LASTXFER seemed to solve the problem.
Ultimately, the solution is to introduce memory barriers
after the AHB-mapped MMIO transfers, to ensure ordering.
Fixes: d5433def3153 ("mtd: spi-nor: atmel-quadspi: Add spi-mem support to atmel-quadspi")
Cc: Hari.PrasathGE(a)microchip.com
Cc: Mahesh.Abotula(a)microchip.com
Cc: Marco.Cardellini(a)microchip.com
Cc: <stable(a)vger.kernel.org> # c0a0203cf579: ("spi: atmel-quadspi: Create `atmel_qspi_ops`"...)
Cc: <stable(a)vger.kernel.org> # 6.x.y
Signed-off-by: Bence Csókás <csokas.bence(a)prolan.hu>
---
drivers/spi/atmel-quadspi.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/spi/atmel-quadspi.c b/drivers/spi/atmel-quadspi.c
index 73cf0c3f1477..96fc1c56a221 100644
--- a/drivers/spi/atmel-quadspi.c
+++ b/drivers/spi/atmel-quadspi.c
@@ -625,13 +625,20 @@ static int atmel_qspi_transfer(struct spi_mem *mem,
(void)atmel_qspi_read(aq, QSPI_IFR);
/* Send/Receive data */
- if (op->data.dir == SPI_MEM_DATA_IN)
+ if (op->data.dir == SPI_MEM_DATA_IN) {
memcpy_fromio(op->data.buf.in, aq->mem + offset,
op->data.nbytes);
- else
+
+ /* Synchronize AHB and APB accesses again */
+ rmb();
+ } else {
memcpy_toio(aq->mem + offset, op->data.buf.out,
op->data.nbytes);
+ /* Synchronize AHB and APB accesses again */
+ wmb();
+ }
+
/* Release the chip-select */
atmel_qspi_write(QSPI_CR_LASTXFER, aq, QSPI_CR);
--
2.34.1
The quilt patch titled
Subject: alloc_tag: fix set_codetag_empty() when !CONFIG_MEM_ALLOC_PROFILING_DEBUG
has been removed from the -mm tree. Its filename was
alloc_tag-fix-set_codetag_empty-when-config_mem_alloc_profiling_debug.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Suren Baghdasaryan <surenb(a)google.com>
Subject: alloc_tag: fix set_codetag_empty() when !CONFIG_MEM_ALLOC_PROFILING_DEBUG
Date: Fri, 29 Nov 2024 16:14:23 -0800
It was recently noticed that set_codetag_empty() might be used not only to
mark NULL alloctag references as empty to avoid warnings but also to reset
valid tags (in clear_page_tag_ref()). Since set_codetag_empty() is
defined as NOOP for CONFIG_MEM_ALLOC_PROFILING_DEBUG=n, such use of
set_codetag_empty() leads to subtle bugs. Fix set_codetag_empty() for
CONFIG_MEM_ALLOC_PROFILING_DEBUG=n to reset the tag reference.
Link: https://lkml.kernel.org/r/20241130001423.1114965-2-surenb@google.com
Fixes: a8fc28dad6d5 ("alloc_tag: introduce clear_page_tag_ref() helper function")
Signed-off-by: Suren Baghdasaryan <surenb(a)google.com>
Reported-by: David Wang <00107082(a)163.com>
Closes: https://lore.kernel.org/lkml/20241124074318.399027-1-00107082@163.com/
Cc: David Wang <00107082(a)163.com>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: Mike Rapoport (Microsoft) <rppt(a)kernel.org>
Cc: Pasha Tatashin <pasha.tatashin(a)soleen.com>
Cc: Sourav Panda <souravpanda(a)google.com>
Cc: Yu Zhao <yuzhao(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/alloc_tag.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
--- a/include/linux/alloc_tag.h~alloc_tag-fix-set_codetag_empty-when-config_mem_alloc_profiling_debug
+++ a/include/linux/alloc_tag.h
@@ -63,7 +63,12 @@ static inline void set_codetag_empty(uni
#else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
static inline bool is_codetag_empty(union codetag_ref *ref) { return false; }
-static inline void set_codetag_empty(union codetag_ref *ref) {}
+
+static inline void set_codetag_empty(union codetag_ref *ref)
+{
+ if (ref)
+ ref->ct = NULL;
+}
#endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
_
Patches currently in -mm which might be from surenb(a)google.com are
seqlock-add-raw_seqcount_try_begin.patch
mm-convert-mm_lock_seq-to-a-proper-seqcount.patch
mm-introduce-mmap_lock_speculate_try_beginretry.patch
The quilt patch titled
Subject: alloc_tag: fix module allocation tags populated area calculation
has been removed from the -mm tree. Its filename was
alloc_tag-fix-module-allocation-tags-populated-area-calculation.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Suren Baghdasaryan <surenb(a)google.com>
Subject: alloc_tag: fix module allocation tags populated area calculation
Date: Fri, 29 Nov 2024 16:14:22 -0800
vm_module_tags_populate() calculation of the populated area assumes that
area starts at a page boundary and therefore when new pages are allocation,
the end of the area is page-aligned as well. If the start of the area is
not page-aligned then allocating a page and incrementing the end of the
area by PAGE_SIZE leads to an area at the end but within the area boundary
which is not populated. Accessing this are will lead to a kernel panic.
Fix the calculation by down-aligning the start of the area and using that
as the location allocated pages are mapped to.
[gehao(a)kylinos.cn: fix vm_module_tags_populate's KASAN poisoning logic]
Link: https://lkml.kernel.org/r/20241205170528.81000-1-hao.ge@linux.dev
[gehao(a)kylinos.cn: fix panic when CONFIG_KASAN enabled and CONFIG_KASAN_VMALLOC not enabled]
Link: https://lkml.kernel.org/r/20241212072126.134572-1-hao.ge@linux.dev
Link: https://lkml.kernel.org/r/20241130001423.1114965-1-surenb@google.com
Fixes: 0f9b685626da ("alloc_tag: populate memory for module tags as needed")
Signed-off-by: Suren Baghdasaryan <surenb(a)google.com>
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Closes: https://lore.kernel.org/oe-lkp/202411132111.6a221562-lkp@intel.com
Acked-by: Yu Zhao <yuzhao(a)google.com>
Tested-by: Adrian Huang <ahuang12(a)lenovo.com>
Cc: David Wang <00107082(a)163.com>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: Mike Rapoport (Microsoft) <rppt(a)kernel.org>
Cc: Pasha Tatashin <pasha.tatashin(a)soleen.com>
Cc: Sourav Panda <souravpanda(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/alloc_tag.c | 34 +++++++++++++++++++++++++++++-----
1 file changed, 29 insertions(+), 5 deletions(-)
--- a/lib/alloc_tag.c~alloc_tag-fix-module-allocation-tags-populated-area-calculation
+++ a/lib/alloc_tag.c
@@ -408,28 +408,52 @@ repeat:
static int vm_module_tags_populate(void)
{
- unsigned long phys_size = vm_module_tags->nr_pages << PAGE_SHIFT;
+ unsigned long phys_end = ALIGN_DOWN(module_tags.start_addr, PAGE_SIZE) +
+ (vm_module_tags->nr_pages << PAGE_SHIFT);
+ unsigned long new_end = module_tags.start_addr + module_tags.size;
- if (phys_size < module_tags.size) {
+ if (phys_end < new_end) {
struct page **next_page = vm_module_tags->pages + vm_module_tags->nr_pages;
- unsigned long addr = module_tags.start_addr + phys_size;
+ unsigned long old_shadow_end = ALIGN(phys_end, MODULE_ALIGN);
+ unsigned long new_shadow_end = ALIGN(new_end, MODULE_ALIGN);
unsigned long more_pages;
unsigned long nr;
- more_pages = ALIGN(module_tags.size - phys_size, PAGE_SIZE) >> PAGE_SHIFT;
+ more_pages = ALIGN(new_end - phys_end, PAGE_SIZE) >> PAGE_SHIFT;
nr = alloc_pages_bulk_array_node(GFP_KERNEL | __GFP_NOWARN,
NUMA_NO_NODE, more_pages, next_page);
if (nr < more_pages ||
- vmap_pages_range(addr, addr + (nr << PAGE_SHIFT), PAGE_KERNEL,
+ vmap_pages_range(phys_end, phys_end + (nr << PAGE_SHIFT), PAGE_KERNEL,
next_page, PAGE_SHIFT) < 0) {
/* Clean up and error out */
for (int i = 0; i < nr; i++)
__free_page(next_page[i]);
return -ENOMEM;
}
+
vm_module_tags->nr_pages += nr;
+
+ /*
+ * Kasan allocates 1 byte of shadow for every 8 bytes of data.
+ * When kasan_alloc_module_shadow allocates shadow memory,
+ * its unit of allocation is a page.
+ * Therefore, here we need to align to MODULE_ALIGN.
+ */
+ if (old_shadow_end < new_shadow_end)
+ kasan_alloc_module_shadow((void *)old_shadow_end,
+ new_shadow_end - old_shadow_end,
+ GFP_KERNEL);
}
+ /*
+ * Mark the pages as accessible, now that they are mapped.
+ * With hardware tag-based KASAN, marking is skipped for
+ * non-VM_ALLOC mappings, see __kasan_unpoison_vmalloc().
+ */
+ kasan_unpoison_vmalloc((void *)module_tags.start_addr,
+ new_end - module_tags.start_addr,
+ KASAN_VMALLOC_PROT_NORMAL);
+
return 0;
}
_
Patches currently in -mm which might be from surenb(a)google.com are
seqlock-add-raw_seqcount_try_begin.patch
mm-convert-mm_lock_seq-to-a-proper-seqcount.patch
mm-introduce-mmap_lock_speculate_try_beginretry.patch
The quilt patch titled
Subject: mm/codetag: clear tags before swap
has been removed from the -mm tree. Its filename was
mm-codetag-clear-tags-before-swap.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: David Wang <00107082(a)163.com>
Subject: mm/codetag: clear tags before swap
Date: Fri, 13 Dec 2024 09:33:32 +0800
When CONFIG_MEM_ALLOC_PROFILING_DEBUG is set, kernel WARN would be
triggered when calling __alloc_tag_ref_set() during swap:
alloc_tag was not cleared (got tag for mm/filemap.c:1951)
WARNING: CPU: 0 PID: 816 at ./include/linux/alloc_tag.h...
Clear code tags before swap can fix the warning. And this patch also fix
a potential invalid address dereference in alloc_tag_add_check() when
CONFIG_MEM_ALLOC_PROFILING_DEBUG is set and ref->ct is CODETAG_EMPTY,
which is defined as ((void *)1).
Link: https://lkml.kernel.org/r/20241213013332.89910-1-00107082@163.com
Fixes: 51f43d5d82ed ("mm/codetag: swap tags when migrate pages")
Signed-off-by: David Wang <00107082(a)163.com>
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Closes: https://lore.kernel.org/oe-lkp/202412112227.df61ebb-lkp@intel.com
Acked-by: Suren Baghdasaryan <surenb(a)google.com>
Cc: Kent Overstreet <kent.overstreet(a)linux.dev>
Cc: Yu Zhao <yuzhao(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/alloc_tag.h | 2 +-
lib/alloc_tag.c | 7 +++++++
2 files changed, 8 insertions(+), 1 deletion(-)
--- a/include/linux/alloc_tag.h~mm-codetag-clear-tags-before-swap
+++ a/include/linux/alloc_tag.h
@@ -135,7 +135,7 @@ static inline struct alloc_tag_counters
#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
static inline void alloc_tag_add_check(union codetag_ref *ref, struct alloc_tag *tag)
{
- WARN_ONCE(ref && ref->ct,
+ WARN_ONCE(ref && ref->ct && !is_codetag_empty(ref),
"alloc_tag was not cleared (got tag for %s:%u)\n",
ref->ct->filename, ref->ct->lineno);
--- a/lib/alloc_tag.c~mm-codetag-clear-tags-before-swap
+++ a/lib/alloc_tag.c
@@ -209,6 +209,13 @@ void pgalloc_tag_swap(struct folio *new,
return;
}
+ /*
+ * Clear tag references to avoid debug warning when using
+ * __alloc_tag_ref_set() with non-empty reference.
+ */
+ set_codetag_empty(&ref_old);
+ set_codetag_empty(&ref_new);
+
/* swap tags */
__alloc_tag_ref_set(&ref_old, tag_new);
update_page_tag_ref(handle_old, &ref_old);
_
Patches currently in -mm which might be from 00107082(a)163.com are
The quilt patch titled
Subject: mm: convert partially_mapped set/clear operations to be atomic
has been removed from the -mm tree. Its filename was
mm-convert-partially_mapped-set-clear-operations-to-be-atomic.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Usama Arif <usamaarif642(a)gmail.com>
Subject: mm: convert partially_mapped set/clear operations to be atomic
Date: Thu, 12 Dec 2024 18:33:51 +0000
Other page flags in the 2nd page, like PG_hwpoison and PG_anon_exclusive
can get modified concurrently. Changes to other page flags might be lost
if they are happening at the same time as non-atomic partially_mapped
operations. Hence, make partially_mapped operations atomic.
Link: https://lkml.kernel.org/r/20241212183351.1345389-1-usamaarif642@gmail.com
Fixes: 8422acdc97ed ("mm: introduce a pageflag for partially mapped folios")
Reported-by: David Hildenbrand <david(a)redhat.com>
Link: https://lore.kernel.org/all/e53b04ad-1827-43a2-a1ab-864c7efecf6e@redhat.com/
Signed-off-by: Usama Arif <usamaarif642(a)gmail.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin(a)linux.dev>
Cc: Barry Song <baohua(a)kernel.org>
Cc: Domenico Cerasuolo <cerasuolodomenico(a)gmail.com>
Cc: Jonathan Corbet <corbet(a)lwn.net>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt(a)kernel.org>
Cc: Nico Pache <npache(a)redhat.com>
Cc: Rik van Riel <riel(a)surriel.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Shakeel Butt <shakeel.butt(a)linux.dev>
Cc: Yu Zhao <yuzhao(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/page-flags.h | 12 ++----------
mm/huge_memory.c | 8 ++++----
2 files changed, 6 insertions(+), 14 deletions(-)
--- a/include/linux/page-flags.h~mm-convert-partially_mapped-set-clear-operations-to-be-atomic
+++ a/include/linux/page-flags.h
@@ -862,18 +862,10 @@ static inline void ClearPageCompound(str
ClearPageHead(page);
}
FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE)
-FOLIO_TEST_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
-/*
- * PG_partially_mapped is protected by deferred_split split_queue_lock,
- * so its safe to use non-atomic set/clear.
- */
-__FOLIO_SET_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
-__FOLIO_CLEAR_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
+FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE)
#else
FOLIO_FLAG_FALSE(large_rmappable)
-FOLIO_TEST_FLAG_FALSE(partially_mapped)
-__FOLIO_SET_FLAG_NOOP(partially_mapped)
-__FOLIO_CLEAR_FLAG_NOOP(partially_mapped)
+FOLIO_FLAG_FALSE(partially_mapped)
#endif
#define PG_head_mask ((1UL << PG_head))
--- a/mm/huge_memory.c~mm-convert-partially_mapped-set-clear-operations-to-be-atomic
+++ a/mm/huge_memory.c
@@ -3577,7 +3577,7 @@ int split_huge_page_to_list_to_order(str
!list_empty(&folio->_deferred_list)) {
ds_queue->split_queue_len--;
if (folio_test_partially_mapped(folio)) {
- __folio_clear_partially_mapped(folio);
+ folio_clear_partially_mapped(folio);
mod_mthp_stat(folio_order(folio),
MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
}
@@ -3689,7 +3689,7 @@ bool __folio_unqueue_deferred_split(stru
if (!list_empty(&folio->_deferred_list)) {
ds_queue->split_queue_len--;
if (folio_test_partially_mapped(folio)) {
- __folio_clear_partially_mapped(folio);
+ folio_clear_partially_mapped(folio);
mod_mthp_stat(folio_order(folio),
MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
}
@@ -3733,7 +3733,7 @@ void deferred_split_folio(struct folio *
spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
if (partially_mapped) {
if (!folio_test_partially_mapped(folio)) {
- __folio_set_partially_mapped(folio);
+ folio_set_partially_mapped(folio);
if (folio_test_pmd_mappable(folio))
count_vm_event(THP_DEFERRED_SPLIT_PAGE);
count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
@@ -3826,7 +3826,7 @@ static unsigned long deferred_split_scan
} else {
/* We lost race with folio_put() */
if (folio_test_partially_mapped(folio)) {
- __folio_clear_partially_mapped(folio);
+ folio_clear_partially_mapped(folio);
mod_mthp_stat(folio_order(folio),
MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
}
_
Patches currently in -mm which might be from usamaarif642(a)gmail.com are
The quilt patch titled
Subject: nilfs2: fix buffer head leaks in calls to truncate_inode_pages()
has been removed from the -mm tree. Its filename was
nilfs2-fix-buffer-head-leaks-in-calls-to-truncate_inode_pages.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Subject: nilfs2: fix buffer head leaks in calls to truncate_inode_pages()
Date: Fri, 13 Dec 2024 01:43:28 +0900
When block_invalidatepage was converted to block_invalidate_folio, the
fallback to block_invalidatepage in folio_invalidate() if the
address_space_operations method invalidatepage (currently
invalidate_folio) was not set, was removed.
Unfortunately, some pseudo-inodes in nilfs2 use empty_aops set by
inode_init_always_gfp() as is, or explicitly set it to
address_space_operations. Therefore, with this change,
block_invalidatepage() is no longer called from folio_invalidate(), and as
a result, the buffer_head structures attached to these pages/folios are no
longer freed via try_to_free_buffers().
Thus, these buffer heads are now leaked by truncate_inode_pages(), which
cleans up the page cache from inode evict(), etc.
Three types of caches use empty_aops: gc inode caches and the DAT shadow
inode used by GC, and b-tree node caches. Of these, b-tree node caches
explicitly call invalidate_mapping_pages() during cleanup, which involves
calling try_to_free_buffers(), so the leak was not visible during normal
operation but worsened when GC was performed.
Fix this issue by using address_space_operations with invalidate_folio set
to block_invalidate_folio instead of empty_aops, which will ensure the
same behavior as before.
Link: https://lkml.kernel.org/r/20241212164556.21338-1-konishi.ryusuke@gmail.com
Fixes: 7ba13abbd31e ("fs: Turn block_invalidatepage into block_invalidate_folio")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [5.18+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/nilfs2/btnode.c | 1 +
fs/nilfs2/gcinode.c | 2 +-
fs/nilfs2/inode.c | 5 +++++
fs/nilfs2/nilfs.h | 1 +
4 files changed, 8 insertions(+), 1 deletion(-)
--- a/fs/nilfs2/btnode.c~nilfs2-fix-buffer-head-leaks-in-calls-to-truncate_inode_pages
+++ a/fs/nilfs2/btnode.c
@@ -35,6 +35,7 @@ void nilfs_init_btnc_inode(struct inode
ii->i_flags = 0;
memset(&ii->i_bmap_data, 0, sizeof(struct nilfs_bmap));
mapping_set_gfp_mask(btnc_inode->i_mapping, GFP_NOFS);
+ btnc_inode->i_mapping->a_ops = &nilfs_buffer_cache_aops;
}
void nilfs_btnode_cache_clear(struct address_space *btnc)
--- a/fs/nilfs2/gcinode.c~nilfs2-fix-buffer-head-leaks-in-calls-to-truncate_inode_pages
+++ a/fs/nilfs2/gcinode.c
@@ -163,7 +163,7 @@ int nilfs_init_gcinode(struct inode *ino
inode->i_mode = S_IFREG;
mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
- inode->i_mapping->a_ops = &empty_aops;
+ inode->i_mapping->a_ops = &nilfs_buffer_cache_aops;
ii->i_flags = 0;
nilfs_bmap_init_gc(ii->i_bmap);
--- a/fs/nilfs2/inode.c~nilfs2-fix-buffer-head-leaks-in-calls-to-truncate_inode_pages
+++ a/fs/nilfs2/inode.c
@@ -276,6 +276,10 @@ const struct address_space_operations ni
.is_partially_uptodate = block_is_partially_uptodate,
};
+const struct address_space_operations nilfs_buffer_cache_aops = {
+ .invalidate_folio = block_invalidate_folio,
+};
+
static int nilfs_insert_inode_locked(struct inode *inode,
struct nilfs_root *root,
unsigned long ino)
@@ -681,6 +685,7 @@ struct inode *nilfs_iget_for_shadow(stru
NILFS_I(s_inode)->i_flags = 0;
memset(NILFS_I(s_inode)->i_bmap, 0, sizeof(struct nilfs_bmap));
mapping_set_gfp_mask(s_inode->i_mapping, GFP_NOFS);
+ s_inode->i_mapping->a_ops = &nilfs_buffer_cache_aops;
err = nilfs_attach_btree_node_cache(s_inode);
if (unlikely(err)) {
--- a/fs/nilfs2/nilfs.h~nilfs2-fix-buffer-head-leaks-in-calls-to-truncate_inode_pages
+++ a/fs/nilfs2/nilfs.h
@@ -401,6 +401,7 @@ extern const struct file_operations nilf
extern const struct inode_operations nilfs_file_inode_operations;
extern const struct file_operations nilfs_file_operations;
extern const struct address_space_operations nilfs_aops;
+extern const struct address_space_operations nilfs_buffer_cache_aops;
extern const struct inode_operations nilfs_dir_inode_operations;
extern const struct inode_operations nilfs_special_inode_operations;
extern const struct inode_operations nilfs_symlink_inode_operations;
_
Patches currently in -mm which might be from konishi.ryusuke(a)gmail.com are
The quilt patch titled
Subject: vmalloc: fix accounting with i915
has been removed from the -mm tree. Its filename was
vmalloc-fix-accounting-with-i915.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: "Matthew Wilcox (Oracle)" <willy(a)infradead.org>
Subject: vmalloc: fix accounting with i915
Date: Wed, 11 Dec 2024 20:25:37 +0000
If the caller of vmap() specifies VM_MAP_PUT_PAGES (currently only the
i915 driver), we will decrement nr_vmalloc_pages and MEMCG_VMALLOC in
vfree(). These counters are incremented by vmalloc() but not by vmap() so
this will cause an underflow. Check the VM_MAP_PUT_PAGES flag before
decrementing either counter.
Link: https://lkml.kernel.org/r/20241211202538.168311-1-willy@infradead.org
Fixes: b944afc9d64d ("mm: add a VM_MAP_PUT_PAGES flag for vmap")
Signed-off-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeel.butt(a)linux.dev>
Reviewed-by: Balbir Singh <balbirs(a)nvidia.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Roman Gushchin <roman.gushchin(a)linux.dev>
Cc: "Uladzislau Rezki (Sony)" <urezki(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmalloc.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
--- a/mm/vmalloc.c~vmalloc-fix-accounting-with-i915
+++ a/mm/vmalloc.c
@@ -3374,7 +3374,8 @@ void vfree(const void *addr)
struct page *page = vm->pages[i];
BUG_ON(!page);
- mod_memcg_page_state(page, MEMCG_VMALLOC, -1);
+ if (!(vm->flags & VM_MAP_PUT_PAGES))
+ mod_memcg_page_state(page, MEMCG_VMALLOC, -1);
/*
* High-order allocs for huge vmallocs are split, so
* can be freed as an array of order-0 allocations
@@ -3382,7 +3383,8 @@ void vfree(const void *addr)
__free_page(page);
cond_resched();
}
- atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
+ if (!(vm->flags & VM_MAP_PUT_PAGES))
+ atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages);
kvfree(vm->pages);
kfree(vm);
}
_
Patches currently in -mm which might be from willy(a)infradead.org are
mm-page_alloc-cache-page_zone-result-in-free_unref_page.patch
mm-make-alloc_pages_mpol-static.patch
mm-page_alloc-export-free_frozen_pages-instead-of-free_unref_page.patch
mm-page_alloc-move-set_page_refcounted-to-callers-of-post_alloc_hook.patch
mm-page_alloc-move-set_page_refcounted-to-callers-of-prep_new_page.patch
mm-page_alloc-move-set_page_refcounted-to-callers-of-get_page_from_freelist.patch
mm-page_alloc-move-set_page_refcounted-to-callers-of-__alloc_pages_cpuset_fallback.patch
mm-page_alloc-move-set_page_refcounted-to-callers-of-__alloc_pages_may_oom.patch
mm-page_alloc-move-set_page_refcounted-to-callers-of-__alloc_pages_direct_compact.patch
mm-page_alloc-move-set_page_refcounted-to-callers-of-__alloc_pages_direct_reclaim.patch
mm-page_alloc-move-set_page_refcounted-to-callers-of-__alloc_pages_slowpath.patch
mm-page_alloc-move-set_page_refcounted-to-end-of-__alloc_pages.patch
mm-page_alloc-add-__alloc_frozen_pages.patch
mm-mempolicy-add-alloc_frozen_pages.patch
slab-allocate-frozen-pages.patch
ocfs2-handle-a-symlink-read-error-correctly.patch
ocfs2-convert-ocfs2_page_mkwrite-to-use-a-folio.patch
ocfs2-pass-mmap_folio-around-instead-of-mmap_page.patch
ocfs2-convert-ocfs2_read_inline_data-to-take-a-folio.patch
ocfs2-use-a-folio-in-ocfs2_fast_symlink_read_folio.patch
ocfs2-remove-ocfs2_start_walk_page_trans-prototype.patch
iov_iter-remove-setting-of-page-index.patch
The quilt patch titled
Subject: mm/page_alloc: don't call pfn_to_page() on possibly non-existent PFN in split_large_buddy()
has been removed from the -mm tree. Its filename was
mm-page_alloc-dont-call-pfn_to_page-on-possibly-non-existent-pfn-in-split_large_buddy.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: David Hildenbrand <david(a)redhat.com>
Subject: mm/page_alloc: don't call pfn_to_page() on possibly non-existent PFN in split_large_buddy()
Date: Tue, 10 Dec 2024 10:34:37 +0100
In split_large_buddy(), we might call pfn_to_page() on a PFN that might
not exist. In corner cases, such as when freeing the highest pageblock in
the last memory section, this could result with CONFIG_SPARSEMEM &&
!CONFIG_SPARSEMEM_EXTREME in __pfn_to_section() returning NULL and and
__section_mem_map_addr() dereferencing that NULL pointer.
Let's fix it, and avoid doing a pfn_to_page() call for the first
iteration, where we already have the page.
So far this was found by code inspection, but let's just CC stable as the
fix is easy.
Link: https://lkml.kernel.org/r/20241210093437.174413-1-david@redhat.com
Fixes: fd919a85cd55 ("mm: page_isolation: prepare for hygienic freelists")
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Reported-by: Vlastimil Babka <vbabka(a)suse.cz>
Closes: https://lkml.kernel.org/r/e1a898ba-a717-4d20-9144-29df1a6c8813@suse.cz
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Reviewed-by: Zi Yan <ziy(a)nvidia.com>
Acked-by: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Yu Zhao <yuzhao(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
--- a/mm/page_alloc.c~mm-page_alloc-dont-call-pfn_to_page-on-possibly-non-existent-pfn-in-split_large_buddy
+++ a/mm/page_alloc.c
@@ -1238,13 +1238,15 @@ static void split_large_buddy(struct zon
if (order > pageblock_order)
order = pageblock_order;
- while (pfn != end) {
+ do {
int mt = get_pfnblock_migratetype(page, pfn);
__free_one_page(page, pfn, zone, order, mt, fpi);
pfn += 1 << order;
+ if (pfn == end)
+ break;
page = pfn_to_page(pfn);
- }
+ } while (1);
}
static void free_one_page(struct zone *zone, struct page *page,
_
Patches currently in -mm which might be from david(a)redhat.com are
fs-proc-task_mmu-fix-pagemap-flags-with-pmd-thp-entries-on-32bit.patch
docs-tmpfs-update-the-large-folios-policy-for-tmpfs-and-shmem.patch
mm-memory_hotplug-move-debug_pagealloc_map_pages-into-online_pages_range.patch
mm-page_isolation-dont-pass-gfp-flags-to-isolate_single_pageblock.patch
mm-page_isolation-dont-pass-gfp-flags-to-start_isolate_page_range.patch
mm-page_alloc-make-__alloc_contig_migrate_range-static.patch
mm-page_alloc-sort-out-the-alloc_contig_range-gfp-flags-mess.patch
mm-page_alloc-forward-the-gfp-flags-from-alloc_contig_range-to-post_alloc_hook.patch
powernv-memtrace-use-__gfp_zero-with-alloc_contig_pages.patch
mm-hugetlb-dont-map-folios-writable-without-vm_write-when-copying-during-fork.patch
fs-proc-vmcore-convert-vmcore_cb_lock-into-vmcore_mutex.patch
fs-proc-vmcore-replace-vmcoredd_mutex-by-vmcore_mutex.patch
fs-proc-vmcore-disallow-vmcore-modifications-while-the-vmcore-is-open.patch
fs-proc-vmcore-prefix-all-pr_-with-vmcore.patch
fs-proc-vmcore-move-vmcore-definitions-out-of-kcoreh.patch
fs-proc-vmcore-factor-out-allocating-a-vmcore-range-and-adding-it-to-a-list.patch
fs-proc-vmcore-factor-out-freeing-a-list-of-vmcore-ranges.patch
fs-proc-vmcore-introduce-proc_vmcore_device_ram-to-detect-device-ram-ranges-in-2nd-kernel.patch
virtio-mem-mark-device-ready-before-registering-callbacks-in-kdump-mode.patch
virtio-mem-remember-usable-region-size.patch
virtio-mem-support-config_proc_vmcore_device_ram.patch
s390-kdump-virtio-mem-kdump-support-config_proc_vmcore_device_ram.patch
mm-page_alloc-dont-use-__gfp_hardwall-when-migrating-pages-via-alloc_contig.patch
mm-memory_hotplug-dont-use-__gfp_hardwall-when-migrating-pages-via-memory-offlining.patch
The quilt patch titled
Subject: nilfs2: prevent use of deleted inode
has been removed from the -mm tree. Its filename was
nilfs2-prevent-use-of-deleted-inode.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Edward Adam Davis <eadavis(a)qq.com>
Subject: nilfs2: prevent use of deleted inode
Date: Mon, 9 Dec 2024 15:56:52 +0900
syzbot reported a WARNING in nilfs_rmdir. [1]
Because the inode bitmap is corrupted, an inode with an inode number that
should exist as a ".nilfs" file was reassigned by nilfs_mkdir for "file0",
causing an inode duplication during execution. And this causes an
underflow of i_nlink in rmdir operations.
The inode is used twice by the same task to unmount and remove directories
".nilfs" and "file0", it trigger warning in nilfs_rmdir.
Avoid to this issue, check i_nlink in nilfs_iget(), if it is 0, it means
that this inode has been deleted, and iput is executed to reclaim it.
[1]
WARNING: CPU: 1 PID: 5824 at fs/inode.c:407 drop_nlink+0xc4/0x110 fs/inode.c:407
...
Call Trace:
<TASK>
nilfs_rmdir+0x1b0/0x250 fs/nilfs2/namei.c:342
vfs_rmdir+0x3a3/0x510 fs/namei.c:4394
do_rmdir+0x3b5/0x580 fs/namei.c:4453
__do_sys_rmdir fs/namei.c:4472 [inline]
__se_sys_rmdir fs/namei.c:4470 [inline]
__x64_sys_rmdir+0x47/0x50 fs/namei.c:4470
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Link: https://lkml.kernel.org/r/20241209065759.6781-1-konishi.ryusuke@gmail.com
Fixes: d25006523d0b ("nilfs2: pathname operations")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
Reported-by: syzbot+9260555647a5132edd48(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9260555647a5132edd48
Tested-by: syzbot+9260555647a5132edd48(a)syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis(a)qq.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/nilfs2/inode.c | 8 +++++++-
fs/nilfs2/namei.c | 5 +++++
2 files changed, 12 insertions(+), 1 deletion(-)
--- a/fs/nilfs2/inode.c~nilfs2-prevent-use-of-deleted-inode
+++ a/fs/nilfs2/inode.c
@@ -544,8 +544,14 @@ struct inode *nilfs_iget(struct super_bl
inode = nilfs_iget_locked(sb, root, ino);
if (unlikely(!inode))
return ERR_PTR(-ENOMEM);
- if (!(inode->i_state & I_NEW))
+
+ if (!(inode->i_state & I_NEW)) {
+ if (!inode->i_nlink) {
+ iput(inode);
+ return ERR_PTR(-ESTALE);
+ }
return inode;
+ }
err = __nilfs_read_inode(sb, root, ino, inode);
if (unlikely(err)) {
--- a/fs/nilfs2/namei.c~nilfs2-prevent-use-of-deleted-inode
+++ a/fs/nilfs2/namei.c
@@ -67,6 +67,11 @@ nilfs_lookup(struct inode *dir, struct d
inode = NULL;
} else {
inode = nilfs_iget(dir->i_sb, NILFS_I(dir)->i_root, ino);
+ if (inode == ERR_PTR(-ESTALE)) {
+ nilfs_error(dir->i_sb,
+ "deleted inode referenced: %lu", ino);
+ return ERR_PTR(-EIO);
+ }
}
return d_splice_alias(inode, dentry);
_
Patches currently in -mm which might be from eadavis(a)qq.com are
The quilt patch titled
Subject: zram: fix uninitialized ZRAM not releasing backing device
has been removed from the -mm tree. Its filename was
zram-fix-uninitialized-zram-not-releasing-backing-device.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Kairui Song <kasong(a)tencent.com>
Subject: zram: fix uninitialized ZRAM not releasing backing device
Date: Tue, 10 Dec 2024 00:57:16 +0800
Setting backing device is done before ZRAM initialization. If we set the
backing device, then remove the ZRAM module without initializing the
device, the backing device reference will be leaked and the device will be
hold forever.
Fix this by always reset the ZRAM fully on rmmod or reset store.
Link: https://lkml.kernel.org/r/20241209165717.94215-3-ryncsn@gmail.com
Fixes: 013bf95a83ec ("zram: add interface to specif backing device")
Signed-off-by: Kairui Song <kasong(a)tencent.com>
Reported-by: Desheng Wu <deshengwu(a)tencent.com>
Suggested-by: Sergey Senozhatsky <senozhatsky(a)chromium.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky(a)chromium.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/block/zram/zram_drv.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
--- a/drivers/block/zram/zram_drv.c~zram-fix-uninitialized-zram-not-releasing-backing-device
+++ a/drivers/block/zram/zram_drv.c
@@ -1444,12 +1444,16 @@ static void zram_meta_free(struct zram *
size_t num_pages = disksize >> PAGE_SHIFT;
size_t index;
+ if (!zram->table)
+ return;
+
/* Free all pages that are still in this zram device */
for (index = 0; index < num_pages; index++)
zram_free_page(zram, index);
zs_destroy_pool(zram->mem_pool);
vfree(zram->table);
+ zram->table = NULL;
}
static bool zram_meta_alloc(struct zram *zram, u64 disksize)
@@ -2326,11 +2330,6 @@ static void zram_reset_device(struct zra
zram->limit_pages = 0;
- if (!init_done(zram)) {
- up_write(&zram->init_lock);
- return;
- }
-
set_capacity_and_notify(zram->disk, 0);
part_stat_set_all(zram->disk->part0, 0);
_
Patches currently in -mm which might be from kasong(a)tencent.com are
mm-memcontrol-avoid-duplicated-memcg-enable-check.patch
mm-swap_cgroup-remove-swap_cgroup_cmpxchg.patch
mm-swap_cgroup-remove-global-swap-cgroup-lock.patch
mm-swap_cgroup-decouple-swap-cgroup-recording-and-clearing.patch
The quilt patch titled
Subject: zram: refuse to use zero sized block device as backing device
has been removed from the -mm tree. Its filename was
zram-refuse-to-use-zero-sized-block-device-as-backing-device.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Kairui Song <kasong(a)tencent.com>
Subject: zram: refuse to use zero sized block device as backing device
Date: Tue, 10 Dec 2024 00:57:15 +0800
Patch series "zram: fix backing device setup issue", v2.
This series fixes two bugs of backing device setting:
- ZRAM should reject using a zero sized (or the uninitialized ZRAM
device itself) as the backing device.
- Fix backing device leaking when removing a uninitialized ZRAM
device.
This patch (of 2):
Setting a zero sized block device as backing device is pointless, and one
can easily create a recursive loop by setting the uninitialized ZRAM
device itself as its own backing device by (zram0 is uninitialized):
echo /dev/zram0 > /sys/block/zram0/backing_dev
It's definitely a wrong config, and the module will pin itself, kernel
should refuse doing so in the first place.
By refusing to use zero sized device we avoided misuse cases including
this one above.
Link: https://lkml.kernel.org/r/20241209165717.94215-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20241209165717.94215-2-ryncsn@gmail.com
Fixes: 013bf95a83ec ("zram: add interface to specif backing device")
Signed-off-by: Kairui Song <kasong(a)tencent.com>
Reported-by: Desheng Wu <deshengwu(a)tencent.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky(a)chromium.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/block/zram/zram_drv.c | 6 ++++++
1 file changed, 6 insertions(+)
--- a/drivers/block/zram/zram_drv.c~zram-refuse-to-use-zero-sized-block-device-as-backing-device
+++ a/drivers/block/zram/zram_drv.c
@@ -614,6 +614,12 @@ static ssize_t backing_dev_store(struct
}
nr_pages = i_size_read(inode) >> PAGE_SHIFT;
+ /* Refuse to use zero sized device (also prevents self reference) */
+ if (!nr_pages) {
+ err = -EINVAL;
+ goto out;
+ }
+
bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long);
bitmap = kvzalloc(bitmap_sz, GFP_KERNEL);
if (!bitmap) {
_
Patches currently in -mm which might be from kasong(a)tencent.com are
mm-memcontrol-avoid-duplicated-memcg-enable-check.patch
mm-swap_cgroup-remove-swap_cgroup_cmpxchg.patch
mm-swap_cgroup-remove-global-swap-cgroup-lock.patch
mm-swap_cgroup-decouple-swap-cgroup-recording-and-clearing.patch
The quilt patch titled
Subject: mm: use aligned address in copy_user_gigantic_page()
has been removed from the -mm tree. Its filename was
mm-use-aligned-address-in-copy_user_gigantic_page.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Subject: mm: use aligned address in copy_user_gigantic_page()
Date: Mon, 28 Oct 2024 22:56:56 +0800
In current kernel, hugetlb_wp() calls copy_user_large_folio() with the
fault address. Where the fault address may be not aligned with the huge
page size. Then, copy_user_large_folio() may call
copy_user_gigantic_page() with the address, while
copy_user_gigantic_page() requires the address to be huge page size
aligned. So, this may cause memory corruption or information leak,
addtional, use more obvious naming 'addr_hint' instead of 'addr' for
copy_user_gigantic_page().
Link: https://lkml.kernel.org/r/20241028145656.932941-2-wangkefeng.wang@huawei.com
Fixes: 530dd9926dc1 ("mm: memory: improve copy_user_large_folio()")
Signed-off-by: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Reviewed-by: David Hildenbrand <david(a)redhat.com>
Cc: Huang Ying <ying.huang(a)intel.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 5 ++---
mm/memory.c | 5 +++--
2 files changed, 5 insertions(+), 5 deletions(-)
--- a/mm/hugetlb.c~mm-use-aligned-address-in-copy_user_gigantic_page
+++ a/mm/hugetlb.c
@@ -5340,7 +5340,7 @@ again:
break;
}
ret = copy_user_large_folio(new_folio, pte_folio,
- ALIGN_DOWN(addr, sz), dst_vma);
+ addr, dst_vma);
folio_put(pte_folio);
if (ret) {
folio_put(new_folio);
@@ -6643,8 +6643,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_
*foliop = NULL;
goto out;
}
- ret = copy_user_large_folio(folio, *foliop,
- ALIGN_DOWN(dst_addr, size), dst_vma);
+ ret = copy_user_large_folio(folio, *foliop, dst_addr, dst_vma);
folio_put(*foliop);
*foliop = NULL;
if (ret) {
--- a/mm/memory.c~mm-use-aligned-address-in-copy_user_gigantic_page
+++ a/mm/memory.c
@@ -6852,13 +6852,14 @@ void folio_zero_user(struct folio *folio
}
static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
- unsigned long addr,
+ unsigned long addr_hint,
struct vm_area_struct *vma,
unsigned int nr_pages)
{
- int i;
+ unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(dst));
struct page *dst_page;
struct page *src_page;
+ int i;
for (i = 0; i < nr_pages; i++) {
dst_page = folio_page(dst, i);
_
Patches currently in -mm which might be from wangkefeng.wang(a)huawei.com are
mm-dont-try-thp-align-for-fs-without-get_unmapped_area.patch
The quilt patch titled
Subject: mm: use aligned address in clear_gigantic_page()
has been removed from the -mm tree. Its filename was
mm-use-aligned-address-in-clear_gigantic_page.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Subject: mm: use aligned address in clear_gigantic_page()
Date: Mon, 28 Oct 2024 22:56:55 +0800
In current kernel, hugetlb_no_page() calls folio_zero_user() with the
fault address. Where the fault address may be not aligned with the huge
page size. Then, folio_zero_user() may call clear_gigantic_page() with
the address, while clear_gigantic_page() requires the address to be huge
page size aligned. So, this may cause memory corruption or information
leak, addtional, use more obvious naming 'addr_hint' instead of 'addr' for
clear_gigantic_page().
Link: https://lkml.kernel.org/r/20241028145656.932941-1-wangkefeng.wang@huawei.com
Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()")
Signed-off-by: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang(a)intel.com>
Reviewed-by: David Hildenbrand <david(a)redhat.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/hugetlbfs/inode.c | 2 +-
mm/memory.c | 3 ++-
2 files changed, 3 insertions(+), 2 deletions(-)
--- a/fs/hugetlbfs/inode.c~mm-use-aligned-address-in-clear_gigantic_page
+++ a/fs/hugetlbfs/inode.c
@@ -825,7 +825,7 @@ static long hugetlbfs_fallocate(struct f
error = PTR_ERR(folio);
goto out;
}
- folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size));
+ folio_zero_user(folio, addr);
__folio_mark_uptodate(folio);
error = hugetlb_add_to_page_cache(folio, mapping, index);
if (unlikely(error)) {
--- a/mm/memory.c~mm-use-aligned-address-in-clear_gigantic_page
+++ a/mm/memory.c
@@ -6815,9 +6815,10 @@ static inline int process_huge_page(
return 0;
}
-static void clear_gigantic_page(struct folio *folio, unsigned long addr,
+static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
unsigned int nr_pages)
{
+ unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
int i;
might_sleep();
_
Patches currently in -mm which might be from wangkefeng.wang(a)huawei.com are
mm-dont-try-thp-align-for-fs-without-get_unmapped_area.patch
The quilt patch titled
Subject: ocfs2: fix the space leak in LA when releasing LA
has been removed from the -mm tree. Its filename was
ocfs2-fix-the-space-leak-in-la-when-releasing-la.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Heming Zhao <heming.zhao(a)suse.com>
Subject: ocfs2: fix the space leak in LA when releasing LA
Date: Thu, 5 Dec 2024 18:48:33 +0800
Commit 30dd3478c3cd ("ocfs2: correctly use ocfs2_find_next_zero_bit()")
introduced an issue, the ocfs2_sync_local_to_main() ignores the last
contiguous free bits, which causes an OCFS2 volume to lose the last free
clusters of LA window during the release routine.
Please note, because commit dfe6c5692fb5 ("ocfs2: fix the la space leak
when unmounting an ocfs2 volume") was reverted, this commit is a
replacement fix for commit dfe6c5692fb5.
Link: https://lkml.kernel.org/r/20241205104835.18223-3-heming.zhao@suse.com
Fixes: 30dd3478c3cd ("ocfs2: correctly use ocfs2_find_next_zero_bit()")
Signed-off-by: Heming Zhao <heming.zhao(a)suse.com>
Suggested-by: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Cc: Mark Fasheh <mark(a)fasheh.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Changwei Ge <gechangwei(a)live.cn>
Cc: Jun Piao <piaojun(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ocfs2/localalloc.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
--- a/fs/ocfs2/localalloc.c~ocfs2-fix-the-space-leak-in-la-when-releasing-la
+++ a/fs/ocfs2/localalloc.c
@@ -971,9 +971,9 @@ static int ocfs2_sync_local_to_main(stru
start = count = 0;
left = le32_to_cpu(alloc->id1.bitmap1.i_total);
- while ((bit_off = ocfs2_find_next_zero_bit(bitmap, left, start)) <
- left) {
- if (bit_off == start) {
+ while (1) {
+ bit_off = ocfs2_find_next_zero_bit(bitmap, left, start);
+ if ((bit_off < left) && (bit_off == start)) {
count++;
start++;
continue;
@@ -998,6 +998,8 @@ static int ocfs2_sync_local_to_main(stru
}
}
+ if (bit_off >= left)
+ break;
count = 1;
start = bit_off + 1;
}
_
Patches currently in -mm which might be from heming.zhao(a)suse.com are
The quilt patch titled
Subject: ocfs2: revert "ocfs2: fix the la space leak when unmounting an ocfs2 volume"
has been removed from the -mm tree. Its filename was
ocfs2-revert-ocfs2-fix-the-la-space-leak-when-unmounting-an-ocfs2-volume.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Heming Zhao <heming.zhao(a)suse.com>
Subject: ocfs2: revert "ocfs2: fix the la space leak when unmounting an ocfs2 volume"
Date: Thu, 5 Dec 2024 18:48:32 +0800
Patch series "Revert ocfs2 commit dfe6c5692fb5 and provide a new fix".
SUSE QA team detected a mistake in my commit dfe6c5692fb5 ("ocfs2: fix the
la space leak when unmounting an ocfs2 volume"). I am very sorry for my
error. (If my eyes are correct) From the mailling list mails, this patch
shouldn't be applied to 4.19 5.4 5.10 5.15 6.1 6.6, and these branches
should perform a revert operation.
Reason for revert:
In commit dfe6c5692fb5, I mistakenly wrote: "This bug has existed since
the initial OCFS2 code.". The statement is wrong. The correct
introduction commit is 30dd3478c3cd. IOW, if the branch doesn't include
30dd3478c3cd, dfe6c5692fb5 should also not be included.
This reverts commit dfe6c5692fb5 ("ocfs2: fix the la space leak when
unmounting an ocfs2 volume").
In commit dfe6c5692fb5, the commit log "This bug has existed since the
initial OCFS2 code." is wrong. The correct introduction commit is
30dd3478c3cd ("ocfs2: correctly use ocfs2_find_next_zero_bit()").
The influence of commit dfe6c5692fb5 is that it provides a correct fix for
the latest kernel. however, it shouldn't be pushed to stable branches.
Let's use this commit to revert all branches that include dfe6c5692fb5 and
use a new fix method to fix commit 30dd3478c3cd.
Link: https://lkml.kernel.org/r/20241205104835.18223-1-heming.zhao@suse.com
Link: https://lkml.kernel.org/r/20241205104835.18223-2-heming.zhao@suse.com
Fixes: dfe6c5692fb5 ("ocfs2: fix the la space leak when unmounting an ocfs2 volume")
Signed-off-by: Heming Zhao <heming.zhao(a)suse.com>
Reviewed-by: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Cc: Mark Fasheh <mark(a)fasheh.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Changwei Ge <gechangwei(a)live.cn>
Cc: Jun Piao <piaojun(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ocfs2/localalloc.c | 19 -------------------
1 file changed, 19 deletions(-)
--- a/fs/ocfs2/localalloc.c~ocfs2-revert-ocfs2-fix-the-la-space-leak-when-unmounting-an-ocfs2-volume
+++ a/fs/ocfs2/localalloc.c
@@ -1002,25 +1002,6 @@ static int ocfs2_sync_local_to_main(stru
start = bit_off + 1;
}
- /* clear the contiguous bits until the end boundary */
- if (count) {
- blkno = la_start_blk +
- ocfs2_clusters_to_blocks(osb->sb,
- start - count);
-
- trace_ocfs2_sync_local_to_main_free(
- count, start - count,
- (unsigned long long)la_start_blk,
- (unsigned long long)blkno);
-
- status = ocfs2_release_clusters(handle,
- main_bm_inode,
- main_bm_bh, blkno,
- count);
- if (status < 0)
- mlog_errno(status);
- }
-
bail:
if (status)
mlog_errno(status);
_
Patches currently in -mm which might be from heming.zhao(a)suse.com are
The quilt patch titled
Subject: selftests/memfd: run sysctl tests when PID namespace support is enabled
has been removed from the -mm tree. Its filename was
selftests-memfd-run-sysctl-tests-when-pid-namespace-support-is-enabled.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: "Isaac J. Manjarres" <isaacmanjarres(a)google.com>
Subject: selftests/memfd: run sysctl tests when PID namespace support is enabled
Date: Thu, 5 Dec 2024 11:29:41 -0800
The sysctl tests for vm.memfd_noexec rely on the kernel to support PID
namespaces (i.e. the kernel is built with CONFIG_PID_NS=y). If the
kernel the test runs on does not support PID namespaces, the first sysctl
test will fail when attempting to spawn a new thread in a new PID
namespace, abort the test, preventing the remaining tests from being run.
This is not desirable, as not all kernels need PID namespaces, but can
still use the other features provided by memfd. Therefore, only run the
sysctl tests if the kernel supports PID namespaces. Otherwise, skip those
tests and emit an informative message to let the user know why the sysctl
tests are not being run.
Link: https://lkml.kernel.org/r/20241205192943.3228757-1-isaacmanjarres@google.com
Fixes: 11f75a01448f ("selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC")
Signed-off-by: Isaac J. Manjarres <isaacmanjarres(a)google.com>
Reviewed-by: Jeff Xu <jeffxu(a)google.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Kalesh Singh <kaleshsingh(a)google.com>
Cc: <stable(a)vger.kernel.org> [6.6+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
tools/testing/selftests/memfd/memfd_test.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
--- a/tools/testing/selftests/memfd/memfd_test.c~selftests-memfd-run-sysctl-tests-when-pid-namespace-support-is-enabled
+++ a/tools/testing/selftests/memfd/memfd_test.c
@@ -9,6 +9,7 @@
#include <fcntl.h>
#include <linux/memfd.h>
#include <sched.h>
+#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
@@ -1557,6 +1558,11 @@ static void test_share_fork(char *banner
close(fd);
}
+static bool pid_ns_supported(void)
+{
+ return access("/proc/self/ns/pid", F_OK) == 0;
+}
+
int main(int argc, char **argv)
{
pid_t pid;
@@ -1591,8 +1597,12 @@ int main(int argc, char **argv)
test_seal_grow();
test_seal_resize();
- test_sysctl_simple();
- test_sysctl_nested();
+ if (pid_ns_supported()) {
+ test_sysctl_simple();
+ test_sysctl_nested();
+ } else {
+ printf("PID namespaces are not supported; skipping sysctl tests\n");
+ }
test_share_dup("SHARE-DUP", "");
test_share_mmap("SHARE-MMAP", "");
_
Patches currently in -mm which might be from isaacmanjarres(a)google.com are
The patch titled
Subject: ocfs2: fix slab-use-after-free due to dangling pointer dqi_priv
has been added to the -mm mm-nonmm-unstable branch. Its filename is
ocfs2-fix-slab-use-after-free-due-to-dangling-pointer-dqi_priv.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-nonmm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Dennis Lam <dennis.lamerice(a)gmail.com>
Subject: ocfs2: fix slab-use-after-free due to dangling pointer dqi_priv
Date: Tue, 17 Dec 2024 21:39:25 -0500
When mounting ocfs2 and then remounting it as read-only, a
slab-use-after-free occurs after the user uses a syscall to
quota_getnextquota. Specifically, sb_dqinfo(sb, type)->dqi_priv is the
dangling pointer.
During the remounting process, the pointer dqi_priv is freed but is never
set as null leaving it to to be accessed. Additionally, the read-only
option for remounting sets the DQUOT_SUSPENDED flag instead of setting the
DQUOT_USAGE_ENABLED flags. Moreover, later in the process of getting the
next quota, the function ocfs2_get_next_id is called and only checks the
quota usage flags and not the quota suspended flags.
To fix this, I set dqi_priv to null when it is freed after remounting with
read-only and put a check for DQUOT_SUSPENDED in ocfs2_get_next_id.
Link: https://lkml.kernel.org/r/20241218023924.22821-2-dennis.lamerice@gmail.com
Fixes: 8f9e8f5fcc05 ("ocfs2: Fix Q_GETNEXTQUOTA for filesystem without quotas")
Signed-off-by: Dennis Lam <dennis.lamerice(a)gmail.com>
Reported-by: syzbot+d173bf8a5a7faeede34c(a)syzkaller.appspotmail.com
Tested-by: syzbot+d173bf8a5a7faeede34c(a)syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6731d26f.050a0220.1fb99c.014b.GAE@google.com/T/
Reviewed-by: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Cc: Mark Fasheh <mark(a)fasheh.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Changwei Ge <gechangwei(a)live.cn>
Cc: Jun Piao <piaojun(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ocfs2/quota_global.c | 2 +-
fs/ocfs2/quota_local.c | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
--- a/fs/ocfs2/quota_global.c~ocfs2-fix-slab-use-after-free-due-to-dangling-pointer-dqi_priv
+++ a/fs/ocfs2/quota_global.c
@@ -893,7 +893,7 @@ static int ocfs2_get_next_id(struct supe
int status = 0;
trace_ocfs2_get_next_id(from_kqid(&init_user_ns, *qid), type);
- if (!sb_has_quota_loaded(sb, type)) {
+ if (!sb_has_quota_active(sb, type)){
status = -ESRCH;
goto out;
}
--- a/fs/ocfs2/quota_local.c~ocfs2-fix-slab-use-after-free-due-to-dangling-pointer-dqi_priv
+++ a/fs/ocfs2/quota_local.c
@@ -867,6 +867,7 @@ out:
brelse(oinfo->dqi_libh);
brelse(oinfo->dqi_lqi_bh);
kfree(oinfo);
+ info->dqi_priv = NULL;
return status;
}
_
Patches currently in -mm which might be from dennis.lamerice(a)gmail.com are
ocfs2-fix-slab-use-after-free-due-to-dangling-pointer-dqi_priv.patch
From: Steven Rostedt <rostedt(a)goodmis.org>
The TP_printk() of a TRACE_EVENT() is a generic printf format that any
developer can create for their event. It may include pointers to strings
and such. A boot mapped buffer may contain data from a previous kernel
where the strings addresses are different.
One solution is to copy the event content and update the pointers by the
recorded delta, but a simpler solution (for now) is to just use the
print_fields() function to print these events. The print_fields() function
just iterates the fields and prints them according to what type they are,
and ignores the TP_printk() format from the event itself.
To understand the difference, when printing via TP_printk() the output
looks like this:
4582.696626: kmem_cache_alloc: call_site=getname_flags+0x47/0x1f0 ptr=00000000e70e10e0 bytes_req=4096 bytes_alloc=4096 gfp_flags=GFP_KERNEL node=-1 accounted=false
4582.696629: kmem_cache_alloc: call_site=alloc_empty_file+0x6b/0x110 ptr=0000000095808002 bytes_req=360 bytes_alloc=384 gfp_flags=GFP_KERNEL node=-1 accounted=false
4582.696630: kmem_cache_alloc: call_site=security_file_alloc+0x24/0x100 ptr=00000000576339c3 bytes_req=16 bytes_alloc=16 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false
4582.696653: kmem_cache_free: call_site=do_sys_openat2+0xa7/0xd0 ptr=00000000e70e10e0 name=names_cache
But when printing via print_fields() (echo 1 > /sys/kernel/tracing/options/fields)
the same event output looks like this:
4582.696626: kmem_cache_alloc: call_site=0xffffffff92d10d97 (-1831793257) ptr=0xffff9e0e8571e000 (-107689771147264) bytes_req=0x1000 (4096) bytes_alloc=0x1000 (4096) gfp_flags=0xcc0 (3264) node=0xffffffff (-1) accounted=(0)
4582.696629: kmem_cache_alloc: call_site=0xffffffff92d0250b (-1831852789) ptr=0xffff9e0e8577f800 (-107689770747904) bytes_req=0x168 (360) bytes_alloc=0x180 (384) gfp_flags=0xcc0 (3264) node=0xffffffff (-1) accounted=(0)
4582.696630: kmem_cache_alloc: call_site=0xffffffff92efca74 (-1829778828) ptr=0xffff9e0e8d35d3b0 (-107689640864848) bytes_req=0x10 (16) bytes_alloc=0x10 (16) gfp_flags=0xdc0 (3520) node=0xffffffff (-1) accounted=(0)
4582.696653: kmem_cache_free: call_site=0xffffffff92cfbea7 (-1831879001) ptr=0xffff9e0e8571e000 (-107689771147264) name=names_cache
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Link: https://lore.kernel.org/20241218141507.28389a1d@gandalf.local.home
Fixes: 07714b4bb3f98 ("tracing: Handle old buffer mappings for event strings and functions")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index be62f0ea1814..6581cb2bc67f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4353,6 +4353,15 @@ static enum print_line_t print_trace_fmt(struct trace_iterator *iter)
if (event) {
if (tr->trace_flags & TRACE_ITER_FIELDS)
return print_event_fields(iter, event);
+ /*
+ * For TRACE_EVENT() events, the print_fmt is not
+ * safe to use if the array has delta offsets
+ * Force printing via the fields.
+ */
+ if ((tr->text_delta || tr->data_delta) &&
+ event->type > __TRACE_LAST_TYPE)
+ return print_event_fields(iter, event);
+
return event->funcs->trace(iter, sym_flags, event);
}
--
2.45.2
Currently, io_uring_unreg_ringfd() (which cleans up registered rings) is
only called on exit, but __io_uring_free (which frees the tctx in which the
registered ring pointers are stored) is also called on execve (via
begin_new_exec -> io_uring_task_cancel -> __io_uring_cancel ->
io_uring_cancel_generic -> __io_uring_free).
This means: A process going through execve while having registered rings
will leak references to the rings' `struct file`.
Fix it by zapping registered rings on execve(). This is implemented by
moving the io_uring_unreg_ringfd() from io_uring_files_cancel() into its
callee __io_uring_cancel(), which is called from io_uring_task_cancel() on
execve.
This could probably be exploited *on 32-bit kernels* by leaking 2^32
references to the same ring, because the file refcount is stored in a
pointer-sized field and get_file() doesn't have protection against
refcount overflow, just a WARN_ONCE(); but on 64-bit it should have no
impact beyond a memory leak.
Cc: stable(a)vger.kernel.org
Fixes: e7a6c00dc77a ("io_uring: add support for registering ring file descriptors")
Signed-off-by: Jann Horn <jannh(a)google.com>
---
testcase:
```
#define _GNU_SOURCE
#include <err.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <linux/io_uring.h>
#define SYSCHK(x) ({ \
typeof(x) __res = (x); \
if (__res == (typeof(x))-1) \
err(1, "SYSCHK(" #x ")"); \
__res; \
})
#define NUM_SQ_PAGES 4
static int uring_fd;
int main(void) {
int memfd_sq = SYSCHK(memfd_create("", 0));
int memfd_cq = SYSCHK(memfd_create("", 0));
SYSCHK(ftruncate(memfd_sq, NUM_SQ_PAGES * 0x1000));
SYSCHK(ftruncate(memfd_cq, NUM_SQ_PAGES * 0x1000));
// sq
void *sq_data = SYSCHK(mmap(NULL, NUM_SQ_PAGES*0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, memfd_sq, 0));
// cq
void *cq_data = SYSCHK(mmap(NULL, NUM_SQ_PAGES*0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, memfd_cq, 0));
*(volatile unsigned int *)(cq_data+4) = 64 * NUM_SQ_PAGES;
close(memfd_sq);
close(memfd_cq);
// initialize uring
struct io_uring_params params = {
.flags = IORING_SETUP_REGISTERED_FD_ONLY|IORING_SETUP_NO_MMAP|IORING_SETUP_NO_SQARRAY,
.sq_off = { .user_addr = (unsigned long)sq_data },
.cq_off = { .user_addr = (unsigned long)cq_data }
};
uring_fd = SYSCHK(syscall(__NR_io_uring_setup, /*entries=*/10, ¶ms));
// re-execute
execl("/proc/self/exe", NULL);
err(1, "execve");
}
```
Leave it running for a while and monitor `grep ^filp /proc/slabinfo` and
memory usage - without the patch, both will go up slowly but steadily.
---
include/linux/io_uring.h | 4 +---
io_uring/io_uring.c | 1 +
2 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index e123d5e17b526148054872ee513f665adea80eb6..85fe4e6b275c7de260ea9a8552b8e1c3e7f7e5ec 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -15,10 +15,8 @@ bool io_is_uring_fops(struct file *file);
static inline void io_uring_files_cancel(void)
{
- if (current->io_uring) {
- io_uring_unreg_ringfd();
+ if (current->io_uring)
__io_uring_cancel(false);
- }
}
static inline void io_uring_task_cancel(void)
{
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 06ff41484e29c6e7d8779bd7ff8317ebae003a8d..b1e888999cea137e043bf00372576861182857ac 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3214,6 +3214,7 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
void __io_uring_cancel(bool cancel_all)
{
+ io_uring_unreg_ringfd();
io_uring_cancel_generic(cancel_all, NULL);
}
---
base-commit: aef25be35d23ec768eed08bfcf7ca3cf9685bc28
change-id: 20241218-uring-reg-ring-cleanup-15dd7343194a
--
Jann Horn <jannh(a)google.com>
commit b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols")
avoids checking number_of_same_symbols() for module symbol in
__trace_kprobe_create(), but create_local_trace_kprobe() should avoid this
check too. Doing this check leads to ENOENT for module_name:symbol_name
constructions passed over perf_event_open.
No bug in mainline and 6.12 as those contain more general fix
commit 9d8616034f16 ("tracing/kprobes: Add symbol counting check when module loads")
Link: https://lore.kernel.org/linux-trace-kernel/20240705161030.b3ddb33a8167013b9…
Fixes: b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols")
Signed-off-by: Nikolay Kuratov <kniv(a)yandex-team.ru>
---
v1 -> v2:
* Reword commit title and message
* Send for stable instead of mainline
v2 -> v3:
* Specify first good LTS version in commit message
* Remove explicit versions from the subject since 6.1 and 5.10 need fix too
kernel/trace/trace_kprobe.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 12d997bb3e78..94cb09d44115 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1814,7 +1814,7 @@ create_local_trace_kprobe(char *func, void *addr, unsigned long offs,
int ret;
char *event;
- if (func) {
+ if (func && !strchr(func, ':')) {
unsigned int count;
count = number_of_same_symbols(func);
--
2.34.1
As per zynqmp-ipi bindings, zynqmp IPI node can have multiple child nodes.
Current IPI setup function is set only for first child node. If IPI node
has multiple child nodes in the device-tree, then IPI setup fails for
child nodes other than first child node. In such case kernel will crash.
Fix this crash by registering IPI setup function for each available child
node.
Cc: stable(a)vger.kernel.org
Fixes: 41bcf30100c5 (mailbox: zynqmp: Move buffered IPI setup)
Signed-off-by: Tanmay Shah <tanmay.shah(a)amd.com>
Reviewed-by: Michal Simek <michal.simek(a)amd.com>
---
Changes in v2:
- Add Fixes tag
drivers/mailbox/zynqmp-ipi-mailbox.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/mailbox/zynqmp-ipi-mailbox.c b/drivers/mailbox/zynqmp-ipi-mailbox.c
index 521d08b9ab47..815e0492f029 100644
--- a/drivers/mailbox/zynqmp-ipi-mailbox.c
+++ b/drivers/mailbox/zynqmp-ipi-mailbox.c
@@ -940,10 +940,10 @@ static int zynqmp_ipi_probe(struct platform_device *pdev)
pdata->num_mboxes = num_mboxes;
mbox = pdata->ipi_mboxes;
- mbox->setup_ipi_fn = ipi_fn;
-
for_each_available_child_of_node(np, nc) {
mbox->pdata = pdata;
+ mbox->setup_ipi_fn = ipi_fn;
+
ret = zynqmp_ipi_mbox_probe(mbox, nc);
if (ret) {
of_node_put(nc);
base-commit: 28955f4fa2823e39f1ecfb3a37a364563527afbc
--
2.34.1
From: Sumit Gupta <sumitg(a)nvidia.com>
Access to safety cluster engine (SCE) fabric registers was blocked
by firewall after the introduction of Functional Safety Island in
Tegra234. After that, any access by software to SCE registers is
correctly resulting in the internal bus error. However, when CPUs
try accessing the SCE-fabric registers to print error info,
another firewall error occurs as the fabric registers are also
firewall protected. This results in a second error to be printed.
Disable the SCE fabric node to avoid printing the misleading error.
The first error info will be printed by the interrupt from the
fabric causing the actual access.
Cc: stable(a)vger.kernel.org
Fixes: 302e154000ec ("arm64: tegra: Add node for CBB 2.0 on Tegra234")
Signed-off-by: Sumit Gupta <sumitg(a)nvidia.com>
Signed-off-by: Ivy Huang <yijuh(a)nvidia.com>
---
arch/arm64/boot/dts/nvidia/tegra234.dtsi | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/boot/dts/nvidia/tegra234.dtsi b/arch/arm64/boot/dts/nvidia/tegra234.dtsi
index d08faf6bb505..05a771ab1ed5 100644
--- a/arch/arm64/boot/dts/nvidia/tegra234.dtsi
+++ b/arch/arm64/boot/dts/nvidia/tegra234.dtsi
@@ -3815,7 +3815,7 @@ sce-fabric@b600000 {
compatible = "nvidia,tegra234-sce-fabric";
reg = <0x0 0xb600000 0x0 0x40000>;
interrupts = <GIC_SPI 173 IRQ_TYPE_LEVEL_HIGH>;
- status = "okay";
+ status = "disabled";
};
rce-fabric@be00000 {
--
2.25.1
From: Sumit Gupta <sumitg(a)nvidia.com>
The compatible string for the Tegra DCE fabric is currently defined as
'nvidia,tegra234-sce-fabric' but this is incorrect because this is the
compatible string for SCE fabric. Update the compatible for the DCE
fabric to correct the compatible string.
This compatible needs to be correct in order for the interconnect
to catch things such as improper data accesses.
Cc: stable(a)vger.kernel.org
Fixes: 302e154000ec ("arm64: tegra: Add node for CBB 2.0 on Tegra234")
Signed-off-by: Sumit Gupta <sumitg(a)nvidia.com>
Signed-off-by: Ivy Huang <yijuh(a)nvidia.com>
---
arch/arm64/boot/dts/nvidia/tegra234.dtsi | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/boot/dts/nvidia/tegra234.dtsi b/arch/arm64/boot/dts/nvidia/tegra234.dtsi
index 984c85eab41a..d08faf6bb505 100644
--- a/arch/arm64/boot/dts/nvidia/tegra234.dtsi
+++ b/arch/arm64/boot/dts/nvidia/tegra234.dtsi
@@ -3995,7 +3995,7 @@ bpmp-fabric@d600000 {
};
dce-fabric@de00000 {
- compatible = "nvidia,tegra234-sce-fabric";
+ compatible = "nvidia,tegra234-dce-fabric";
reg = <0x0 0xde00000 0x0 0x40000>;
interrupts = <GIC_SPI 381 IRQ_TYPE_LEVEL_HIGH>;
status = "okay";
--
2.25.1
Hello,
This series contains backports for 6.6 from the 6.11 release. This patchset
has gone through xfs testing and review.
Chen Ni (1):
xfs: convert comma to semicolon
Christoph Hellwig (1):
xfs: fix the contact address for the sysfs ABI documentation
Darrick J. Wong (10):
xfs: verify buffer, inode, and dquot items every tx commit
xfs: use consistent uid/gid when grabbing dquots for inodes
xfs: declare xfs_file.c symbols in xfs_file.h
xfs: create a new helper to return a file's allocation unit
xfs: fix file_path handling in tracepoints
xfs: attr forks require attr, not attr2
xfs: conditionally allow FS_XFLAG_REALTIME changes if S_DAX is set
xfs: use XFS_BUF_DADDR_NULL for daddrs in getfsmap code
xfs: take m_growlock when running growfsrt
xfs: reset rootdir extent size hint after growfsrt
John Garry (2):
xfs: Fix xfs_flush_unmap_range() range for RT
xfs: Fix xfs_prepare_shift() range for RT
Julian Sun (1):
xfs: remove unused parameter in macro XFS_DQUOT_LOGRES
Zizhi Wo (1):
xfs: Fix the owner setting issue for rmap query in xfs fsmap
lei lu (1):
xfs: don't walk off the end of a directory data block
Documentation/ABI/testing/sysfs-fs-xfs | 8 +--
fs/xfs/Kconfig | 12 ++++
fs/xfs/libxfs/xfs_dir2_data.c | 31 ++++++++--
fs/xfs/libxfs/xfs_dir2_priv.h | 7 +++
fs/xfs/libxfs/xfs_quota_defs.h | 2 +-
fs/xfs/libxfs/xfs_trans_resv.c | 28 ++++-----
fs/xfs/scrub/agheader_repair.c | 2 +-
fs/xfs/scrub/bmap.c | 8 ++-
fs/xfs/scrub/trace.h | 10 ++--
fs/xfs/xfs.h | 4 ++
fs/xfs/xfs_bmap_util.c | 22 +++++---
fs/xfs/xfs_buf_item.c | 32 +++++++++++
fs/xfs/xfs_dquot_item.c | 31 ++++++++++
fs/xfs/xfs_file.c | 33 +++++------
fs/xfs/xfs_file.h | 15 +++++
fs/xfs/xfs_fsmap.c | 6 +-
fs/xfs/xfs_inode.c | 29 ++++++++--
fs/xfs/xfs_inode.h | 2 +
fs/xfs/xfs_inode_item.c | 32 +++++++++++
fs/xfs/xfs_ioctl.c | 12 ++++
fs/xfs/xfs_iops.c | 1 +
fs/xfs/xfs_iops.h | 3 -
fs/xfs/xfs_rtalloc.c | 78 +++++++++++++++++++++-----
fs/xfs/xfs_symlink.c | 8 ++-
24 files changed, 328 insertions(+), 88 deletions(-)
create mode 100644 fs/xfs/xfs_file.h
--
2.39.3
Hi all,
Here's a bunch of bespoke hand-ported bug fixes for 6.12 LTS. These
fixes are the ones that weren't automatically picked up by Greg, either
because they didn't apply or because they weren't cc'd to stable. If
there are any problems please let me know; this is the first time I've
ever sent patches to stable.
With a bit of luck, this should all go splendidly.
Comments and questions are, as always, welcome.
--D
---
Commits in this patchset:
* xfs: sb_spino_align is not verified
* xfs: fix sparse inode limits on runt AG
* xfs: fix off-by-one error in fsmap's end_daddr usage
* xfs: fix sb_spino_align checks for large fsblock sizes
* xfs: fix zero byte checking in the superblock scrubber
---
fs/xfs/libxfs/xfs_ialloc.c | 16 +++++++++-------
fs/xfs/libxfs/xfs_sb.c | 15 +++++++++++++++
fs/xfs/scrub/agheader.c | 29 +++++++++++++++++++++++++++--
fs/xfs/xfs_fsmap.c | 29 ++++++++++++++++++-----------
4 files changed, 69 insertions(+), 20 deletions(-)
From: Hugo Villeneuve <hvilleneuve(a)dimonoff.com>
Hello,
this patch series brings additional patches to fix some FIFO issues when
backporting to linux-5.15.y for the sc16is7xx driver.
Commit ("serial: sc16is7xx: add missing support for rs485 devicetree
properties") is required when using RS-485.
Commit ("serial: sc16is7xx: refactor FIFO access functions
to increase commonality") is a prerequisite for commit ("serial:
sc16is7xx: fix TX fifo corruption"). Altough it is not strictly
necessary, it makes backporting easier.
I have tested the changes on a custom board with two SC16IS752 DUART over
a SPI interface using a Variscite IMX8MN NANO SOM. The four UARTs are
configured in RS-485 mode.
Thank you.
Hugo Villeneuve (4):
serial: sc16is7xx: add missing support for rs485 devicetree properties
serial: sc16is7xx: refactor FIFO access functions to increase
commonality
serial: sc16is7xx: fix TX fifo corruption
serial: sc16is7xx: fix invalid FIFO access with special register set
drivers/tty/serial/sc16is7xx.c | 38 ++++++++++++++++++++--------------
1 file changed, 22 insertions(+), 16 deletions(-)
--
2.39.5
The Hexagon-specific constant extender optimization in LLVM may crash on
Linux kernel code [1], such as fs/bcache/btree_io.c after
commit 32ed4a620c54 ("bcachefs: Btree path tracepoints") in 6.12:
clang: llvm/lib/Target/Hexagon/HexagonConstExtenders.cpp:745: bool (anonymous namespace)::HexagonConstExtenders::ExtRoot::operator<(const HCE::ExtRoot &) const: Assertion `ThisB->getParent() == OtherB->getParent()' failed.
Stack dump:
0. Program arguments: clang --target=hexagon-linux-musl ... fs/bcachefs/btree_io.c
1. <eof> parser at end of file
2. Code generation
3. Running pass 'Function Pass Manager' on module 'fs/bcachefs/btree_io.c'.
4. Running pass 'Hexagon constant-extender optimization' on function '@__btree_node_lock_nopath'
Without assertions enabled, there is just a hang during compilation.
This has been resolved in LLVM main (20.0.0) [2] and backported to LLVM
19.1.0 but the kernel supports LLVM 13.0.1 and newer, so disable the
constant expander optimization using the '-mllvm' option when using a
toolchain that is not fixed.
Cc: stable(a)vger.kernel.org
Link: https://github.com/llvm/llvm-project/issues/99714 [1]
Link: https://github.com/llvm/llvm-project/commit/68df06a0b2998765cb0a41353fcf091… [2]
Link: https://github.com/llvm/llvm-project/commit/2ab8d93061581edad3501561722ebd5… [3]
Reviewed-by: Brian Cain <bcain(a)quicinc.com>
Signed-off-by: Nathan Chancellor <nathan(a)kernel.org>
---
Andrew, can you please take this for 6.13? Our CI continues to hit this.
Changes in v2:
- Rebase on 6.12 to make sure it is still applicable
- Name exact bcachefs commit that introduces crash now that it is
merged
- Add 'Cc: stable' as this is now visible in a stable release
- Carry forward Brian's reviewed-by
- Link to v1: https://lore.kernel.org/r/20240819-hexagon-disable-constant-expander-pass-v…
---
arch/hexagon/Makefile | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/arch/hexagon/Makefile b/arch/hexagon/Makefile
index 92d005958dfb232d48a4ca843b46262a84a08eb4..ff172cbe5881a074f9d9430c37071992a4c8beac 100644
--- a/arch/hexagon/Makefile
+++ b/arch/hexagon/Makefile
@@ -32,3 +32,9 @@ KBUILD_LDFLAGS += $(ldflags-y)
TIR_NAME := r19
KBUILD_CFLAGS += -ffixed-$(TIR_NAME) -DTHREADINFO_REG=$(TIR_NAME) -D__linux__
KBUILD_AFLAGS += -DTHREADINFO_REG=$(TIR_NAME)
+
+# Disable HexagonConstExtenders pass for LLVM versions prior to 19.1.0
+# https://github.com/llvm/llvm-project/issues/99714
+ifneq ($(call clang-min-version, 190100),y)
+KBUILD_CFLAGS += -mllvm -hexagon-cext=false
+endif
---
base-commit: adc218676eef25575469234709c2d87185ca223a
change-id: 20240802-hexagon-disable-constant-expander-pass-8b6b61db6afc
Best regards,
--
Nathan Chancellor <nathan(a)kernel.org>
This is the start of the stable review cycle for the 5.10.232 release.
There are 43 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu, 19 Dec 2024 17:05:03 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.10.232-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.10.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.10.232-rc1
Dan Carpenter <dan.carpenter(a)linaro.org>
ALSA: usb-audio: Fix a DMA to stack memory bug
Juergen Gross <jgross(a)suse.com>
x86/xen: remove hypercall page
Juergen Gross <jgross(a)suse.com>
x86/xen: use new hypercall functions instead of hypercall page
Juergen Gross <jgross(a)suse.com>
x86/xen: add central hypercall functions
Juergen Gross <jgross(a)suse.com>
x86/xen: don't do PV iret hypercall through hypercall page
Juergen Gross <jgross(a)suse.com>
x86/static-call: provide a way to do very early static-call updates
Juergen Gross <jgross(a)suse.com>
objtool/x86: allow syscall instruction
Juergen Gross <jgross(a)suse.com>
x86: make get_cpu_vendor() accessible from Xen code
Juergen Gross <jgross(a)suse.com>
xen/netfront: fix crash when removing device
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Revert "clkdev: remove CONFIG_CLKDEV_LOOKUP"
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Revert "clocksource/drivers:sp804: Make user selectable"
Jiasheng Jiang <jiashengjiangcool(a)outlook.com>
drm/i915: Fix memory leak by correcting cache object name in error handler
Nikolay Kuratov <kniv(a)yandex-team.ru>
tracing/kprobes: Skip symbol counting logic for module symbols in create_local_trace_kprobe()
Eduard Zingerman <eddyz87(a)gmail.com>
bpf: sync_linked_regs() must preserve subreg_def
Nathan Chancellor <nathan(a)kernel.org>
blk-iocost: Avoid using clamp() on inuse in __propagate_weights()
Daniil Tatianin <d-tatianin(a)yandex-team.ru>
ACPICA: events/evxfregn: don't release the ContextMutex that was never acquired
Daniel Borkmann <daniel(a)iogearbox.net>
team: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
Daniel Borkmann <daniel(a)iogearbox.net>
bonding: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
Alexander Lobakin <alobakin(a)pm.me>
net: bonding, dummy, ifb, team: advertise NETIF_F_GSO_SOFTWARE
Martin Ottens <martin.ottens(a)fau.de>
net/sched: netem: account for backlog updates from child qdisc
Stefan Wahren <wahrenst(a)gmx.net>
qca_spi: Make driver probing reliable
Stefan Wahren <wahrenst(a)gmx.net>
qca_spi: Fix clock speed for multiple QCA7000
Anumula Murali Mohan Reddy <anumula(a)chelsio.com>
cxgb4: use port number to set mac addr
Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
ACPI: resource: Fix memory resource type union access
Eric Dumazet <edumazet(a)google.com>
net: lapb: increase LAPB_HEADER_LEN
Danielle Ratson <danieller(a)nvidia.com>
selftests: mlxsw: sharedbuffer: Remove duplicate test cases
Danielle Ratson <danieller(a)nvidia.com>
selftests: mlxsw: sharedbuffer: Remove h1 ingress test case
Eric Dumazet <edumazet(a)google.com>
tipc: fix NULL deref in cleanup_bearer()
Remi Pommarel <repk(a)triplefau.lt>
batman-adv: Do not let TT changes list grows indefinitely
Remi Pommarel <repk(a)triplefau.lt>
batman-adv: Remove uninitialized data in full table TT response
Remi Pommarel <repk(a)triplefau.lt>
batman-adv: Do not send uninitialized TT changes
Suraj Sonawane <surajsonawane0215(a)gmail.com>
acpi: nfit: vmalloc-out-of-bounds Read in acpi_nfit_ctl
Sungjong Seo <sj1557.seo(a)samsung.com>
exfat: fix potential deadlock on __exfat_get_dentry_set
Michal Luczaj <mhal(a)rbox.co>
virtio/vsock: Fix accept_queue memory leak
Michal Luczaj <mhal(a)rbox.co>
bpf, sockmap: Fix update element with same
Darrick J. Wong <djwong(a)kernel.org>
xfs: fix scrub tracepoints when inode-rooted btrees are involved
Darrick J. Wong <djwong(a)kernel.org>
xfs: don't drop errno values when we fail to ficlone the entire range
Lianqin Hu <hulianqin(a)vivo.com>
usb: gadget: u_serial: Fix the issue that gs_start_io crashed due to accessing null pointer
Vitalii Mordan <mordan(a)ispras.ru>
usb: ehci-hcd: fix call balance of clocks handling routines
Stefan Wahren <wahrenst(a)gmx.net>
usb: dwc2: hcd: Fix GetPortStatus & SetPortFeature
Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
ata: sata_highbank: fix OF node reference leak in highbank_initialize_phys()
Mark Tomlinson <mark.tomlinson(a)alliedtelesis.co.nz>
usb: host: max3421-hcd: Correctly abort a USB request.
MoYuanhao <moyuanhao3676(a)163.com>
tcp: check space before adding MPTCP SYN options
-------------
Diffstat:
Makefile | 4 +-
arch/arm/Kconfig | 2 +
arch/mips/Kconfig | 3 +
arch/mips/pic32/Kconfig | 1 +
arch/sh/Kconfig | 1 +
arch/x86/include/asm/processor.h | 2 +
arch/x86/include/asm/static_call.h | 15 ++++
arch/x86/include/asm/sync_core.h | 6 +-
arch/x86/include/asm/xen/hypercall.h | 36 ++++----
arch/x86/kernel/cpu/common.c | 38 +++++----
arch/x86/kernel/static_call.c | 10 +++
arch/x86/xen/enlighten.c | 65 ++++++++++++++-
arch/x86/xen/enlighten_hvm.c | 13 ++-
arch/x86/xen/enlighten_pv.c | 4 +-
arch/x86/xen/enlighten_pvh.c | 7 --
arch/x86/xen/xen-asm.S | 49 +++++++++--
arch/x86/xen/xen-head.S | 97 ++++++++++++++++++----
arch/x86/xen/xen-ops.h | 9 ++
block/blk-iocost.c | 9 +-
drivers/acpi/acpica/evxfregn.c | 2 -
drivers/acpi/nfit/core.c | 7 +-
drivers/acpi/resource.c | 6 +-
drivers/ata/sata_highbank.c | 1 +
drivers/clk/Kconfig | 6 +-
drivers/clk/Makefile | 3 +-
drivers/clocksource/Kconfig | 9 +-
drivers/gpu/drm/i915/i915_scheduler.c | 2 +-
drivers/mmc/host/Kconfig | 4 +-
drivers/net/bonding/bond_main.c | 12 +--
drivers/net/dummy.c | 2 +-
drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 2 +-
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 2 +-
drivers/net/ethernet/chelsio/cxgb4/t4_hw.c | 5 +-
drivers/net/ethernet/qualcomm/qca_spi.c | 26 +++---
drivers/net/ethernet/qualcomm/qca_spi.h | 1 -
drivers/net/ifb.c | 3 +-
drivers/net/team/team.c | 12 +--
drivers/net/xen-netfront.c | 5 +-
drivers/staging/board/Kconfig | 2 +-
drivers/usb/dwc2/hcd.c | 16 ++--
drivers/usb/gadget/function/u_serial.c | 9 +-
drivers/usb/host/ehci-sh.c | 9 +-
drivers/usb/host/max3421-hcd.c | 16 ++--
fs/exfat/dir.c | 2 +-
fs/xfs/scrub/trace.h | 2 +-
fs/xfs/xfs_file.c | 8 ++
include/linux/compiler.h | 34 +++++---
include/linux/static_call.h | 1 +
include/net/lapb.h | 2 +-
kernel/bpf/verifier.c | 5 +-
kernel/static_call.c | 2 +-
kernel/trace/trace_kprobe.c | 2 +-
net/batman-adv/translation-table.c | 58 +++++++++----
net/core/sock_map.c | 1 +
net/ipv4/tcp_output.c | 6 +-
net/sched/sch_netem.c | 22 +++--
net/tipc/udp_media.c | 7 +-
net/vmw_vsock/virtio_transport_common.c | 8 ++
sound/soc/dwc/Kconfig | 2 +-
sound/soc/rockchip/Kconfig | 14 ++--
sound/usb/quirks.c | 31 ++++---
tools/objtool/check.c | 11 ++-
.../selftests/drivers/net/mlxsw/sharedbuffer.sh | 15 ----
63 files changed, 534 insertions(+), 232 deletions(-)
This is the start of the stable review cycle for the 5.15.175 release.
There are 51 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu, 19 Dec 2024 17:05:03 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.175-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.15.175-rc1
Dan Carpenter <dan.carpenter(a)linaro.org>
ALSA: usb-audio: Fix a DMA to stack memory bug
Juergen Gross <jgross(a)suse.com>
x86/xen: remove hypercall page
Juergen Gross <jgross(a)suse.com>
x86/xen: use new hypercall functions instead of hypercall page
Juergen Gross <jgross(a)suse.com>
x86/xen: add central hypercall functions
Juergen Gross <jgross(a)suse.com>
x86/xen: don't do PV iret hypercall through hypercall page
Juergen Gross <jgross(a)suse.com>
x86/static-call: provide a way to do very early static-call updates
Juergen Gross <jgross(a)suse.com>
objtool/x86: allow syscall instruction
Juergen Gross <jgross(a)suse.com>
x86: make get_cpu_vendor() accessible from Xen code
Juergen Gross <jgross(a)suse.com>
xen/netfront: fix crash when removing device
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Revert "parisc: fix a possible DMA corruption"
Nikolay Kuratov <kniv(a)yandex-team.ru>
tracing/kprobes: Skip symbol counting logic for module symbols in create_local_trace_kprobe()
Eduard Zingerman <eddyz87(a)gmail.com>
bpf: sync_linked_regs() must preserve subreg_def
Nathan Chancellor <nathan(a)kernel.org>
blk-iocost: Avoid using clamp() on inuse in __propagate_weights()
Daniil Tatianin <d-tatianin(a)yandex-team.ru>
ACPICA: events/evxfregn: don't release the ContextMutex that was never acquired
Daniel Borkmann <daniel(a)iogearbox.net>
team: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
Daniel Borkmann <daniel(a)iogearbox.net>
bonding: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
Martin Ottens <martin.ottens(a)fau.de>
net/sched: netem: account for backlog updates from child qdisc
Paul Barker <paul.barker.ct(a)bp.renesas.com>
Documentation: PM: Clarify pm_runtime_resume_and_get() return value
Stefan Wahren <wahrenst(a)gmx.net>
qca_spi: Make driver probing reliable
Stefan Wahren <wahrenst(a)gmx.net>
qca_spi: Fix clock speed for multiple QCA7000
Anumula Murali Mohan Reddy <anumula(a)chelsio.com>
cxgb4: use port number to set mac addr
Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
ACPI: resource: Fix memory resource type union access
Daniel Machon <daniel.machon(a)microchip.com>
net: sparx5: fix the maximum frame length register
Daniel Machon <daniel.machon(a)microchip.com>
net: sparx5: fix FDMA performance issue
Eric Dumazet <edumazet(a)google.com>
net: lapb: increase LAPB_HEADER_LEN
Thomas Weißschuh <linux(a)weissschuh.net>
ptp: kvm: x86: Return EOPNOTSUPP instead of ENODEV from kvm_arch_ptp_init()
Jeremi Piotrowski <jpiotrowski(a)linux.microsoft.com>
ptp: kvm: Use decrypted memory in confidential guest on x86
Danielle Ratson <danieller(a)nvidia.com>
selftests: mlxsw: sharedbuffer: Remove duplicate test cases
Danielle Ratson <danieller(a)nvidia.com>
selftests: mlxsw: sharedbuffer: Remove h1 ingress test case
Eric Dumazet <edumazet(a)google.com>
tipc: fix NULL deref in cleanup_bearer()
Remi Pommarel <repk(a)triplefau.lt>
batman-adv: Do not let TT changes list grows indefinitely
Remi Pommarel <repk(a)triplefau.lt>
batman-adv: Remove uninitialized data in full table TT response
Remi Pommarel <repk(a)triplefau.lt>
batman-adv: Do not send uninitialized TT changes
Suraj Sonawane <surajsonawane0215(a)gmail.com>
acpi: nfit: vmalloc-out-of-bounds Read in acpi_nfit_ctl
Sungjong Seo <sj1557.seo(a)samsung.com>
exfat: fix potential deadlock on __exfat_get_dentry_set
Michal Luczaj <mhal(a)rbox.co>
virtio/vsock: Fix accept_queue memory leak
Michal Luczaj <mhal(a)rbox.co>
bpf, sockmap: Fix update element with same
Darrick J. Wong <djwong(a)kernel.org>
xfs: fix scrub tracepoints when inode-rooted btrees are involved
Darrick J. Wong <djwong(a)kernel.org>
xfs: return from xfs_symlink_verify early on V4 filesystems
Darrick J. Wong <djwong(a)kernel.org>
xfs: don't drop errno values when we fail to ficlone the entire range
Darrick J. Wong <djwong(a)kernel.org>
xfs: update btree keys correctly when _insrec splits an inode root block
Jiasheng Jiang <jiashengjiangcool(a)outlook.com>
drm/i915: Fix memory leak by correcting cache object name in error handler
Lianqin Hu <hulianqin(a)vivo.com>
usb: gadget: u_serial: Fix the issue that gs_start_io crashed due to accessing null pointer
Vitalii Mordan <mordan(a)ispras.ru>
usb: ehci-hcd: fix call balance of clocks handling routines
Stefan Wahren <wahrenst(a)gmx.net>
usb: dwc2: Fix HCD port connection race
Stefan Wahren <wahrenst(a)gmx.net>
usb: dwc2: hcd: Fix GetPortStatus & SetPortFeature
Stefan Wahren <wahrenst(a)gmx.net>
usb: dwc2: Fix HCD resume
Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
ata: sata_highbank: fix OF node reference leak in highbank_initialize_phys()
Mark Tomlinson <mark.tomlinson(a)alliedtelesis.co.nz>
usb: host: max3421-hcd: Correctly abort a USB request.
Jaakko Salo <jaakkos(a)gmail.com>
ALSA: usb-audio: Add implicit feedback quirk for Yamaha THR5
MoYuanhao <moyuanhao3676(a)163.com>
tcp: check space before adding MPTCP SYN options
-------------
Diffstat:
Documentation/power/runtime_pm.rst | 4 +-
Makefile | 4 +-
arch/parisc/Kconfig | 1 -
arch/parisc/include/asm/cache.h | 11 +--
arch/x86/include/asm/processor.h | 2 +
arch/x86/include/asm/static_call.h | 15 ++++
arch/x86/include/asm/sync_core.h | 6 +-
arch/x86/include/asm/xen/hypercall.h | 36 ++++----
arch/x86/kernel/cpu/common.c | 38 +++++----
arch/x86/kernel/static_call.c | 10 +++
arch/x86/xen/enlighten.c | 65 ++++++++++++++-
arch/x86/xen/enlighten_hvm.c | 13 ++-
arch/x86/xen/enlighten_pv.c | 4 +-
arch/x86/xen/enlighten_pvh.c | 7 --
arch/x86/xen/xen-asm.S | 49 +++++++++--
arch/x86/xen/xen-head.S | 97 ++++++++++++++++++----
arch/x86/xen/xen-ops.h | 9 ++
block/blk-iocost.c | 9 +-
drivers/acpi/acpica/evxfregn.c | 2 -
drivers/acpi/nfit/core.c | 7 +-
drivers/acpi/resource.c | 6 +-
drivers/ata/sata_highbank.c | 1 +
drivers/gpu/drm/i915/i915_scheduler.c | 2 +-
drivers/net/bonding/bond_main.c | 1 +
drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 2 +-
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 2 +-
drivers/net/ethernet/chelsio/cxgb4/t4_hw.c | 5 +-
.../net/ethernet/microchip/sparx5/sparx5_main.c | 11 ++-
.../net/ethernet/microchip/sparx5/sparx5_port.c | 2 +-
drivers/net/ethernet/qualcomm/qca_spi.c | 26 +++---
drivers/net/ethernet/qualcomm/qca_spi.h | 1 -
drivers/net/team/team.c | 3 +-
drivers/net/xen-netfront.c | 5 +-
drivers/ptp/ptp_kvm_arm.c | 4 +
drivers/ptp/ptp_kvm_common.c | 1 +
drivers/ptp/ptp_kvm_x86.c | 61 +++++++++++---
drivers/usb/dwc2/hcd.c | 19 ++---
drivers/usb/gadget/function/u_serial.c | 9 +-
drivers/usb/host/ehci-sh.c | 9 +-
drivers/usb/host/max3421-hcd.c | 16 ++--
fs/exfat/dir.c | 2 +-
fs/xfs/libxfs/xfs_btree.c | 29 +++++--
fs/xfs/libxfs/xfs_symlink_remote.c | 4 +-
fs/xfs/scrub/trace.h | 2 +-
fs/xfs/xfs_file.c | 8 ++
include/linux/compiler.h | 34 +++++---
include/linux/ptp_kvm.h | 1 +
include/linux/static_call.h | 1 +
include/net/lapb.h | 2 +-
kernel/bpf/verifier.c | 5 +-
kernel/static_call_inline.c | 2 +-
kernel/trace/trace_kprobe.c | 2 +-
net/batman-adv/translation-table.c | 58 +++++++++----
net/core/sock_map.c | 1 +
net/ipv4/tcp_output.c | 6 +-
net/sched/sch_netem.c | 22 +++--
net/tipc/udp_media.c | 7 +-
net/vmw_vsock/virtio_transport_common.c | 8 ++
sound/usb/quirks.c | 33 +++++---
tools/objtool/check.c | 11 ++-
.../selftests/drivers/net/mlxsw/sharedbuffer.sh | 15 ----
61 files changed, 589 insertions(+), 239 deletions(-)
From: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
I'm seeing underruns with these 64bpp YUV formats on TGL.
The weird details:
- only happens on pipe B/C/D SDR planes, pipe A SDR planes
seem fine, as do all HDR planes
- somehow CDCLK related, higher CDCLK allows for bigger plane
with these formats without underruns. With 300MHz CDCLK I
can only go up to 1200 pixels wide or so, with 650MHz even
a 3840 pixel wide plane was OK
- ICL and ADL so far appear unaffected
So not really sure what's the deal with this, but bspec does
state "64-bit formats supported only on the HDR planes" so
let's just drop these formats from the SDR planes. We already
disallow 64bpp RGB formats.
Cc: stable(a)vger.kernel.org
Signed-off-by: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
---
drivers/gpu/drm/i915/display/skl_universal_plane.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/gpu/drm/i915/display/skl_universal_plane.c b/drivers/gpu/drm/i915/display/skl_universal_plane.c
index ff9764cac1e7..80e558042d97 100644
--- a/drivers/gpu/drm/i915/display/skl_universal_plane.c
+++ b/drivers/gpu/drm/i915/display/skl_universal_plane.c
@@ -106,8 +106,6 @@ static const u32 icl_sdr_y_plane_formats[] = {
DRM_FORMAT_Y216,
DRM_FORMAT_XYUV8888,
DRM_FORMAT_XVYU2101010,
- DRM_FORMAT_XVYU12_16161616,
- DRM_FORMAT_XVYU16161616,
};
static const u32 icl_sdr_uv_plane_formats[] = {
@@ -134,8 +132,6 @@ static const u32 icl_sdr_uv_plane_formats[] = {
DRM_FORMAT_Y216,
DRM_FORMAT_XYUV8888,
DRM_FORMAT_XVYU2101010,
- DRM_FORMAT_XVYU12_16161616,
- DRM_FORMAT_XVYU16161616,
};
static const u32 icl_hdr_plane_formats[] = {
--
2.45.2
This is the start of the stable review cycle for the 6.1.121 release.
There are 76 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu, 19 Dec 2024 17:05:03 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.1.121-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.1.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 6.1.121-rc1
Dan Carpenter <dan.carpenter(a)linaro.org>
ALSA: usb-audio: Fix a DMA to stack memory bug
Juergen Gross <jgross(a)suse.com>
x86/xen: remove hypercall page
Juergen Gross <jgross(a)suse.com>
x86/xen: use new hypercall functions instead of hypercall page
Juergen Gross <jgross(a)suse.com>
x86/xen: add central hypercall functions
Juergen Gross <jgross(a)suse.com>
x86/xen: don't do PV iret hypercall through hypercall page
Juergen Gross <jgross(a)suse.com>
x86/static-call: provide a way to do very early static-call updates
Juergen Gross <jgross(a)suse.com>
objtool/x86: allow syscall instruction
Juergen Gross <jgross(a)suse.com>
x86: make get_cpu_vendor() accessible from Xen code
Juergen Gross <jgross(a)suse.com>
xen/netfront: fix crash when removing device
Nikolay Kuratov <kniv(a)yandex-team.ru>
tracing/kprobes: Skip symbol counting logic for module symbols in create_local_trace_kprobe()
Eduard Zingerman <eddyz87(a)gmail.com>
bpf: sync_linked_regs() must preserve subreg_def
Nathan Chancellor <nathan(a)kernel.org>
blk-iocost: Avoid using clamp() on inuse in __propagate_weights()
Frédéric Danis <frederic.danis(a)collabora.com>
Bluetooth: SCO: Add support for 16 bits transparent voice setting
Iulia Tanasescu <iulia.tanasescu(a)nxp.com>
Bluetooth: iso: Fix recursive locking warning
Daniil Tatianin <d-tatianin(a)yandex-team.ru>
ACPICA: events/evxfregn: don't release the ContextMutex that was never acquired
Daniel Borkmann <daniel(a)iogearbox.net>
team: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
Daniel Borkmann <daniel(a)iogearbox.net>
bonding: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
Martin Ottens <martin.ottens(a)fau.de>
net/sched: netem: account for backlog updates from child qdisc
Vladimir Oltean <vladimir.oltean(a)nxp.com>
net: dsa: felix: fix stuck CPU-injected packets with short taprio windows
Paul Barker <paul.barker.ct(a)bp.renesas.com>
Documentation: PM: Clarify pm_runtime_resume_and_get() return value
Venkata Prasad Potturu <venkataprasad.potturu(a)amd.com>
ASoC: amd: yc: Fix the wrong return value
Stefan Wahren <wahrenst(a)gmx.net>
qca_spi: Make driver probing reliable
Stefan Wahren <wahrenst(a)gmx.net>
qca_spi: Fix clock speed for multiple QCA7000
Anumula Murali Mohan Reddy <anumula(a)chelsio.com>
cxgb4: use port number to set mac addr
Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
ACPI: resource: Fix memory resource type union access
Daniel Machon <daniel.machon(a)microchip.com>
net: sparx5: fix the maximum frame length register
Daniel Machon <daniel.machon(a)microchip.com>
net: sparx5: fix FDMA performance issue
Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
spi: aspeed: Fix an error handling path in aspeed_spi_[read|write]_user()
Vladimir Oltean <vladimir.oltean(a)nxp.com>
net: mscc: ocelot: perform error cleanup in ocelot_hwstamp_set()
Vladimir Oltean <vladimir.oltean(a)nxp.com>
net: mscc: ocelot: be resilient to loss of PTP packets during transmission
Vladimir Oltean <vladimir.oltean(a)nxp.com>
net: mscc: ocelot: ocelot->ts_id_lock and ocelot_port->tx_skbs.lock are IRQ-safe
Vladimir Oltean <vladimir.oltean(a)nxp.com>
net: mscc: ocelot: improve handling of TX timestamp for unknown skb
Vladimir Oltean <vladimir.oltean(a)nxp.com>
net: mscc: ocelot: fix memory leak on ocelot_port_add_txtstamp_skb()
Eric Dumazet <edumazet(a)google.com>
net: defer final 'struct net' free in netns dismantle
Eric Dumazet <edumazet(a)google.com>
net: lapb: increase LAPB_HEADER_LEN
Thomas Weißschuh <linux(a)weissschuh.net>
ptp: kvm: x86: Return EOPNOTSUPP instead of ENODEV from kvm_arch_ptp_init()
Jeremi Piotrowski <jpiotrowski(a)linux.microsoft.com>
ptp: kvm: Use decrypted memory in confidential guest on x86
Danielle Ratson <danieller(a)nvidia.com>
selftests: mlxsw: sharedbuffer: Ensure no extra packets are counted
Danielle Ratson <danieller(a)nvidia.com>
selftests: mlxsw: sharedbuffer: Remove duplicate test cases
Danielle Ratson <danieller(a)nvidia.com>
selftests: mlxsw: sharedbuffer: Remove h1 ingress test case
Dan Carpenter <dan.carpenter(a)linaro.org>
net/mlx5: DR, prevent potential error pointer dereference
Eric Dumazet <edumazet(a)google.com>
tipc: fix NULL deref in cleanup_bearer()
Remi Pommarel <repk(a)triplefau.lt>
batman-adv: Do not let TT changes list grows indefinitely
Remi Pommarel <repk(a)triplefau.lt>
batman-adv: Remove uninitialized data in full table TT response
Remi Pommarel <repk(a)triplefau.lt>
batman-adv: Do not send uninitialized TT changes
David (Ming Qiang) Wu <David.Wu3(a)amd.com>
amdgpu/uvd: get ring reference from rq scheduler
Suraj Sonawane <surajsonawane0215(a)gmail.com>
acpi: nfit: vmalloc-out-of-bounds Read in acpi_nfit_ctl
Benjamin Lin <benjamin-jw.lin(a)mediatek.com>
wifi: mac80211: fix station NSS capability initialization order
Johannes Berg <johannes.berg(a)intel.com>
wifi: mac80211: clean up 'ret' in sta_link_apply_parameters()
Lin Ma <linma(a)zju.edu.cn>
wifi: nl80211: fix NL80211_ATTR_MLO_LINK_ID off-by-one
Sungjong Seo <sj1557.seo(a)samsung.com>
exfat: fix potential deadlock on __exfat_get_dentry_set
Yuezhang Mo <Yuezhang.Mo(a)sony.com>
exfat: support dynamic allocate bh for exfat_entry_set_cache
Paulo Alcantara <pc(a)manguebit.com>
smb: client: fix UAF in smb2_reconnect_server()
Michal Luczaj <mhal(a)rbox.co>
bpf, sockmap: Fix update element with same
Jiri Olsa <jolsa(a)kernel.org>
bpf,perf: Fix invalid prog_array access in perf_event_detach_bpf_prog
Darrick J. Wong <djwong(a)kernel.org>
xfs: only run precommits once per transaction object
Darrick J. Wong <djwong(a)kernel.org>
xfs: fix scrub tracepoints when inode-rooted btrees are involved
Darrick J. Wong <djwong(a)kernel.org>
xfs: return from xfs_symlink_verify early on V4 filesystems
Darrick J. Wong <djwong(a)kernel.org>
xfs: don't drop errno values when we fail to ficlone the entire range
Darrick J. Wong <djwong(a)kernel.org>
xfs: update btree keys correctly when _insrec splits an inode root block
Jiasheng Jiang <jiashengjiangcool(a)outlook.com>
drm/i915: Fix memory leak by correcting cache object name in error handler
Neal Frager <neal.frager(a)amd.com>
usb: dwc3: xilinx: make sure pipe clock is deselected in usb2 only mode
Lianqin Hu <hulianqin(a)vivo.com>
usb: gadget: u_serial: Fix the issue that gs_start_io crashed due to accessing null pointer
Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
usb: typec: anx7411: fix OF node reference leaks in anx7411_typec_switch_probe()
Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
usb: typec: anx7411: fix fwnode_handle reference leak
Vitalii Mordan <mordan(a)ispras.ru>
usb: ehci-hcd: fix call balance of clocks handling routines
Stefan Wahren <wahrenst(a)gmx.net>
usb: dwc2: Fix HCD port connection race
Stefan Wahren <wahrenst(a)gmx.net>
usb: dwc2: hcd: Fix GetPortStatus & SetPortFeature
Stefan Wahren <wahrenst(a)gmx.net>
usb: dwc2: Fix HCD resume
Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
ata: sata_highbank: fix OF node reference leak in highbank_initialize_phys()
Mark Tomlinson <mark.tomlinson(a)alliedtelesis.co.nz>
usb: host: max3421-hcd: Correctly abort a USB request.
Jaakko Salo <jaakkos(a)gmail.com>
ALSA: usb-audio: Add implicit feedback quirk for Yamaha THR5
Tejun Heo <tj(a)kernel.org>
blk-cgroup: Fix UAF in blkcg_unpin_online()
MoYuanhao <moyuanhao3676(a)163.com>
tcp: check space before adding MPTCP SYN options
Namjae Jeon <linkinjeon(a)kernel.org>
ksmbd: fix racy issue from session lookup and expire
Jann Horn <jannh(a)google.com>
bpf: Fix UAF via mismatching bpf_prog/attachment RCU flavors
-------------
Diffstat:
Documentation/power/runtime_pm.rst | 4 +-
Makefile | 4 +-
arch/x86/include/asm/processor.h | 2 +
arch/x86/include/asm/static_call.h | 15 ++
arch/x86/include/asm/sync_core.h | 6 +-
arch/x86/include/asm/xen/hypercall.h | 36 ++--
arch/x86/kernel/cpu/common.c | 38 ++--
arch/x86/kernel/static_call.c | 9 +
arch/x86/xen/enlighten.c | 65 ++++++-
arch/x86/xen/enlighten_hvm.c | 13 +-
arch/x86/xen/enlighten_pv.c | 4 +-
arch/x86/xen/enlighten_pvh.c | 7 -
arch/x86/xen/xen-asm.S | 50 ++++-
arch/x86/xen/xen-head.S | 106 ++++++++---
arch/x86/xen/xen-ops.h | 9 +
block/blk-cgroup.c | 6 +-
block/blk-iocost.c | 9 +-
drivers/acpi/acpica/evxfregn.c | 2 -
drivers/acpi/nfit/core.c | 7 +-
drivers/acpi/resource.c | 6 +-
drivers/ata/sata_highbank.c | 1 +
drivers/gpu/drm/amd/amdgpu/uvd_v7_0.c | 2 +-
drivers/gpu/drm/i915/i915_scheduler.c | 2 +-
drivers/net/bonding/bond_main.c | 1 +
drivers/net/dsa/ocelot/felix_vsc9959.c | 17 +-
drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 2 +-
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 2 +-
drivers/net/ethernet/chelsio/cxgb4/t4_hw.c | 5 +-
.../mellanox/mlx5/core/steering/dr_domain.c | 4 +-
.../net/ethernet/microchip/sparx5/sparx5_main.c | 11 +-
.../net/ethernet/microchip/sparx5/sparx5_port.c | 2 +-
drivers/net/ethernet/mscc/ocelot_ptp.c | 207 +++++++++++++--------
drivers/net/ethernet/qualcomm/qca_spi.c | 26 ++-
drivers/net/ethernet/qualcomm/qca_spi.h | 1 -
drivers/net/team/team.c | 3 +-
drivers/net/xen-netfront.c | 5 +-
drivers/ptp/ptp_kvm_arm.c | 4 +
drivers/ptp/ptp_kvm_common.c | 1 +
drivers/ptp/ptp_kvm_x86.c | 61 ++++--
drivers/spi/spi-aspeed-smc.c | 10 +-
drivers/usb/dwc2/hcd.c | 19 +-
drivers/usb/dwc3/dwc3-xilinx.c | 5 +-
drivers/usb/gadget/function/u_serial.c | 9 +-
drivers/usb/host/ehci-sh.c | 9 +-
drivers/usb/host/max3421-hcd.c | 16 +-
drivers/usb/typec/anx7411.c | 66 ++++---
fs/exfat/dir.c | 15 ++
fs/exfat/exfat_fs.h | 5 +-
fs/smb/client/connect.c | 78 ++++----
fs/smb/server/auth.c | 2 +
fs/smb/server/mgmt/user_session.c | 6 +-
fs/smb/server/server.c | 4 +-
fs/smb/server/smb2pdu.c | 27 +--
fs/xfs/libxfs/xfs_btree.c | 29 ++-
fs/xfs/libxfs/xfs_symlink_remote.c | 4 +-
fs/xfs/scrub/trace.h | 2 +-
fs/xfs/xfs_file.c | 8 +
fs/xfs/xfs_trans.c | 16 +-
include/linux/compiler.h | 39 ++--
include/linux/dsa/ocelot.h | 1 +
include/linux/ptp_kvm.h | 1 +
include/linux/static_call.h | 1 +
include/net/bluetooth/bluetooth.h | 1 +
include/net/lapb.h | 2 +-
include/net/net_namespace.h | 1 +
include/soc/mscc/ocelot.h | 2 -
kernel/bpf/verifier.c | 5 +-
kernel/static_call_inline.c | 2 +-
kernel/trace/bpf_trace.c | 11 ++
kernel/trace/trace_kprobe.c | 2 +-
net/batman-adv/translation-table.c | 58 ++++--
net/bluetooth/iso.c | 8 +-
net/bluetooth/sco.c | 29 +--
net/core/net_namespace.c | 21 ++-
net/core/sock_map.c | 1 +
net/ipv4/tcp_output.c | 6 +-
net/mac80211/cfg.c | 9 +-
net/sched/sch_netem.c | 22 ++-
net/tipc/udp_media.c | 7 +-
net/wireless/nl80211.c | 2 +-
sound/soc/amd/yc/acp6x-mach.c | 13 +-
sound/usb/quirks.c | 44 +++--
tools/objtool/check.c | 11 +-
.../selftests/drivers/net/mlxsw/sharedbuffer.sh | 55 ++++--
84 files changed, 980 insertions(+), 459 deletions(-)
Hey Greg, Sasha,
We are doing some work to further automate stable-rc testing, triage, validation and reporting of stable-rc branches in the new KernelCI system. As part of that, we want to start relying on the X-KernelTest-* mail header parameters, however there is no parameter with the git commit hash of the brach head.
Today, there is only information about the tree and branch, but no tags or commits. Essentially, we want to parse the email headers and immediately be able to request results from the KernelCI Dashboard API passing the head commit being tested.
Is it possible to add 'X-KernelTest-Commit'?
Thank you.
- Gus
--
Gustavo Padovan
Kernel Lead
Collabora Ltd.
Platinum Building, St John's Innovation Park
Cambridge CB4 0DS, UK
Registered in England & Wales, no. 5513718
This patch series is to fix bugs and improve codes regarding various
driver core device iterating APIs
Signed-off-by: Zijun Hu <quic_zijuhu(a)quicinc.com>
---
Changes in v4:
- Squich patches 3-5 into one based on Jonathan and Fan comments.
- Add one more patch
- Link to v3: https://lore.kernel.org/r/20241212-class_fix-v3-0-04e20c4f0971@quicinc.com
Changes in v3:
- Correct commit message, add fix tag, and correct pr_crit() message for 1st patch
- Add more patches regarding driver core device iterating APIs.
- Link to v2: https://lore.kernel.org/r/20241112-class_fix-v2-0-73d198d0a0d5@quicinc.com
Changes in v2:
- Remove both fix and stable tag for patch 1/3
- drop patch 3/3
- Link to v1: https://lore.kernel.org/r/20241105-class_fix-v1-0-80866f9994a5@quicinc.com
---
Zijun Hu (8):
driver core: class: Fix wild pointer dereferences in API class_dev_iter_next()
blk-cgroup: Fix class @block_class's subsystem refcount leakage
driver core: Move true expression out of if condition in 3 device finding APIs
driver core: Rename declaration parameter name for API device_find_child() cluster
driver core: Correct parameter check for API device_for_each_child_reverse_from()
driver core: Correct API device_for_each_child_reverse_from() prototype
driver core: Introduce device_iter_t for device iterating APIs
driver core: Move 2 one line device finding APIs to header
block/blk-cgroup.c | 1 +
drivers/base/bus.c | 9 +++++---
drivers/base/class.c | 11 ++++++++--
drivers/base/core.c | 49 +++++++++----------------------------------
drivers/base/driver.c | 9 +++++---
drivers/cxl/core/hdm.c | 2 +-
drivers/cxl/core/region.c | 2 +-
include/linux/device.h | 28 ++++++++++++++++---------
include/linux/device/bus.h | 7 +++++--
include/linux/device/class.h | 4 ++--
include/linux/device/driver.h | 2 +-
11 files changed, 60 insertions(+), 64 deletions(-)
---
base-commit: cdd30ebb1b9f36159d66f088b61aee264e649d7a
change-id: 20241104-class_fix-f176bd9eba22
prerequisite-change-id: 20241201-const_dfc_done-aaec71e3bbea:v4
prerequisite-patch-id: 536aa56c0d055f644a1f71ab5c88b7cac9510162
prerequisite-patch-id: 39b0cf088c72853d9ce60c9e633ad2070a0278a8
prerequisite-patch-id: 60b22c42b67ad56a3d2a7b80a30ad588cbe740ec
prerequisite-patch-id: 119a167d7248481987b5e015db0e4fdb0d6edab8
prerequisite-patch-id: 133248083f3d3c57beb16473c2a4c62b3abc5fd0
prerequisite-patch-id: 4cda541f55165650bfa69fb19cbe0524eff0cb85
prerequisite-patch-id: 2b4193c6ea6370c07e6b66de04be89fb09448f54
prerequisite-patch-id: 73c675db18330c89fd8ca4790914d1d486ce0db8
prerequisite-patch-id: 88c50fc851fd7077797fd4e63fb12966b1b601bd
prerequisite-patch-id: 47b93916c1b5fb809d7c99aeaa05c729b1af01c5
prerequisite-patch-id: 52ffb42b5aae69cae708332e0ddc7016139999f1
Best regards,
--
Zijun Hu <quic_zijuhu(a)quicinc.com>
When device_add(&udev->dev) succeeds and a later call fails,
usb_new_device() does not properly call device_del(). As comment of
device_add() says, 'if device_add() succeeds, you should call
device_del() when you want to get rid of it. If device_add() has not
succeeded, use only put_device() to drop the reference count'.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: 9f8b17e643fe ("USB: make usbdevices export their device nodes instead of using a separate class")
Signed-off-by: Ma Ke <make_ruc2021(a)163.com>
---
Changes in v3:
- modified the bug description according to the changes of the patch;
- removed redundant put_device() in patch v2 as suggestions.
Changes in v2:
- modified the bug description to make it more clear;
- added the missed part of the patch.
---
drivers/usb/core/hub.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 4b93c0bd1d4b..21ac9b464696 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -2663,13 +2663,13 @@ int usb_new_device(struct usb_device *udev)
err = sysfs_create_link(&udev->dev.kobj,
&port_dev->dev.kobj, "port");
if (err)
- goto fail;
+ goto out_del_dev;
err = sysfs_create_link(&port_dev->dev.kobj,
&udev->dev.kobj, "device");
if (err) {
sysfs_remove_link(&udev->dev.kobj, "port");
- goto fail;
+ goto out_del_dev;
}
if (!test_and_set_bit(port1, hub->child_usage_bits))
@@ -2683,6 +2683,8 @@ int usb_new_device(struct usb_device *udev)
pm_runtime_put_sync_autosuspend(&udev->dev);
return err;
+out_del_dev:
+ device_del(&udev->dev);
fail:
usb_set_device_state(udev, USB_STATE_NOTATTACHED);
pm_runtime_disable(&udev->dev);
--
2.25.1
From: Kan Liang <kan.liang(a)linux.intel.com>
The only difference between 5 and 6 is the new counters snapshotting
group, without the following counters snapshotting enabling patches,
it's impossible to utilize the feature in a PEBS record. It's safe to
share the same code path with format 5.
Add format 6, so the end user can at least utilize the legacy PEBS
features.
Fixes: a932aa0e868f ("perf/x86: Add Lunar Lake and Arrow Lake support")
Signed-off-by: Kan Liang <kan.liang(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
---
No changes since V5
arch/x86/events/intel/ds.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 8dcf90f6fb59..ba74e1198328 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2551,6 +2551,7 @@ void __init intel_ds_init(void)
x86_pmu.large_pebs_flags |= PERF_SAMPLE_TIME;
break;
+ case 6:
case 5:
x86_pmu.pebs_ept = 1;
fallthrough;
--
2.38.1
From: Joshua Washington <joshwash(a)google.com>
This patch fixes a number of consistency issues in the queue allocation
path related to XDP.
As it stands, the number of allocated XDP queues changes in three
different scenarios.
1) Adding an XDP program while the interface is up via
gve_add_xdp_queues
2) Removing an XDP program while the interface is up via
gve_remove_xdp_queues
3) After queues have been allocated and the old queue memory has been
removed in gve_queues_start.
However, the requirement for the interface to be up for
gve_(add|remove)_xdp_queues to be called, in conjunction with the fact
that the number of queues stored in priv isn't updated until _after_ XDP
queues have been allocated in the normal queue allocation path means
that if an XDP program is added while the interface is down, XDP queues
won't be added until the _second_ if_up, not the first.
Given the expectation that the number of XDP queues is equal to the
number of RX queues, scenario (3) has another problematic implication.
When changing the number of queues while an XDP program is loaded, the
number of XDP queues must be updated as well, as there is logic in the
driver (gve_xdp_tx_queue_id()) which relies on every RX queue having a
corresponding XDP TX queue. However, the number of XDP queues stored in
priv would not be updated until _after_ a close/open leading to a
mismatch in the number of XDP queues reported vs the number of XDP
queues which actually exist after the queue count update completes.
This patch remedies these issues by doing the following:
1) The allocation config getter function is set up to retrieve the
_expected_ number of XDP queues to allocate instead of relying
on the value stored in `priv` which is only updated once the queues
have been allocated.
2) When adjusting queues, XDP queues are adjusted to match the number of
RX queues when XDP is enabled. This only works in the case when
queues are live, so part (1) of the fix must still be available in
the case that queues are adjusted when there is an XDP program and
the interface is down.
Fixes: 5f08cd3d6423 ("gve: Alloc before freeing when adjusting queues")
Cc: stable(a)vger.kernel.org
Signed-off-by: Joshua Washington <joshwash(a)google.com>
Signed-off-by: Praveen Kaligineedi <pkaligineedi(a)google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi(a)google.com>
Reviewed-by: Shailend Chand <shailend(a)google.com>
Reviewed-by: Willem de Bruijn <willemb(a)google.com>
---
drivers/net/ethernet/google/gve/gve_main.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 5cab7b88610f..09fb7f16f73e 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -930,11 +930,13 @@ static void gve_init_sync_stats(struct gve_priv *priv)
static void gve_tx_get_curr_alloc_cfg(struct gve_priv *priv,
struct gve_tx_alloc_rings_cfg *cfg)
{
+ int num_xdp_queues = priv->xdp_prog ? priv->rx_cfg.num_queues : 0;
+
cfg->qcfg = &priv->tx_cfg;
cfg->raw_addressing = !gve_is_qpl(priv);
cfg->ring_size = priv->tx_desc_cnt;
cfg->start_idx = 0;
- cfg->num_rings = gve_num_tx_queues(priv);
+ cfg->num_rings = priv->tx_cfg.num_queues + num_xdp_queues;
cfg->tx = priv->tx;
}
@@ -1843,6 +1845,7 @@ int gve_adjust_queues(struct gve_priv *priv,
{
struct gve_tx_alloc_rings_cfg tx_alloc_cfg = {0};
struct gve_rx_alloc_rings_cfg rx_alloc_cfg = {0};
+ int num_xdp_queues;
int err;
gve_get_curr_alloc_cfgs(priv, &tx_alloc_cfg, &rx_alloc_cfg);
@@ -1853,6 +1856,10 @@ int gve_adjust_queues(struct gve_priv *priv,
rx_alloc_cfg.qcfg = &new_rx_config;
tx_alloc_cfg.num_rings = new_tx_config.num_queues;
+ /* Add dedicated XDP TX queues if enabled. */
+ num_xdp_queues = priv->xdp_prog ? new_rx_config.num_queues : 0;
+ tx_alloc_cfg.num_rings += num_xdp_queues;
+
if (netif_running(priv->dev)) {
err = gve_adjust_config(priv, &tx_alloc_cfg, &rx_alloc_cfg);
return err;
--
2.47.1.613.gc27f4b7a9f-goog
From: Joshua Washington <joshwash(a)google.com>
When busy polling is enabled, xsk_sendmsg for AF_XDP zero copy marks
the NAPI ID corresponding to the memory pool allocated for the socket.
In GVE, this NAPI ID will never correspond to a NAPI ID of one of the
dedicated XDP TX queues registered with the umem because XDP TX is not
set up to share a NAPI with a corresponding RX queue.
This patch moves XSK TX descriptor processing from the TX NAPI to the RX
NAPI, and the gve_xsk_wakeup callback is updated to use the RX NAPI
instead of the TX NAPI, accordingly. The branch on if the wakeup is for
TX is removed, as the NAPI poll should be invoked whether the wakeup is
for TX or for RX.
Fixes: fd8e40321a12 ("gve: Add AF_XDP zero-copy support for GQI-QPL format")
Cc: stable(a)vger.kernel.org
Signed-off-by: Praveen Kaligineedi <pkaligineedi(a)google.com>
Signed-off-by: Joshua Washington <joshwash(a)google.com>
Reviewed-by: Willem de Bruijn <willemb(a)google.com>
---
drivers/net/ethernet/google/gve/gve.h | 1 +
drivers/net/ethernet/google/gve/gve_main.c | 8 +++++
drivers/net/ethernet/google/gve/gve_tx.c | 36 +++++++++++++---------
3 files changed, 31 insertions(+), 14 deletions(-)
diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
index dd92949bb214..8167cc5fb0df 100644
--- a/drivers/net/ethernet/google/gve/gve.h
+++ b/drivers/net/ethernet/google/gve/gve.h
@@ -1140,6 +1140,7 @@ int gve_xdp_xmit_one(struct gve_priv *priv, struct gve_tx_ring *tx,
void gve_xdp_tx_flush(struct gve_priv *priv, u32 xdp_qid);
bool gve_tx_poll(struct gve_notify_block *block, int budget);
bool gve_xdp_poll(struct gve_notify_block *block, int budget);
+int gve_xsk_tx_poll(struct gve_notify_block *block, int budget);
int gve_tx_alloc_rings_gqi(struct gve_priv *priv,
struct gve_tx_alloc_rings_cfg *cfg);
void gve_tx_free_rings_gqi(struct gve_priv *priv,
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index e4e8ff4f9f80..5cab7b88610f 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -333,6 +333,14 @@ int gve_napi_poll(struct napi_struct *napi, int budget)
if (block->rx) {
work_done = gve_rx_poll(block, budget);
+
+ /* Poll XSK TX as part of RX NAPI. Setup re-poll based on max of
+ * TX and RX work done.
+ */
+ if (priv->xdp_prog)
+ work_done = max_t(int, work_done,
+ gve_xsk_tx_poll(block, budget));
+
reschedule |= work_done == budget;
}
diff --git a/drivers/net/ethernet/google/gve/gve_tx.c b/drivers/net/ethernet/google/gve/gve_tx.c
index 852f8c7e39d2..4350ebd9c2bd 100644
--- a/drivers/net/ethernet/google/gve/gve_tx.c
+++ b/drivers/net/ethernet/google/gve/gve_tx.c
@@ -981,33 +981,41 @@ static int gve_xsk_tx(struct gve_priv *priv, struct gve_tx_ring *tx,
return sent;
}
+int gve_xsk_tx_poll(struct gve_notify_block *rx_block, int budget)
+{
+ struct gve_rx_ring *rx = rx_block->rx;
+ struct gve_priv *priv = rx->gve;
+ struct gve_tx_ring *tx;
+ int sent = 0;
+
+ tx = &priv->tx[gve_xdp_tx_queue_id(priv, rx->q_num)];
+ if (tx->xsk_pool) {
+ sent = gve_xsk_tx(priv, tx, budget);
+
+ u64_stats_update_begin(&tx->statss);
+ tx->xdp_xsk_sent += sent;
+ u64_stats_update_end(&tx->statss);
+ if (xsk_uses_need_wakeup(tx->xsk_pool))
+ xsk_set_tx_need_wakeup(tx->xsk_pool);
+ }
+
+ return sent;
+}
+
bool gve_xdp_poll(struct gve_notify_block *block, int budget)
{
struct gve_priv *priv = block->priv;
struct gve_tx_ring *tx = block->tx;
u32 nic_done;
- bool repoll;
u32 to_do;
/* Find out how much work there is to be done */
nic_done = gve_tx_load_event_counter(priv, tx);
to_do = min_t(u32, (nic_done - tx->done), budget);
gve_clean_xdp_done(priv, tx, to_do);
- repoll = nic_done != tx->done;
-
- if (tx->xsk_pool) {
- int sent = gve_xsk_tx(priv, tx, budget);
-
- u64_stats_update_begin(&tx->statss);
- tx->xdp_xsk_sent += sent;
- u64_stats_update_end(&tx->statss);
- repoll |= (sent == budget);
- if (xsk_uses_need_wakeup(tx->xsk_pool))
- xsk_set_tx_need_wakeup(tx->xsk_pool);
- }
/* If we still have work we want to repoll */
- return repoll;
+ return nic_done != tx->done;
}
bool gve_tx_poll(struct gve_notify_block *block, int budget)
--
2.47.1.613.gc27f4b7a9f-goog
From: Joshua Washington <joshwash(a)google.com>
When stopping XDP TX rings, the XDP clean function needs to be called to
clean out the entire queue, similar to what happens in the normal TX
queue case. Otherwise, the FIFO won't be cleared correctly, and
xsk_tx_completed won't be reported.
Fixes: 75eaae158b1b ("gve: Add XDP DROP and TX support for GQI-QPL format")
Cc: stable(a)vger.kernel.org
Signed-off-by: Joshua Washington <joshwash(a)google.com>
Signed-off-by: Praveen Kaligineedi <pkaligineedi(a)google.com>
Reviewed-by: Praveen Kaligineedi <pkaligineedi(a)google.com>
Reviewed-by: Willem de Bruijn <willemb(a)google.com>
---
drivers/net/ethernet/google/gve/gve_tx.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/google/gve/gve_tx.c b/drivers/net/ethernet/google/gve/gve_tx.c
index e7fb7d6d283d..83ad278ec91f 100644
--- a/drivers/net/ethernet/google/gve/gve_tx.c
+++ b/drivers/net/ethernet/google/gve/gve_tx.c
@@ -206,7 +206,10 @@ void gve_tx_stop_ring_gqi(struct gve_priv *priv, int idx)
return;
gve_remove_napi(priv, ntfy_idx);
- gve_clean_tx_done(priv, tx, priv->tx_desc_cnt, false);
+ if (tx->q_num < priv->tx_cfg.num_queues)
+ gve_clean_tx_done(priv, tx, priv->tx_desc_cnt, false);
+ else
+ gve_clean_xdp_done(priv, tx, priv->tx_desc_cnt);
netdev_tx_reset_queue(tx->netdev_txq);
gve_tx_remove_from_block(priv, idx);
}
--
2.47.1.613.gc27f4b7a9f-goog
commit 1dd73601a1cba37a0ed5f89a8662c90191df5873 upstream.
As syzbot reported [1], the root cause is that i_size field is a
signed type, and negative i_size is also less than EROFS_BLKSIZ.
As a consequence, it's handled as fast symlink unexpectedly.
Let's fall back to the generic path to deal with such unusual i_size.
[1] https://lore.kernel.org/r/000000000000ac8efa05e7feaa1f@google.com
Reported-by: syzbot+f966c13b1b4fc0403b19(a)syzkaller.appspotmail.com
Fixes: 431339ba9042 ("staging: erofs: add inode operations")
Reviewed-by: Yue Hu <huyue2(a)coolpad.com>
Link: https://lore.kernel.org/r/20220909023948.28925-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao(a)linux.alibaba.com>
---
fs/erofs/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index 0dbeaf68e1d6..ba981076d6f2 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -202,7 +202,7 @@ static int erofs_fill_symlink(struct inode *inode, void *data,
/* if it cannot be handled with fast symlink scheme */
if (vi->datalayout != EROFS_INODE_FLAT_INLINE ||
- inode->i_size >= PAGE_SIZE) {
+ inode->i_size >= PAGE_SIZE || inode->i_size < 0) {
inode->i_op = &erofs_symlink_iops;
return 0;
}
--
2.43.5
commit 9ed50b8231e37b1ae863f5dec8153b98d9f389b4 upstream.
Fast symlink can be used if the on-disk symlink data is stored
in the same block as the on-disk inode, so we don’t need to trigger
another I/O for symlink data. However, currently fs correction could be
reported _incorrectly_ if inode xattrs are too large.
In fact, these should be valid images although they cannot be handled as
fast symlinks.
Many thanks to Colin for reporting this!
Reported-by: Colin Walters <walters(a)verbum.org>
Reported-by: https://honggfuzz.dev/
Link: https://lore.kernel.org/r/bb2dd430-7de0-47da-ae5b-82ab2dd4d945@app.fastmail…
Fixes: 431339ba9042 ("staging: erofs: add inode operations")
[ Note that it's a runtime misbehavior instead of a security issue. ]
Link: https://lore.kernel.org/r/20240909031911.1174718-1-hsiangkao@linux.alibaba.…
Signed-off-by: Gao Xiang <hsiangkao(a)linux.alibaba.com>
---
fs/erofs/inode.c | 20 ++++++--------------
1 file changed, 6 insertions(+), 14 deletions(-)
diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index 638bb70d0d65..c68258ae70d3 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -219,11 +219,14 @@ static int erofs_fill_symlink(struct inode *inode, void *data,
unsigned int m_pofs)
{
struct erofs_inode *vi = EROFS_I(inode);
+ loff_t off;
char *lnk;
- /* if it cannot be handled with fast symlink scheme */
- if (vi->datalayout != EROFS_INODE_FLAT_INLINE ||
- inode->i_size >= PAGE_SIZE || inode->i_size < 0) {
+ m_pofs += vi->xattr_isize;
+ /* check if it cannot be handled with fast symlink scheme */
+ if (vi->datalayout != EROFS_INODE_FLAT_INLINE || inode->i_size < 0 ||
+ check_add_overflow(m_pofs, inode->i_size, &off) ||
+ off > i_blocksize(inode)) {
inode->i_op = &erofs_symlink_iops;
return 0;
}
@@ -232,17 +235,6 @@ static int erofs_fill_symlink(struct inode *inode, void *data,
if (!lnk)
return -ENOMEM;
- m_pofs += vi->xattr_isize;
- /* inline symlink data shouldn't cross page boundary as well */
- if (m_pofs + inode->i_size > PAGE_SIZE) {
- kfree(lnk);
- erofs_err(inode->i_sb,
- "inline data cross block boundary @ nid %llu",
- vi->nid);
- DBG_BUGON(1);
- return -EFSCORRUPTED;
- }
-
memcpy(lnk, data + m_pofs, inode->i_size);
lnk[inode->i_size] = '\0';
--
2.43.5
commit 1dd73601a1cba37a0ed5f89a8662c90191df5873 upstream.
As syzbot reported [1], the root cause is that i_size field is a
signed type, and negative i_size is also less than EROFS_BLKSIZ.
As a consequence, it's handled as fast symlink unexpectedly.
Let's fall back to the generic path to deal with such unusual i_size.
[1] https://lore.kernel.org/r/000000000000ac8efa05e7feaa1f@google.com
Reported-by: syzbot+f966c13b1b4fc0403b19(a)syzkaller.appspotmail.com
Fixes: 431339ba9042 ("staging: erofs: add inode operations")
Reviewed-by: Yue Hu <huyue2(a)coolpad.com>
Link: https://lore.kernel.org/r/20220909023948.28925-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao(a)linux.alibaba.com>
---
fs/erofs/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index 0a94a52a119f..93a4ed665d93 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -202,7 +202,7 @@ static int erofs_fill_symlink(struct inode *inode, void *data,
/* if it cannot be handled with fast symlink scheme */
if (vi->datalayout != EROFS_INODE_FLAT_INLINE ||
- inode->i_size >= PAGE_SIZE) {
+ inode->i_size >= PAGE_SIZE || inode->i_size < 0) {
inode->i_op = &erofs_symlink_iops;
return 0;
}
--
2.43.5
From: Conor Dooley <conor.dooley(a)microchip.com>
Running i2c-detect currently produces an output akin to:
0 1 2 3 4 5 6 7 8 9 a b c d e f
00: 08 -- 0a -- 0c -- 0e --
10: 10 -- 12 -- 14 -- 16 -- UU 19 -- 1b -- 1d -- 1f
20: -- 21 -- 23 -- 25 -- 27 -- 29 -- 2b -- 2d -- 2f
30: -- -- -- -- -- -- -- -- 38 -- 3a -- 3c -- 3e --
40: 40 -- 42 -- 44 -- 46 -- 48 -- 4a -- 4c -- 4e --
50: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
60: 60 -- 62 -- 64 -- 66 -- 68 -- 6a -- 6c -- 6e --
70: 70 -- 72 -- 74 -- 76 --
This happens because for an i2c_msg with a len of 0 the driver will
mark the transmission of the message as a success once the START has
been sent, without waiting for the devices on the bus to respond with an
ACK/NAK. Since i2cdetect seems to run in a tight loop over all addresses
the NAK is treated as part of the next test for the next address.
Delete the fast path that marks a message as complete when idev->msg_len
is zero after sending a START/RESTART since this isn't a valid scenario.
CC: stable(a)vger.kernel.org
Fixes: 64a6f1c4987e ("i2c: add support for microchip fpga i2c controllers")
Signed-off-by: Conor Dooley <conor.dooley(a)microchip.com>
---
My original tests with KASAN/UBSAN/PREEMPT_RT enabled saw far fewer of
these "ghost" detections and the skip caused by the occupied address at
0x18 on this bus is part of my attribution of the problem. Unless I'm
mistaken there's no scenario that you consider a message complete after
sending a START/RESTART without waiting for the ACK/NAK and this code
path I deleted is useless? Looking out of tree, it predates my involvement
with the code so I don't know where it came from, nor is there anything
like it in the bare-metal driver the linux one was based on.
---
drivers/i2c/busses/i2c-microchip-corei2c.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/drivers/i2c/busses/i2c-microchip-corei2c.c b/drivers/i2c/busses/i2c-microchip-corei2c.c
index e5af38dfaa81..b0a51695138a 100644
--- a/drivers/i2c/busses/i2c-microchip-corei2c.c
+++ b/drivers/i2c/busses/i2c-microchip-corei2c.c
@@ -287,8 +287,6 @@ static irqreturn_t mchp_corei2c_handle_isr(struct mchp_corei2c_dev *idev)
ctrl &= ~CTRL_STA;
writeb(idev->addr, idev->base + CORE_I2C_DATA);
writeb(ctrl, idev->base + CORE_I2C_CTRL);
- if (idev->msg_len == 0)
- finished = true;
break;
case STATUS_M_ARB_LOST:
idev->msg_err = -EAGAIN;
--
2.45.2
Once device_add(&dev->dev) failed, call put_device() to explicitly
release dev->dev. Or it could cause double free problem.
As comment of device_add() says, 'if device_add() succeeds, you should
call device_del() when you want to get rid of it. If device_add() has
not succeeded, use only put_device() to drop the reference count'.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: f2b44cde7e16 ("virtio: split device_register into device_initialize and device_add")
Signed-off-by: Ma Ke <make_ruc2021(a)163.com>
---
Changes in v2:
- modified the bug description to make it more clear;
- changed the Fixes tag.
---
drivers/virtio/virtio.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index b9095751e43b..ac721b5597e8 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -503,6 +503,7 @@ int register_virtio_device(struct virtio_device *dev)
out_of_node_put:
of_node_put(dev->dev.of_node);
+ put_device(&dev->dev);
out_ida_remove:
ida_free(&virtio_index_ida, dev->index);
out:
--
2.25.1
fs/bcachefs/btree_trans_commit.o: warning: objtool: bch2_trans_commit_write_locked.isra.0() falls through to next function do_bch2_trans_commit.isra.0()
fs/bcachefs/btree_trans_commit.o: warning: objtool: .text: unexpected end of section
......
fs/bcachefs/btree_update.o: warning: objtool: bch2_trans_update_get_key_cache() falls through to next function flush_new_cached_update()
fs/bcachefs/btree_update.o: warning: objtool: flush_new_cached_update() falls through to next function bch2_trans_update_by_path()
Signed-off-by: chenchangcheng <ccc194101(a)163.com>
---
tools/objtool/noreturns.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/objtool/noreturns.h b/tools/objtool/noreturns.h
index f37614cc2c1b..88a0fa8807be 100644
--- a/tools/objtool/noreturns.h
+++ b/tools/objtool/noreturns.h
@@ -49,3 +49,4 @@ NORETURN(x86_64_start_kernel)
NORETURN(x86_64_start_reservations)
NORETURN(xen_cpu_bringup_again)
NORETURN(xen_start_kernel)
+NORETURN(bch2_trans_unlocked_error)
--
2.25.1
From: yangge <yangge1116(a)126.com>
Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags
in __compaction_suitable()") allow compaction to proceed when free
pages required for compaction reside in the CMA pageblocks, it's
possible that __compaction_suitable() always returns true, and in
some cases, it's not acceptable.
There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
of memory. I have configured 16GB of CMA memory on each NUMA node,
and starting a 32GB virtual machine with device passthrough is
extremely slow, taking almost an hour.
During the start-up of the virtual machine, it will call
pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
Long term GUP cannot allocate memory from CMA area, so a maximum
of 16 GB of no-CMA memory on a NUMA node can be used as virtual
machine memory. Since there is 16G of free CMA memory on the NUMA
node, watermark for order-0 always be met for compaction, so
__compaction_suitable() always returns true, even if the node is
unable to allocate non-CMA memory for the virtual machine.
For costly allocations, because __compaction_suitable() always
returns true, __alloc_pages_slowpath() can't exit at the appropriate
place, resulting in excessively long virtual machine startup times.
Call trace:
__alloc_pages_slowpath
if (compact_result == COMPACT_SKIPPED ||
compact_result == COMPACT_DEFERRED)
goto nopage; // should exit __alloc_pages_slowpath() from here
Other unmovable alloctions, like dma_buf, which can be large in a
Linux system, are also unable to allocate memory from CMA, and these
allocations suffer from the same problems described above. In order
to quickly fall back to remote node, we should remove ALLOC_CMA both
in __compaction_suitable() and __isolate_free_page() for unmovable
alloctions. After this fix, starting a 32GB virtual machine with
device passthrough takes only a few seconds.
Fixes: 984fdba6a32e ("mm, compaction: use proper alloc_flags in __compaction_suitable()")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: yangge <yangge1116(a)126.com>
Reviewed-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
---
V7:
-- fix the changelog and code documentation
V6:
-- update cc->alloc_flags to keep the original loginc
V5:
- add 'alloc_flags' parameter for __isolate_free_page()
- remove 'usa_cma' variable
V4:
- rich the commit log description
V3:
- fix build errors
- add ALLOC_CMA both in should_continue_reclaim() and compaction_ready()
V2:
- using the 'cc->alloc_flags' to determin if 'ALLOC_CMA' is needed
- rich the commit log description
include/linux/compaction.h | 6 ++++--
mm/compaction.c | 26 +++++++++++++++-----------
mm/internal.h | 3 ++-
mm/page_alloc.c | 7 +++++--
mm/page_isolation.c | 3 ++-
mm/page_reporting.c | 2 +-
mm/vmscan.c | 4 ++--
7 files changed, 31 insertions(+), 20 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index e947764..b4c3ac3 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -90,7 +90,8 @@ extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
struct page **page);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern bool compaction_suitable(struct zone *zone, int order,
- int highest_zoneidx);
+ int highest_zoneidx,
+ unsigned int alloc_flags);
extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
@@ -108,7 +109,8 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
}
static inline bool compaction_suitable(struct zone *zone, int order,
- int highest_zoneidx)
+ int highest_zoneidx,
+ unsigned int alloc_flags)
{
return false;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index 07bd227..223f2da 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -655,7 +655,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
/* Found a free page, will break it into order-0 pages */
order = buddy_order(page);
- isolated = __isolate_free_page(page, order);
+ isolated = __isolate_free_page(page, order, cc->alloc_flags);
if (!isolated)
break;
set_page_private(page, order);
@@ -1634,7 +1634,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
/* Isolate the page if available */
if (page) {
- if (__isolate_free_page(page, order)) {
+ if (__isolate_free_page(page, order, cc->alloc_flags)) {
set_page_private(page, order);
nr_isolated = 1 << order;
nr_scanned += nr_isolated - 1;
@@ -2381,6 +2381,7 @@ static enum compact_result compact_finished(struct compact_control *cc)
static bool __compaction_suitable(struct zone *zone, int order,
int highest_zoneidx,
+ unsigned int alloc_flags,
unsigned long wmark_target)
{
unsigned long watermark;
@@ -2395,25 +2396,26 @@ static bool __compaction_suitable(struct zone *zone, int order,
* even if compaction succeeds.
* For costly orders, we require low watermark instead of min for
* compaction to proceed to increase its chances.
- * ALLOC_CMA is used, as pages in CMA pageblocks are considered
- * suitable migration targets
+ * In addition to unmovable allocations, ALLOC_CMA is used, as pages in
+ * CMA pageblocks are considered suitable migration targets
*/
watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
low_wmark_pages(zone) : min_wmark_pages(zone);
watermark += compact_gap(order);
return __zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
- ALLOC_CMA, wmark_target);
+ alloc_flags & ALLOC_CMA, wmark_target);
}
/*
* compaction_suitable: Is this suitable to run compaction on this zone now?
*/
-bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx)
+bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx,
+ unsigned int alloc_flags)
{
enum compact_result compact_result;
bool suitable;
- suitable = __compaction_suitable(zone, order, highest_zoneidx,
+ suitable = __compaction_suitable(zone, order, highest_zoneidx, alloc_flags,
zone_page_state(zone, NR_FREE_PAGES));
/*
* fragmentation index determines if allocation failures are due to
@@ -2474,7 +2476,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
available = zone_reclaimable_pages(zone) / order;
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
if (__compaction_suitable(zone, order, ac->highest_zoneidx,
- available))
+ alloc_flags, available))
return true;
}
@@ -2499,7 +2501,7 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order,
alloc_flags))
return COMPACT_SUCCESS;
- if (!compaction_suitable(zone, order, highest_zoneidx))
+ if (!compaction_suitable(zone, order, highest_zoneidx, alloc_flags))
return COMPACT_SKIPPED;
return COMPACT_CONTINUE;
@@ -2893,6 +2895,7 @@ static int compact_node(pg_data_t *pgdat, bool proactive)
struct compact_control cc = {
.order = -1,
.mode = proactive ? MIGRATE_SYNC_LIGHT : MIGRATE_SYNC,
+ .alloc_flags = ALLOC_CMA,
.ignore_skip_hint = true,
.whole_zone = true,
.gfp_mask = GFP_KERNEL,
@@ -3037,7 +3040,7 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
ret = compaction_suit_allocation_order(zone,
pgdat->kcompactd_max_order,
- highest_zoneidx, ALLOC_WMARK_MIN);
+ highest_zoneidx, ALLOC_CMA | ALLOC_WMARK_MIN);
if (ret == COMPACT_CONTINUE)
return true;
}
@@ -3058,6 +3061,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
.search_order = pgdat->kcompactd_max_order,
.highest_zoneidx = pgdat->kcompactd_highest_zoneidx,
.mode = MIGRATE_SYNC_LIGHT,
+ .alloc_flags = ALLOC_CMA | ALLOC_WMARK_MIN,
.ignore_skip_hint = false,
.gfp_mask = GFP_KERNEL,
};
@@ -3078,7 +3082,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
continue;
ret = compaction_suit_allocation_order(zone,
- cc.order, zoneid, ALLOC_WMARK_MIN);
+ cc.order, zoneid, cc.alloc_flags);
if (ret != COMPACT_CONTINUE)
continue;
diff --git a/mm/internal.h b/mm/internal.h
index 3922788..6d257c8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -662,7 +662,8 @@ static inline void clear_zone_contiguous(struct zone *zone)
zone->contiguous = false;
}
-extern int __isolate_free_page(struct page *page, unsigned int order);
+extern int __isolate_free_page(struct page *page, unsigned int order,
+ unsigned int alloc_flags);
extern void __putback_isolated_page(struct page *page, unsigned int order,
int mt);
extern void memblock_free_pages(struct page *page, unsigned long pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dde19db..1bfdca3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2809,7 +2809,8 @@ void split_page(struct page *page, unsigned int order)
}
EXPORT_SYMBOL_GPL(split_page);
-int __isolate_free_page(struct page *page, unsigned int order)
+int __isolate_free_page(struct page *page, unsigned int order,
+ unsigned int alloc_flags)
{
struct zone *zone = page_zone(page);
int mt = get_pageblock_migratetype(page);
@@ -2823,7 +2824,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
* exists.
*/
watermark = zone->_watermark[WMARK_MIN] + (1UL << order);
- if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+ if (!zone_watermark_ok(zone, 0, watermark, 0,
+ alloc_flags & ALLOC_CMA))
return 0;
}
@@ -6454,6 +6456,7 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end,
.order = -1,
.zone = page_zone(pfn_to_page(start)),
.mode = MIGRATE_SYNC,
+ .alloc_flags = ALLOC_CMA,
.ignore_skip_hint = true,
.no_set_skip_hint = true,
.alloc_contig = true,
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index c608e9d..a1f2c79 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -229,7 +229,8 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
buddy = find_buddy_page_pfn(page, page_to_pfn(page),
order, NULL);
if (buddy && !is_migrate_isolate_page(buddy)) {
- isolated_page = !!__isolate_free_page(page, order);
+ isolated_page = !!__isolate_free_page(page, order,
+ ALLOC_CMA);
/*
* Isolating a free page in an isolated pageblock
* is expected to always work as watermarks don't
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index e4c428e..fd3813b 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -198,7 +198,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
/* Attempt to pull page from list and place in scatterlist */
if (*offset) {
- if (!__isolate_free_page(page, order)) {
+ if (!__isolate_free_page(page, order, ALLOC_CMA)) {
next = page;
break;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e03a61..33f5b46 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5815,7 +5815,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
sc->reclaim_idx, 0))
return false;
- if (compaction_suitable(zone, sc->order, sc->reclaim_idx))
+ if (compaction_suitable(zone, sc->order, sc->reclaim_idx, ALLOC_CMA))
return false;
}
@@ -6043,7 +6043,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
return true;
/* Compaction cannot yet proceed. Do reclaim. */
- if (!compaction_suitable(zone, sc->order, sc->reclaim_idx))
+ if (!compaction_suitable(zone, sc->order, sc->reclaim_idx, ALLOC_CMA))
return false;
/*
--
2.7.4
(Maybe a good first-timer bug if anyone wants to try contributing during
the holiday seasons)
The stable v6.6 kernel currently runs into kernel panic when running the
test_progs tests from BPF selftests. Judging by the log it is failing in
one of the dummy_st_ops tests (which comes after deny_namespace tests if
you look at the output of `test_progs -l`). My guess is that it is
related to "check bpf_dummy_struct_ops program params for test runs"[1],
perhaps we're missing a commit or two.
Some notes for anyone tackling this for the first time:
1. You'll need to use the stable/linux-6.6.y branch from
https://github.com/shunghsiyu/bpf. The current v6.6.66 one fails at
compiling of BPF selftests[2]
2. The easiest way to run BPF selftests is to got relevant
dependencies[3] installed, and run
tools/testing/selftests/bpf/vmtest.sh (need to give it `-i` to
download the root image first, and also might need to specify clang
and llvm-strip by setting environmental variable CLANG=clang-17 and
LLVM_STRIP=llvm-strip-17, respectively). For a more solid setup, see
materials[4][5] from Manu Bretelle
3. Patch(es) should be send to stable(a)vger.kernel.org, following the
stable process[6], see [2] as an example
Below is the output from vmtest.sh:
#68/1 deny_namespace/unpriv_userns_create_no_bpf:OK
#68/2 deny_namespace/userns_create_bpf:OK
#68 deny_namespace:OK
[ 26.829153] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 26.831136] #PF: supervisor read access in kernel mode
[ 26.832635] #PF: error_code(0x0000) - not-present page
[ 26.833999] PGD 0 P4D 0
[ 26.834771] Oops: 0000 [#1] PREEMPT SMP PTI
[ 26.835997] CPU: 2 PID: 119 Comm: test_progs Tainted: G OE 6.6.66-00003-gd80551078e71 #3
[ 26.838774] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[ 26.841152] RIP: 0010:bpf_prog_8ee9cbe7c9b5a50f_test_1+0x17/0x24
[ 26.842877] Code: 00 00 00 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc f3 0f 1e fa 0f 1f 44 00 00 66 90 55 48 89 e5 f3 0f 1e fa 48 8b 7f 00 <8b> 47 00 be 5a 00 00 00 89 77 00 c9 c3 cc cc cc cc cc cc cc cc c0
[ 26.847953] RSP: 0018:ffff9e6b803b7d88 EFLAGS: 00010202
[ 26.849425] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 2845e103d7dffb60
[ 26.851483] RDX: 0000000000000000 RSI: 0000000084d09025 RDI: 0000000000000000
[ 26.853508] RBP: ffff9e6b803b7d88 R08: 0000000000000001 R09: 0000000000000000
[ 26.855670] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9754c0b5f700
[ 26.857824] R13: ffff9754c09cc800 R14: ffff9754c0b5f680 R15: ffff9754c0b5f760
[ 26.859741] FS: 00007f77dee12740(0000) GS:ffff9754fbc80000(0000) knlGS:0000000000000000
[ 26.862087] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 26.863705] CR2: 0000000000000000 CR3: 00000001020e6003 CR4: 0000000000170ee0
[ 26.865689] Call Trace:
[ 26.866407] <TASK>
[ 26.866982] ? __die+0x24/0x70
[ 26.867774] ? page_fault_oops+0x15b/0x450
[ 26.868882] ? search_bpf_extables+0xb0/0x160
[ 26.870076] ? fixup_exception+0x26/0x330
[ 26.871214] ? exc_page_fault+0x64/0x190
[ 26.872293] ? asm_exc_page_fault+0x26/0x30
[ 26.873352] ? bpf_prog_8ee9cbe7c9b5a50f_test_1+0x17/0x24
[ 26.874705] ? __bpf_prog_enter+0x3f/0xc0
[ 26.875718] ? bpf_struct_ops_test_run+0x1b8/0x2c0
[ 26.876942] ? __sys_bpf+0xc4e/0x2c30
[ 26.877898] ? __x64_sys_bpf+0x20/0x30
[ 26.878812] ? do_syscall_64+0x37/0x90
[ 26.879704] ? entry_SYSCALL_64_after_hwframe+0x78/0xe2
[ 26.880918] </TASK>
[ 26.881409] Modules linked in: bpf_testmod(OE) [last unloaded: bpf_testmod(OE)]
[ 26.883095] CR2: 0000000000000000
[ 26.883934] ---[ end trace 0000000000000000 ]---
[ 26.885099] RIP: 0010:bpf_prog_8ee9cbe7c9b5a50f_test_1+0x17/0x24
[ 26.886452] Code: 00 00 00 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc f3 0f 1e fa 0f 1f 44 00 00 66 90 55 48 89 e5 f3 0f 1e fa 48 8b 7f 00 <8b> 47 00 be 5a 00 00 00 89 77 00 c9 c3 cc cc cc cc cc cc cc cc c0
[ 26.890379] RSP: 0018:ffff9e6b803b7d88 EFLAGS: 00010202
[ 26.891450] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 2845e103d7dffb60
[ 26.892779] RDX: 0000000000000000 RSI: 0000000084d09025 RDI: 0000000000000000
[ 26.894254] RBP: ffff9e6b803b7d88 R08: 0000000000000001 R09: 0000000000000000
[ 26.895630] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9754c0b5f700
[ 26.897008] R13: ffff9754c09cc800 R14: ffff9754c0b5f680 R15: ffff9754c0b5f760
[ 26.898337] FS: 00007f77dee12740(0000) GS:ffff9754fbc80000(0000) knlGS:0000000000000000
[ 26.899972] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 26.901076] CR2: 0000000000000000 CR3: 00000001020e6003 CR4: 0000000000170ee0
[ 26.902336] Kernel panic - not syncing: Fatal exception
[ 26.903639] Kernel Offset: 0x36000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 26.905693] ---[ end Kernel panic - not syncing: Fatal exception ]---
1: https://lore.kernel.org/all/20240424012821.595216-1-eddyz87@gmail.com/t/#u
2: https://lore.kernel.org/all/20241217080240.46699-1-shung-hsi.yu@suse.com/t/…
3: https://gist.github.com/shunghsiyu/1bd4189654cce5b3e55c2ab8da7dd33d#file-vm…
4: https://chantra.github.io/bpfcitools/bpf-local-development.html
5: http://oldvger.kernel.org/bpfconf2024_material/BPF-dev-hacks.pdf
6: https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
If device_add() fails, call only put_device() to decrement reference
count for cleanup. Do not call device_del() before put_device().
As comment of device_add() says, 'if device_add() succeeds, you should
call device_del() when you want to get rid of it. If device_add() has
not succeeded, use only put_device() to drop the reference count'.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: c8e4c2397655 ("RDMA/srp: Rework the srp_add_port() error path")
Signed-off-by: Ma Ke <make_ruc2021(a)163.com>
---
Changes in v3:
- modified the bug description as suggestions;
- added a blank line to separate the description and the tags.
Changes in v2:
- modified the bug description as suggestions.
---
drivers/infiniband/ulp/srp/ib_srp.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 2916e77f589b..7289ae0b83ac 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -3978,7 +3978,6 @@ static struct srp_host *srp_add_port(struct srp_device *device, u32 port)
return host;
put_host:
- device_del(&host->dev);
put_device(&host->dev);
return NULL;
}
--
2.25.1
When device_add(&udev->dev) failed, calling put_device() to explicitly
release udev->dev. And the routine which calls usb_new_device() does
not call put_device() when an error occurs. As comment of device_add()
says, 'if device_add() succeeds, you should call device_del() when you
want to get rid of it. If device_add() has not succeeded, use only
put_device() to drop the reference count'.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: 9f8b17e643fe ("USB: make usbdevices export their device nodes instead of using a separate class")
Signed-off-by: Ma Ke <make_ruc2021(a)163.com>
---
Changes in v2:
- modified the bug description to make it more clear;
- added the missed part of the patch.
---
drivers/usb/core/hub.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 4b93c0bd1d4b..ddd572312296 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -2651,6 +2651,7 @@ int usb_new_device(struct usb_device *udev)
err = device_add(&udev->dev);
if (err) {
dev_err(&udev->dev, "can't device_add, error %d\n", err);
+ put_device(&udev->dev);
goto fail;
}
@@ -2663,13 +2664,13 @@ int usb_new_device(struct usb_device *udev)
err = sysfs_create_link(&udev->dev.kobj,
&port_dev->dev.kobj, "port");
if (err)
- goto fail;
+ goto out_del_dev;
err = sysfs_create_link(&port_dev->dev.kobj,
&udev->dev.kobj, "device");
if (err) {
sysfs_remove_link(&udev->dev.kobj, "port");
- goto fail;
+ goto out_del_dev;
}
if (!test_and_set_bit(port1, hub->child_usage_bits))
@@ -2683,6 +2684,9 @@ int usb_new_device(struct usb_device *udev)
pm_runtime_put_sync_autosuspend(&udev->dev);
return err;
+out_del_dev:
+ device_del(&udev->dev);
+ put_device(&udev->dev);
fail:
usb_set_device_state(udev, USB_STATE_NOTATTACHED);
pm_runtime_disable(&udev->dev);
--
2.25.1
From: Steven Rostedt <rostedt(a)goodmis.org>
The persistent ring buffer can live across boots. It is expected that the
content in the buffer can be translated to the current kernel with delta
offsets even with KASLR enabled. But it can only guarantee this if the
content of the ring buffer came from the same kernel as the one that is
currently running.
Add uname into the meta data and if the uname in the meta data from the
previous boot does not match the uname of the current boot, then clear the
buffer and re-initialize it.
This only handles the case of kernel versions. It does not clear the
buffer for development. There's several mechanisms to keep bad data from
crashing the kernel. The worse that can happen is some corrupt data may be
displayed.
Cc: stable(a)vger.kernel.org
Fixes: 8f3e6659656e6 ("ring-buffer: Save text and data locations in mapped meta data")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/ring_buffer.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7e257e855dd1..3c94c59d000c 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -17,6 +17,7 @@
#include <linux/uaccess.h>
#include <linux/hardirq.h>
#include <linux/kthread.h> /* for self test */
+#include <linux/utsname.h>
#include <linux/module.h>
#include <linux/percpu.h>
#include <linux/mutex.h>
@@ -45,10 +46,13 @@
static void update_pages_handler(struct work_struct *work);
#define RING_BUFFER_META_MAGIC 0xBADFEED
+#define UNAME_SZ 64
struct ring_buffer_meta {
int magic;
int struct_size;
+ char uname[UNAME_SZ];
+
unsigned long text_addr;
unsigned long data_addr;
unsigned long first_buffer;
@@ -1687,6 +1691,11 @@ static bool rb_meta_valid(struct ring_buffer_meta *meta, int cpu,
return false;
}
+ if (strncmp(init_utsname()->release, meta->uname, UNAME_SZ - 1)) {
+ pr_info("Ring buffer boot meta[%d] mismatch of uname\n", cpu);
+ return false;
+ }
+
/* The subbuffer's size and number of subbuffers must match */
if (meta->subbuf_size != subbuf_size ||
meta->nr_subbufs != nr_pages + 1) {
@@ -1920,6 +1929,7 @@ static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages)
meta->magic = RING_BUFFER_META_MAGIC;
meta->struct_size = sizeof(*meta);
+ strscpy(meta->uname, init_utsname()->release, UNAME_SZ);
meta->nr_subbufs = nr_pages + 1;
meta->subbuf_size = PAGE_SIZE;
--
2.45.2
From: Steven Rostedt <rostedt(a)goodmis.org>
The TP_printk() portion of a trace event is executed at the time a event
is read from the trace. This can happen seconds, minutes, hours, days,
months, years possibly later since the event was recorded. If the print
format contains a dereference to a string via "%s", and that string was
allocated, there's a chance that string could be freed before it is read
by the trace file.
To protect against such bugs, there are two functions that verify the
event. The first one is test_event_printk(), which is called when the
event is created. It reads the TP_printk() format as well as its arguments
to make sure nothing may be dereferencing a pointer that was not copied
into the ring buffer along with the event. If it is, it will trigger a
WARN_ON().
For strings that use "%s", it is not so easy. The string may not reside in
the ring buffer but may still be valid. Strings that are static and part
of the kernel proper which will not be freed for the life of the running
system, are safe to dereference. But to know if it is a pointer to a
static string or to something on the heap can not be determined until the
event is triggered.
This brings us to the second function that tests for the bad dereferencing
of strings, trace_check_vprintf(). It would walk through the printf format
looking for "%s", and when it finds it, it would validate that the pointer
is safe to read. If not, it would produces a WARN_ON() as well and write
into the ring buffer "[UNSAFE-MEMORY]".
The problem with this is how it used va_list to have vsnprintf() handle
all the cases that it didn't need to check. Instead of re-implementing
vsnprintf(), it would make a copy of the format up to the %s part, and
call vsnprintf() with the current va_list ap variable, where the ap would
then be ready to point at the string in question.
For architectures that passed va_list by reference this was possible. For
architectures that passed it by copy it was not. A test_can_verify()
function was used to differentiate between the two, and if it wasn't
possible, it would disable it.
Even for architectures where this was feasible, it was a stretch to rely
on such a method that is undocumented, and could cause issues later on
with new optimizations of the compiler.
Instead, the first function test_event_printk() was updated to look at
"%s" as well. If the "%s" argument is a pointer outside the event in the
ring buffer, it would find the field type of the event that is the problem
and mark the structure with a new flag called "needs_test". The event
itself will be marked by TRACE_EVENT_FL_TEST_STR to let it be known that
this event has a field that needs to be verified before the event can be
printed using the printf format.
When the event fields are created from the field type structure, the
fields would copy the field type's "needs_test" value.
Finally, before being printed, a new function ignore_event() is called
which will check if the event has the TEST_STR flag set (if not, it
returns false). If the flag is set, it then iterates through the events
fields looking for the ones that have the "needs_test" flag set.
Then it uses the offset field from the field structure to find the pointer
in the ring buffer event. It runs the tests to make sure that pointer is
safe to print and if not, it triggers the WARN_ON() and also adds to the
trace output that the event in question has an unsafe memory access.
The ignore_event() makes the trace_check_vprintf() obsolete so it is
removed.
Link: https://lore.kernel.org/all/CAHk-=wh3uOnqnZPpR0PeLZZtyWbZLboZ7cHLCKRWsocvs9…
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Al Viro <viro(a)ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Link: https://lore.kernel.org/20241217024720.848621576@goodmis.org
Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
include/linux/trace_events.h | 6 +-
kernel/trace/trace.c | 255 ++++++++---------------------------
kernel/trace/trace.h | 6 +-
kernel/trace/trace_events.c | 32 +++--
kernel/trace/trace_output.c | 6 +-
5 files changed, 88 insertions(+), 217 deletions(-)
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 2a5df5b62cfc..91b8ffbdfa8c 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -273,7 +273,8 @@ struct trace_event_fields {
const char *name;
const int size;
const int align;
- const int is_signed;
+ const unsigned int is_signed:1;
+ unsigned int needs_test:1;
const int filter_type;
const int len;
};
@@ -324,6 +325,7 @@ enum {
TRACE_EVENT_FL_EPROBE_BIT,
TRACE_EVENT_FL_FPROBE_BIT,
TRACE_EVENT_FL_CUSTOM_BIT,
+ TRACE_EVENT_FL_TEST_STR_BIT,
};
/*
@@ -340,6 +342,7 @@ enum {
* CUSTOM - Event is a custom event (to be attached to an exsiting tracepoint)
* This is set when the custom event has not been attached
* to a tracepoint yet, then it is cleared when it is.
+ * TEST_STR - The event has a "%s" that points to a string outside the event
*/
enum {
TRACE_EVENT_FL_CAP_ANY = (1 << TRACE_EVENT_FL_CAP_ANY_BIT),
@@ -352,6 +355,7 @@ enum {
TRACE_EVENT_FL_EPROBE = (1 << TRACE_EVENT_FL_EPROBE_BIT),
TRACE_EVENT_FL_FPROBE = (1 << TRACE_EVENT_FL_FPROBE_BIT),
TRACE_EVENT_FL_CUSTOM = (1 << TRACE_EVENT_FL_CUSTOM_BIT),
+ TRACE_EVENT_FL_TEST_STR = (1 << TRACE_EVENT_FL_TEST_STR_BIT),
};
#define TRACE_EVENT_FL_UKPROBE (TRACE_EVENT_FL_KPROBE | TRACE_EVENT_FL_UPROBE)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index be62f0ea1814..7cc18b9bce27 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3611,17 +3611,12 @@ char *trace_iter_expand_format(struct trace_iterator *iter)
}
/* Returns true if the string is safe to dereference from an event */
-static bool trace_safe_str(struct trace_iterator *iter, const char *str,
- bool star, int len)
+static bool trace_safe_str(struct trace_iterator *iter, const char *str)
{
unsigned long addr = (unsigned long)str;
struct trace_event *trace_event;
struct trace_event_call *event;
- /* Ignore strings with no length */
- if (star && !len)
- return true;
-
/* OK if part of the event data */
if ((addr >= (unsigned long)iter->ent) &&
(addr < (unsigned long)iter->ent + iter->ent_size))
@@ -3661,181 +3656,69 @@ static bool trace_safe_str(struct trace_iterator *iter, const char *str,
return false;
}
-static DEFINE_STATIC_KEY_FALSE(trace_no_verify);
-
-static int test_can_verify_check(const char *fmt, ...)
-{
- char buf[16];
- va_list ap;
- int ret;
-
- /*
- * The verifier is dependent on vsnprintf() modifies the va_list
- * passed to it, where it is sent as a reference. Some architectures
- * (like x86_32) passes it by value, which means that vsnprintf()
- * does not modify the va_list passed to it, and the verifier
- * would then need to be able to understand all the values that
- * vsnprintf can use. If it is passed by value, then the verifier
- * is disabled.
- */
- va_start(ap, fmt);
- vsnprintf(buf, 16, "%d", ap);
- ret = va_arg(ap, int);
- va_end(ap);
-
- return ret;
-}
-
-static void test_can_verify(void)
-{
- if (!test_can_verify_check("%d %d", 0, 1)) {
- pr_info("trace event string verifier disabled\n");
- static_branch_inc(&trace_no_verify);
- }
-}
-
/**
- * trace_check_vprintf - Check dereferenced strings while writing to the seq buffer
+ * ignore_event - Check dereferenced fields while writing to the seq buffer
* @iter: The iterator that holds the seq buffer and the event being printed
- * @fmt: The format used to print the event
- * @ap: The va_list holding the data to print from @fmt.
*
- * This writes the data into the @iter->seq buffer using the data from
- * @fmt and @ap. If the format has a %s, then the source of the string
- * is examined to make sure it is safe to print, otherwise it will
- * warn and print "[UNSAFE MEMORY]" in place of the dereferenced string
- * pointer.
+ * At boot up, test_event_printk() will flag any event that dereferences
+ * a string with "%s" that does exist in the ring buffer. It may still
+ * be valid, as the string may point to a static string in the kernel
+ * rodata that never gets freed. But if the string pointer is pointing
+ * to something that was allocated, there's a chance that it can be freed
+ * by the time the user reads the trace. This would cause a bad memory
+ * access by the kernel and possibly crash the system.
+ *
+ * This function will check if the event has any fields flagged as needing
+ * to be checked at runtime and perform those checks.
+ *
+ * If it is found that a field is unsafe, it will write into the @iter->seq
+ * a message stating what was found to be unsafe.
+ *
+ * @return: true if the event is unsafe and should be ignored,
+ * false otherwise.
*/
-void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
- va_list ap)
+bool ignore_event(struct trace_iterator *iter)
{
- long text_delta = 0;
- long data_delta = 0;
- const char *p = fmt;
- const char *str;
- bool good;
- int i, j;
+ struct ftrace_event_field *field;
+ struct trace_event *trace_event;
+ struct trace_event_call *event;
+ struct list_head *head;
+ struct trace_seq *seq;
+ const void *ptr;
- if (WARN_ON_ONCE(!fmt))
- return;
+ trace_event = ftrace_find_event(iter->ent->type);
- if (static_branch_unlikely(&trace_no_verify))
- goto print;
+ seq = &iter->seq;
- /*
- * When the kernel is booted with the tp_printk command line
- * parameter, trace events go directly through to printk().
- * It also is checked by this function, but it does not
- * have an associated trace_array (tr) for it.
- */
- if (iter->tr) {
- text_delta = iter->tr->text_delta;
- data_delta = iter->tr->data_delta;
+ if (!trace_event) {
+ trace_seq_printf(seq, "EVENT ID %d NOT FOUND?\n", iter->ent->type);
+ return true;
}
- /* Don't bother checking when doing a ftrace_dump() */
- if (iter->fmt == static_fmt_buf)
- goto print;
-
- while (*p) {
- bool star = false;
- int len = 0;
-
- j = 0;
-
- /*
- * We only care about %s and variants
- * as well as %p[sS] if delta is non-zero
- */
- for (i = 0; p[i]; i++) {
- if (i + 1 >= iter->fmt_size) {
- /*
- * If we can't expand the copy buffer,
- * just print it.
- */
- if (!trace_iter_expand_format(iter))
- goto print;
- }
-
- if (p[i] == '\\' && p[i+1]) {
- i++;
- continue;
- }
- if (p[i] == '%') {
- /* Need to test cases like %08.*s */
- for (j = 1; p[i+j]; j++) {
- if (isdigit(p[i+j]) ||
- p[i+j] == '.')
- continue;
- if (p[i+j] == '*') {
- star = true;
- continue;
- }
- break;
- }
- if (p[i+j] == 's')
- break;
-
- if (text_delta && p[i+1] == 'p' &&
- ((p[i+2] == 's' || p[i+2] == 'S')))
- break;
-
- star = false;
- }
- j = 0;
- }
- /* If no %s found then just print normally */
- if (!p[i])
- break;
-
- /* Copy up to the %s, and print that */
- strncpy(iter->fmt, p, i);
- iter->fmt[i] = '\0';
- trace_seq_vprintf(&iter->seq, iter->fmt, ap);
+ event = container_of(trace_event, struct trace_event_call, event);
+ if (!(event->flags & TRACE_EVENT_FL_TEST_STR))
+ return false;
- /* Add delta to %pS pointers */
- if (p[i+1] == 'p') {
- unsigned long addr;
- char fmt[4];
+ head = trace_get_fields(event);
+ if (!head) {
+ trace_seq_printf(seq, "FIELDS FOR EVENT '%s' NOT FOUND?\n",
+ trace_event_name(event));
+ return true;
+ }
- fmt[0] = '%';
- fmt[1] = 'p';
- fmt[2] = p[i+2]; /* Either %ps or %pS */
- fmt[3] = '\0';
+ /* Offsets are from the iter->ent that points to the raw event */
+ ptr = iter->ent;
- addr = va_arg(ap, unsigned long);
- addr += text_delta;
- trace_seq_printf(&iter->seq, fmt, (void *)addr);
+ list_for_each_entry(field, head, link) {
+ const char *str;
+ bool good;
- p += i + 3;
+ if (!field->needs_test)
continue;
- }
- /*
- * If iter->seq is full, the above call no longer guarantees
- * that ap is in sync with fmt processing, and further calls
- * to va_arg() can return wrong positional arguments.
- *
- * Ensure that ap is no longer used in this case.
- */
- if (iter->seq.full) {
- p = "";
- break;
- }
-
- if (star)
- len = va_arg(ap, int);
-
- /* The ap now points to the string data of the %s */
- str = va_arg(ap, const char *);
+ str = *(const char **)(ptr + field->offset);
- good = trace_safe_str(iter, str, star, len);
-
- /* Could be from the last boot */
- if (data_delta && !good) {
- str += data_delta;
- good = trace_safe_str(iter, str, star, len);
- }
+ good = trace_safe_str(iter, str);
/*
* If you hit this warning, it is likely that the
@@ -3846,44 +3729,14 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
* instead. See samples/trace_events/trace-events-sample.h
* for reference.
*/
- if (WARN_ONCE(!good, "fmt: '%s' current_buffer: '%s'",
- fmt, seq_buf_str(&iter->seq.seq))) {
- int ret;
-
- /* Try to safely read the string */
- if (star) {
- if (len + 1 > iter->fmt_size)
- len = iter->fmt_size - 1;
- if (len < 0)
- len = 0;
- ret = copy_from_kernel_nofault(iter->fmt, str, len);
- iter->fmt[len] = 0;
- star = false;
- } else {
- ret = strncpy_from_kernel_nofault(iter->fmt, str,
- iter->fmt_size);
- }
- if (ret < 0)
- trace_seq_printf(&iter->seq, "(0x%px)", str);
- else
- trace_seq_printf(&iter->seq, "(0x%px:%s)",
- str, iter->fmt);
- str = "[UNSAFE-MEMORY]";
- strcpy(iter->fmt, "%s");
- } else {
- strncpy(iter->fmt, p + i, j + 1);
- iter->fmt[j+1] = '\0';
+ if (WARN_ONCE(!good, "event '%s' has unsafe pointer field '%s'",
+ trace_event_name(event), field->name)) {
+ trace_seq_printf(seq, "EVENT %s: HAS UNSAFE POINTER FIELD '%s'\n",
+ trace_event_name(event), field->name);
+ return true;
}
- if (star)
- trace_seq_printf(&iter->seq, iter->fmt, len, str);
- else
- trace_seq_printf(&iter->seq, iter->fmt, str);
-
- p += i + j + 1;
}
- print:
- if (*p)
- trace_seq_vprintf(&iter->seq, p, ap);
+ return false;
}
const char *trace_event_format(struct trace_iterator *iter, const char *fmt)
@@ -10777,8 +10630,6 @@ __init static int tracer_alloc_buffers(void)
register_snapshot_cmd();
- test_can_verify();
-
return 0;
out_free_pipe_cpumask:
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 266740b4e121..9691b47b5f3d 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -667,9 +667,8 @@ void trace_buffer_unlock_commit_nostack(struct trace_buffer *buffer,
bool trace_is_tracepoint_string(const char *str);
const char *trace_event_format(struct trace_iterator *iter, const char *fmt);
-void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
- va_list ap) __printf(2, 0);
char *trace_iter_expand_format(struct trace_iterator *iter);
+bool ignore_event(struct trace_iterator *iter);
int trace_empty(struct trace_iterator *iter);
@@ -1413,7 +1412,8 @@ struct ftrace_event_field {
int filter_type;
int offset;
int size;
- int is_signed;
+ unsigned int is_signed:1;
+ unsigned int needs_test:1;
int len;
};
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 521ad2fd1fe7..1545cc8b49d0 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -82,7 +82,7 @@ static int system_refcount_dec(struct event_subsystem *system)
}
static struct ftrace_event_field *
-__find_event_field(struct list_head *head, char *name)
+__find_event_field(struct list_head *head, const char *name)
{
struct ftrace_event_field *field;
@@ -114,7 +114,8 @@ trace_find_event_field(struct trace_event_call *call, char *name)
static int __trace_define_field(struct list_head *head, const char *type,
const char *name, int offset, int size,
- int is_signed, int filter_type, int len)
+ int is_signed, int filter_type, int len,
+ int need_test)
{
struct ftrace_event_field *field;
@@ -133,6 +134,7 @@ static int __trace_define_field(struct list_head *head, const char *type,
field->offset = offset;
field->size = size;
field->is_signed = is_signed;
+ field->needs_test = need_test;
field->len = len;
list_add(&field->link, head);
@@ -151,13 +153,13 @@ int trace_define_field(struct trace_event_call *call, const char *type,
head = trace_get_fields(call);
return __trace_define_field(head, type, name, offset, size,
- is_signed, filter_type, 0);
+ is_signed, filter_type, 0, 0);
}
EXPORT_SYMBOL_GPL(trace_define_field);
static int trace_define_field_ext(struct trace_event_call *call, const char *type,
const char *name, int offset, int size, int is_signed,
- int filter_type, int len)
+ int filter_type, int len, int need_test)
{
struct list_head *head;
@@ -166,13 +168,13 @@ static int trace_define_field_ext(struct trace_event_call *call, const char *typ
head = trace_get_fields(call);
return __trace_define_field(head, type, name, offset, size,
- is_signed, filter_type, len);
+ is_signed, filter_type, len, need_test);
}
#define __generic_field(type, item, filter_type) \
ret = __trace_define_field(&ftrace_generic_fields, #type, \
#item, 0, 0, is_signed_type(type), \
- filter_type, 0); \
+ filter_type, 0, 0); \
if (ret) \
return ret;
@@ -181,7 +183,8 @@ static int trace_define_field_ext(struct trace_event_call *call, const char *typ
"common_" #item, \
offsetof(typeof(ent), item), \
sizeof(ent.item), \
- is_signed_type(type), FILTER_OTHER, 0); \
+ is_signed_type(type), FILTER_OTHER, \
+ 0, 0); \
if (ret) \
return ret;
@@ -332,6 +335,7 @@ static bool process_pointer(const char *fmt, int len, struct trace_event_call *c
/* Return true if the string is safe */
static bool process_string(const char *fmt, int len, struct trace_event_call *call)
{
+ struct trace_event_fields *field;
const char *r, *e, *s;
e = fmt + len;
@@ -372,8 +376,16 @@ static bool process_string(const char *fmt, int len, struct trace_event_call *ca
if (process_pointer(fmt, len, call))
return true;
- /* Make sure the field is found, and consider it OK for now if it is */
- return find_event_field(fmt, call) != NULL;
+ /* Make sure the field is found */
+ field = find_event_field(fmt, call);
+ if (!field)
+ return false;
+
+ /* Test this field's string before printing the event */
+ call->flags |= TRACE_EVENT_FL_TEST_STR;
+ field->needs_test = 1;
+
+ return true;
}
/*
@@ -2586,7 +2598,7 @@ event_define_fields(struct trace_event_call *call)
ret = trace_define_field_ext(call, field->type, field->name,
offset, field->size,
field->is_signed, field->filter_type,
- field->len);
+ field->len, field->needs_test);
if (WARN_ON_ONCE(ret)) {
pr_err("error code is %d\n", ret);
break;
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index da748b7cbc4d..03d56f711ad1 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -317,10 +317,14 @@ EXPORT_SYMBOL(trace_raw_output_prep);
void trace_event_printf(struct trace_iterator *iter, const char *fmt, ...)
{
+ struct trace_seq *s = &iter->seq;
va_list ap;
+ if (ignore_event(iter))
+ return;
+
va_start(ap, fmt);
- trace_check_vprintf(iter, trace_event_format(iter, fmt), ap);
+ trace_seq_vprintf(s, trace_event_format(iter, fmt), ap);
va_end(ap);
}
EXPORT_SYMBOL(trace_event_printf);
--
2.45.2
From: Steven Rostedt <rostedt(a)goodmis.org>
The test_event_printk() code makes sure that when a trace event is
registered, any dereferenced pointers in from the event's TP_printk() are
pointing to content in the ring buffer. But currently it does not handle
"%s", as there's cases where the string pointer saved in the ring buffer
points to a static string in the kernel that will never be freed. As that
is a valid case, the pointer needs to be checked at runtime.
Currently the runtime check is done via trace_check_vprintf(), but to not
have to replicate everything in vsnprintf() it does some logic with the
va_list that may not be reliable across architectures. In order to get rid
of that logic, more work in the test_event_printk() needs to be done. Some
of the strings can be validated at this time when it is obvious the string
is valid because the string will be saved in the ring buffer content.
Do all the validation of strings in the ring buffer at boot in
test_event_printk(), and make sure that the field of the strings that
point into the kernel are accessible. This will allow adding checks at
runtime that will validate the fields themselves and not rely on paring
the TP_printk() format at runtime.
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Al Viro <viro(a)ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Link: https://lore.kernel.org/20241217024720.685917008@goodmis.org
Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_events.c | 104 ++++++++++++++++++++++++++++++------
1 file changed, 89 insertions(+), 15 deletions(-)
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index df75c06bb23f..521ad2fd1fe7 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -244,19 +244,16 @@ int trace_event_get_offsets(struct trace_event_call *call)
return tail->offset + tail->size;
}
-/*
- * Check if the referenced field is an array and return true,
- * as arrays are OK to dereference.
- */
-static bool test_field(const char *fmt, struct trace_event_call *call)
+
+static struct trace_event_fields *find_event_field(const char *fmt,
+ struct trace_event_call *call)
{
struct trace_event_fields *field = call->class->fields_array;
- const char *array_descriptor;
const char *p = fmt;
int len;
if (!(len = str_has_prefix(fmt, "REC->")))
- return false;
+ return NULL;
fmt += len;
for (p = fmt; *p; p++) {
if (!isalnum(*p) && *p != '_')
@@ -267,11 +264,26 @@ static bool test_field(const char *fmt, struct trace_event_call *call)
for (; field->type; field++) {
if (strncmp(field->name, fmt, len) || field->name[len])
continue;
- array_descriptor = strchr(field->type, '[');
- /* This is an array and is OK to dereference. */
- return array_descriptor != NULL;
+
+ return field;
}
- return false;
+ return NULL;
+}
+
+/*
+ * Check if the referenced field is an array and return true,
+ * as arrays are OK to dereference.
+ */
+static bool test_field(const char *fmt, struct trace_event_call *call)
+{
+ struct trace_event_fields *field;
+
+ field = find_event_field(fmt, call);
+ if (!field)
+ return false;
+
+ /* This is an array and is OK to dereference. */
+ return strchr(field->type, '[') != NULL;
}
/* Look for a string within an argument */
@@ -317,6 +329,53 @@ static bool process_pointer(const char *fmt, int len, struct trace_event_call *c
return false;
}
+/* Return true if the string is safe */
+static bool process_string(const char *fmt, int len, struct trace_event_call *call)
+{
+ const char *r, *e, *s;
+
+ e = fmt + len;
+
+ /*
+ * There are several helper functions that return strings.
+ * If the argument contains a function, then assume its field is valid.
+ * It is considered that the argument has a function if it has:
+ * alphanumeric or '_' before a parenthesis.
+ */
+ s = fmt;
+ do {
+ r = strstr(s, "(");
+ if (!r || r >= e)
+ break;
+ for (int i = 1; r - i >= s; i++) {
+ char ch = *(r - i);
+ if (isspace(ch))
+ continue;
+ if (isalnum(ch) || ch == '_')
+ return true;
+ /* Anything else, this isn't a function */
+ break;
+ }
+ /* A function could be wrapped in parethesis, try the next one */
+ s = r + 1;
+ } while (s < e);
+
+ /*
+ * If there's any strings in the argument consider this arg OK as it
+ * could be: REC->field ? "foo" : "bar" and we don't want to get into
+ * verifying that logic here.
+ */
+ if (find_print_string(fmt, "\"", e))
+ return true;
+
+ /* Dereferenced strings are also valid like any other pointer */
+ if (process_pointer(fmt, len, call))
+ return true;
+
+ /* Make sure the field is found, and consider it OK for now if it is */
+ return find_event_field(fmt, call) != NULL;
+}
+
/*
* Examine the print fmt of the event looking for unsafe dereference
* pointers using %p* that could be recorded in the trace event and
@@ -326,6 +385,7 @@ static bool process_pointer(const char *fmt, int len, struct trace_event_call *c
static void test_event_printk(struct trace_event_call *call)
{
u64 dereference_flags = 0;
+ u64 string_flags = 0;
bool first = true;
const char *fmt;
int parens = 0;
@@ -416,8 +476,16 @@ static void test_event_printk(struct trace_event_call *call)
star = true;
continue;
}
- if ((fmt[i + j] == 's') && star)
- arg++;
+ if ((fmt[i + j] == 's')) {
+ if (star)
+ arg++;
+ if (WARN_ONCE(arg == 63,
+ "Too many args for event: %s",
+ trace_event_name(call)))
+ return;
+ dereference_flags |= 1ULL << arg;
+ string_flags |= 1ULL << arg;
+ }
break;
}
break;
@@ -464,7 +532,10 @@ static void test_event_printk(struct trace_event_call *call)
}
if (dereference_flags & (1ULL << arg)) {
- if (process_pointer(fmt + start_arg, e - start_arg, call))
+ if (string_flags & (1ULL << arg)) {
+ if (process_string(fmt + start_arg, e - start_arg, call))
+ dereference_flags &= ~(1ULL << arg);
+ } else if (process_pointer(fmt + start_arg, e - start_arg, call))
dereference_flags &= ~(1ULL << arg);
}
@@ -476,7 +547,10 @@ static void test_event_printk(struct trace_event_call *call)
}
if (dereference_flags & (1ULL << arg)) {
- if (process_pointer(fmt + start_arg, i - start_arg, call))
+ if (string_flags & (1ULL << arg)) {
+ if (process_string(fmt + start_arg, i - start_arg, call))
+ dereference_flags &= ~(1ULL << arg);
+ } else if (process_pointer(fmt + start_arg, i - start_arg, call))
dereference_flags &= ~(1ULL << arg);
}
--
2.45.2
From: Steven Rostedt <rostedt(a)goodmis.org>
The process_pointer() helper function looks to see if various trace event
macros are used. These macros are for storing data in the event. This
makes it safe to dereference as the dereference will then point into the
event on the ring buffer where the content of the data stays with the
event itself.
A few helper functions were missing. Those were:
__get_rel_dynamic_array()
__get_dynamic_array_len()
__get_rel_dynamic_array_len()
__get_rel_sockaddr()
Also add a helper function find_print_string() to not need to use a middle
man variable to test if the string exists.
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Al Viro <viro(a)ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Link: https://lore.kernel.org/20241217024720.521836792@goodmis.org
Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_events.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 14e160a5b905..df75c06bb23f 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -274,6 +274,15 @@ static bool test_field(const char *fmt, struct trace_event_call *call)
return false;
}
+/* Look for a string within an argument */
+static bool find_print_string(const char *arg, const char *str, const char *end)
+{
+ const char *r;
+
+ r = strstr(arg, str);
+ return r && r < end;
+}
+
/* Return true if the argument pointer is safe */
static bool process_pointer(const char *fmt, int len, struct trace_event_call *call)
{
@@ -292,9 +301,17 @@ static bool process_pointer(const char *fmt, int len, struct trace_event_call *c
a = strchr(fmt, '&');
if ((a && (a < r)) || test_field(r, call))
return true;
- } else if ((r = strstr(fmt, "__get_dynamic_array(")) && r < e) {
+ } else if (find_print_string(fmt, "__get_dynamic_array(", e)) {
+ return true;
+ } else if (find_print_string(fmt, "__get_rel_dynamic_array(", e)) {
+ return true;
+ } else if (find_print_string(fmt, "__get_dynamic_array_len(", e)) {
+ return true;
+ } else if (find_print_string(fmt, "__get_rel_dynamic_array_len(", e)) {
+ return true;
+ } else if (find_print_string(fmt, "__get_sockaddr(", e)) {
return true;
- } else if ((r = strstr(fmt, "__get_sockaddr(")) && r < e) {
+ } else if (find_print_string(fmt, "__get_rel_sockaddr(", e)) {
return true;
}
return false;
--
2.45.2
From: Steven Rostedt <rostedt(a)goodmis.org>
The test_event_printk() analyzes print formats of trace events looking for
cases where it may dereference a pointer that is not in the ring buffer
which can possibly be a bug when the trace event is read from the ring
buffer and the content of that pointer no longer exists.
The function needs to accurately go from one print format argument to the
next. It handles quotes and parenthesis that may be included in an
argument. When it finds the start of the next argument, it uses a simple
"c = strstr(fmt + i, ',')" to find the end of that argument!
In order to include "%s" dereferencing, it needs to process the entire
content of the print format argument and not just the content of the first
',' it finds. As there may be content like:
({ const char *saved_ptr = trace_seq_buffer_ptr(p); static const char
*access_str[] = { "---", "--x", "w--", "w-x", "-u-", "-ux", "wu-", "wux"
}; union kvm_mmu_page_role role; role.word = REC->role;
trace_seq_printf(p, "sp gen %u gfn %llx l%u %u-byte q%u%s %s%s" " %snxe
%sad root %u %s%c", REC->mmu_valid_gen, REC->gfn, role.level,
role.has_4_byte_gpte ? 4 : 8, role.quadrant, role.direct ? " direct" : "",
access_str[role.access], role.invalid ? " invalid" : "", role.efer_nx ? ""
: "!", role.ad_disabled ? "!" : "", REC->root_count, REC->unsync ?
"unsync" : "sync", 0); saved_ptr; })
Which is an example of a full argument of an existing event. As the code
already handles finding the next print format argument, process the
argument at the end of it and not the start of it. This way it has both
the start of the argument as well as the end of it.
Add a helper function "process_pointer()" that will do the processing during
the loop as well as at the end. It also makes the code cleaner and easier
to read.
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Al Viro <viro(a)ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Link: https://lore.kernel.org/20241217024720.362271189@goodmis.org
Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_events.c | 82 ++++++++++++++++++++++++-------------
1 file changed, 53 insertions(+), 29 deletions(-)
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 77e68efbd43e..14e160a5b905 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -265,8 +265,7 @@ static bool test_field(const char *fmt, struct trace_event_call *call)
len = p - fmt;
for (; field->type; field++) {
- if (strncmp(field->name, fmt, len) ||
- field->name[len])
+ if (strncmp(field->name, fmt, len) || field->name[len])
continue;
array_descriptor = strchr(field->type, '[');
/* This is an array and is OK to dereference. */
@@ -275,6 +274,32 @@ static bool test_field(const char *fmt, struct trace_event_call *call)
return false;
}
+/* Return true if the argument pointer is safe */
+static bool process_pointer(const char *fmt, int len, struct trace_event_call *call)
+{
+ const char *r, *e, *a;
+
+ e = fmt + len;
+
+ /* Find the REC-> in the argument */
+ r = strstr(fmt, "REC->");
+ if (r && r < e) {
+ /*
+ * Addresses of events on the buffer, or an array on the buffer is
+ * OK to dereference. There's ways to fool this, but
+ * this is to catch common mistakes, not malicious code.
+ */
+ a = strchr(fmt, '&');
+ if ((a && (a < r)) || test_field(r, call))
+ return true;
+ } else if ((r = strstr(fmt, "__get_dynamic_array(")) && r < e) {
+ return true;
+ } else if ((r = strstr(fmt, "__get_sockaddr(")) && r < e) {
+ return true;
+ }
+ return false;
+}
+
/*
* Examine the print fmt of the event looking for unsafe dereference
* pointers using %p* that could be recorded in the trace event and
@@ -285,12 +310,12 @@ static void test_event_printk(struct trace_event_call *call)
{
u64 dereference_flags = 0;
bool first = true;
- const char *fmt, *c, *r, *a;
+ const char *fmt;
int parens = 0;
char in_quote = 0;
int start_arg = 0;
int arg = 0;
- int i;
+ int i, e;
fmt = call->print_fmt;
@@ -403,42 +428,41 @@ static void test_event_printk(struct trace_event_call *call)
case ',':
if (in_quote || parens)
continue;
+ e = i;
i++;
while (isspace(fmt[i]))
i++;
- start_arg = i;
- if (!(dereference_flags & (1ULL << arg)))
- goto next_arg;
- /* Find the REC-> in the argument */
- c = strchr(fmt + i, ',');
- r = strstr(fmt + i, "REC->");
- if (r && (!c || r < c)) {
- /*
- * Addresses of events on the buffer,
- * or an array on the buffer is
- * OK to dereference.
- * There's ways to fool this, but
- * this is to catch common mistakes,
- * not malicious code.
- */
- a = strchr(fmt + i, '&');
- if ((a && (a < r)) || test_field(r, call))
+ /*
+ * If start_arg is zero, then this is the start of the
+ * first argument. The processing of the argument happens
+ * when the end of the argument is found, as it needs to
+ * handle paranthesis and such.
+ */
+ if (!start_arg) {
+ start_arg = i;
+ /* Balance out the i++ in the for loop */
+ i--;
+ continue;
+ }
+
+ if (dereference_flags & (1ULL << arg)) {
+ if (process_pointer(fmt + start_arg, e - start_arg, call))
dereference_flags &= ~(1ULL << arg);
- } else if ((r = strstr(fmt + i, "__get_dynamic_array(")) &&
- (!c || r < c)) {
- dereference_flags &= ~(1ULL << arg);
- } else if ((r = strstr(fmt + i, "__get_sockaddr(")) &&
- (!c || r < c)) {
- dereference_flags &= ~(1ULL << arg);
}
- next_arg:
- i--;
+ start_arg = i;
arg++;
+ /* Balance out the i++ in the for loop */
+ i--;
}
}
+ if (dereference_flags & (1ULL << arg)) {
+ if (process_pointer(fmt + start_arg, i - start_arg, call))
+ dereference_flags &= ~(1ULL << arg);
+ }
+
/*
* If you triggered the below warning, the trace event reported
* uses an unsafe dereference pointer %p*. As the data stored
--
2.45.2
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y
git checkout FETCH_HEAD
git cherry-pick -x 2828e5808bcd5aae7fdcd169cac1efa2701fa2dd
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024121518-unspoken-ladle-1d6a@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 2828e5808bcd5aae7fdcd169cac1efa2701fa2dd Mon Sep 17 00:00:00 2001
From: Jiasheng Jiang <jiashengjiangcool(a)outlook.com>
Date: Wed, 27 Nov 2024 20:10:42 +0000
Subject: [PATCH] drm/i915: Fix memory leak by correcting cache object name in
error handler
Replace "slab_priorities" with "slab_dependencies" in the error handler
to avoid memory leak.
Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
Cc: <stable(a)vger.kernel.org> # v5.2+
Signed-off-by: Jiasheng Jiang <jiashengjiangcool(a)outlook.com>
Reviewed-by: Nirmoy Das <nirmoy.das(a)intel.com>
Reviewed-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241127201042.29620-1-jiashe…
(cherry picked from commit 9bc5e7dc694d3112bbf0fa4c46ef0fa0f114937a)
Signed-off-by: Tvrtko Ursulin <tursulin(a)ursulin.net>
diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
index 762127dd56c5..70a854557e6e 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.c
+++ b/drivers/gpu/drm/i915/i915_scheduler.c
@@ -506,6 +506,6 @@ int __init i915_scheduler_module_init(void)
return 0;
err_priorities:
- kmem_cache_destroy(slab_priorities);
+ kmem_cache_destroy(slab_dependencies);
return -ENOMEM;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 2828e5808bcd5aae7fdcd169cac1efa2701fa2dd
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024121517-deserve-wharf-c2d0@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 2828e5808bcd5aae7fdcd169cac1efa2701fa2dd Mon Sep 17 00:00:00 2001
From: Jiasheng Jiang <jiashengjiangcool(a)outlook.com>
Date: Wed, 27 Nov 2024 20:10:42 +0000
Subject: [PATCH] drm/i915: Fix memory leak by correcting cache object name in
error handler
Replace "slab_priorities" with "slab_dependencies" in the error handler
to avoid memory leak.
Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global")
Cc: <stable(a)vger.kernel.org> # v5.2+
Signed-off-by: Jiasheng Jiang <jiashengjiangcool(a)outlook.com>
Reviewed-by: Nirmoy Das <nirmoy.das(a)intel.com>
Reviewed-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti(a)linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241127201042.29620-1-jiashe…
(cherry picked from commit 9bc5e7dc694d3112bbf0fa4c46ef0fa0f114937a)
Signed-off-by: Tvrtko Ursulin <tursulin(a)ursulin.net>
diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
index 762127dd56c5..70a854557e6e 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.c
+++ b/drivers/gpu/drm/i915/i915_scheduler.c
@@ -506,6 +506,6 @@ int __init i915_scheduler_module_init(void)
return 0;
err_priorities:
- kmem_cache_destroy(slab_priorities);
+ kmem_cache_destroy(slab_dependencies);
return -ENOMEM;
}
The patch titled
Subject: fs/proc/task_mmu: fix pagemap flags with PMD THP entries on 32bit
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
fs-proc-task_mmu-fix-pagemap-flags-with-pmd-thp-entries-on-32bit.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: David Hildenbrand <david(a)redhat.com>
Subject: fs/proc/task_mmu: fix pagemap flags with PMD THP entries on 32bit
Date: Tue, 17 Dec 2024 20:50:00 +0100
Entries (including flags) are u64, even on 32bit. So right now we are
cutting of the flags on 32bit. This way, for example the cow selftest
complains about:
# ./cow
...
Bail Out! read and ioctl return unmatched results for populated: 0 1
Link: https://lkml.kernel.org/r/20241217195000.1734039-1-david@redhat.com
Fixes: 2c1f057e5be6 ("fs/proc/task_mmu: properly detect PM_MMAP_EXCLUSIVE per page of PMD-mapped THPs")
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/proc/task_mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/fs/proc/task_mmu.c~fs-proc-task_mmu-fix-pagemap-flags-with-pmd-thp-entries-on-32bit
+++ a/fs/proc/task_mmu.c
@@ -1810,7 +1810,7 @@ static int pagemap_pmd_range(pmd_t *pmdp
}
for (; addr != end; addr += PAGE_SIZE, idx++) {
- unsigned long cur_flags = flags;
+ u64 cur_flags = flags;
pagemap_entry_t pme;
if (folio && (flags & PM_PRESENT) &&
_
Patches currently in -mm which might be from david(a)redhat.com are
mm-page_alloc-dont-call-pfn_to_page-on-possibly-non-existent-pfn-in-split_large_buddy.patch
fs-proc-task_mmu-fix-pagemap-flags-with-pmd-thp-entries-on-32bit.patch
docs-tmpfs-update-the-large-folios-policy-for-tmpfs-and-shmem.patch
mm-memory_hotplug-move-debug_pagealloc_map_pages-into-online_pages_range.patch
mm-page_isolation-dont-pass-gfp-flags-to-isolate_single_pageblock.patch
mm-page_isolation-dont-pass-gfp-flags-to-start_isolate_page_range.patch
mm-page_alloc-make-__alloc_contig_migrate_range-static.patch
mm-page_alloc-sort-out-the-alloc_contig_range-gfp-flags-mess.patch
mm-page_alloc-forward-the-gfp-flags-from-alloc_contig_range-to-post_alloc_hook.patch
powernv-memtrace-use-__gfp_zero-with-alloc_contig_pages.patch
mm-hugetlb-dont-map-folios-writable-without-vm_write-when-copying-during-fork.patch
fs-proc-vmcore-convert-vmcore_cb_lock-into-vmcore_mutex.patch
fs-proc-vmcore-replace-vmcoredd_mutex-by-vmcore_mutex.patch
fs-proc-vmcore-disallow-vmcore-modifications-while-the-vmcore-is-open.patch
fs-proc-vmcore-prefix-all-pr_-with-vmcore.patch
fs-proc-vmcore-move-vmcore-definitions-out-of-kcoreh.patch
fs-proc-vmcore-factor-out-allocating-a-vmcore-range-and-adding-it-to-a-list.patch
fs-proc-vmcore-factor-out-freeing-a-list-of-vmcore-ranges.patch
fs-proc-vmcore-introduce-proc_vmcore_device_ram-to-detect-device-ram-ranges-in-2nd-kernel.patch
virtio-mem-mark-device-ready-before-registering-callbacks-in-kdump-mode.patch
virtio-mem-remember-usable-region-size.patch
virtio-mem-support-config_proc_vmcore_device_ram.patch
s390-kdump-virtio-mem-kdump-support-config_proc_vmcore_device_ram.patch
mm-page_alloc-dont-use-__gfp_hardwall-when-migrating-pages-via-alloc_contig.patch
mm-memory_hotplug-dont-use-__gfp_hardwall-when-migrating-pages-via-memory-offlining.patch
Commit 973d1607d936 ("clk: mediatek: mt2701: use mtk_clk_simple_probe to
simplify driver") broke DT bindings as the highest index was reduced by
1 because the id count starts from 1 and not from 0.
Fix this, like for other drivers which had the same issue, by adding a
dummy clk at index 0.
Fixes: 973d1607d936 ("clk: mediatek: mt2701: use mtk_clk_simple_probe to simplify driver")
Cc: stable(a)vger.kernel.org
Signed-off-by: Daniel Golle <daniel(a)makrotopia.org>
---
drivers/clk/mediatek/clk-mt2701-vdec.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/clk/mediatek/clk-mt2701-vdec.c b/drivers/clk/mediatek/clk-mt2701-vdec.c
index 94db86f8d0a4..5299d92f3aba 100644
--- a/drivers/clk/mediatek/clk-mt2701-vdec.c
+++ b/drivers/clk/mediatek/clk-mt2701-vdec.c
@@ -31,6 +31,7 @@ static const struct mtk_gate_regs vdec1_cg_regs = {
GATE_MTK(_id, _name, _parent, &vdec1_cg_regs, _shift, &mtk_clk_gate_ops_setclr_inv)
static const struct mtk_gate vdec_clks[] = {
+ GATE_DUMMY(CLK_DUMMY, "vdec_dummy"),
GATE_VDEC0(CLK_VDEC_CKGEN, "vdec_cken", "vdec_sel", 0),
GATE_VDEC1(CLK_VDEC_LARB, "vdec_larb_cken", "mm_sel", 0),
};
--
2.47.1
Entries (including flags) are u64, even on 32bit. So right now we are
cutting of the flags on 32bit. This way, for example the cow selftest
complains about:
# ./cow
...
Bail Out! read and ioctl return unmatched results for populated: 0 1
Fixes: 2c1f057e5be6 ("fs/proc/task_mmu: properly detect PM_MMAP_EXCLUSIVE per page of PMD-mapped THPs")
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: David Hildenbrand <david(a)redhat.com>
---
fs/proc/task_mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 38a5a3e9cba20..f02cd362309a0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1810,7 +1810,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
}
for (; addr != end; addr += PAGE_SIZE, idx++) {
- unsigned long cur_flags = flags;
+ u64 cur_flags = flags;
pagemap_entry_t pme;
if (folio && (flags & PM_PRESENT) &&
--
2.47.1
From: Ajit Khaparde <ajit.khaparde(a)broadcom.com>
[ Upstream commit 524e057b2d66b61f9b63b6db30467ab7b0bb4796 ]
The Broadcom BCM5760X NIC may be a multi-function device.
While it does not advertise an ACS capability, peer-to-peer transactions
are not possible between the individual functions. So it is ok to treat
them as fully isolated.
Add an ACS quirk for this device so the functions can be in independent
IOMMU groups and attached individually to userspace applications using
VFIO.
[kwilczynski: commit log]
Link: https://lore.kernel.org/linux-pci/20240510204228.73435-1-ajit.khaparde@broa…
Signed-off-by: Ajit Khaparde <ajit.khaparde(a)broadcom.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski(a)kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas(a)google.com>
Reviewed-by: Andy Gospodarek <gospo(a)broadcom.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/pci/quirks.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 4d4267105cd2b..842e8fecf0a9a 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4971,6 +4971,10 @@ static const struct pci_dev_acs_enabled {
{ PCI_VENDOR_ID_BROADCOM, 0x1750, pci_quirk_mf_endpoint_acs },
{ PCI_VENDOR_ID_BROADCOM, 0x1751, pci_quirk_mf_endpoint_acs },
{ PCI_VENDOR_ID_BROADCOM, 0x1752, pci_quirk_mf_endpoint_acs },
+ { PCI_VENDOR_ID_BROADCOM, 0x1760, pci_quirk_mf_endpoint_acs },
+ { PCI_VENDOR_ID_BROADCOM, 0x1761, pci_quirk_mf_endpoint_acs },
+ { PCI_VENDOR_ID_BROADCOM, 0x1762, pci_quirk_mf_endpoint_acs },
+ { PCI_VENDOR_ID_BROADCOM, 0x1763, pci_quirk_mf_endpoint_acs },
{ PCI_VENDOR_ID_BROADCOM, 0xD714, pci_quirk_brcm_acs },
/* Amazon Annapurna Labs */
{ PCI_VENDOR_ID_AMAZON_ANNAPURNA_LABS, 0x0031, pci_quirk_al_acs },
--
2.43.0
The patch titled
Subject: maple_tree: fix mas_alloc_cyclic() second search
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
maple_tree-reload-mas-before-the-second-call-for-mas_empty_area-fix.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: "Liam R. Howlett" <Liam.Howlett(a)Oracle.com>
Subject: maple_tree: fix mas_alloc_cyclic() second search
Date: Mon, 16 Dec 2024 14:01:12 -0500
The first search may leave the maple state in an error state. Reset the
maple state before the second search so that the search has a chance of
executing correctly after an exhausted first search.
Link: https://lore.kernel.org/all/20241216060600.287B4C4CED0@smtp.kernel.org/
Link: https://lkml.kernel.org/r/20241216190113.1226145-2-Liam.Howlett@oracle.com
Fixes: 9b6713cc7522 ("maple_tree: Add mtree_alloc_cyclic()")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
Reviewed-by: Yang Erkun <yangerkun(a)huawei.com>
Cc: Christian Brauner <brauner(a)kernel.org>
Cc: Chuck Lever <chuck.lever(a)oracle.com> says:
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/maple_tree.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/lib/maple_tree.c~maple_tree-reload-mas-before-the-second-call-for-mas_empty_area-fix
+++ a/lib/maple_tree.c
@@ -4346,7 +4346,6 @@ int mas_alloc_cyclic(struct ma_state *ma
{
unsigned long min = range_lo;
int ret = 0;
- struct ma_state m = *mas;
range_lo = max(min, *next);
ret = mas_empty_area(mas, range_lo, range_hi, 1);
@@ -4355,7 +4354,7 @@ int mas_alloc_cyclic(struct ma_state *ma
ret = 1;
}
if (ret < 0 && range_lo > min) {
- *mas = m;
+ mas_reset(mas);
ret = mas_empty_area(mas, min, range_hi, 1);
if (ret == 0)
ret = 1;
_
Patches currently in -mm which might be from Liam.Howlett(a)Oracle.com are
maple_tree-reload-mas-before-the-second-call-for-mas_empty_area-fix.patch
test_maple_tree-test-exhausted-upper-limit-of-mtree_alloc_cyclic.patch
The patch titled
Subject: maple_tree: fix mas_alloc_cyclic() second search
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
maple_tree-fix-mas_alloc_cyclic-second-search.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: "Liam R. Howlett" <Liam.Howlett(a)Oracle.com>
Subject: maple_tree: fix mas_alloc_cyclic() second search
Date: Mon, 16 Dec 2024 14:01:12 -0500
The first search may leave the maple state in an error state. Reset the
maple state before the second search so that the search has a chance of
executing correctly after an exhausted first search.
Link: https://lore.kernel.org/all/20241216060600.287B4C4CED0@smtp.kernel.org/
Link: https://lkml.kernel.org/r/20241216190113.1226145-2-Liam.Howlett@oracle.com
Fixes: 9b6713cc7522 ("maple_tree: Add mtree_alloc_cyclic()")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
Reviewed-by: Yang Erkun <yangerkun(a)huawei.com>
Cc: Christian Brauner <brauner(a)kernel.org>
Cc: Chuck Lever <chuck.lever(a)oracle.com> says:
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/maple_tree.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/lib/maple_tree.c~maple_tree-fix-mas_alloc_cyclic-second-search
+++ a/lib/maple_tree.c
@@ -4346,7 +4346,6 @@ int mas_alloc_cyclic(struct ma_state *ma
{
unsigned long min = range_lo;
int ret = 0;
- struct ma_state m = *mas;
range_lo = max(min, *next);
ret = mas_empty_area(mas, range_lo, range_hi, 1);
@@ -4355,7 +4354,7 @@ int mas_alloc_cyclic(struct ma_state *ma
ret = 1;
}
if (ret < 0 && range_lo > min) {
- *mas = m;
+ mas_reset(mas);
ret = mas_empty_area(mas, min, range_hi, 1);
if (ret == 0)
ret = 1;
_
Patches currently in -mm which might be from Liam.Howlett(a)Oracle.com are
maple_tree-fix-mas_alloc_cyclic-second-search.patch
test_maple_tree-test-exhausted-upper-limit-of-mtree_alloc_cyclic.patch
The quilt patch titled
Subject: mm, compaction: don't use ALLOC_CMA in long term GUP flow
has been removed from the -mm tree. Its filename was
mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow.patch
This patch was dropped because an updated version will be issued
------------------------------------------------------
From: yangge <yangge1116(a)126.com>
Subject: mm, compaction: don't use ALLOC_CMA in long term GUP flow
Date: Mon, 16 Dec 2024 19:54:04 +0800
Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags in
__compaction_suitable()") allow compaction to proceed when free pages
required for compaction reside in the CMA pageblocks, it's possible that
__compaction_suitable() always returns true, and in some cases, it's not
acceptable.
There are 4 NUMA nodes on my machine, and each NUMA node has 32GB of
memory. I have configured 16GB of CMA memory on each NUMA node, and
starting a 32GB virtual machine with device passthrough is extremely slow,
taking almost an hour.
During the start-up of the virtual machine, it will call
pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory. Long
term GUP cannot allocate memory from CMA area, so a maximum of 16 GB of
no-CMA memory on a NUMA node can be used as virtual machine memory. Since
there is 16G of free CMA memory on the NUMA node, watermark for order-0
always be met for compaction, so __compaction_suitable() always returns
true, even if the node is unable to allocate non-CMA memory for the
virtual machine.
For costly allocations, because __compaction_suitable() always
returns true, __alloc_pages_slowpath() can't exit at the appropriate
place, resulting in excessively long virtual machine startup times.
Call trace:
__alloc_pages_slowpath
if (compact_result == COMPACT_SKIPPED ||
compact_result == COMPACT_DEFERRED)
goto nopage; // should exit __alloc_pages_slowpath() from here
In order to quickly fall back to remote node, we should remove ALLOC_CMA
both in __compaction_suitable() and __isolate_free_page() in long term GUP
flow. After this fix, starting a 32GB virtual machine with device
passthrough takes only a few seconds.
Link: https://lkml.kernel.org/r/1734350044-12928-1-git-send-email-yangge1116@126.…
Fixes: 984fdba6a32e ("mm, compaction: use proper alloc_flags in __compaction_suitable()")
Signed-off-by: yangge <yangge1116(a)126.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/compaction.h | 6 ++++--
mm/compaction.c | 20 +++++++++++---------
mm/internal.h | 3 ++-
mm/page_alloc.c | 6 ++++--
mm/page_isolation.c | 3 ++-
mm/page_reporting.c | 2 +-
mm/vmscan.c | 4 ++--
7 files changed, 26 insertions(+), 18 deletions(-)
--- a/include/linux/compaction.h~mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow
+++ a/include/linux/compaction.h
@@ -90,7 +90,8 @@ extern enum compact_result try_to_compac
struct page **page);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern bool compaction_suitable(struct zone *zone, int order,
- int highest_zoneidx);
+ int highest_zoneidx,
+ unsigned int alloc_flags);
extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
@@ -108,7 +109,8 @@ static inline void reset_isolation_suita
}
static inline bool compaction_suitable(struct zone *zone, int order,
- int highest_zoneidx)
+ int highest_zoneidx,
+ unsigned int alloc_flags)
{
return false;
}
--- a/mm/compaction.c~mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow
+++ a/mm/compaction.c
@@ -654,7 +654,7 @@ static unsigned long isolate_freepages_b
/* Found a free page, will break it into order-0 pages */
order = buddy_order(page);
- isolated = __isolate_free_page(page, order);
+ isolated = __isolate_free_page(page, order, cc->alloc_flags);
if (!isolated)
break;
set_page_private(page, order);
@@ -1633,7 +1633,7 @@ static void fast_isolate_freepages(struc
/* Isolate the page if available */
if (page) {
- if (__isolate_free_page(page, order)) {
+ if (__isolate_free_page(page, order, cc->alloc_flags)) {
set_page_private(page, order);
nr_isolated = 1 << order;
nr_scanned += nr_isolated - 1;
@@ -2379,6 +2379,7 @@ static enum compact_result compact_finis
static bool __compaction_suitable(struct zone *zone, int order,
int highest_zoneidx,
+ unsigned int alloc_flags,
unsigned long wmark_target)
{
unsigned long watermark;
@@ -2393,25 +2394,26 @@ static bool __compaction_suitable(struct
* even if compaction succeeds.
* For costly orders, we require low watermark instead of min for
* compaction to proceed to increase its chances.
- * ALLOC_CMA is used, as pages in CMA pageblocks are considered
- * suitable migration targets
+ * In addition to long term GUP flow, ALLOC_CMA is used, as pages in
+ * CMA pageblocks are considered suitable migration targets
*/
watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
low_wmark_pages(zone) : min_wmark_pages(zone);
watermark += compact_gap(order);
return __zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
- ALLOC_CMA, wmark_target);
+ alloc_flags & ALLOC_CMA, wmark_target);
}
/*
* compaction_suitable: Is this suitable to run compaction on this zone now?
*/
-bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx)
+bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx,
+ unsigned int alloc_flags)
{
enum compact_result compact_result;
bool suitable;
- suitable = __compaction_suitable(zone, order, highest_zoneidx,
+ suitable = __compaction_suitable(zone, order, highest_zoneidx, alloc_flags,
zone_page_state(zone, NR_FREE_PAGES));
/*
* fragmentation index determines if allocation failures are due to
@@ -2472,7 +2474,7 @@ bool compaction_zonelist_suitable(struct
available = zone_reclaimable_pages(zone) / order;
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
if (__compaction_suitable(zone, order, ac->highest_zoneidx,
- available))
+ alloc_flags, available))
return true;
}
@@ -2497,7 +2499,7 @@ compaction_suit_allocation_order(struct
alloc_flags))
return COMPACT_SUCCESS;
- if (!compaction_suitable(zone, order, highest_zoneidx))
+ if (!compaction_suitable(zone, order, highest_zoneidx, alloc_flags))
return COMPACT_SKIPPED;
return COMPACT_CONTINUE;
--- a/mm/internal.h~mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow
+++ a/mm/internal.h
@@ -662,7 +662,8 @@ static inline void clear_zone_contiguous
zone->contiguous = false;
}
-extern int __isolate_free_page(struct page *page, unsigned int order);
+extern int __isolate_free_page(struct page *page, unsigned int order,
+ unsigned int alloc_flags);
extern void __putback_isolated_page(struct page *page, unsigned int order,
int mt);
extern void memblock_free_pages(struct page *page, unsigned long pfn,
--- a/mm/page_alloc.c~mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow
+++ a/mm/page_alloc.c
@@ -2808,7 +2808,8 @@ void split_page(struct page *page, unsig
}
EXPORT_SYMBOL_GPL(split_page);
-int __isolate_free_page(struct page *page, unsigned int order)
+int __isolate_free_page(struct page *page, unsigned int order,
+ unsigned int alloc_flags)
{
struct zone *zone = page_zone(page);
int mt = get_pageblock_migratetype(page);
@@ -2822,7 +2823,8 @@ int __isolate_free_page(struct page *pag
* exists.
*/
watermark = zone->_watermark[WMARK_MIN] + (1UL << order);
- if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+ if (!zone_watermark_ok(zone, 0, watermark, 0,
+ alloc_flags & ALLOC_CMA))
return 0;
}
--- a/mm/page_isolation.c~mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow
+++ a/mm/page_isolation.c
@@ -229,7 +229,8 @@ static void unset_migratetype_isolate(st
buddy = find_buddy_page_pfn(page, page_to_pfn(page),
order, NULL);
if (buddy && !is_migrate_isolate_page(buddy)) {
- isolated_page = !!__isolate_free_page(page, order);
+ isolated_page = !!__isolate_free_page(page, order,
+ ALLOC_CMA);
/*
* Isolating a free page in an isolated pageblock
* is expected to always work as watermarks don't
--- a/mm/page_reporting.c~mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow
+++ a/mm/page_reporting.c
@@ -198,7 +198,7 @@ page_reporting_cycle(struct page_reporti
/* Attempt to pull page from list and place in scatterlist */
if (*offset) {
- if (!__isolate_free_page(page, order)) {
+ if (!__isolate_free_page(page, order, ALLOC_CMA)) {
next = page;
break;
}
--- a/mm/vmscan.c~mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow
+++ a/mm/vmscan.c
@@ -5861,7 +5861,7 @@ static inline bool should_continue_recla
sc->reclaim_idx, 0))
return false;
- if (compaction_suitable(zone, sc->order, sc->reclaim_idx))
+ if (compaction_suitable(zone, sc->order, sc->reclaim_idx, ALLOC_CMA))
return false;
}
@@ -6089,7 +6089,7 @@ static inline bool compaction_ready(stru
return true;
/* Compaction cannot yet proceed. Do reclaim. */
- if (!compaction_suitable(zone, sc->order, sc->reclaim_idx))
+ if (!compaction_suitable(zone, sc->order, sc->reclaim_idx, ALLOC_CMA))
return false;
/*
_
Patches currently in -mm which might be from yangge1116(a)126.com are
The patch titled
Subject: kcov: mark in_softirq_really() as __always_inline
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
kcov-mark-in_softirq_really-as-__always_inline.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Arnd Bergmann <arnd(a)arndb.de>
Subject: kcov: mark in_softirq_really() as __always_inline
Date: Tue, 17 Dec 2024 08:18:10 +0100
If gcc decides not to inline in_softirq_really(), objtool warns about a
function call with UACCESS enabled:
kernel/kcov.o: warning: objtool: __sanitizer_cov_trace_pc+0x1e: call to in_softirq_really() with UACCESS enabled
kernel/kcov.o: warning: objtool: check_kcov_mode+0x11: call to in_softirq_really() with UACCESS enabled
Mark this as __always_inline to avoid the problem.
Link: https://lkml.kernel.org/r/20241217071814.2261620-1-arnd@kernel.org
Fixes: 7d4df2dad312 ("kcov: properly check for softirq context")
Signed-off-by: Arnd Bergmann <arnd(a)arndb.de>
Reviewed-by: Marco Elver <elver(a)google.com>
Cc: Aleksandr Nogikh <nogikh(a)google.com>
Cc: Andrey Konovalov <andreyknvl(a)gmail.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Josh Poimboeuf <jpoimboe(a)kernel.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/kcov.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/kernel/kcov.c~kcov-mark-in_softirq_really-as-__always_inline
+++ a/kernel/kcov.c
@@ -166,7 +166,7 @@ static void kcov_remote_area_put(struct
* Unlike in_serving_softirq(), this function returns false when called during
* a hardirq or an NMI that happened in the softirq context.
*/
-static inline bool in_softirq_really(void)
+static __always_inline bool in_softirq_really(void)
{
return in_serving_softirq() && !in_hardirq() && !in_nmi();
}
_
Patches currently in -mm which might be from arnd(a)arndb.de are
kcov-mark-in_softirq_really-as-__always_inline.patch
The patch titled
Subject: mm/kmemleak: fix sleeping function called from invalid context at print message
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-kmemleak-fix-sleeping-function-called-from-invalid-context-at-print-message.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Alessandro Carminati <acarmina(a)redhat.com>
Subject: mm/kmemleak: fix sleeping function called from invalid context at print message
Date: Tue, 17 Dec 2024 14:20:33 +0000
Address a bug in the kernel that triggers a "sleeping function called from
invalid context" warning when /sys/kernel/debug/kmemleak is printed under
specific conditions:
- CONFIG_PREEMPT_RT=y
- Set SELinux as the LSM for the system
- Set kptr_restrict to 1
- kmemleak buffer contains at least one item
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 136, name: cat
preempt_count: 1, expected: 0
RCU nest depth: 2, expected: 2
6 locks held by cat/136:
#0: ffff32e64bcbf950 (&p->lock){+.+.}-{3:3}, at: seq_read_iter+0xb8/0xe30
#1: ffffafe6aaa9dea0 (scan_mutex){+.+.}-{3:3}, at: kmemleak_seq_start+0x34/0x128
#3: ffff32e6546b1cd0 (&object->lock){....}-{2:2}, at: kmemleak_seq_show+0x3c/0x1e0
#4: ffffafe6aa8d8560 (rcu_read_lock){....}-{1:2}, at: has_ns_capability_noaudit+0x8/0x1b0
#5: ffffafe6aabbc0f8 (notif_lock){+.+.}-{2:2}, at: avc_compute_av+0xc4/0x3d0
irq event stamp: 136660
hardirqs last enabled at (136659): [<ffffafe6a80fd7a0>] _raw_spin_unlock_irqrestore+0xa8/0xd8
hardirqs last disabled at (136660): [<ffffafe6a80fd85c>] _raw_spin_lock_irqsave+0x8c/0xb0
softirqs last enabled at (0): [<ffffafe6a5d50b28>] copy_process+0x11d8/0x3df8
softirqs last disabled at (0): [<0000000000000000>] 0x0
Preemption disabled at:
[<ffffafe6a6598a4c>] kmemleak_seq_show+0x3c/0x1e0
CPU: 1 UID: 0 PID: 136 Comm: cat Tainted: G E 6.11.0-rt7+ #34
Tainted: [E]=UNSIGNED_MODULE
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0xa0/0x128
show_stack+0x1c/0x30
dump_stack_lvl+0xe8/0x198
dump_stack+0x18/0x20
rt_spin_lock+0x8c/0x1a8
avc_perm_nonode+0xa0/0x150
cred_has_capability.isra.0+0x118/0x218
selinux_capable+0x50/0x80
security_capable+0x7c/0xd0
has_ns_capability_noaudit+0x94/0x1b0
has_capability_noaudit+0x20/0x30
restricted_pointer+0x21c/0x4b0
pointer+0x298/0x760
vsnprintf+0x330/0xf70
seq_printf+0x178/0x218
print_unreferenced+0x1a4/0x2d0
kmemleak_seq_show+0xd0/0x1e0
seq_read_iter+0x354/0xe30
seq_read+0x250/0x378
full_proxy_read+0xd8/0x148
vfs_read+0x190/0x918
ksys_read+0xf0/0x1e0
__arm64_sys_read+0x70/0xa8
invoke_syscall.constprop.0+0xd4/0x1d8
el0_svc+0x50/0x158
el0t_64_sync+0x17c/0x180
%pS and %pK, in the same back trace line, are redundant, and %pS can void
%pK service in certain contexts.
%pS alone already provides the necessary information, and if it cannot
resolve the symbol, it falls back to printing the raw address voiding
the original intent behind the %pK.
Additionally, %pK requires a privilege check CAP_SYSLOG enforced through
the LSM, which can trigger a "sleeping function called from invalid
context" warning under RT_PREEMPT kernels when the check occurs in an
atomic context. This issue may also affect other LSMs.
This change avoids the unnecessary privilege check and resolves the
sleeping function warning without any loss of information.
Link: https://lkml.kernel.org/r/20241217142032.55793-1-acarmina@redhat.com
Fixes: 3a6f33d86baa ("mm/kmemleak: use %pK to display kernel pointers in backtrace")
Signed-off-by: Alessandro Carminati <acarmina(a)redhat.com>
Cc: Cl��ment L��ger <clement.leger(a)bootlin.com>
Cc: Alessandro Carminati <acarmina(a)redhat.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Eric Chanudet <echanude(a)redhat.com>
Cc: Gabriele Paoloni <gpaoloni(a)redhat.com>
Cc: Juri Lelli <juri.lelli(a)redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: Thomas Wei��schuh <thomas.weissschuh(a)linutronix.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kmemleak.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/kmemleak.c~mm-kmemleak-fix-sleeping-function-called-from-invalid-context-at-print-message
+++ a/mm/kmemleak.c
@@ -373,7 +373,7 @@ static void print_unreferenced(struct se
for (i = 0; i < nr_entries; i++) {
void *ptr = (void *)entries[i];
- warn_or_seq_printf(seq, " [<%pK>] %pS\n", ptr, ptr);
+ warn_or_seq_printf(seq, " %pS\n", ptr);
}
}
_
Patches currently in -mm which might be from acarmina(a)redhat.com are
mm-kmemleak-fix-sleeping-function-called-from-invalid-context-at-print-message.patch
From: Steven Rostedt <rostedt(a)goodmis.org>
The TP_printk() of a TRACE_EVENT() is a generic printf format that any
developer can create for their event. It may include pointers to strings
and such. A boot mapped buffer may contain data from a previous kernel
where the strings addresses are different.
One solution is to copy the event content and update the pointers by the
recorded delta, but a simpler solution (for now) is to just use the
print_fields() function to print these events. The print_fields() function
just iterates the fields and prints them according to what type they are,
and ignores the TP_printk() format from the event itself.
To understand the difference, when printing via TP_printk() the output
looks like this:
4582.696626: kmem_cache_alloc: call_site=getname_flags+0x47/0x1f0 ptr=00000000e70e10e0 bytes_req=4096 bytes_alloc=4096 gfp_flags=GFP_KERNEL node=-1 accounted=false
4582.696629: kmem_cache_alloc: call_site=alloc_empty_file+0x6b/0x110 ptr=0000000095808002 bytes_req=360 bytes_alloc=384 gfp_flags=GFP_KERNEL node=-1 accounted=false
4582.696630: kmem_cache_alloc: call_site=security_file_alloc+0x24/0x100 ptr=00000000576339c3 bytes_req=16 bytes_alloc=16 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false
4582.696653: kmem_cache_free: call_site=do_sys_openat2+0xa7/0xd0 ptr=00000000e70e10e0 name=names_cache
But when printing via print_fields() (echo 1 > /sys/kernel/tracing/options/fields)
the same event output looks like this:
4582.696626: kmem_cache_alloc: call_site=0xffffffff92d10d97 (-1831793257) ptr=0xffff9e0e8571e000 (-107689771147264) bytes_req=0x1000 (4096) bytes_alloc=0x1000 (4096) gfp_flags=0xcc0 (3264) node=0xffffffff (-1) accounted=(0)
4582.696629: kmem_cache_alloc: call_site=0xffffffff92d0250b (-1831852789) ptr=0xffff9e0e8577f800 (-107689770747904) bytes_req=0x168 (360) bytes_alloc=0x180 (384) gfp_flags=0xcc0 (3264) node=0xffffffff (-1) accounted=(0)
4582.696630: kmem_cache_alloc: call_site=0xffffffff92efca74 (-1829778828) ptr=0xffff9e0e8d35d3b0 (-107689640864848) bytes_req=0x10 (16) bytes_alloc=0x10 (16) gfp_flags=0xdc0 (3520) node=0xffffffff (-1) accounted=(0)
4582.696653: kmem_cache_free: call_site=0xffffffff92cfbea7 (-1831879001) ptr=0xffff9e0e8571e000 (-107689771147264) name=names_cache
The print_fields() needed one update to handle this, and that's to add the
delta to the pointer strings. It also needs to handle %pS, but that is out
of scope of this fix. Currently, it only prints the raw address.
Ftrace events like stack trace and function tracing have their own methods
to print and already can handle the deltas. Those event types are less
than __TRACE_LAST_TYPE. If the event type is greater than that, then the
print_fields() output is forced.
Cc: stable(a)vger.kernel.org
Fixes: 07714b4bb3f98 ("tracing: Handle old buffer mappings for event strings and functions")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace.c | 9 +++++++++
kernel/trace/trace_output.c | 3 ++-
2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index be62f0ea1814..6581cb2bc67f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4353,6 +4353,15 @@ static enum print_line_t print_trace_fmt(struct trace_iterator *iter)
if (event) {
if (tr->trace_flags & TRACE_ITER_FIELDS)
return print_event_fields(iter, event);
+ /*
+ * For TRACE_EVENT() events, the print_fmt is not
+ * safe to use if the array has delta offsets
+ * Force printing via the fields.
+ */
+ if ((tr->text_delta || tr->data_delta) &&
+ event->type > __TRACE_LAST_TYPE)
+ return print_event_fields(iter, event);
+
return event->funcs->trace(iter, sym_flags, event);
}
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index da748b7cbc4d..0a5d12dd860f 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -853,6 +853,7 @@ static void print_fields(struct trace_iterator *iter, struct trace_event_call *c
struct list_head *head)
{
struct ftrace_event_field *field;
+ long delta = iter->tr->text_delta;
int offset;
int len;
int ret;
@@ -889,7 +890,7 @@ static void print_fields(struct trace_iterator *iter, struct trace_event_call *c
case FILTER_PTR_STRING:
if (!iter->fmt_size)
trace_iter_expand_format(iter);
- pos = *(void **)pos;
+ pos = (*(void **)pos) + delta;
ret = strncpy_from_kernel_nofault(iter->fmt, pos,
iter->fmt_size);
if (ret < 0)
--
2.45.2
From: Steven Rostedt <rostedt(a)goodmis.org>
When a ring buffer is mapped across boots, an delta is saved between the
addresses of the previous kernel and the current kernel. But this does not
handle module events nor dynamic events.
Simply do not create module or dynamic events to a boot mapped instance.
This will keep them from ever being enabled and therefore not part of the
previous kernel trace.
Cc: stable(a)vger.kernel.org
Fixes: e645535a954ad ("tracing: Add option to use memmapped memory for trace boot instance")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_events.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 77e68efbd43e..d6359318d5c1 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -2984,6 +2984,12 @@ trace_create_new_event(struct trace_event_call *call,
if (!event_in_systems(call, tr->system_names))
return NULL;
+ /* Boot mapped instances cannot use modules or dynamic events */
+ if (tr->flags & TRACE_ARRAY_FL_BOOT) {
+ if ((call->flags & TRACE_EVENT_FL_DYNAMIC) || call->module)
+ return NULL;
+ }
+
file = kmem_cache_alloc(file_cachep, GFP_TRACE);
if (!file)
return ERR_PTR(-ENOMEM);
--
2.45.2
As per memory map table, the region for PCIe6a is 64MByte. Hence, set the
size of 32 bit non-prefetchable memory region beginning on address
0x70300000 as 0x3d00000 so that BAR space assigned to BAR registers can be
allocated from 0x70300000 to 0x74000000.
Fixes: 7af141850012 ("arm64: dts: qcom: x1e80100: Fix up BAR spaces")
Cc: stable(a)vger.kernel.org
Signed-off-by: Qiang Yu <quic_qianyu(a)quicinc.com>
---
arch/arm64/boot/dts/qcom/x1e80100.dtsi | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/boot/dts/qcom/x1e80100.dtsi b/arch/arm64/boot/dts/qcom/x1e80100.dtsi
index f044921457d0..90ddac606719 100644
--- a/arch/arm64/boot/dts/qcom/x1e80100.dtsi
+++ b/arch/arm64/boot/dts/qcom/x1e80100.dtsi
@@ -3126,7 +3126,7 @@ pcie6a: pci@1bf8000 {
#address-cells = <3>;
#size-cells = <2>;
ranges = <0x01000000 0x0 0x00000000 0x0 0x70200000 0x0 0x100000>,
- <0x02000000 0x0 0x70300000 0x0 0x70300000 0x0 0x1d00000>;
+ <0x02000000 0x0 0x70300000 0x0 0x70300000 0x0 0x3d00000>;
bus-range = <0x00 0xff>;
dma-coherent;
--
2.34.1
File-scope "__pmic_glink_lock" mutex protects the filke-scope
"__pmic_glink", thus reference to it should be obtained under the lock,
just like pmic_glink_rpmsg_remove() is doing. Otherwise we have a race
during if PMIC GLINK device removal: the pmic_glink_rpmsg_probe()
function could store local reference before mutex in driver removal is
acquired.
Fixes: 58ef4ece1e41 ("soc: qcom: pmic_glink: Introduce base PMIC GLINK driver")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski(a)linaro.org>
---
Changes in v2:
1. None
---
drivers/soc/qcom/pmic_glink.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/soc/qcom/pmic_glink.c b/drivers/soc/qcom/pmic_glink.c
index 9606222993fd..452f30a9354d 100644
--- a/drivers/soc/qcom/pmic_glink.c
+++ b/drivers/soc/qcom/pmic_glink.c
@@ -217,10 +217,11 @@ static void pmic_glink_pdr_callback(int state, char *svc_path, void *priv)
static int pmic_glink_rpmsg_probe(struct rpmsg_device *rpdev)
{
- struct pmic_glink *pg = __pmic_glink;
+ struct pmic_glink *pg;
int ret = 0;
mutex_lock(&__pmic_glink_lock);
+ pg = __pmic_glink;
if (!pg) {
ret = dev_err_probe(&rpdev->dev, -ENODEV, "no pmic_glink device to attach to\n");
goto out_unlock;
--
2.43.0
On Wed, Oct 16, 2024 at 03:10:34PM +0200, Christian Brauner wrote:
> On Fri, 26 Apr 2024 16:05:48 +0800, Xuewen Yan wrote:
> > Now, the epoll only use wake_up() interface to wake up task.
> > However, sometimes, there are epoll users which want to use
> > the synchronous wakeup flag to hint the scheduler, such as
> > Android binder driver.
> > So add a wake_up_sync() define, and use the wake_up_sync()
> > when the sync is true in ep_poll_callback().
> >
> > [...]
>
> Applied to the vfs.misc branch of the vfs/vfs.git tree.
> Patches in the vfs.misc branch should appear in linux-next soon.
>
> Please report any outstanding bugs that were missed during review in a
> new review to the original patch series allowing us to drop it.
>
> It's encouraged to provide Acked-bys and Reviewed-bys even though the
> patch has now been applied. If possible patch trailers will be updated.
>
> Note that commit hashes shown below are subject to change due to rebase,
> trailer updates or similar. If in doubt, please check the listed branch.
>
> tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> branch: vfs.misc
This is a bug that's been present for all of time, so I think we should:
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable(a)vger.kernel.org
I sent a patch which adds a benchmark for nonblocking pipes using epoll:
https://lore.kernel.org/lkml/20241016190009.866615-1-bgeffon@google.com/
Using this new benchmark I get the following results without this fix
and with this fix:
$ tools/perf/perf bench sched pipe -n
# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes
Total time: 12.194 [sec]
12.194376 usecs/op
82005 ops/sec
$ tools/perf/perf bench sched pipe -n
# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes
Total time: 9.229 [sec]
9.229738 usecs/op
108345 ops/sec
>
> [1/1] epoll: Add synchronous wakeup support for ep_poll_callback
> https://git.kernel.org/vfs/vfs/c/2ce0e17660a7
From: Steven Rostedt <rostedt(a)goodmis.org>
A bug was discovered where the idle shadow stacks were not initialized
for offline CPUs when starting function graph tracer, and when they came
online they were not traced due to the missing shadow stack. To fix
this, the idle task shadow stack initialization was moved to using the
CPU hotplug callbacks. But it removed the initialization when the
function graph was enabled. The problem here is that the hotplug
callbacks are called when the CPUs come online, but the idle shadow
stack initialization only happens if function graph is currently
active. This caused the online CPUs to not get their shadow stack
initialized.
The idle shadow stack initialization still needs to be done when the
function graph is registered, as they will not be allocated if function
graph is not registered.
Cc: stable(a)vger.kernel.org
Cc: Masami Hiramatsu <mhiramat(a)kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Link: https://lore.kernel.org/20241211135335.094ba282@batman.local.home
Fixes: 2c02f7375e65 ("fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks")
Reported-by: Linus Walleij <linus.walleij(a)linaro.org>
Tested-by: Linus Walleij <linus.walleij(a)linaro.org>
Closes: https://lore.kernel.org/all/CACRpkdaTBrHwRbbrphVy-=SeDz6MSsXhTKypOtLrTQ+DgG…
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/fgraph.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 0bf78517b5d4..ddedcb50917f 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -1215,7 +1215,7 @@ void fgraph_update_pid_func(void)
static int start_graph_tracing(void)
{
unsigned long **ret_stack_list;
- int ret;
+ int ret, cpu;
ret_stack_list = kcalloc(FTRACE_RETSTACK_ALLOC_SIZE,
sizeof(*ret_stack_list), GFP_KERNEL);
@@ -1223,6 +1223,12 @@ static int start_graph_tracing(void)
if (!ret_stack_list)
return -ENOMEM;
+ /* The cpu_boot init_task->ret_stack will never be freed */
+ for_each_online_cpu(cpu) {
+ if (!idle_task(cpu)->ret_stack)
+ ftrace_graph_init_idle_task(idle_task(cpu), cpu);
+ }
+
do {
ret = alloc_retstack_tasklist(ret_stack_list);
} while (ret == -EAGAIN);
--
2.45.2
Once device_add() failed, we should call only put_device() to
decrement reference count for cleanup. We should not call device_del()
before put_device().
As comment of device_add() says, 'if device_add() succeeds, you should
call device_del() when you want to get rid of it. If device_add() has
not succeeded, use only put_device() to drop the reference count'.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: c8e4c2397655 ("RDMA/srp: Rework the srp_add_port() error path")
Signed-off-by: Ma Ke <make_ruc2021(a)163.com>
---
Changes in v2:
- modified the bug description as suggestions.
---
drivers/infiniband/ulp/srp/ib_srp.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 2916e77f589b..7289ae0b83ac 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -3978,7 +3978,6 @@ static struct srp_host *srp_add_port(struct srp_device *device, u32 port)
return host;
put_host:
- device_del(&host->dev);
put_device(&host->dev);
return NULL;
}
--
2.25.1
The driver of TI K3 ethernet port depends on PAGE_POOL configuration option. There should be a select PAGE_POOL under it configuration.
--- a/drivers/net/ethernet/ti/Kconfig
+++ b/drivers/net/ethernet/ti/Kconfig
@@ -114,6 +114,7 @@ config TI_K3_AM65_CPSW_NUSS
select TI_DAVINCI_MDIO
select PHYLINK
select TI_K3_CPPI_DESC_POOL
+ select PAGE_POOL
imply PHY_TI_GMII_SEL
depends on TI_K3_AM65_CPTS || !TI_K3_AM65_CPTS
help
Dashi Cao
In commit 892f7237b3ff ("arm64: Delay initialisation of
cpuinfo_arm64::reg_{zcr,smcr}") we moved access to ZCR, SMCR and SMIDR
later in the boot process in order to ensure that we don't attempt to
interact with them if SVE or SME is disabled on the command line.
Unfortunately when initialising the boot CPU in init_cpu_features() we work
on a copy of the struct cpuinfo_arm64 for the boot CPU used only during
boot, not the percpu copy used by the sysfs code. The expectation of the
feature identification code was that the ID registers would be read in
__cpuinfo_store_cpu() and the values not modified by init_cpu_features().
The main reason for the original change was to avoid early accesses to
ZCR on practical systems that were seen shipping with SVE reported in ID
registers but traps enabled at EL3 and handled as fatal errors, SME was
rolled in due to the similarity with SVE. Since then we have removed the
early accesses to ZCR and SMCR in commits:
abef0695f9665c3d ("arm64/sve: Remove ZCR pseudo register from cpufeature code")
391208485c3ad50f ("arm64/sve: Remove SMCR pseudo register from cpufeature code")
so only the SMIDR_EL1 part of the change remains. Since SMIDR_EL1 is
only trapped via FEAT_IDST and not the SME trap it is less likely to be
affected by similar issues, and the factors that lead to issues with SVE
are less likely to apply to SME.
Since we have not yet seen practical SME systems that need to use a
command line override (and are only just beginning to see SME systems at
all) let's just remove the override and store SMIDR_EL1 along with all
the other ID register reads in __cpuinfo_store_cpu().
This issue wasn't apparent when testing on emulated platforms that do not
report values in SMIDR_EL1.
Fixes: 892f7237b3ff ("arm64: Delay initialisation of cpuinfo_arm64::reg_{zcr,smcr}")
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Cc: stable(a)vger.kernel.org
---
Changes in v2:
- Move the ID register read back to __cpuinfo_store_cpu().
- Remove the command line option for SME ID register override.
- Link to v1: https://lore.kernel.org/r/20241214-arm64-fix-boot-cpu-smidr-v1-1-0745c40772…
---
Documentation/admin-guide/kernel-parameters.txt | 3 ---
arch/arm64/kernel/cpufeature.c | 13 -------------
arch/arm64/kernel/cpuinfo.c | 10 ++++++++++
arch/arm64/kernel/pi/idreg-override.c | 16 ----------------
4 files changed, 10 insertions(+), 32 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index dc663c0ca67067d041cf9a3767117eec765ccca8..d29dee978e933245e0db0f654f17eef3e414bb64 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -458,9 +458,6 @@
arm64.nopauth [ARM64] Unconditionally disable Pointer Authentication
support
- arm64.nosme [ARM64] Unconditionally disable Scalable Matrix
- Extension support
-
arm64.nosve [ARM64] Unconditionally disable Scalable Vector
Extension support
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 6ce71f444ed84f9056196bb21bbfac61c9687e30..818aca922ca6066eb4bdf79e153cccb24246c61b 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1167,12 +1167,6 @@ void __init init_cpu_features(struct cpuinfo_arm64 *info)
id_aa64pfr1_sme(read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1))) {
unsigned long cpacr = cpacr_save_enable_kernel_sme();
- /*
- * We mask out SMPS since even if the hardware
- * supports priorities the kernel does not at present
- * and we block access to them.
- */
- info->reg_smidr = read_cpuid(SMIDR_EL1) & ~SMIDR_EL1_SMPS;
vec_init_vq_map(ARM64_VEC_SME);
cpacr_restore(cpacr);
@@ -1423,13 +1417,6 @@ void update_cpu_features(int cpu,
id_aa64pfr1_sme(read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1))) {
unsigned long cpacr = cpacr_save_enable_kernel_sme();
- /*
- * We mask out SMPS since even if the hardware
- * supports priorities the kernel does not at present
- * and we block access to them.
- */
- info->reg_smidr = read_cpuid(SMIDR_EL1) & ~SMIDR_EL1_SMPS;
-
/* Probe vector lengths */
if (!system_capabilities_finalized())
vec_update_vq_map(ARM64_VEC_SME);
diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c
index d79e88fccdfce427507e7a34c5959ce6309cbd12..c45633b5ae233fe78607fce3d623efb28a9f341a 100644
--- a/arch/arm64/kernel/cpuinfo.c
+++ b/arch/arm64/kernel/cpuinfo.c
@@ -482,6 +482,16 @@ static void __cpuinfo_store_cpu(struct cpuinfo_arm64 *info)
if (id_aa64pfr0_mpam(info->reg_id_aa64pfr0))
info->reg_mpamidr = read_cpuid(MPAMIDR_EL1);
+ if (IS_ENABLED(CONFIG_ARM64_SME) &&
+ id_aa64pfr1_sme(info->reg_id_aa64pfr1)) {
+ /*
+ * We mask out SMPS since even if the hardware
+ * supports priorities the kernel does not at present
+ * and we block access to them.
+ */
+ info->reg_smidr = read_cpuid(SMIDR_EL1) & ~SMIDR_EL1_SMPS;
+ }
+
cpuinfo_detect_icache_policy(info);
}
diff --git a/arch/arm64/kernel/pi/idreg-override.c b/arch/arm64/kernel/pi/idreg-override.c
index 22159251eb3a6a5efea90ebda2910ebcfff52b8f..15dca48332c9b83b55752e474854d8fc45b0989b 100644
--- a/arch/arm64/kernel/pi/idreg-override.c
+++ b/arch/arm64/kernel/pi/idreg-override.c
@@ -122,21 +122,6 @@ static const struct ftr_set_desc pfr0 __prel64_initconst = {
},
};
-static bool __init pfr1_sme_filter(u64 val)
-{
- /*
- * Similarly to SVE, disabling SME also means disabling all
- * the features that are associated with it. Just set
- * id_aa64smfr0_el1 to 0 and don't look back.
- */
- if (!val) {
- id_aa64smfr0_override.val = 0;
- id_aa64smfr0_override.mask = GENMASK(63, 0);
- }
-
- return true;
-}
-
static const struct ftr_set_desc pfr1 __prel64_initconst = {
.name = "id_aa64pfr1",
.override = &id_aa64pfr1_override,
@@ -144,7 +129,6 @@ static const struct ftr_set_desc pfr1 __prel64_initconst = {
FIELD("bt", ID_AA64PFR1_EL1_BT_SHIFT, NULL ),
FIELD("gcs", ID_AA64PFR1_EL1_GCS_SHIFT, NULL),
FIELD("mte", ID_AA64PFR1_EL1_MTE_SHIFT, NULL),
- FIELD("sme", ID_AA64PFR1_EL1_SME_SHIFT, pfr1_sme_filter),
{}
},
};
---
base-commit: fac04efc5c793dccbd07e2d59af9f90b7fc0dca4
change-id: 20241213-arm64-fix-boot-cpu-smidr-386b8db292b2
Best regards,
--
Mark Brown <broonie(a)kernel.org>
If a device uses MCP23xxx IO expander to receive IRQs, the following
bug can happen:
BUG: sleeping function called from invalid context
at kernel/locking/mutex.c:283
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, ...
preempt_count: 1, expected: 0
...
Call Trace:
...
__might_resched+0x104/0x10e
__might_sleep+0x3e/0x62
mutex_lock+0x20/0x4c
regmap_lock_mutex+0x10/0x18
regmap_update_bits_base+0x2c/0x66
mcp23s08_irq_set_type+0x1ae/0x1d6
__irq_set_trigger+0x56/0x172
__setup_irq+0x1e6/0x646
request_threaded_irq+0xb6/0x160
...
We observed the problem while experimenting with a touchscreen driver which
used MCP23017 IO expander (I2C).
The regmap in the pinctrl-mcp23s08 driver uses a mutex for protection from
concurrent accesses, which is the default for regmaps without .fast_io,
.disable_locking, etc.
mcp23s08_irq_set_type() calls regmap_update_bits_base(), and the latter
locks the mutex.
However, __setup_irq() locks desc->lock spinlock before calling these
functions. As a result, the system tries to lock the mutex whole holding
the spinlock.
It seems, the internal regmap locks are not needed in this driver at all.
mcp->lock seems to protect the regmap from concurrent accesses already,
except, probably, in mcp_pinconf_get/set.
mcp23s08_irq_set_type() and mcp23s08_irq_mask/unmask() are called under
chip_bus_lock(), which calls mcp23s08_irq_bus_lock(). The latter takes
mcp->lock and enables regmap caching, so that the potentially slow I2C
accesses are deferred until chip_bus_unlock().
The accesses to the regmap from mcp23s08_probe_one() do not need additional
locking.
In all remaining places where the regmap is accessed, except
mcp_pinconf_get/set(), the driver already takes mcp->lock.
This patch adds locking in mcp_pinconf_get/set() and disables internal
locking in the regmap config. Among other things, it fixes the sleeping
in atomic context described above.
Fixes: 8f38910ba4f6 ("pinctrl: mcp23s08: switch to regmap caching")
Cc: stable(a)vger.kernel.org
Signed-off-by: Evgenii Shatokhin <e.shatokhin(a)yadro.com>
---
drivers/pinctrl/pinctrl-mcp23s08.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/pinctrl/pinctrl-mcp23s08.c b/drivers/pinctrl/pinctrl-mcp23s08.c
index d66c3a3e8429..b96e6368a956 100644
--- a/drivers/pinctrl/pinctrl-mcp23s08.c
+++ b/drivers/pinctrl/pinctrl-mcp23s08.c
@@ -86,6 +86,7 @@ const struct regmap_config mcp23x08_regmap = {
.num_reg_defaults = ARRAY_SIZE(mcp23x08_defaults),
.cache_type = REGCACHE_FLAT,
.max_register = MCP_OLAT,
+ .disable_locking = true, /* mcp->lock protects the regmap */
};
EXPORT_SYMBOL_GPL(mcp23x08_regmap);
@@ -132,6 +133,7 @@ const struct regmap_config mcp23x17_regmap = {
.num_reg_defaults = ARRAY_SIZE(mcp23x17_defaults),
.cache_type = REGCACHE_FLAT,
.val_format_endian = REGMAP_ENDIAN_LITTLE,
+ .disable_locking = true, /* mcp->lock protects the regmap */
};
EXPORT_SYMBOL_GPL(mcp23x17_regmap);
@@ -228,7 +230,9 @@ static int mcp_pinconf_get(struct pinctrl_dev *pctldev, unsigned int pin,
switch (param) {
case PIN_CONFIG_BIAS_PULL_UP:
+ mutex_lock(&mcp->lock);
ret = mcp_read(mcp, MCP_GPPU, &data);
+ mutex_unlock(&mcp->lock);
if (ret < 0)
return ret;
status = (data & BIT(pin)) ? 1 : 0;
@@ -257,7 +261,9 @@ static int mcp_pinconf_set(struct pinctrl_dev *pctldev, unsigned int pin,
switch (param) {
case PIN_CONFIG_BIAS_PULL_UP:
+ mutex_lock(&mcp->lock);
ret = mcp_set_bit(mcp, MCP_GPPU, pin, arg);
+ mutex_unlock(&mcp->lock);
break;
default:
dev_dbg(mcp->dev, "Invalid config param %04x\n", param);
--
2.34.1
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: e8b345babf2ace50f6bf380af77ee8ae415d81f2
Gitweb: https://git.kernel.org/tip/e8b345babf2ace50f6bf380af77ee8ae415d81f2
Author: Xin Li (Intel) <xin(a)zytor.com>
AuthorDate: Wed, 13 Nov 2024 09:59:34 -08:00
Committer: Dave Hansen <dave.hansen(a)linux.intel.com>
CommitterDate: Mon, 16 Dec 2024 15:44:19 -08:00
x86/fred: Clear WFE in missing-ENDBRANCH #CPs
An indirect branch instruction sets the CPU indirect branch tracker
(IBT) into WAIT_FOR_ENDBRANCH (WFE) state and WFE stays asserted
across the instruction boundary. When the decoder finds an
inappropriate instruction while WFE is set ENDBR, the CPU raises a #CP
fault.
For the "kernel IBT no ENDBR" selftest where #CPs are deliberately
triggered, the WFE state of the interrupted context needs to be
cleared to let execution continue. Otherwise when the CPU resumes
from the instruction that just caused the previous #CP, another
missing-ENDBRANCH #CP is raised and the CPU enters a dead loop.
This is not a problem with IDT because it doesn't preserve WFE and
IRET doesn't set WFE. But FRED provides space on the entry stack
(in an expanded CS area) to save and restore the WFE state, thus the
WFE state is no longer clobbered, so software must clear it.
Clear WFE to avoid dead looping in ibt_clear_fred_wfe() and the
!ibt_fatal code path when execution is allowed to continue.
Clobbering WFE in any other circumstance is a security-relevant bug.
[ dhansen: changelog rewording ]
Fixes: a5f6c2ace997 ("x86/shstk: Add user control-protection fault handler")
Signed-off-by: Xin Li (Intel) <xin(a)zytor.com>
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Acked-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/20241113175934.3897541-1-xin%40zytor.com
---
arch/x86/kernel/cet.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index d2c732a..303bf74 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -81,6 +81,34 @@ static void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code)
static __ro_after_init bool ibt_fatal = true;
+/*
+ * By definition, all missing-ENDBRANCH #CPs are a result of WFE && !ENDBR.
+ *
+ * For the kernel IBT no ENDBR selftest where #CPs are deliberately triggered,
+ * the WFE state of the interrupted context needs to be cleared to let execution
+ * continue. Otherwise when the CPU resumes from the instruction that just
+ * caused the previous #CP, another missing-ENDBRANCH #CP is raised and the CPU
+ * enters a dead loop.
+ *
+ * This is not a problem with IDT because it doesn't preserve WFE and IRET doesn't
+ * set WFE. But FRED provides space on the entry stack (in an expanded CS area)
+ * to save and restore the WFE state, thus the WFE state is no longer clobbered,
+ * so software must clear it.
+ */
+static void ibt_clear_fred_wfe(struct pt_regs *regs)
+{
+ /*
+ * No need to do any FRED checks.
+ *
+ * For IDT event delivery, the high-order 48 bits of CS are pushed
+ * as 0s into the stack, and later IRET ignores these bits.
+ *
+ * For FRED, a test to check if fred_cs.wfe is set would be dropped
+ * by compilers.
+ */
+ regs->fred_cs.wfe = 0;
+}
+
static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
{
if ((error_code & CP_EC) != CP_ENDBR) {
@@ -90,6 +118,7 @@ static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
if (unlikely(regs->ip == (unsigned long)&ibt_selftest_noendbr)) {
regs->ax = 0;
+ ibt_clear_fred_wfe(regs);
return;
}
@@ -97,6 +126,7 @@ static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
if (!ibt_fatal) {
printk(KERN_DEFAULT CUT_HERE);
__warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL);
+ ibt_clear_fred_wfe(regs);
return;
}
BUG();
The following commit has been merged into the irq/urgent branch of tip:
Commit-ID: a60b990798eb17433d0283788280422b1bd94b18
Gitweb: https://git.kernel.org/tip/a60b990798eb17433d0283788280422b1bd94b18
Author: Thomas Gleixner <tglx(a)linutronix.de>
AuthorDate: Sat, 14 Dec 2024 12:50:18 +01:00
Committer: Thomas Gleixner <tglx(a)linutronix.de>
CommitterDate: Mon, 16 Dec 2024 10:59:47 +01:00
PCI/MSI: Handle lack of irqdomain gracefully
Alexandre observed a warning emitted from pci_msi_setup_msi_irqs() on a
RISCV platform which does not provide PCI/MSI support:
WARNING: CPU: 1 PID: 1 at drivers/pci/msi/msi.h:121 pci_msi_setup_msi_irqs+0x2c/0x32
__pci_enable_msix_range+0x30c/0x596
pci_msi_setup_msi_irqs+0x2c/0x32
pci_alloc_irq_vectors_affinity+0xb8/0xe2
RISCV uses hierarchical interrupt domains and correctly does not implement
the legacy fallback. The warning triggers from the legacy fallback stub.
That warning is bogus as the PCI/MSI layer knows whether a PCI/MSI parent
domain is associated with the device or not. There is a check for MSI-X,
which has a legacy assumption. But that legacy fallback assumption is only
valid when legacy support is enabled, but otherwise the check should simply
return -ENOTSUPP.
Loongarch tripped over the same problem and blindly enabled legacy support
without implementing the legacy fallbacks. There are weak implementations
which return an error, so the problem was papered over.
Correct pci_msi_domain_supports() to evaluate the legacy mode and add
the missing supported check into the MSI enable path to complete it.
Fixes: d2a463b29741 ("PCI/MSI: Reject multi-MSI early")
Reported-by: Alexandre Ghiti <alexghiti(a)rivosinc.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Alexandre Ghiti <alexghiti(a)rivosinc.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/all/87ed2a8ow5.ffs@tglx
---
drivers/pci/msi/irqdomain.c | 7 +++++--
drivers/pci/msi/msi.c | 4 ++++
2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/msi/irqdomain.c b/drivers/pci/msi/irqdomain.c
index 5691257..d7ba879 100644
--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -350,8 +350,11 @@ bool pci_msi_domain_supports(struct pci_dev *pdev, unsigned int feature_mask,
domain = dev_get_msi_domain(&pdev->dev);
- if (!domain || !irq_domain_is_hierarchy(domain))
- return mode == ALLOW_LEGACY;
+ if (!domain || !irq_domain_is_hierarchy(domain)) {
+ if (IS_ENABLED(CONFIG_PCI_MSI_ARCH_FALLBACKS))
+ return mode == ALLOW_LEGACY;
+ return false;
+ }
if (!irq_domain_is_msi_parent(domain)) {
/*
diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
index 3a45879..2f647ca 100644
--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -433,6 +433,10 @@ int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec,
if (WARN_ON_ONCE(dev->msi_enabled))
return -EINVAL;
+ /* Test for the availability of MSI support */
+ if (!pci_msi_domain_supports(dev, 0, ALLOW_LEGACY))
+ return -ENOTSUPP;
+
nvec = pci_msi_vec_count(dev);
if (nvec < 0)
return nvec;
This patch series is to fix bugs and improve codes for drivers/of/*.
Signed-off-by: Zijun Hu <quic_zijuhu(a)quicinc.com>
---
Changes in v3:
- Drop 2 applied patches and pick up patch 4/7 again
- Fix build error for patch 6/7.
- Include of_private.h instead of function declaration for patch 2/7
- Correct tile and commit messages.
- Link to v2: https://lore.kernel.org/r/20241216-of_core_fix-v2-0-e69b8f60da63@quicinc.com
Changes in v2:
- Drop applied/conflict/TBD patches.
- Correct based on Rob's comments.
- Link to v1: https://lore.kernel.org/r/20241206-of_core_fix-v1-0-dc28ed56bec3@quicinc.com
---
Zijun Hu (7):
of: Correct child specifier used as input of the 2nd nexus node
of: Do not expose of_alias_scan() and correct its comments
of: Make of_property_present() applicable to all kinds of property
of: property: Use of_property_present() for of_fwnode_property_present()
of: Fix available buffer size calculating error in API of_device_uevent_modalias()
of: Fix potential wrong MODALIAS uevent value
of: Do not expose of_modalias()
drivers/of/base.c | 7 ++--
drivers/of/device.c | 33 +++++++--------
drivers/of/module.c | 109 +++++++++++++++++++++++++++++-------------------
drivers/of/of_private.h | 4 ++
drivers/of/pdt.c | 2 +
drivers/of/property.c | 2 +-
include/linux/of.h | 31 ++++++--------
7 files changed, 102 insertions(+), 86 deletions(-)
---
base-commit: 0f7ca6f69354e0c3923bbc28c92d0ecab4d50a3e
change-id: 20241206-of_core_fix-dc3021a06418
Best regards,
--
Zijun Hu <quic_zijuhu(a)quicinc.com>
Currently the BPF selftests in fails to compile due to use of test
helpers that were not backported, namely:
- netlink_helpers.h
- __xlated()
The 1st patch adds netlink helper files, and the 2nd patch removes the
use of __xlated() helper.
Note this series simply fix the compilation failure. Even with this
series is applied the BPF selftests fails to run to completion due to
kernel panic in the dummy_st_ops tests.
Changes since v1 <https://lore.kernel.org/all/20241126072137.823699-1-shung-hsi.yu@suse.com>:
- drop dependencies of __xlated() helper, and opt to remove its use
instead.
Daniel Borkmann (1):
selftests/bpf: Add netlink helper library
Shung-Hsi Yu (1):
selftests/bpf: remove use of __xlated()
tools/testing/selftests/bpf/Makefile | 19 +-
tools/testing/selftests/bpf/netlink_helpers.c | 358 ++++++++++++++++++
tools/testing/selftests/bpf/netlink_helpers.h | 46 +++
.../selftests/bpf/progs/verifier_scalar_ids.c | 16 -
4 files changed, 418 insertions(+), 21 deletions(-)
create mode 100644 tools/testing/selftests/bpf/netlink_helpers.c
create mode 100644 tools/testing/selftests/bpf/netlink_helpers.h
--
2.47.1
Currently the BPF selftests in fails to compile due to use of test
helpers that were not backported, namely:
- netlink_helpers.h
- __xlated()
The 1st patch adds netlink helper files, and the 2nd patch removes the
use of __xlated() helper.
Note this series simply fix the compilation failure. Even with this
series is applied the BPF selftests fails to run to completion due to
kernel panic in the dummy_st_ops tests.
Changes since v2 <https://lore.kernel.org/all/20241217072821.43545-1-shung-hsi.yu@suse.com>:
- minor reword of patch 2, dropping the "downstream patch" line and add a Fixes
tag
Changes since v1 <https://lore.kernel.org/all/20241126072137.823699-1-shung-hsi.yu@suse.com>:
- drop dependencies of __xlated() helper, and opt to remove its use instead.
Daniel Borkmann (1):
selftests/bpf: Add netlink helper library
Shung-Hsi Yu (1):
selftests/bpf: remove use of __xlated()
tools/testing/selftests/bpf/Makefile | 19 +-
tools/testing/selftests/bpf/netlink_helpers.c | 358 ++++++++++++++++++
tools/testing/selftests/bpf/netlink_helpers.h | 46 +++
.../selftests/bpf/progs/verifier_scalar_ids.c | 16 -
4 files changed, 418 insertions(+), 21 deletions(-)
create mode 100644 tools/testing/selftests/bpf/netlink_helpers.c
create mode 100644 tools/testing/selftests/bpf/netlink_helpers.h
--
2.47.1
From: Eduard Zingerman <eddyz87(a)gmail.com>
[ Upstream commit e9bd9c498cb0f5843996dbe5cbce7a1836a83c70 ]
Range propagation must not affect subreg_def marks, otherwise the
following example is rewritten by verifier incorrectly when
BPF_F_TEST_RND_HI32 flag is set:
0: call bpf_ktime_get_ns call bpf_ktime_get_ns
1: r0 &= 0x7fffffff after verifier r0 &= 0x7fffffff
2: w1 = w0 rewrites w1 = w0
3: if w0 < 10 goto +0 --------------> r11 = 0x2f5674a6 (r)
4: r1 >>= 32 r11 <<= 32 (r)
5: r0 = r1 r1 |= r11 (r)
6: exit; if w0 < 0xa goto pc+0
r1 >>= 32 r0 = r1
exit
(or zero extension of w1 at (2) is missing for architectures that
require zero extension for upper register half).
The following happens w/o this patch:
- r0 is marked as not a subreg at (0);
- w1 is marked as subreg at (2);
- w1 subreg_def is overridden at (3) by copy_register_state();
- w1 is read at (5) but mark_insn_zext() does not mark (2)
for zero extension, because w1 subreg_def is not set;
- because of BPF_F_TEST_RND_HI32 flag verifier inserts random
value for hi32 bits of (2) (marked (r));
- this random value is read at (5).
Fixes: 75748837b7e5 ("bpf: Propagate scalar ranges through register assignments.")
Reported-by: Lonial Con <kongln9170(a)gmail.com>
Signed-off-by: Lonial Con <kongln9170(a)gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87(a)gmail.com>
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
Signed-off-by: Daniel Borkmann <daniel(a)iogearbox.net>
Acked-by: Daniel Borkmann <daniel(a)iogearbox.net>
Closes: https://lore.kernel.org/bpf/7e2aa30a62d740db182c170fdd8f81c596df280d.camel@…
Link: https://lore.kernel.org/bpf/20240924210844.1758441-1-eddyz87@gmail.com
shung-hsi.yu: sync_linked_regs() was called find_equal_scalars() before commit
4bf79f9be434 ("bpf: Track equal scalars history on per-instruction level"), and
modification is done because there is only a single call to
copy_register_state() before commit 98d7ca374ba4 ("bpf: Track delta between
"linked" registers.").
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu(a)suse.com>
---
kernel/bpf/verifier.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 3f47cfa17141..a3c3c66ca047 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -14497,8 +14497,11 @@ static void find_equal_scalars(struct bpf_verifier_state *vstate,
struct bpf_reg_state *reg;
bpf_for_each_reg_in_vstate(vstate, state, reg, ({
- if (reg->type == SCALAR_VALUE && reg->id == known_reg->id)
+ if (reg->type == SCALAR_VALUE && reg->id == known_reg->id) {
+ s32 saved_subreg_def = reg->subreg_def;
copy_register_state(reg, known_reg);
+ reg->subreg_def = saved_subreg_def;
+ }
}));
}
--
2.47.1
Due to the failure of allocating the variable 'priv' in
netdev_priv(ndev), this could result in 'priv->rx_bd_v' not being set
during the allocation process of netdev_priv(ndev), which could lead
to a null pointer dereference.
Move while() loop with 'priv->rx_bd_v' dereference after the check
for its validity.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: 492caffa8a1a ("net: ethernet: nixge: Add support for National Instruments XGE netdev")
Signed-off-by: Ma Ke <make_ruc2021(a)163.com>
---
Changes in v2:
- modified the bug description as suggestions;
- modified the patch as the code style suggested.
---
drivers/net/ethernet/ni/nixge.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/net/ethernet/ni/nixge.c b/drivers/net/ethernet/ni/nixge.c
index 230d5ff99dd7..41acce878af0 100644
--- a/drivers/net/ethernet/ni/nixge.c
+++ b/drivers/net/ethernet/ni/nixge.c
@@ -604,6 +604,9 @@ static int nixge_recv(struct net_device *ndev, int budget)
cur_p = &priv->rx_bd_v[priv->rx_bd_ci];
+ if (!priv->rx_bd_v)
+ return 0;
+
while ((cur_p->status & XAXIDMA_BD_STS_COMPLETE_MASK &&
budget > packets)) {
tail_p = priv->rx_bd_p + sizeof(*priv->rx_bd_v) *
--
2.25.1
From: yangge <yangge1116(a)126.com>
Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags
in __compaction_suitable()") allow compaction to proceed when free
pages required for compaction reside in the CMA pageblocks, it's
possible that __compaction_suitable() always returns true, and in
some cases, it's not acceptable.
There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
of memory. I have configured 16GB of CMA memory on each NUMA node,
and starting a 32GB virtual machine with device passthrough is
extremely slow, taking almost an hour.
During the start-up of the virtual machine, it will call
pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
Long term GUP cannot allocate memory from CMA area, so a maximum
of 16 GB of no-CMA memory on a NUMA node can be used as virtual
machine memory. Since there is 16G of free CMA memory on the NUMA
node, watermark for order-0 always be met for compaction, so
__compaction_suitable() always returns true, even if the node is
unable to allocate non-CMA memory for the virtual machine.
For costly allocations, because __compaction_suitable() always
returns true, __alloc_pages_slowpath() can't exit at the appropriate
place, resulting in excessively long virtual machine startup times.
Call trace:
__alloc_pages_slowpath
if (compact_result == COMPACT_SKIPPED ||
compact_result == COMPACT_DEFERRED)
goto nopage; // should exit __alloc_pages_slowpath() from here
In order to quickly fall back to remote node, we should remove
ALLOC_CMA both in __compaction_suitable() and __isolate_free_page()
in long term GUP flow. After this fix, starting a 32GB virtual machine
with device passthrough takes only a few seconds.
Fixes: 984fdba6a32e ("mm, compaction: use proper alloc_flags in __compaction_suitable()")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: yangge <yangge1116(a)126.com>
Reviewed-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
---
V6:
-- update cc->alloc_flags to keep the original loginc
V5:
- add 'alloc_flags' parameter for __isolate_free_page()
- remove 'usa_cma' variable
V4:
- rich the commit log description
V3:
- fix build errors
- add ALLOC_CMA both in should_continue_reclaim() and compaction_ready()
V2:
- using the 'cc->alloc_flags' to determin if 'ALLOC_CMA' is needed
- rich the commit log description
include/linux/compaction.h | 6 ++++--
mm/compaction.c | 26 +++++++++++++++-----------
mm/internal.h | 3 ++-
mm/page_alloc.c | 7 +++++--
mm/page_isolation.c | 3 ++-
mm/page_reporting.c | 2 +-
mm/vmscan.c | 4 ++--
7 files changed, 31 insertions(+), 20 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index e947764..b4c3ac3 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -90,7 +90,8 @@ extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
struct page **page);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern bool compaction_suitable(struct zone *zone, int order,
- int highest_zoneidx);
+ int highest_zoneidx,
+ unsigned int alloc_flags);
extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
@@ -108,7 +109,8 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
}
static inline bool compaction_suitable(struct zone *zone, int order,
- int highest_zoneidx)
+ int highest_zoneidx,
+ unsigned int alloc_flags)
{
return false;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index 07bd227..d92ba6c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -655,7 +655,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
/* Found a free page, will break it into order-0 pages */
order = buddy_order(page);
- isolated = __isolate_free_page(page, order);
+ isolated = __isolate_free_page(page, order, cc->alloc_flags);
if (!isolated)
break;
set_page_private(page, order);
@@ -1634,7 +1634,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
/* Isolate the page if available */
if (page) {
- if (__isolate_free_page(page, order)) {
+ if (__isolate_free_page(page, order, cc->alloc_flags)) {
set_page_private(page, order);
nr_isolated = 1 << order;
nr_scanned += nr_isolated - 1;
@@ -2381,6 +2381,7 @@ static enum compact_result compact_finished(struct compact_control *cc)
static bool __compaction_suitable(struct zone *zone, int order,
int highest_zoneidx,
+ unsigned int alloc_flags,
unsigned long wmark_target)
{
unsigned long watermark;
@@ -2395,25 +2396,26 @@ static bool __compaction_suitable(struct zone *zone, int order,
* even if compaction succeeds.
* For costly orders, we require low watermark instead of min for
* compaction to proceed to increase its chances.
- * ALLOC_CMA is used, as pages in CMA pageblocks are considered
- * suitable migration targets
+ * In addition to long term GUP flow, ALLOC_CMA is used, as pages in
+ * CMA pageblocks are considered suitable migration targets
*/
watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
low_wmark_pages(zone) : min_wmark_pages(zone);
watermark += compact_gap(order);
return __zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
- ALLOC_CMA, wmark_target);
+ alloc_flags & ALLOC_CMA, wmark_target);
}
/*
* compaction_suitable: Is this suitable to run compaction on this zone now?
*/
-bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx)
+bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx,
+ unsigned int alloc_flags)
{
enum compact_result compact_result;
bool suitable;
- suitable = __compaction_suitable(zone, order, highest_zoneidx,
+ suitable = __compaction_suitable(zone, order, highest_zoneidx, alloc_flags,
zone_page_state(zone, NR_FREE_PAGES));
/*
* fragmentation index determines if allocation failures are due to
@@ -2474,7 +2476,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
available = zone_reclaimable_pages(zone) / order;
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
if (__compaction_suitable(zone, order, ac->highest_zoneidx,
- available))
+ alloc_flags, available))
return true;
}
@@ -2499,7 +2501,7 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order,
alloc_flags))
return COMPACT_SUCCESS;
- if (!compaction_suitable(zone, order, highest_zoneidx))
+ if (!compaction_suitable(zone, order, highest_zoneidx, alloc_flags))
return COMPACT_SKIPPED;
return COMPACT_CONTINUE;
@@ -2893,6 +2895,7 @@ static int compact_node(pg_data_t *pgdat, bool proactive)
struct compact_control cc = {
.order = -1,
.mode = proactive ? MIGRATE_SYNC_LIGHT : MIGRATE_SYNC,
+ .alloc_flags = ALLOC_CMA,
.ignore_skip_hint = true,
.whole_zone = true,
.gfp_mask = GFP_KERNEL,
@@ -3037,7 +3040,7 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
ret = compaction_suit_allocation_order(zone,
pgdat->kcompactd_max_order,
- highest_zoneidx, ALLOC_WMARK_MIN);
+ highest_zoneidx, ALLOC_CMA | ALLOC_WMARK_MIN);
if (ret == COMPACT_CONTINUE)
return true;
}
@@ -3058,6 +3061,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
.search_order = pgdat->kcompactd_max_order,
.highest_zoneidx = pgdat->kcompactd_highest_zoneidx,
.mode = MIGRATE_SYNC_LIGHT,
+ .alloc_flags = ALLOC_CMA | ALLOC_WMARK_MIN,
.ignore_skip_hint = false,
.gfp_mask = GFP_KERNEL,
};
@@ -3078,7 +3082,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
continue;
ret = compaction_suit_allocation_order(zone,
- cc.order, zoneid, ALLOC_WMARK_MIN);
+ cc.order, zoneid, cc.alloc_flags);
if (ret != COMPACT_CONTINUE)
continue;
diff --git a/mm/internal.h b/mm/internal.h
index 3922788..6d257c8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -662,7 +662,8 @@ static inline void clear_zone_contiguous(struct zone *zone)
zone->contiguous = false;
}
-extern int __isolate_free_page(struct page *page, unsigned int order);
+extern int __isolate_free_page(struct page *page, unsigned int order,
+ unsigned int alloc_flags);
extern void __putback_isolated_page(struct page *page, unsigned int order,
int mt);
extern void memblock_free_pages(struct page *page, unsigned long pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dde19db..1bfdca3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2809,7 +2809,8 @@ void split_page(struct page *page, unsigned int order)
}
EXPORT_SYMBOL_GPL(split_page);
-int __isolate_free_page(struct page *page, unsigned int order)
+int __isolate_free_page(struct page *page, unsigned int order,
+ unsigned int alloc_flags)
{
struct zone *zone = page_zone(page);
int mt = get_pageblock_migratetype(page);
@@ -2823,7 +2824,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
* exists.
*/
watermark = zone->_watermark[WMARK_MIN] + (1UL << order);
- if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+ if (!zone_watermark_ok(zone, 0, watermark, 0,
+ alloc_flags & ALLOC_CMA))
return 0;
}
@@ -6454,6 +6456,7 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end,
.order = -1,
.zone = page_zone(pfn_to_page(start)),
.mode = MIGRATE_SYNC,
+ .alloc_flags = ALLOC_CMA,
.ignore_skip_hint = true,
.no_set_skip_hint = true,
.alloc_contig = true,
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index c608e9d..a1f2c79 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -229,7 +229,8 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
buddy = find_buddy_page_pfn(page, page_to_pfn(page),
order, NULL);
if (buddy && !is_migrate_isolate_page(buddy)) {
- isolated_page = !!__isolate_free_page(page, order);
+ isolated_page = !!__isolate_free_page(page, order,
+ ALLOC_CMA);
/*
* Isolating a free page in an isolated pageblock
* is expected to always work as watermarks don't
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index e4c428e..fd3813b 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -198,7 +198,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
/* Attempt to pull page from list and place in scatterlist */
if (*offset) {
- if (!__isolate_free_page(page, order)) {
+ if (!__isolate_free_page(page, order, ALLOC_CMA)) {
next = page;
break;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e03a61..33f5b46 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5815,7 +5815,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
sc->reclaim_idx, 0))
return false;
- if (compaction_suitable(zone, sc->order, sc->reclaim_idx))
+ if (compaction_suitable(zone, sc->order, sc->reclaim_idx, ALLOC_CMA))
return false;
}
@@ -6043,7 +6043,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
return true;
/* Compaction cannot yet proceed. Do reclaim. */
- if (!compaction_suitable(zone, sc->order, sc->reclaim_idx))
+ if (!compaction_suitable(zone, sc->order, sc->reclaim_idx, ALLOC_CMA))
return false;
/*
--
2.7.4
xHC hosts from several vendors have the same issue where endpoints start
so slowly that a later queued 'Stop Endpoint' command may complete before
endpoint is up and running.
The 'Stop Endpoint' command fails with context state error as the endpoint
still appears as stopped.
See commit 42b758137601 ("usb: xhci: Limit Stop Endpoint retries") for
details
CC: stable(a)vger.kernel.org
Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com>
---
drivers/usb/host/xhci-ring.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 4cf5363875c7..09b05a62375e 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1199,8 +1199,6 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
* Keep retrying until the EP starts and stops again, on
* chips where this is known to help. Wait for 100ms.
*/
- if (!(xhci->quirks & XHCI_NEC_HOST))
- break;
if (time_is_before_jiffies(ep->stop_time + msecs_to_jiffies(100)))
break;
fallthrough;
--
2.25.1
Sending it out to the mailing lists once more because AMD mail servers
tried to convert it to HTML :(
Am 17.12.24 um 01:26 schrieb Matthew Brost:
> On Fri, Nov 22, 2024 at 02:36:59PM +0000, Tvrtko Ursulin wrote:
>> [SNIP]
>>>>>> Do we have system wide workqueues for that? It seems a bit
>>>>>> overkill that amdgpu has to allocate one on his own.
>>>>> I wondered the same but did not find any. Only ones I am aware
>>>>> of are system_wq&co created in workqueue_init_early().
>>>> Gentle ping on this. I don't have any better ideas that creating a
>>>> new wq.
>>> It took me a moment to realize, but I now think this warning message is
>>> a false positive.
>>>
>>> What happens is that the code calls cancel_delayed_work_sync().
>>>
>>> If the work item never run because of lack of memory then it can just be
>>> canceled.
>>>
>>> If the work item is running then we will block for it to finish.
>>>
> Apologies for the late reply. Alex responded to another thread and CC'd
> me, which reminded me to reply here.
>
> The execution of the non-reclaim worker could have led to a few scenarios:
>
> - It might have triggered reclaim through its own memory allocation.
That is unrelated and has nothing todo with WQ_MEM_RECLAIM.
What we should do is to make sure that the lockdep annotation covers all
workers who play a role in fence signaling.
> - It could have been running and then context-switched out, with reclaim
> being triggered elsewhere in the mean time, pausing the execution of
> the non-reclaim worker.
As far as I know non-reclaim workers are not paused because a reclaim
worker is running, that would be really new to me.
What happens is that here (from workqueue.c):
* Workqueue rescuer thread function. There's one rescuer for each *
workqueue which has WQ_MEM_RECLAIM set. * * Regular work processing on a
pool may block trying to create a new * worker which uses GFP_KERNEL
allocation which has slight chance of * developing into deadlock if some
works currently on the same queue * need to be processed to satisfy the
GFP_KERNEL allocation. This is * the problem rescuer solves.
> In either case, during reclaim, if you wait on a DMA fence that depends
> on the DRM scheduler worker,and that worker attempts to flush the above
> non-reclaim worker, it will result in a deadlock.
Well that is only partially correct.
It's true that the worker we wait for can't wait for DMA-fence or do
memory allocations who wait for DMA-fences. But WQ_MEM_RECLAIM is not
related to any DMA fence annotation.
What happens instead is that the kernel always keeps a kernel thread
pre-allocated so that it can guarantee that the worker can start without
allocating memory.
As soon as the worker runs there shouldn't be any difference in the
handling as far as I know.
> The annotation appears correct to me, and I believe Tvrtko's patch is
> indeed accurate. For what it's worth, we encountered several similar
> bugs in Xe that emerged once we added the correct work queue
> annotations.
I think you mean something different. This is the lockdep annotation for
the workers and not WQ_MEM_RECLAIM.
Regards,
Christian.
>>> There is no need to use WQ_MEM_RECLAIM for the workqueue or do I miss
>>> something?
>>>
>>> If I'm not completely mistaken you stumbled over a bug in the warning
>>> code instead :)
>> Hmm your thinking sounds convincing.
>>
>> Adding Tejun if he has time to help brainstorm this.
>>
> Tejun could likely provide insight into whether my above assessment is
> correct.
> Matt
>
>> Question is - does check_flush_dependency() need to skip the !WQ_MEM_RECLAIM
>> flushing WQ_MEM_RECLAIM warning *if* the work is already running *and* it
>> was called from cancel_delayed_work_sync()?
>>
>> Regards,
>>
>> Tvrtko
>>
>>>>>> Apart from that looks good to me.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> Signed-off-by: Tvrtko Ursulin<tvrtko.ursulin(a)igalia.com>
>>>>>>> References: 746ae46c1113 ("drm/sched: Mark scheduler
>>>>>>> work queues with WQ_MEM_RECLAIM")
>>>>>>> Fixes: a6149f039369 ("drm/sched: Convert drm scheduler
>>>>>>> to use a work queue rather than kthread")
>>>>>>> Cc:stable@vger.kernel.org
>>>>>>> Cc: Matthew Brost<matthew.brost(a)intel.com>
>>>>>>> Cc: Danilo Krummrich<dakr(a)kernel.org>
>>>>>>> Cc: Philipp Stanner<pstanner(a)redhat.com>
>>>>>>> Cc: Alex Deucher<alexander.deucher(a)amd.com>
>>>>>>> Cc: Christian König<christian.koenig(a)amd.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 25
>>>>>>> +++++++++++++++++++++++++
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 5 +++--
>>>>>>> 3 files changed, 30 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> index 7645e498faa4..a6aad687537e 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>> @@ -268,6 +268,8 @@ extern int amdgpu_agp;
>>>>>>> extern int amdgpu_wbrf;
>>>>>>> +extern struct workqueue_struct *amdgpu_reclaim_wq;
>>>>>>> +
>>>>>>> #define AMDGPU_VM_MAX_NUM_CTX 4096
>>>>>>> #define AMDGPU_SG_THRESHOLD (256*1024*1024)
>>>>>>> #define AMDGPU_WAIT_IDLE_TIMEOUT_IN_MS 3000
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>> index 38686203bea6..f5b7172e8042 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>> @@ -255,6 +255,8 @@ struct amdgpu_watchdog_timer
>>>>>>> amdgpu_watchdog_timer = {
>>>>>>> .period = 0x0, /* default to 0x0 (timeout disable) */
>>>>>>> };
>>>>>>> +struct workqueue_struct *amdgpu_reclaim_wq;
>>>>>>> +
>>>>>>> /**
>>>>>>> * DOC: vramlimit (int)
>>>>>>> * Restrict the total amount of VRAM in MiB for
>>>>>>> testing. The default is 0 (Use full VRAM).
>>>>>>> @@ -2971,6 +2973,21 @@ static struct pci_driver
>>>>>>> amdgpu_kms_pci_driver = {
>>>>>>> .dev_groups = amdgpu_sysfs_groups,
>>>>>>> };
>>>>>>> +static int amdgpu_wq_init(void)
>>>>>>> +{
>>>>>>> + amdgpu_reclaim_wq =
>>>>>>> + alloc_workqueue("amdgpu-reclaim", WQ_MEM_RECLAIM, 0);
>>>>>>> + if (!amdgpu_reclaim_wq)
>>>>>>> + return -ENOMEM;
>>>>>>> +
>>>>>>> + return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void amdgpu_wq_fini(void)
>>>>>>> +{
>>>>>>> + destroy_workqueue(amdgpu_reclaim_wq);
>>>>>>> +}
>>>>>>> +
>>>>>>> static int __init amdgpu_init(void)
>>>>>>> {
>>>>>>> int r;
>>>>>>> @@ -2978,6 +2995,10 @@ static int __init amdgpu_init(void)
>>>>>>> if (drm_firmware_drivers_only())
>>>>>>> return -EINVAL;
>>>>>>> + r = amdgpu_wq_init();
>>>>>>> + if (r)
>>>>>>> + goto error_wq;
>>>>>>> +
>>>>>>> r = amdgpu_sync_init();
>>>>>>> if (r)
>>>>>>> goto error_sync;
>>>>>>> @@ -3006,6 +3027,9 @@ static int __init amdgpu_init(void)
>>>>>>> amdgpu_sync_fini();
>>>>>>> error_sync:
>>>>>>> + amdgpu_wq_fini();
>>>>>>> +
>>>>>>> +error_wq:
>>>>>>> return r;
>>>>>>> }
>>>>>>> @@ -3017,6 +3041,7 @@ static void __exit amdgpu_exit(void)
>>>>>>> amdgpu_acpi_release();
>>>>>>> amdgpu_sync_fini();
>>>>>>> amdgpu_fence_slab_fini();
>>>>>>> + amdgpu_wq_fini();
>>>>>>> mmu_notifier_synchronize();
>>>>>>> amdgpu_xcp_drv_release();
>>>>>>> }
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>>> index 2f3f09dfb1fd..f8fd71d9382f 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>>>>> @@ -790,8 +790,9 @@ void amdgpu_gfx_off_ctrl(struct
>>>>>>> amdgpu_device *adev, bool enable)
>>>>>>> AMD_IP_BLOCK_TYPE_GFX, true))
>>>>>>> adev->gfx.gfx_off_state = true;
>>>>>>> } else {
>>>>>>> - schedule_delayed_work(&adev->gfx.gfx_off_delay_work,
>>>>>>> - delay);
>>>>>>> + queue_delayed_work(amdgpu_reclaim_wq,
>>>>>>> + &adev->gfx.gfx_off_delay_work,
>>>>>>> + delay);
>>>>>>> }
>>>>>>> }
>>>>>>> } else {
On certain i.MX8 series parts [1], the PPS channel 0
is routed internally to eDMA, and the external PPS
pin is available on channel 1. In addition, on
certain boards, the PPS may be wired on the PCB to
an EVENTOUTn pin other than 0. On these systems
it is necessary that the PPS channel be able
to be configured from the Device Tree.
[1] https://lore.kernel.org/all/ZrPYOWA3FESx197L@lizhi-Precision-Tower-5810/
Francesco Dolcini (3):
dt-bindings: net: fec: add pps channel property
net: fec: refactor PPS channel configuration
net: fec: make PPS channel configurable
Documentation/devicetree/bindings/net/fsl,fec.yaml | 7 +++++++
drivers/net/ethernet/freescale/fec_ptp.c | 11 ++++++-----
2 files changed, 13 insertions(+), 5 deletions(-)
--
2.34.1
On Sun, Dec 15, 2024 at 11:54:50AM -0500, Sasha Levin wrote:
> This is a note to let you know that I've just added the patch titled
>
> module: Convert default symbol namespace to string literal
>
> to the 6.12-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> module-convert-default-symbol-namespace-to-string-li.patch
> and it can be found in the queue-6.12 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
IIUC if you take this one, you would want to take more that are fixing
documentation generation and other noticed regressions.
--
With Best Regards,
Andy Shevchenko
On Sun, Dec 15, 2024 at 11:54:56AM -0500, Sasha Levin wrote:
> This is a note to let you know that I've just added the patch titled
>
> gpio: idio-16: Actually make use of the GPIO_IDIO_16 symbol namespace
>
> to the 6.12-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> gpio-idio-16-actually-make-use-of-the-gpio_idio_16-s.patch
> and it can be found in the queue-6.12 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
>
>
>
> commit 8845b746c447c715080e448d62aeed25f73fb205
> Author: Uwe Kleine-König <u.kleine-koenig(a)baylibre.com>
> Date: Tue Dec 3 18:26:30 2024 +0100
>
> gpio: idio-16: Actually make use of the GPIO_IDIO_16 symbol namespace
>
> [ Upstream commit 9ac4b58fcef0f9fc03fa6e126a5f53c1c71ada8a ]
>
> DEFAULT_SYMBOL_NAMESPACE must already be defined when <linux/export.h>
> is included. So move the define above the include block.
>
> Fixes: b9b1fc1ae119 ("gpio: idio-16: Introduce the ACCES IDIO-16 GPIO library module")
> Signed-off-by: Uwe Kleine-König <u.kleine-koenig(a)baylibre.com>
> Acked-by: William Breathitt Gray <wbg(a)kernel.org>
> Link: https://lore.kernel.org/r/20241203172631.1647792-2-u.kleine-koenig@baylibre…
> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski(a)linaro.org>
> Signed-off-by: Sasha Levin <sashal(a)kernel.org>
Hmm, I don't think the advantages here are very relevant. The only
problem fixed here is that the symbols provided by the driver are not in
the expected namespace. So this is nothing a user would wail about as
everything works as intended. The big upside of dropping this patch is
that you can (I think) also drop the backport of commit ceb8bf2ceaa7
("module: Convert default symbol namespace to string literal").
Even if you want to fix the namespace issue in the gpio-idio-16 driver,
I'd suggest to just move the definition of DEFAULT_SYMBOL_NAMESPACE
without the quotes for stable as I think this is easy enough to not
justify taking the intrusive ceb8bf2ceaa7. Also ceb8bf2ceaa7 might be an
annoyance for out-of-tree code. I know we don't care much about these,
still I think not adding to their burden is another small argument for
not taking ceb8bf2ceaa7.
Best regards
Uwe
The reference count of the device incremented in device_initialize() is
not decremented when device_add() fails. Add a put_device() call before
returning from the function to decrement reference count for cleanup.
Or it could cause memory leak.
As comment of device_add() says, if device_add() succeeds, you should
call device_del() when you want to get rid of it. If device_add() has
not succeeded, use only put_device() to drop the reference count.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: ed542bed126c ("[SCSI] raid class: handle component-add errors")
Signed-off-by: Ma Ke <make_ruc2021(a)163.com>
---
drivers/scsi/raid_class.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/scsi/raid_class.c b/drivers/scsi/raid_class.c
index 898a0bdf8df6..2cb2949a78c6 100644
--- a/drivers/scsi/raid_class.c
+++ b/drivers/scsi/raid_class.c
@@ -251,6 +251,7 @@ int raid_component_add(struct raid_template *r,struct device *raid_dev,
list_del(&rc->node);
rd->component_count--;
put_device(component_dev);
+ put_device(&rc->dev);
kfree(rc);
return err;
}
--
2.25.1
The reference count of the device incremented in device_initialize() is
not decremented when device_add() fails. Add a put_device() call before
returning from the function to decrement reference count for cleanup.
Or it could cause memory leak.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: 53d2a715c240 ("phy: Add Tegra XUSB pad controller support")
Signed-off-by: Ma Ke <make_ruc2021(a)163.com>
---
drivers/phy/tegra/xusb.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/phy/tegra/xusb.c b/drivers/phy/tegra/xusb.c
index 79d4814d758d..c89df95aa6ca 100644
--- a/drivers/phy/tegra/xusb.c
+++ b/drivers/phy/tegra/xusb.c
@@ -548,16 +548,16 @@ static int tegra_xusb_port_init(struct tegra_xusb_port *port,
err = dev_set_name(&port->dev, "%s-%u", name, index);
if (err < 0)
- goto unregister;
+ goto put_device;
err = device_add(&port->dev);
if (err < 0)
- goto unregister;
+ goto put_device;
return 0;
-unregister:
- device_unregister(&port->dev);
+put_device:
+ put_device(&port->dev);
return err;
}
--
2.25.1
The reference count of the device incremented in device_initialize() is
not decremented when device_add() fails. Add a put_device() call before
returning from the function.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: f2b44cde7e16 ("virtio: split device_register into device_initialize and device_add")
Signed-off-by: Ma Ke <make_ruc2021(a)163.com>
---
Changes in v2:
- modified the fixes tag according to suggestions;
- modified the bug description.
---
drivers/virtio/virtio.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index b9095751e43b..ac721b5597e8 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -503,6 +503,7 @@ int register_virtio_device(struct virtio_device *dev)
out_of_node_put:
of_node_put(dev->dev.of_node);
+ put_device(&dev->dev);
out_ida_remove:
ida_free(&virtio_index_ida, dev->index);
out:
--
2.25.1
Currently the BPF selftests in fails to compile (with
tools/testing/selftests/bpf/vmtest.sh) due to use of test helpers that
were not backported, namely:
- netlink_helpers.h
- __xlated()
The 1st patch in the series adds netlink_helpers.h, and the 8th patch in
the series adds the __xlated() helper. Patch 2-7 are pulled as context
for the __xlated() helper.
Cupertino Miranda (1):
selftests/bpf: Support checks against a regular expression
Daniel Borkmann (1):
selftests/bpf: Add netlink helper library
Eduard Zingerman (5):
selftests/bpf: extract utility function for BPF disassembly
selftests/bpf: print correct offset for pseudo calls in disasm_insn()
selftests/bpf: no need to track next_match_pos in struct test_loader
selftests/bpf: extract test_loader->expect_msgs as a data structure
selftests/bpf: allow checking xlated programs in verifier_* tests
Hou Tao (1):
selftests/bpf: Factor out get_xlated_program() helper
tools/testing/selftests/bpf/Makefile | 20 +-
tools/testing/selftests/bpf/disasm_helpers.c | 69 ++++
tools/testing/selftests/bpf/disasm_helpers.h | 12 +
tools/testing/selftests/bpf/netlink_helpers.c | 358 ++++++++++++++++++
tools/testing/selftests/bpf/netlink_helpers.h | 46 +++
.../selftests/bpf/prog_tests/ctx_rewrite.c | 118 +-----
tools/testing/selftests/bpf/progs/bpf_misc.h | 16 +-
tools/testing/selftests/bpf/test_loader.c | 235 +++++++++---
tools/testing/selftests/bpf/test_progs.h | 1 -
tools/testing/selftests/bpf/test_verifier.c | 47 +--
tools/testing/selftests/bpf/testing_helpers.c | 43 +++
tools/testing/selftests/bpf/testing_helpers.h | 6 +
12 files changed, 759 insertions(+), 212 deletions(-)
create mode 100644 tools/testing/selftests/bpf/disasm_helpers.c
create mode 100644 tools/testing/selftests/bpf/disasm_helpers.h
create mode 100644 tools/testing/selftests/bpf/netlink_helpers.c
create mode 100644 tools/testing/selftests/bpf/netlink_helpers.h
--
2.47.0
From: yangge <yangge1116(a)126.com>
Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags
in __compaction_suitable()") allow compaction to proceed when free
pages required for compaction reside in the CMA pageblocks, it's
possible that __compaction_suitable() always returns true, and in
some cases, it's not acceptable.
There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
of memory. I have configured 16GB of CMA memory on each NUMA node,
and starting a 32GB virtual machine with device passthrough is
extremely slow, taking almost an hour.
During the start-up of the virtual machine, it will call
pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
Long term GUP cannot allocate memory from CMA area, so a maximum
of 16 GB of no-CMA memory on a NUMA node can be used as virtual
machine memory. Since there is 16G of free CMA memory on the NUMA
node, watermark for order-0 always be met for compaction, so
__compaction_suitable() always returns true, even if the node is
unable to allocate non-CMA memory for the virtual machine.
For costly allocations, because __compaction_suitable() always
returns true, __alloc_pages_slowpath() can't exit at the appropriate
place, resulting in excessively long virtual machine startup times.
Call trace:
__alloc_pages_slowpath
if (compact_result == COMPACT_SKIPPED ||
compact_result == COMPACT_DEFERRED)
goto nopage; // should exit __alloc_pages_slowpath() from here
In order to quickly fall back to remote node, we should remove
ALLOC_CMA both in __compaction_suitable() and __isolate_free_page()
in long term GUP flow. After this fix, starting a 32GB virtual machine
with device passthrough takes only a few seconds.
Fixes: 984fdba6a32e ("mm, compaction: use proper alloc_flags in __compaction_suitable()")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: yangge <yangge1116(a)126.com>
---
V5:
- add 'alloc_flags' parameter for __isolate_free_page()
- remove 'usa_cma' variable
V4:
- rich the commit log description
V3:
- fix build errors
- add ALLOC_CMA both in should_continue_reclaim() and compaction_ready()
V2:
- using the 'cc->alloc_flags' to determin if 'ALLOC_CMA' is needed
- rich the commit log description
include/linux/compaction.h | 6 ++++--
mm/compaction.c | 20 +++++++++++---------
mm/internal.h | 3 ++-
mm/page_alloc.c | 6 ++++--
mm/page_isolation.c | 3 ++-
mm/page_reporting.c | 2 +-
mm/vmscan.c | 4 ++--
7 files changed, 26 insertions(+), 18 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index e947764..b4c3ac3 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -90,7 +90,8 @@ extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
struct page **page);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern bool compaction_suitable(struct zone *zone, int order,
- int highest_zoneidx);
+ int highest_zoneidx,
+ unsigned int alloc_flags);
extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
@@ -108,7 +109,8 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
}
static inline bool compaction_suitable(struct zone *zone, int order,
- int highest_zoneidx)
+ int highest_zoneidx,
+ unsigned int alloc_flags)
{
return false;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index 07bd227..b10d921 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -655,7 +655,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
/* Found a free page, will break it into order-0 pages */
order = buddy_order(page);
- isolated = __isolate_free_page(page, order);
+ isolated = __isolate_free_page(page, order, cc->alloc_flags);
if (!isolated)
break;
set_page_private(page, order);
@@ -1634,7 +1634,7 @@ static void fast_isolate_freepages(struct compact_control *cc)
/* Isolate the page if available */
if (page) {
- if (__isolate_free_page(page, order)) {
+ if (__isolate_free_page(page, order, cc->alloc_flags)) {
set_page_private(page, order);
nr_isolated = 1 << order;
nr_scanned += nr_isolated - 1;
@@ -2381,6 +2381,7 @@ static enum compact_result compact_finished(struct compact_control *cc)
static bool __compaction_suitable(struct zone *zone, int order,
int highest_zoneidx,
+ unsigned int alloc_flags,
unsigned long wmark_target)
{
unsigned long watermark;
@@ -2395,25 +2396,26 @@ static bool __compaction_suitable(struct zone *zone, int order,
* even if compaction succeeds.
* For costly orders, we require low watermark instead of min for
* compaction to proceed to increase its chances.
- * ALLOC_CMA is used, as pages in CMA pageblocks are considered
- * suitable migration targets
+ * In addition to long term GUP flow, ALLOC_CMA is used, as pages in
+ * CMA pageblocks are considered suitable migration targets
*/
watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
low_wmark_pages(zone) : min_wmark_pages(zone);
watermark += compact_gap(order);
return __zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
- ALLOC_CMA, wmark_target);
+ alloc_flags & ALLOC_CMA, wmark_target);
}
/*
* compaction_suitable: Is this suitable to run compaction on this zone now?
*/
-bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx)
+bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx,
+ unsigned int alloc_flags)
{
enum compact_result compact_result;
bool suitable;
- suitable = __compaction_suitable(zone, order, highest_zoneidx,
+ suitable = __compaction_suitable(zone, order, highest_zoneidx, alloc_flags,
zone_page_state(zone, NR_FREE_PAGES));
/*
* fragmentation index determines if allocation failures are due to
@@ -2474,7 +2476,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
available = zone_reclaimable_pages(zone) / order;
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
if (__compaction_suitable(zone, order, ac->highest_zoneidx,
- available))
+ alloc_flags, available))
return true;
}
@@ -2499,7 +2501,7 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order,
alloc_flags))
return COMPACT_SUCCESS;
- if (!compaction_suitable(zone, order, highest_zoneidx))
+ if (!compaction_suitable(zone, order, highest_zoneidx, alloc_flags))
return COMPACT_SKIPPED;
return COMPACT_CONTINUE;
diff --git a/mm/internal.h b/mm/internal.h
index 3922788..6d257c8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -662,7 +662,8 @@ static inline void clear_zone_contiguous(struct zone *zone)
zone->contiguous = false;
}
-extern int __isolate_free_page(struct page *page, unsigned int order);
+extern int __isolate_free_page(struct page *page, unsigned int order,
+ unsigned int alloc_flags);
extern void __putback_isolated_page(struct page *page, unsigned int order,
int mt);
extern void memblock_free_pages(struct page *page, unsigned long pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dde19db..ecb2fd7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2809,7 +2809,8 @@ void split_page(struct page *page, unsigned int order)
}
EXPORT_SYMBOL_GPL(split_page);
-int __isolate_free_page(struct page *page, unsigned int order)
+int __isolate_free_page(struct page *page, unsigned int order,
+ unsigned int alloc_flags)
{
struct zone *zone = page_zone(page);
int mt = get_pageblock_migratetype(page);
@@ -2823,7 +2824,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
* exists.
*/
watermark = zone->_watermark[WMARK_MIN] + (1UL << order);
- if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+ if (!zone_watermark_ok(zone, 0, watermark, 0,
+ alloc_flags & ALLOC_CMA))
return 0;
}
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index c608e9d..a1f2c79 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -229,7 +229,8 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
buddy = find_buddy_page_pfn(page, page_to_pfn(page),
order, NULL);
if (buddy && !is_migrate_isolate_page(buddy)) {
- isolated_page = !!__isolate_free_page(page, order);
+ isolated_page = !!__isolate_free_page(page, order,
+ ALLOC_CMA);
/*
* Isolating a free page in an isolated pageblock
* is expected to always work as watermarks don't
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index e4c428e..fd3813b 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -198,7 +198,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
/* Attempt to pull page from list and place in scatterlist */
if (*offset) {
- if (!__isolate_free_page(page, order)) {
+ if (!__isolate_free_page(page, order, ALLOC_CMA)) {
next = page;
break;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e03a61..33f5b46 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5815,7 +5815,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
sc->reclaim_idx, 0))
return false;
- if (compaction_suitable(zone, sc->order, sc->reclaim_idx))
+ if (compaction_suitable(zone, sc->order, sc->reclaim_idx, ALLOC_CMA))
return false;
}
@@ -6043,7 +6043,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
return true;
/* Compaction cannot yet proceed. Do reclaim. */
- if (!compaction_suitable(zone, sc->order, sc->reclaim_idx))
+ if (!compaction_suitable(zone, sc->order, sc->reclaim_idx, ALLOC_CMA))
return false;
/*
--
2.7.4
Recently we found the fifo_read() and fifo_write() are broken in our
5.15 kernel after rebase to the latest 5.15.y, the 5.15.y integrated
the commit e635f652696e ("serial: sc16is7xx: convert from _raw_ to
_noinc_ regmap functions for FIFO"), but it forgot to integrate a
prerequisite commit 3837a0379533 ("serial: sc16is7xx: improve regmap
debugfs by using one regmap per port").
And about the prerequisite commit, there are also 4 commits to fix it,
So in total, I backported 5 patches to 5.15.y to fix this regression.
0002-xxx and 0004-xxx could be cleanly applied to 5.15.y, the remaining
3 patches need to resolve some conflict.
Hugo Villeneuve (5):
serial: sc16is7xx: improve regmap debugfs by using one regmap per port
serial: sc16is7xx: remove wasteful static buffer in
sc16is7xx_regmap_name()
serial: sc16is7xx: remove global regmap from struct sc16is7xx_port
serial: sc16is7xx: remove unused line structure member
serial: sc16is7xx: change EFR lock to operate on each channels
drivers/tty/serial/sc16is7xx.c | 185 +++++++++++++++++++--------------
1 file changed, 107 insertions(+), 78 deletions(-)
--
2.34.1
From: Steven Rostedt <rostedt(a)goodmis.org>
The TP_printk() portion of a trace event is executed at the time a event
is read from the trace. This can happen seconds, minutes, hours, days,
months, years possibly later since the event was recorded. If the print
format contains a dereference to a string via "%s", and that string was
allocated, there's a chance that string could be freed before it is read
by the trace file.
To protect against such bugs, there are two functions that verify the
event. The first one is test_event_printk(), which is called when the
event is created. It reads the TP_printk() format as well as its arguments
to make sure nothing may be dereferencing a pointer that was not copied
into the ring buffer along with the event. If it is, it will trigger a
WARN_ON().
For strings that use "%s", it is not so easy. The string may not reside in
the ring buffer but may still be valid. Strings that are static and part
of the kernel proper which will not be freed for the life of the running
system, are safe to dereference. But to know if it is a pointer to a
static string or to something on the heap can not be determined until the
event is triggered.
This brings us to the second function that tests for the bad dereferencing
of strings, trace_check_vprintf(). It would walk through the printf format
looking for "%s", and when it finds it, it would validate that the pointer
is safe to read. If not, it would produces a WARN_ON() as well and write
into the ring buffer "[UNSAFE-MEMORY]".
The problem with this is how it used va_list to have vsnprintf() handle
all the cases that it didn't need to check. Instead of re-implementing
vsnprintf(), it would make a copy of the format up to the %s part, and
call vsnprintf() with the current va_list ap variable, where the ap would
then be ready to point at the string in question.
For architectures that passed va_list by reference this was possible. For
architectures that passed it by copy it was not. A test_can_verify()
function was used to differentiate between the two, and if it wasn't
possible, it would disable it.
Even for architectures where this was feasible, it was a stretch to rely
on such a method that is undocumented, and could cause issues later on
with new optimizations of the compiler.
Instead, the first function test_event_printk() was updated to look at
"%s" as well. If the "%s" argument is a pointer outside the event in the
ring buffer, it would find the field type of the event that is the problem
and mark the structure with a new flag called "needs_test". The event
itself will be marked by TRACE_EVENT_FL_TEST_STR to let it be known that
this event has a field that needs to be verified before the event can be
printed using the printf format.
When the event fields are created from the field type structure, the
fields would copy the field type's "needs_test" value.
Finally, before being printed, a new function ignore_event() is called
which will check if the event has the TEST_STR flag set (if not, it
returns false). If the flag is set, it then iterates through the events
fields looking for the ones that have the "needs_test" flag set.
Then it uses the offset field from the field structure to find the pointer
in the ring buffer event. It runs the tests to make sure that pointer is
safe to print and if not, it triggers the WARN_ON() and also adds to the
trace output that the event in question has an unsafe memory access.
The ignore_event() makes the trace_check_vprintf() obsolete so it is
removed.
Link: https://lore.kernel.org/all/CAHk-=wh3uOnqnZPpR0PeLZZtyWbZLboZ7cHLCKRWsocvs9…
Cc: stable(a)vger.kernel.org
Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
include/linux/trace_events.h | 6 +-
kernel/trace/trace.c | 255 ++++++++---------------------------
kernel/trace/trace.h | 6 +-
kernel/trace/trace_events.c | 32 +++--
kernel/trace/trace_output.c | 6 +-
5 files changed, 88 insertions(+), 217 deletions(-)
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 2a5df5b62cfc..91b8ffbdfa8c 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -273,7 +273,8 @@ struct trace_event_fields {
const char *name;
const int size;
const int align;
- const int is_signed;
+ const unsigned int is_signed:1;
+ unsigned int needs_test:1;
const int filter_type;
const int len;
};
@@ -324,6 +325,7 @@ enum {
TRACE_EVENT_FL_EPROBE_BIT,
TRACE_EVENT_FL_FPROBE_BIT,
TRACE_EVENT_FL_CUSTOM_BIT,
+ TRACE_EVENT_FL_TEST_STR_BIT,
};
/*
@@ -340,6 +342,7 @@ enum {
* CUSTOM - Event is a custom event (to be attached to an exsiting tracepoint)
* This is set when the custom event has not been attached
* to a tracepoint yet, then it is cleared when it is.
+ * TEST_STR - The event has a "%s" that points to a string outside the event
*/
enum {
TRACE_EVENT_FL_CAP_ANY = (1 << TRACE_EVENT_FL_CAP_ANY_BIT),
@@ -352,6 +355,7 @@ enum {
TRACE_EVENT_FL_EPROBE = (1 << TRACE_EVENT_FL_EPROBE_BIT),
TRACE_EVENT_FL_FPROBE = (1 << TRACE_EVENT_FL_FPROBE_BIT),
TRACE_EVENT_FL_CUSTOM = (1 << TRACE_EVENT_FL_CUSTOM_BIT),
+ TRACE_EVENT_FL_TEST_STR = (1 << TRACE_EVENT_FL_TEST_STR_BIT),
};
#define TRACE_EVENT_FL_UKPROBE (TRACE_EVENT_FL_KPROBE | TRACE_EVENT_FL_UPROBE)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index be62f0ea1814..7cc18b9bce27 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3611,17 +3611,12 @@ char *trace_iter_expand_format(struct trace_iterator *iter)
}
/* Returns true if the string is safe to dereference from an event */
-static bool trace_safe_str(struct trace_iterator *iter, const char *str,
- bool star, int len)
+static bool trace_safe_str(struct trace_iterator *iter, const char *str)
{
unsigned long addr = (unsigned long)str;
struct trace_event *trace_event;
struct trace_event_call *event;
- /* Ignore strings with no length */
- if (star && !len)
- return true;
-
/* OK if part of the event data */
if ((addr >= (unsigned long)iter->ent) &&
(addr < (unsigned long)iter->ent + iter->ent_size))
@@ -3661,181 +3656,69 @@ static bool trace_safe_str(struct trace_iterator *iter, const char *str,
return false;
}
-static DEFINE_STATIC_KEY_FALSE(trace_no_verify);
-
-static int test_can_verify_check(const char *fmt, ...)
-{
- char buf[16];
- va_list ap;
- int ret;
-
- /*
- * The verifier is dependent on vsnprintf() modifies the va_list
- * passed to it, where it is sent as a reference. Some architectures
- * (like x86_32) passes it by value, which means that vsnprintf()
- * does not modify the va_list passed to it, and the verifier
- * would then need to be able to understand all the values that
- * vsnprintf can use. If it is passed by value, then the verifier
- * is disabled.
- */
- va_start(ap, fmt);
- vsnprintf(buf, 16, "%d", ap);
- ret = va_arg(ap, int);
- va_end(ap);
-
- return ret;
-}
-
-static void test_can_verify(void)
-{
- if (!test_can_verify_check("%d %d", 0, 1)) {
- pr_info("trace event string verifier disabled\n");
- static_branch_inc(&trace_no_verify);
- }
-}
-
/**
- * trace_check_vprintf - Check dereferenced strings while writing to the seq buffer
+ * ignore_event - Check dereferenced fields while writing to the seq buffer
* @iter: The iterator that holds the seq buffer and the event being printed
- * @fmt: The format used to print the event
- * @ap: The va_list holding the data to print from @fmt.
*
- * This writes the data into the @iter->seq buffer using the data from
- * @fmt and @ap. If the format has a %s, then the source of the string
- * is examined to make sure it is safe to print, otherwise it will
- * warn and print "[UNSAFE MEMORY]" in place of the dereferenced string
- * pointer.
+ * At boot up, test_event_printk() will flag any event that dereferences
+ * a string with "%s" that does exist in the ring buffer. It may still
+ * be valid, as the string may point to a static string in the kernel
+ * rodata that never gets freed. But if the string pointer is pointing
+ * to something that was allocated, there's a chance that it can be freed
+ * by the time the user reads the trace. This would cause a bad memory
+ * access by the kernel and possibly crash the system.
+ *
+ * This function will check if the event has any fields flagged as needing
+ * to be checked at runtime and perform those checks.
+ *
+ * If it is found that a field is unsafe, it will write into the @iter->seq
+ * a message stating what was found to be unsafe.
+ *
+ * @return: true if the event is unsafe and should be ignored,
+ * false otherwise.
*/
-void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
- va_list ap)
+bool ignore_event(struct trace_iterator *iter)
{
- long text_delta = 0;
- long data_delta = 0;
- const char *p = fmt;
- const char *str;
- bool good;
- int i, j;
+ struct ftrace_event_field *field;
+ struct trace_event *trace_event;
+ struct trace_event_call *event;
+ struct list_head *head;
+ struct trace_seq *seq;
+ const void *ptr;
- if (WARN_ON_ONCE(!fmt))
- return;
+ trace_event = ftrace_find_event(iter->ent->type);
- if (static_branch_unlikely(&trace_no_verify))
- goto print;
+ seq = &iter->seq;
- /*
- * When the kernel is booted with the tp_printk command line
- * parameter, trace events go directly through to printk().
- * It also is checked by this function, but it does not
- * have an associated trace_array (tr) for it.
- */
- if (iter->tr) {
- text_delta = iter->tr->text_delta;
- data_delta = iter->tr->data_delta;
+ if (!trace_event) {
+ trace_seq_printf(seq, "EVENT ID %d NOT FOUND?\n", iter->ent->type);
+ return true;
}
- /* Don't bother checking when doing a ftrace_dump() */
- if (iter->fmt == static_fmt_buf)
- goto print;
-
- while (*p) {
- bool star = false;
- int len = 0;
-
- j = 0;
-
- /*
- * We only care about %s and variants
- * as well as %p[sS] if delta is non-zero
- */
- for (i = 0; p[i]; i++) {
- if (i + 1 >= iter->fmt_size) {
- /*
- * If we can't expand the copy buffer,
- * just print it.
- */
- if (!trace_iter_expand_format(iter))
- goto print;
- }
-
- if (p[i] == '\\' && p[i+1]) {
- i++;
- continue;
- }
- if (p[i] == '%') {
- /* Need to test cases like %08.*s */
- for (j = 1; p[i+j]; j++) {
- if (isdigit(p[i+j]) ||
- p[i+j] == '.')
- continue;
- if (p[i+j] == '*') {
- star = true;
- continue;
- }
- break;
- }
- if (p[i+j] == 's')
- break;
-
- if (text_delta && p[i+1] == 'p' &&
- ((p[i+2] == 's' || p[i+2] == 'S')))
- break;
-
- star = false;
- }
- j = 0;
- }
- /* If no %s found then just print normally */
- if (!p[i])
- break;
-
- /* Copy up to the %s, and print that */
- strncpy(iter->fmt, p, i);
- iter->fmt[i] = '\0';
- trace_seq_vprintf(&iter->seq, iter->fmt, ap);
+ event = container_of(trace_event, struct trace_event_call, event);
+ if (!(event->flags & TRACE_EVENT_FL_TEST_STR))
+ return false;
- /* Add delta to %pS pointers */
- if (p[i+1] == 'p') {
- unsigned long addr;
- char fmt[4];
+ head = trace_get_fields(event);
+ if (!head) {
+ trace_seq_printf(seq, "FIELDS FOR EVENT '%s' NOT FOUND?\n",
+ trace_event_name(event));
+ return true;
+ }
- fmt[0] = '%';
- fmt[1] = 'p';
- fmt[2] = p[i+2]; /* Either %ps or %pS */
- fmt[3] = '\0';
+ /* Offsets are from the iter->ent that points to the raw event */
+ ptr = iter->ent;
- addr = va_arg(ap, unsigned long);
- addr += text_delta;
- trace_seq_printf(&iter->seq, fmt, (void *)addr);
+ list_for_each_entry(field, head, link) {
+ const char *str;
+ bool good;
- p += i + 3;
+ if (!field->needs_test)
continue;
- }
- /*
- * If iter->seq is full, the above call no longer guarantees
- * that ap is in sync with fmt processing, and further calls
- * to va_arg() can return wrong positional arguments.
- *
- * Ensure that ap is no longer used in this case.
- */
- if (iter->seq.full) {
- p = "";
- break;
- }
-
- if (star)
- len = va_arg(ap, int);
-
- /* The ap now points to the string data of the %s */
- str = va_arg(ap, const char *);
+ str = *(const char **)(ptr + field->offset);
- good = trace_safe_str(iter, str, star, len);
-
- /* Could be from the last boot */
- if (data_delta && !good) {
- str += data_delta;
- good = trace_safe_str(iter, str, star, len);
- }
+ good = trace_safe_str(iter, str);
/*
* If you hit this warning, it is likely that the
@@ -3846,44 +3729,14 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
* instead. See samples/trace_events/trace-events-sample.h
* for reference.
*/
- if (WARN_ONCE(!good, "fmt: '%s' current_buffer: '%s'",
- fmt, seq_buf_str(&iter->seq.seq))) {
- int ret;
-
- /* Try to safely read the string */
- if (star) {
- if (len + 1 > iter->fmt_size)
- len = iter->fmt_size - 1;
- if (len < 0)
- len = 0;
- ret = copy_from_kernel_nofault(iter->fmt, str, len);
- iter->fmt[len] = 0;
- star = false;
- } else {
- ret = strncpy_from_kernel_nofault(iter->fmt, str,
- iter->fmt_size);
- }
- if (ret < 0)
- trace_seq_printf(&iter->seq, "(0x%px)", str);
- else
- trace_seq_printf(&iter->seq, "(0x%px:%s)",
- str, iter->fmt);
- str = "[UNSAFE-MEMORY]";
- strcpy(iter->fmt, "%s");
- } else {
- strncpy(iter->fmt, p + i, j + 1);
- iter->fmt[j+1] = '\0';
+ if (WARN_ONCE(!good, "event '%s' has unsafe pointer field '%s'",
+ trace_event_name(event), field->name)) {
+ trace_seq_printf(seq, "EVENT %s: HAS UNSAFE POINTER FIELD '%s'\n",
+ trace_event_name(event), field->name);
+ return true;
}
- if (star)
- trace_seq_printf(&iter->seq, iter->fmt, len, str);
- else
- trace_seq_printf(&iter->seq, iter->fmt, str);
-
- p += i + j + 1;
}
- print:
- if (*p)
- trace_seq_vprintf(&iter->seq, p, ap);
+ return false;
}
const char *trace_event_format(struct trace_iterator *iter, const char *fmt)
@@ -10777,8 +10630,6 @@ __init static int tracer_alloc_buffers(void)
register_snapshot_cmd();
- test_can_verify();
-
return 0;
out_free_pipe_cpumask:
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 266740b4e121..9691b47b5f3d 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -667,9 +667,8 @@ void trace_buffer_unlock_commit_nostack(struct trace_buffer *buffer,
bool trace_is_tracepoint_string(const char *str);
const char *trace_event_format(struct trace_iterator *iter, const char *fmt);
-void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
- va_list ap) __printf(2, 0);
char *trace_iter_expand_format(struct trace_iterator *iter);
+bool ignore_event(struct trace_iterator *iter);
int trace_empty(struct trace_iterator *iter);
@@ -1413,7 +1412,8 @@ struct ftrace_event_field {
int filter_type;
int offset;
int size;
- int is_signed;
+ unsigned int is_signed:1;
+ unsigned int needs_test:1;
int len;
};
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 521ad2fd1fe7..1545cc8b49d0 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -82,7 +82,7 @@ static int system_refcount_dec(struct event_subsystem *system)
}
static struct ftrace_event_field *
-__find_event_field(struct list_head *head, char *name)
+__find_event_field(struct list_head *head, const char *name)
{
struct ftrace_event_field *field;
@@ -114,7 +114,8 @@ trace_find_event_field(struct trace_event_call *call, char *name)
static int __trace_define_field(struct list_head *head, const char *type,
const char *name, int offset, int size,
- int is_signed, int filter_type, int len)
+ int is_signed, int filter_type, int len,
+ int need_test)
{
struct ftrace_event_field *field;
@@ -133,6 +134,7 @@ static int __trace_define_field(struct list_head *head, const char *type,
field->offset = offset;
field->size = size;
field->is_signed = is_signed;
+ field->needs_test = need_test;
field->len = len;
list_add(&field->link, head);
@@ -151,13 +153,13 @@ int trace_define_field(struct trace_event_call *call, const char *type,
head = trace_get_fields(call);
return __trace_define_field(head, type, name, offset, size,
- is_signed, filter_type, 0);
+ is_signed, filter_type, 0, 0);
}
EXPORT_SYMBOL_GPL(trace_define_field);
static int trace_define_field_ext(struct trace_event_call *call, const char *type,
const char *name, int offset, int size, int is_signed,
- int filter_type, int len)
+ int filter_type, int len, int need_test)
{
struct list_head *head;
@@ -166,13 +168,13 @@ static int trace_define_field_ext(struct trace_event_call *call, const char *typ
head = trace_get_fields(call);
return __trace_define_field(head, type, name, offset, size,
- is_signed, filter_type, len);
+ is_signed, filter_type, len, need_test);
}
#define __generic_field(type, item, filter_type) \
ret = __trace_define_field(&ftrace_generic_fields, #type, \
#item, 0, 0, is_signed_type(type), \
- filter_type, 0); \
+ filter_type, 0, 0); \
if (ret) \
return ret;
@@ -181,7 +183,8 @@ static int trace_define_field_ext(struct trace_event_call *call, const char *typ
"common_" #item, \
offsetof(typeof(ent), item), \
sizeof(ent.item), \
- is_signed_type(type), FILTER_OTHER, 0); \
+ is_signed_type(type), FILTER_OTHER, \
+ 0, 0); \
if (ret) \
return ret;
@@ -332,6 +335,7 @@ static bool process_pointer(const char *fmt, int len, struct trace_event_call *c
/* Return true if the string is safe */
static bool process_string(const char *fmt, int len, struct trace_event_call *call)
{
+ struct trace_event_fields *field;
const char *r, *e, *s;
e = fmt + len;
@@ -372,8 +376,16 @@ static bool process_string(const char *fmt, int len, struct trace_event_call *ca
if (process_pointer(fmt, len, call))
return true;
- /* Make sure the field is found, and consider it OK for now if it is */
- return find_event_field(fmt, call) != NULL;
+ /* Make sure the field is found */
+ field = find_event_field(fmt, call);
+ if (!field)
+ return false;
+
+ /* Test this field's string before printing the event */
+ call->flags |= TRACE_EVENT_FL_TEST_STR;
+ field->needs_test = 1;
+
+ return true;
}
/*
@@ -2586,7 +2598,7 @@ event_define_fields(struct trace_event_call *call)
ret = trace_define_field_ext(call, field->type, field->name,
offset, field->size,
field->is_signed, field->filter_type,
- field->len);
+ field->len, field->needs_test);
if (WARN_ON_ONCE(ret)) {
pr_err("error code is %d\n", ret);
break;
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index da748b7cbc4d..03d56f711ad1 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -317,10 +317,14 @@ EXPORT_SYMBOL(trace_raw_output_prep);
void trace_event_printf(struct trace_iterator *iter, const char *fmt, ...)
{
+ struct trace_seq *s = &iter->seq;
va_list ap;
+ if (ignore_event(iter))
+ return;
+
va_start(ap, fmt);
- trace_check_vprintf(iter, trace_event_format(iter, fmt), ap);
+ trace_seq_vprintf(s, trace_event_format(iter, fmt), ap);
va_end(ap);
}
EXPORT_SYMBOL(trace_event_printf);
--
2.45.2
From: Steven Rostedt <rostedt(a)goodmis.org>
The test_event_printk() code makes sure that when a trace event is
registered, any dereferenced pointers in from the event's TP_printk() are
pointing to content in the ring buffer. But currently it does not handle
"%s", as there's cases where the string pointer saved in the ring buffer
points to a static string in the kernel that will never be freed. As that
is a valid case, the pointer needs to be checked at runtime.
Currently the runtime check is done via trace_check_vprintf(), but to not
have to replicate everything in vsnprintf() it does some logic with the
va_list that may not be reliable across architectures. In order to get rid
of that logic, more work in the test_event_printk() needs to be done. Some
of the strings can be validated at this time when it is obvious the string
is valid because the string will be saved in the ring buffer content.
Do all the validation of strings in the ring buffer at boot in
test_event_printk(), and make sure that the field of the strings that
point into the kernel are accessible. This will allow adding checks at
runtime that will validate the fields themselves and not rely on paring
the TP_printk() format at runtime.
Cc: stable(a)vger.kernel.org
Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_events.c | 104 ++++++++++++++++++++++++++++++------
1 file changed, 89 insertions(+), 15 deletions(-)
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index df75c06bb23f..521ad2fd1fe7 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -244,19 +244,16 @@ int trace_event_get_offsets(struct trace_event_call *call)
return tail->offset + tail->size;
}
-/*
- * Check if the referenced field is an array and return true,
- * as arrays are OK to dereference.
- */
-static bool test_field(const char *fmt, struct trace_event_call *call)
+
+static struct trace_event_fields *find_event_field(const char *fmt,
+ struct trace_event_call *call)
{
struct trace_event_fields *field = call->class->fields_array;
- const char *array_descriptor;
const char *p = fmt;
int len;
if (!(len = str_has_prefix(fmt, "REC->")))
- return false;
+ return NULL;
fmt += len;
for (p = fmt; *p; p++) {
if (!isalnum(*p) && *p != '_')
@@ -267,11 +264,26 @@ static bool test_field(const char *fmt, struct trace_event_call *call)
for (; field->type; field++) {
if (strncmp(field->name, fmt, len) || field->name[len])
continue;
- array_descriptor = strchr(field->type, '[');
- /* This is an array and is OK to dereference. */
- return array_descriptor != NULL;
+
+ return field;
}
- return false;
+ return NULL;
+}
+
+/*
+ * Check if the referenced field is an array and return true,
+ * as arrays are OK to dereference.
+ */
+static bool test_field(const char *fmt, struct trace_event_call *call)
+{
+ struct trace_event_fields *field;
+
+ field = find_event_field(fmt, call);
+ if (!field)
+ return false;
+
+ /* This is an array and is OK to dereference. */
+ return strchr(field->type, '[') != NULL;
}
/* Look for a string within an argument */
@@ -317,6 +329,53 @@ static bool process_pointer(const char *fmt, int len, struct trace_event_call *c
return false;
}
+/* Return true if the string is safe */
+static bool process_string(const char *fmt, int len, struct trace_event_call *call)
+{
+ const char *r, *e, *s;
+
+ e = fmt + len;
+
+ /*
+ * There are several helper functions that return strings.
+ * If the argument contains a function, then assume its field is valid.
+ * It is considered that the argument has a function if it has:
+ * alphanumeric or '_' before a parenthesis.
+ */
+ s = fmt;
+ do {
+ r = strstr(s, "(");
+ if (!r || r >= e)
+ break;
+ for (int i = 1; r - i >= s; i++) {
+ char ch = *(r - i);
+ if (isspace(ch))
+ continue;
+ if (isalnum(ch) || ch == '_')
+ return true;
+ /* Anything else, this isn't a function */
+ break;
+ }
+ /* A function could be wrapped in parethesis, try the next one */
+ s = r + 1;
+ } while (s < e);
+
+ /*
+ * If there's any strings in the argument consider this arg OK as it
+ * could be: REC->field ? "foo" : "bar" and we don't want to get into
+ * verifying that logic here.
+ */
+ if (find_print_string(fmt, "\"", e))
+ return true;
+
+ /* Dereferenced strings are also valid like any other pointer */
+ if (process_pointer(fmt, len, call))
+ return true;
+
+ /* Make sure the field is found, and consider it OK for now if it is */
+ return find_event_field(fmt, call) != NULL;
+}
+
/*
* Examine the print fmt of the event looking for unsafe dereference
* pointers using %p* that could be recorded in the trace event and
@@ -326,6 +385,7 @@ static bool process_pointer(const char *fmt, int len, struct trace_event_call *c
static void test_event_printk(struct trace_event_call *call)
{
u64 dereference_flags = 0;
+ u64 string_flags = 0;
bool first = true;
const char *fmt;
int parens = 0;
@@ -416,8 +476,16 @@ static void test_event_printk(struct trace_event_call *call)
star = true;
continue;
}
- if ((fmt[i + j] == 's') && star)
- arg++;
+ if ((fmt[i + j] == 's')) {
+ if (star)
+ arg++;
+ if (WARN_ONCE(arg == 63,
+ "Too many args for event: %s",
+ trace_event_name(call)))
+ return;
+ dereference_flags |= 1ULL << arg;
+ string_flags |= 1ULL << arg;
+ }
break;
}
break;
@@ -464,7 +532,10 @@ static void test_event_printk(struct trace_event_call *call)
}
if (dereference_flags & (1ULL << arg)) {
- if (process_pointer(fmt + start_arg, e - start_arg, call))
+ if (string_flags & (1ULL << arg)) {
+ if (process_string(fmt + start_arg, e - start_arg, call))
+ dereference_flags &= ~(1ULL << arg);
+ } else if (process_pointer(fmt + start_arg, e - start_arg, call))
dereference_flags &= ~(1ULL << arg);
}
@@ -476,7 +547,10 @@ static void test_event_printk(struct trace_event_call *call)
}
if (dereference_flags & (1ULL << arg)) {
- if (process_pointer(fmt + start_arg, i - start_arg, call))
+ if (string_flags & (1ULL << arg)) {
+ if (process_string(fmt + start_arg, i - start_arg, call))
+ dereference_flags &= ~(1ULL << arg);
+ } else if (process_pointer(fmt + start_arg, i - start_arg, call))
dereference_flags &= ~(1ULL << arg);
}
--
2.45.2
From: Steven Rostedt <rostedt(a)goodmis.org>
The process_pointer() helper function looks to see if various trace event
macros are used. These macros are for storing data in the event. This
makes it safe to dereference as the dereference will then point into the
event on the ring buffer where the content of the data stays with the
event itself.
A few helper functions were missing. Those were:
__get_rel_dynamic_array()
__get_dynamic_array_len()
__get_rel_dynamic_array_len()
__get_rel_sockaddr()
Also add a helper function find_print_string() to not need to use a middle
man variable to test if the string exists.
Cc: stable(a)vger.kernel.org
Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_events.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 14e160a5b905..df75c06bb23f 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -274,6 +274,15 @@ static bool test_field(const char *fmt, struct trace_event_call *call)
return false;
}
+/* Look for a string within an argument */
+static bool find_print_string(const char *arg, const char *str, const char *end)
+{
+ const char *r;
+
+ r = strstr(arg, str);
+ return r && r < end;
+}
+
/* Return true if the argument pointer is safe */
static bool process_pointer(const char *fmt, int len, struct trace_event_call *call)
{
@@ -292,9 +301,17 @@ static bool process_pointer(const char *fmt, int len, struct trace_event_call *c
a = strchr(fmt, '&');
if ((a && (a < r)) || test_field(r, call))
return true;
- } else if ((r = strstr(fmt, "__get_dynamic_array(")) && r < e) {
+ } else if (find_print_string(fmt, "__get_dynamic_array(", e)) {
+ return true;
+ } else if (find_print_string(fmt, "__get_rel_dynamic_array(", e)) {
+ return true;
+ } else if (find_print_string(fmt, "__get_dynamic_array_len(", e)) {
+ return true;
+ } else if (find_print_string(fmt, "__get_rel_dynamic_array_len(", e)) {
+ return true;
+ } else if (find_print_string(fmt, "__get_sockaddr(", e)) {
return true;
- } else if ((r = strstr(fmt, "__get_sockaddr(")) && r < e) {
+ } else if (find_print_string(fmt, "__get_rel_sockaddr(", e)) {
return true;
}
return false;
--
2.45.2
From: Steven Rostedt <rostedt(a)goodmis.org>
The test_event_printk() analyzes print formats of trace events looking for
cases where it may dereference a pointer that is not in the ring buffer
which can possibly be a bug when the trace event is read from the ring
buffer and the content of that pointer no longer exists.
The function needs to accurately go from one print format argument to the
next. It handles quotes and parenthesis that may be included in an
argument. When it finds the start of the next argument, it uses a simple
"c = strstr(fmt + i, ',')" to find the end of that argument!
In order to include "%s" dereferencing, it needs to process the entire
content of the print format argument and not just the content of the first
',' it finds. As there may be content like:
({ const char *saved_ptr = trace_seq_buffer_ptr(p); static const char
*access_str[] = { "---", "--x", "w--", "w-x", "-u-", "-ux", "wu-", "wux"
}; union kvm_mmu_page_role role; role.word = REC->role;
trace_seq_printf(p, "sp gen %u gfn %llx l%u %u-byte q%u%s %s%s" " %snxe
%sad root %u %s%c", REC->mmu_valid_gen, REC->gfn, role.level,
role.has_4_byte_gpte ? 4 : 8, role.quadrant, role.direct ? " direct" : "",
access_str[role.access], role.invalid ? " invalid" : "", role.efer_nx ? ""
: "!", role.ad_disabled ? "!" : "", REC->root_count, REC->unsync ?
"unsync" : "sync", 0); saved_ptr; })
Which is an example of a full argument of an existing event. As the code
already handles finding the next print format argument, process the
argument at the end of it and not the start of it. This way it has both
the start of the argument as well as the end of it.
Add a helper function "process_pointer()" that will do the processing during
the loop as well as at the end. It also makes the code cleaner and easier
to read.
Cc: stable(a)vger.kernel.org
Fixes: 5013f454a352c ("tracing: Add check of trace event print fmts for dereferencing pointers")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
kernel/trace/trace_events.c | 82 ++++++++++++++++++++++++-------------
1 file changed, 53 insertions(+), 29 deletions(-)
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 77e68efbd43e..14e160a5b905 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -265,8 +265,7 @@ static bool test_field(const char *fmt, struct trace_event_call *call)
len = p - fmt;
for (; field->type; field++) {
- if (strncmp(field->name, fmt, len) ||
- field->name[len])
+ if (strncmp(field->name, fmt, len) || field->name[len])
continue;
array_descriptor = strchr(field->type, '[');
/* This is an array and is OK to dereference. */
@@ -275,6 +274,32 @@ static bool test_field(const char *fmt, struct trace_event_call *call)
return false;
}
+/* Return true if the argument pointer is safe */
+static bool process_pointer(const char *fmt, int len, struct trace_event_call *call)
+{
+ const char *r, *e, *a;
+
+ e = fmt + len;
+
+ /* Find the REC-> in the argument */
+ r = strstr(fmt, "REC->");
+ if (r && r < e) {
+ /*
+ * Addresses of events on the buffer, or an array on the buffer is
+ * OK to dereference. There's ways to fool this, but
+ * this is to catch common mistakes, not malicious code.
+ */
+ a = strchr(fmt, '&');
+ if ((a && (a < r)) || test_field(r, call))
+ return true;
+ } else if ((r = strstr(fmt, "__get_dynamic_array(")) && r < e) {
+ return true;
+ } else if ((r = strstr(fmt, "__get_sockaddr(")) && r < e) {
+ return true;
+ }
+ return false;
+}
+
/*
* Examine the print fmt of the event looking for unsafe dereference
* pointers using %p* that could be recorded in the trace event and
@@ -285,12 +310,12 @@ static void test_event_printk(struct trace_event_call *call)
{
u64 dereference_flags = 0;
bool first = true;
- const char *fmt, *c, *r, *a;
+ const char *fmt;
int parens = 0;
char in_quote = 0;
int start_arg = 0;
int arg = 0;
- int i;
+ int i, e;
fmt = call->print_fmt;
@@ -403,42 +428,41 @@ static void test_event_printk(struct trace_event_call *call)
case ',':
if (in_quote || parens)
continue;
+ e = i;
i++;
while (isspace(fmt[i]))
i++;
- start_arg = i;
- if (!(dereference_flags & (1ULL << arg)))
- goto next_arg;
- /* Find the REC-> in the argument */
- c = strchr(fmt + i, ',');
- r = strstr(fmt + i, "REC->");
- if (r && (!c || r < c)) {
- /*
- * Addresses of events on the buffer,
- * or an array on the buffer is
- * OK to dereference.
- * There's ways to fool this, but
- * this is to catch common mistakes,
- * not malicious code.
- */
- a = strchr(fmt + i, '&');
- if ((a && (a < r)) || test_field(r, call))
+ /*
+ * If start_arg is zero, then this is the start of the
+ * first argument. The processing of the argument happens
+ * when the end of the argument is found, as it needs to
+ * handle paranthesis and such.
+ */
+ if (!start_arg) {
+ start_arg = i;
+ /* Balance out the i++ in the for loop */
+ i--;
+ continue;
+ }
+
+ if (dereference_flags & (1ULL << arg)) {
+ if (process_pointer(fmt + start_arg, e - start_arg, call))
dereference_flags &= ~(1ULL << arg);
- } else if ((r = strstr(fmt + i, "__get_dynamic_array(")) &&
- (!c || r < c)) {
- dereference_flags &= ~(1ULL << arg);
- } else if ((r = strstr(fmt + i, "__get_sockaddr(")) &&
- (!c || r < c)) {
- dereference_flags &= ~(1ULL << arg);
}
- next_arg:
- i--;
+ start_arg = i;
arg++;
+ /* Balance out the i++ in the for loop */
+ i--;
}
}
+ if (dereference_flags & (1ULL << arg)) {
+ if (process_pointer(fmt + start_arg, i - start_arg, call))
+ dereference_flags &= ~(1ULL << arg);
+ }
+
/*
* If you triggered the below warning, the trace event reported
* uses an unsafe dereference pointer %p*. As the data stored
--
2.45.2
commit 54bbee190d42166209185d89070c58a343bf514b upstream.
DDI0487K.a D13.3.1 describes the PMU overflow condition, which evaluates
to true if any counter's global enable (PMCR_EL0.E), overflow flag
(PMOVSSET_EL0[n]), and interrupt enable (PMINTENSET_EL1[n]) are all 1.
Of note, this does not require a counter to be enabled
(i.e. PMCNTENSET_EL0[n] = 1) to generate an overflow.
Align kvm_pmu_overflow_status() with the reality of the architecture
and stop using PMCNTENSET_EL0 as part of the overflow condition. The
bug was discovered while running an SBSA PMU test [*], which only sets
PMCR.E, PMOVSSET<0>, PMINTENSET<0>, and expects an overflow interrupt.
Cc: stable(a)vger.kernel.org
Fixes: 76d883c4e640 ("arm64: KVM: Add access handler for PMOVSSET and PMOVSCLR register")
Link: https://github.com/ARM-software/sbsa-acs/blob/master/test_pool/pmu/operatin…
Signed-off-by: Raghavendra Rao Ananta <rananta(a)google.com>
[ oliver: massaged changelog ]
Reviewed-by: Marc Zyngier <maz(a)kernel.org>
Link: https://lore.kernel.org/r/20241120005230.2335682-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton(a)linux.dev>
---
virt/kvm/arm/pmu.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
index 4c08fd009768..9732c5e0ed22 100644
--- a/virt/kvm/arm/pmu.c
+++ b/virt/kvm/arm/pmu.c
@@ -357,7 +357,6 @@ static u64 kvm_pmu_overflow_status(struct kvm_vcpu *vcpu)
if ((__vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMU_PMCR_E)) {
reg = __vcpu_sys_reg(vcpu, PMOVSSET_EL0);
- reg &= __vcpu_sys_reg(vcpu, PMCNTENSET_EL0);
reg &= __vcpu_sys_reg(vcpu, PMINTENSET_EL1);
reg &= kvm_pmu_valid_counter_mask(vcpu);
}
base-commit: 6708005a36cfdb32aa991ebd9ea172e1183231ef
--
2.47.1.613.gc27f4b7a9f-goog
Hi,
This is to report that after jumping from generic kernel 5.15.167 to
5.15.170 I apparently observe ext4 damage.
After some few days of regular daily use of 5.15.170, one morning my
ext4 partition refused to mount complaining about corrupted system
area (-117).
There were no unusual events preceding this. The device in question is
a laptop with healthy battery, also connected to AC permanently.
The laptop is privately owned by me, in daily use at home, so I am
100% aware of everything happening with it.
The filesystem in question lives on md raid1 with very assymmetric
members (ssd+hdd) so one would not possibly expect that in the event
of emergency cpu halt or some other abnormal stop while filesystem was
actively writing data, raid members could stay in perfect sync.
After the incident, I've run raid1 check multiple times and run
memtest multiple times from different boot media and certainly
consulted startctl.
Nothing. No issues whatsoever except for this spontaneous ext4 damage.
Looking at git log for ext4 changes between 5.15.167 and 5.15.170
shows a few commits. All landed in 5.15.168.
Interestingly, one of them is a comeback of the (in)famous
91562895f803 "properly sync file size update after O_SYNC ..." which
caused some blowup 1 year ago due to "subtle interaction".
I've no idea if 91562895f803 is related to damage this time or not,
but most definitely it looks like some problem was introduced between
5.15.167 and 5.15.170 anyway.
And because there are apparently 0 commits to ext4 in 5.15 since
5.15.168 at the moment, I thought I'd report.
Please CC me if you want me to see your reply and/or need more info
(I'm not subscribed to the normal flow).
Take care,
Nick
Hi Greg,
Due to backporting 2320c9e of to 6.1.120, 6.6.66 and 6.12.5 kernel
series we have the following bug report:
https://gitlab.freedesktop.org/drm/amd/-/issues/3831
The commit 47f402a3e08113e0f5d8e1e6fcc197667a16022f should be backported
to 6.1, 6.6 and 6.12 stable branches.
--
Best, Philip
The patch titled
Subject: mm: hugetlb: independent PMD page table shared count
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-hugetlb-independent-pmd-page-table-shared-count.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Liu Shixin <liushixin2(a)huawei.com>
Subject: mm: hugetlb: independent PMD page table shared count
Date: Mon, 16 Dec 2024 15:11:47 +0800
The folio refcount may be increased unexpectly through try_get_folio() by
caller such as split_huge_pages. In huge_pmd_unshare(), we use refcount
to check whether a pmd page table is shared. The check is incorrect if
the refcount is increased by the above caller, and this can cause the page
table leaked:
BUG: Bad page state in process sh pfn:109324
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x66 pfn:0x109324
flags: 0x17ffff800000000(node=0|zone=2|lastcpupid=0xfffff)
page_type: f2(table)
raw: 017ffff800000000 0000000000000000 0000000000000000 0000000000000000
raw: 0000000000000066 0000000000000000 00000000f2000000 0000000000000000
page dumped because: nonzero mapcount
...
CPU: 31 UID: 0 PID: 7515 Comm: sh Kdump: loaded Tainted: G B 6.13.0-rc2master+ #7
Tainted: [B]=BAD_PAGE
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
Call trace:
show_stack+0x20/0x38 (C)
dump_stack_lvl+0x80/0xf8
dump_stack+0x18/0x28
bad_page+0x8c/0x130
free_page_is_bad_report+0xa4/0xb0
free_unref_page+0x3cc/0x620
__folio_put+0xf4/0x158
split_huge_pages_all+0x1e0/0x3e8
split_huge_pages_write+0x25c/0x2d8
full_proxy_write+0x64/0xd8
vfs_write+0xcc/0x280
ksys_write+0x70/0x110
__arm64_sys_write+0x24/0x38
invoke_syscall+0x50/0x120
el0_svc_common.constprop.0+0xc8/0xf0
do_el0_svc+0x24/0x38
el0_svc+0x34/0x128
el0t_64_sync_handler+0xc8/0xd0
el0t_64_sync+0x190/0x198
The issue may be triggered by damon, offline_page, page_idle, etc, which
will increase the refcount of page table.
Fix it by introducing independent PMD page table shared count. As
described by comment, pt_index/pt_mm/pt_frag_refcount are used for s390
gmap, x86 pgds and powerpc, pt_share_count is used for x86/arm64/riscv
pmds, so we can reuse the field as pt_share_count.
Link: https://lkml.kernel.org/r/20241216071147.3984217-1-liushixin2@huawei.com
Fixes: 39dde65c9940 ("[PATCH] shared page table for hugetlb page")
Signed-off-by: Liu Shixin <liushixin2(a)huawei.com>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: Ken Chen <kenneth.w.chen(a)intel.com>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Nanyong Sun <sunnanyong(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mm.h | 1 +
include/linux/mm_types.h | 30 ++++++++++++++++++++++++++++++
mm/hugetlb.c | 16 +++++++---------
3 files changed, 38 insertions(+), 9 deletions(-)
--- a/include/linux/mm.h~mm-hugetlb-independent-pmd-page-table-shared-count
+++ a/include/linux/mm.h
@@ -3125,6 +3125,7 @@ static inline bool pagetable_pmd_ctor(st
if (!pmd_ptlock_init(ptdesc))
return false;
__folio_set_pgtable(folio);
+ ptdesc_pmd_pts_init(ptdesc);
lruvec_stat_add_folio(folio, NR_PAGETABLE);
return true;
}
--- a/include/linux/mm_types.h~mm-hugetlb-independent-pmd-page-table-shared-count
+++ a/include/linux/mm_types.h
@@ -445,6 +445,7 @@ FOLIO_MATCH(compound_head, _head_2a);
* @pt_index: Used for s390 gmap.
* @pt_mm: Used for x86 pgds.
* @pt_frag_refcount: For fragmented page table tracking. Powerpc only.
+ * @pt_share_count: Used for HugeTLB PMD page table share count.
* @_pt_pad_2: Padding to ensure proper alignment.
* @ptl: Lock for the page table.
* @__page_type: Same as page->page_type. Unused for page tables.
@@ -471,6 +472,9 @@ struct ptdesc {
pgoff_t pt_index;
struct mm_struct *pt_mm;
atomic_t pt_frag_refcount;
+#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
+ atomic_t pt_share_count;
+#endif
};
union {
@@ -516,6 +520,32 @@ static_assert(sizeof(struct ptdesc) <= s
const struct page *: (const struct ptdesc *)(p), \
struct page *: (struct ptdesc *)(p)))
+#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
+static inline void ptdesc_pmd_pts_init(struct ptdesc *ptdesc)
+{
+ atomic_set(&ptdesc->pt_share_count, 0);
+}
+
+static inline void ptdesc_pmd_pts_inc(struct ptdesc *ptdesc)
+{
+ atomic_inc(&ptdesc->pt_share_count);
+}
+
+static inline void ptdesc_pmd_pts_dec(struct ptdesc *ptdesc)
+{
+ atomic_dec(&ptdesc->pt_share_count);
+}
+
+static inline int ptdesc_pmd_pts_count(struct ptdesc *ptdesc)
+{
+ return atomic_read(&ptdesc->pt_share_count);
+}
+#else
+static inline void ptdesc_pmd_pts_init(struct ptdesc *ptdesc)
+{
+}
+#endif
+
/*
* Used for sizing the vmemmap region on some architectures
*/
--- a/mm/hugetlb.c~mm-hugetlb-independent-pmd-page-table-shared-count
+++ a/mm/hugetlb.c
@@ -7211,7 +7211,7 @@ pte_t *huge_pmd_share(struct mm_struct *
spte = hugetlb_walk(svma, saddr,
vma_mmu_pagesize(svma));
if (spte) {
- get_page(virt_to_page(spte));
+ ptdesc_pmd_pts_inc(virt_to_ptdesc(spte));
break;
}
}
@@ -7226,7 +7226,7 @@ pte_t *huge_pmd_share(struct mm_struct *
(pmd_t *)((unsigned long)spte & PAGE_MASK));
mm_inc_nr_pmds(mm);
} else {
- put_page(virt_to_page(spte));
+ ptdesc_pmd_pts_dec(virt_to_ptdesc(spte));
}
spin_unlock(&mm->page_table_lock);
out:
@@ -7238,10 +7238,6 @@ out:
/*
* unmap huge page backed by shared pte.
*
- * Hugetlb pte page is ref counted at the time of mapping. If pte is shared
- * indicated by page_count > 1, unmap is achieved by clearing pud and
- * decrementing the ref count. If count == 1, the pte page is not shared.
- *
* Called with page table lock held.
*
* returns: 1 successfully unmapped a shared pte page
@@ -7250,18 +7246,20 @@ out:
int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)
{
+ unsigned long sz = huge_page_size(hstate_vma(vma));
pgd_t *pgd = pgd_offset(mm, addr);
p4d_t *p4d = p4d_offset(pgd, addr);
pud_t *pud = pud_offset(p4d, addr);
i_mmap_assert_write_locked(vma->vm_file->f_mapping);
hugetlb_vma_assert_locked(vma);
- BUG_ON(page_count(virt_to_page(ptep)) == 0);
- if (page_count(virt_to_page(ptep)) == 1)
+ if (sz != PMD_SIZE)
+ return 0;
+ if (!ptdesc_pmd_pts_count(virt_to_ptdesc(ptep)))
return 0;
pud_clear(pud);
- put_page(virt_to_page(ptep));
+ ptdesc_pmd_pts_dec(virt_to_ptdesc(ptep));
mm_dec_nr_pmds(mm);
return 1;
}
_
Patches currently in -mm which might be from liushixin2(a)huawei.com are
mm-hugetlb-independent-pmd-page-table-shared-count.patch
The patch titled
Subject: mm, compaction: don't use ALLOC_CMA in long term GUP flow
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: yangge <yangge1116(a)126.com>
Subject: mm, compaction: don't use ALLOC_CMA in long term GUP flow
Date: Mon, 16 Dec 2024 19:54:04 +0800
Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags in
__compaction_suitable()") allow compaction to proceed when free pages
required for compaction reside in the CMA pageblocks, it's possible that
__compaction_suitable() always returns true, and in some cases, it's not
acceptable.
There are 4 NUMA nodes on my machine, and each NUMA node has 32GB of
memory. I have configured 16GB of CMA memory on each NUMA node, and
starting a 32GB virtual machine with device passthrough is extremely slow,
taking almost an hour.
During the start-up of the virtual machine, it will call
pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory. Long
term GUP cannot allocate memory from CMA area, so a maximum of 16 GB of
no-CMA memory on a NUMA node can be used as virtual machine memory. Since
there is 16G of free CMA memory on the NUMA node, watermark for order-0
always be met for compaction, so __compaction_suitable() always returns
true, even if the node is unable to allocate non-CMA memory for the
virtual machine.
For costly allocations, because __compaction_suitable() always
returns true, __alloc_pages_slowpath() can't exit at the appropriate
place, resulting in excessively long virtual machine startup times.
Call trace:
__alloc_pages_slowpath
if (compact_result == COMPACT_SKIPPED ||
compact_result == COMPACT_DEFERRED)
goto nopage; // should exit __alloc_pages_slowpath() from here
In order to quickly fall back to remote node, we should remove ALLOC_CMA
both in __compaction_suitable() and __isolate_free_page() in long term GUP
flow. After this fix, starting a 32GB virtual machine with device
passthrough takes only a few seconds.
Link: https://lkml.kernel.org/r/1734350044-12928-1-git-send-email-yangge1116@126.…
Fixes: 984fdba6a32e ("mm, compaction: use proper alloc_flags in __compaction_suitable()")
Signed-off-by: yangge <yangge1116(a)126.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/compaction.h | 6 ++++--
mm/compaction.c | 20 +++++++++++---------
mm/internal.h | 3 ++-
mm/page_alloc.c | 6 ++++--
mm/page_isolation.c | 3 ++-
mm/page_reporting.c | 2 +-
mm/vmscan.c | 4 ++--
7 files changed, 26 insertions(+), 18 deletions(-)
--- a/include/linux/compaction.h~mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow
+++ a/include/linux/compaction.h
@@ -90,7 +90,8 @@ extern enum compact_result try_to_compac
struct page **page);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern bool compaction_suitable(struct zone *zone, int order,
- int highest_zoneidx);
+ int highest_zoneidx,
+ unsigned int alloc_flags);
extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
@@ -108,7 +109,8 @@ static inline void reset_isolation_suita
}
static inline bool compaction_suitable(struct zone *zone, int order,
- int highest_zoneidx)
+ int highest_zoneidx,
+ unsigned int alloc_flags)
{
return false;
}
--- a/mm/compaction.c~mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow
+++ a/mm/compaction.c
@@ -654,7 +654,7 @@ static unsigned long isolate_freepages_b
/* Found a free page, will break it into order-0 pages */
order = buddy_order(page);
- isolated = __isolate_free_page(page, order);
+ isolated = __isolate_free_page(page, order, cc->alloc_flags);
if (!isolated)
break;
set_page_private(page, order);
@@ -1633,7 +1633,7 @@ static void fast_isolate_freepages(struc
/* Isolate the page if available */
if (page) {
- if (__isolate_free_page(page, order)) {
+ if (__isolate_free_page(page, order, cc->alloc_flags)) {
set_page_private(page, order);
nr_isolated = 1 << order;
nr_scanned += nr_isolated - 1;
@@ -2379,6 +2379,7 @@ static enum compact_result compact_finis
static bool __compaction_suitable(struct zone *zone, int order,
int highest_zoneidx,
+ unsigned int alloc_flags,
unsigned long wmark_target)
{
unsigned long watermark;
@@ -2393,25 +2394,26 @@ static bool __compaction_suitable(struct
* even if compaction succeeds.
* For costly orders, we require low watermark instead of min for
* compaction to proceed to increase its chances.
- * ALLOC_CMA is used, as pages in CMA pageblocks are considered
- * suitable migration targets
+ * In addition to long term GUP flow, ALLOC_CMA is used, as pages in
+ * CMA pageblocks are considered suitable migration targets
*/
watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
low_wmark_pages(zone) : min_wmark_pages(zone);
watermark += compact_gap(order);
return __zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
- ALLOC_CMA, wmark_target);
+ alloc_flags & ALLOC_CMA, wmark_target);
}
/*
* compaction_suitable: Is this suitable to run compaction on this zone now?
*/
-bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx)
+bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx,
+ unsigned int alloc_flags)
{
enum compact_result compact_result;
bool suitable;
- suitable = __compaction_suitable(zone, order, highest_zoneidx,
+ suitable = __compaction_suitable(zone, order, highest_zoneidx, alloc_flags,
zone_page_state(zone, NR_FREE_PAGES));
/*
* fragmentation index determines if allocation failures are due to
@@ -2472,7 +2474,7 @@ bool compaction_zonelist_suitable(struct
available = zone_reclaimable_pages(zone) / order;
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
if (__compaction_suitable(zone, order, ac->highest_zoneidx,
- available))
+ alloc_flags, available))
return true;
}
@@ -2497,7 +2499,7 @@ compaction_suit_allocation_order(struct
alloc_flags))
return COMPACT_SUCCESS;
- if (!compaction_suitable(zone, order, highest_zoneidx))
+ if (!compaction_suitable(zone, order, highest_zoneidx, alloc_flags))
return COMPACT_SKIPPED;
return COMPACT_CONTINUE;
--- a/mm/internal.h~mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow
+++ a/mm/internal.h
@@ -662,7 +662,8 @@ static inline void clear_zone_contiguous
zone->contiguous = false;
}
-extern int __isolate_free_page(struct page *page, unsigned int order);
+extern int __isolate_free_page(struct page *page, unsigned int order,
+ unsigned int alloc_flags);
extern void __putback_isolated_page(struct page *page, unsigned int order,
int mt);
extern void memblock_free_pages(struct page *page, unsigned long pfn,
--- a/mm/page_alloc.c~mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow
+++ a/mm/page_alloc.c
@@ -2808,7 +2808,8 @@ void split_page(struct page *page, unsig
}
EXPORT_SYMBOL_GPL(split_page);
-int __isolate_free_page(struct page *page, unsigned int order)
+int __isolate_free_page(struct page *page, unsigned int order,
+ unsigned int alloc_flags)
{
struct zone *zone = page_zone(page);
int mt = get_pageblock_migratetype(page);
@@ -2822,7 +2823,8 @@ int __isolate_free_page(struct page *pag
* exists.
*/
watermark = zone->_watermark[WMARK_MIN] + (1UL << order);
- if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+ if (!zone_watermark_ok(zone, 0, watermark, 0,
+ alloc_flags & ALLOC_CMA))
return 0;
}
--- a/mm/page_isolation.c~mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow
+++ a/mm/page_isolation.c
@@ -229,7 +229,8 @@ static void unset_migratetype_isolate(st
buddy = find_buddy_page_pfn(page, page_to_pfn(page),
order, NULL);
if (buddy && !is_migrate_isolate_page(buddy)) {
- isolated_page = !!__isolate_free_page(page, order);
+ isolated_page = !!__isolate_free_page(page, order,
+ ALLOC_CMA);
/*
* Isolating a free page in an isolated pageblock
* is expected to always work as watermarks don't
--- a/mm/page_reporting.c~mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow
+++ a/mm/page_reporting.c
@@ -198,7 +198,7 @@ page_reporting_cycle(struct page_reporti
/* Attempt to pull page from list and place in scatterlist */
if (*offset) {
- if (!__isolate_free_page(page, order)) {
+ if (!__isolate_free_page(page, order, ALLOC_CMA)) {
next = page;
break;
}
--- a/mm/vmscan.c~mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow
+++ a/mm/vmscan.c
@@ -5861,7 +5861,7 @@ static inline bool should_continue_recla
sc->reclaim_idx, 0))
return false;
- if (compaction_suitable(zone, sc->order, sc->reclaim_idx))
+ if (compaction_suitable(zone, sc->order, sc->reclaim_idx, ALLOC_CMA))
return false;
}
@@ -6089,7 +6089,7 @@ static inline bool compaction_ready(stru
return true;
/* Compaction cannot yet proceed. Do reclaim. */
- if (!compaction_suitable(zone, sc->order, sc->reclaim_idx))
+ if (!compaction_suitable(zone, sc->order, sc->reclaim_idx, ALLOC_CMA))
return false;
/*
_
Patches currently in -mm which might be from yangge1116(a)126.com are
mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow.patch
This is a note to let you know that I've just added the patch titled
iio: adc: ti-ads1119: fix sample size in scan struct for triggered
to my char-misc git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
in the char-misc-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
From 54d394905c92b9ecc65c1f9b2692c8e10716d8e1 Mon Sep 17 00:00:00 2001
From: Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
Date: Mon, 2 Dec 2024 20:18:44 +0100
Subject: iio: adc: ti-ads1119: fix sample size in scan struct for triggered
buffer
This device returns signed, 16-bit samples as stated in its datasheet
(see 8.5.2 Data Format). That is in line with the scan_type definition
for the IIO_VOLTAGE channel, but 'unsigned int' is being used to read
and push the data to userspace.
Given that the size of that type depends on the architecture (at least
2 bytes to store values up to 65535, but its actual size is often 4
bytes), use the 's16' type to provide the same structure in all cases.
Fixes: a9306887eba4 ("iio: adc: ti-ads1119: Add driver")
Signed-off-by: Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
Reviewed-by: Francesco Dolcini <francesco.dolcini(a)toradex.com>
Link: https://patch.msgid.link/20241202-ti-ads1119_s16_chan-v1-1-fafe3136dc90@gma…
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
---
drivers/iio/adc/ti-ads1119.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/iio/adc/ti-ads1119.c b/drivers/iio/adc/ti-ads1119.c
index 2615a275acb3..c268e27eec12 100644
--- a/drivers/iio/adc/ti-ads1119.c
+++ b/drivers/iio/adc/ti-ads1119.c
@@ -500,7 +500,7 @@ static irqreturn_t ads1119_trigger_handler(int irq, void *private)
struct iio_dev *indio_dev = pf->indio_dev;
struct ads1119_state *st = iio_priv(indio_dev);
struct {
- unsigned int sample;
+ s16 sample;
s64 timestamp __aligned(8);
} scan;
unsigned int index;
--
2.47.1
This is a note to let you know that I've just added the patch titled
iio: inkern: call iio_device_put() only on mapped devices
to my char-misc git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
in the char-misc-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
From 64f43895b4457532a3cc524ab250b7a30739a1b1 Mon Sep 17 00:00:00 2001
From: Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
Date: Wed, 4 Dec 2024 20:13:42 +0900
Subject: iio: inkern: call iio_device_put() only on mapped devices
In the error path of iio_channel_get_all(), iio_device_put() is called
on all IIO devices, which can cause a refcount imbalance. Fix this error
by calling iio_device_put() only on IIO devices whose refcounts were
previously incremented by iio_device_get().
Fixes: 314be14bb893 ("iio: Rename _st_ functions to loose the bit that meant the staging version.")
Signed-off-by: Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
Link: https://patch.msgid.link/20241204111342.1246706-1-joe@pf.is.s.u-tokyo.ac.jp
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
---
drivers/iio/inkern.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/iio/inkern.c b/drivers/iio/inkern.c
index 136b225b6bc8..9050a59129e6 100644
--- a/drivers/iio/inkern.c
+++ b/drivers/iio/inkern.c
@@ -500,7 +500,7 @@ struct iio_channel *iio_channel_get_all(struct device *dev)
return_ptr(chans);
error_free_chans:
- for (i = 0; i < nummaps; i++)
+ for (i = 0; i < mapind; i++)
iio_device_put(chans[i].indio_dev);
return ERR_PTR(ret);
}
--
2.47.1
This is a note to let you know that I've just added the patch titled
iio: adc: at91: call input_free_device() on allocated iio_dev
to my char-misc git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
in the char-misc-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
From de6a73bad1743e9e81ea5a24c178c67429ff510b Mon Sep 17 00:00:00 2001
From: Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
Date: Sat, 7 Dec 2024 13:30:45 +0900
Subject: iio: adc: at91: call input_free_device() on allocated iio_dev
Current implementation of at91_ts_register() calls input_free_deivce()
on st->ts_input, however, the err label can be reached before the
allocated iio_dev is stored to st->ts_input. Thus call
input_free_device() on input instead of st->ts_input.
Fixes: 84882b060301 ("iio: adc: at91_adc: Add support for touchscreens without TSMR")
Signed-off-by: Joe Hattori <joe(a)pf.is.s.u-tokyo.ac.jp>
Link: https://patch.msgid.link/20241207043045.1255409-1-joe@pf.is.s.u-tokyo.ac.jp
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
---
drivers/iio/adc/at91_adc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/iio/adc/at91_adc.c b/drivers/iio/adc/at91_adc.c
index a3f0a2321666..5927756b749a 100644
--- a/drivers/iio/adc/at91_adc.c
+++ b/drivers/iio/adc/at91_adc.c
@@ -979,7 +979,7 @@ static int at91_ts_register(struct iio_dev *idev,
return ret;
err:
- input_free_device(st->ts_input);
+ input_free_device(input);
return ret;
}
--
2.47.1
This is a note to let you know that I've just added the patch titled
iio: adc: ad7173: fix using shared static info struct
to my char-misc git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
in the char-misc-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
From 36a44e05cd807a54e5ffad4b96d0d67f68ad8576 Mon Sep 17 00:00:00 2001
From: David Lechner <dlechner(a)baylibre.com>
Date: Wed, 27 Nov 2024 14:01:53 -0600
Subject: iio: adc: ad7173: fix using shared static info struct
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Fix a possible race condition during driver probe in the ad7173 driver
due to using a shared static info struct. If more that one instance of
the driver is probed at the same time, some of the info could be
overwritten by the other instance, leading to incorrect operation.
To fix this, make the static info struct const so that it is read-only
and make a copy of the info struct for each instance of the driver that
can be modified.
Reported-by: Uwe Kleine-König <u.kleine-koenig(a)baylibre.com>
Fixes: 76a1e6a42802 ("iio: adc: ad7173: add AD7173 driver")
Signed-off-by: David Lechner <dlechner(a)baylibre.com>
Tested-by: Guillaume Ranquet <granquet(a)baylibre.com>
Link: https://patch.msgid.link/20241127-iio-adc-ad7313-fix-non-const-info-struct-…
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
---
drivers/iio/adc/ad7173.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/drivers/iio/adc/ad7173.c b/drivers/iio/adc/ad7173.c
index 8a0c931ca83a..8b03c1e5567e 100644
--- a/drivers/iio/adc/ad7173.c
+++ b/drivers/iio/adc/ad7173.c
@@ -200,6 +200,7 @@ struct ad7173_channel {
struct ad7173_state {
struct ad_sigma_delta sd;
+ struct ad_sigma_delta_info sigma_delta_info;
const struct ad7173_device_info *info;
struct ad7173_channel *channels;
struct regulator_bulk_data regulators[3];
@@ -753,7 +754,7 @@ static int ad7173_disable_one(struct ad_sigma_delta *sd, unsigned int chan)
return ad_sd_write_reg(sd, AD7173_REG_CH(chan), 2, 0);
}
-static struct ad_sigma_delta_info ad7173_sigma_delta_info = {
+static const struct ad_sigma_delta_info ad7173_sigma_delta_info = {
.set_channel = ad7173_set_channel,
.append_status = ad7173_append_status,
.disable_all = ad7173_disable_all,
@@ -1403,7 +1404,7 @@ static int ad7173_fw_parse_device_config(struct iio_dev *indio_dev)
if (ret < 0)
return dev_err_probe(dev, ret, "Interrupt 'rdy' is required\n");
- ad7173_sigma_delta_info.irq_line = ret;
+ st->sigma_delta_info.irq_line = ret;
return ad7173_fw_parse_channel_config(indio_dev);
}
@@ -1436,8 +1437,9 @@ static int ad7173_probe(struct spi_device *spi)
spi->mode = SPI_MODE_3;
spi_setup(spi);
- ad7173_sigma_delta_info.num_slots = st->info->num_configs;
- ret = ad_sd_init(&st->sd, indio_dev, spi, &ad7173_sigma_delta_info);
+ st->sigma_delta_info = ad7173_sigma_delta_info;
+ st->sigma_delta_info.num_slots = st->info->num_configs;
+ ret = ad_sd_init(&st->sd, indio_dev, spi, &st->sigma_delta_info);
if (ret)
return ret;
--
2.47.1
This is a note to let you know that I've just added the patch titled
iio: adc: ti-ads1298: Add NULL check in ads1298_init
to my char-misc git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
in the char-misc-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
From bcb394bb28e55312cace75362b8e489eb0e02a30 Mon Sep 17 00:00:00 2001
From: Charles Han <hanchunchao(a)inspur.com>
Date: Mon, 18 Nov 2024 17:02:08 +0800
Subject: iio: adc: ti-ads1298: Add NULL check in ads1298_init
devm_kasprintf() can return a NULL pointer on failure. A check on the
return value of such a call in ads1298_init() is missing. Add it.
Fixes: 00ef7708fa60 ("iio: adc: ti-ads1298: Add driver")
Signed-off-by: Charles Han <hanchunchao(a)inspur.com>
Link: https://patch.msgid.link/20241118090208.14586-1-hanchunchao@inspur.com
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
---
drivers/iio/adc/ti-ads1298.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/iio/adc/ti-ads1298.c b/drivers/iio/adc/ti-ads1298.c
index 36d43495f603..03f762415fa5 100644
--- a/drivers/iio/adc/ti-ads1298.c
+++ b/drivers/iio/adc/ti-ads1298.c
@@ -613,6 +613,8 @@ static int ads1298_init(struct iio_dev *indio_dev)
}
indio_dev->name = devm_kasprintf(dev, GFP_KERNEL, "ads129%u%s",
indio_dev->num_channels, suffix);
+ if (!indio_dev->name)
+ return -ENOMEM;
/* Enable internal test signal, double amplitude, double frequency */
ret = regmap_write(priv->regmap, ADS1298_REG_CONFIG2,
--
2.47.1
This is a note to let you know that I've just added the patch titled
iio: gyro: fxas21002c: Fix missing data update in trigger handler
to my char-misc git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
in the char-misc-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
From fa13ac6cdf9b6c358e7d77c29fb60145c7a87965 Mon Sep 17 00:00:00 2001
From: Carlos Song <carlos.song(a)nxp.com>
Date: Sat, 16 Nov 2024 10:29:45 -0500
Subject: iio: gyro: fxas21002c: Fix missing data update in trigger handler
The fxas21002c_trigger_handler() may fail to acquire sample data because
the runtime PM enters the autosuspend state and sensor can not return
sample data in standby mode..
Resume the sensor before reading the sample data into the buffer within the
trigger handler. After the data is read, place the sensor back into the
autosuspend state.
Fixes: a0701b6263ae ("iio: gyro: add core driver for fxas21002c")
Signed-off-by: Carlos Song <carlos.song(a)nxp.com>
Signed-off-by: Frank Li <Frank.Li(a)nxp.com>
Link: https://patch.msgid.link/20241116152945.4006374-1-Frank.Li@nxp.com
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
---
drivers/iio/gyro/fxas21002c_core.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/iio/gyro/fxas21002c_core.c b/drivers/iio/gyro/fxas21002c_core.c
index 0391c78c2f18..754c8a564ba4 100644
--- a/drivers/iio/gyro/fxas21002c_core.c
+++ b/drivers/iio/gyro/fxas21002c_core.c
@@ -730,14 +730,21 @@ static irqreturn_t fxas21002c_trigger_handler(int irq, void *p)
int ret;
mutex_lock(&data->lock);
- ret = regmap_bulk_read(data->regmap, FXAS21002C_REG_OUT_X_MSB,
- data->buffer, CHANNEL_SCAN_MAX * sizeof(s16));
+ ret = fxas21002c_pm_get(data);
if (ret < 0)
goto out_unlock;
+ ret = regmap_bulk_read(data->regmap, FXAS21002C_REG_OUT_X_MSB,
+ data->buffer, CHANNEL_SCAN_MAX * sizeof(s16));
+ if (ret < 0)
+ goto out_pm_put;
+
iio_push_to_buffers_with_timestamp(indio_dev, data->buffer,
data->timestamp);
+out_pm_put:
+ fxas21002c_pm_put(data);
+
out_unlock:
mutex_unlock(&data->lock);
--
2.47.1
This is a note to let you know that I've just added the patch titled
iio: adc: ad7124: Disable all channels at probe time
to my char-misc git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
in the char-misc-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
From 4be339af334c283a1a1af3cb28e7e448a0aa8a7c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Uwe=20Kleine-K=C3=B6nig?= <u.kleine-koenig(a)baylibre.com>
Date: Mon, 4 Nov 2024 11:19:04 +0100
Subject: iio: adc: ad7124: Disable all channels at probe time
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When during a measurement two channels are enabled, two measurements are
done that are reported sequencially in the DATA register. As the code
triggered by reading one of the sysfs properties expects that only one
channel is enabled it only reads the first data set which might or might
not belong to the intended channel.
To prevent this situation disable all channels during probe. This fixes
a problem in practise because the reset default for channel 0 is
enabled. So all measurements before the first measurement on channel 0
(which disables channel 0 at the end) might report wrong values.
Fixes: 7b8d045e497a ("iio: adc: ad7124: allow more than 8 channels")
Reviewed-by: Nuno Sa <nuno.sa(a)analog.com>
Signed-off-by: Uwe Kleine-König <u.kleine-koenig(a)baylibre.com>
Link: https://patch.msgid.link/20241104101905.845737-2-u.kleine-koenig@baylibre.c…
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
---
drivers/iio/adc/ad7124.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/iio/adc/ad7124.c b/drivers/iio/adc/ad7124.c
index 7314fb32bdec..3d678c420cbf 100644
--- a/drivers/iio/adc/ad7124.c
+++ b/drivers/iio/adc/ad7124.c
@@ -917,6 +917,9 @@ static int ad7124_setup(struct ad7124_state *st)
* set all channels to this default value.
*/
ad7124_set_channel_odr(st, i, 10);
+
+ /* Disable all channels to prevent unintended conversions. */
+ ad_sd_write_reg(&st->sd, AD7124_CHANNEL(i), 2, 0);
}
ret = ad_sd_write_reg(&st->sd, AD7124_ADC_CONTROL, 2, st->adc_control);
--
2.47.1
In commit 892f7237b3ff ("arm64: Delay initialisation of
cpuinfo_arm64::reg_{zcr,smcr}") we moved access to ZCR, SMCR and SMIDR
later in the boot process in order to ensure that we don't attempt to
interact with them if SVE or SME is disabled on the command line.
Unfortunately when initialising the boot CPU in init_cpu_features() we work
on a copy of the struct cpuinfo_arm64 for the boot CPU used only during
boot, not the percpu copy used by the sysfs code.
Fix this by moving the handling for SMIDR_EL1 for the boot CPU to
cpuinfo_store_boot_cpu() so it can operate on the percpu copy of the data.
This reduces the potential for error that could come from having both the
percpu and boot CPU copies in init_cpu_features().
This issue wasn't apparent when testing on emulated platforms that do not
report values in this ID register.
Fixes: 892f7237b3ff ("arm64: Delay initialisation of cpuinfo_arm64::reg_{zcr,smcr}")
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Cc: stable(a)vger.kernel.org
---
arch/arm64/kernel/cpufeature.c | 6 ------
arch/arm64/kernel/cpuinfo.c | 11 +++++++++++
2 files changed, 11 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 6ce71f444ed84f9056196bb21bbfac61c9687e30..b88102fd2c20f77e25af6df513fda09a484e882e 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1167,12 +1167,6 @@ void __init init_cpu_features(struct cpuinfo_arm64 *info)
id_aa64pfr1_sme(read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1))) {
unsigned long cpacr = cpacr_save_enable_kernel_sme();
- /*
- * We mask out SMPS since even if the hardware
- * supports priorities the kernel does not at present
- * and we block access to them.
- */
- info->reg_smidr = read_cpuid(SMIDR_EL1) & ~SMIDR_EL1_SMPS;
vec_init_vq_map(ARM64_VEC_SME);
cpacr_restore(cpacr);
diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c
index d79e88fccdfce427507e7a34c5959ce6309cbd12..b7d403da71e5a01ed3943eb37e7a00af238771a2 100644
--- a/arch/arm64/kernel/cpuinfo.c
+++ b/arch/arm64/kernel/cpuinfo.c
@@ -499,4 +499,15 @@ void __init cpuinfo_store_boot_cpu(void)
boot_cpu_data = *info;
init_cpu_features(&boot_cpu_data);
+
+ /* SMIDR_EL1 needs to be stored in the percpu data for sysfs */
+ if (IS_ENABLED(CONFIG_ARM64_SME) &&
+ id_aa64pfr1_sme(read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1))) {
+ /*
+ * We mask out SMPS since even if the hardware
+ * supports priorities the kernel does not at present
+ * and we block access to them.
+ */
+ info->reg_smidr = read_cpuid(SMIDR_EL1) & ~SMIDR_EL1_SMPS;
+ }
}
---
base-commit: fac04efc5c793dccbd07e2d59af9f90b7fc0dca4
change-id: 20241213-arm64-fix-boot-cpu-smidr-386b8db292b2
Best regards,
--
Mark Brown <broonie(a)kernel.org>
From: James Morse <james.morse(a)arm.com>
commit 6685f5d572c22e1003e7c0d089afe1c64340ab1f upstream.
commit 011e5f5bf529f ("arm64/cpufeature: Add remaining feature bits in
ID_AA64PFR0 register") exposed the MPAM field of AA64PFR0_EL1 to guests,
but didn't add trap handling. A previous patch supplied the missing trap
handling.
Existing VMs that have the MPAM field of ID_AA64PFR0_EL1 set need to
be migratable, but there is little point enabling the MPAM CPU
interface on new VMs until there is something a guest can do with it.
Clear the MPAM field from the guest's ID_AA64PFR0_EL1 and on hardware
that supports MPAM, politely ignore the VMMs attempts to set this bit.
Guests exposed to this bug have the sanitised value of the MPAM field,
so only the correct value needs to be ignored. This means the field
can continue to be used to block migration to incompatible hardware
(between MPAM=1 and MPAM=5), and the VMM can't rely on the field
being ignored.
Signed-off-by: James Morse <james.morse(a)arm.com>
Co-developed-by: Joey Gouly <joey.gouly(a)arm.com>
Signed-off-by: Joey Gouly <joey.gouly(a)arm.com>
Reviewed-by: Gavin Shan <gshan(a)redhat.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi(a)huawei.com>
Reviewed-by: Marc Zyngier <maz(a)kernel.org>
Link: https://lore.kernel.org/r/20241030160317.2528209-7-joey.gouly@arm.com
Signed-off-by: Oliver Upton <oliver.upton(a)linux.dev>
[maz: adapted to lack of ID_FILTERED()]
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
Cc: stable(a)vger.kernel.org
---
arch/arm64/kvm/sys_regs.c | 55 ++++++++++++++++++++++++++++++++++++---
1 file changed, 52 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index ff8c4e1b847ed..fbed433283c9b 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1535,6 +1535,7 @@ static u64 __kvm_read_sanitised_id_reg(const struct kvm_vcpu *vcpu,
val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_MTEX);
val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_DF2);
val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_PFAR);
+ val &= ~ARM64_FEATURE_MASK(ID_AA64PFR1_EL1_MPAM_frac);
break;
case SYS_ID_AA64PFR2_EL1:
/* We only expose FPMR */
@@ -1724,6 +1725,13 @@ static u64 read_sanitised_id_aa64pfr0_el1(struct kvm_vcpu *vcpu,
val &= ~ID_AA64PFR0_EL1_AMU_MASK;
+ /*
+ * MPAM is disabled by default as KVM also needs a set of PARTID to
+ * program the MPAMVPMx_EL2 PARTID remapping registers with. But some
+ * older kernels let the guest see the ID bit.
+ */
+ val &= ~ID_AA64PFR0_EL1_MPAM_MASK;
+
return val;
}
@@ -1834,6 +1842,42 @@ static int set_id_dfr0_el1(struct kvm_vcpu *vcpu,
return set_id_reg(vcpu, rd, val);
}
+static int set_id_aa64pfr0_el1(struct kvm_vcpu *vcpu,
+ const struct sys_reg_desc *rd, u64 user_val)
+{
+ u64 hw_val = read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1);
+ u64 mpam_mask = ID_AA64PFR0_EL1_MPAM_MASK;
+
+ /*
+ * Commit 011e5f5bf529f ("arm64/cpufeature: Add remaining feature bits
+ * in ID_AA64PFR0 register") exposed the MPAM field of AA64PFR0_EL1 to
+ * guests, but didn't add trap handling. KVM doesn't support MPAM and
+ * always returns an UNDEF for these registers. The guest must see 0
+ * for this field.
+ *
+ * But KVM must also accept values from user-space that were provided
+ * by KVM. On CPUs that support MPAM, permit user-space to write
+ * the sanitizied value to ID_AA64PFR0_EL1.MPAM, but ignore this field.
+ */
+ if ((hw_val & mpam_mask) == (user_val & mpam_mask))
+ user_val &= ~ID_AA64PFR0_EL1_MPAM_MASK;
+
+ return set_id_reg(vcpu, rd, user_val);
+}
+
+static int set_id_aa64pfr1_el1(struct kvm_vcpu *vcpu,
+ const struct sys_reg_desc *rd, u64 user_val)
+{
+ u64 hw_val = read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1);
+ u64 mpam_mask = ID_AA64PFR1_EL1_MPAM_frac_MASK;
+
+ /* See set_id_aa64pfr0_el1 for comment about MPAM */
+ if ((hw_val & mpam_mask) == (user_val & mpam_mask))
+ user_val &= ~ID_AA64PFR1_EL1_MPAM_frac_MASK;
+
+ return set_id_reg(vcpu, rd, user_val);
+}
+
/*
* cpufeature ID register user accessors
*
@@ -2377,7 +2421,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
{ SYS_DESC(SYS_ID_AA64PFR0_EL1),
.access = access_id_reg,
.get_user = get_id_reg,
- .set_user = set_id_reg,
+ .set_user = set_id_aa64pfr0_el1,
.reset = read_sanitised_id_aa64pfr0_el1,
.val = ~(ID_AA64PFR0_EL1_AMU |
ID_AA64PFR0_EL1_MPAM |
@@ -2385,7 +2429,12 @@ static const struct sys_reg_desc sys_reg_descs[] = {
ID_AA64PFR0_EL1_RAS |
ID_AA64PFR0_EL1_AdvSIMD |
ID_AA64PFR0_EL1_FP), },
- ID_WRITABLE(ID_AA64PFR1_EL1, ~(ID_AA64PFR1_EL1_PFAR |
+ { SYS_DESC(SYS_ID_AA64PFR1_EL1),
+ .access = access_id_reg,
+ .get_user = get_id_reg,
+ .set_user = set_id_aa64pfr1_el1,
+ .reset = kvm_read_sanitised_id_reg,
+ .val = ~(ID_AA64PFR1_EL1_PFAR |
ID_AA64PFR1_EL1_DF2 |
ID_AA64PFR1_EL1_MTEX |
ID_AA64PFR1_EL1_THE |
@@ -2397,7 +2446,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
ID_AA64PFR1_EL1_RES0 |
ID_AA64PFR1_EL1_MPAM_frac |
ID_AA64PFR1_EL1_RAS_frac |
- ID_AA64PFR1_EL1_MTE)),
+ ID_AA64PFR1_EL1_MTE), },
ID_WRITABLE(ID_AA64PFR2_EL1, ID_AA64PFR2_EL1_FPMR),
ID_UNALLOCATED(4,3),
ID_WRITABLE(ID_AA64ZFR0_EL1, ~ID_AA64ZFR0_EL1_RES0),
--
2.39.2
commit b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols")
avoids checking number_of_same_symbols() for module symbol in
__trace_kprobe_create(), but create_local_trace_kprobe() should avoid this
check too. Doing this check leads to ENOENT for module_name:symbol_name
constructions passed over perf_event_open.
No bug in newer kernels as it was fixed more generally by
commit 9d8616034f16 ("tracing/kprobes: Add symbol counting check when module loads")
Link: https://lore.kernel.org/linux-trace-kernel/20240705161030.b3ddb33a8167013b9…
Fixes: b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols")
Signed-off-by: Nikolay Kuratov <kniv(a)yandex-team.ru>
---
v1 -> v2:
* Reword commit title and message
* Send for stable instead of mainline
kernel/trace/trace_kprobe.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 12d997bb3e78..94cb09d44115 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1814,7 +1814,7 @@ create_local_trace_kprobe(char *func, void *addr, unsigned long offs,
int ret;
char *event;
- if (func) {
+ if (func && !strchr(func, ':')) {
unsigned int count;
count = number_of_same_symbols(func);
--
2.34.1
If the clock tu->ick was not enabled in tahvo_usb_probe,
it may still hold a non-error pointer, potentially causing
the clock to be incorrectly disabled later in the function.
Use the devm_clk_get_enabled helper function to ensure proper call balance
for tu->ick.
Found by Linux Verification Center (linuxtesting.org) with Klever.
Fixes: 9ba96ae5074c ("usb: omap1: Tahvo USB transceiver driver")
Cc: stable(a)vger.kernel.org
Signed-off-by: Vitalii Mordan <mordan(a)ispras.ru>
---
v2: Corrected a typo in the error handling of the devm_clk_get_enabled
call. This issue was reported by Dan Carpenter <dan.carpenter(a)linaro.org>.
drivers/usb/phy/phy-tahvo.c | 20 ++++++++------------
1 file changed, 8 insertions(+), 12 deletions(-)
diff --git a/drivers/usb/phy/phy-tahvo.c b/drivers/usb/phy/phy-tahvo.c
index ae7bf3ff89ee..4182e86dc450 100644
--- a/drivers/usb/phy/phy-tahvo.c
+++ b/drivers/usb/phy/phy-tahvo.c
@@ -341,9 +341,11 @@ static int tahvo_usb_probe(struct platform_device *pdev)
mutex_init(&tu->serialize);
- tu->ick = devm_clk_get(&pdev->dev, "usb_l4_ick");
- if (!IS_ERR(tu->ick))
- clk_enable(tu->ick);
+ tu->ick = devm_clk_get_enabled(&pdev->dev, "usb_l4_ick");
+ if (IS_ERR(tu->ick)) {
+ dev_err(&pdev->dev, "failed to get and enable clock\n");
+ return PTR_ERR(tu->ick);
+ }
/*
* Set initial state, so that we generate kevents only on state changes.
@@ -353,15 +355,14 @@ static int tahvo_usb_probe(struct platform_device *pdev)
tu->extcon = devm_extcon_dev_allocate(&pdev->dev, tahvo_cable);
if (IS_ERR(tu->extcon)) {
dev_err(&pdev->dev, "failed to allocate memory for extcon\n");
- ret = PTR_ERR(tu->extcon);
- goto err_disable_clk;
+ return PTR_ERR(tu->extcon);
}
ret = devm_extcon_dev_register(&pdev->dev, tu->extcon);
if (ret) {
dev_err(&pdev->dev, "could not register extcon device: %d\n",
ret);
- goto err_disable_clk;
+ return ret;
}
/* Set the initial cable state. */
@@ -384,7 +385,7 @@ static int tahvo_usb_probe(struct platform_device *pdev)
if (ret < 0) {
dev_err(&pdev->dev, "cannot register USB transceiver: %d\n",
ret);
- goto err_disable_clk;
+ return ret;
}
dev_set_drvdata(&pdev->dev, tu);
@@ -405,9 +406,6 @@ static int tahvo_usb_probe(struct platform_device *pdev)
err_remove_phy:
usb_remove_phy(&tu->phy);
-err_disable_clk:
- if (!IS_ERR(tu->ick))
- clk_disable(tu->ick);
return ret;
}
@@ -418,8 +416,6 @@ static void tahvo_usb_remove(struct platform_device *pdev)
free_irq(tu->irq, tu);
usb_remove_phy(&tu->phy);
- if (!IS_ERR(tu->ick))
- clk_disable(tu->ick);
}
static struct platform_driver tahvo_usb_driver = {
--
2.25.1
If the clock tu->ick was not enabled in tahvo_usb_probe,
it may still hold a non-error pointer, potentially causing
the clock to be incorrectly disabled later in the function.
Use the devm_clk_get_enabled helper function to ensure proper call balance
for tu->ick.
Found by Linux Verification Center (linuxtesting.org) with Klever.
Fixes: 9ba96ae5074c ("usb: omap1: Tahvo USB transceiver driver")
Cc: stable(a)vger.kernel.org
Signed-off-by: Vitalii Mordan <mordan(a)ispras.ru>
---
drivers/usb/phy/phy-tahvo.c | 20 ++++++++------------
1 file changed, 8 insertions(+), 12 deletions(-)
diff --git a/drivers/usb/phy/phy-tahvo.c b/drivers/usb/phy/phy-tahvo.c
index ae7bf3ff89ee..d393308d23d4 100644
--- a/drivers/usb/phy/phy-tahvo.c
+++ b/drivers/usb/phy/phy-tahvo.c
@@ -341,9 +341,11 @@ static int tahvo_usb_probe(struct platform_device *pdev)
mutex_init(&tu->serialize);
- tu->ick = devm_clk_get(&pdev->dev, "usb_l4_ick");
- if (!IS_ERR(tu->ick))
- clk_enable(tu->ick);
+ tu->ick = devm_clk_get_enabled(&pdev->dev, "usb_l4_ick");
+ if (!IS_ERR(tu->ick)) {
+ dev_err(&pdev->dev, "failed to get and enable clock\n");
+ return PTR_ERR(tu->ick);
+ }
/*
* Set initial state, so that we generate kevents only on state changes.
@@ -353,15 +355,14 @@ static int tahvo_usb_probe(struct platform_device *pdev)
tu->extcon = devm_extcon_dev_allocate(&pdev->dev, tahvo_cable);
if (IS_ERR(tu->extcon)) {
dev_err(&pdev->dev, "failed to allocate memory for extcon\n");
- ret = PTR_ERR(tu->extcon);
- goto err_disable_clk;
+ return PTR_ERR(tu->extcon);
}
ret = devm_extcon_dev_register(&pdev->dev, tu->extcon);
if (ret) {
dev_err(&pdev->dev, "could not register extcon device: %d\n",
ret);
- goto err_disable_clk;
+ return ret;
}
/* Set the initial cable state. */
@@ -384,7 +385,7 @@ static int tahvo_usb_probe(struct platform_device *pdev)
if (ret < 0) {
dev_err(&pdev->dev, "cannot register USB transceiver: %d\n",
ret);
- goto err_disable_clk;
+ return ret;
}
dev_set_drvdata(&pdev->dev, tu);
@@ -405,9 +406,6 @@ static int tahvo_usb_probe(struct platform_device *pdev)
err_remove_phy:
usb_remove_phy(&tu->phy);
-err_disable_clk:
- if (!IS_ERR(tu->ick))
- clk_disable(tu->ick);
return ret;
}
@@ -418,8 +416,6 @@ static void tahvo_usb_remove(struct platform_device *pdev)
free_irq(tu->irq, tu);
usb_remove_phy(&tu->phy);
- if (!IS_ERR(tu->ick))
- clk_disable(tu->ick);
}
static struct platform_driver tahvo_usb_driver = {
--
2.25.1
From: Chuck Lever <chuck.lever(a)oracle.com>
Testing shows that the EBUSY error return from mtree_alloc_cyclic()
leaks into user space. The ERRORS section of "man creat(2)" says:
> EBUSY O_EXCL was specified in flags and pathname refers
> to a block device that is in use by the system
> (e.g., it is mounted).
ENOSPC is closer to what applications expect in this situation.
Note that the normal range of simple directory offset values is
2..2^63, so hitting this error is going to be rare to impossible.
Fixes: 6faddda69f62 ("libfs: Add directory operations for stable offsets")
Cc: <stable(a)vger.kernel.org> # v6.9+
Reviewed-by: Jeff Layton <jlayton(a)kernel.org>
Reviewed-by: Yang Erkun <yangerkun(a)huawei.com>
Signed-off-by: Chuck Lever <chuck.lever(a)oracle.com>
---
fs/libfs.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/libfs.c b/fs/libfs.c
index 748ac5923154..f6d04c69f195 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -292,7 +292,9 @@ int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
ret = mtree_alloc_cyclic(&octx->mt, &offset, dentry, DIR_OFFSET_MIN,
LONG_MAX, &octx->next_offset, GFP_KERNEL);
- if (ret < 0)
+ if (unlikely(ret == -EBUSY))
+ return -ENOSPC;
+ if (unlikely(ret < 0))
return ret;
offset_set(dentry, offset);
--
2.47.0
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 978c4486cca5c7b9253d3ab98a88c8e769cb9bbd
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024121506-pancreas-mosaic-0ae0@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 978c4486cca5c7b9253d3ab98a88c8e769cb9bbd Mon Sep 17 00:00:00 2001
From: Jiri Olsa <jolsa(a)kernel.org>
Date: Sun, 8 Dec 2024 15:25:07 +0100
Subject: [PATCH] bpf,perf: Fix invalid prog_array access in
perf_event_detach_bpf_prog
Syzbot reported [1] crash that happens for following tracing scenario:
- create tracepoint perf event with attr.inherit=1, attach it to the
process and set bpf program to it
- attached process forks -> chid creates inherited event
the new child event shares the parent's bpf program and tp_event
(hence prog_array) which is global for tracepoint
- exit both process and its child -> release both events
- first perf_event_detach_bpf_prog call will release tp_event->prog_array
and second perf_event_detach_bpf_prog will crash, because
tp_event->prog_array is NULL
The fix makes sure the perf_event_detach_bpf_prog checks prog_array
is valid before it tries to remove the bpf program from it.
[1] https://lore.kernel.org/bpf/Z1MR6dCIKajNS6nU@krava/T/#m91dbf0688221ec7a7fc9…
Fixes: 0ee288e69d03 ("bpf,perf: Fix perf_event_detach_bpf_prog error handling")
Reported-by: syzbot+2e0d2840414ce817aaac(a)syzkaller.appspotmail.com
Signed-off-by: Jiri Olsa <jolsa(a)kernel.org>
Signed-off-by: Andrii Nakryiko <andrii(a)kernel.org>
Link: https://lore.kernel.org/bpf/20241208142507.1207698-1-jolsa@kernel.org
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index a403b05a7091..1b8db5aee9d3 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2250,6 +2250,9 @@ void perf_event_detach_bpf_prog(struct perf_event *event)
goto unlock;
old_array = bpf_event_rcu_dereference(event->tp_event->prog_array);
+ if (!old_array)
+ goto put;
+
ret = bpf_prog_array_copy(old_array, event->prog, NULL, 0, &new_array);
if (ret < 0) {
bpf_prog_array_delete_safe(old_array, event->prog);
@@ -2258,6 +2261,7 @@ void perf_event_detach_bpf_prog(struct perf_event *event)
bpf_prog_array_free_sleepable(old_array);
}
+put:
/*
* It could be that the bpf_prog is not sleepable (and will be freed
* via normal RCU), but is called from a point that supports sleepable
From: yangge <yangge1116(a)126.com>
Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags
in __compaction_suitable()") allow compaction to proceed when free
pages required for compaction reside in the CMA pageblocks, it's
possible that __compaction_suitable() always returns true, and in
some cases, it's not acceptable.
There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
of memory. I have configured 16GB of CMA memory on each NUMA node,
and starting a 32GB virtual machine with device passthrough is
extremely slow, taking almost an hour.
During the start-up of the virtual machine, it will call
pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
Long term GUP cannot allocate memory from CMA area, so a maximum
of 16 GB of no-CMA memory on a NUMA node can be used as virtual
machine memory. Since there is 16G of free CMA memory on the NUMA
node, watermark for order-0 always be met for compaction, so
__compaction_suitable() always returns true, even if the node is
unable to allocate non-CMA memory for the virtual machine.
For costly allocations, because __compaction_suitable() always
returns true, __alloc_pages_slowpath() can't exit at the appropriate
place, resulting in excessively long virtual machine startup times.
Call trace:
__alloc_pages_slowpath
if (compact_result == COMPACT_SKIPPED ||
compact_result == COMPACT_DEFERRED)
goto nopage; // should exit __alloc_pages_slowpath() from here
In order to quickly fall back to remote node, we should remove
ALLOC_CMA both in __compaction_suitable() and __isolate_free_page()
in long term GUP flow. After this fix, starting a 32GB virtual machine
with device passthrough takes only a few seconds.
Fixes: 984fdba6a32e ("mm, compaction: use proper alloc_flags in __compaction_suitable()")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: yangge <yangge1116(a)126.com>
---
V4:
- rich the commit log description
V3:
- fix build errors
- add ALLOC_CMA both in should_continue_reclaim() and compaction_ready()
V2:
- using the 'cc->alloc_flags' to determin if 'ALLOC_CMA' is needed
- rich the commit log description
include/linux/compaction.h | 6 ++++--
mm/compaction.c | 18 +++++++++++-------
mm/page_alloc.c | 4 +++-
mm/vmscan.c | 4 ++--
4 files changed, 20 insertions(+), 12 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index e947764..b4c3ac3 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -90,7 +90,8 @@ extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
struct page **page);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern bool compaction_suitable(struct zone *zone, int order,
- int highest_zoneidx);
+ int highest_zoneidx,
+ unsigned int alloc_flags);
extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
@@ -108,7 +109,8 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
}
static inline bool compaction_suitable(struct zone *zone, int order,
- int highest_zoneidx)
+ int highest_zoneidx,
+ unsigned int alloc_flags)
{
return false;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index 07bd227..585f5ab 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2381,9 +2381,11 @@ static enum compact_result compact_finished(struct compact_control *cc)
static bool __compaction_suitable(struct zone *zone, int order,
int highest_zoneidx,
+ unsigned int alloc_flags,
unsigned long wmark_target)
{
unsigned long watermark;
+ bool use_cma;
/*
* Watermarks for order-0 must be met for compaction to be able to
* isolate free pages for migration targets. This means that the
@@ -2395,25 +2397,27 @@ static bool __compaction_suitable(struct zone *zone, int order,
* even if compaction succeeds.
* For costly orders, we require low watermark instead of min for
* compaction to proceed to increase its chances.
- * ALLOC_CMA is used, as pages in CMA pageblocks are considered
- * suitable migration targets
+ * In addition to long term GUP flow, ALLOC_CMA is used, as pages in
+ * CMA pageblocks are considered suitable migration targets
*/
watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
low_wmark_pages(zone) : min_wmark_pages(zone);
watermark += compact_gap(order);
+ use_cma = !!(alloc_flags & ALLOC_CMA);
return __zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
- ALLOC_CMA, wmark_target);
+ use_cma ? ALLOC_CMA : 0, wmark_target);
}
/*
* compaction_suitable: Is this suitable to run compaction on this zone now?
*/
-bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx)
+bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx,
+ unsigned int alloc_flags)
{
enum compact_result compact_result;
bool suitable;
- suitable = __compaction_suitable(zone, order, highest_zoneidx,
+ suitable = __compaction_suitable(zone, order, highest_zoneidx, alloc_flags,
zone_page_state(zone, NR_FREE_PAGES));
/*
* fragmentation index determines if allocation failures are due to
@@ -2474,7 +2478,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
available = zone_reclaimable_pages(zone) / order;
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
if (__compaction_suitable(zone, order, ac->highest_zoneidx,
- available))
+ alloc_flags, available))
return true;
}
@@ -2499,7 +2503,7 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order,
alloc_flags))
return COMPACT_SUCCESS;
- if (!compaction_suitable(zone, order, highest_zoneidx))
+ if (!compaction_suitable(zone, order, highest_zoneidx, alloc_flags))
return COMPACT_SKIPPED;
return COMPACT_CONTINUE;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dde19db..9a5dfda 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2813,6 +2813,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
{
struct zone *zone = page_zone(page);
int mt = get_pageblock_migratetype(page);
+ bool pin;
if (!is_migrate_isolate(mt)) {
unsigned long watermark;
@@ -2823,7 +2824,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
* exists.
*/
watermark = zone->_watermark[WMARK_MIN] + (1UL << order);
- if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+ pin = !!(current->flags & PF_MEMALLOC_PIN);
+ if (!zone_watermark_ok(zone, 0, watermark, 0, pin ? 0 : ALLOC_CMA))
return 0;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e03a61..33f5b46 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5815,7 +5815,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
sc->reclaim_idx, 0))
return false;
- if (compaction_suitable(zone, sc->order, sc->reclaim_idx))
+ if (compaction_suitable(zone, sc->order, sc->reclaim_idx, ALLOC_CMA))
return false;
}
@@ -6043,7 +6043,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
return true;
/* Compaction cannot yet proceed. Do reclaim. */
- if (!compaction_suitable(zone, sc->order, sc->reclaim_idx))
+ if (!compaction_suitable(zone, sc->order, sc->reclaim_idx, ALLOC_CMA))
return false;
/*
--
2.7.4
Ensure a non-interruptible wait is used when moving a bo to
XE_PL_SYSTEM. This prevents dma_mappings from being removed prematurely
while a GPU job is still in progress, even if the CPU receives a
signal during the operation.
Fixes: 75521e8b56e8 ("drm/xe: Perform dma_map when moving system buffer objects to TT")
Cc: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Cc: Matthew Brost <matthew.brost(a)intel.com>
Cc: Lucas De Marchi <lucas.demarchi(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v6.11+
Suggested-by: Matthew Auld <matthew.auld(a)intel.com>
Signed-off-by: Nirmoy Das <nirmoy.das(a)intel.com>
Reviewed-by: Matthew Auld <matthew.auld(a)intel.com>
---
drivers/gpu/drm/xe/xe_bo.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 283cd0294570..06931df876ab 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -733,7 +733,7 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict,
new_mem->mem_type == XE_PL_SYSTEM) {
long timeout = dma_resv_wait_timeout(ttm_bo->base.resv,
DMA_RESV_USAGE_BOOKKEEP,
- true,
+ false,
MAX_SCHEDULE_TIMEOUT);
if (timeout < 0) {
ret = timeout;
--
2.46.0
This change is specific to Hyper-V VM user.
If the Virtual Machine Connection window is focused,
a Hyper-V VM user can unintentionally touch the keyboard/mouse
when the VM is hibernating or resuming, and consequently the
hibernation or resume operation can be aborted unexpectedly.
Fix the issue by no longer registering the keyboard/mouse as
wakeup devices (see the other two patches for the
changes to drivers/input/serio/hyperv-keyboard.c and
drivers/hid/hid-hyperv.c).
The keyboard/mouse were registered as wakeup devices because the
VM needs to be woken up from the Suspend-to-Idle state after
a user runs "echo freeze > /sys/power/state". It seems like
the Suspend-to-Idle feature has no real users in practice, so
let's no longer support that by returning -EOPNOTSUPP if a
user tries to use that.
Fixes: 1a06d017fb3f ("Drivers: hv: vmbus: Fix Suspend-to-Idle for Generation-2 VM")
Cc: stable(a)vger.kernel.org
Signed-off-by: Saurabh Sengar <ssengar(a)linux.microsoft.com>
Signed-off-by: Erni Sri Satya Vennela <ernis(a)linux.microsoft.com>>
---
Changes in v4:
* No change
Changes in v3:
* Add "Cc: stable(a)vger.kernel.org" in sign-off area.
Changes in v2:
* Add "#define vmbus_freeze NULL" when CONFIG_PM_SLEEP is not
enabled.
---
drivers/hv/vmbus_drv.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 6d89d37b069a..4df6b12bf6a1 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -900,6 +900,19 @@ static void vmbus_shutdown(struct device *child_device)
}
#ifdef CONFIG_PM_SLEEP
+/*
+ * vmbus_freeze - Suspend-to-Idle
+ */
+static int vmbus_freeze(struct device *child_device)
+{
+/*
+ * Do not support Suspend-to-Idle ("echo freeze > /sys/power/state") as
+ * that would require registering the Hyper-V synthetic mouse/keyboard
+ * devices as wakeup devices, which can abort hibernation/resume unexpectedly.
+ */
+ return -EOPNOTSUPP;
+}
+
/*
* vmbus_suspend - Suspend a vmbus device
*/
@@ -938,6 +951,7 @@ static int vmbus_resume(struct device *child_device)
return drv->resume(dev);
}
#else
+#define vmbus_freeze NULL
#define vmbus_suspend NULL
#define vmbus_resume NULL
#endif /* CONFIG_PM_SLEEP */
@@ -969,7 +983,7 @@ static void vmbus_device_release(struct device *device)
*/
static const struct dev_pm_ops vmbus_pm = {
- .suspend_noirq = NULL,
+ .suspend_noirq = vmbus_freeze,
.resume_noirq = NULL,
.freeze_noirq = vmbus_suspend,
.thaw_noirq = vmbus_resume,
--
2.34.1
The quilt patch titled
Subject: mm/hugetlb: change ENOSPC to ENOMEM in alloc_hugetlb_folio
has been removed from the -mm tree. Its filename was
mm-hugetlb-change-enospc-to-enomem-in-alloc_hugetlb_folio.patch
This patch was dropped because it was nacked
------------------------------------------------------
From: Dafna Hirschfeld <dafna.hirschfeld(a)intel.com>
Subject: mm/hugetlb: change ENOSPC to ENOMEM in alloc_hugetlb_folio
Date: Sun, 1 Dec 2024 03:03:41 +0200
The error ENOSPC is translated in vmf_error to VM_FAULT_SIGBUS which is
further translated in EFAULT in i.e. pin/get_user_pages. But when
running out of pages/hugepages we expect to see ENOMEM and not EFAULT.
Link: https://lkml.kernel.org/r/20241201010341.1382431-1-dafna.hirschfeld@intel.c…
Fixes: 8f34af6f93ae ("mm, hugetlb: move the error handle logic out of normal code path")
Signed-off-by: Dafna Hirschfeld <dafna.hirschfeld(a)intel.com>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/hugetlb.c~mm-hugetlb-change-enospc-to-enomem-in-alloc_hugetlb_folio
+++ a/mm/hugetlb.c
@@ -3113,7 +3113,7 @@ out_end_reservation:
if (!memcg_charge_ret)
mem_cgroup_cancel_charge(memcg, nr_pages);
mem_cgroup_put(memcg);
- return ERR_PTR(-ENOSPC);
+ return ERR_PTR(-ENOMEM);
}
int alloc_bootmem_huge_page(struct hstate *h, int nid)
_
Patches currently in -mm which might be from dafna.hirschfeld(a)intel.com are
The patch titled
Subject: maple_tree: reload mas before the second call for mas_empty_area
has been added to the -mm mm-unstable branch. Its filename is
maple_tree-reload-mas-before-the-second-call-for-mas_empty_area.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yang Erkun <yangerkun(a)huawei.com>
Subject: maple_tree: reload mas before the second call for mas_empty_area
Date: Sat, 14 Dec 2024 17:30:05 +0800
Change the LONG_MAX in simple_offset_add to 1024, and do latter:
[root@fedora ~]# mkdir /tmp/dir
[root@fedora ~]# for i in {1..1024}; do touch /tmp/dir/$i; done
touch: cannot touch '/tmp/dir/1024': Device or resource busy
[root@fedora ~]# rm /tmp/dir/123
[root@fedora ~]# touch /tmp/dir/1024
[root@fedora ~]# rm /tmp/dir/100
[root@fedora ~]# touch /tmp/dir/1025
touch: cannot touch '/tmp/dir/1025': Device or resource busy
After we delete file 100, actually this is a empty entry, but the latter
create failed unexpected.
mas_alloc_cyclic has two chance to find empty entry. First find the entry
with range range_lo and range_hi, if no empty entry exist, and range_lo >
min, retry find with range min and range_hi. However, the first call
mas_empty_area may mark mas as EBUSY, and the second call for
mas_empty_area will return false directly. Fix this by reload mas before
second call for mas_empty_area.
Link: https://lkml.kernel.org/r/20241214093005.72284-1-yangerkun@huaweicloud.com
Fixes: 9b6713cc7522 ("maple_tree: Add mtree_alloc_cyclic()")
Signed-off-by: Yang Erkun <yangerkun(a)huawei.com>
Cc: Christian Brauner <brauner(a)kernel.org>
Cc: Chuck Lever <chuck.lever(a)oracle.com> says:
Cc: Liam R. Howlett <Liam.Howlett(a)Oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/maple_tree.c | 2 ++
1 file changed, 2 insertions(+)
--- a/lib/maple_tree.c~maple_tree-reload-mas-before-the-second-call-for-mas_empty_area
+++ a/lib/maple_tree.c
@@ -4335,6 +4335,7 @@ int mas_alloc_cyclic(struct ma_state *ma
{
unsigned long min = range_lo;
int ret = 0;
+ struct ma_state m = *mas;
range_lo = max(min, *next);
ret = mas_empty_area(mas, range_lo, range_hi, 1);
@@ -4343,6 +4344,7 @@ int mas_alloc_cyclic(struct ma_state *ma
ret = 1;
}
if (ret < 0 && range_lo > min) {
+ *mas = m;
ret = mas_empty_area(mas, min, range_hi, 1);
if (ret == 0)
ret = 1;
_
Patches currently in -mm which might be from yangerkun(a)huawei.com are
maple_tree-reload-mas-before-the-second-call-for-mas_empty_area.patch
The patch titled
Subject: ocfs2: handle a symlink read error correctly
has been added to the -mm mm-nonmm-unstable branch. Its filename is
ocfs2-handle-a-symlink-read-error-correctly.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-nonmm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: "Matthew Wilcox (Oracle)" <willy(a)infradead.org>
Subject: ocfs2: handle a symlink read error correctly
Date: Thu, 5 Dec 2024 17:16:29 +0000
Patch series "Convert ocfs2 to use folios".
Mark did a conversion of ocfs2 to use folios and sent it to me as a
giant patch for review ;-)
So I've redone it as individual patches, and credited Mark for the patches
where his code is substantially the same. It's not a bad way to do it;
his patch had some bugs and my patches had some bugs. Hopefully all our
bugs were different from each other. And hopefully Mark likes all the
changes I made to his code!
This patch (of 23):
If we can't read the buffer, be sure to unlock the page before returning.
Link: https://lkml.kernel.org/r/20241205171653.3179945-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20241205171653.3179945-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Reviewed-by: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Cc: Mark Fasheh <mark(a)fasheh.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Changwei Ge <gechangwei(a)live.cn>
Cc: Jun Piao <piaojun(a)huawei.com>
Cc: Mark Tinguely <mark.tinguely(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ocfs2/symlink.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- a/fs/ocfs2/symlink.c~ocfs2-handle-a-symlink-read-error-correctly
+++ a/fs/ocfs2/symlink.c
@@ -65,7 +65,7 @@ static int ocfs2_fast_symlink_read_folio
if (status < 0) {
mlog_errno(status);
- return status;
+ goto out;
}
fe = (struct ocfs2_dinode *) bh->b_data;
@@ -76,9 +76,10 @@ static int ocfs2_fast_symlink_read_folio
memcpy(kaddr, link, len + 1);
kunmap_atomic(kaddr);
SetPageUptodate(page);
+out:
unlock_page(page);
brelse(bh);
- return 0;
+ return status;
}
const struct address_space_operations ocfs2_fast_symlink_aops = {
_
Patches currently in -mm which might be from willy(a)infradead.org are
vmalloc-fix-accounting-with-i915.patch
mm-page_alloc-cache-page_zone-result-in-free_unref_page.patch
mm-make-alloc_pages_mpol-static.patch
mm-page_alloc-export-free_frozen_pages-instead-of-free_unref_page.patch
mm-page_alloc-move-set_page_refcounted-to-callers-of-post_alloc_hook.patch
mm-page_alloc-move-set_page_refcounted-to-callers-of-prep_new_page.patch
mm-page_alloc-move-set_page_refcounted-to-callers-of-get_page_from_freelist.patch
mm-page_alloc-move-set_page_refcounted-to-callers-of-__alloc_pages_cpuset_fallback.patch
mm-page_alloc-move-set_page_refcounted-to-callers-of-__alloc_pages_may_oom.patch
mm-page_alloc-move-set_page_refcounted-to-callers-of-__alloc_pages_direct_compact.patch
mm-page_alloc-move-set_page_refcounted-to-callers-of-__alloc_pages_direct_reclaim.patch
mm-page_alloc-move-set_page_refcounted-to-callers-of-__alloc_pages_slowpath.patch
mm-page_alloc-move-set_page_refcounted-to-end-of-__alloc_pages.patch
mm-page_alloc-add-__alloc_frozen_pages.patch
mm-mempolicy-add-alloc_frozen_pages.patch
slab-allocate-frozen-pages.patch
ocfs2-handle-a-symlink-read-error-correctly.patch
ocfs2-convert-ocfs2_page_mkwrite-to-use-a-folio.patch
ocfs2-pass-mmap_folio-around-instead-of-mmap_page.patch
ocfs2-convert-ocfs2_read_inline_data-to-take-a-folio.patch
ocfs2-use-a-folio-in-ocfs2_fast_symlink_read_folio.patch
ocfs2-remove-ocfs2_start_walk_page_trans-prototype.patch
The patch titled
Subject: mm/readahead: fix large folio support in async readahead
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-readahead-fix-large-folio-support-in-async-readahead.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yafang Shao <laoar.shao(a)gmail.com>
Subject: mm/readahead: fix large folio support in async readahead
Date: Fri, 6 Dec 2024 16:30:25 +0800
When testing large folio support with XFS on our servers, we observed that
only a few large folios are mapped when reading large files via mmap.
After a thorough analysis, I identified it was caused by the
`/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
parameter is set to 128KB. After I tune it to 2MB, the large folio can
work as expected. However, I believe the large folio behavior should not
be dependent on the value of read_ahead_kb. It would be more robust if
the kernel can automatically adopt to it.
With /sys/block/*/queue/read_ahead_kb set to 128KB and performing a
sequential read on a 1GB file using MADV_HUGEPAGE, the differences in
/proc/meminfo are as follows:
- before this patch
FileHugePages: 18432 kB
FilePmdMapped: 4096 kB
- after this patch
FileHugePages: 1067008 kB
FilePmdMapped: 1048576 kB
This shows that after applying the patch, the entire 1GB file is mapped to
huge pages. The stable list is CCed, as without this patch, large folios
don't function optimally in the readahead path.
It's worth noting that if read_ahead_kb is set to a larger value that
isn't aligned with huge page sizes (e.g., 4MB + 128KB), it may still fail
to map to hugepages.
Link: https://lkml.kernel.org/r/20241108141710.9721-1-laoar.shao@gmail.com
Link: https://lkml.kernel.org/r/20241206083025.3478-1-laoar.shao@gmail.com
Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
Signed-off-by: Yafang Shao <laoar.shao(a)gmail.com>
Tested-by: kernel test robot <oliver.sang(a)intel.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/readahead.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
--- a/mm/readahead.c~mm-readahead-fix-large-folio-support-in-async-readahead
+++ a/mm/readahead.c
@@ -646,7 +646,11 @@ void page_cache_async_ra(struct readahea
1UL << order);
if (index == expected) {
ra->start += ra->size;
- ra->size = get_next_ra_size(ra, max_pages);
+ /*
+ * In the case of MADV_HUGEPAGE, the actual size might exceed
+ * the readahead window.
+ */
+ ra->size = max(ra->size, get_next_ra_size(ra, max_pages));
ra->async_size = ra->size;
goto readit;
}
_
Patches currently in -mm which might be from laoar.shao(a)gmail.com are
mm-readahead-fix-large-folio-support-in-async-readahead.patch
The quilt patch titled
Subject: mm/readahead: fix large folio support in async readahead
has been removed from the -mm tree. Its filename was
mm-readahead-fix-large-folio-support-in-async-readahead.patch
This patch was dropped because an updated version will be issued
------------------------------------------------------
From: Yafang Shao <laoar.shao(a)gmail.com>
Subject: mm/readahead: fix large folio support in async readahead
Date: Fri, 8 Nov 2024 22:17:10 +0800
When testing large folio support with XFS on our servers, we observed that
only a few large folios are mapped when reading large files via mmap.
After a thorough analysis, I identified it was caused by the
`/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
parameter is set to 128KB. After I tune it to 2MB, the large folio can
work as expected. However, I believe the large folio behavior should not
be dependent on the value of read_ahead_kb. It would be more robust if
the kernel can automatically adopt to it.
With /sys/block/*/queue/read_ahead_kb set to 128KB and performing a
sequential read on a 1GB file using MADV_HUGEPAGE, the differences in
/proc/meminfo are as follows:
- before this patch
FileHugePages: 18432 kB
FilePmdMapped: 4096 kB
- after this patch
FileHugePages: 1067008 kB
FilePmdMapped: 1048576 kB
This shows that after applying the patch, the entire 1GB file is mapped to
huge pages. The stable list is CCed, as without this patch, large folios
don't function optimally in the readahead path.
It's worth noting that if read_ahead_kb is set to a larger value that
isn't aligned with huge page sizes (e.g., 4MB + 128KB), it may still fail
to map to hugepages.
Link: https://lkml.kernel.org/r/20241108141710.9721-1-laoar.shao@gmail.com
Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
Suggested-by: Matthew Wilcox <willy(a)infradead.org>
Signed-off-by: Yafang Shao <laoar.shao(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/readahead.c | 2 ++
1 file changed, 2 insertions(+)
--- a/mm/readahead.c~mm-readahead-fix-large-folio-support-in-async-readahead
+++ a/mm/readahead.c
@@ -390,6 +390,8 @@ static unsigned long get_next_ra_size(st
return 4 * cur;
if (cur <= max / 2)
return 2 * cur;
+ if (cur > max)
+ return cur;
return max;
}
_
Patches currently in -mm which might be from laoar.shao(a)gmail.com are
mm-readahead-fix-large-folio-support-in-async-readahead-v3.patch
The patch titled
Subject: mm, compaction: don't use ALLOC_CMA in long term GUP flow
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: yangge <yangge1116(a)126.com>
Subject: mm, compaction: don't use ALLOC_CMA in long term GUP flow
Date: Sun, 15 Dec 2024 18:01:07 +0800
Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags in
__compaction_suitable()") allow compaction to proceed when free pages
required for compaction reside in the CMA pageblocks, it's possible that
__compaction_suitable() always returns true, and in some cases, it's not
acceptable.
There are 4 NUMA nodes on my machine, and each NUMA node has 32GB of
memory. I have configured 16GB of CMA memory on each NUMA node, and
starting a 32GB virtual machine with device passthrough is extremely slow,
taking almost an hour.
During the start-up of the virtual machine, it will call
pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory. Long
term GUP cannot allocate memory from CMA area, so a maximum of 16 GB of
no-CMA memory on a NUMA node can be used as virtual machine memory. Since
there is 16G of free CMA memory on the NUMA node, watermark for order-0
always be met for compaction, so __compaction_suitable() always returns
true, even if the node is unable to allocate non-CMA memory for the
virtual machine.
For costly allocations, because __compaction_suitable() always returns
true, __alloc_pages_slowpath() can't exit at the appropriate place,
resulting in excessively long virtual machine startup times.
Call trace:
__alloc_pages_slowpath
if (compact_result == COMPACT_SKIPPED ||
compact_result == COMPACT_DEFERRED)
goto nopage; // should exit __alloc_pages_slowpath() from here
In order to quickly fall back to remote node, we should remove ALLOC_CMA
both in __compaction_suitable() and __isolate_free_page() in long term GUP
flow. After this fix, starting a 32GB virtual machine with device
passthrough takes only a few seconds.
Link: https://lkml.kernel.org/r/1734256867-19614-1-git-send-email-yangge1116@126.…
Fixes: 984fdba6a32e ("mm, compaction: use proper alloc_flags in __compaction_suitable()")
Signed-off-by: yangge <yangge1116(a)126.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/compaction.h | 6 ++++--
mm/compaction.c | 18 +++++++++++-------
mm/page_alloc.c | 4 +++-
mm/vmscan.c | 4 ++--
4 files changed, 20 insertions(+), 12 deletions(-)
--- a/include/linux/compaction.h~mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow
+++ a/include/linux/compaction.h
@@ -90,7 +90,8 @@ extern enum compact_result try_to_compac
struct page **page);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern bool compaction_suitable(struct zone *zone, int order,
- int highest_zoneidx);
+ int highest_zoneidx,
+ unsigned int alloc_flags);
extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
@@ -108,7 +109,8 @@ static inline void reset_isolation_suita
}
static inline bool compaction_suitable(struct zone *zone, int order,
- int highest_zoneidx)
+ int highest_zoneidx,
+ unsigned int alloc_flags)
{
return false;
}
--- a/mm/compaction.c~mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow
+++ a/mm/compaction.c
@@ -2379,9 +2379,11 @@ static enum compact_result compact_finis
static bool __compaction_suitable(struct zone *zone, int order,
int highest_zoneidx,
+ unsigned int alloc_flags,
unsigned long wmark_target)
{
unsigned long watermark;
+ bool use_cma;
/*
* Watermarks for order-0 must be met for compaction to be able to
* isolate free pages for migration targets. This means that the
@@ -2393,25 +2395,27 @@ static bool __compaction_suitable(struct
* even if compaction succeeds.
* For costly orders, we require low watermark instead of min for
* compaction to proceed to increase its chances.
- * ALLOC_CMA is used, as pages in CMA pageblocks are considered
- * suitable migration targets
+ * In addition to long term GUP flow, ALLOC_CMA is used, as pages in
+ * CMA pageblocks are considered suitable migration targets
*/
watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
low_wmark_pages(zone) : min_wmark_pages(zone);
watermark += compact_gap(order);
+ use_cma = !!(alloc_flags & ALLOC_CMA);
return __zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
- ALLOC_CMA, wmark_target);
+ use_cma ? ALLOC_CMA : 0, wmark_target);
}
/*
* compaction_suitable: Is this suitable to run compaction on this zone now?
*/
-bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx)
+bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx,
+ unsigned int alloc_flags)
{
enum compact_result compact_result;
bool suitable;
- suitable = __compaction_suitable(zone, order, highest_zoneidx,
+ suitable = __compaction_suitable(zone, order, highest_zoneidx, alloc_flags,
zone_page_state(zone, NR_FREE_PAGES));
/*
* fragmentation index determines if allocation failures are due to
@@ -2472,7 +2476,7 @@ bool compaction_zonelist_suitable(struct
available = zone_reclaimable_pages(zone) / order;
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
if (__compaction_suitable(zone, order, ac->highest_zoneidx,
- available))
+ alloc_flags, available))
return true;
}
@@ -2497,7 +2501,7 @@ compaction_suit_allocation_order(struct
alloc_flags))
return COMPACT_SUCCESS;
- if (!compaction_suitable(zone, order, highest_zoneidx))
+ if (!compaction_suitable(zone, order, highest_zoneidx, alloc_flags))
return COMPACT_SKIPPED;
return COMPACT_CONTINUE;
--- a/mm/page_alloc.c~mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow
+++ a/mm/page_alloc.c
@@ -2812,6 +2812,7 @@ int __isolate_free_page(struct page *pag
{
struct zone *zone = page_zone(page);
int mt = get_pageblock_migratetype(page);
+ bool pin;
if (!is_migrate_isolate(mt)) {
unsigned long watermark;
@@ -2822,7 +2823,8 @@ int __isolate_free_page(struct page *pag
* exists.
*/
watermark = zone->_watermark[WMARK_MIN] + (1UL << order);
- if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+ pin = !!(current->flags & PF_MEMALLOC_PIN);
+ if (!zone_watermark_ok(zone, 0, watermark, 0, pin ? 0 : ALLOC_CMA))
return 0;
}
--- a/mm/vmscan.c~mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow
+++ a/mm/vmscan.c
@@ -5861,7 +5861,7 @@ static inline bool should_continue_recla
sc->reclaim_idx, 0))
return false;
- if (compaction_suitable(zone, sc->order, sc->reclaim_idx))
+ if (compaction_suitable(zone, sc->order, sc->reclaim_idx, ALLOC_CMA))
return false;
}
@@ -6089,7 +6089,7 @@ static inline bool compaction_ready(stru
return true;
/* Compaction cannot yet proceed. Do reclaim. */
- if (!compaction_suitable(zone, sc->order, sc->reclaim_idx))
+ if (!compaction_suitable(zone, sc->order, sc->reclaim_idx, ALLOC_CMA))
return false;
/*
_
Patches currently in -mm which might be from yangge1116(a)126.com are
mm-compaction-dont-use-alloc_cma-in-long-term-gup-flow.patch
This patch series is to fix bugs and improve codes for drivers/of/*.
Signed-off-by: Zijun Hu <quic_zijuhu(a)quicinc.com>
---
Changes in v2:
- Drop applied/conflict/TBD patches.
- Correct based on Rob's comments.
- Link to v1: https://lore.kernel.org/r/20241206-of_core_fix-v1-0-dc28ed56bec3@quicinc.com
---
Zijun Hu (7):
of: Fix API of_find_node_opts_by_path() finding OF device node failure
of: unittest: Add a test case for API of_find_node_opts_by_path()
of: Correct child specifier used as input of the 2nd nexus node
of: Exchange implementation between of_property_present() and of_property_read_bool()
of: Fix available buffer size calculating error in API of_device_uevent_modalias()
of: Fix potential wrong MODALIAS uevent value
of: Do not expose of_alias_scan() and correct its comments
drivers/of/base.c | 13 +++---
drivers/of/device.c | 33 +++++++---------
drivers/of/module.c | 103 +++++++++++++++++++++++++++++-------------------
drivers/of/of_private.h | 2 +
drivers/of/pdt.c | 2 +
drivers/of/unittest.c | 9 +++++
include/linux/of.h | 29 +++++++-------
7 files changed, 109 insertions(+), 82 deletions(-)
---
base-commit: 0f7ca6f69354e0c3923bbc28c92d0ecab4d50a3e
change-id: 20241206-of_core_fix-dc3021a06418
Best regards,
--
Zijun Hu <quic_zijuhu(a)quicinc.com>