October 2024 - Linux-stable-mirror

by Josey Swihart

Hello, I am offering my late husband?s Yamaha piano to anyone who would truly appreciate it. If you or someone you know would be interested in receiving this instrument for free, please do not hesitate to contact me. Warm regards, Josey

1 year, 2 months

1
0
0 0

[PATCH 5.10] nvme: fix metadata handling in nvme-passthrough

by Puranjay Mohan

[ Upstream commit 7c2fd76048e95dd267055b5f5e0a48e6e7c81fd9 ] On an NVMe namespace that does not support metadata, it is possible to send an IO command with metadata through io-passthru. This allows issues like [1] to trigger in the completion code path. nvme_map_user_request() doesn't check if the namespace supports metadata before sending it forward. It also allows admin commands with metadata to be processed as it ignores metadata when bdev == NULL and may report success. Reject an IO command with metadata when the NVMe namespace doesn't support it and reject an admin command if it has metadata. [1] https://lore.kernel.org/all/mb61pcylvnym8.fsf@amazon.com/ Suggested-by: Christoph Hellwig <hch(a)lst.de> Reviewed-by: Christoph Hellwig <hch(a)lst.de> Reviewed-by: Sagi Grimberg <sagi(a)grimberg.me> Reviewed-by: Anuj Gupta <anuj20.g(a)samsung.com> Signed-off-by: Keith Busch <kbusch(a)kernel.org> [ Move the changes from nvme_map_user_request() to nvme_submit_user_cmd() to make it work on 5.10 ] Signed-off-by: Puranjay Mohan <pjy(a)amazon.com> --- drivers/nvme/host/core.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 30a642c8f5374..bee55902fe6ce 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1121,11 +1121,16 @@ static int nvme_submit_user_cmd(struct request_queue *q, bool write = nvme_is_write(cmd); struct nvme_ns *ns = q->queuedata; struct gendisk *disk = ns ? ns->disk : NULL; + bool supports_metadata = disk && blk_get_integrity(disk); + bool has_metadata = meta_buffer && meta_len; struct request *req; struct bio *bio = NULL; void *meta = NULL; int ret; + if (has_metadata && !supports_metadata) + return -EINVAL; + req = nvme_alloc_request(q, cmd, 0); if (IS_ERR(req)) return PTR_ERR(req); @@ -1141,7 +1146,7 @@ static int nvme_submit_user_cmd(struct request_queue *q, goto out; bio = req->bio; bio->bi_disk = disk; - if (disk && meta_buffer && meta_len) { + if (has_metadata) { meta = nvme_add_user_metadata(bio, meta_buffer, meta_len, meta_seed, write); if (IS_ERR(meta)) { -- 2.40.1

1 year, 2 months

1
0
0 0

[PATCH 01/10] drm/amd/display: temp w/a for dGPU to enter idle optimizations

by Wayne Lin

From: Aurabindo Pillai <aurabindo.pillai(a)amd.com> [Why&How] vblank immediate disable currently does not work for all asics. On DCN401, the vblank interrupts never stop coming, and hence we never get a chance to trigger idle optimizations. Add a workaround to enable immediate disable only on APUs for now. This adds a 2-frame delay for triggering idle optimization, which is a negligible overhead. Fixes: db11e20a1144 ("drm/amd/display: use a more lax vblank enable policy for older ASICs") Fixes: 6dfb3a42a914 ("drm/amd/display: use a more lax vblank enable policy for DCN35+") Cc: Mario Limonciello <mario.limonciello(a)amd.com> Cc: Alex Deucher <alexander.deucher(a)amd.com> Cc: stable(a)vger.kernel.org Reviewed-by: Harry Wentland <harry.wentland(a)amd.com> Reviewed-by: Rodrigo Siqueira <rodrigo.siqueira(a)amd.com> Signed-off-by: Aurabindo Pillai <aurabindo.pillai(a)amd.com> Signed-off-by: Wayne Lin <wayne.lin(a)amd.com> --- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index a4882b16ace2..6ea54eb5d68d 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -8379,7 +8379,8 @@ static void manage_dm_interrupts(struct amdgpu_device *adev, if (amdgpu_ip_version(adev, DCE_HWIP, 0) < IP_VERSION(3, 5, 0) || acrtc_state->stream->link->psr_settings.psr_version < - DC_PSR_VERSION_UNSUPPORTED) { + DC_PSR_VERSION_UNSUPPORTED || + !(adev->flags & AMD_IS_APU)) { timing = &acrtc_state->stream->timing; /* at least 2 frames */ -- 2.37.3

1 year, 2 months

2
1
0 0

[PATCH v2] mm/page_alloc: Let GFP_ATOMIC order-0 allocs access highatomic reserves

by Matt Fleming

From: Matt Fleming <mfleming(a)cloudflare.com> Under memory pressure it's possible for GFP_ATOMIC order-0 allocations to fail even though free pages are available in the highatomic reserves. GFP_ATOMIC allocations cannot trigger unreserve_highatomic_pageblock() since it's only run from reclaim. Given that such allocations will pass the watermarks in __zone_watermark_unusable_free(), it makes sense to fallback to highatomic reserves the same way that ALLOC_OOM can. This fixes order-0 page allocation failures observed on Cloudflare's fleet when handling network packets: kswapd1: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0-7 CPU: 10 PID: 696 Comm: kswapd1 Kdump: loaded Tainted: G O 6.6.43-CUSTOM #1 Hardware name: MACHINE Call Trace: <IRQ> dump_stack_lvl+0x3c/0x50 warn_alloc+0x13a/0x1c0 __alloc_pages_slowpath.constprop.0+0xc9d/0xd10 __alloc_pages+0x327/0x340 __napi_alloc_skb+0x16d/0x1f0 bnxt_rx_page_skb+0x96/0x1b0 [bnxt_en] bnxt_rx_pkt+0x201/0x15e0 [bnxt_en] __bnxt_poll_work+0x156/0x2b0 [bnxt_en] bnxt_poll+0xd9/0x1c0 [bnxt_en] __napi_poll+0x2b/0x1b0 bpf_trampoline_6442524138+0x7d/0x1000 __napi_poll+0x5/0x1b0 net_rx_action+0x342/0x740 handle_softirqs+0xcf/0x2b0 irq_exit_rcu+0x6c/0x90 sysvec_apic_timer_interrupt+0x72/0x90 </IRQ> Fixes: 1d91df85f399 ("mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs") Suggested-by: Vlastimil Babka <vbabka(a)suse.cz> Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz> Cc: Andrew Morton <akpm(a)linux-foundation.org> Cc: Mel Gorman <mgorman(a)techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com> Cc: linux-mm(a)kvack.org Cc: stable(a)vger.kernel.org # v5.9+ Link: https://lore.kernel.org/all/CAGis_TWzSu=P7QJmjD58WWiu3zjMTVKSzdOwWE8ORaGytz… Signed-off-by: Matt Fleming <mfleming(a)cloudflare.com> --- mm/page_alloc.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) v2: Update comment and add Fixes, Reviewed-by, and Cc: stable tags. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8afab64814dc..94a2ffe28008 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2893,12 +2893,12 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, page = __rmqueue(zone, order, migratetype, alloc_flags); /* - * If the allocation fails, allow OOM handling access - * to HIGHATOMIC reserves as failing now is worse than - * failing a high-order atomic allocation in the - * future. + * If the allocation fails, allow OOM handling and + * order-0 (atomic) allocs access to HIGHATOMIC + * reserves as failing now is worse than failing a + * high-order atomic allocation in the future. */ - if (!page && (alloc_flags & ALLOC_OOM)) + if (!page && (alloc_flags & (ALLOC_OOM|ALLOC_NON_BLOCK))) page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); if (!page) { -- 2.34.1

1 year, 2 months

1
0
0 0

backport mseal and mseal_test to 6.10

by Jeff Xu

Hi Greg, How are you? What is the process to backport Pedro's recent mseal fixes to 6.10 ? Specifically those 5 commits: 67203f3f2a63d429272f0c80451e5fcc469fdb46 selftests/mm: add mseal test for no-discard madvise 4d1b3416659be70a2251b494e85e25978de06519 mm: move can_modify_vma to mm/vma.h 4a2dd02b09160ee43f96c759fafa7b56dfc33816 mm/mprotect: replace can_modify_mm with can_modify_vma 23c57d1fa2b9530e38f7964b4e457fed5a7a0ae8 mseal: replace can_modify_mm_madv with a vma variant f28bdd1b17ec187eaa34845814afaaff99832762 selftests/mm: add more mseal traversal tests There will be merge conflicts, I can backport them to 5.10 and test to help the backporting process. Those 5 fixes are needed for two reasons: maintain the consistency of mseal's semantics across releases, and for ease of backporting future fixes. PS: There are also three other commits for munmap and remap (see below), they have dependency on Michael Ellerman's arch_unmap() patch [1] and maybe uprobe change [2]. If Michael and Oleg are OK with backporting their patches, then great ! Otherwise, since those commits below don't change mseal's semantics, I think it is OK to just backport above 5 patches. df2a7df9a9aa32c3df227de346693e6e802c8591 mm/munmap: replace can_modify_mm with can_modify_vma 38075679b5f157eeacd46c900e9cfc684bdbc167 mm/mremap: replace can_modify_mm with can_modify_vma 5b3db2b812a1f86dfab587324d198a5d10c7a5cf mm: remove can_modify_mm() [1] https://lore.kernel.org/all/20240812082605.743814-1-mpe@ellerman.id.au/ [2] https://lore.kernel.org/all/20240911131320.GA3448@redhat.com/ Thanks! -Jeff

1 year, 2 months

3
8
0 0

[PATCH 05/10] drm/amd/display: temp w/a for DP Link Layer compliance

by Wayne Lin

From: Aurabindo Pillai <aurabindo.pillai(a)amd.com> [Why&How] Disabling P-State support on full updates for DCN401 results in introducing additional communication with SMU. A UCLK hard min message to SMU takes 4 seconds to go through, which was due to DCN not allowing pstate switch, which was caused by incorrect value for TTU watermark before blanking the HUBP prior to DPG on for servicing the test request. Fix the issue temporarily by disallowing pstate changes for compliance test while test request handler is reworked for a proper fix. Fixes: 67ea53a4bd9d ("drm/amd/display: Disable DCN401 UCLK P-State support on full updates") Cc: Mario Limonciello <mario.limonciello(a)amd.com> Cc: Alex Deucher <alexander.deucher(a)amd.com> Cc: stable(a)vger.kernel.org Reviewed-by: Dillon Varone <dillon.varone(a)amd.com> Signed-off-by: Aurabindo Pillai <aurabindo.pillai(a)amd.com> Signed-off-by: Wayne Lin <wayne.lin(a)amd.com> --- .../drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c index 8eaf292bc4eb..b0fea0856866 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c @@ -46,6 +46,7 @@ #include "dm_helpers.h" #include "ddc_service_types.h" +#include "clk_mgr.h" static u32 edid_extract_panel_id(struct edid *edid) { @@ -1181,6 +1182,8 @@ bool dm_helpers_dp_handle_test_pattern_request( struct pipe_ctx *pipe_ctx = NULL; struct amdgpu_dm_connector *aconnector = link->priv; struct drm_device *dev = aconnector->base.dev; + struct dc_state *dc_state = ctx->dc->current_state; + struct clk_mgr *clk_mgr = ctx->dc->clk_mgr; int i; for (i = 0; i < MAX_PIPES; i++) { @@ -1281,6 +1284,16 @@ bool dm_helpers_dp_handle_test_pattern_request( pipe_ctx->stream->test_pattern.type = test_pattern; pipe_ctx->stream->test_pattern.color_space = test_pattern_color_space; + /* Temp W/A for compliance test failure */ + dc_state->bw_ctx.bw.dcn.clk.p_state_change_support = false; + dc_state->bw_ctx.bw.dcn.clk.dramclk_khz = clk_mgr->dc_mode_softmax_enabled ? + clk_mgr->bw_params->dc_mode_softmax_memclk : clk_mgr->bw_params->max_memclk_mhz; + dc_state->bw_ctx.bw.dcn.clk.idle_dramclk_khz = dc_state->bw_ctx.bw.dcn.clk.dramclk_khz; + ctx->dc->clk_mgr->funcs->update_clocks( + ctx->dc->clk_mgr, + dc_state, + false); + dc_link_dp_set_test_pattern( (struct dc_link *) link, test_pattern, -- 2.37.3

1 year, 2 months

1
0
0 0

[PATCH ipsec v3] xfrm: fix one more kernel-infoleak in algo dumping

by Petr Vaganov

During fuzz testing, the following issue was discovered: BUG: KMSAN: kernel-infoleak in _copy_to_iter+0x598/0x2a30 _copy_to_iter+0x598/0x2a30 __skb_datagram_iter+0x168/0x1060 skb_copy_datagram_iter+0x5b/0x220 netlink_recvmsg+0x362/0x1700 sock_recvmsg+0x2dc/0x390 __sys_recvfrom+0x381/0x6d0 __x64_sys_recvfrom+0x130/0x200 x64_sys_call+0x32c8/0x3cc0 do_syscall_64+0xd8/0x1c0 entry_SYSCALL_64_after_hwframe+0x79/0x81 Uninit was stored to memory at: copy_to_user_state_extra+0xcc1/0x1e00 dump_one_state+0x28c/0x5f0 xfrm_state_walk+0x548/0x11e0 xfrm_dump_sa+0x1e0/0x840 netlink_dump+0x943/0x1c40 __netlink_dump_start+0x746/0xdb0 xfrm_user_rcv_msg+0x429/0xc00 netlink_rcv_skb+0x613/0x780 xfrm_netlink_rcv+0x77/0xc0 netlink_unicast+0xe90/0x1280 netlink_sendmsg+0x126d/0x1490 __sock_sendmsg+0x332/0x3d0 ____sys_sendmsg+0x863/0xc30 ___sys_sendmsg+0x285/0x3e0 __x64_sys_sendmsg+0x2d6/0x560 x64_sys_call+0x1316/0x3cc0 do_syscall_64+0xd8/0x1c0 entry_SYSCALL_64_after_hwframe+0x79/0x81 Uninit was created at: __kmalloc+0x571/0xd30 attach_auth+0x106/0x3e0 xfrm_add_sa+0x2aa0/0x4230 xfrm_user_rcv_msg+0x832/0xc00 netlink_rcv_skb+0x613/0x780 xfrm_netlink_rcv+0x77/0xc0 netlink_unicast+0xe90/0x1280 netlink_sendmsg+0x126d/0x1490 __sock_sendmsg+0x332/0x3d0 ____sys_sendmsg+0x863/0xc30 ___sys_sendmsg+0x285/0x3e0 __x64_sys_sendmsg+0x2d6/0x560 x64_sys_call+0x1316/0x3cc0 do_syscall_64+0xd8/0x1c0 entry_SYSCALL_64_after_hwframe+0x79/0x81 Bytes 328-379 of 732 are uninitialized Memory access of size 732 starts at ffff88800e18e000 Data copied to user address 00007ff30f48aff0 CPU: 2 PID: 18167 Comm: syz-executor.0 Not tainted 6.8.11 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Fixes copying of xfrm algorithms where some random data of the structure fields can end up in userspace. Padding in structures may be filled with random (possibly sensitve) data and should never be given directly to user-space. A similar issue was resolved in the commit 8222d5910dae ("xfrm: Zero padding when dumping algos and encap") Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: c7a5899eb26e ("xfrm: redact SA secret with lockdown confidentiality") Cc: stable(a)vger.kernel.org Co-developed-by: Boris Tonofa <b.tonofa(a)ideco.ru> Signed-off-by: Boris Tonofa <b.tonofa(a)ideco.ru> Signed-off-by: Petr Vaganov <p.vaganov(a)ideco.ru> --- v3: Corrected commit description "This patch fixes copying..." to "Fixes copying..." according to accepted rules of Linux kernel commits, as suggested by Markus Elfring <elfring(a)users.sourceforge.net>. v2: Fixed typo in sizeof(sizeof(ap->alg_name) expression. The third argument for the strscpy_pad macro was chosen by analogy with those in other functions of this file - as did commit 8222d5910dae ("xfrm: Zero padding when dumping algos and encap"). I still think it would be better to leave the strscpy_pad() macro with three arguments in this patch, mainly so that it would be consistent with the existing similar code in this module. Regarding strncpy() above, we don't think it is a real issue since the uncopied parts should be already padded with zeroes by nla_reserve(). net/xfrm/xfrm_user.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 55f039ec3d59..0083faabe8be 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -1098,7 +1098,9 @@ static int copy_to_user_auth(struct xfrm_algo_auth *auth, struct sk_buff *skb) if (!nla) return -EMSGSIZE; ap = nla_data(nla); - memcpy(ap, auth, sizeof(struct xfrm_algo_auth)); + strscpy_pad(ap->alg_name, auth->alg_name, sizeof(ap->alg_name)); + ap->alg_key_len = auth->alg_key_len; + ap->alg_trunc_len = auth->alg_trunc_len; if (redact_secret && auth->alg_key_len) memset(ap->alg_key, 0, (auth->alg_key_len + 7) / 8); else -- 2.46.2

1 year, 2 months

3
2
0 0

[PATCH V4] x86/apic: Always explicitly disarm TSC-deadline timer

by Zhang Rui

New processors have become pickier about the local APIC timer state before entering low power modes. These low power modes are used (for example) when you close your laptop lid and suspend. If you put your laptop in a bag in this unnecessarily-high-power state, it is likely to get quite toasty while it quickly sucks the battery dry. The problem boils down to some CPUs' inability to power down until the kernel fully disables the local APIC timer. The current kernel code works in one-shot and periodic modes but does not work for deadline mode. Deadline mode has been the supported and preferred mode on Intel CPUs for over a decade and uses an MSR to drive the timer instead of an APIC register. Disable the TSC Deadline timer in lapic_timer_shutdown() by writing to MSR_IA32_TSC_DEADLINE when in TSC-deadline mode. Also avoid writing to the initial-count register (APIC_TMICT) which is ignored in TSC-deadline mode. Note: The APIC_LVTT|=APIC_LVT_MASKED operation should theoretically be enough to tell the hardware that the timer will not fire in any of the timer modes. But mitigating AMD erratum 411[1] also requires clearing out APIC_TMICT. Solely setting APIC_LVT_MASKED is also ineffective in practice on Intel Lunar Lake systems, which is the motivation for this change. 1. 411 Processor May Exit Message-Triggered C1E State Without an Interrupt if Local APIC Timer Reaches Zero - https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/revisio… Cc: stable(a)vger.kernel.org Fixes: 279f1461432c ("x86: apic: Use tsc deadline for oneshot when available") Suggested-by: Dave Hansen <dave.hansen(a)intel.com> Signed-off-by: Zhang Rui <rui.zhang(a)intel.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada(a)linux.intel.com> Tested-by: Todd Brandt <todd.e.brandt(a)intel.com> --- V2 - Improve changelog V3 - Subject and changelog rewrite - Check LAPIC Timer mode using APIC_LVTT value instead of extra CPU feature flag check - Avoid APIC_TMICT write which is ignored in TSC-deadline mode V4 - Add back Fixes tag and stable tag which was missing in V3 - Update patch recipients --- arch/x86/kernel/apic/apic.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 6513c53c9459..5436a4083065 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -440,7 +440,19 @@ static int lapic_timer_shutdown(struct clock_event_device *evt) v = apic_read(APIC_LVTT); v |= (APIC_LVT_MASKED | LOCAL_TIMER_VECTOR); apic_write(APIC_LVTT, v); - apic_write(APIC_TMICT, 0); + + /* + * Setting APIC_LVT_MASKED should be enough to tell the + * hardware that this timer will never fire. But AMD + * erratum 411 and some Intel CPU behavior circa 2024 + * say otherwise. Time for belt and suspenders programming, + * mask the timer and zero the counter registers: + */ + if (v & APIC_LVT_TIMER_TSCDEADLINE) + wrmsrl(MSR_IA32_TSC_DEADLINE, 0); + else + apic_write(APIC_TMICT, 0); + return 0; } -- 2.34.1

1 year, 2 months

1
0
0 0

+ mm-swapfile-skip-hugetlb-pages-for-unuse_vma.patch added to mm-hotfixes-unstable branch

by Andrew Morton

The patch titled Subject: mm/swapfile: skip HugeTLB pages for unuse_vma has been added to the -mm mm-hotfixes-unstable branch. Its filename is mm-swapfile-skip-hugetlb-pages-for-unuse_vma.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Liu Shixin <liushixin2(a)huawei.com> Subject: mm/swapfile: skip HugeTLB pages for unuse_vma Date: Tue, 15 Oct 2024 09:45:21 +0800 I got a bad pud error and lost a 1GB HugeTLB when calling swapoff. The problem can be reproduced by the following steps: 1. Allocate an anonymous 1GB HugeTLB and some other anonymous memory. 2. Swapout the above anonymous memory. 3. run swapoff and we will get a bad pud error in kernel message: mm/pgtable-generic.c:42: bad pud 00000000743d215d(84000001400000e7) We can tell that pud_clear_bad is called by pud_none_or_clear_bad in unuse_pud_range() by ftrace. And therefore the HugeTLB pages will never be freed because we lost it from page table. We can skip HugeTLB pages for unuse_vma to fix it. Link: https://lkml.kernel.org/r/20241015014521.570237-1-liushixin2@huawei.com Fixes: 0fe6e20b9c4c ("hugetlb, rmap: add reverse mapping for hugepage") Signed-off-by: Liu Shixin <liushixin2(a)huawei.com> Acked-by: Muchun Song <muchun.song(a)linux.dev> Cc: Naoya Horiguchi <nao.horiguchi(a)gmail.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/swapfile.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/mm/swapfile.c~mm-swapfile-skip-hugetlb-pages-for-unuse_vma +++ a/mm/swapfile.c @@ -2313,7 +2313,7 @@ static int unuse_mm(struct mm_struct *mm mmap_read_lock(mm); for_each_vma(vmi, vma) { - if (vma->anon_vma) { + if (vma->anon_vma && !is_vm_hugetlb_page(vma)) { ret = unuse_vma(vma, type); if (ret) break; _ Patches currently in -mm which might be from liushixin2(a)huawei.com are mm-swapfile-skip-hugetlb-pages-for-unuse_vma.patch

1 year, 2 months

1
0
0 0

[PATCH] mm/swapfile: skip HugeTLB pages for unuse_vma

by Liu Shixin

I got a bad pud error and lost a 1GB HugeTLB when calling swapoff. The problem can be reproduced by the following steps: 1. Allocate an anonymous 1GB HugeTLB and some other anonymous memory. 2. Swapout the above anonymous memory. 3. run swapoff and we will get a bad pud error in kernel message: mm/pgtable-generic.c:42: bad pud 00000000743d215d(84000001400000e7) We can tell that pud_clear_bad is called by pud_none_or_clear_bad in unuse_pud_range() by ftrace. And therefore the HugeTLB pages will never be freed because we lost it from page table. We can skip HugeTLB pages for unuse_vma to fix it. Fixes: 0fe6e20b9c4c ("hugetlb, rmap: add reverse mapping for hugepage") Signed-off-by: Liu Shixin <liushixin2(a)huawei.com> --- mm/swapfile.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 0cded32414a1..f4ef91513fc9 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2312,7 +2312,7 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type) mmap_read_lock(mm); for_each_vma(vmi, vma) { - if (vma->anon_vma) { + if (vma->anon_vma && !is_vm_hugetlb_page(vma)) { ret = unuse_vma(vma, type); if (ret) break; -- 2.34.1

1 year, 2 months

3
2
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror October 2024