The patch titled
Subject: mm/numa: no task_numa_fault() call if PTE is changed
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-numa-no-task_numa_fault-call-if-pte-is-changed.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Zi Yan <ziy(a)nvidia.com>
Subject: mm/numa: no task_numa_fault() call if PTE is changed
Date: Fri, 9 Aug 2024 10:59:04 -0400
When handling a numa page fault, task_numa_fault() should be called by a
process that restores the page table of the faulted folio to avoid
duplicated stats counting. Commit b99a342d4f11 ("NUMA balancing: reduce
TLB flush via delaying mapping on hint page fault") restructured
do_numa_page() and did not avoid task_numa_fault() call in the second page
table check after a numa migration failure. Fix it by making all
!pte_same() return immediately.
This issue can cause task_numa_fault() being called more than necessary
and lead to unexpected numa balancing results (It is hard to tell whether
the issue will cause positive or negative performance impact due to
duplicated numa fault counting).
Link: https://lkml.kernel.org/r/20240809145906.1513458-2-ziy@nvidia.com
Fixes: b99a342d4f11 ("NUMA balancing: reduce TLB flush via delaying mapping on hint page fault")
Signed-off-by: Zi Yan <ziy(a)nvidia.com>
Reported-by: "Huang, Ying" <ying.huang(a)intel.com>
Closes: https://lore.kernel.org/linux-mm/87zfqfw0yw.fsf@yhuang6-desk2.ccr.corp.inte…
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory.c | 33 ++++++++++++++++-----------------
1 file changed, 16 insertions(+), 17 deletions(-)
--- a/mm/memory.c~mm-numa-no-task_numa_fault-call-if-pte-is-changed
+++ a/mm/memory.c
@@ -5295,7 +5295,7 @@ static vm_fault_t do_numa_page(struct vm
if (unlikely(!pte_same(old_pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
- goto out;
+ return 0;
}
pte = pte_modify(old_pte, vma->vm_page_prot);
@@ -5358,23 +5358,19 @@ static vm_fault_t do_numa_page(struct vm
if (!migrate_misplaced_folio(folio, vma, target_nid)) {
nid = target_nid;
flags |= TNF_MIGRATED;
- } else {
- flags |= TNF_MIGRATE_FAIL;
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
- if (unlikely(!vmf->pte))
- goto out;
- if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- goto out;
- }
- goto out_map;
+ task_numa_fault(last_cpupid, nid, nr_pages, flags);
+ return 0;
}
-out:
- if (nid != NUMA_NO_NODE)
- task_numa_fault(last_cpupid, nid, nr_pages, flags);
- return 0;
+ flags |= TNF_MIGRATE_FAIL;
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ if (unlikely(!vmf->pte))
+ return 0;
+ if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
+ }
out_map:
/*
* Make it present again, depending on how arch implements
@@ -5387,7 +5383,10 @@ out_map:
numa_rebuild_single_mapping(vmf, vma, vmf->address, vmf->pte,
writable);
pte_unmap_unlock(vmf->pte, vmf->ptl);
- goto out;
+
+ if (nid != NUMA_NO_NODE)
+ task_numa_fault(last_cpupid, nid, nr_pages, flags);
+ return 0;
}
static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
_
Patches currently in -mm which might be from ziy(a)nvidia.com are
mm-numa-no-task_numa_fault-call-if-pte-is-changed.patch
mm-numa-no-task_numa_fault-call-if-pmd-is-changed.patch
memory-tiering-read-last_cpupid-correctly-in-do_huge_pmd_numa_page.patch
memory-tiering-introduce-folio_use_access_time-check.patch
memory-tiering-count-pgpromote_success-when-mem-tiering-is-enabled.patch
The quilt patch titled
Subject: crash: Fix riscv64 crash memory reserve dead loop
has been removed from the -mm tree. Its filename was
crash-fix-riscv64-crash-memory-reserve-dead-loop.patch
This patch was dropped because an updated version will be issued
------------------------------------------------------
From: Jinjie Ruan <ruanjinjie(a)huawei.com>
Subject: crash: Fix riscv64 crash memory reserve dead loop
Date: Fri, 2 Aug 2024 17:01:05 +0800
On RISCV64 Qemu machine with 512MB memory, cmdline "crashkernel=500M,high"
will cause system stall as below:
Zone ranges:
DMA32 [mem 0x0000000080000000-0x000000009fffffff]
Normal empty
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x0000000080000000-0x000000008005ffff]
node 0: [mem 0x0000000080060000-0x000000009fffffff]
Initmem setup node 0 [mem 0x0000000080000000-0x000000009fffffff]
(stall here)
commit 5d99cadf1568 ("crash: fix x86_32 crash memory reserve dead loop
bug") fix this on 32-bit architecture. However, the problem is not
completely solved. If `CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX` on
64-bit architecture, for example, when system memory is equal to
CRASH_ADDR_LOW_MAX on RISCV64, the following infinite loop will also
occur:
-> reserve_crashkernel_generic() and high is true
-> alloc at [CRASH_ADDR_LOW_MAX, CRASH_ADDR_HIGH_MAX] fail
-> alloc at [0, CRASH_ADDR_LOW_MAX] fail and repeatedly
(because CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX).
Before refactor in commit 9c08a2a139fe ("x86: kdump: use generic interface
to simplify crashkernel reservation code"), x86 do not try to reserve
crash memory at low if it fails to alloc above high 4G. However before
refator in commit fdc268232dbba ("arm64: kdump: use generic interface to
simplify crashkernel reservation"), arm64 try to reserve crash memory at
low if it fails above high 4G. For 64-bit systems, this attempt is less
beneficial than the opposite, remove it to fix this bug and align with
native x86 implementation.
After this patch, it print:
cannot allocate crashkernel (size:0x1f400000)
Link: https://lkml.kernel.org/r/20240802090105.3871929-1-ruanjinjie@huawei.com
Fixes: 39365395046f ("riscv: kdump: use generic interface to simplify crashkernel reservation")
Signed-off-by: Jinjie Ruan <ruanjinjie(a)huawei.com>
Acked-by: Baoquan He <bhe(a)redhat.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Albert Ou <aou(a)eecs.berkeley.edu>
Cc: Chen Jiahao <chenjiahao16(a)huawei.com>
Cc: Dave Young <dyoung(a)redhat.com>
Cc: Palmer Dabbelt <palmer(a)dabbelt.com>
Cc: Paul Walmsley <paul.walmsley(a)sifive.com>
Cc: Vivek Goyal <vgoyal(a)redhat.com>
Cc: Alexandre Ghiti <alex(a)ghiti.fr>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/crash_reserve.c | 9 ---------
1 file changed, 9 deletions(-)
--- a/kernel/crash_reserve.c~crash-fix-riscv64-crash-memory-reserve-dead-loop
+++ a/kernel/crash_reserve.c
@@ -416,15 +416,6 @@ retry:
goto retry;
}
- /*
- * For crashkernel=size[KMG],high, if the first attempt was
- * for high memory, fall back to low memory.
- */
- if (high && search_end == CRASH_ADDR_HIGH_MAX) {
- search_end = CRASH_ADDR_LOW_MAX;
- search_base = 0;
- goto retry;
- }
pr_warn("cannot allocate crashkernel (size:0x%llx)\n",
crash_size);
return;
_
Patches currently in -mm which might be from ruanjinjie(a)huawei.com are
crash-fix-x86_32-crash-memory-reserve-dead-loop-bug.patch
crash-fix-x86_32-crash-memory-reserve-dead-loop.patch
arm-use-generic-interface-to-simplify-crashkernel-reservation.patch
crash-fix-crash-memory-reserve-exceed-system-memory-bug.patch
From: Rodrigo Siqueira <Rodrigo.Siqueira(a)amd.com>
[why & how]
When the commit 9d84c7ef8a87 ("drm/amd/display: Correct cursor position
on horizontal mirror") was introduced, it used the wrong calculation for
the position copy for X. This commit uses the correct calculation for that
based on the original patch.
Fixes: 9d84c7ef8a87 ("drm/amd/display: Correct cursor position on horizontal mirror")
Cc: Mario Limonciello <mario.limonciello(a)amd.com>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
Acked-by: Wayne Lin <wayne.lin(a)amd.com>
Signed-off-by: Rodrigo Siqueira <Rodrigo.Siqueira(a)amd.com>
Signed-off-by: Tom Chung <chiahsuan.chung(a)amd.com>
---
drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c b/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
index 802902f54d09..01dffed4d30b 100644
--- a/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
+++ b/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
@@ -3687,7 +3687,7 @@ void dcn10_set_cursor_position(struct pipe_ctx *pipe_ctx)
(int)hubp->curs_attr.width || pos_cpy.x
<= (int)hubp->curs_attr.width +
pipe_ctx->plane_state->src_rect.x) {
- pos_cpy.x = 2 * viewport_width - temp_x;
+ pos_cpy.x = temp_x + viewport_width;
}
}
} else {
--
2.34.1
When handling a numa page fault, task_numa_fault() should be called by a
process that restores the page table of the faulted folio to avoid
duplicated stats counting. Commit b99a342d4f11 ("NUMA balancing: reduce
TLB flush via delaying mapping on hint page fault") restructured
do_numa_page() and did not avoid task_numa_fault() call in the second page
table check after a numa migration failure. Fix it by making all
!pte_same() return immediately.
This issue can cause task_numa_fault() being called more than necessary
and lead to unexpected numa balancing results (It is hard to tell whether
the issue will cause positive or negative performance impact due to
duplicated numa fault counting).
Reported-by: "Huang, Ying" <ying.huang(a)intel.com>
Closes: https://lore.kernel.org/linux-mm/87zfqfw0yw.fsf@yhuang6-desk2.ccr.corp.inte…
Fixes: b99a342d4f11 ("NUMA balancing: reduce TLB flush via delaying mapping on hint page fault")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Zi Yan <ziy(a)nvidia.com>
---
mm/memory.c | 33 ++++++++++++++++-----------------
1 file changed, 16 insertions(+), 17 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 67496dc5064f..bf791da57cab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5461,7 +5461,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
if (unlikely(!pte_same(old_pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
- goto out;
+ return 0;
}
pte = pte_modify(old_pte, vma->vm_page_prot);
@@ -5523,23 +5523,19 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
if (!migrate_misplaced_folio(folio, vma, target_nid)) {
nid = target_nid;
flags |= TNF_MIGRATED;
- } else {
- flags |= TNF_MIGRATE_FAIL;
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
- if (unlikely(!vmf->pte))
- goto out;
- if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- goto out;
- }
- goto out_map;
+ task_numa_fault(last_cpupid, nid, nr_pages, flags);
+ return 0;
}
-out:
- if (nid != NUMA_NO_NODE)
- task_numa_fault(last_cpupid, nid, nr_pages, flags);
- return 0;
+ flags |= TNF_MIGRATE_FAIL;
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ if (unlikely(!vmf->pte))
+ return 0;
+ if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
+ }
out_map:
/*
* Make it present again, depending on how arch implements
@@ -5552,7 +5548,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
numa_rebuild_single_mapping(vmf, vma, vmf->address, vmf->pte,
writable);
pte_unmap_unlock(vmf->pte, vmf->ptl);
- goto out;
+
+ if (nid != NUMA_NO_NODE)
+ task_numa_fault(last_cpupid, nid, nr_pages, flags);
+ return 0;
}
static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
--
2.43.0
From: Long Li <longli(a)microsoft.com>
After napi_complete_done() is called when NAPI is polling in the current
process context, another NAPI may be scheduled and start running in
softirq on another CPU and may ring the doorbell before the current CPU
does. When combined with unnecessary rings when there is no need to arm
the CQ, it triggers error paths in the hardware.
This patch fixes this by calling napi_complete_done() after doorbell
rings. It limits the number of unnecessary rings when there is
no need to arm. MANA hardware specifies that there must be one doorbell
ring every 8 CQ wraparounds. This driver guarantees one doorbell ring as
soon as the number of consumed CQEs exceeds 4 CQ wraparounds. In pratical
workloads, the 4 CQ wraparounds proves to be big enough that it rarely
exceeds this limit before all the napi weight is consumed.
To implement this, add a per-CQ counter cq->work_done_since_doorbell,
and make sure the CQ is armed as soon as passing 4 wraparounds of the CQ.
Cc: stable(a)vger.kernel.org
Fixes: e1b5683ff62e ("net: mana: Move NAPI from EQ to CQ")
Reviewed-by: Haiyang Zhang <haiyangz(a)microsoft.com>
Signed-off-by: Long Li <longli(a)microsoft.com>
---
change in v2:
Added more details to comments to explain the patch
drivers/net/ethernet/microsoft/mana/mana_en.c | 24 ++++++++++++-------
include/net/mana/mana.h | 1 +
2 files changed, 16 insertions(+), 9 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index d2f07e179e86..f83211f9e737 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1788,7 +1788,6 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
{
struct mana_cq *cq = context;
- u8 arm_bit;
int w;
WARN_ON_ONCE(cq->gdma_cq != gdma_queue);
@@ -1799,16 +1798,23 @@ static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
mana_poll_tx_cq(cq);
w = cq->work_done;
-
- if (w < cq->budget &&
- napi_complete_done(&cq->napi, w)) {
- arm_bit = SET_ARM_BIT;
- } else {
- arm_bit = 0;
+ cq->work_done_since_doorbell += w;
+
+ if (w < cq->budget) {
+ mana_gd_ring_cq(gdma_queue, SET_ARM_BIT);
+ cq->work_done_since_doorbell = 0;
+ napi_complete_done(&cq->napi, w);
+ } else if (cq->work_done_since_doorbell >
+ cq->gdma_cq->queue_size / COMP_ENTRY_SIZE * 4) {
+ /* MANA hardware requires at least one doorbell ring every 8
+ * wraparounds of CQ even if there is no need to arm the CQ.
+ * This driver rings the doorbell as soon as we have exceeded
+ * 4 wraparounds.
+ */
+ mana_gd_ring_cq(gdma_queue, 0);
+ cq->work_done_since_doorbell = 0;
}
- mana_gd_ring_cq(gdma_queue, arm_bit);
-
return w;
}
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 6439fd8b437b..7caa334f4888 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -275,6 +275,7 @@ struct mana_cq {
/* NAPI data */
struct napi_struct napi;
int work_done;
+ int work_done_since_doorbell;
int budget;
};
--
2.17.1
When handling a numa page fault, task_numa_fault() should be called by a
process that restores the page table of the faulted folio to avoid
duplicated stats counting. Commit b99a342d4f11 ("NUMA balancing: reduce
TLB flush via delaying mapping on hint page fault") restructured
do_numa_page() and do_huge_pmd_numa_page() and did not avoid
task_numa_fault() call in the second page table check after a numa
migration failure. Fix it by making all !pte_same()/!pmd_same() return
immediately.
This issue can cause task_numa_fault() being called more than necessary
and lead to unexpected numa balancing results (It is hard to tell whether
the issue will cause positive or negative performance impact due to
duplicated numa fault counting).
Reported-by: "Huang, Ying" <ying.huang(a)intel.com>
Closes: https://lore.kernel.org/linux-mm/87zfqfw0yw.fsf@yhuang6-desk2.ccr.corp.inte…
Fixes: b99a342d4f11 ("NUMA balancing: reduce TLB flush via delaying mapping on hint page fault")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Zi Yan <ziy(a)nvidia.com>
---
mm/huge_memory.c | 5 +++--
mm/memory.c | 5 +++--
2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0024266dea0a..a3c018f2b554 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1734,10 +1734,11 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
goto out_map;
}
-out:
+count_fault:
if (nid != NUMA_NO_NODE)
task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
+out:
return 0;
out_map:
@@ -1749,7 +1750,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
spin_unlock(vmf->ptl);
- goto out;
+ goto count_fault;
}
/*
diff --git a/mm/memory.c b/mm/memory.c
index 67496dc5064f..503d493263df 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5536,9 +5536,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
goto out_map;
}
-out:
+count_fault:
if (nid != NUMA_NO_NODE)
task_numa_fault(last_cpupid, nid, nr_pages, flags);
+out:
return 0;
out_map:
/*
@@ -5552,7 +5553,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
numa_rebuild_single_mapping(vmf, vma, vmf->address, vmf->pte,
writable);
pte_unmap_unlock(vmf->pte, vmf->ptl);
- goto out;
+ goto count_fault;
}
static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
--
2.43.0