The patch titled
Subject: mm/page_alloc: Separate THP PCP into movable and non-movable categories
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-page_alloc-separate-thp-pcp-into-movable-and-non-movable-categories.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: yangge <yangge1116(a)126.com>
Subject: mm/page_alloc: Separate THP PCP into movable and non-movable categories
Date: Thu, 20 Jun 2024 08:59:50 +0800
Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
THP-sized allocations") no longer differentiates the migration type of
pages in THP-sized PCP list, it's possible that non-movable allocation
requests may get a CMA page from the list, in some cases, it's not
acceptable.
If a large number of CMA memory are configured in system (for example, the
CMA memory accounts for 50% of the system memory), starting a virtual
machine with device passthrough will get stuck. During starting the
virtual machine, it will call pin_user_pages_remote(..., FOLL_LONGTERM,
...) to pin memory. Normally if a page is present and in CMA area,
pin_user_pages_remote() will migrate the page from CMA area to non-CMA
area because of FOLL_LONGTERM flag. But if non-movable allocation
requests return CMA memory, migrate_longterm_unpinnable_pages() will
migrate a CMA page to another CMA page, which will fail to pass the check
in check_and_migrate_movable_pages() and cause migration endless.
Call trace:
pin_user_pages_remote
--__gup_longterm_locked // endless loops in this function
----_get_user_pages_locked
----check_and_migrate_movable_pages
------migrate_longterm_unpinnable_pages
--------alloc_migration_target
This problem will also have a negative impact on CMA itself. For example,
when CMA is borrowed by THP, and we need to reclaim it through cma_alloc()
or dma_alloc_coherent(), we must move those pages out to ensure CMA's
users can retrieve that contigous memory. Currently, CMA's memory is
occupied by non-movable pages, meaning we can't relocate them. As a
result, cma_alloc() is more likely to fail.
To fix the problem above, we add one PCP list for THP, which will not
introduce a new cacheline for struct per_cpu_pages. THP will have 2 PCP
lists, one PCP list is used by MOVABLE allocation, and the other PCP list
is used by UNMOVABLE allocation. MOVABLE allocation contains GPF_MOVABLE,
and UNMOVABLE allocation contains GFP_UNMOVABLE and GFP_RECLAIMABLE.
Link: https://lkml.kernel.org/r/1718845190-4456-1-git-send-email-yangge1116@126.c…
Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations")
Signed-off-by: yangge <yangge1116(a)126.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Barry Song <21cnbao(a)gmail.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mmzone.h | 9 ++++-----
mm/page_alloc.c | 9 +++++++--
2 files changed, 11 insertions(+), 7 deletions(-)
--- a/include/linux/mmzone.h~mm-page_alloc-separate-thp-pcp-into-movable-and-non-movable-categories
+++ a/include/linux/mmzone.h
@@ -654,13 +654,12 @@ enum zone_watermarks {
};
/*
- * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER. One additional list
- * for THP which will usually be GFP_MOVABLE. Even if it is another type,
- * it should not contribute to serious fragmentation causing THP allocation
- * failures.
+ * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER. Two additional lists
+ * are added for THP. One PCP list is used by GPF_MOVABLE, and the other PCP list
+ * is used by GFP_UNMOVABLE and GFP_RECLAIMABLE.
*/
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define NR_PCP_THP 1
+#define NR_PCP_THP 2
#else
#define NR_PCP_THP 0
#endif
--- a/mm/page_alloc.c~mm-page_alloc-separate-thp-pcp-into-movable-and-non-movable-categories
+++ a/mm/page_alloc.c
@@ -504,10 +504,15 @@ out:
static inline unsigned int order_to_pindex(int migratetype, int order)
{
+ bool __maybe_unused movable;
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
if (order > PAGE_ALLOC_COSTLY_ORDER) {
VM_BUG_ON(order != HPAGE_PMD_ORDER);
- return NR_LOWORDER_PCP_LISTS;
+
+ movable = migratetype == MIGRATE_MOVABLE;
+
+ return NR_LOWORDER_PCP_LISTS + movable;
}
#else
VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
@@ -521,7 +526,7 @@ static inline int pindex_to_order(unsign
int order = pindex / MIGRATE_PCPTYPES;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- if (pindex == NR_LOWORDER_PCP_LISTS)
+ if (pindex >= NR_LOWORDER_PCP_LISTS)
order = HPAGE_PMD_ORDER;
#else
VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
_
Patches currently in -mm which might be from yangge1116(a)126.com are
mm-page_alloc-separate-thp-pcp-into-movable-and-non-movable-categories.patch
mm-page_alloc-add-one-pcp-list-for-thp.patch
On Wed, Jun 19, 2024 at 3:48 PM Steve French <smfrench(a)gmail.com> wrote:
>
> tentatively merged into cifs-2.6.git for-next pending testing and any additional review
Steve, Thanks! I guess you missed an email from mm-commits.
A couple of hours ago, this was pulled into mm-hotfixes-unstable, likely
for the same purpose. Will this cause any conflicts when both changes hit
linux-next?
https://lore.kernel.org/mm-commits/20240618195943.EC07BC3277B@smtp.kernel.o…
Will we just keep one?
>
> On Tue, Jun 18, 2024 at 3:56 AM Barry Song <21cnbao(a)gmail.com> wrote:
>>
>> From: Barry Song <v-songbaohua(a)oppo.com>
>>
>> Since commit 2282679fb20b ("mm: submit multipage write for SWP_FS_OPS
>> swap-space"), we can plug multiple pages then unplug them all together.
>> That means iov_iter_count(iter) could be way bigger than PAGE_SIZE, it
>> actually equals the size of iov_iter_npages(iter, INT_MAX).
>>
>> Note this issue has nothing to do with large folios as we don't support
>> THP_SWPOUT to non-block devices.
>>
>> Fixes: 2282679fb20b ("mm: submit multipage write for SWP_FS_OPS swap-space")
>> Reported-by: Christoph Hellwig <hch(a)lst.de>
>> Closes: https://lore.kernel.org/linux-mm/20240614100329.1203579-1-hch@lst.de/
>> Cc: NeilBrown <neilb(a)suse.de>
>> Cc: Anna Schumaker <anna(a)kernel.org>
>> Cc: Steve French <sfrench(a)samba.org>
>> Cc: Trond Myklebust <trondmy(a)kernel.org>
>> Cc: Chuanhua Han <hanchuanhua(a)oppo.com>
>> Cc: Ryan Roberts <ryan.roberts(a)arm.com>
>> Cc: Chris Li <chrisl(a)kernel.org>
>> Cc: "Huang, Ying" <ying.huang(a)intel.com>
>> Cc: Jeff Layton <jlayton(a)kernel.org>
>> Cc: Paulo Alcantara <pc(a)manguebit.com>
>> Cc: Ronnie Sahlberg <ronniesahlberg(a)gmail.com>
>> Cc: Shyam Prasad N <sprasad(a)microsoft.com>
>> Cc: Tom Talpey <tom(a)talpey.com>
>> Cc: Bharath SM <bharathsm(a)microsoft.com>
>> Cc: <stable(a)vger.kernel.org>
>> Signed-off-by: Barry Song <v-songbaohua(a)oppo.com>
>> ---
>> -v2:
>> * drop the assertion instead of fixing the assertion.
>> per the comments of Willy, Christoph in nfs thread.
>>
>> fs/smb/client/file.c | 2 --
>> 1 file changed, 2 deletions(-)
>>
>> diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
>> index 9d5c2440abfc..1e269e0bc75b 100644
>> --- a/fs/smb/client/file.c
>> +++ b/fs/smb/client/file.c
>> @@ -3200,8 +3200,6 @@ static int cifs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
>> {
>> ssize_t ret;
>>
>> - WARN_ON_ONCE(iov_iter_count(iter) != PAGE_SIZE);
>> -
>> if (iov_iter_rw(iter) == READ)
>> ret = netfs_unbuffered_read_iter_locked(iocb, iter);
>> else
>> --
>> 2.34.1
>>
>>
>
>
> --
> Thanks,
>
> Steve
From: Christoph Hellwig <hch(a)lst.de>
Since commit 2282679fb20b ("mm: submit multipage write for SWP_FS_OPS
swap-space"), we can plug multiple pages then unplug them all together.
That means iov_iter_count(iter) could be way bigger than PAGE_SIZE, it
actually equals the size of iov_iter_npages(iter, INT_MAX).
Note this issue has nothing to do with large folios as we don't support
THP_SWPOUT to non-block devices.
Fixes: 2282679fb20b ("mm: submit multipage write for SWP_FS_OPS swap-space")
Reported-by: Christoph Hellwig <hch(a)lst.de>
Closes: https://lore.kernel.org/linux-mm/20240617053201.GA16852@lst.de/
Cc: NeilBrown <neilb(a)suse.de>
Cc: Anna Schumaker <anna(a)kernel.org>
Cc: Steve French <sfrench(a)samba.org>
Cc: Trond Myklebust <trondmy(a)kernel.org>
Cc: Chuanhua Han <hanchuanhua(a)oppo.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Chris Li <chrisl(a)kernel.org>
Cc: "Huang, Ying" <ying.huang(a)intel.com>
Cc: Jeff Layton <jlayton(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Martin Wege <martin.l.wege(a)gmail.com>
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
[Barry: figure out the cause and correct the commit message]
Signed-off-by: Barry Song <v-songbaohua(a)oppo.com>
---
fs/nfs/direct.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index bb2f583eb28b..90079ca134dd 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -141,8 +141,6 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
{
ssize_t ret;
- VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE);
-
if (iov_iter_rw(iter) == READ)
ret = nfs_file_direct_read(iocb, iter, true);
else
--
2.34.1
From: yangge <yangge1116(a)126.com>
Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
THP-sized allocations") no longer differentiates the migration type
of pages in THP-sized PCP list, it's possible that non-movable
allocation requests may get a CMA page from the list, in some cases,
it's not acceptable.
If a large number of CMA memory are configured in system (for
example, the CMA memory accounts for 50% of the system memory),
starting a virtual machine with device passthrough will get stuck.
During starting the virtual machine, it will call
pin_user_pages_remote(..., FOLL_LONGTERM, ...) to pin memory. Normally
if a page is present and in CMA area, pin_user_pages_remote() will
migrate the page from CMA area to non-CMA area because of
FOLL_LONGTERM flag. But if non-movable allocation requests return
CMA memory, migrate_longterm_unpinnable_pages() will migrate a CMA
page to another CMA page, which will fail to pass the check in
check_and_migrate_movable_pages() and cause migration endless.
Call trace:
pin_user_pages_remote
--__gup_longterm_locked // endless loops in this function
----_get_user_pages_locked
----check_and_migrate_movable_pages
------migrate_longterm_unpinnable_pages
--------alloc_migration_target
This problem will also have a negative impact on CMA itself. For
example, when CMA is borrowed by THP, and we need to reclaim it
through cma_alloc() or dma_alloc_coherent(), we must move those
pages out to ensure CMA's users can retrieve that contigous memory.
Currently, CMA's memory is occupied by non-movable pages, meaning
we can't relocate them. As a result, cma_alloc() is more likely to
fail.
To fix the problem above, we add one PCP list for THP, which will
not introduce a new cacheline for struct per_cpu_pages. THP will
have 2 PCP lists, one PCP list is used by MOVABLE allocation, and
the other PCP list is used by UNMOVABLE allocation. MOVABLE
allocation contains GPF_MOVABLE, and UNMOVABLE allocation contains
GFP_UNMOVABLE and GFP_RECLAIMABLE.
Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: yangge <yangge1116(a)126.com>
---
V2:
- Change the commit title
- Add Cc to stable
include/linux/mmzone.h | 9 ++++-----
mm/page_alloc.c | 9 +++++++--
2 files changed, 11 insertions(+), 7 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b7546dd..cb7f265 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -656,13 +656,12 @@ enum zone_watermarks {
};
/*
- * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER. One additional list
- * for THP which will usually be GFP_MOVABLE. Even if it is another type,
- * it should not contribute to serious fragmentation causing THP allocation
- * failures.
+ * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER. Two additional lists
+ * are added for THP. One PCP list is used by GPF_MOVABLE, and the other PCP list
+ * is used by GFP_UNMOVABLE and GFP_RECLAIMABLE.
*/
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define NR_PCP_THP 1
+#define NR_PCP_THP 2
#else
#define NR_PCP_THP 0
#endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8f416a0..0a837e6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -504,10 +504,15 @@ static void bad_page(struct page *page, const char *reason)
static inline unsigned int order_to_pindex(int migratetype, int order)
{
+ bool __maybe_unused movable;
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
if (order > PAGE_ALLOC_COSTLY_ORDER) {
VM_BUG_ON(order != HPAGE_PMD_ORDER);
- return NR_LOWORDER_PCP_LISTS;
+
+ movable = migratetype == MIGRATE_MOVABLE;
+
+ return NR_LOWORDER_PCP_LISTS + movable;
}
#else
VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
@@ -521,7 +526,7 @@ static inline int pindex_to_order(unsigned int pindex)
int order = pindex / MIGRATE_PCPTYPES;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- if (pindex == NR_LOWORDER_PCP_LISTS)
+ if (pindex >= NR_LOWORDER_PCP_LISTS)
order = HPAGE_PMD_ORDER;
#else
VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
--
2.7.4
From: Dmitry Safonov <0x7f454c46(a)gmail.com>
It seems I introduced it together with TCP_AO_CMDF_AO_REQUIRED, on
version 5 [1] of TCP-AO patches. Quite frustrative that having all these
selftests that I've written, running kmemtest & kcov was always in todo.
[1]: https://lore.kernel.org/netdev/20230215183335.800122-5-dima@arista.com/
Reported-by: Jakub Kicinski <kuba(a)kernel.org>
Closes: https://lore.kernel.org/netdev/20240617072451.1403e1d2@kernel.org/
Fixes: 0aadc73995d0 ("net/tcp: Prevent TCP-MD5 with TCP-AO being set")
Cc: stable(a)vger.kernel.org
Signed-off-by: Dmitry Safonov <0x7f454c46(a)gmail.com>
---
net/ipv4/tcp_ao.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp_ao.c b/net/ipv4/tcp_ao.c
index 37c42b63ff99..09c0fa6756b7 100644
--- a/net/ipv4/tcp_ao.c
+++ b/net/ipv4/tcp_ao.c
@@ -1968,8 +1968,10 @@ static int tcp_ao_info_cmd(struct sock *sk, unsigned short int family,
first = true;
}
- if (cmd.ao_required && tcp_ao_required_verify(sk))
- return -EKEYREJECTED;
+ if (cmd.ao_required && tcp_ao_required_verify(sk)) {
+ err = -EKEYREJECTED;
+ goto out;
+ }
/* For sockets in TCP_CLOSED it's possible set keys that aren't
* matching the future peer (address/port/VRF/etc),
---
base-commit: 92e5605a199efbaee59fb19e15d6cc2103a04ec2
change-id: 20240619-tcp-ao-required-leak-22b4a9b6d743
Best regards,
--
Dmitry Safonov <0x7f454c46(a)gmail.com>
On some systems the processor thermal device interrupt is shared with
other PCI devices. In this case return IRQ_NONE from the interrupt
handler when the interrupt is not for the processor thermal device.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada(a)linux.intel.com>
Fixes: f0658708e863 ("thermal: int340x: processor_thermal: Use non MSI interrupts by default")
Cc: <stable(a)vger.kernel.org> # v6.7+
---
This was only observed on a non production system. So not urgent.
.../intel/int340x_thermal/processor_thermal_device_pci.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/thermal/intel/int340x_thermal/processor_thermal_device_pci.c b/drivers/thermal/intel/int340x_thermal/processor_thermal_device_pci.c
index 14e34eabc419..4a1bfebb1b8e 100644
--- a/drivers/thermal/intel/int340x_thermal/processor_thermal_device_pci.c
+++ b/drivers/thermal/intel/int340x_thermal/processor_thermal_device_pci.c
@@ -150,7 +150,7 @@ static irqreturn_t proc_thermal_irq_handler(int irq, void *devid)
{
struct proc_thermal_pci *pci_info = devid;
struct proc_thermal_device *proc_priv;
- int ret = IRQ_HANDLED;
+ int ret = IRQ_NONE;
u32 status;
proc_priv = pci_info->proc_priv;
@@ -175,6 +175,7 @@ static irqreturn_t proc_thermal_irq_handler(int irq, void *devid)
/* Disable enable interrupt flag */
proc_thermal_mmio_write(pci_info, PROC_THERMAL_MMIO_INT_ENABLE_0, 0);
pkg_thermal_schedule_work(&pci_info->work);
+ ret = IRQ_HANDLED;
}
pci_write_config_byte(pci_info->pdev, 0xdc, 0x01);
--
2.44.0