On Sun, Jul 07, 2024 at 10:55:51AM -0400, Sasha Levin wrote:
> This is a note to let you know that I've just added the patch titled
>
> wifi: cfg80211: restrict NL80211_ATTR_TXQ_QUANTUM values
>
> to the 6.6-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> wifi-cfg80211-restrict-nl80211_attr_txq_quantum-valu.patch
> and it can be found in the queue-6.6 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
>
>
>
> commit 0014eb2dd000fba5b30a3cb883b750bd344f050d
> Author: Eric Dumazet <edumazet(a)google.com>
> Date: Sat Jun 15 16:08:00 2024 +0000
>
> wifi: cfg80211: restrict NL80211_ATTR_TXQ_QUANTUM values
>
> [ Upstream commit d1cba2ea8121e7fdbe1328cea782876b1dd80993 ]
Breaks the build, so I've dropped it now.
On Fri, Jul 05, 2024 at 03:34:04PM -0400, Sasha Levin wrote:
> This is a note to let you know that I've just added the patch titled
>
> drm/amdgpu: fix the warning about the expression (int)size - len
>
> to the 6.1-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> drm-amdgpu-fix-the-warning-about-the-expression-int-.patch
> and it can be found in the queue-6.1 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
>
>
>
> commit c71e1d31c7d6735e32dcfd0043970d9fadb00b82
> Author: Jesse Zhang <jesse.zhang(a)amd.com>
> Date: Thu Apr 25 15:16:40 2024 +0800
>
> drm/amdgpu: fix the warning about the expression (int)size - len
>
> [ Upstream commit ea686fef5489ef7a2450a9fdbcc732b837fb46a8 ]
>
> Converting size from size_t to int may overflow.
> v2: keep reverse xmas tree order (Christian)
>
> Signed-off-by: Jesse Zhang <jesse.zhang(a)amd.com>
> Reviewed-by: Alex Deucher <alexander.deucher(a)amd.com>
> Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
> Signed-off-by: Sasha Levin <sashal(a)kernel.org>
Nope, this breaks the build on 6.1, which is kind of worse than fixing a
build warning :(
I'll go drop this now.
thanks,
greg k-h
From: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
Since balancing mode was added in
bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes"),
it was possible to set this mode but it wouldn't be shown in
/proc/<pid>/numa_maps since there was no support for it in the
mpol_to_str() helper.
Furthermore, because the balancing mode sets the MPOL_F_MORON flag, it
would be displayed as 'default' due a workaround introduced a few years
earlier in
8790c71a18e5 ("mm/mempolicy.c: fix mempolicy printing in numa_maps").
To tidy this up we implement two changes:
Replace the MPOL_F_MORON check by pointer comparison against the
preferred_node_policy array. By doing this we generalise the current
special casing and replace the incorrect 'default' with the correct
'bind' for the mode.
Secondly, we add a string representation and corresponding handling for
the MPOL_F_NUMA_BALANCING flag.
With the two changes together we start showing the balancing flag when it
is set and therefore complete the fix.
Representation format chosen is to separate multiple flags with vertical
bars, following what existed long time ago in kernel 2.6.25. But as
between then and now there wasn't a way to display multiple flags, this
patch does not change the format in practice.
Some /proc/<pid>/numa_maps output examples:
555559580000 bind=balancing:0-1,3 file=...
555585800000 bind=balancing|static:0,2 file=...
555635240000 prefer=relative:0 file=
v2:
* Fully fix by introducing MPOL_F_KERNEL.
v3:
* Abandoned the MPOL_F_KERNEL approach in favour of pointer comparisons.
* Removed lookup generalisation for easier backporting.
* Replaced commas as separator with vertical bars.
* Added a few more words about the string format in the commit message.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com>
Fixes: bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes")
References: 8790c71a18e5 ("mm/mempolicy.c: fix mempolicy printing in numa_maps")
Cc: Huang Ying <ying.huang(a)intel.com>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Rik van Riel <riel(a)surriel.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: "Matthew Wilcox (Oracle)" <willy(a)infradead.org>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Andi Kleen <ak(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: David Rientjes <rientjes(a)google.com>
Cc: <stable(a)vger.kernel.org> # v5.12+
---
mm/mempolicy.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index aec756ae5637..1bfb6c73a39c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -3293,8 +3293,9 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
* @pol: pointer to mempolicy to be formatted
*
* Convert @pol into a string. If @buffer is too short, truncate the string.
- * Recommend a @maxlen of at least 32 for the longest mode, "interleave", the
- * longest flag, "relative", and to display at least a few node ids.
+ * Recommend a @maxlen of at least 42 for the longest mode, "weighted
+ * interleave", the longest flag, "balancing", and to display at least a few
+ * node ids.
*/
void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
{
@@ -3303,7 +3304,10 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
unsigned short mode = MPOL_DEFAULT;
unsigned short flags = 0;
- if (pol && pol != &default_policy && !(pol->flags & MPOL_F_MORON)) {
+ if (pol &&
+ pol != &default_policy &&
+ !(pol >= &preferred_node_policy[0] &&
+ pol <= &preferred_node_policy[MAX_NUMNODES - 1])) {
mode = pol->mode;
flags = pol->flags;
}
@@ -3331,12 +3335,18 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
p += snprintf(p, buffer + maxlen - p, "=");
/*
- * Currently, the only defined flags are mutually exclusive
+ * Static and relative are mutually exclusive.
*/
if (flags & MPOL_F_STATIC_NODES)
p += snprintf(p, buffer + maxlen - p, "static");
else if (flags & MPOL_F_RELATIVE_NODES)
p += snprintf(p, buffer + maxlen - p, "relative");
+
+ if (flags & MPOL_F_NUMA_BALANCING) {
+ if (hweight16(flags & MPOL_MODE_FLAGS) > 1)
+ p += snprintf(p, buffer + maxlen - p, "|");
+ p += snprintf(p, buffer + maxlen - p, "balancing");
+ }
}
if (!nodes_empty(nodes))
--
2.44.0
The page cache of the atomic file keeps new data pages which will be
stored in the COW file. It can also keep old data pages when GCing the
atomic file. In this case, new data can be overwritten by old data if a
GC thread sets the old data page as dirty after new data page was
evicted.
Also, since all writes to the atomic file are redirected to COW inodes,
GC for the atomic file is not working well as below.
f2fs_gc(gc_type=FG_GC)
- select A as a victim segment
do_garbage_collect
- iget atomic file's inode for block B
move_data_page
f2fs_do_write_data_page
- use dn of cow inode
- set fio->old_blkaddr from cow inode
- seg_freed is 0 since block B is still valid
- goto gc_more and A is selected as victim again
To solve the problem, let's separate GC writes and updates in the atomic
file by using the meta inode for GC writes.
Fixes: 3db1de0e582c ("f2fs: change the current atomic write way")
Cc: stable(a)vger.kernel.org #v5.19+
Reviewed-by: Sungjong Seo <sj1557.seo(a)samsung.com>
Reviewed-by: Yeongjin Gil <youngjin.gil(a)samsung.com>
Signed-off-by: Sunmin Jeong <s_min.jeong(a)samsung.com>
---
v2:
- replace post_read to meta_gc
fs/f2fs/data.c | 4 ++--
fs/f2fs/f2fs.h | 7 ++++++-
fs/f2fs/gc.c | 6 +++---
fs/f2fs/segment.c | 6 +++---
4 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index b6dcb3bcaef7..9a213d03005d 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2693,7 +2693,7 @@ int f2fs_do_write_data_page(struct f2fs_io_info *fio)
}
/* wait for GCed page writeback via META_MAPPING */
- if (fio->post_read)
+ if (fio->meta_gc)
f2fs_wait_on_block_writeback(inode, fio->old_blkaddr);
/*
@@ -2788,7 +2788,7 @@ int f2fs_write_single_data_page(struct page *page, int *submitted,
.submitted = 0,
.compr_blocks = compr_blocks,
.need_lock = compr_blocks ? LOCK_DONE : LOCK_RETRY,
- .post_read = f2fs_post_read_required(inode) ? 1 : 0,
+ .meta_gc = f2fs_meta_inode_gc_required(inode) ? 1 : 0,
.io_type = io_type,
.io_wbc = wbc,
.bio = bio,
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index f7ee6c5e371e..796ae11c0fa3 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -1211,7 +1211,7 @@ struct f2fs_io_info {
unsigned int in_list:1; /* indicate fio is in io_list */
unsigned int is_por:1; /* indicate IO is from recovery or not */
unsigned int encrypted:1; /* indicate file is encrypted */
- unsigned int post_read:1; /* require post read */
+ unsigned int meta_gc:1; /* require meta inode GC */
enum iostat_type io_type; /* io type */
struct writeback_control *io_wbc; /* writeback control */
struct bio **bio; /* bio for ipu */
@@ -4263,6 +4263,11 @@ static inline bool f2fs_post_read_required(struct inode *inode)
f2fs_compressed_file(inode);
}
+static inline bool f2fs_meta_inode_gc_required(struct inode *inode)
+{
+ return f2fs_post_read_required(inode) || f2fs_is_atomic_file(inode);
+}
+
/*
* compress.c
*/
diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index ef667fec9a12..cb3006551ab5 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -1589,7 +1589,7 @@ static int gc_data_segment(struct f2fs_sb_info *sbi, struct f2fs_summary *sum,
start_bidx = f2fs_start_bidx_of_node(nofs, inode) +
ofs_in_node;
- if (f2fs_post_read_required(inode)) {
+ if (f2fs_meta_inode_gc_required(inode)) {
int err = ra_data_block(inode, start_bidx);
f2fs_up_write(&F2FS_I(inode)->i_gc_rwsem[WRITE]);
@@ -1640,7 +1640,7 @@ static int gc_data_segment(struct f2fs_sb_info *sbi, struct f2fs_summary *sum,
start_bidx = f2fs_start_bidx_of_node(nofs, inode)
+ ofs_in_node;
- if (f2fs_post_read_required(inode))
+ if (f2fs_meta_inode_gc_required(inode))
err = move_data_block(inode, start_bidx,
gc_type, segno, off);
else
@@ -1648,7 +1648,7 @@ static int gc_data_segment(struct f2fs_sb_info *sbi, struct f2fs_summary *sum,
segno, off);
if (!err && (gc_type == FG_GC ||
- f2fs_post_read_required(inode)))
+ f2fs_meta_inode_gc_required(inode)))
submitted++;
if (locked) {
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 4db1add43e36..77ef46b384b4 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -3851,7 +3851,7 @@ int f2fs_inplace_write_data(struct f2fs_io_info *fio)
goto drop_bio;
}
- if (fio->post_read)
+ if (fio->meta_gc)
f2fs_truncate_meta_inode_pages(sbi, fio->new_blkaddr, 1);
stat_inc_inplace_blocks(fio->sbi);
@@ -4021,7 +4021,7 @@ void f2fs_wait_on_block_writeback(struct inode *inode, block_t blkaddr)
struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
struct page *cpage;
- if (!f2fs_post_read_required(inode))
+ if (!f2fs_meta_inode_gc_required(inode))
return;
if (!__is_valid_data_blkaddr(blkaddr))
@@ -4040,7 +4040,7 @@ void f2fs_wait_on_block_writeback_range(struct inode *inode, block_t blkaddr,
struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
block_t i;
- if (!f2fs_post_read_required(inode))
+ if (!f2fs_meta_inode_gc_required(inode))
return;
for (i = 0; i < len; i++)
--
2.25.1
The flapping-irq detector still has a timebomb.
A pathological workload, or test script,
can arm the spurious-irq timebomb described in
4f27c00bf80f ("Improve behaviour of spurious IRQ detect")
This leads to irqs being moved the much slower polled mode,
despite the actual unhandled-irq rate being well under the
99.9k/100k threshold that the code appears to check.
How?
- Queued completion handler, like nvme, servicing events
as they appear in the queue, even if the irq corresponding
to the event has not yet been seen.
- queues frequently empty, so seeing "spurious" irqs
whenever the last events of a threaded handler's
while (events_queued()) process_them();
ends with those events' irqs posted while thread was scanning.
In this case the while() has consumed last event(s),
so next handler says IRQ_NONE.
- In each run of "unhandled" irqs, exactly one IRQ_NONE response
is promoted from IRQ_NONE to IRQ_HANDLED, by note_interrupt()'s
SPURIOUS_DEFERRED logic.
- Any 2+ unhandled-irq runs will increment irqs_unhandled.
The time_after() check in note_interrupt() resets irqs_unhandled
to 1 after an idle period, but if irqs are never spaced more
than HZ/10 apart, irqs_unhandled keeps growing.
- During processing of long completion queues, the non-threaded
handlers will return IRQ_WAKE_THREAD, for potentially thousands
of per-event irqs. These bypass note_interrupt()'s irq_count++ logic,
so do not count as handled, and do not invoke the flapping-irq
logic.
- When the _counted_ irq_count reaches the 100k threshold,
it's possible for irqs_unhandled > 99.9k to force a move
to polling mode, even though many millions of _WAKE_THREAD
irqs have been handled without being counted.
Solution: include IRQ_WAKE_THREAD events in irq_count.
Only when IRQ_NONE responses outweigh (IRQ_HANDLED + IRQ_WAKE_THREAD)
by the old 99:1 ratio will an irq be moved to polling mode.
Fixes: 4f27c00bf80f ("Improve behaviour of spurious IRQ detect")
Cc: stable(a)vger.kernel.org
Signed-off-by: Pete Swain <swine(a)google.com>
---
kernel/irq/spurious.c | 68 +++++++++++++++++++++----------------------
1 file changed, 34 insertions(+), 34 deletions(-)
diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
index 02b2daf07441..ac596c8dc4b1 100644
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -321,44 +321,44 @@ void note_interrupt(struct irq_desc *desc, irqreturn_t action_ret)
*/
if (!(desc->threads_handled_last & SPURIOUS_DEFERRED)) {
desc->threads_handled_last |= SPURIOUS_DEFERRED;
- return;
- }
- /*
- * Check whether one of the threaded handlers
- * returned IRQ_HANDLED since the last
- * interrupt happened.
- *
- * For simplicity we just set bit 31, as it is
- * set in threads_handled_last as well. So we
- * avoid extra masking. And we really do not
- * care about the high bits of the handled
- * count. We just care about the count being
- * different than the one we saw before.
- */
- handled = atomic_read(&desc->threads_handled);
- handled |= SPURIOUS_DEFERRED;
- if (handled != desc->threads_handled_last) {
- action_ret = IRQ_HANDLED;
- /*
- * Note: We keep the SPURIOUS_DEFERRED
- * bit set. We are handling the
- * previous invocation right now.
- * Keep it for the current one, so the
- * next hardware interrupt will
- * account for it.
- */
- desc->threads_handled_last = handled;
} else {
/*
- * None of the threaded handlers felt
- * responsible for the last interrupt
+ * Check whether one of the threaded handlers
+ * returned IRQ_HANDLED since the last
+ * interrupt happened.
*
- * We keep the SPURIOUS_DEFERRED bit
- * set in threads_handled_last as we
- * need to account for the current
- * interrupt as well.
+ * For simplicity we just set bit 31, as it is
+ * set in threads_handled_last as well. So we
+ * avoid extra masking. And we really do not
+ * care about the high bits of the handled
+ * count. We just care about the count being
+ * different than the one we saw before.
*/
- action_ret = IRQ_NONE;
+ handled = atomic_read(&desc->threads_handled);
+ handled |= SPURIOUS_DEFERRED;
+ if (handled != desc->threads_handled_last) {
+ action_ret = IRQ_HANDLED;
+ /*
+ * Note: We keep the SPURIOUS_DEFERRED
+ * bit set. We are handling the
+ * previous invocation right now.
+ * Keep it for the current one, so the
+ * next hardware interrupt will
+ * account for it.
+ */
+ desc->threads_handled_last = handled;
+ } else {
+ /*
+ * None of the threaded handlers felt
+ * responsible for the last interrupt
+ *
+ * We keep the SPURIOUS_DEFERRED bit
+ * set in threads_handled_last as we
+ * need to account for the current
+ * interrupt as well.
+ */
+ action_ret = IRQ_NONE;
+ }
}
} else {
/*
--
2.45.2.627.g7a2c4fd464-goog
When the source address is selected, the scope must be checked. For
example, if a loopback address is assigned to the vrf device, it must not
be chosen for packets sent outside.
CC: stable(a)vger.kernel.org
Fixes: afbac6010aec ("net: ipv6: Address selection needs to consider L3 domains")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel(a)6wind.com>
---
net/ipv6/addrconf.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 5c424a0e7232..4f2c5cc31015 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1873,7 +1873,8 @@ int ipv6_dev_get_saddr(struct net *net, const struct net_device *dst_dev,
master, &dst,
scores, hiscore_idx);
- if (scores[hiscore_idx].ifa)
+ if (scores[hiscore_idx].ifa &&
+ scores[hiscore_idx].scopedist >= 0)
goto out;
}
--
2.43.1
Since the introduction of mTHP, the docuementation has stated that
khugepaged would be enabled when any mTHP size is enabled, and disabled
when all mTHP sizes are disabled. There are 2 problems with this; 1.
this is not what was implemented by the code and 2. this is not the
desirable behavior.
Desirable behavior is for khugepaged to be enabled when any PMD-sized
THP is enabled, anon or file. (Note that file THP is still controlled by
the top-level control so we must always consider that, as well as the
PMD-size mTHP control for anon). khugepaged only supports collapsing to
PMD-sized THP so there is no value in enabling it when PMD-sized THP is
disabled. So let's change the code and documentation to reflect this
policy.
Further, per-size enabled control modification events were not
previously forwarded to khugepaged to give it an opportunity to start or
stop. Consequently the following was resulting in khugepaged eroneously
not being activated:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
Signed-off-by: Ryan Roberts <ryan.roberts(a)arm.com>
Fixes: 3485b88390b0 ("mm: thp: introduce multi-size THP sysfs interface")
Closes: https://lore.kernel.org/linux-mm/7a0bbe69-1e3d-4263-b206-da007791a5c4@redha…
Cc: stable(a)vger.kernel.org
---
Hi All,
Applies on top of mm-unstable from a couple of days ago (9bb8753acdd8). No
regressions observed in mm selftests.
When fixing this I also noticed that khugepaged doesn't get (and never has been)
activated/deactivated by `shmem_enabled=`. I've concluded that this is
definitely a (separate) bug. But I'm waiting for the conclusion of the
conversation at [3] before fixing, so will send separately.
Changes since v1 [1]
====================
- hugepage_pmd_enabled() now considers CONFIG_READ_ONLY_THP_FOR_FS as part of
decision; means that for kernels without this config, khugepaged will not be
started when only the top-level control is enabled.
Changes since v2 [2]
====================
- Make hugepage_pmd_enabled() out-of-line static in khugepaged.c (per Andrew)
- Refactor hugepage_pmd_enabled() for better readability (per Andrew)
[1] https://lore.kernel.org/linux-mm/20240702144617.2291480-1-ryan.roberts@arm.…
[2] https://lore.kernel.org/linux-mm/20240704091051.2411934-1-ryan.roberts@arm.…
[3] https://lore.kernel.org/linux-mm/65c37315-2741-481f-b433-cec35ef1af35@arm.c…
Thanks,
Ryan
Documentation/admin-guide/mm/transhuge.rst | 11 ++++----
include/linux/huge_mm.h | 12 --------
mm/huge_memory.c | 7 +++++
mm/khugepaged.c | 33 +++++++++++++++++-----
4 files changed, 38 insertions(+), 25 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 709fe10b60f4..fc321d40b8ac 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -202,12 +202,11 @@ PMD-mappable transparent hugepage::
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
-khugepaged will be automatically started when one or more hugepage
-sizes are enabled (either by directly setting "always" or "madvise",
-or by setting "inherit" while the top-level enabled is set to "always"
-or "madvise"), and it'll be automatically shutdown when the last
-hugepage size is disabled (either by directly setting "never", or by
-setting "inherit" while the top-level enabled is set to "never").
+khugepaged will be automatically started when PMD-sized THP is enabled
+(either of the per-size anon control or the top-level control are set
+to "always" or "madvise"), and it'll be automatically shutdown when
+PMD-sized THP is disabled (when both the per-size anon control and the
+top-level control are "never")
Khugepaged controls
-------------------
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4d155c7a4792..107da5c4eba4 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -128,18 +128,6 @@ static inline bool hugepage_global_always(void)
(1<<TRANSPARENT_HUGEPAGE_FLAG);
}
-static inline bool hugepage_flags_enabled(void)
-{
- /*
- * We cover both the anon and the file-backed case here; we must return
- * true if globally enabled, even when all anon sizes are set to never.
- * So we don't need to look at huge_anon_orders_inherit.
- */
- return hugepage_global_enabled() ||
- READ_ONCE(huge_anon_orders_always) ||
- READ_ONCE(huge_anon_orders_madvise);
-}
-
static inline int highest_order(unsigned long orders)
{
return fls_long(orders) - 1;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 251d6932130f..085f5e973231 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -502,6 +502,13 @@ static ssize_t thpsize_enabled_store(struct kobject *kobj,
} else
ret = -EINVAL;
+ if (ret > 0) {
+ int err;
+
+ err = start_stop_khugepaged();
+ if (err)
+ ret = err;
+ }
return ret;
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 409f67a817f1..a5ec03ef8722 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -413,6 +413,26 @@ static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
test_bit(MMF_DISABLE_THP, &mm->flags);
}
+static bool hugepage_pmd_enabled(void)
+{
+ /*
+ * We cover both the anon and the file-backed case here; file-backed
+ * hugepages, when configured in, are determined by the global control.
+ * Anon pmd-sized hugepages are determined by the pmd-size control.
+ */
+ if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
+ hugepage_global_enabled())
+ return true;
+ if (test_bit(PMD_ORDER, &huge_anon_orders_always))
+ return true;
+ if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
+ return true;
+ if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
+ hugepage_global_enabled())
+ return true;
+ return false;
+}
+
void __khugepaged_enter(struct mm_struct *mm)
{
struct khugepaged_mm_slot *mm_slot;
@@ -449,7 +469,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
unsigned long vm_flags)
{
if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
- hugepage_flags_enabled()) {
+ hugepage_pmd_enabled()) {
if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
PMD_ORDER))
__khugepaged_enter(vma->vm_mm);
@@ -2462,8 +2482,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
static int khugepaged_has_work(void)
{
- return !list_empty(&khugepaged_scan.mm_head) &&
- hugepage_flags_enabled();
+ return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled();
}
static int khugepaged_wait_event(void)
@@ -2536,7 +2555,7 @@ static void khugepaged_wait_work(void)
return;
}
- if (hugepage_flags_enabled())
+ if (hugepage_pmd_enabled())
wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
}
@@ -2567,7 +2586,7 @@ static void set_recommended_min_free_kbytes(void)
int nr_zones = 0;
unsigned long recommended_min;
- if (!hugepage_flags_enabled()) {
+ if (!hugepage_pmd_enabled()) {
calculate_min_free_kbytes();
goto update_wmarks;
}
@@ -2617,7 +2636,7 @@ int start_stop_khugepaged(void)
int err = 0;
mutex_lock(&khugepaged_mutex);
- if (hugepage_flags_enabled()) {
+ if (hugepage_pmd_enabled()) {
if (!khugepaged_thread)
khugepaged_thread = kthread_run(khugepaged, NULL,
"khugepaged");
@@ -2643,7 +2662,7 @@ int start_stop_khugepaged(void)
void khugepaged_min_free_kbytes_update(void)
{
mutex_lock(&khugepaged_mutex);
- if (hugepage_flags_enabled() && khugepaged_thread)
+ if (hugepage_pmd_enabled() && khugepaged_thread)
set_recommended_min_free_kbytes();
mutex_unlock(&khugepaged_mutex);
}
--
2.43.0
The quilt patch titled
Subject: mm: fix crashes from deferred split racing folio migration
has been removed from the -mm tree. Its filename was
mm-fix-crashes-from-deferred-split-racing-folio-migration.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm: fix crashes from deferred split racing folio migration
Date: Tue, 2 Jul 2024 00:40:55 -0700 (PDT)
Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
flags when freeing, yet the flags shown are not bad: PG_locked had been
set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
symptoms implying double free by deferred split and large folio migration.
6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large
folio migration") was right to fix the memcg-dependent locking broken in
85ce2c517ade ("memcontrol: only transfer the memcg data for migration"),
but missed a subtlety of deferred_split_scan(): it moves folios to its own
local list to work on them without split_queue_lock, during which time
folio->_deferred_list is not empty, but even the "right" lock does nothing
to secure the folio and the list it is on.
Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
while the old folio's reference count is temporarily frozen to 0 - adding
such a freeze in the !mapping case too (originally, folio lock and
unmapping and no swap cache left an anon folio unreachable, so no freezing
was needed there: but the deferred split queue offers a way to reach it).
Link: https://lkml.kernel.org/r/29c83d1a-11ca-b6c9-f92e-6ccb322af510@google.com
Fixes: 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Reviewed-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Barry Song <baohua(a)kernel.org>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Nhat Pham <nphamcs(a)gmail.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memcontrol.c | 11 -----------
mm/migrate.c | 13 +++++++++++++
2 files changed, 13 insertions(+), 11 deletions(-)
--- a/mm/memcontrol.c~mm-fix-crashes-from-deferred-split-racing-folio-migration
+++ a/mm/memcontrol.c
@@ -7823,17 +7823,6 @@ void mem_cgroup_migrate(struct folio *ol
/* Transfer the charge and the css ref */
commit_charge(new, memcg);
- /*
- * If the old folio is a large folio and is in the split queue, it needs
- * to be removed from the split queue now, in case getting an incorrect
- * split queue in destroy_large_folio() after the memcg of the old folio
- * is cleared.
- *
- * In addition, the old folio is about to be freed after migration, so
- * removing from the split queue a bit earlier seems reasonable.
- */
- if (folio_test_large(old) && folio_test_large_rmappable(old))
- folio_undo_large_rmappable(old);
old->memcg_data = 0;
}
--- a/mm/migrate.c~mm-fix-crashes-from-deferred-split-racing-folio-migration
+++ a/mm/migrate.c
@@ -415,6 +415,15 @@ int folio_migrate_mapping(struct address
if (folio_ref_count(folio) != expected_count)
return -EAGAIN;
+ /* Take off deferred split queue while frozen and memcg set */
+ if (folio_test_large(folio) &&
+ folio_test_large_rmappable(folio)) {
+ if (!folio_ref_freeze(folio, expected_count))
+ return -EAGAIN;
+ folio_undo_large_rmappable(folio);
+ folio_ref_unfreeze(folio, expected_count);
+ }
+
/* No turning back from here */
newfolio->index = folio->index;
newfolio->mapping = folio->mapping;
@@ -433,6 +442,10 @@ int folio_migrate_mapping(struct address
return -EAGAIN;
}
+ /* Take off deferred split queue while frozen and memcg set */
+ if (folio_test_large(folio) && folio_test_large_rmappable(folio))
+ folio_undo_large_rmappable(folio);
+
/*
* Now we know that no one else is looking at the folio:
* no turning back from here.
_
Patches currently in -mm which might be from hughd(a)google.com are