The flapping-irq detector still has a timebomb.
A pathological workload, or test script,
can arm the spurious-irq timebomb described in
4f27c00bf80f ("Improve behaviour of spurious IRQ detect")
This leads to irqs being moved the much slower polled mode,
despite the actual unhandled-irq rate being well under the
99.9k/100k threshold that the code appears to check.
How?
- Queued completion handler, like nvme, servicing events
as they appear in the queue, even if the irq corresponding
to the event has not yet been seen.
- queues frequently empty, so seeing "spurious" irqs
whenever the last events of a threaded handler's
while (events_queued()) process_them();
ends with those events' irqs posted while thread was scanning.
In this case the while() has consumed last event(s),
so next handler says IRQ_NONE.
- In each run of "unhandled" irqs, exactly one IRQ_NONE response
is promoted from IRQ_NONE to IRQ_HANDLED, by note_interrupt()'s
SPURIOUS_DEFERRED logic.
- Any 2+ unhandled-irq runs will increment irqs_unhandled.
The time_after() check in note_interrupt() resets irqs_unhandled
to 1 after an idle period, but if irqs are never spaced more
than HZ/10 apart, irqs_unhandled keeps growing.
- During processing of long completion queues, the non-threaded
handlers will return IRQ_WAKE_THREAD, for potentially thousands
of per-event irqs. These bypass note_interrupt()'s irq_count++ logic,
so do not count as handled, and do not invoke the flapping-irq
logic.
- When the _counted_ irq_count reaches the 100k threshold,
it's possible for irqs_unhandled > 99.9k to force a move
to polling mode, even though many millions of _WAKE_THREAD
irqs have been handled without being counted.
Solution: include IRQ_WAKE_THREAD events in irq_count.
Only when IRQ_NONE responses outweigh (IRQ_HANDLED + IRQ_WAKE_THREAD)
by the old 99:1 ratio will an irq be moved to polling mode.
Fixes: 4f27c00bf80f ("Improve behaviour of spurious IRQ detect")
Cc: stable(a)vger.kernel.org
Signed-off-by: Pete Swain <swine(a)google.com>
---
kernel/irq/spurious.c | 68 +++++++++++++++++++++----------------------
1 file changed, 34 insertions(+), 34 deletions(-)
diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
index 02b2daf07441..ac596c8dc4b1 100644
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -321,44 +321,44 @@ void note_interrupt(struct irq_desc *desc, irqreturn_t action_ret)
*/
if (!(desc->threads_handled_last & SPURIOUS_DEFERRED)) {
desc->threads_handled_last |= SPURIOUS_DEFERRED;
- return;
- }
- /*
- * Check whether one of the threaded handlers
- * returned IRQ_HANDLED since the last
- * interrupt happened.
- *
- * For simplicity we just set bit 31, as it is
- * set in threads_handled_last as well. So we
- * avoid extra masking. And we really do not
- * care about the high bits of the handled
- * count. We just care about the count being
- * different than the one we saw before.
- */
- handled = atomic_read(&desc->threads_handled);
- handled |= SPURIOUS_DEFERRED;
- if (handled != desc->threads_handled_last) {
- action_ret = IRQ_HANDLED;
- /*
- * Note: We keep the SPURIOUS_DEFERRED
- * bit set. We are handling the
- * previous invocation right now.
- * Keep it for the current one, so the
- * next hardware interrupt will
- * account for it.
- */
- desc->threads_handled_last = handled;
} else {
/*
- * None of the threaded handlers felt
- * responsible for the last interrupt
+ * Check whether one of the threaded handlers
+ * returned IRQ_HANDLED since the last
+ * interrupt happened.
*
- * We keep the SPURIOUS_DEFERRED bit
- * set in threads_handled_last as we
- * need to account for the current
- * interrupt as well.
+ * For simplicity we just set bit 31, as it is
+ * set in threads_handled_last as well. So we
+ * avoid extra masking. And we really do not
+ * care about the high bits of the handled
+ * count. We just care about the count being
+ * different than the one we saw before.
*/
- action_ret = IRQ_NONE;
+ handled = atomic_read(&desc->threads_handled);
+ handled |= SPURIOUS_DEFERRED;
+ if (handled != desc->threads_handled_last) {
+ action_ret = IRQ_HANDLED;
+ /*
+ * Note: We keep the SPURIOUS_DEFERRED
+ * bit set. We are handling the
+ * previous invocation right now.
+ * Keep it for the current one, so the
+ * next hardware interrupt will
+ * account for it.
+ */
+ desc->threads_handled_last = handled;
+ } else {
+ /*
+ * None of the threaded handlers felt
+ * responsible for the last interrupt
+ *
+ * We keep the SPURIOUS_DEFERRED bit
+ * set in threads_handled_last as we
+ * need to account for the current
+ * interrupt as well.
+ */
+ action_ret = IRQ_NONE;
+ }
}
} else {
/*
--
2.45.2.627.g7a2c4fd464-goog
When the source address is selected, the scope must be checked. For
example, if a loopback address is assigned to the vrf device, it must not
be chosen for packets sent outside.
CC: stable(a)vger.kernel.org
Fixes: afbac6010aec ("net: ipv6: Address selection needs to consider L3 domains")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel(a)6wind.com>
---
net/ipv6/addrconf.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 5c424a0e7232..4f2c5cc31015 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1873,7 +1873,8 @@ int ipv6_dev_get_saddr(struct net *net, const struct net_device *dst_dev,
master, &dst,
scores, hiscore_idx);
- if (scores[hiscore_idx].ifa)
+ if (scores[hiscore_idx].ifa &&
+ scores[hiscore_idx].scopedist >= 0)
goto out;
}
--
2.43.1
By default, an address assigned to the output interface is selected when
the source address is not specified. This is problematic when a route,
configured in a vrf, uses an interface from another vrf (aka route leak).
The original vrf does not own the selected source address.
Let's add a check against the output interface and call the appropriate
function to select the source address.
CC: stable(a)vger.kernel.org
Fixes: 8cbb512c923d ("net: Add source address lookup op for VRF")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel(a)6wind.com>
---
net/ipv4/fib_semantics.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index f669da98d11d..459082f4936d 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -2270,6 +2270,13 @@ void fib_select_path(struct net *net, struct fib_result *res,
fib_select_default(fl4, res);
check_saddr:
- if (!fl4->saddr)
- fl4->saddr = fib_result_prefsrc(net, res);
+ if (!fl4->saddr) {
+ struct net_device *l3mdev = dev_get_by_index_rcu(net, fl4->flowi4_l3mdev);
+
+ if (!l3mdev ||
+ l3mdev_master_dev_rcu(FIB_RES_DEV(*res)) == l3mdev)
+ fl4->saddr = fib_result_prefsrc(net, res);
+ else
+ fl4->saddr = inet_select_addr(l3mdev, 0, RT_SCOPE_LINK);
+ }
}
--
2.43.1
These igc bug fixes are sent as a patch series because:
1. The two patches below remove the reliance on using the qbv_count field.
"igc: Fix qbv_config_change_errors logics"
"igc: Fix reset adapter logics when tx mode change"
qbv_count field will be removed in future patch via iwl-next.
2. The patch "igc: Fix qbv tx latency by setting gtxoffset" reuse the
function igc_tsn_will_tx_mode_change() which was created in the patch:
"igc: Fix reset adapter logics when tx mode change"
v1: https://patchwork.kernel.org/project/netdevbpf/cover/20240702040926.3327530…
Changelog:
v1 -> v2
- Instead of casting to bool, use !! (Simon)
- Simplify new functions created. Instead of if.. return true, else return false,
use single return. (Simon)
- Remove patch "igc: Remove unused qbv_coun" from this series which is targeting
to iwl-net. This patch will be sent to iwl-next. (Simon)
Faizal Rahim (3):
igc: Fix qbv_config_change_errors logics
igc: Fix reset adapter logics when tx mode change
igc: Fix qbv tx latency by setting gtxoffset
drivers/net/ethernet/intel/igc/igc_main.c | 8 +++--
drivers/net/ethernet/intel/igc/igc_tsn.c | 41 ++++++++++++++++-------
drivers/net/ethernet/intel/igc/igc_tsn.h | 1 +
3 files changed, 36 insertions(+), 14 deletions(-)
--
2.25.1
Since the introduction of mTHP, the docuementation has stated that
khugepaged would be enabled when any mTHP size is enabled, and disabled
when all mTHP sizes are disabled. There are 2 problems with this; 1.
this is not what was implemented by the code and 2. this is not the
desirable behavior.
Desirable behavior is for khugepaged to be enabled when any PMD-sized
THP is enabled, anon or file. (Note that file THP is still controlled by
the top-level control so we must always consider that, as well as the
PMD-size mTHP control for anon). khugepaged only supports collapsing to
PMD-sized THP so there is no value in enabling it when PMD-sized THP is
disabled. So let's change the code and documentation to reflect this
policy.
Further, per-size enabled control modification events were not
previously forwarded to khugepaged to give it an opportunity to start or
stop. Consequently the following was resulting in khugepaged eroneously
not being activated:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
Signed-off-by: Ryan Roberts <ryan.roberts(a)arm.com>
Fixes: 3485b88390b0 ("mm: thp: introduce multi-size THP sysfs interface")
Closes: https://lore.kernel.org/linux-mm/7a0bbe69-1e3d-4263-b206-da007791a5c4@redha…
Cc: stable(a)vger.kernel.org
---
Hi All,
Applies on top of mm-unstable from a couple of days ago (9bb8753acdd8). No
regressions observed in mm selftests.
When fixing this I also noticed that khugepaged doesn't get (and never has been)
activated/deactivated by `shmem_enabled=`. I've concluded that this is
definitely a (separate) bug. But I'm waiting for the conclusion of the
conversation at [3] before fixing, so will send separately.
Changes since v1 [1]
====================
- hugepage_pmd_enabled() now considers CONFIG_READ_ONLY_THP_FOR_FS as part of
decision; means that for kernels without this config, khugepaged will not be
started when only the top-level control is enabled.
Changes since v2 [2]
====================
- Make hugepage_pmd_enabled() out-of-line static in khugepaged.c (per Andrew)
- Refactor hugepage_pmd_enabled() for better readability (per Andrew)
[1] https://lore.kernel.org/linux-mm/20240702144617.2291480-1-ryan.roberts@arm.…
[2] https://lore.kernel.org/linux-mm/20240704091051.2411934-1-ryan.roberts@arm.…
[3] https://lore.kernel.org/linux-mm/65c37315-2741-481f-b433-cec35ef1af35@arm.c…
Thanks,
Ryan
Documentation/admin-guide/mm/transhuge.rst | 11 ++++----
include/linux/huge_mm.h | 12 --------
mm/huge_memory.c | 7 +++++
mm/khugepaged.c | 33 +++++++++++++++++-----
4 files changed, 38 insertions(+), 25 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 709fe10b60f4..fc321d40b8ac 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -202,12 +202,11 @@ PMD-mappable transparent hugepage::
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
-khugepaged will be automatically started when one or more hugepage
-sizes are enabled (either by directly setting "always" or "madvise",
-or by setting "inherit" while the top-level enabled is set to "always"
-or "madvise"), and it'll be automatically shutdown when the last
-hugepage size is disabled (either by directly setting "never", or by
-setting "inherit" while the top-level enabled is set to "never").
+khugepaged will be automatically started when PMD-sized THP is enabled
+(either of the per-size anon control or the top-level control are set
+to "always" or "madvise"), and it'll be automatically shutdown when
+PMD-sized THP is disabled (when both the per-size anon control and the
+top-level control are "never")
Khugepaged controls
-------------------
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4d155c7a4792..107da5c4eba4 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -128,18 +128,6 @@ static inline bool hugepage_global_always(void)
(1<<TRANSPARENT_HUGEPAGE_FLAG);
}
-static inline bool hugepage_flags_enabled(void)
-{
- /*
- * We cover both the anon and the file-backed case here; we must return
- * true if globally enabled, even when all anon sizes are set to never.
- * So we don't need to look at huge_anon_orders_inherit.
- */
- return hugepage_global_enabled() ||
- READ_ONCE(huge_anon_orders_always) ||
- READ_ONCE(huge_anon_orders_madvise);
-}
-
static inline int highest_order(unsigned long orders)
{
return fls_long(orders) - 1;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 251d6932130f..085f5e973231 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -502,6 +502,13 @@ static ssize_t thpsize_enabled_store(struct kobject *kobj,
} else
ret = -EINVAL;
+ if (ret > 0) {
+ int err;
+
+ err = start_stop_khugepaged();
+ if (err)
+ ret = err;
+ }
return ret;
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 409f67a817f1..a5ec03ef8722 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -413,6 +413,26 @@ static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
test_bit(MMF_DISABLE_THP, &mm->flags);
}
+static bool hugepage_pmd_enabled(void)
+{
+ /*
+ * We cover both the anon and the file-backed case here; file-backed
+ * hugepages, when configured in, are determined by the global control.
+ * Anon pmd-sized hugepages are determined by the pmd-size control.
+ */
+ if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
+ hugepage_global_enabled())
+ return true;
+ if (test_bit(PMD_ORDER, &huge_anon_orders_always))
+ return true;
+ if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
+ return true;
+ if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
+ hugepage_global_enabled())
+ return true;
+ return false;
+}
+
void __khugepaged_enter(struct mm_struct *mm)
{
struct khugepaged_mm_slot *mm_slot;
@@ -449,7 +469,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
unsigned long vm_flags)
{
if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
- hugepage_flags_enabled()) {
+ hugepage_pmd_enabled()) {
if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
PMD_ORDER))
__khugepaged_enter(vma->vm_mm);
@@ -2462,8 +2482,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
static int khugepaged_has_work(void)
{
- return !list_empty(&khugepaged_scan.mm_head) &&
- hugepage_flags_enabled();
+ return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled();
}
static int khugepaged_wait_event(void)
@@ -2536,7 +2555,7 @@ static void khugepaged_wait_work(void)
return;
}
- if (hugepage_flags_enabled())
+ if (hugepage_pmd_enabled())
wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
}
@@ -2567,7 +2586,7 @@ static void set_recommended_min_free_kbytes(void)
int nr_zones = 0;
unsigned long recommended_min;
- if (!hugepage_flags_enabled()) {
+ if (!hugepage_pmd_enabled()) {
calculate_min_free_kbytes();
goto update_wmarks;
}
@@ -2617,7 +2636,7 @@ int start_stop_khugepaged(void)
int err = 0;
mutex_lock(&khugepaged_mutex);
- if (hugepage_flags_enabled()) {
+ if (hugepage_pmd_enabled()) {
if (!khugepaged_thread)
khugepaged_thread = kthread_run(khugepaged, NULL,
"khugepaged");
@@ -2643,7 +2662,7 @@ int start_stop_khugepaged(void)
void khugepaged_min_free_kbytes_update(void)
{
mutex_lock(&khugepaged_mutex);
- if (hugepage_flags_enabled() && khugepaged_thread)
+ if (hugepage_pmd_enabled() && khugepaged_thread)
set_recommended_min_free_kbytes();
mutex_unlock(&khugepaged_mutex);
}
--
2.43.0