The following commit has been merged into the timers/urgent branch of tip:
Commit-ID: d125d1349abeb46945dc5e98f7824bf688266f13
Gitweb: https://git.kernel.org/tip/d125d1349abeb46945dc5e98f7824bf688266f13
Author: Thomas Gleixner <tglx(a)linutronix.de>
AuthorDate: Thu, 09 Feb 2023 23:25:49 +01:00
Committer: Thomas Gleixner <tglx(a)linutronix.de>
CommitterDate: Tue, 14 Feb 2023 11:18:35 +01:00
alarmtimer: Prevent starvation by small intervals and SIG_IGN
syzbot reported a RCU stall which is caused by setting up an alarmtimer
with a very small interval and ignoring the signal. The reproducer arms the
alarm timer with a relative expiry of 8ns and an interval of 9ns. Not a
problem per se, but that's an issue when the signal is ignored because then
the timer is immediately rearmed because there is no way to delay that
rearming to the signal delivery path. See posix_timer_fn() and commit
58229a189942 ("posix-timers: Prevent softirq starvation by small intervals
and SIG_IGN") for details.
The reproducer does not set SIG_IGN explicitely, but it sets up the timers
signal with SIGCONT. That has the same effect as explicitely setting
SIG_IGN for a signal as SIGCONT is ignored if there is no handler set and
the task is not ptraced.
The log clearly shows that:
[pid 5102] --- SIGCONT {si_signo=SIGCONT, si_code=SI_TIMER, si_timerid=0, si_overrun=316014, si_int=0, si_ptr=NULL} ---
It works because the tasks are traced and therefore the signal is queued so
the tracer can see it, which delays the restart of the timer to the signal
delivery path. But then the tracer is killed:
[pid 5087] kill(-5102, SIGKILL <unfinished ...>
...
./strace-static-x86_64: Process 5107 detached
and after it's gone the stall can be observed:
syzkaller login: [ 79.439102][ C0] hrtimer: interrupt took 68471 ns
[ 184.460538][ C1] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
...
[ 184.658237][ C1] rcu: Stack dump where RCU GP kthread last ran:
[ 184.664574][ C1] Sending NMI from CPU 1 to CPUs 0:
[ 184.669821][ C0] NMI backtrace for cpu 0
[ 184.669831][ C0] CPU: 0 PID: 5108 Comm: syz-executor192 Not tainted 6.2.0-rc6-next-20230203-syzkaller #0
...
[ 184.670036][ C0] Call Trace:
[ 184.670041][ C0] <IRQ>
[ 184.670045][ C0] alarmtimer_fired+0x327/0x670
posix_timer_fn() prevents that by checking whether the interval for
timers which have the signal ignored is smaller than a jiffie and
artifically delay it by shifting the next expiry out by a jiffie. That's
accurate vs. the overrun accounting, but slightly inaccurate
vs. timer_gettimer(2).
The comment in that function says what needs to be done and there was a fix
available for the regular userspace induced SIG_IGN mechanism, but that did
not work due to the implicit ignore for SIGCONT and similar signals. This
needs to be worked on, but for now the only available workaround is to do
exactly what posix_timer_fn() does:
Increase the interval of self-rearming timers, which have their signal
ignored, to at least a jiffie.
Interestingly this has been fixed before via commit ff86bf0c65f1
("alarmtimer: Rate limit periodic intervals") already, but that fix got
lost in a later rework.
Reported-by: syzbot+b9564ba6e8e00694511b(a)syzkaller.appspotmail.com
Fixes: f2c45807d399 ("alarmtimer: Switch over to generic set/get/rearm routine")
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Acked-by: John Stultz <jstultz(a)google.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/87k00q1no2.ffs@tglx
---
kernel/time/alarmtimer.c | 33 +++++++++++++++++++++++++++++----
1 file changed, 29 insertions(+), 4 deletions(-)
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index 5897828..7e5dff6 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -470,11 +470,35 @@ u64 alarm_forward(struct alarm *alarm, ktime_t now, ktime_t interval)
}
EXPORT_SYMBOL_GPL(alarm_forward);
-u64 alarm_forward_now(struct alarm *alarm, ktime_t interval)
+static u64 __alarm_forward_now(struct alarm *alarm, ktime_t interval, bool throttle)
{
struct alarm_base *base = &alarm_bases[alarm->type];
+ ktime_t now = base->get_ktime();
+
+ if (IS_ENABLED(CONFIG_HIGH_RES_TIMERS) && throttle) {
+ /*
+ * Same issue as with posix_timer_fn(). Timers which are
+ * periodic but the signal is ignored can starve the system
+ * with a very small interval. The real fix which was
+ * promised in the context of posix_timer_fn() never
+ * materialized, but someone should really work on it.
+ *
+ * To prevent DOS fake @now to be 1 jiffie out which keeps
+ * the overrun accounting correct but creates an
+ * inconsistency vs. timer_gettime(2).
+ */
+ ktime_t kj = NSEC_PER_SEC / HZ;
+
+ if (interval < kj)
+ now = ktime_add(now, kj);
+ }
+
+ return alarm_forward(alarm, now, interval);
+}
- return alarm_forward(alarm, base->get_ktime(), interval);
+u64 alarm_forward_now(struct alarm *alarm, ktime_t interval)
+{
+ return __alarm_forward_now(alarm, interval, false);
}
EXPORT_SYMBOL_GPL(alarm_forward_now);
@@ -551,9 +575,10 @@ static enum alarmtimer_restart alarm_handle_timer(struct alarm *alarm,
if (posix_timer_event(ptr, si_private) && ptr->it_interval) {
/*
* Handle ignored signals and rearm the timer. This will go
- * away once we handle ignored signals proper.
+ * away once we handle ignored signals proper. Ensure that
+ * small intervals cannot starve the system.
*/
- ptr->it_overrun += alarm_forward_now(alarm, ptr->it_interval);
+ ptr->it_overrun += __alarm_forward_now(alarm, ptr->it_interval, true);
++ptr->it_requeue_pending;
ptr->it_active = 1;
result = ALARMTIMER_RESTART;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
73bdf65ea748 ("migrate: hugetlb: check for hugetlb shared PMD in node migration")
7ce82f4c3f3e ("mm/migration: return errno when isolate_huge_page failed")
1b7f7e58decc ("mm/gup: Convert check_and_migrate_movable_pages() to use a folio")
f9f38f78c5d5 ("mm: refactor check_and_migrate_movable_pages")
5ac95884a784 ("mm/migrate: enable returning precise migrate_pages() success count")
c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling")
5db4f15c4fd7 ("mm: memory: add orig_pmd to struct vm_fault")
8f34f1eac382 ("mm/userfaultfd: fix uffd-wp special cases for fork()")
25182f05ffed ("mm,hwpoison: fix race with hugetlb page allocation")
f68749ec342b ("mm/gup: longterm pin migration cleanup")
d1e153fea2a8 ("mm/gup: migrate pinned pages out of movable zone")
1a08ae36cf8b ("mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN")
6e7f34ebb8d2 ("mm/gup: check for isolation errors")
f0f4463837da ("mm/gup: return an error on migration failure")
83c02c23d074 ("mm/gup: check every subpage of a compound page during isolation")
c991ffef7bce ("mm/gup: don't pin migrated cma pages in movable zone")
7ee820ee7238 ("Revert "mm: migrate: skip shared exec THP for NUMA balancing"")
ae37c7ff79f1 ("mm: make alloc_contig_range handle in-use hugetlb pages")
369fa227c219 ("mm: make alloc_contig_range handle free hugetlb pages")
c2ad7a1ffeaf ("mm,compaction: let isolate_migratepages_{range,block} return error codes")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 73bdf65ea74857d7fb2ec3067a3cec0e261b1462 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz(a)oracle.com>
Date: Thu, 26 Jan 2023 14:27:21 -0800
Subject: [PATCH] migrate: hugetlb: check for hugetlb shared PMD in node
migration
migrate_pages/mempolicy semantics state that CAP_SYS_NICE is required to
move pages shared with another process to a different node. page_mapcount
> 1 is being used to determine if a hugetlb page is shared. However, a
hugetlb page will have a mapcount of 1 if mapped by multiple processes via
a shared PMD. As a result, hugetlb pages shared by multiple processes and
mapped with a shared PMD can be moved by a process without CAP_SYS_NICE.
To fix, check for a shared PMD if mapcount is 1. If a shared PMD is found
consider the page shared.
Link: https://lkml.kernel.org/r/20230126222721.222195-3-mike.kravetz@oracle.com
Fixes: e2d8cf405525 ("migrate: add hugepage migration code to migrate_pages()")
Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Acked-by: Peter Xu <peterx(a)redhat.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: James Houghton <jthoughton(a)google.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi(a)linux.dev>
Cc: Vishal Moola (Oracle) <vishal.moola(a)gmail.com>
Cc: Yang Shi <shy828301(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 02c8a712282f..f940395667c8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -600,7 +600,8 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
if (flags & (MPOL_MF_MOVE_ALL) ||
- (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
+ (flags & MPOL_MF_MOVE && page_mapcount(page) == 1 &&
+ !hugetlb_pmd_shared(pte))) {
if (isolate_hugetlb(page, qp->pagelist) &&
(flags & MPOL_MF_STRICT))
/*
From: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
It's unusual that we have enumeration by class in the middle of the table.
It might potentially be problematic in the future if we add another entry
after it.
So, move class matching entry to be the last in the ID table.
[ Upstream commit 0b85f59d30b91bd2b93ea7ef0816a4b7e7039e8c ]
Without this change, quirks set in driver_data added after the catch-all
are ignored.
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Reviewed-by: Keith Busch <kbusch(a)kernel.org>
Reviewed-by: Sagi Grimberg <sagi(a)grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni(a)wdc.com>
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
Signed-off-by: Gwendal Grignou <gwendal(a)chromium.org>
---
drivers/nvme/host/pci.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5d62d1042c0e6..a58711c488509 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3199,7 +3199,6 @@ static const struct pci_device_id nvme_id_table[] = {
NVME_QUIRK_IGNORE_DEV_SUBNQN, },
{ PCI_DEVICE(0x1c5c, 0x1504), /* SK Hynix PC400 */
.driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, },
- { PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
{ PCI_DEVICE(0x2646, 0x2263), /* KINGSTON A2000 NVMe SSD */
.driver_data = NVME_QUIRK_NO_DEEPEST_PS, },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001),
@@ -3209,6 +3208,8 @@ static const struct pci_device_id nvme_id_table[] = {
.driver_data = NVME_QUIRK_SINGLE_VECTOR |
NVME_QUIRK_128_BYTES_SQES |
NVME_QUIRK_SHARED_TAGS },
+
+ { PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
{ 0, }
};
MODULE_DEVICE_TABLE(pci, nvme_id_table);
--
2.39.1.519.gcb327c4b5f-goog
Mark the Tiger Lake UP{3,4} AHCI controller as "low_power". This enables
S0ix to work out of the box. Otherwise this isn't working unless the
user manually sets /sys/class/scsi_host/*/link_power_management_policy.
Intel lists a total of 4 SATA controller IDs in [1] for those mobile
PCHs. This commit just adds the "AHCI" variant since I only tested
those.
[1]: https://cdrdv2.intel.com/v1/dl/getContent/631119
Signed-off-by: Simon Gaiser <simon(a)invisiblethingslab.com>
CC: stable(a)vger.kernel.org
---
As noted above this doesn't include the other PCI IDs listed by Intel
for those PCHs (RAID modes). Also the same is probably needed for newer
generations. But for both I don't have hardware to test handy right now,
so only included what I have actually tested.
Added stable to CC, since on systems using S0ix this prevents S0ix
residency and therefore leads to such high power consumption that
suspend is effectively broken.
drivers/ata/ahci.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 14a1c0d14916..3bb9bb483fe3 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -421,6 +421,7 @@ static const struct pci_device_id ahci_pci_tbl[] = {
{ PCI_VDEVICE(INTEL, 0x34d3), board_ahci_low_power }, /* Ice Lake LP AHCI */
{ PCI_VDEVICE(INTEL, 0x02d3), board_ahci_low_power }, /* Comet Lake PCH-U AHCI */
{ PCI_VDEVICE(INTEL, 0x02d7), board_ahci_low_power }, /* Comet Lake PCH RAID */
+ { PCI_VDEVICE(INTEL, 0xa0d3), board_ahci_low_power }, /* Tiger Lake UP{3,4} AHCI */
/* JMicron 360/1/3/5/6, match class to avoid IDE function */
{ PCI_VENDOR_ID_JMICRON, PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID,
--
2.39.1