The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From c7fd89a6407ea3a44a2a2fa12d290162c42499c4 Mon Sep 17 00:00:00 2001
From: James Hogan <jhogan(a)kernel.org>
Date: Fri, 10 Nov 2017 11:46:54 +0000
Subject: [PATCH] MIPS: Fix odd fp register warnings with MIPS64r2
Building 32-bit MIPS64r2 kernels produces warnings like the following
on certain toolchains (such as GNU assembler 2.24.90, but not GNU
assembler 2.28.51) since commit 22b8ba765a72 ("MIPS: Fix MIPS64 FP
save/restore on 32-bit kernels"), due to the exposure of fpu_save_16odd
from fpu_save_double and fpu_restore_16odd from fpu_restore_double:
arch/mips/kernel/r4k_fpu.S:47: Warning: float register should be even, was 1
...
arch/mips/kernel/r4k_fpu.S:59: Warning: float register should be even, was 1
...
This appears to be because .set mips64r2 does not change the FPU ABI to
64-bit when -march=mips64r2 (or e.g. -march=xlp) is provided on the
command line on that toolchain, from the default FPU ABI of 32-bit due
to the -mabi=32. This makes access to the odd FPU registers invalid.
Fix by explicitly changing the FPU ABI with .set fp=64 directives in
fpu_save_16odd and fpu_restore_16odd, and moving the undefine of fp up
in asmmacro.h so fp doesn't turn into $30.
Fixes: 22b8ba765a72 ("MIPS: Fix MIPS64 FP save/restore on 32-bit kernels")
Signed-off-by: James Hogan <jhogan(a)kernel.org>
Cc: Ralf Baechle <ralf(a)linux-mips.org>
Cc: Paul Burton <paul.burton(a)imgtec.com>
Cc: linux-mips(a)linux-mips.org
Cc: <stable(a)vger.kernel.org> # 4.0+: 22b8ba765a72: MIPS: Fix MIPS64 FP save/restore on 32-bit kernels
Cc: <stable(a)vger.kernel.org> # 4.0+
Patchwork: https://patchwork.linux-mips.org/patch/17656/
diff --git a/arch/mips/include/asm/asmmacro.h b/arch/mips/include/asm/asmmacro.h
index b815d7b3bd27..feb069cbf44e 100644
--- a/arch/mips/include/asm/asmmacro.h
+++ b/arch/mips/include/asm/asmmacro.h
@@ -19,6 +19,9 @@
#include <asm/asmmacro-64.h>
#endif
+/* preprocessor replaces the fp in ".set fp=64" with $30 otherwise */
+#undef fp
+
/*
* Helper macros for generating raw instruction encodings.
*/
@@ -105,6 +108,7 @@
.macro fpu_save_16odd thread
.set push
.set mips64r2
+ .set fp=64
SET_HARDFLOAT
sdc1 $f1, THREAD_FPR1(\thread)
sdc1 $f3, THREAD_FPR3(\thread)
@@ -163,6 +167,7 @@
.macro fpu_restore_16odd thread
.set push
.set mips64r2
+ .set fp=64
SET_HARDFLOAT
ldc1 $f1, THREAD_FPR1(\thread)
ldc1 $f3, THREAD_FPR3(\thread)
@@ -234,9 +239,6 @@
.endm
#ifdef TOOLCHAIN_SUPPORTS_MSA
-/* preprocessor replaces the fp in ".set fp=64" with $30 otherwise */
-#undef fp
-
.macro _cfcmsa rd, cs
.set push
.set mips32r2
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 22b8ba765a726d90e9830ff6134c32b04f12c10f Mon Sep 17 00:00:00 2001
From: James Hogan <jhogan(a)kernel.org>
Date: Mon, 3 Jul 2017 23:41:47 +0100
Subject: [PATCH] MIPS: Fix MIPS64 FP save/restore on 32-bit kernels
32-bit kernels can be configured to support MIPS64, in which case
neither CONFIG_64BIT or CONFIG_CPU_MIPS32_R* will be set. This causes
the CP0_Status.FR checks at the point of floating point register save
and restore to be compiled out, which results in odd FP registers not
being saved or restored to the task or signal context even when
CP0_Status.FR is set.
Fix the ifdefs to use CONFIG_CPU_MIPSR2 and CONFIG_CPU_MIPSR6, which are
enabled for the relevant revisions of either MIPS32 or MIPS64, along
with some other CPUs such as Octeon (r2), Loongson1 (r2), XLP (r2),
Loongson 3A R2.
The suspect code originates from commit 597ce1723e0f ("MIPS: Support for
64-bit FP with O32 binaries") in v3.14, however the code in
__enable_fpu() was consistent and refused to set FR=1, falling back to
software FPU emulation. This was suboptimal but should be functionally
correct.
Commit fcc53b5f6c38 ("MIPS: fpu.h: Allow 64-bit FPU on a 64-bit MIPS R6
CPU") in v4.2 (and stable tagged back to 4.0) later introduced the bug
by updating __enable_fpu() to set FR=1 but failing to update the other
similar ifdefs to enable FR=1 state handling.
Fixes: fcc53b5f6c38 ("MIPS: fpu.h: Allow 64-bit FPU on a 64-bit MIPS R6 CPU")
Signed-off-by: James Hogan <jhogan(a)kernel.org>
Cc: Ralf Baechle <ralf(a)linux-mips.org>
Cc: Paul Burton <paul.burton(a)imgtec.com>
Cc: linux-mips(a)linux-mips.org
Cc: <stable(a)vger.kernel.org> # 4.0+
Patchwork: https://patchwork.linux-mips.org/patch/16739/
diff --git a/arch/mips/include/asm/asmmacro.h b/arch/mips/include/asm/asmmacro.h
index 83054f79f72a..b815d7b3bd27 100644
--- a/arch/mips/include/asm/asmmacro.h
+++ b/arch/mips/include/asm/asmmacro.h
@@ -126,8 +126,8 @@
.endm
.macro fpu_save_double thread status tmp
-#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2) || \
- defined(CONFIG_CPU_MIPS32_R6)
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPSR2) || \
+ defined(CONFIG_CPU_MIPSR6)
sll \tmp, \status, 5
bgez \tmp, 10f
fpu_save_16odd \thread
@@ -184,8 +184,8 @@
.endm
.macro fpu_restore_double thread status tmp
-#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2) || \
- defined(CONFIG_CPU_MIPS32_R6)
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPSR2) || \
+ defined(CONFIG_CPU_MIPSR6)
sll \tmp, \status, 5
bgez \tmp, 10f # 16 register mode?
diff --git a/arch/mips/kernel/r4k_fpu.S b/arch/mips/kernel/r4k_fpu.S
index 0a83b1708b3c..8e3a6020c613 100644
--- a/arch/mips/kernel/r4k_fpu.S
+++ b/arch/mips/kernel/r4k_fpu.S
@@ -40,8 +40,8 @@
*/
LEAF(_save_fp)
EXPORT_SYMBOL(_save_fp)
-#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2) || \
- defined(CONFIG_CPU_MIPS32_R6)
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPSR2) || \
+ defined(CONFIG_CPU_MIPSR6)
mfc0 t0, CP0_STATUS
#endif
fpu_save_double a0 t0 t1 # clobbers t1
@@ -52,8 +52,8 @@ EXPORT_SYMBOL(_save_fp)
* Restore a thread's fp context.
*/
LEAF(_restore_fp)
-#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2) || \
- defined(CONFIG_CPU_MIPS32_R6)
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPSR2) || \
+ defined(CONFIG_CPU_MIPSR6)
mfc0 t0, CP0_STATUS
#endif
fpu_restore_double a0 t0 t1 # clobbers t1
@@ -246,11 +246,11 @@ LEAF(_save_fp_context)
cfc1 t1, fcr31
.set pop
-#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2) || \
- defined(CONFIG_CPU_MIPS32_R6)
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPSR2) || \
+ defined(CONFIG_CPU_MIPSR6)
.set push
SET_HARDFLOAT
-#ifdef CONFIG_CPU_MIPS32_R2
+#ifdef CONFIG_CPU_MIPSR2
.set mips32r2
.set fp=64
mfc0 t0, CP0_STATUS
@@ -314,11 +314,11 @@ LEAF(_save_fp_context)
LEAF(_restore_fp_context)
EX lw t1, 0(a1)
-#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2) || \
- defined(CONFIG_CPU_MIPS32_R6)
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPSR2) || \
+ defined(CONFIG_CPU_MIPSR6)
.set push
SET_HARDFLOAT
-#ifdef CONFIG_CPU_MIPS32_R2
+#ifdef CONFIG_CPU_MIPSR2
.set mips32r2
.set fp=64
mfc0 t0, CP0_STATUS
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 22b8ba765a726d90e9830ff6134c32b04f12c10f Mon Sep 17 00:00:00 2001
From: James Hogan <jhogan(a)kernel.org>
Date: Mon, 3 Jul 2017 23:41:47 +0100
Subject: [PATCH] MIPS: Fix MIPS64 FP save/restore on 32-bit kernels
32-bit kernels can be configured to support MIPS64, in which case
neither CONFIG_64BIT or CONFIG_CPU_MIPS32_R* will be set. This causes
the CP0_Status.FR checks at the point of floating point register save
and restore to be compiled out, which results in odd FP registers not
being saved or restored to the task or signal context even when
CP0_Status.FR is set.
Fix the ifdefs to use CONFIG_CPU_MIPSR2 and CONFIG_CPU_MIPSR6, which are
enabled for the relevant revisions of either MIPS32 or MIPS64, along
with some other CPUs such as Octeon (r2), Loongson1 (r2), XLP (r2),
Loongson 3A R2.
The suspect code originates from commit 597ce1723e0f ("MIPS: Support for
64-bit FP with O32 binaries") in v3.14, however the code in
__enable_fpu() was consistent and refused to set FR=1, falling back to
software FPU emulation. This was suboptimal but should be functionally
correct.
Commit fcc53b5f6c38 ("MIPS: fpu.h: Allow 64-bit FPU on a 64-bit MIPS R6
CPU") in v4.2 (and stable tagged back to 4.0) later introduced the bug
by updating __enable_fpu() to set FR=1 but failing to update the other
similar ifdefs to enable FR=1 state handling.
Fixes: fcc53b5f6c38 ("MIPS: fpu.h: Allow 64-bit FPU on a 64-bit MIPS R6 CPU")
Signed-off-by: James Hogan <jhogan(a)kernel.org>
Cc: Ralf Baechle <ralf(a)linux-mips.org>
Cc: Paul Burton <paul.burton(a)imgtec.com>
Cc: linux-mips(a)linux-mips.org
Cc: <stable(a)vger.kernel.org> # 4.0+
Patchwork: https://patchwork.linux-mips.org/patch/16739/
diff --git a/arch/mips/include/asm/asmmacro.h b/arch/mips/include/asm/asmmacro.h
index 83054f79f72a..b815d7b3bd27 100644
--- a/arch/mips/include/asm/asmmacro.h
+++ b/arch/mips/include/asm/asmmacro.h
@@ -126,8 +126,8 @@
.endm
.macro fpu_save_double thread status tmp
-#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2) || \
- defined(CONFIG_CPU_MIPS32_R6)
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPSR2) || \
+ defined(CONFIG_CPU_MIPSR6)
sll \tmp, \status, 5
bgez \tmp, 10f
fpu_save_16odd \thread
@@ -184,8 +184,8 @@
.endm
.macro fpu_restore_double thread status tmp
-#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2) || \
- defined(CONFIG_CPU_MIPS32_R6)
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPSR2) || \
+ defined(CONFIG_CPU_MIPSR6)
sll \tmp, \status, 5
bgez \tmp, 10f # 16 register mode?
diff --git a/arch/mips/kernel/r4k_fpu.S b/arch/mips/kernel/r4k_fpu.S
index 0a83b1708b3c..8e3a6020c613 100644
--- a/arch/mips/kernel/r4k_fpu.S
+++ b/arch/mips/kernel/r4k_fpu.S
@@ -40,8 +40,8 @@
*/
LEAF(_save_fp)
EXPORT_SYMBOL(_save_fp)
-#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2) || \
- defined(CONFIG_CPU_MIPS32_R6)
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPSR2) || \
+ defined(CONFIG_CPU_MIPSR6)
mfc0 t0, CP0_STATUS
#endif
fpu_save_double a0 t0 t1 # clobbers t1
@@ -52,8 +52,8 @@ EXPORT_SYMBOL(_save_fp)
* Restore a thread's fp context.
*/
LEAF(_restore_fp)
-#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2) || \
- defined(CONFIG_CPU_MIPS32_R6)
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPSR2) || \
+ defined(CONFIG_CPU_MIPSR6)
mfc0 t0, CP0_STATUS
#endif
fpu_restore_double a0 t0 t1 # clobbers t1
@@ -246,11 +246,11 @@ LEAF(_save_fp_context)
cfc1 t1, fcr31
.set pop
-#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2) || \
- defined(CONFIG_CPU_MIPS32_R6)
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPSR2) || \
+ defined(CONFIG_CPU_MIPSR6)
.set push
SET_HARDFLOAT
-#ifdef CONFIG_CPU_MIPS32_R2
+#ifdef CONFIG_CPU_MIPSR2
.set mips32r2
.set fp=64
mfc0 t0, CP0_STATUS
@@ -314,11 +314,11 @@ LEAF(_save_fp_context)
LEAF(_restore_fp_context)
EX lw t1, 0(a1)
-#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2) || \
- defined(CONFIG_CPU_MIPS32_R6)
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPSR2) || \
+ defined(CONFIG_CPU_MIPSR6)
.set push
SET_HARDFLOAT
-#ifdef CONFIG_CPU_MIPS32_R2
+#ifdef CONFIG_CPU_MIPSR2
.set mips32r2
.set fp=64
mfc0 t0, CP0_STATUS
On Mon, Nov 27, 2017 at 02:13:00PM +0000, James Hogan wrote:
> On Mon, Nov 27, 2017 at 01:56:49PM +0100, Greg KH wrote:
> > On Mon, Nov 27, 2017 at 12:40:36PM +0000, James Hogan wrote:
> > > Hi Greg,
> > >
> > > On Mon, Nov 27, 2017 at 01:35:46PM +0100, gregkh(a)linuxfoundation.org wrote:
> > > > The patch below was submitted to be applied to the 4.9-stable tree.
> > > >
> > > > I fail to see how this patch meets the stable kernel rules as found at
> > > > Documentation/process/stable-kernel-rules.rst.
> > > >
> > > > I could be totally wrong, and if so, please respond to
> > > > <stable(a)vger.kernel.org> and let me know why this patch should be
> > > > applied. Otherwise, it is now dropped from my patch queues, never to be
> > > > seen again.
> > >
> > > I should have adjusted the commit message. KERN_WARN doesn't exist so it
> > > actually fixes a build error as well as switching to pr_warn().
> >
> > What build error? I have not heard of this breaking the build on 4.9
> > for the past year, is it in some config that no one uses? :)
>
> The LEDE project has been carrying the patch [1] since February when
> they added 4.9 support (their 4.4 support had a slightly earlier version
> of the driver added with just a plain printk, no KERN_WARN).
>
> They have both CONFIG_SOC_MT7620 and CONFIG_PCI=y in their ralink mt7620
> config [2], and they are keeping up to date with stable releases [3], so
> I have no doubt they would appreciate having the patch applied to
> upstream stable to reduce their delta.
>
> The only defconfigs in mainline which enable this platform
> (CONFIG_SOC_MT7620) are omega2p_defconfig and vocore2_defconfig, which
> were added in August by Harvey to help widen our internal continuous
> build & boot test coverage. Neither defconfig enables CONFIG_PCI yet
> which is required to see the build failure below, but regardless it is a
> valid configuration which LEDE is actively using.
>
> arch/mips/pci/pci-mt7620.c: In function ‘wait_pciephy_busy’:
> arch/mips/pci/pci-mt7620.c:123:11: error: ‘KERN_WARN’ undeclared (first use in this function)
> printk(KERN_WARN "PCIE-PHY retry failed.\n");
> ^~~~~~~~~
>
> John: I'm not familiar with the hardware, but would it be appropriate to
> add CONFIG_PCI=y to either of those 2 defconfigs (omega2p_defconfig and
> vocore2_defconfig) so this driver gets some upstream build[/boot]
> testing?
>
> Anyway, hopefully that helps allay stable backport concerns.
Yes, thanks, that explains it a lot better, now queued up.
greg k-h
hi Greg,
Please include the commit below in -stable, once it gets upstream.
Maybe also add:
Fixes: b6366f048e0c: ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
because it fixes a pretty old bug.
Thanks,
Ingo
----- Forwarded message from "tip-bot for Steven Rostedt (Red Hat)" <tipbot(a)zytor.com> -----
Date: Tue, 10 Oct 2017 04:02:17 -0700
From: "tip-bot for Steven Rostedt (Red Hat)" <tipbot(a)zytor.com>
To: linux-tip-commits(a)vger.kernel.org
Cc: rostedt(a)goodmis.org, williams(a)redhat.com, mingo(a)kernel.org, tglx(a)linutronix.de, peterz(a)infradead.org, linux-kernel(a)vger.kernel.org, jkacur(a)redhat.com, bristot(a)redhat.com,
torvalds(a)linux-foundation.org, efault(a)gmx.de, hpa(a)zytor.com, swood(a)redhat.com
Subject: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic
Commit-ID: 4bdced5c9a2922521e325896a7bbbf0132c94e56
Gitweb: https://git.kernel.org/tip/4bdced5c9a2922521e325896a7bbbf0132c94e56
Author: Steven Rostedt (Red Hat) <rostedt(a)goodmis.org>
AuthorDate: Fri, 6 Oct 2017 14:05:04 -0400
Committer: Ingo Molnar <mingo(a)kernel.org>
CommitDate: Tue, 10 Oct 2017 11:45:40 +0200
sched/rt: Simplify the IPI based RT balancing logic
When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.
When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit:
b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
changed the way to request a pull. Instead of grabbing the lock of the
overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.
Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.
Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.
This greatly simplifies the code!
The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:
rto_push_work - the irq work to process each CPU set in rto_mask
rto_lock - the lock to protect some of the other rto fields
rto_loop_start - an atomic that keeps contention down on rto_lock
the first CPU scheduling in a lower priority task
is the one to kick off the process.
rto_loop_next - an atomic that gets incremented for each CPU that
schedules in a lower priority task.
rto_loop - a variable protected by rto_lock that is used to
compare against rto_loop_next
rto_cpu - The cpu to send the next IPI to, also protected by
the rto_lock.
When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.
If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.
When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.
Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.
If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.
This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?
Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: Clark Williams <williams(a)redhat.com>
Cc: Daniel Bristot de Oliveira <bristot(a)redhat.com>
Cc: John Kacur <jkacur(a)redhat.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Mike Galbraith <efault(a)gmx.de>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Scott Wood <swood(a)redhat.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
---
kernel/sched/rt.c | 316 ++++++++++++++++++------------------------------
kernel/sched/sched.h | 24 ++--
kernel/sched/topology.c | 6 +
3 files changed, 138 insertions(+), 208 deletions(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 0af5ca9..fda2799 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -73,10 +73,6 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
raw_spin_unlock(&rt_b->rt_runtime_lock);
}
-#if defined(CONFIG_SMP) && defined(HAVE_RT_PUSH_IPI)
-static void push_irq_work_func(struct irq_work *work);
-#endif
-
void init_rt_rq(struct rt_rq *rt_rq)
{
struct rt_prio_array *array;
@@ -96,13 +92,6 @@ void init_rt_rq(struct rt_rq *rt_rq)
rt_rq->rt_nr_migratory = 0;
rt_rq->overloaded = 0;
plist_head_init(&rt_rq->pushable_tasks);
-
-#ifdef HAVE_RT_PUSH_IPI
- rt_rq->push_flags = 0;
- rt_rq->push_cpu = nr_cpu_ids;
- raw_spin_lock_init(&rt_rq->push_lock);
- init_irq_work(&rt_rq->push_work, push_irq_work_func);
-#endif
#endif /* CONFIG_SMP */
/* We start is dequeued state, because no RT tasks are queued */
rt_rq->rt_queued = 0;
@@ -1875,241 +1864,166 @@ static void push_rt_tasks(struct rq *rq)
}
#ifdef HAVE_RT_PUSH_IPI
+
/*
- * The search for the next cpu always starts at rq->cpu and ends
- * when we reach rq->cpu again. It will never return rq->cpu.
- * This returns the next cpu to check, or nr_cpu_ids if the loop
- * is complete.
+ * When a high priority task schedules out from a CPU and a lower priority
+ * task is scheduled in, a check is made to see if there's any RT tasks
+ * on other CPUs that are waiting to run because a higher priority RT task
+ * is currently running on its CPU. In this case, the CPU with multiple RT
+ * tasks queued on it (overloaded) needs to be notified that a CPU has opened
+ * up that may be able to run one of its non-running queued RT tasks.
+ *
+ * All CPUs with overloaded RT tasks need to be notified as there is currently
+ * no way to know which of these CPUs have the highest priority task waiting
+ * to run. Instead of trying to take a spinlock on each of these CPUs,
+ * which has shown to cause large latency when done on machines with many
+ * CPUs, sending an IPI to the CPUs to have them push off the overloaded
+ * RT tasks waiting to run.
+ *
+ * Just sending an IPI to each of the CPUs is also an issue, as on large
+ * count CPU machines, this can cause an IPI storm on a CPU, especially
+ * if its the only CPU with multiple RT tasks queued, and a large number
+ * of CPUs scheduling a lower priority task at the same time.
+ *
+ * Each root domain has its own irq work function that can iterate over
+ * all CPUs with RT overloaded tasks. Since all CPUs with overloaded RT
+ * tassk must be checked if there's one or many CPUs that are lowering
+ * their priority, there's a single irq work iterator that will try to
+ * push off RT tasks that are waiting to run.
+ *
+ * When a CPU schedules a lower priority task, it will kick off the
+ * irq work iterator that will jump to each CPU with overloaded RT tasks.
+ * As it only takes the first CPU that schedules a lower priority task
+ * to start the process, the rto_start variable is incremented and if
+ * the atomic result is one, then that CPU will try to take the rto_lock.
+ * This prevents high contention on the lock as the process handles all
+ * CPUs scheduling lower priority tasks.
+ *
+ * All CPUs that are scheduling a lower priority task will increment the
+ * rt_loop_next variable. This will make sure that the irq work iterator
+ * checks all RT overloaded CPUs whenever a CPU schedules a new lower
+ * priority task, even if the iterator is in the middle of a scan. Incrementing
+ * the rt_loop_next will cause the iterator to perform another scan.
*
- * rq->rt.push_cpu holds the last cpu returned by this function,
- * or if this is the first instance, it must hold rq->cpu.
*/
static int rto_next_cpu(struct rq *rq)
{
- int prev_cpu = rq->rt.push_cpu;
+ struct root_domain *rd = rq->rd;
+ int next;
int cpu;
- cpu = cpumask_next(prev_cpu, rq->rd->rto_mask);
-
/*
- * If the previous cpu is less than the rq's CPU, then it already
- * passed the end of the mask, and has started from the beginning.
- * We end if the next CPU is greater or equal to rq's CPU.
+ * When starting the IPI RT pushing, the rto_cpu is set to -1,
+ * rt_next_cpu() will simply return the first CPU found in
+ * the rto_mask.
+ *
+ * If rto_next_cpu() is called with rto_cpu is a valid cpu, it
+ * will return the next CPU found in the rto_mask.
+ *
+ * If there are no more CPUs left in the rto_mask, then a check is made
+ * against rto_loop and rto_loop_next. rto_loop is only updated with
+ * the rto_lock held, but any CPU may increment the rto_loop_next
+ * without any locking.
*/
- if (prev_cpu < rq->cpu) {
- if (cpu >= rq->cpu)
- return nr_cpu_ids;
+ for (;;) {
- } else if (cpu >= nr_cpu_ids) {
- /*
- * We passed the end of the mask, start at the beginning.
- * If the result is greater or equal to the rq's CPU, then
- * the loop is finished.
- */
- cpu = cpumask_first(rq->rd->rto_mask);
- if (cpu >= rq->cpu)
- return nr_cpu_ids;
- }
- rq->rt.push_cpu = cpu;
+ /* When rto_cpu is -1 this acts like cpumask_first() */
+ cpu = cpumask_next(rd->rto_cpu, rd->rto_mask);
- /* Return cpu to let the caller know if the loop is finished or not */
- return cpu;
-}
+ rd->rto_cpu = cpu;
-static int find_next_push_cpu(struct rq *rq)
-{
- struct rq *next_rq;
- int cpu;
+ if (cpu < nr_cpu_ids)
+ return cpu;
- while (1) {
- cpu = rto_next_cpu(rq);
- if (cpu >= nr_cpu_ids)
- break;
- next_rq = cpu_rq(cpu);
+ rd->rto_cpu = -1;
+
+ /*
+ * ACQUIRE ensures we see the @rto_mask changes
+ * made prior to the @next value observed.
+ *
+ * Matches WMB in rt_set_overload().
+ */
+ next = atomic_read_acquire(&rd->rto_loop_next);
- /* Make sure the next rq can push to this rq */
- if (next_rq->rt.highest_prio.next < rq->rt.highest_prio.curr)
+ if (rd->rto_loop == next)
break;
+
+ rd->rto_loop = next;
}
- return cpu;
+ return -1;
}
-#define RT_PUSH_IPI_EXECUTING 1
-#define RT_PUSH_IPI_RESTART 2
+static inline bool rto_start_trylock(atomic_t *v)
+{
+ return !atomic_cmpxchg_acquire(v, 0, 1);
+}
-/*
- * When a high priority task schedules out from a CPU and a lower priority
- * task is scheduled in, a check is made to see if there's any RT tasks
- * on other CPUs that are waiting to run because a higher priority RT task
- * is currently running on its CPU. In this case, the CPU with multiple RT
- * tasks queued on it (overloaded) needs to be notified that a CPU has opened
- * up that may be able to run one of its non-running queued RT tasks.
- *
- * On large CPU boxes, there's the case that several CPUs could schedule
- * a lower priority task at the same time, in which case it will look for
- * any overloaded CPUs that it could pull a task from. To do this, the runqueue
- * lock must be taken from that overloaded CPU. Having 10s of CPUs all fighting
- * for a single overloaded CPU's runqueue lock can produce a large latency.
- * (This has actually been observed on large boxes running cyclictest).
- * Instead of taking the runqueue lock of the overloaded CPU, each of the
- * CPUs that scheduled a lower priority task simply sends an IPI to the
- * overloaded CPU. An IPI is much cheaper than taking an runqueue lock with
- * lots of contention. The overloaded CPU will look to push its non-running
- * RT task off, and if it does, it can then ignore the other IPIs coming
- * in, and just pass those IPIs off to any other overloaded CPU.
- *
- * When a CPU schedules a lower priority task, it only sends an IPI to
- * the "next" CPU that has overloaded RT tasks. This prevents IPI storms,
- * as having 10 CPUs scheduling lower priority tasks and 10 CPUs with
- * RT overloaded tasks, would cause 100 IPIs to go out at once.
- *
- * The overloaded RT CPU, when receiving an IPI, will try to push off its
- * overloaded RT tasks and then send an IPI to the next CPU that has
- * overloaded RT tasks. This stops when all CPUs with overloaded RT tasks
- * have completed. Just because a CPU may have pushed off its own overloaded
- * RT task does not mean it should stop sending the IPI around to other
- * overloaded CPUs. There may be another RT task waiting to run on one of
- * those CPUs that are of higher priority than the one that was just
- * pushed.
- *
- * An optimization that could possibly be made is to make a CPU array similar
- * to the cpupri array mask of all running RT tasks, but for the overloaded
- * case, then the IPI could be sent to only the CPU with the highest priority
- * RT task waiting, and that CPU could send off further IPIs to the CPU with
- * the next highest waiting task. Since the overloaded case is much less likely
- * to happen, the complexity of this implementation may not be worth it.
- * Instead, just send an IPI around to all overloaded CPUs.
- *
- * The rq->rt.push_flags holds the status of the IPI that is going around.
- * A run queue can only send out a single IPI at a time. The possible flags
- * for rq->rt.push_flags are:
- *
- * (None or zero): No IPI is going around for the current rq
- * RT_PUSH_IPI_EXECUTING: An IPI for the rq is being passed around
- * RT_PUSH_IPI_RESTART: The priority of the running task for the rq
- * has changed, and the IPI should restart
- * circulating the overloaded CPUs again.
- *
- * rq->rt.push_cpu contains the CPU that is being sent the IPI. It is updated
- * before sending to the next CPU.
- *
- * Instead of having all CPUs that schedule a lower priority task send
- * an IPI to the same "first" CPU in the RT overload mask, they send it
- * to the next overloaded CPU after their own CPU. This helps distribute
- * the work when there's more than one overloaded CPU and multiple CPUs
- * scheduling in lower priority tasks.
- *
- * When a rq schedules a lower priority task than what was currently
- * running, the next CPU with overloaded RT tasks is examined first.
- * That is, if CPU 1 and 5 are overloaded, and CPU 3 schedules a lower
- * priority task, it will send an IPI first to CPU 5, then CPU 5 will
- * send to CPU 1 if it is still overloaded. CPU 1 will clear the
- * rq->rt.push_flags if RT_PUSH_IPI_RESTART is not set.
- *
- * The first CPU to notice IPI_RESTART is set, will clear that flag and then
- * send an IPI to the next overloaded CPU after the rq->cpu and not the next
- * CPU after push_cpu. That is, if CPU 1, 4 and 5 are overloaded when CPU 3
- * schedules a lower priority task, and the IPI_RESTART gets set while the
- * handling is being done on CPU 5, it will clear the flag and send it back to
- * CPU 4 instead of CPU 1.
- *
- * Note, the above logic can be disabled by turning off the sched_feature
- * RT_PUSH_IPI. Then the rq lock of the overloaded CPU will simply be
- * taken by the CPU requesting a pull and the waiting RT task will be pulled
- * by that CPU. This may be fine for machines with few CPUs.
- */
-static void tell_cpu_to_push(struct rq *rq)
+static inline void rto_start_unlock(atomic_t *v)
{
- int cpu;
+ atomic_set_release(v, 0);
+}
- if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
- raw_spin_lock(&rq->rt.push_lock);
- /* Make sure it's still executing */
- if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
- /*
- * Tell the IPI to restart the loop as things have
- * changed since it started.
- */
- rq->rt.push_flags |= RT_PUSH_IPI_RESTART;
- raw_spin_unlock(&rq->rt.push_lock);
- return;
- }
- raw_spin_unlock(&rq->rt.push_lock);
- }
+static void tell_cpu_to_push(struct rq *rq)
+{
+ int cpu = -1;
- /* When here, there's no IPI going around */
+ /* Keep the loop going if the IPI is currently active */
+ atomic_inc(&rq->rd->rto_loop_next);
- rq->rt.push_cpu = rq->cpu;
- cpu = find_next_push_cpu(rq);
- if (cpu >= nr_cpu_ids)
+ /* Only one CPU can initiate a loop at a time */
+ if (!rto_start_trylock(&rq->rd->rto_loop_start))
return;
- rq->rt.push_flags = RT_PUSH_IPI_EXECUTING;
+ raw_spin_lock(&rq->rd->rto_lock);
+
+ /*
+ * The rto_cpu is updated under the lock, if it has a valid cpu
+ * then the IPI is still running and will continue due to the
+ * update to loop_next, and nothing needs to be done here.
+ * Otherwise it is finishing up and an ipi needs to be sent.
+ */
+ if (rq->rd->rto_cpu < 0)
+ cpu = rto_next_cpu(rq);
- irq_work_queue_on(&rq->rt.push_work, cpu);
+ raw_spin_unlock(&rq->rd->rto_lock);
+
+ rto_start_unlock(&rq->rd->rto_loop_start);
+
+ if (cpu >= 0)
+ irq_work_queue_on(&rq->rd->rto_push_work, cpu);
}
/* Called from hardirq context */
-static void try_to_push_tasks(void *arg)
+void rto_push_irq_work_func(struct irq_work *work)
{
- struct rt_rq *rt_rq = arg;
- struct rq *rq, *src_rq;
- int this_cpu;
+ struct rq *rq;
int cpu;
- this_cpu = rt_rq->push_cpu;
+ rq = this_rq();
- /* Paranoid check */
- BUG_ON(this_cpu != smp_processor_id());
-
- rq = cpu_rq(this_cpu);
- src_rq = rq_of_rt_rq(rt_rq);
-
-again:
+ /*
+ * We do not need to grab the lock to check for has_pushable_tasks.
+ * When it gets updated, a check is made if a push is possible.
+ */
if (has_pushable_tasks(rq)) {
raw_spin_lock(&rq->lock);
- push_rt_task(rq);
+ push_rt_tasks(rq);
raw_spin_unlock(&rq->lock);
}
- /* Pass the IPI to the next rt overloaded queue */
- raw_spin_lock(&rt_rq->push_lock);
- /*
- * If the source queue changed since the IPI went out,
- * we need to restart the search from that CPU again.
- */
- if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) {
- rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART;
- rt_rq->push_cpu = src_rq->cpu;
- }
+ raw_spin_lock(&rq->rd->rto_lock);
- cpu = find_next_push_cpu(src_rq);
+ /* Pass the IPI to the next rt overloaded queue */
+ cpu = rto_next_cpu(rq);
- if (cpu >= nr_cpu_ids)
- rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING;
- raw_spin_unlock(&rt_rq->push_lock);
+ raw_spin_unlock(&rq->rd->rto_lock);
- if (cpu >= nr_cpu_ids)
+ if (cpu < 0)
return;
- /*
- * It is possible that a restart caused this CPU to be
- * chosen again. Don't bother with an IPI, just see if we
- * have more to push.
- */
- if (unlikely(cpu == rq->cpu))
- goto again;
-
/* Try the next RT overloaded CPU */
- irq_work_queue_on(&rt_rq->push_work, cpu);
-}
-
-static void push_irq_work_func(struct irq_work *work)
-{
- struct rt_rq *rt_rq = container_of(work, struct rt_rq, push_work);
-
- try_to_push_tasks(rt_rq);
+ irq_work_queue_on(&rq->rd->rto_push_work, cpu);
}
#endif /* HAVE_RT_PUSH_IPI */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a81c978..8aa24b4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -505,7 +505,7 @@ static inline int rt_bandwidth_enabled(void)
}
/* RT IPI pull logic requires IRQ_WORK */
-#ifdef CONFIG_IRQ_WORK
+#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_SMP)
# define HAVE_RT_PUSH_IPI
#endif
@@ -527,12 +527,6 @@ struct rt_rq {
unsigned long rt_nr_total;
int overloaded;
struct plist_head pushable_tasks;
-#ifdef HAVE_RT_PUSH_IPI
- int push_flags;
- int push_cpu;
- struct irq_work push_work;
- raw_spinlock_t push_lock;
-#endif
#endif /* CONFIG_SMP */
int rt_queued;
@@ -641,6 +635,19 @@ struct root_domain {
struct dl_bw dl_bw;
struct cpudl cpudl;
+#ifdef HAVE_RT_PUSH_IPI
+ /*
+ * For IPI pull requests, loop across the rto_mask.
+ */
+ struct irq_work rto_push_work;
+ raw_spinlock_t rto_lock;
+ /* These are only updated and read within rto_lock */
+ int rto_loop;
+ int rto_cpu;
+ /* These atomics are updated outside of a lock */
+ atomic_t rto_loop_next;
+ atomic_t rto_loop_start;
+#endif
/*
* The "RT overload" flag: it gets set if a CPU has more than
* one runnable RT task.
@@ -658,6 +665,9 @@ extern void init_defrootdomain(void);
extern int sched_init_domains(const struct cpumask *cpu_map);
extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
+#ifdef HAVE_RT_PUSH_IPI
+extern void rto_push_irq_work_func(struct irq_work *work);
+#endif
#endif /* CONFIG_SMP */
/*
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f51d123..e50450c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -268,6 +268,12 @@ static int init_rootdomain(struct root_domain *rd)
if (!zalloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
goto free_dlo_mask;
+#ifdef HAVE_RT_PUSH_IPI
+ rd->rto_cpu = -1;
+ raw_spin_lock_init(&rd->rto_lock);
+ init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);
+#endif
+
init_dl_bw(&rd->dl_bw);
if (cpudl_init(&rd->cpudl) != 0)
goto free_rto_mask;
----- End forwarded message -----
This is a note to let you know that I've just added the patch titled
sched/rt: Simplify the IPI based RT balancing logic
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
sched-rt-simplify-the-ipi-based-rt-balancing-logic.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 4bdced5c9a2922521e325896a7bbbf0132c94e56 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Red Hat)" <rostedt(a)goodmis.org>
Date: Fri, 6 Oct 2017 14:05:04 -0400
Subject: sched/rt: Simplify the IPI based RT balancing logic
From: Steven Rostedt (Red Hat) <rostedt(a)goodmis.org>
commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream.
When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.
When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit:
b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
changed the way to request a pull. Instead of grabbing the lock of the
overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.
Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.
Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.
This greatly simplifies the code!
The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:
rto_push_work - the irq work to process each CPU set in rto_mask
rto_lock - the lock to protect some of the other rto fields
rto_loop_start - an atomic that keeps contention down on rto_lock
the first CPU scheduling in a lower priority task
is the one to kick off the process.
rto_loop_next - an atomic that gets incremented for each CPU that
schedules in a lower priority task.
rto_loop - a variable protected by rto_lock that is used to
compare against rto_loop_next
rto_cpu - The cpu to send the next IPI to, also protected by
the rto_lock.
When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.
If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.
When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.
Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.
If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.
This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?
Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: Clark Williams <williams(a)redhat.com>
Cc: Daniel Bristot de Oliveira <bristot(a)redhat.com>
Cc: John Kacur <jkacur(a)redhat.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Mike Galbraith <efault(a)gmx.de>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Scott Wood <swood(a)redhat.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
kernel/sched/rt.c | 316 +++++++++++++++++-------------------------------
kernel/sched/sched.h | 24 ++-
kernel/sched/topology.c | 6
3 files changed, 138 insertions(+), 208 deletions(-)
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -74,10 +74,6 @@ static void start_rt_bandwidth(struct rt
raw_spin_unlock(&rt_b->rt_runtime_lock);
}
-#if defined(CONFIG_SMP) && defined(HAVE_RT_PUSH_IPI)
-static void push_irq_work_func(struct irq_work *work);
-#endif
-
void init_rt_rq(struct rt_rq *rt_rq)
{
struct rt_prio_array *array;
@@ -97,13 +93,6 @@ void init_rt_rq(struct rt_rq *rt_rq)
rt_rq->rt_nr_migratory = 0;
rt_rq->overloaded = 0;
plist_head_init(&rt_rq->pushable_tasks);
-
-#ifdef HAVE_RT_PUSH_IPI
- rt_rq->push_flags = 0;
- rt_rq->push_cpu = nr_cpu_ids;
- raw_spin_lock_init(&rt_rq->push_lock);
- init_irq_work(&rt_rq->push_work, push_irq_work_func);
-#endif
#endif /* CONFIG_SMP */
/* We start is dequeued state, because no RT tasks are queued */
rt_rq->rt_queued = 0;
@@ -1876,241 +1865,166 @@ static void push_rt_tasks(struct rq *rq)
}
#ifdef HAVE_RT_PUSH_IPI
+
/*
- * The search for the next cpu always starts at rq->cpu and ends
- * when we reach rq->cpu again. It will never return rq->cpu.
- * This returns the next cpu to check, or nr_cpu_ids if the loop
- * is complete.
+ * When a high priority task schedules out from a CPU and a lower priority
+ * task is scheduled in, a check is made to see if there's any RT tasks
+ * on other CPUs that are waiting to run because a higher priority RT task
+ * is currently running on its CPU. In this case, the CPU with multiple RT
+ * tasks queued on it (overloaded) needs to be notified that a CPU has opened
+ * up that may be able to run one of its non-running queued RT tasks.
+ *
+ * All CPUs with overloaded RT tasks need to be notified as there is currently
+ * no way to know which of these CPUs have the highest priority task waiting
+ * to run. Instead of trying to take a spinlock on each of these CPUs,
+ * which has shown to cause large latency when done on machines with many
+ * CPUs, sending an IPI to the CPUs to have them push off the overloaded
+ * RT tasks waiting to run.
+ *
+ * Just sending an IPI to each of the CPUs is also an issue, as on large
+ * count CPU machines, this can cause an IPI storm on a CPU, especially
+ * if its the only CPU with multiple RT tasks queued, and a large number
+ * of CPUs scheduling a lower priority task at the same time.
+ *
+ * Each root domain has its own irq work function that can iterate over
+ * all CPUs with RT overloaded tasks. Since all CPUs with overloaded RT
+ * tassk must be checked if there's one or many CPUs that are lowering
+ * their priority, there's a single irq work iterator that will try to
+ * push off RT tasks that are waiting to run.
+ *
+ * When a CPU schedules a lower priority task, it will kick off the
+ * irq work iterator that will jump to each CPU with overloaded RT tasks.
+ * As it only takes the first CPU that schedules a lower priority task
+ * to start the process, the rto_start variable is incremented and if
+ * the atomic result is one, then that CPU will try to take the rto_lock.
+ * This prevents high contention on the lock as the process handles all
+ * CPUs scheduling lower priority tasks.
+ *
+ * All CPUs that are scheduling a lower priority task will increment the
+ * rt_loop_next variable. This will make sure that the irq work iterator
+ * checks all RT overloaded CPUs whenever a CPU schedules a new lower
+ * priority task, even if the iterator is in the middle of a scan. Incrementing
+ * the rt_loop_next will cause the iterator to perform another scan.
*
- * rq->rt.push_cpu holds the last cpu returned by this function,
- * or if this is the first instance, it must hold rq->cpu.
*/
static int rto_next_cpu(struct rq *rq)
{
- int prev_cpu = rq->rt.push_cpu;
+ struct root_domain *rd = rq->rd;
+ int next;
int cpu;
- cpu = cpumask_next(prev_cpu, rq->rd->rto_mask);
-
/*
- * If the previous cpu is less than the rq's CPU, then it already
- * passed the end of the mask, and has started from the beginning.
- * We end if the next CPU is greater or equal to rq's CPU.
+ * When starting the IPI RT pushing, the rto_cpu is set to -1,
+ * rt_next_cpu() will simply return the first CPU found in
+ * the rto_mask.
+ *
+ * If rto_next_cpu() is called with rto_cpu is a valid cpu, it
+ * will return the next CPU found in the rto_mask.
+ *
+ * If there are no more CPUs left in the rto_mask, then a check is made
+ * against rto_loop and rto_loop_next. rto_loop is only updated with
+ * the rto_lock held, but any CPU may increment the rto_loop_next
+ * without any locking.
*/
- if (prev_cpu < rq->cpu) {
- if (cpu >= rq->cpu)
- return nr_cpu_ids;
+ for (;;) {
- } else if (cpu >= nr_cpu_ids) {
- /*
- * We passed the end of the mask, start at the beginning.
- * If the result is greater or equal to the rq's CPU, then
- * the loop is finished.
- */
- cpu = cpumask_first(rq->rd->rto_mask);
- if (cpu >= rq->cpu)
- return nr_cpu_ids;
- }
- rq->rt.push_cpu = cpu;
+ /* When rto_cpu is -1 this acts like cpumask_first() */
+ cpu = cpumask_next(rd->rto_cpu, rd->rto_mask);
- /* Return cpu to let the caller know if the loop is finished or not */
- return cpu;
-}
+ rd->rto_cpu = cpu;
-static int find_next_push_cpu(struct rq *rq)
-{
- struct rq *next_rq;
- int cpu;
+ if (cpu < nr_cpu_ids)
+ return cpu;
- while (1) {
- cpu = rto_next_cpu(rq);
- if (cpu >= nr_cpu_ids)
- break;
- next_rq = cpu_rq(cpu);
+ rd->rto_cpu = -1;
+
+ /*
+ * ACQUIRE ensures we see the @rto_mask changes
+ * made prior to the @next value observed.
+ *
+ * Matches WMB in rt_set_overload().
+ */
+ next = atomic_read_acquire(&rd->rto_loop_next);
- /* Make sure the next rq can push to this rq */
- if (next_rq->rt.highest_prio.next < rq->rt.highest_prio.curr)
+ if (rd->rto_loop == next)
break;
+
+ rd->rto_loop = next;
}
- return cpu;
+ return -1;
}
-#define RT_PUSH_IPI_EXECUTING 1
-#define RT_PUSH_IPI_RESTART 2
+static inline bool rto_start_trylock(atomic_t *v)
+{
+ return !atomic_cmpxchg_acquire(v, 0, 1);
+}
-/*
- * When a high priority task schedules out from a CPU and a lower priority
- * task is scheduled in, a check is made to see if there's any RT tasks
- * on other CPUs that are waiting to run because a higher priority RT task
- * is currently running on its CPU. In this case, the CPU with multiple RT
- * tasks queued on it (overloaded) needs to be notified that a CPU has opened
- * up that may be able to run one of its non-running queued RT tasks.
- *
- * On large CPU boxes, there's the case that several CPUs could schedule
- * a lower priority task at the same time, in which case it will look for
- * any overloaded CPUs that it could pull a task from. To do this, the runqueue
- * lock must be taken from that overloaded CPU. Having 10s of CPUs all fighting
- * for a single overloaded CPU's runqueue lock can produce a large latency.
- * (This has actually been observed on large boxes running cyclictest).
- * Instead of taking the runqueue lock of the overloaded CPU, each of the
- * CPUs that scheduled a lower priority task simply sends an IPI to the
- * overloaded CPU. An IPI is much cheaper than taking an runqueue lock with
- * lots of contention. The overloaded CPU will look to push its non-running
- * RT task off, and if it does, it can then ignore the other IPIs coming
- * in, and just pass those IPIs off to any other overloaded CPU.
- *
- * When a CPU schedules a lower priority task, it only sends an IPI to
- * the "next" CPU that has overloaded RT tasks. This prevents IPI storms,
- * as having 10 CPUs scheduling lower priority tasks and 10 CPUs with
- * RT overloaded tasks, would cause 100 IPIs to go out at once.
- *
- * The overloaded RT CPU, when receiving an IPI, will try to push off its
- * overloaded RT tasks and then send an IPI to the next CPU that has
- * overloaded RT tasks. This stops when all CPUs with overloaded RT tasks
- * have completed. Just because a CPU may have pushed off its own overloaded
- * RT task does not mean it should stop sending the IPI around to other
- * overloaded CPUs. There may be another RT task waiting to run on one of
- * those CPUs that are of higher priority than the one that was just
- * pushed.
- *
- * An optimization that could possibly be made is to make a CPU array similar
- * to the cpupri array mask of all running RT tasks, but for the overloaded
- * case, then the IPI could be sent to only the CPU with the highest priority
- * RT task waiting, and that CPU could send off further IPIs to the CPU with
- * the next highest waiting task. Since the overloaded case is much less likely
- * to happen, the complexity of this implementation may not be worth it.
- * Instead, just send an IPI around to all overloaded CPUs.
- *
- * The rq->rt.push_flags holds the status of the IPI that is going around.
- * A run queue can only send out a single IPI at a time. The possible flags
- * for rq->rt.push_flags are:
- *
- * (None or zero): No IPI is going around for the current rq
- * RT_PUSH_IPI_EXECUTING: An IPI for the rq is being passed around
- * RT_PUSH_IPI_RESTART: The priority of the running task for the rq
- * has changed, and the IPI should restart
- * circulating the overloaded CPUs again.
- *
- * rq->rt.push_cpu contains the CPU that is being sent the IPI. It is updated
- * before sending to the next CPU.
- *
- * Instead of having all CPUs that schedule a lower priority task send
- * an IPI to the same "first" CPU in the RT overload mask, they send it
- * to the next overloaded CPU after their own CPU. This helps distribute
- * the work when there's more than one overloaded CPU and multiple CPUs
- * scheduling in lower priority tasks.
- *
- * When a rq schedules a lower priority task than what was currently
- * running, the next CPU with overloaded RT tasks is examined first.
- * That is, if CPU 1 and 5 are overloaded, and CPU 3 schedules a lower
- * priority task, it will send an IPI first to CPU 5, then CPU 5 will
- * send to CPU 1 if it is still overloaded. CPU 1 will clear the
- * rq->rt.push_flags if RT_PUSH_IPI_RESTART is not set.
- *
- * The first CPU to notice IPI_RESTART is set, will clear that flag and then
- * send an IPI to the next overloaded CPU after the rq->cpu and not the next
- * CPU after push_cpu. That is, if CPU 1, 4 and 5 are overloaded when CPU 3
- * schedules a lower priority task, and the IPI_RESTART gets set while the
- * handling is being done on CPU 5, it will clear the flag and send it back to
- * CPU 4 instead of CPU 1.
- *
- * Note, the above logic can be disabled by turning off the sched_feature
- * RT_PUSH_IPI. Then the rq lock of the overloaded CPU will simply be
- * taken by the CPU requesting a pull and the waiting RT task will be pulled
- * by that CPU. This may be fine for machines with few CPUs.
- */
-static void tell_cpu_to_push(struct rq *rq)
+static inline void rto_start_unlock(atomic_t *v)
{
- int cpu;
+ atomic_set_release(v, 0);
+}
- if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
- raw_spin_lock(&rq->rt.push_lock);
- /* Make sure it's still executing */
- if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
- /*
- * Tell the IPI to restart the loop as things have
- * changed since it started.
- */
- rq->rt.push_flags |= RT_PUSH_IPI_RESTART;
- raw_spin_unlock(&rq->rt.push_lock);
- return;
- }
- raw_spin_unlock(&rq->rt.push_lock);
- }
+static void tell_cpu_to_push(struct rq *rq)
+{
+ int cpu = -1;
- /* When here, there's no IPI going around */
+ /* Keep the loop going if the IPI is currently active */
+ atomic_inc(&rq->rd->rto_loop_next);
- rq->rt.push_cpu = rq->cpu;
- cpu = find_next_push_cpu(rq);
- if (cpu >= nr_cpu_ids)
+ /* Only one CPU can initiate a loop at a time */
+ if (!rto_start_trylock(&rq->rd->rto_loop_start))
return;
- rq->rt.push_flags = RT_PUSH_IPI_EXECUTING;
+ raw_spin_lock(&rq->rd->rto_lock);
- irq_work_queue_on(&rq->rt.push_work, cpu);
+ /*
+ * The rto_cpu is updated under the lock, if it has a valid cpu
+ * then the IPI is still running and will continue due to the
+ * update to loop_next, and nothing needs to be done here.
+ * Otherwise it is finishing up and an ipi needs to be sent.
+ */
+ if (rq->rd->rto_cpu < 0)
+ cpu = rto_next_cpu(rq);
+
+ raw_spin_unlock(&rq->rd->rto_lock);
+
+ rto_start_unlock(&rq->rd->rto_loop_start);
+
+ if (cpu >= 0)
+ irq_work_queue_on(&rq->rd->rto_push_work, cpu);
}
/* Called from hardirq context */
-static void try_to_push_tasks(void *arg)
+void rto_push_irq_work_func(struct irq_work *work)
{
- struct rt_rq *rt_rq = arg;
- struct rq *rq, *src_rq;
- int this_cpu;
+ struct rq *rq;
int cpu;
- this_cpu = rt_rq->push_cpu;
+ rq = this_rq();
- /* Paranoid check */
- BUG_ON(this_cpu != smp_processor_id());
-
- rq = cpu_rq(this_cpu);
- src_rq = rq_of_rt_rq(rt_rq);
-
-again:
+ /*
+ * We do not need to grab the lock to check for has_pushable_tasks.
+ * When it gets updated, a check is made if a push is possible.
+ */
if (has_pushable_tasks(rq)) {
raw_spin_lock(&rq->lock);
- push_rt_task(rq);
+ push_rt_tasks(rq);
raw_spin_unlock(&rq->lock);
}
- /* Pass the IPI to the next rt overloaded queue */
- raw_spin_lock(&rt_rq->push_lock);
- /*
- * If the source queue changed since the IPI went out,
- * we need to restart the search from that CPU again.
- */
- if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) {
- rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART;
- rt_rq->push_cpu = src_rq->cpu;
- }
+ raw_spin_lock(&rq->rd->rto_lock);
- cpu = find_next_push_cpu(src_rq);
+ /* Pass the IPI to the next rt overloaded queue */
+ cpu = rto_next_cpu(rq);
- if (cpu >= nr_cpu_ids)
- rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING;
- raw_spin_unlock(&rt_rq->push_lock);
+ raw_spin_unlock(&rq->rd->rto_lock);
- if (cpu >= nr_cpu_ids)
+ if (cpu < 0)
return;
- /*
- * It is possible that a restart caused this CPU to be
- * chosen again. Don't bother with an IPI, just see if we
- * have more to push.
- */
- if (unlikely(cpu == rq->cpu))
- goto again;
-
/* Try the next RT overloaded CPU */
- irq_work_queue_on(&rt_rq->push_work, cpu);
-}
-
-static void push_irq_work_func(struct irq_work *work)
-{
- struct rt_rq *rt_rq = container_of(work, struct rt_rq, push_work);
-
- try_to_push_tasks(rt_rq);
+ irq_work_queue_on(&rq->rd->rto_push_work, cpu);
}
#endif /* HAVE_RT_PUSH_IPI */
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -502,7 +502,7 @@ static inline int rt_bandwidth_enabled(v
}
/* RT IPI pull logic requires IRQ_WORK */
-#ifdef CONFIG_IRQ_WORK
+#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_SMP)
# define HAVE_RT_PUSH_IPI
#endif
@@ -524,12 +524,6 @@ struct rt_rq {
unsigned long rt_nr_total;
int overloaded;
struct plist_head pushable_tasks;
-#ifdef HAVE_RT_PUSH_IPI
- int push_flags;
- int push_cpu;
- struct irq_work push_work;
- raw_spinlock_t push_lock;
-#endif
#endif /* CONFIG_SMP */
int rt_queued;
@@ -638,6 +632,19 @@ struct root_domain {
struct dl_bw dl_bw;
struct cpudl cpudl;
+#ifdef HAVE_RT_PUSH_IPI
+ /*
+ * For IPI pull requests, loop across the rto_mask.
+ */
+ struct irq_work rto_push_work;
+ raw_spinlock_t rto_lock;
+ /* These are only updated and read within rto_lock */
+ int rto_loop;
+ int rto_cpu;
+ /* These atomics are updated outside of a lock */
+ atomic_t rto_loop_next;
+ atomic_t rto_loop_start;
+#endif
/*
* The "RT overload" flag: it gets set if a CPU has more than
* one runnable RT task.
@@ -655,6 +662,9 @@ extern void init_defrootdomain(void);
extern int sched_init_domains(const struct cpumask *cpu_map);
extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
+#ifdef HAVE_RT_PUSH_IPI
+extern void rto_push_irq_work_func(struct irq_work *work);
+#endif
#endif /* CONFIG_SMP */
/*
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -269,6 +269,12 @@ static int init_rootdomain(struct root_d
if (!zalloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
goto free_dlo_mask;
+#ifdef HAVE_RT_PUSH_IPI
+ rd->rto_cpu = -1;
+ raw_spin_lock_init(&rd->rto_lock);
+ init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);
+#endif
+
init_dl_bw(&rd->dl_bw);
if (cpudl_init(&rd->cpudl) != 0)
goto free_rto_mask;
Patches currently in stable-queue which might be from rostedt(a)goodmis.org are
queue-4.14/sched-make-resched_cpu-unconditional.patch
queue-4.14/sched-rt-simplify-the-ipi-based-rt-balancing-logic.patch
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4bdced5c9a2922521e325896a7bbbf0132c94e56 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Red Hat)" <rostedt(a)goodmis.org>
Date: Fri, 6 Oct 2017 14:05:04 -0400
Subject: [PATCH] sched/rt: Simplify the IPI based RT balancing logic
When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.
When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit:
b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
changed the way to request a pull. Instead of grabbing the lock of the
overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.
Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.
Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.
This greatly simplifies the code!
The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:
rto_push_work - the irq work to process each CPU set in rto_mask
rto_lock - the lock to protect some of the other rto fields
rto_loop_start - an atomic that keeps contention down on rto_lock
the first CPU scheduling in a lower priority task
is the one to kick off the process.
rto_loop_next - an atomic that gets incremented for each CPU that
schedules in a lower priority task.
rto_loop - a variable protected by rto_lock that is used to
compare against rto_loop_next
rto_cpu - The cpu to send the next IPI to, also protected by
the rto_lock.
When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.
If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.
When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.
Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.
If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.
This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?
Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Cc: Clark Williams <williams(a)redhat.com>
Cc: Daniel Bristot de Oliveira <bristot(a)redhat.com>
Cc: John Kacur <jkacur(a)redhat.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Mike Galbraith <efault(a)gmx.de>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Scott Wood <swood(a)redhat.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 0af5ca9e3e3f..fda27991699a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -73,10 +73,6 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
raw_spin_unlock(&rt_b->rt_runtime_lock);
}
-#if defined(CONFIG_SMP) && defined(HAVE_RT_PUSH_IPI)
-static void push_irq_work_func(struct irq_work *work);
-#endif
-
void init_rt_rq(struct rt_rq *rt_rq)
{
struct rt_prio_array *array;
@@ -96,13 +92,6 @@ void init_rt_rq(struct rt_rq *rt_rq)
rt_rq->rt_nr_migratory = 0;
rt_rq->overloaded = 0;
plist_head_init(&rt_rq->pushable_tasks);
-
-#ifdef HAVE_RT_PUSH_IPI
- rt_rq->push_flags = 0;
- rt_rq->push_cpu = nr_cpu_ids;
- raw_spin_lock_init(&rt_rq->push_lock);
- init_irq_work(&rt_rq->push_work, push_irq_work_func);
-#endif
#endif /* CONFIG_SMP */
/* We start is dequeued state, because no RT tasks are queued */
rt_rq->rt_queued = 0;
@@ -1875,241 +1864,166 @@ static void push_rt_tasks(struct rq *rq)
}
#ifdef HAVE_RT_PUSH_IPI
+
/*
- * The search for the next cpu always starts at rq->cpu and ends
- * when we reach rq->cpu again. It will never return rq->cpu.
- * This returns the next cpu to check, or nr_cpu_ids if the loop
- * is complete.
+ * When a high priority task schedules out from a CPU and a lower priority
+ * task is scheduled in, a check is made to see if there's any RT tasks
+ * on other CPUs that are waiting to run because a higher priority RT task
+ * is currently running on its CPU. In this case, the CPU with multiple RT
+ * tasks queued on it (overloaded) needs to be notified that a CPU has opened
+ * up that may be able to run one of its non-running queued RT tasks.
+ *
+ * All CPUs with overloaded RT tasks need to be notified as there is currently
+ * no way to know which of these CPUs have the highest priority task waiting
+ * to run. Instead of trying to take a spinlock on each of these CPUs,
+ * which has shown to cause large latency when done on machines with many
+ * CPUs, sending an IPI to the CPUs to have them push off the overloaded
+ * RT tasks waiting to run.
+ *
+ * Just sending an IPI to each of the CPUs is also an issue, as on large
+ * count CPU machines, this can cause an IPI storm on a CPU, especially
+ * if its the only CPU with multiple RT tasks queued, and a large number
+ * of CPUs scheduling a lower priority task at the same time.
+ *
+ * Each root domain has its own irq work function that can iterate over
+ * all CPUs with RT overloaded tasks. Since all CPUs with overloaded RT
+ * tassk must be checked if there's one or many CPUs that are lowering
+ * their priority, there's a single irq work iterator that will try to
+ * push off RT tasks that are waiting to run.
+ *
+ * When a CPU schedules a lower priority task, it will kick off the
+ * irq work iterator that will jump to each CPU with overloaded RT tasks.
+ * As it only takes the first CPU that schedules a lower priority task
+ * to start the process, the rto_start variable is incremented and if
+ * the atomic result is one, then that CPU will try to take the rto_lock.
+ * This prevents high contention on the lock as the process handles all
+ * CPUs scheduling lower priority tasks.
+ *
+ * All CPUs that are scheduling a lower priority task will increment the
+ * rt_loop_next variable. This will make sure that the irq work iterator
+ * checks all RT overloaded CPUs whenever a CPU schedules a new lower
+ * priority task, even if the iterator is in the middle of a scan. Incrementing
+ * the rt_loop_next will cause the iterator to perform another scan.
*
- * rq->rt.push_cpu holds the last cpu returned by this function,
- * or if this is the first instance, it must hold rq->cpu.
*/
static int rto_next_cpu(struct rq *rq)
{
- int prev_cpu = rq->rt.push_cpu;
+ struct root_domain *rd = rq->rd;
+ int next;
int cpu;
- cpu = cpumask_next(prev_cpu, rq->rd->rto_mask);
-
/*
- * If the previous cpu is less than the rq's CPU, then it already
- * passed the end of the mask, and has started from the beginning.
- * We end if the next CPU is greater or equal to rq's CPU.
+ * When starting the IPI RT pushing, the rto_cpu is set to -1,
+ * rt_next_cpu() will simply return the first CPU found in
+ * the rto_mask.
+ *
+ * If rto_next_cpu() is called with rto_cpu is a valid cpu, it
+ * will return the next CPU found in the rto_mask.
+ *
+ * If there are no more CPUs left in the rto_mask, then a check is made
+ * against rto_loop and rto_loop_next. rto_loop is only updated with
+ * the rto_lock held, but any CPU may increment the rto_loop_next
+ * without any locking.
*/
- if (prev_cpu < rq->cpu) {
- if (cpu >= rq->cpu)
- return nr_cpu_ids;
+ for (;;) {
- } else if (cpu >= nr_cpu_ids) {
- /*
- * We passed the end of the mask, start at the beginning.
- * If the result is greater or equal to the rq's CPU, then
- * the loop is finished.
- */
- cpu = cpumask_first(rq->rd->rto_mask);
- if (cpu >= rq->cpu)
- return nr_cpu_ids;
- }
- rq->rt.push_cpu = cpu;
+ /* When rto_cpu is -1 this acts like cpumask_first() */
+ cpu = cpumask_next(rd->rto_cpu, rd->rto_mask);
- /* Return cpu to let the caller know if the loop is finished or not */
- return cpu;
-}
+ rd->rto_cpu = cpu;
-static int find_next_push_cpu(struct rq *rq)
-{
- struct rq *next_rq;
- int cpu;
+ if (cpu < nr_cpu_ids)
+ return cpu;
- while (1) {
- cpu = rto_next_cpu(rq);
- if (cpu >= nr_cpu_ids)
- break;
- next_rq = cpu_rq(cpu);
+ rd->rto_cpu = -1;
+
+ /*
+ * ACQUIRE ensures we see the @rto_mask changes
+ * made prior to the @next value observed.
+ *
+ * Matches WMB in rt_set_overload().
+ */
+ next = atomic_read_acquire(&rd->rto_loop_next);
- /* Make sure the next rq can push to this rq */
- if (next_rq->rt.highest_prio.next < rq->rt.highest_prio.curr)
+ if (rd->rto_loop == next)
break;
+
+ rd->rto_loop = next;
}
- return cpu;
+ return -1;
}
-#define RT_PUSH_IPI_EXECUTING 1
-#define RT_PUSH_IPI_RESTART 2
+static inline bool rto_start_trylock(atomic_t *v)
+{
+ return !atomic_cmpxchg_acquire(v, 0, 1);
+}
-/*
- * When a high priority task schedules out from a CPU and a lower priority
- * task is scheduled in, a check is made to see if there's any RT tasks
- * on other CPUs that are waiting to run because a higher priority RT task
- * is currently running on its CPU. In this case, the CPU with multiple RT
- * tasks queued on it (overloaded) needs to be notified that a CPU has opened
- * up that may be able to run one of its non-running queued RT tasks.
- *
- * On large CPU boxes, there's the case that several CPUs could schedule
- * a lower priority task at the same time, in which case it will look for
- * any overloaded CPUs that it could pull a task from. To do this, the runqueue
- * lock must be taken from that overloaded CPU. Having 10s of CPUs all fighting
- * for a single overloaded CPU's runqueue lock can produce a large latency.
- * (This has actually been observed on large boxes running cyclictest).
- * Instead of taking the runqueue lock of the overloaded CPU, each of the
- * CPUs that scheduled a lower priority task simply sends an IPI to the
- * overloaded CPU. An IPI is much cheaper than taking an runqueue lock with
- * lots of contention. The overloaded CPU will look to push its non-running
- * RT task off, and if it does, it can then ignore the other IPIs coming
- * in, and just pass those IPIs off to any other overloaded CPU.
- *
- * When a CPU schedules a lower priority task, it only sends an IPI to
- * the "next" CPU that has overloaded RT tasks. This prevents IPI storms,
- * as having 10 CPUs scheduling lower priority tasks and 10 CPUs with
- * RT overloaded tasks, would cause 100 IPIs to go out at once.
- *
- * The overloaded RT CPU, when receiving an IPI, will try to push off its
- * overloaded RT tasks and then send an IPI to the next CPU that has
- * overloaded RT tasks. This stops when all CPUs with overloaded RT tasks
- * have completed. Just because a CPU may have pushed off its own overloaded
- * RT task does not mean it should stop sending the IPI around to other
- * overloaded CPUs. There may be another RT task waiting to run on one of
- * those CPUs that are of higher priority than the one that was just
- * pushed.
- *
- * An optimization that could possibly be made is to make a CPU array similar
- * to the cpupri array mask of all running RT tasks, but for the overloaded
- * case, then the IPI could be sent to only the CPU with the highest priority
- * RT task waiting, and that CPU could send off further IPIs to the CPU with
- * the next highest waiting task. Since the overloaded case is much less likely
- * to happen, the complexity of this implementation may not be worth it.
- * Instead, just send an IPI around to all overloaded CPUs.
- *
- * The rq->rt.push_flags holds the status of the IPI that is going around.
- * A run queue can only send out a single IPI at a time. The possible flags
- * for rq->rt.push_flags are:
- *
- * (None or zero): No IPI is going around for the current rq
- * RT_PUSH_IPI_EXECUTING: An IPI for the rq is being passed around
- * RT_PUSH_IPI_RESTART: The priority of the running task for the rq
- * has changed, and the IPI should restart
- * circulating the overloaded CPUs again.
- *
- * rq->rt.push_cpu contains the CPU that is being sent the IPI. It is updated
- * before sending to the next CPU.
- *
- * Instead of having all CPUs that schedule a lower priority task send
- * an IPI to the same "first" CPU in the RT overload mask, they send it
- * to the next overloaded CPU after their own CPU. This helps distribute
- * the work when there's more than one overloaded CPU and multiple CPUs
- * scheduling in lower priority tasks.
- *
- * When a rq schedules a lower priority task than what was currently
- * running, the next CPU with overloaded RT tasks is examined first.
- * That is, if CPU 1 and 5 are overloaded, and CPU 3 schedules a lower
- * priority task, it will send an IPI first to CPU 5, then CPU 5 will
- * send to CPU 1 if it is still overloaded. CPU 1 will clear the
- * rq->rt.push_flags if RT_PUSH_IPI_RESTART is not set.
- *
- * The first CPU to notice IPI_RESTART is set, will clear that flag and then
- * send an IPI to the next overloaded CPU after the rq->cpu and not the next
- * CPU after push_cpu. That is, if CPU 1, 4 and 5 are overloaded when CPU 3
- * schedules a lower priority task, and the IPI_RESTART gets set while the
- * handling is being done on CPU 5, it will clear the flag and send it back to
- * CPU 4 instead of CPU 1.
- *
- * Note, the above logic can be disabled by turning off the sched_feature
- * RT_PUSH_IPI. Then the rq lock of the overloaded CPU will simply be
- * taken by the CPU requesting a pull and the waiting RT task will be pulled
- * by that CPU. This may be fine for machines with few CPUs.
- */
-static void tell_cpu_to_push(struct rq *rq)
+static inline void rto_start_unlock(atomic_t *v)
{
- int cpu;
+ atomic_set_release(v, 0);
+}
- if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
- raw_spin_lock(&rq->rt.push_lock);
- /* Make sure it's still executing */
- if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
- /*
- * Tell the IPI to restart the loop as things have
- * changed since it started.
- */
- rq->rt.push_flags |= RT_PUSH_IPI_RESTART;
- raw_spin_unlock(&rq->rt.push_lock);
- return;
- }
- raw_spin_unlock(&rq->rt.push_lock);
- }
+static void tell_cpu_to_push(struct rq *rq)
+{
+ int cpu = -1;
- /* When here, there's no IPI going around */
+ /* Keep the loop going if the IPI is currently active */
+ atomic_inc(&rq->rd->rto_loop_next);
- rq->rt.push_cpu = rq->cpu;
- cpu = find_next_push_cpu(rq);
- if (cpu >= nr_cpu_ids)
+ /* Only one CPU can initiate a loop at a time */
+ if (!rto_start_trylock(&rq->rd->rto_loop_start))
return;
- rq->rt.push_flags = RT_PUSH_IPI_EXECUTING;
+ raw_spin_lock(&rq->rd->rto_lock);
+
+ /*
+ * The rto_cpu is updated under the lock, if it has a valid cpu
+ * then the IPI is still running and will continue due to the
+ * update to loop_next, and nothing needs to be done here.
+ * Otherwise it is finishing up and an ipi needs to be sent.
+ */
+ if (rq->rd->rto_cpu < 0)
+ cpu = rto_next_cpu(rq);
- irq_work_queue_on(&rq->rt.push_work, cpu);
+ raw_spin_unlock(&rq->rd->rto_lock);
+
+ rto_start_unlock(&rq->rd->rto_loop_start);
+
+ if (cpu >= 0)
+ irq_work_queue_on(&rq->rd->rto_push_work, cpu);
}
/* Called from hardirq context */
-static void try_to_push_tasks(void *arg)
+void rto_push_irq_work_func(struct irq_work *work)
{
- struct rt_rq *rt_rq = arg;
- struct rq *rq, *src_rq;
- int this_cpu;
+ struct rq *rq;
int cpu;
- this_cpu = rt_rq->push_cpu;
+ rq = this_rq();
- /* Paranoid check */
- BUG_ON(this_cpu != smp_processor_id());
-
- rq = cpu_rq(this_cpu);
- src_rq = rq_of_rt_rq(rt_rq);
-
-again:
+ /*
+ * We do not need to grab the lock to check for has_pushable_tasks.
+ * When it gets updated, a check is made if a push is possible.
+ */
if (has_pushable_tasks(rq)) {
raw_spin_lock(&rq->lock);
- push_rt_task(rq);
+ push_rt_tasks(rq);
raw_spin_unlock(&rq->lock);
}
- /* Pass the IPI to the next rt overloaded queue */
- raw_spin_lock(&rt_rq->push_lock);
- /*
- * If the source queue changed since the IPI went out,
- * we need to restart the search from that CPU again.
- */
- if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) {
- rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART;
- rt_rq->push_cpu = src_rq->cpu;
- }
+ raw_spin_lock(&rq->rd->rto_lock);
- cpu = find_next_push_cpu(src_rq);
+ /* Pass the IPI to the next rt overloaded queue */
+ cpu = rto_next_cpu(rq);
- if (cpu >= nr_cpu_ids)
- rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING;
- raw_spin_unlock(&rt_rq->push_lock);
+ raw_spin_unlock(&rq->rd->rto_lock);
- if (cpu >= nr_cpu_ids)
+ if (cpu < 0)
return;
- /*
- * It is possible that a restart caused this CPU to be
- * chosen again. Don't bother with an IPI, just see if we
- * have more to push.
- */
- if (unlikely(cpu == rq->cpu))
- goto again;
-
/* Try the next RT overloaded CPU */
- irq_work_queue_on(&rt_rq->push_work, cpu);
-}
-
-static void push_irq_work_func(struct irq_work *work)
-{
- struct rt_rq *rt_rq = container_of(work, struct rt_rq, push_work);
-
- try_to_push_tasks(rt_rq);
+ irq_work_queue_on(&rq->rd->rto_push_work, cpu);
}
#endif /* HAVE_RT_PUSH_IPI */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a81c9782e98c..8aa24b41f652 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -505,7 +505,7 @@ static inline int rt_bandwidth_enabled(void)
}
/* RT IPI pull logic requires IRQ_WORK */
-#ifdef CONFIG_IRQ_WORK
+#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_SMP)
# define HAVE_RT_PUSH_IPI
#endif
@@ -527,12 +527,6 @@ struct rt_rq {
unsigned long rt_nr_total;
int overloaded;
struct plist_head pushable_tasks;
-#ifdef HAVE_RT_PUSH_IPI
- int push_flags;
- int push_cpu;
- struct irq_work push_work;
- raw_spinlock_t push_lock;
-#endif
#endif /* CONFIG_SMP */
int rt_queued;
@@ -641,6 +635,19 @@ struct root_domain {
struct dl_bw dl_bw;
struct cpudl cpudl;
+#ifdef HAVE_RT_PUSH_IPI
+ /*
+ * For IPI pull requests, loop across the rto_mask.
+ */
+ struct irq_work rto_push_work;
+ raw_spinlock_t rto_lock;
+ /* These are only updated and read within rto_lock */
+ int rto_loop;
+ int rto_cpu;
+ /* These atomics are updated outside of a lock */
+ atomic_t rto_loop_next;
+ atomic_t rto_loop_start;
+#endif
/*
* The "RT overload" flag: it gets set if a CPU has more than
* one runnable RT task.
@@ -658,6 +665,9 @@ extern void init_defrootdomain(void);
extern int sched_init_domains(const struct cpumask *cpu_map);
extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
+#ifdef HAVE_RT_PUSH_IPI
+extern void rto_push_irq_work_func(struct irq_work *work);
+#endif
#endif /* CONFIG_SMP */
/*
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f51d123f9fe1..e50450c2fed8 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -268,6 +268,12 @@ static int init_rootdomain(struct root_domain *rd)
if (!zalloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
goto free_dlo_mask;
+#ifdef HAVE_RT_PUSH_IPI
+ rd->rto_cpu = -1;
+ raw_spin_lock_init(&rd->rto_lock);
+ init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);
+#endif
+
init_dl_bw(&rd->dl_bw);
if (cpudl_init(&rd->cpudl) != 0)
goto free_rto_mask;
The patch below was submitted to be applied to the 4.14-stable tree.
I fail to see how this patch meets the stable kernel rules as found at
Documentation/process/stable-kernel-rules.rst.
I could be totally wrong, and if so, please respond to
<stable(a)vger.kernel.org> and let me know why this patch should be
applied. Otherwise, it is now dropped from my patch queues, never to be
seen again.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 8a7a8e1eab929eb3a5b735a788a23b9731139046 Mon Sep 17 00:00:00 2001
From: Dou Liyang <douly.fnst(a)cn.fujitsu.com>
Date: Mon, 13 Nov 2017 13:49:04 +0800
Subject: [PATCH] timekeeping: Eliminate the stale declaration of
ktime_get_raw_and_real_ts64()
Commit ba26621e63ce got rid of ktime_get_raw_and_real_ts64(), but left its
declaration behind.
Remove it.
Fixes: ba26621e63ce ("time: Remove duplicated code in ktime_get_raw_and_real()")
Signed-off-by: Dou Liyang <douly.fnst(a)cn.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Christopher S. Hall <christopher.s.hall(a)intel.com>
Cc: joelaf(a)google.com
Cc: arnd(a)arndb.de
Cc: gregkh(a)linuxfoundation.org
Cc: john.stultz(a)linaro.org
Cc: deepa.kernel(a)gmail.com
Cc: stable(a)vger.kernel.org
Link: https://lkml.kernel.org/r/1510552144-20831-1-git-send-email-douly.fnst@cn.f…
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 0021575fe871..51293e1aa4da 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -272,12 +272,6 @@ extern bool timekeeping_rtc_skipresume(void);
extern void timekeeping_inject_sleeptime64(struct timespec64 *delta);
-/*
- * PPS accessor
- */
-extern void ktime_get_raw_and_real_ts64(struct timespec64 *ts_raw,
- struct timespec64 *ts_real);
-
/*
* struct system_time_snapshot - simultaneous raw/real time capture with
* counter value
This is a note to let you know that I've just added the patch titled
dm bufio: fix integer overflow when limiting maximum cache size
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
dm-bufio-fix-integer-overflow-when-limiting-maximum-cache-size.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 74d4108d9e681dbbe4a2940ed8fdff1f6868184c Mon Sep 17 00:00:00 2001
From: Eric Biggers <ebiggers(a)google.com>
Date: Wed, 15 Nov 2017 16:38:09 -0800
Subject: dm bufio: fix integer overflow when limiting maximum cache size
From: Eric Biggers <ebiggers(a)google.com>
commit 74d4108d9e681dbbe4a2940ed8fdff1f6868184c upstream.
The default max_cache_size_bytes for dm-bufio is meant to be the lesser
of 25% of the size of the vmalloc area and 2% of the size of lowmem.
However, on 32-bit systems the intermediate result in the expression
(VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100
overflows, causing the wrong result to be computed. For example, on a
32-bit system where the vmalloc area is 520093696 bytes, the result is
1174405 rather than the expected 130023424, which makes the maximum
cache size much too small (far less than 2% of lowmem). This causes
severe performance problems for dm-verity users on affected systems.
Fix this by using mult_frac() to correctly multiply by a percentage. Do
this for all places in dm-bufio that multiply by a percentage. Also
replace (VMALLOC_END - VMALLOC_START) with VMALLOC_TOTAL, which contrary
to the comment is now defined in include/linux/vmalloc.h.
Depends-on: 9993bc635 ("sched/x86: Fix overflow in cyc2ns_offset")
Fixes: 95d402f057f2 ("dm: add bufio")
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/md/dm-bufio.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -937,7 +937,8 @@ static void __get_memory_limit(struct dm
buffers = c->minimum_buffers;
*limit_buffers = buffers;
- *threshold_buffers = buffers * DM_BUFIO_WRITEBACK_PERCENT / 100;
+ *threshold_buffers = mult_frac(buffers,
+ DM_BUFIO_WRITEBACK_PERCENT, 100);
}
/*
@@ -1856,19 +1857,15 @@ static int __init dm_bufio_init(void)
memset(&dm_bufio_caches, 0, sizeof dm_bufio_caches);
memset(&dm_bufio_cache_names, 0, sizeof dm_bufio_cache_names);
- mem = (__u64)((totalram_pages - totalhigh_pages) *
- DM_BUFIO_MEMORY_PERCENT / 100) << PAGE_SHIFT;
+ mem = (__u64)mult_frac(totalram_pages - totalhigh_pages,
+ DM_BUFIO_MEMORY_PERCENT, 100) << PAGE_SHIFT;
if (mem > ULONG_MAX)
mem = ULONG_MAX;
#ifdef CONFIG_MMU
- /*
- * Get the size of vmalloc space the same way as VMALLOC_TOTAL
- * in fs/proc/internal.h
- */
- if (mem > (VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100)
- mem = (VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100;
+ if (mem > mult_frac(VMALLOC_TOTAL, DM_BUFIO_VMALLOC_PERCENT, 100))
+ mem = mult_frac(VMALLOC_TOTAL, DM_BUFIO_VMALLOC_PERCENT, 100);
#endif
dm_bufio_default_cache_size = mem;
Patches currently in stable-queue which might be from ebiggers(a)google.com are
queue-4.9/lib-mpi-call-cond_resched-from-mpi_powm-loop.patch
queue-4.9/dm-bufio-fix-integer-overflow-when-limiting-maximum-cache-size.patch
This is a note to let you know that I've just added the patch titled
dm: allocate struct mapped_device with kvzalloc
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
dm-allocate-struct-mapped_device-with-kvzalloc.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 856eb0916d181da6d043cc33e03f54d5c5bbe54a Mon Sep 17 00:00:00 2001
From: Mikulas Patocka <mpatocka(a)redhat.com>
Date: Tue, 31 Oct 2017 19:33:02 -0400
Subject: dm: allocate struct mapped_device with kvzalloc
From: Mikulas Patocka <mpatocka(a)redhat.com>
commit 856eb0916d181da6d043cc33e03f54d5c5bbe54a upstream.
The structure srcu_struct can be very big, its size is proportional to the
value CONFIG_NR_CPUS. The Fedora kernel has CONFIG_NR_CPUS 8192, the field
io_barrier in the struct mapped_device has 84kB in the debugging kernel
and 50kB in the non-debugging kernel. The large size may result in failure
of the function kzalloc_node.
In order to avoid the allocation failure, we use the function
kvzalloc_node, this function falls back to vmalloc if a large contiguous
chunk of memory is not available. This patch also moves the field
io_barrier to the last position of struct mapped_device - the reason is
that on many processor architectures, short memory offsets result in
smaller code than long memory offsets - on x86-64 it reduces code size by
320 bytes.
Note to stable kernel maintainers - the kernels 4.11 and older don't have
the function kvzalloc_node, you can use the function vzalloc_node instead.
Signed-off-by: Mikulas Patocka <mpatocka(a)redhat.com>
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/md/dm-core.h | 3 ++-
drivers/md/dm.c | 7 ++++---
2 files changed, 6 insertions(+), 4 deletions(-)
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -29,7 +29,6 @@ struct dm_kobject_holder {
* DM targets must _not_ deference a mapped_device to directly access its members!
*/
struct mapped_device {
- struct srcu_struct io_barrier;
struct mutex suspend_lock;
/*
@@ -127,6 +126,8 @@ struct mapped_device {
struct blk_mq_tag_set *tag_set;
bool use_blk_mq:1;
bool init_tio_pdu:1;
+
+ struct srcu_struct io_barrier;
};
void dm_init_md_queue(struct mapped_device *md);
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -21,6 +21,7 @@
#include <linux/delay.h>
#include <linux/wait.h>
#include <linux/pr.h>
+#include <linux/vmalloc.h>
#define DM_MSG_PREFIX "core"
@@ -1511,7 +1512,7 @@ static struct mapped_device *alloc_dev(i
struct mapped_device *md;
void *old_md;
- md = kzalloc_node(sizeof(*md), GFP_KERNEL, numa_node_id);
+ md = vzalloc_node(sizeof(*md), numa_node_id);
if (!md) {
DMWARN("unable to allocate device, out of memory.");
return NULL;
@@ -1605,7 +1606,7 @@ bad_io_barrier:
bad_minor:
module_put(THIS_MODULE);
bad_module_get:
- kfree(md);
+ kvfree(md);
return NULL;
}
@@ -1624,7 +1625,7 @@ static void free_dev(struct mapped_devic
free_minor(minor);
module_put(THIS_MODULE);
- kfree(md);
+ kvfree(md);
}
static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
Patches currently in stable-queue which might be from mpatocka(a)redhat.com are
queue-4.9/dm-allocate-struct-mapped_device-with-kvzalloc.patch
This is a note to let you know that I've just added the patch titled
dm bufio: fix integer overflow when limiting maximum cache size
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
dm-bufio-fix-integer-overflow-when-limiting-maximum-cache-size.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 74d4108d9e681dbbe4a2940ed8fdff1f6868184c Mon Sep 17 00:00:00 2001
From: Eric Biggers <ebiggers(a)google.com>
Date: Wed, 15 Nov 2017 16:38:09 -0800
Subject: dm bufio: fix integer overflow when limiting maximum cache size
From: Eric Biggers <ebiggers(a)google.com>
commit 74d4108d9e681dbbe4a2940ed8fdff1f6868184c upstream.
The default max_cache_size_bytes for dm-bufio is meant to be the lesser
of 25% of the size of the vmalloc area and 2% of the size of lowmem.
However, on 32-bit systems the intermediate result in the expression
(VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100
overflows, causing the wrong result to be computed. For example, on a
32-bit system where the vmalloc area is 520093696 bytes, the result is
1174405 rather than the expected 130023424, which makes the maximum
cache size much too small (far less than 2% of lowmem). This causes
severe performance problems for dm-verity users on affected systems.
Fix this by using mult_frac() to correctly multiply by a percentage. Do
this for all places in dm-bufio that multiply by a percentage. Also
replace (VMALLOC_END - VMALLOC_START) with VMALLOC_TOTAL, which contrary
to the comment is now defined in include/linux/vmalloc.h.
Depends-on: 9993bc635 ("sched/x86: Fix overflow in cyc2ns_offset")
Fixes: 95d402f057f2 ("dm: add bufio")
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/md/dm-bufio.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -928,7 +928,8 @@ static void __get_memory_limit(struct dm
buffers = c->minimum_buffers;
*limit_buffers = buffers;
- *threshold_buffers = buffers * DM_BUFIO_WRITEBACK_PERCENT / 100;
+ *threshold_buffers = mult_frac(buffers,
+ DM_BUFIO_WRITEBACK_PERCENT, 100);
}
/*
@@ -1829,19 +1830,15 @@ static int __init dm_bufio_init(void)
memset(&dm_bufio_caches, 0, sizeof dm_bufio_caches);
memset(&dm_bufio_cache_names, 0, sizeof dm_bufio_cache_names);
- mem = (__u64)((totalram_pages - totalhigh_pages) *
- DM_BUFIO_MEMORY_PERCENT / 100) << PAGE_SHIFT;
+ mem = (__u64)mult_frac(totalram_pages - totalhigh_pages,
+ DM_BUFIO_MEMORY_PERCENT, 100) << PAGE_SHIFT;
if (mem > ULONG_MAX)
mem = ULONG_MAX;
#ifdef CONFIG_MMU
- /*
- * Get the size of vmalloc space the same way as VMALLOC_TOTAL
- * in fs/proc/internal.h
- */
- if (mem > (VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100)
- mem = (VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100;
+ if (mem > mult_frac(VMALLOC_TOTAL, DM_BUFIO_VMALLOC_PERCENT, 100))
+ mem = mult_frac(VMALLOC_TOTAL, DM_BUFIO_VMALLOC_PERCENT, 100);
#endif
dm_bufio_default_cache_size = mem;
Patches currently in stable-queue which might be from ebiggers(a)google.com are
queue-4.4/lib-mpi-call-cond_resched-from-mpi_powm-loop.patch
queue-4.4/dm-bufio-fix-integer-overflow-when-limiting-maximum-cache-size.patch
This is a note to let you know that I've just added the patch titled
ovl: Put upperdentry if ovl_check_origin() fails
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ovl-put-upperdentry-if-ovl_check_origin-fails.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 5455f92b54e516995a9ca45bbf790d3629c27a93 Mon Sep 17 00:00:00 2001
From: Vivek Goyal <vgoyal(a)redhat.com>
Date: Wed, 1 Nov 2017 15:37:22 -0400
Subject: ovl: Put upperdentry if ovl_check_origin() fails
From: Vivek Goyal <vgoyal(a)redhat.com>
commit 5455f92b54e516995a9ca45bbf790d3629c27a93 upstream.
If ovl_check_origin() fails, we should put upperdentry. We have a reference
on it by now. So goto out_put_upper instead of out.
Fixes: a9d019573e88 ("ovl: lookup non-dir copy-up-origin by file handle")
Signed-off-by: Vivek Goyal <vgoyal(a)redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi(a)redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/overlayfs/namei.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -630,7 +630,7 @@ struct dentry *ovl_lookup(struct inode *
err = ovl_check_origin(upperdentry, roe->lowerstack,
roe->numlower, &stack, &ctr);
if (err)
- goto out;
+ goto out_put_upper;
}
if (d.redirect) {
Patches currently in stable-queue which might be from vgoyal(a)redhat.com are
queue-4.14/ovl-put-upperdentry-if-ovl_check_origin-fails.patch
This is a note to let you know that I've just added the patch titled
dm zoned: ignore last smaller runt zone
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
dm-zoned-ignore-last-smaller-runt-zone.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 114e025968b5990ad0b57bf60697ea64ee206aac Mon Sep 17 00:00:00 2001
From: Damien Le Moal <damien.lemoal(a)wdc.com>
Date: Sat, 28 Oct 2017 16:39:34 +0900
Subject: dm zoned: ignore last smaller runt zone
From: Damien Le Moal <damien.lemoal(a)wdc.com>
commit 114e025968b5990ad0b57bf60697ea64ee206aac upstream.
The SCSI layer allows ZBC drives to have a smaller last runt zone. For
such a device, specifying the entire capacity for a dm-zoned target
table entry fails because the specified capacity is not aligned on a
device zone size indicated in the request queue structure of the
device.
Fix this problem by ignoring the last runt zone in the entry length
when seting up the dm-zoned target (ctr method) and when iterating table
entries of the target (iterate_devices method). This allows dm-zoned
users to still easily setup a target using the entire device capacity
(as mandated by dm-zoned) or the aligned capacity excluding the last
runt zone.
While at it, replace direct references to the device queue chunk_sectors
limit with calls to the accessor blk_queue_zone_sectors().
Reported-by: Peter Desnoyers <pjd(a)ccs.neu.edu>
Signed-off-by: Damien Le Moal <damien.lemoal(a)wdc.com>
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/md/dm-zoned-target.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
--- a/drivers/md/dm-zoned-target.c
+++ b/drivers/md/dm-zoned-target.c
@@ -660,6 +660,7 @@ static int dmz_get_zoned_device(struct d
struct dmz_target *dmz = ti->private;
struct request_queue *q;
struct dmz_dev *dev;
+ sector_t aligned_capacity;
int ret;
/* Get the target device */
@@ -685,15 +686,17 @@ static int dmz_get_zoned_device(struct d
goto err;
}
+ q = bdev_get_queue(dev->bdev);
dev->capacity = i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT;
- if (ti->begin || (ti->len != dev->capacity)) {
+ aligned_capacity = dev->capacity & ~(blk_queue_zone_sectors(q) - 1);
+ if (ti->begin ||
+ ((ti->len != dev->capacity) && (ti->len != aligned_capacity))) {
ti->error = "Partial mapping not supported";
ret = -EINVAL;
goto err;
}
- q = bdev_get_queue(dev->bdev);
- dev->zone_nr_sectors = q->limits.chunk_sectors;
+ dev->zone_nr_sectors = blk_queue_zone_sectors(q);
dev->zone_nr_sectors_shift = ilog2(dev->zone_nr_sectors);
dev->zone_nr_blocks = dmz_sect2blk(dev->zone_nr_sectors);
@@ -929,8 +932,10 @@ static int dmz_iterate_devices(struct dm
iterate_devices_callout_fn fn, void *data)
{
struct dmz_target *dmz = ti->private;
+ struct dmz_dev *dev = dmz->dev;
+ sector_t capacity = dev->capacity & ~(dev->zone_nr_sectors - 1);
- return fn(ti, dmz->ddev, 0, dmz->dev->capacity, data);
+ return fn(ti, dmz->ddev, 0, capacity, data);
}
static struct target_type dmz_type = {
Patches currently in stable-queue which might be from damien.lemoal(a)wdc.com are
queue-4.14/dm-zoned-ignore-last-smaller-runt-zone.patch
This is a note to let you know that I've just added the patch titled
dm mpath: remove annoying message of 'blk_get_request() returned -11'
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
dm-mpath-remove-annoying-message-of-blk_get_request-returned-11.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 9dc112e2daf87b40607fd8d357e2d7de32290d45 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei(a)redhat.com>
Date: Sat, 30 Sep 2017 19:46:48 +0800
Subject: dm mpath: remove annoying message of 'blk_get_request() returned -11'
From: Ming Lei <ming.lei(a)redhat.com>
commit 9dc112e2daf87b40607fd8d357e2d7de32290d45 upstream.
It is very normal to see allocation failure, especially with blk-mq
request_queues, so it's unnecessary to report this error and annoy
people.
In practice this 'blk_get_request() returned -11' error gets logged
quite frequently when a blk-mq DM multipath device sees heavy IO.
This change is marked for stable@ because the annoying message in
question was included in stable@ commit 7083abbbf.
Fixes: 7083abbbf ("dm mpath: avoid that path removal can trigger an infinite loop")
Signed-off-by: Ming Lei <ming.lei(a)redhat.com>
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/md/dm-mpath.c | 2 --
1 file changed, 2 deletions(-)
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -499,8 +499,6 @@ static int multipath_clone_and_map(struc
if (IS_ERR(clone)) {
/* EBUSY, ENODEV or EWOULDBLOCK: requeue */
bool queue_dying = blk_queue_dying(q);
- DMERR_LIMIT("blk_get_request() returned %ld%s - requeuing",
- PTR_ERR(clone), queue_dying ? " (path offline)" : "");
if (queue_dying) {
atomic_inc(&m->pg_init_in_progress);
activate_or_offline_path(pgpath);
Patches currently in stable-queue which might be from ming.lei(a)redhat.com are
queue-4.14/dm-mpath-remove-annoying-message-of-blk_get_request-returned-11.patch
This is a note to let you know that I've just added the patch titled
dm integrity: allow unaligned bv_offset
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
dm-integrity-allow-unaligned-bv_offset.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 95b1369a9638cfa322ad1c0cde8efbe524059884 Mon Sep 17 00:00:00 2001
From: Mikulas Patocka <mpatocka(a)redhat.com>
Date: Tue, 7 Nov 2017 10:40:40 -0500
Subject: dm integrity: allow unaligned bv_offset
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
From: Mikulas Patocka <mpatocka(a)redhat.com>
commit 95b1369a9638cfa322ad1c0cde8efbe524059884 upstream.
When slub_debug is enabled kmalloc returns unaligned memory. XFS uses
this unaligned memory for its buffers (if an unaligned buffer crosses a
page, XFS frees it and allocates a full page instead - see the function
xfs_buf_allocate_memory).
dm-integrity checks if bv_offset is aligned on page size and this check
fail with slub_debug and XFS.
Fix this bug by removing the bv_offset check, leaving only the check for
bv_len.
Fixes: 7eada909bfd7 ("dm: add integrity target")
Reported-by: Bruno Prémont <bonbons(a)sysophe.eu>
Reviewed-by: Milan Broz <gmazyland(a)gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka(a)redhat.com>
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/md/dm-integrity.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/md/dm-integrity.c
+++ b/drivers/md/dm-integrity.c
@@ -1376,7 +1376,7 @@ static int dm_integrity_map(struct dm_ta
struct bvec_iter iter;
struct bio_vec bv;
bio_for_each_segment(bv, bio, iter) {
- if (unlikely((bv.bv_offset | bv.bv_len) & ((ic->sectors_per_block << SECTOR_SHIFT) - 1))) {
+ if (unlikely(bv.bv_len & ((ic->sectors_per_block << SECTOR_SHIFT) - 1))) {
DMERR("Bio vector (%u,%u) is not aligned on %u-sector boundary",
bv.bv_offset, bv.bv_len, ic->sectors_per_block);
return DM_MAPIO_KILL;
Patches currently in stable-queue which might be from mpatocka(a)redhat.com are
queue-4.14/dm-allocate-struct-mapped_device-with-kvzalloc.patch
queue-4.14/dm-integrity-allow-unaligned-bv_offset.patch
queue-4.14/dm-crypt-allow-unaligned-bv_offset.patch
This is a note to let you know that I've just added the patch titled
dm crypt: allow unaligned bv_offset
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
dm-crypt-allow-unaligned-bv_offset.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 0440d5c0ca9744b92a07aeb6df0a9a75db6f4280 Mon Sep 17 00:00:00 2001
From: Mikulas Patocka <mpatocka(a)redhat.com>
Date: Tue, 7 Nov 2017 10:35:57 -0500
Subject: dm crypt: allow unaligned bv_offset
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
From: Mikulas Patocka <mpatocka(a)redhat.com>
commit 0440d5c0ca9744b92a07aeb6df0a9a75db6f4280 upstream.
When slub_debug is enabled kmalloc returns unaligned memory. XFS uses
this unaligned memory for its buffers (if an unaligned buffer crosses a
page, XFS frees it and allocates a full page instead - see the function
xfs_buf_allocate_memory).
dm-crypt checks if bv_offset is aligned on page size and these checks
fail with slub_debug and XFS.
Fix this bug by removing the bv_offset checks. Switch to checking if
bv_len is aligned instead of bv_offset (this check should be sufficient
to prevent overruns if a bio with too small bv_len is received).
Fixes: 8f0009a22517 ("dm crypt: optionally support larger encryption sector size")
Reported-by: Bruno Prémont <bonbons(a)sysophe.eu>
Tested-by: Bruno Prémont <bonbons(a)sysophe.eu>
Signed-off-by: Mikulas Patocka <mpatocka(a)redhat.com>
Reviewed-by: Milan Broz <gmazyland(a)gmail.com>
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/md/dm-crypt.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1075,7 +1075,7 @@ static int crypt_convert_block_aead(stru
BUG_ON(cc->integrity_iv_size && cc->integrity_iv_size != cc->iv_size);
/* Reject unexpected unaligned bio. */
- if (unlikely(bv_in.bv_offset & (cc->sector_size - 1)))
+ if (unlikely(bv_in.bv_len & (cc->sector_size - 1)))
return -EIO;
dmreq = dmreq_of_req(cc, req);
@@ -1168,7 +1168,7 @@ static int crypt_convert_block_skcipher(
int r = 0;
/* Reject unexpected unaligned bio. */
- if (unlikely(bv_in.bv_offset & (cc->sector_size - 1)))
+ if (unlikely(bv_in.bv_len & (cc->sector_size - 1)))
return -EIO;
dmreq = dmreq_of_req(cc, req);
Patches currently in stable-queue which might be from mpatocka(a)redhat.com are
queue-4.14/dm-allocate-struct-mapped_device-with-kvzalloc.patch
queue-4.14/dm-integrity-allow-unaligned-bv_offset.patch
queue-4.14/dm-crypt-allow-unaligned-bv_offset.patch
This is a note to let you know that I've just added the patch titled
dm cache: fix race condition in the writeback mode overwrite_bio optimisation
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
dm-cache-fix-race-condition-in-the-writeback-mode-overwrite_bio-optimisation.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d1260e2a3f85f4c1010510a15f89597001318b1b Mon Sep 17 00:00:00 2001
From: Joe Thornber <ejt(a)redhat.com>
Date: Fri, 10 Nov 2017 07:53:31 -0500
Subject: dm cache: fix race condition in the writeback mode overwrite_bio optimisation
From: Joe Thornber <ejt(a)redhat.com>
commit d1260e2a3f85f4c1010510a15f89597001318b1b upstream.
When a DM cache in writeback mode moves data between the slow and fast
device it can often avoid a copy if the triggering bio either:
i) covers the whole block (no point copying if we're about to overwrite it)
ii) the migration is a promotion and the origin block is currently discarded
Prior to this fix there was a race with case (ii). The discard status
was checked with a shared lock held (rather than exclusive). This meant
another bio could run in parallel and write data to the origin, removing
the discard state. After the promotion the parallel write would have
been lost.
With this fix the discard status is re-checked once the exclusive lock
has been aquired. If the block is no longer discarded it falls back to
the slower full copy path.
Fixes: b29d4986d ("dm cache: significant rework to leverage dm-bio-prison-v2")
Signed-off-by: Joe Thornber <ejt(a)redhat.com>
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/md/dm-cache-target.c | 86 ++++++++++++++++++++++++++-----------------
1 file changed, 53 insertions(+), 33 deletions(-)
--- a/drivers/md/dm-cache-target.c
+++ b/drivers/md/dm-cache-target.c
@@ -1201,6 +1201,18 @@ static void background_work_end(struct c
/*----------------------------------------------------------------*/
+static bool bio_writes_complete_block(struct cache *cache, struct bio *bio)
+{
+ return (bio_data_dir(bio) == WRITE) &&
+ (bio->bi_iter.bi_size == (cache->sectors_per_block << SECTOR_SHIFT));
+}
+
+static bool optimisable_bio(struct cache *cache, struct bio *bio, dm_oblock_t block)
+{
+ return writeback_mode(&cache->features) &&
+ (is_discarded_oblock(cache, block) || bio_writes_complete_block(cache, bio));
+}
+
static void quiesce(struct dm_cache_migration *mg,
void (*continuation)(struct work_struct *))
{
@@ -1474,13 +1486,51 @@ static void mg_upgrade_lock(struct work_
}
}
+static void mg_full_copy(struct work_struct *ws)
+{
+ struct dm_cache_migration *mg = ws_to_mg(ws);
+ struct cache *cache = mg->cache;
+ struct policy_work *op = mg->op;
+ bool is_policy_promote = (op->op == POLICY_PROMOTE);
+
+ if ((!is_policy_promote && !is_dirty(cache, op->cblock)) ||
+ is_discarded_oblock(cache, op->oblock)) {
+ mg_upgrade_lock(ws);
+ return;
+ }
+
+ init_continuation(&mg->k, mg_upgrade_lock);
+
+ if (copy(mg, is_policy_promote)) {
+ DMERR_LIMIT("%s: migration copy failed", cache_device_name(cache));
+ mg->k.input = BLK_STS_IOERR;
+ mg_complete(mg, false);
+ }
+}
+
static void mg_copy(struct work_struct *ws)
{
- int r;
struct dm_cache_migration *mg = ws_to_mg(ws);
if (mg->overwrite_bio) {
/*
+ * No exclusive lock was held when we last checked if the bio
+ * was optimisable. So we have to check again in case things
+ * have changed (eg, the block may no longer be discarded).
+ */
+ if (!optimisable_bio(mg->cache, mg->overwrite_bio, mg->op->oblock)) {
+ /*
+ * Fallback to a real full copy after doing some tidying up.
+ */
+ bool rb = bio_detain_shared(mg->cache, mg->op->oblock, mg->overwrite_bio);
+ BUG_ON(rb); /* An exclussive lock must _not_ be held for this block */
+ mg->overwrite_bio = NULL;
+ inc_io_migrations(mg->cache);
+ mg_full_copy(ws);
+ return;
+ }
+
+ /*
* It's safe to do this here, even though it's new data
* because all IO has been locked out of the block.
*
@@ -1489,26 +1539,8 @@ static void mg_copy(struct work_struct *
*/
overwrite(mg, mg_update_metadata_after_copy);
- } else {
- struct cache *cache = mg->cache;
- struct policy_work *op = mg->op;
- bool is_policy_promote = (op->op == POLICY_PROMOTE);
-
- if ((!is_policy_promote && !is_dirty(cache, op->cblock)) ||
- is_discarded_oblock(cache, op->oblock)) {
- mg_upgrade_lock(ws);
- return;
- }
-
- init_continuation(&mg->k, mg_upgrade_lock);
-
- r = copy(mg, is_policy_promote);
- if (r) {
- DMERR_LIMIT("%s: migration copy failed", cache_device_name(cache));
- mg->k.input = BLK_STS_IOERR;
- mg_complete(mg, false);
- }
- }
+ } else
+ mg_full_copy(ws);
}
static int mg_lock_writes(struct dm_cache_migration *mg)
@@ -1748,18 +1780,6 @@ static void inc_miss_counter(struct cach
/*----------------------------------------------------------------*/
-static bool bio_writes_complete_block(struct cache *cache, struct bio *bio)
-{
- return (bio_data_dir(bio) == WRITE) &&
- (bio->bi_iter.bi_size == (cache->sectors_per_block << SECTOR_SHIFT));
-}
-
-static bool optimisable_bio(struct cache *cache, struct bio *bio, dm_oblock_t block)
-{
- return writeback_mode(&cache->features) &&
- (is_discarded_oblock(cache, block) || bio_writes_complete_block(cache, bio));
-}
-
static int map_bio(struct cache *cache, struct bio *bio, dm_oblock_t block,
bool *commit_needed)
{
Patches currently in stable-queue which might be from ejt(a)redhat.com are
queue-4.14/dm-cache-fix-race-condition-in-the-writeback-mode-overwrite_bio-optimisation.patch
This is a note to let you know that I've just added the patch titled
dm bufio: fix integer overflow when limiting maximum cache size
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
dm-bufio-fix-integer-overflow-when-limiting-maximum-cache-size.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 74d4108d9e681dbbe4a2940ed8fdff1f6868184c Mon Sep 17 00:00:00 2001
From: Eric Biggers <ebiggers(a)google.com>
Date: Wed, 15 Nov 2017 16:38:09 -0800
Subject: dm bufio: fix integer overflow when limiting maximum cache size
From: Eric Biggers <ebiggers(a)google.com>
commit 74d4108d9e681dbbe4a2940ed8fdff1f6868184c upstream.
The default max_cache_size_bytes for dm-bufio is meant to be the lesser
of 25% of the size of the vmalloc area and 2% of the size of lowmem.
However, on 32-bit systems the intermediate result in the expression
(VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100
overflows, causing the wrong result to be computed. For example, on a
32-bit system where the vmalloc area is 520093696 bytes, the result is
1174405 rather than the expected 130023424, which makes the maximum
cache size much too small (far less than 2% of lowmem). This causes
severe performance problems for dm-verity users on affected systems.
Fix this by using mult_frac() to correctly multiply by a percentage. Do
this for all places in dm-bufio that multiply by a percentage. Also
replace (VMALLOC_END - VMALLOC_START) with VMALLOC_TOTAL, which contrary
to the comment is now defined in include/linux/vmalloc.h.
Depends-on: 9993bc635 ("sched/x86: Fix overflow in cyc2ns_offset")
Fixes: 95d402f057f2 ("dm: add bufio")
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/md/dm-bufio.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -974,7 +974,8 @@ static void __get_memory_limit(struct dm
buffers = c->minimum_buffers;
*limit_buffers = buffers;
- *threshold_buffers = buffers * DM_BUFIO_WRITEBACK_PERCENT / 100;
+ *threshold_buffers = mult_frac(buffers,
+ DM_BUFIO_WRITEBACK_PERCENT, 100);
}
/*
@@ -1910,19 +1911,15 @@ static int __init dm_bufio_init(void)
memset(&dm_bufio_caches, 0, sizeof dm_bufio_caches);
memset(&dm_bufio_cache_names, 0, sizeof dm_bufio_cache_names);
- mem = (__u64)((totalram_pages - totalhigh_pages) *
- DM_BUFIO_MEMORY_PERCENT / 100) << PAGE_SHIFT;
+ mem = (__u64)mult_frac(totalram_pages - totalhigh_pages,
+ DM_BUFIO_MEMORY_PERCENT, 100) << PAGE_SHIFT;
if (mem > ULONG_MAX)
mem = ULONG_MAX;
#ifdef CONFIG_MMU
- /*
- * Get the size of vmalloc space the same way as VMALLOC_TOTAL
- * in fs/proc/internal.h
- */
- if (mem > (VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100)
- mem = (VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100;
+ if (mem > mult_frac(VMALLOC_TOTAL, DM_BUFIO_VMALLOC_PERCENT, 100))
+ mem = mult_frac(VMALLOC_TOTAL, DM_BUFIO_VMALLOC_PERCENT, 100);
#endif
dm_bufio_default_cache_size = mem;
Patches currently in stable-queue which might be from ebiggers(a)google.com are
queue-4.14/lib-mpi-call-cond_resched-from-mpi_powm-loop.patch
queue-4.14/dm-bufio-fix-integer-overflow-when-limiting-maximum-cache-size.patch
This is a note to let you know that I've just added the patch titled
dm: allocate struct mapped_device with kvzalloc
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
dm-allocate-struct-mapped_device-with-kvzalloc.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 856eb0916d181da6d043cc33e03f54d5c5bbe54a Mon Sep 17 00:00:00 2001
From: Mikulas Patocka <mpatocka(a)redhat.com>
Date: Tue, 31 Oct 2017 19:33:02 -0400
Subject: dm: allocate struct mapped_device with kvzalloc
From: Mikulas Patocka <mpatocka(a)redhat.com>
commit 856eb0916d181da6d043cc33e03f54d5c5bbe54a upstream.
The structure srcu_struct can be very big, its size is proportional to the
value CONFIG_NR_CPUS. The Fedora kernel has CONFIG_NR_CPUS 8192, the field
io_barrier in the struct mapped_device has 84kB in the debugging kernel
and 50kB in the non-debugging kernel. The large size may result in failure
of the function kzalloc_node.
In order to avoid the allocation failure, we use the function
kvzalloc_node, this function falls back to vmalloc if a large contiguous
chunk of memory is not available. This patch also moves the field
io_barrier to the last position of struct mapped_device - the reason is
that on many processor architectures, short memory offsets result in
smaller code than long memory offsets - on x86-64 it reduces code size by
320 bytes.
Note to stable kernel maintainers - the kernels 4.11 and older don't have
the function kvzalloc_node, you can use the function vzalloc_node instead.
Signed-off-by: Mikulas Patocka <mpatocka(a)redhat.com>
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/md/dm-core.h | 3 ++-
drivers/md/dm.c | 6 +++---
2 files changed, 5 insertions(+), 4 deletions(-)
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -29,7 +29,6 @@ struct dm_kobject_holder {
* DM targets must _not_ deference a mapped_device to directly access its members!
*/
struct mapped_device {
- struct srcu_struct io_barrier;
struct mutex suspend_lock;
/*
@@ -127,6 +126,8 @@ struct mapped_device {
struct blk_mq_tag_set *tag_set;
bool use_blk_mq:1;
bool init_tio_pdu:1;
+
+ struct srcu_struct io_barrier;
};
void dm_init_md_queue(struct mapped_device *md);
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1695,7 +1695,7 @@ static struct mapped_device *alloc_dev(i
struct mapped_device *md;
void *old_md;
- md = kzalloc_node(sizeof(*md), GFP_KERNEL, numa_node_id);
+ md = kvzalloc_node(sizeof(*md), GFP_KERNEL, numa_node_id);
if (!md) {
DMWARN("unable to allocate device, out of memory.");
return NULL;
@@ -1795,7 +1795,7 @@ bad_io_barrier:
bad_minor:
module_put(THIS_MODULE);
bad_module_get:
- kfree(md);
+ kvfree(md);
return NULL;
}
@@ -1814,7 +1814,7 @@ static void free_dev(struct mapped_devic
free_minor(minor);
module_put(THIS_MODULE);
- kfree(md);
+ kvfree(md);
}
static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
Patches currently in stable-queue which might be from mpatocka(a)redhat.com are
queue-4.14/dm-allocate-struct-mapped_device-with-kvzalloc.patch
queue-4.14/dm-integrity-allow-unaligned-bv_offset.patch
queue-4.14/dm-crypt-allow-unaligned-bv_offset.patch
This is a note to let you know that I've just added the patch titled
dm bufio: fix integer overflow when limiting maximum cache size
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
dm-bufio-fix-integer-overflow-when-limiting-maximum-cache-size.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 74d4108d9e681dbbe4a2940ed8fdff1f6868184c Mon Sep 17 00:00:00 2001
From: Eric Biggers <ebiggers(a)google.com>
Date: Wed, 15 Nov 2017 16:38:09 -0800
Subject: dm bufio: fix integer overflow when limiting maximum cache size
From: Eric Biggers <ebiggers(a)google.com>
commit 74d4108d9e681dbbe4a2940ed8fdff1f6868184c upstream.
The default max_cache_size_bytes for dm-bufio is meant to be the lesser
of 25% of the size of the vmalloc area and 2% of the size of lowmem.
However, on 32-bit systems the intermediate result in the expression
(VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100
overflows, causing the wrong result to be computed. For example, on a
32-bit system where the vmalloc area is 520093696 bytes, the result is
1174405 rather than the expected 130023424, which makes the maximum
cache size much too small (far less than 2% of lowmem). This causes
severe performance problems for dm-verity users on affected systems.
Fix this by using mult_frac() to correctly multiply by a percentage. Do
this for all places in dm-bufio that multiply by a percentage. Also
replace (VMALLOC_END - VMALLOC_START) with VMALLOC_TOTAL, which contrary
to the comment is now defined in include/linux/vmalloc.h.
Depends-on: 9993bc635 ("sched/x86: Fix overflow in cyc2ns_offset")
Fixes: 95d402f057f2 ("dm: add bufio")
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
Signed-off-by: Mike Snitzer <snitzer(a)redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/md/dm-bufio.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -876,7 +876,8 @@ static void __get_memory_limit(struct dm
buffers = c->minimum_buffers;
*limit_buffers = buffers;
- *threshold_buffers = buffers * DM_BUFIO_WRITEBACK_PERCENT / 100;
+ *threshold_buffers = mult_frac(buffers,
+ DM_BUFIO_WRITEBACK_PERCENT, 100);
}
/*
@@ -1764,19 +1765,15 @@ static int __init dm_bufio_init(void)
memset(&dm_bufio_caches, 0, sizeof dm_bufio_caches);
memset(&dm_bufio_cache_names, 0, sizeof dm_bufio_cache_names);
- mem = (__u64)((totalram_pages - totalhigh_pages) *
- DM_BUFIO_MEMORY_PERCENT / 100) << PAGE_SHIFT;
+ mem = (__u64)mult_frac(totalram_pages - totalhigh_pages,
+ DM_BUFIO_MEMORY_PERCENT, 100) << PAGE_SHIFT;
if (mem > ULONG_MAX)
mem = ULONG_MAX;
#ifdef CONFIG_MMU
- /*
- * Get the size of vmalloc space the same way as VMALLOC_TOTAL
- * in fs/proc/internal.h
- */
- if (mem > (VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100)
- mem = (VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100;
+ if (mem > mult_frac(VMALLOC_TOTAL, DM_BUFIO_VMALLOC_PERCENT, 100))
+ mem = mult_frac(VMALLOC_TOTAL, DM_BUFIO_VMALLOC_PERCENT, 100);
#endif
dm_bufio_default_cache_size = mem;
Patches currently in stable-queue which might be from ebiggers(a)google.com are
queue-3.18/lib-mpi-call-cond_resched-from-mpi_powm-loop.patch
queue-3.18/dm-bufio-fix-integer-overflow-when-limiting-maximum-cache-size.patch
This is a note to let you know that I've just added the patch titled
PCI: Set Cavium ACS capability quirk flags to assert RR/CR/SV/UF
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
pci-set-cavium-acs-capability-quirk-flags-to-assert-rr-cr-sv-uf.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 7f342678634f16795892677204366e835e450dda Mon Sep 17 00:00:00 2001
From: Vadim Lomovtsev <Vadim.Lomovtsev(a)cavium.com>
Date: Tue, 17 Oct 2017 05:47:38 -0700
Subject: PCI: Set Cavium ACS capability quirk flags to assert RR/CR/SV/UF
From: Vadim Lomovtsev <Vadim.Lomovtsev(a)cavium.com>
commit 7f342678634f16795892677204366e835e450dda upstream.
The Cavium ThunderX (CN8XXX) family of PCIe Root Ports does not advertise
an ACS capability. However, the RTL internally implements similar
protection as if ACS had Request Redirection, Completion Redirection,
Source Validation, and Upstream Forwarding features enabled.
Change Cavium ACS capabilities quirk flags accordingly.
Fixes: b404bcfbf035 ("PCI: Add ACS quirk for all Cavium devices")
Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev(a)cavium.com>
[bhelgaas: tidy changelog, comment, stable tag]
Signed-off-by: Bjorn Helgaas <bhelgaas(a)google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/pci/quirks.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4088,12 +4088,14 @@ static int pci_quirk_amd_sb_acs(struct p
static int pci_quirk_cavium_acs(struct pci_dev *dev, u16 acs_flags)
{
/*
- * Cavium devices matching this quirk do not perform peer-to-peer
- * with other functions, allowing masking out these bits as if they
- * were unimplemented in the ACS capability.
+ * Cavium root ports don't advertise an ACS capability. However,
+ * the RTL internally implements similar protection as if ACS had
+ * Request Redirection, Completion Redirection, Source Validation,
+ * and Upstream Forwarding features enabled. Assert that the
+ * hardware implements and enables equivalent ACS functionality for
+ * these flags.
*/
- acs_flags &= ~(PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR |
- PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_DT);
+ acs_flags &= ~(PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_SV | PCI_ACS_UF);
return acs_flags ? 0 : 1;
}
Patches currently in stable-queue which might be from Vadim.Lomovtsev(a)cavium.com are
queue-4.9/pci-set-cavium-acs-capability-quirk-flags-to-assert-rr-cr-sv-uf.patch
This is a note to let you know that I've just added the patch titled
ALSA: hda: Add Raven PCI ID
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
alsa-hda-add-raven-pci-id.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 9ceace3c9c18c67676e75141032a65a8e01f9a7a Mon Sep 17 00:00:00 2001
From: Vijendar Mukunda <Vijendar.Mukunda(a)amd.com>
Date: Thu, 23 Nov 2017 20:07:00 +0530
Subject: ALSA: hda: Add Raven PCI ID
From: Vijendar Mukunda <Vijendar.Mukunda(a)amd.com>
commit 9ceace3c9c18c67676e75141032a65a8e01f9a7a upstream.
This commit adds PCI ID for Raven platform
Signed-off-by: Vijendar Mukunda <Vijendar.Mukunda(a)amd.com>
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
sound/pci/hda/hda_intel.c | 3 +++
1 file changed, 3 insertions(+)
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -2305,6 +2305,9 @@ static const struct pci_device_id azx_id
/* AMD Hudson */
{ PCI_DEVICE(0x1022, 0x780d),
.driver_data = AZX_DRIVER_GENERIC | AZX_DCAPS_PRESET_ATI_SB },
+ /* AMD Raven */
+ { PCI_DEVICE(0x1022, 0x15e3),
+ .driver_data = AZX_DRIVER_GENERIC | AZX_DCAPS_PRESET_ATI_SB },
/* ATI HDMI */
{ PCI_DEVICE(0x1002, 0x0002),
.driver_data = AZX_DRIVER_ATIHDMI_NS | AZX_DCAPS_PRESET_ATI_HDMI_NS },
Patches currently in stable-queue which might be from Vijendar.Mukunda(a)amd.com are
queue-4.9/alsa-hda-add-raven-pci-id.patch
This is a note to let you know that I've just added the patch titled
ALSA: hda: Add Raven PCI ID
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
alsa-hda-add-raven-pci-id.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 9ceace3c9c18c67676e75141032a65a8e01f9a7a Mon Sep 17 00:00:00 2001
From: Vijendar Mukunda <Vijendar.Mukunda(a)amd.com>
Date: Thu, 23 Nov 2017 20:07:00 +0530
Subject: ALSA: hda: Add Raven PCI ID
From: Vijendar Mukunda <Vijendar.Mukunda(a)amd.com>
commit 9ceace3c9c18c67676e75141032a65a8e01f9a7a upstream.
This commit adds PCI ID for Raven platform
Signed-off-by: Vijendar Mukunda <Vijendar.Mukunda(a)amd.com>
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
sound/pci/hda/hda_intel.c | 3 +++
1 file changed, 3 insertions(+)
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -2316,6 +2316,9 @@ static const struct pci_device_id azx_id
/* AMD Hudson */
{ PCI_DEVICE(0x1022, 0x780d),
.driver_data = AZX_DRIVER_GENERIC | AZX_DCAPS_PRESET_ATI_SB },
+ /* AMD Raven */
+ { PCI_DEVICE(0x1022, 0x15e3),
+ .driver_data = AZX_DRIVER_GENERIC | AZX_DCAPS_PRESET_ATI_SB },
/* ATI HDMI */
{ PCI_DEVICE(0x1002, 0x0002),
.driver_data = AZX_DRIVER_ATIHDMI_NS | AZX_DCAPS_PRESET_ATI_HDMI_NS },
Patches currently in stable-queue which might be from Vijendar.Mukunda(a)amd.com are
queue-4.4/alsa-hda-add-raven-pci-id.patch
This is a note to let you know that I've just added the patch titled
PM / OPP: Add missing of_node_put(np)
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
pm-opp-add-missing-of_node_put-np.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 7978db344719dab1e56d05e6fc04aaaddcde0a5e Mon Sep 17 00:00:00 2001
From: Tobias Jordan <Tobias.Jordan(a)elektrobit.com>
Date: Wed, 4 Oct 2017 11:35:03 +0530
Subject: PM / OPP: Add missing of_node_put(np)
From: Tobias Jordan <Tobias.Jordan(a)elektrobit.com>
commit 7978db344719dab1e56d05e6fc04aaaddcde0a5e upstream.
The for_each_available_child_of_node() loop in _of_add_opp_table_v2()
doesn't drop the reference to "np" on errors. Fix that.
Fixes: 274659029c9d (PM / OPP: Add support to parse "operating-points-v2" bindings)
Signed-off-by: Tobias Jordan <Tobias.Jordan(a)elektrobit.com>
[ VK: Improved commit log. ]
Signed-off-by: Viresh Kumar <viresh.kumar(a)linaro.org>
Reviewed-by: Stephen Boyd <sboyd(a)codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/base/power/opp/of.c | 1 +
1 file changed, 1 insertion(+)
--- a/drivers/base/power/opp/of.c
+++ b/drivers/base/power/opp/of.c
@@ -397,6 +397,7 @@ static int _of_add_opp_table_v2(struct d
dev_err(dev, "%s: Failed to add OPP, %d\n", __func__,
ret);
_dev_pm_opp_remove_table(opp_table, dev, false);
+ of_node_put(np);
goto put_opp_table;
}
}
Patches currently in stable-queue which might be from Tobias.Jordan(a)elektrobit.com are
queue-4.14/pm-opp-add-missing-of_node_put-np.patch
This is a note to let you know that I've just added the patch titled
PCI: Set Cavium ACS capability quirk flags to assert RR/CR/SV/UF
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
pci-set-cavium-acs-capability-quirk-flags-to-assert-rr-cr-sv-uf.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 7f342678634f16795892677204366e835e450dda Mon Sep 17 00:00:00 2001
From: Vadim Lomovtsev <Vadim.Lomovtsev(a)cavium.com>
Date: Tue, 17 Oct 2017 05:47:38 -0700
Subject: PCI: Set Cavium ACS capability quirk flags to assert RR/CR/SV/UF
From: Vadim Lomovtsev <Vadim.Lomovtsev(a)cavium.com>
commit 7f342678634f16795892677204366e835e450dda upstream.
The Cavium ThunderX (CN8XXX) family of PCIe Root Ports does not advertise
an ACS capability. However, the RTL internally implements similar
protection as if ACS had Request Redirection, Completion Redirection,
Source Validation, and Upstream Forwarding features enabled.
Change Cavium ACS capabilities quirk flags accordingly.
Fixes: b404bcfbf035 ("PCI: Add ACS quirk for all Cavium devices")
Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev(a)cavium.com>
[bhelgaas: tidy changelog, comment, stable tag]
Signed-off-by: Bjorn Helgaas <bhelgaas(a)google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/pci/quirks.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4215,12 +4215,14 @@ static int pci_quirk_amd_sb_acs(struct p
static int pci_quirk_cavium_acs(struct pci_dev *dev, u16 acs_flags)
{
/*
- * Cavium devices matching this quirk do not perform peer-to-peer
- * with other functions, allowing masking out these bits as if they
- * were unimplemented in the ACS capability.
+ * Cavium root ports don't advertise an ACS capability. However,
+ * the RTL internally implements similar protection as if ACS had
+ * Request Redirection, Completion Redirection, Source Validation,
+ * and Upstream Forwarding features enabled. Assert that the
+ * hardware implements and enables equivalent ACS functionality for
+ * these flags.
*/
- acs_flags &= ~(PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR |
- PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_DT);
+ acs_flags &= ~(PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_SV | PCI_ACS_UF);
if (!((dev->device >= 0xa000) && (dev->device <= 0xa0ff)))
return -ENOTTY;
Patches currently in stable-queue which might be from Vadim.Lomovtsev(a)cavium.com are
queue-4.14/pci-set-cavium-acs-capability-quirk-flags-to-assert-rr-cr-sv-uf.patch
queue-4.14/pci-apply-cavium-thunderx-acs-quirk-to-more-root-ports.patch